<a href="https://colab.research.google.com/github/RohithParimi/My_AI_ML_EDA_Portfolio/blob/main/Module_1_Data_Types_and_EDA_Intro_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

### Module 1 Goals:

* Understand what "data" truly means in the context of EDA.

* Learn about different categories of data types and their significance.

* Grasp the fundamental importance and purpose of Exploratory Data Analysis.

* Get an overview of the typical EDA process.

---
**Concepts to Cover:**
### 1. What is Data?

* **Definition:** Raw facts, figures, or details that can be processed to produce information.
* **Structured vs. Unstructured Data:**
 * **Structured:** Highly organized data that fits into a fixed field within a record or file. Examples: Databases (SQL), Excel spreadsheets, CSV files. Think rows and columns.
 * **Unstructured:** Data that does not have a pre-defined data model or is not organized in a pre-defined manner. Examples: Text documents, images, audio, video files, social media posts.
 * **Semi-structured:** A blend of both, having some organizational properties but not fully relational. Examples: XML, JSON files.
* **Data Points/Observations/Records:** Each row in a structured dataset.
* **Features/Attributes/Variables:** Each column in a structured dataset.

### **2. Types of Data (in the context of statistical analysis):**

This classification is critical because the type of data dictates which statistical methods and visualization techniques are appropriate.

* **I. Numerical Data (Quantitative):** Represents measurable quantities.
    * **A. Discrete:** Values are countable, often integers. They result from counting.
        * *Examples:* Number of children in a family, count of defects, number of cars in a parking lot.
    * **B. Continuous:** Values can take any value within a given range, often involving decimals. They result from measurement.
        * *Examples:* Height, weight, temperature, time, price.

* **II. Categorical Data (Qualitative):** Represents qualities or characteristics, often non-numerical.
    * **A. Nominal:** Categories without any intrinsic order or ranking.
        * *Examples:* Marital status (single, married, divorced), blood type (A, B, AB, O), color (red, blue, green). You can't say 'red' is "greater" than 'blue'.
    * **B. Ordinal:** Categories with a clear order or ranking, but the intervals between ranks might not be equal.
        * *Examples:* Education level (High School, Bachelor's, Master's, PhD), customer satisfaction (poor, fair, good, excellent), movie ratings (1-star, 2-star...). You can say 'Master's' is "higher" than 'Bachelor's', but the "difference" between Bachelor's and Master's might not be the same as High School and Bachelor's.

* **III. Datetime Data:** Represents points or periods in time. Often needs special handling for extracting features like year, month, day of week, etc.
    * *Examples:* '2023-01-15', '10:30:00', '2023-07-16 11:41:10'.

---
### **3. Why is EDA Crucial? (The "Why" behind it all)**
* **Understanding the Data:** Get a deeper understanding of the dataset's structure, variables, and initial insights.
* **Data Cleaning & Preparation:** Identify and handle missing values, outliers, inconsistencies, and errors.
* **Feature Engineering Insight:** Discover relationships between variables that can lead to creating new, more informative features for models.
* **Assumption Checking:** Verify assumptions required by certain machine learning models (e.g., normality, linearity).
* **Pattern Discovery:** Uncover hidden patterns, trends, and anomalies.
* **Informing Model Selection:** The characteristics of your data often guide which ML models will perform best.
* **Communicating Insights:** Effectively present findings to stakeholders using visualizations and summaries.

### **4. The Typical EDA Process Overview:**
While iterative and flexible, a general EDA workflow often includes:
* **1. Data Collection/Loading:** Getting your data into your environment.
* **2. Data Cleaning:** Handling missing values, duplicates, correcting errors, data type conversions.
* **3. Univariate Analysis:** Analyzing individual variables (distributions, central tendency, spread).
* **4. Bivariate Analysis:** Analyzing relationships between two variables.
* **5. Multivariate Analysis:** Analyzing relationships among three or more variables.
* **6. Feature Engineering (initial):** Creating new features that might reveal more insights.
* **7. Insight Generation & Visualization:** Drawing conclusions and creating compelling plots.
* **8. Documentation:** Recording findings, assumptions, and next steps.

---
* What is the primary difference between structured and unstructured data? Provide an example of each *not* mentioned above.

 *Ans: Structured data will be in rows and columns format(Excel files,Databases) while for unstructured data there will be no specific format(Images,text,Audio)*

* Explain the difference between Discrete and Continuous numerical data. Give an example for each.

 Discrete: Occured due to counting the specific variable. ex: How many steps you walked it can be integers not any decimal values.
 Continous: The data will be in continuous in nature. ex: Exact height

* Explain the difference between Nominal and Ordinal categorical data. Give an example for each.

 Nominal: The text or any categorical data there is no rating associated with the value of the item. eg: Food you have ordered
 Ordinal: The text or any categorical data there is a rating associated with the value of the item and there is no specific exact rule the difference b/w the values remain same. eg: Rating of the Food you have ordered

* In your opinion, what is the *most* important reason to perform EDA before building a machine learning model? Why?

 We can choose the best model based on the data that we had after processed and as the data is clean since we had removed the outliers,null values etc the efficiency of the model will get increases
* Briefly list the steps of a typical EDA process.

 data gathering and loading
 data cleaning
 Intial analysis
 univariate analysis
 bi variate analysis
 multi variate analysis
 feature selection and feature engineering
 documentation

In [1]:
# Numerical Data
num_students = 50  # Discrete
avg_score = 85.75 # Continuous

# Categorical Data
eye_color = "blue" # Nominal
satisfaction_rating = "Excellent" # Ordinal

# Datetime Data
event_date = "2024-10-26" # We'll learn to handle this properly later

print(f"num_students: {num_students}, Type: {type(num_students)}")
print(f"avg_score: {avg_score}, Type: {type(avg_score)}")
print(f"eye_color: {eye_color}, Type: {type(eye_color)}")
print(f"satisfaction_rating: {satisfaction_rating}, Type: {type(satisfaction_rating)}")
print(f"event_date: {event_date}, Type: {type(event_date)}")

num_students: 50, Type: <class 'int'>
avg_score: 85.75, Type: <class 'float'>
eye_color: blue, Type: <class 'str'>
satisfaction_rating: Excellent, Type: <class 'str'>
event_date: 2024-10-26, Type: <class 'str'>
