# EDA - Exploratory Data Analysis 

Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize which focuses on understanding patterns, trends and relationships through statistical tools and visualizations.
+ EDA also called as **FEATURE ENGINEERING TECHENIQUE**

It helps:

+ Understand the structure` of data (types, distributions, relationships).
+ Detect patterns, anomalies, or missing values.
+ Guide feature engineering and model selection.

👉 In short: EDA is about “getting to know your data before modeling.”

## ARCHITECTURE OF EDA 

Input Source / Data Collection --> Collated Dataset (Data Integration) --> Data Cleaning --> Data Transformation --> Exploratory Analysis --> Feature Engineering --> Final Dataset

1. Input Source / Data Collection
   + Get data from files, databases, APIs, sensors, etc

2. Collated Dataset (Data Integration)
   + Combine data from multiple sources into one dataset.
   + Handle duplicates, mismatched formats, etc.

3. Data Cleaning
   + Remove inconsistencies, duplicates, missing values, outliers.

4. Data Transformation
   + Encoding categorical variables, normalization/scaling, log transforms.
   + Feature extraction if needed.

5. Exploratory Analysis
   + Univariate & Bivariate analysis, visualization, summary statistics.
   + Detect trends, distributions, correlations.

6. Feature Engineering
   + Create new variables/features that help in prediction.

7. Final Dataset
   + Well-structured, cleaned, transformed, and feature-rich dataset.
   + Ready for predictive modeling.

## EDA TECHENIQUE 
+ Convert from RAW - CLEAN

1. Variable Identification
2. Univariate Analysis
3. Bivarite Analysis
4. Outlier Treatment
5. Missing Value Transformation
6. Imputation or Transformer
7. Variable Creation

### 1. Variable Identification

+ Independent Variable == x1, x2, x3....  
+ Dependent Variable == y
+ Relevant attribute
+ Irrelevant attribute

Every time being Data Scientist we need to look relevant attribute  
If you build the ml model with irrelevant attribute then **Overfitting** multicollinearity


**Example**  
Family (Loans, Kids education, House property)
+ Father (GovtEmpy) -- dependent variable == target variable == predicted variable (y)
+ Mother (Homemaker) -- independent variable == non target variable == non predicted attribute (x1)
+ Son (2nd Class) -- independent variable == non target variable == non predicted attribute (x2)
+ Daughter (5th Class) -- independent variable == non target variable == non predicted attribute (x3)

y = x1 + x2 + x3  
linear equation -- y = mx + c (Simple Linear Regression Algo)  
y = m1x1 + m2x2 + m3x3 (Multiple Linear Regression Algo)  

### 2. Univariate Analysis
--> **Plot the graph using 1 variable**
+ Study of one variable at a time.
+ Helps check distribution, central tendency, and spread.
+ Tools: histograms, boxplots, frequency tables.
+ Example: Looking at income distribution in a population.

### 3. Bivariate Analysis

--> **Plot the graph using 2 variable**
+ Study of relationship between two variables.
+ Helps detect correlation, trends, and associations.
+ Tools: scatterplots, correlation matrices, cross-tabulation.
+ Example: Relationship between study hours and exam scores.

Correlation - Relation among attributes in the dataset
range of correlation is **-1 to 1**
 + +ve Correlation - Range (**0 to 1**)
 + -ve Correlation - Range (**-1 to 0**)
 + 0 Correlation - Range (**0**)

### 4. Outlier Treatment

+ Identifying data points that deviate significantly from the rest.
+ Outliers can distort averages and models.
+ Methods: IQR rule, Z-score, capping, transformation.

### 5. Missing Value Transformation

+ Checking for incomplete data and deciding how to handle it.
+ Options: remove rows, impute values, flag missingness.
+ Example: Missing age in a survey → replace with mean/median or keep as "unknown."

--> Dataset build with numberical data & categorical data

If numerical data is messing then we need to implement
+ mean strategy
+ median strategy
+ mode strategy

If categorical data is messing then we need to implement 
+ mode strategy
+ knn strategy(k nearest neighbour)

#### Example

![Screenshot 2025-08-31 205424.png](attachment:cdbab12c-452a-4687-ae8a-6c0822dcfec8.png)

### Table 1: Mean Imputation for Numerical Data
For the numerical 'NUMBERS' column, mean imputation is used.

This method involves calculating the average (mean) of all the known values in the column and then using that average to fill in the missing spots. This is a quick and common technique for numerical data.

+ Calculation: (5 + 10 + 23 + 33 + 20 + 50) / 6 = 141 / 6 = 23.5
+ Result: Both missing values are replaced with 23.5.
---
### Table 2: Mode Imputation for Categorical Data
For the categorical 'SEASON' column, mode imputation is applied.

This technique involves finding the most frequently occurring value (the mode) in the column and using it to fill the missing entry. It's the most straightforward approach for categorical data.

+ Analysis: "Summer" appears 3 times, while "Winter" and "Rainy" each appear twice.
+ Result: The mode is "Summer," so the missing value is imputed as "Summer."
---
### Table 3: K-Nearest Neighbors (KNN) Imputation
Here, the 'SEASON' column has a tie; "Summer," "Winter," and "Rainy" each appear twice. Simple mode imputation isn't effective. Therefore, a more advanced method, K-Nearest Neighbors (KNN) Imputation, is used.

KNN imputation predicts the missing value by looking at its "neighbors"—the other data points that are most similar to it. In this case, similarity is determined by the 'TEMP' column.

+ Goal: To find the missing 'SEASON' for the row where 'TEMP' is 18.
+ Process (with k=1): The algorithm finds the single data point (k=1) with the 'TEMP' value closest to 18.
    +  The temperature closest to 18 is 14.

    + The season corresponding to the temperature 14 is "Rainy."
+ Result: Based on its nearest neighbor, the missing value is imputed as "Rainy."

### 6. Imputation or Transformation

+ Imputation = filling missing values (mean, median, regression, KNN).
+ Transformation = applying mathematical changes (log, square root, scaling).
+ Purpose: stabilize variance, improve normality, or reduce skewness.

TRANSFORMER -- Trasform categorical data numerical data to build ml model. Transformer also called as **IMPUTEATION TECHNIQUE**

NLP -- This technique is called **EMBEDDING**.

Transformer are of 3 types -->
1. DUMMY VARIABLE 
2. LABEL ENCODER
3. ONE HOT ENCODER

### 7. Variable Creation (Feature Engineering)

+ Creating new variables from existing ones to capture more information.
+ Example: From “Date of Birth” → create “Age”
+ Increases predictive power of models.