## Exploratory Data Analysis (EDA) — Complete Guide

### 1. What is EDA?
Exploratory Data Analysis (EDA) is the process of understanding, summarizing, and visualizing data before applying any machine learning or statistical models.

- The main goal is to discover patterns, detect anomalies, test assumptions, and gain insights.

> EDA answers: “What does my data actually look like?”

### 2. Why is EDA Important?
**EDA helps you:**
- Understand data structure
- Identify missing values & outliers
- Detect data quality issues
- Understand relationships between variables
- Choose the right ML model
- Avoid wrong assumptions

> Without EDA → bad features → poor models

### 3. Types of EDA

- `Univariate Analysis`
  - One variable at a time
  - Focus: distribution, central tendency, spread

- `Bivariate Analysis`
  - Relationship between two variables

- `Multivariate Analysis`
  - More than two variables
  - Complex relationships

### 4. EDA Workflow (Step-by-Step)

`Step 1: Understand the Dataset`
- Shape of data
- Data types
- Column meanings

```python
df.shape
df.info()
df.head()
df.tail()
```

`Step 2: Data Cleaning (Initial)`
- Missing values
- Duplicates
- Wrong data types
- Invalid entries

```python
df.isnull().sum()
df.duplicated().sum()
df.describe()
```

`Step 3: Univariate Analysis`

#### Numerical Data
- Mean
- Median
- Mode
- Min / Max
- Variance
- Standard Deviation
- Skewness
-Kurtosis

```python
df['age'].describe()
df['salary'].skew()
df['salary'].kurtosis()
```

**Visualizations**
- Histogram
- Boxplot
- KDE plot

```python
plt.hist(df['salary'])
sns.boxplot(x=df['salary'])
```

#### Categorical Data
- Frequency counts
- Percentage distribution

```python
df['gender'].value_counts()
df['gender'].value_counts(normalize=True)
```

**Visualizations**
- Bar chart
- Pie chart

`Step 4: Bivariate Analysis`

#### Numerical vs Numerical
- Correlation
- Scatter plot

```python
df[['age','salary']].corr()
sns.scatterplot(x='age', y='salary', data=df)
```

#### Categorical vs Numerical
- Group statistics

```python
df.groupby('gender')['salary'].mean()
sns.boxplot(x='gender', y='salary', data=df)
```

#### Categorical vs Categorical
- Cross-tabulation
- Count plots

```python
pd.crosstab(df['gender'], df['purchased'])
sns.countplot(x='gender', hue='purchased', data=df)
```

`Step 5: Multivariate Analysis`
- Pair plots
- Heatmaps
- 3D plots (rare)

```python
sns.pairplot(df)
sns.heatmap(df.corr(), annot=True)
```

### 5. Statistical Concepts Used in EDA
**Measures of Central Tendency**
- **Mean**: Average
- **Median**: Middle value
- **Mode**: Most frequent

**Measures of Dispersion**
- **Range**: Max − Min
- **Variance**: Spread of data
- **Standard Deviation (Std Dev)**: Square root of variance
- **Interquartile Range (IQR)**: Q3 − Q1

**Distribution Shape**
- Skewness
  - Positive → Right skew
  - Negative → Left skew
- Kurtosis
  - Peakedness of distribution

### 6. Missing Values Analysis
**Types of Missing Data**
- MCAR (Missing Completely at Random)
- MAR (Missing at Random)
- MNAR (Missing Not at Random)

**Handling Methods**
- Remove rows/columns
- Mean/Median/Mode imputation
- Forward/Backward fill
- Model-based imputation

```python
df['age'].fillna(df['age'].median(), inplace=True)
```

### 7. Outlier Detection
**Methods**
- Boxplot
- Z-score
- IQR method

```python
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
```

**Handling:**
- Remove
- Cap (Winsorization)
- Transform (log, sqrt)

### 8. Feature Relationships & Correlation
**Correlation Types**
- Pearson (linear)
- Spearman (monotonic)
- Kendall

```python
df.corr(method='pearson')
```

> Correlation ≠ Causation

### 9. Data Transformation in EDA
- Log transformation
- Scaling
- Normalization
- Encoding categorical features

```python
np.log(df['salary'])
```

### 10. EDA Tools & Libraries
**Python Libraries**
- Pandas → data manipulation
- NumPy → numerical operations
- Matplotlib → base visualization
- Seaborn → statistical visualization
- Plotly → interactive plots

### 11. EDA Checklist (Very Important)
- Dataset shape & types
- Missing values
- Duplicates
- Outliers
- Distributions
- Feature relationships
- Target variable behavior
- Data imbalance
- Assumption checks

### 12. EDA vs Data Cleaning vs Feature Engineering
- **EDA**: Understand data  
- **Data Cleaning**: Fix data  
- **Feature Engineering**: Create better features  

### 13. EDA for Machine Learning
**EDA helps decide:**
- Which features to use
- Which features to drop
- Which transformations to apply
- Which ML model fits best

### 14. Real-World Example (Mini)
`Problem: Predict house prices`

**EDA reveals:**
- Price is right-skewed → log transform
- Strong correlation with area
- Missing values in bedrooms
- Outliers in luxury houses

> Model improves drastically after EDA

### 15. Common EDA Mistakes
- Skipping EDA
- Blindly removing outliers
- Ignoring domain knowledge
- Over-visualizing
- Not documenting insights

### 16. Best Practices
- Always start with EDA
- Write observations alongside plots
- Combine stats + visuals
- Re-do EDA after cleaning
- Keep EDA notebook clean & readable

### 17. EDA Interview Questions
- What is EDA?
- How do you handle missing values?
- How do you detect outliers?
- Difference between skewness & kurtosis?
- How does EDA help ML?