<a href="https://colab.research.google.com/github/SrinathMLOps/MLPractise/blob/main/EDA_Preprocessing_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üìí Detailed Notes: EDA and Data Preprocessing with Examples (with Real-World Analogies)

We'll use the popular **Iris dataset** to explain each concept step by step, including **before and after** transformations, **real-world analogies**, and **sample data snapshots** for better understanding.

## üìä 1. Exploratory Data Analysis (EDA)
üìå **EDA involves:**
- ‚úÖ Checking the shape of the dataset using `df.shape` ‚Äî tells you how many rows and columns the dataset has.
```python
print(df.shape)
```

- ‚úÖ Understanding data types with `df.dtypes` ‚Äî shows the type of data in each column.
```python
print(df.dtypes)
```

- ‚úÖ Viewing summary statistics using `df.describe()`
```python
print(df.describe())
```

- ‚úÖ Detecting missing values with `df.isnull().sum()`
```python
print(df.isnull().sum())
```

- ‚úÖ Exploring class distributions with `df['target'].value_counts()`
```python
print(df['target'].value_counts())
```

- ‚úÖ Visualizing relationships using heatmap and pairplot
```python
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.corr(), annot=True)
plt.show()
sns.pairplot(df, hue='target')
plt.show()
```

üß† **Analogy:** Think of EDA like doing a *health checkup* before treatment.

## üßΩ 2. Handling Missing Values
üìå Fill missing values or drop them

```python
import numpy as np
df.loc[0, 'sepal length (cm)'] = np.nan
df['sepal length (cm)'].fillna(df['sepal length (cm)'].mean(), inplace=True)
```

## ‚öñÔ∏è 3. Feature Scaling
üìå Normalize values so they‚Äôre comparable

```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df.iloc[:, :-1])
```

## üî† 4. Encoding Categorical Variables
üìå Convert text labels to numbers

```python
from sklearn.preprocessing import LabelEncoder
df['category'] = ['A', 'B', 'C'] * 50
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])
```

## üö® 5. Outlier Detection (Z-Score)
üìå Remove extreme values

```python
from scipy.stats import zscore
z = np.abs(zscore(df[iris.feature_names]))
df['outlier'] = (z > 3).any(axis=1)
df = df[df['outlier'] == False]
```

## üîÑ 6. Log Transformation
üìå Reduce skewness in data

```python
df['new_feature'] = df['petal width (cm)'] ** 3
df['log_transformed'] = np.log1p(df['new_feature'])
```

## üìö 7. Text Preprocessing
üìå Tokenize, remove stopwords, and stem

```python
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

text = "I love natural language processing"
tokens = word_tokenize(text)
tokens = [t for t in tokens if t.lower() not in stop_words]
stemmed = [PorterStemmer().stem(t) for t in tokens]
```

## ‚öñÔ∏è 8. SMOTE (Balancing Classes)
üìå Generate synthetic samples for minority class

```python
from imblearn.over_sampling import SMOTE
sm = SMOTE()
X_res, y_res = sm.fit_resample(X, y)
```