
---

# ✅ **Complete Guide to Data Preprocessing in Machine Learning**  

Preprocessing is a **crucial step** in machine learning that can **make or break your model’s performance**. Different models **handle data differently**, so applying the correct techniques **saves time** and **improves accuracy**.  

Let’s go **step by step** 👇  

---

## 📌 **1. Understand the Data Type First**  
Before any transformation, **identify the types of data** in your dataset.  

### 🔹 **Types of Data**  
| **Data Type**      | **Example**             | **Encoding Needed?** | **Scaling Needed?** |
|--------------------|------------------------|----------------------|---------------------|
| **Numerical (Continuous)** | Age, Salary, Temperature | ❌ No | ✅ Yes |
| **Numerical (Discrete)** | Number of Children, Counts | ❌ No | ✅ Yes |
| **Categorical (Nominal)** | Gender, City, Country | ✅ Yes (One-Hot, Label) | ❌ No |
| **Categorical (Ordinal)** | Low-Medium-High, Education Level | ✅ Yes (Label Encoding) | ✅ Sometimes |

🔹 **Tools:**  
```python
df.head()
df.info()
df.describe(include='all')
df.dtypes
```

👉 **Why is this important?**  
- **Some models (like Decision Trees, Random Forests, XGBoost) can handle categorical variables directly** without encoding!  
- **Linear models (like Logistic Regression, SVM) need numerical input**, so categorical features **must be encoded**.  

---

## 📌 **2. Handling Missing Values**  
### 🔹 **Rules of Thumb**  
✔ **If missing values < 5%** → Drop rows (`df.dropna()`).  
✔ **If missing values 5-40%** → Impute using:  
   - **Numerical** → Mean (if normal), Median (if skewed), KNN Imputer.  
   - **Categorical** → Mode (most frequent value) or "Unknown".  
✔ **If missing values > 40%** → Drop column if it's not important.  

🔹 **Example:**  
```python
# Check for missing values in each column
df.isnull().sum()

# Check total number of missing values
df.isnull().sum().sum()

# Visualize missing values (optional, using seaborn)
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()

# drop rows with missing values
df_dropped_rows = df.dropna()

# drop columns with missing values
df_dropped_columns = df.dropna(axis=1)

# Replace missing values with zero
df_filled = df.fillna(0)

# Numeric columns mean or median imputaion
df['numeric_column'] = df['numeric_column'].fillna(df['numeric_column'].mean())
df['numeric_column'] = df['numeric_column'].fillna(df['numeric_column'].median())

# Categorical columns mode imputation
df['categorical_column'] = df['categorical_column'].fillna(df['categorical_column'].mode()[0])

# backward or forward fill
# Forward fill (propagate the last valid value forward)
df_filled_forward = df.fillna(method='ffill')

# Backward fill (propagate the next valid value backward)
df_filled_backward = df.fillna(method='bfill')

# Impute using scikit learn - using simpleimputer
from sklearn.impute import SimpleImputer

# For numeric data
imputer = SimpleImputer(strategy='mean')  # Can use 'median', 'most_frequent', or 'constant'
df[['numeric_column']] = imputer.fit_transform(df[['numeric_column']])

# For categorical data
imputer = SimpleImputer(strategy='most_frequent')
df[['categorical_column']] = imputer.fit_transform(df[['categorical_column']])

# for time series data - use interpolation
# Linear interpolation-- Use linear when data points are equally spaced (e.g., daily data).
df_interpolated = df.interpolate(method='linear')

# Time-based interpolation (if the index is datetime) --Use time for unevenly spaced time-series data (e.g., hourly data with missing records).

df_time_interpolated = df.interpolate(method='time')

```

👉 **Which ML models handle missing values?**  
| Model | Needs Missing Value Handling? |
|--------|-----------------------------|
| Decision Trees, Random Forest | ❌ No, can handle missing values |
| XGBoost | ✅ Yes, but can handle some missing values |
| Logistic Regression, SVM, KNN | ✅ Yes, requires imputation |
| Neural Networks | ✅ Yes, missing values must be handled |

---

## 📌 **3. Encoding Categorical Data**  
### 🔹 **Rules of Thumb**  
✔ **Few unique values (<10 categories):** **One-Hot Encoding** (`pd.get_dummies()`).  
✔ **Many unique values (>10 categories):** **Target Encoding** (for tree-based models).  
✔ **Ordinal data (e.g., Low-Medium-High):** **Label Encoding**.  

🔹 **Example:**  
```python
# 1.One-Hot Encoding

# Converts each category into a separate binary column (0 or 1).
df = pd.get_dummies(df, columns=['category_column'], drop_first=True)
or
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['Color']])
# 2.Label Encoding

# Converts each unique category into an integer.
# Suitable for ordinal data (where the order matters, e.g., Low, Medium, High).
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['ordinal_column'] = le.fit_transform(df['ordinal_column'])

# 3.Ordinal Encoding

# For ordered categories, you can specify the rank manually.

from sklearn.preprocessing import OrdinalEncoder
data = [['Low'], ['Medium'], ['High']]
encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
encoded = encoder.fit_transform(data)
print(encoded)  # Output: [[0], [1], [2]]

```

👉 **Which ML models require encoding?**  
| Model | Needs Categorical Encoding? |
|--------|-----------------------------|
| Decision Trees, Random Forest | ❌ No, can handle categorical data directly |
| XGBoost, LightGBM | ✅ Yes, but can handle some categorical variables |
| Logistic Regression, SVM, KNN | ✅ Yes, needs numerical encoding |
| Neural Networks | ✅ Yes, needs numerical encoding |

---

## 📌 **4. Handling Outliers**  
### 🔹 **Rules of Thumb**  
✔ **For skewed numerical data**, use **log transformation** (`np.log1p()`).  
✔ **For extreme outliers**, use **Winsorization (capping)** or remove using **IQR method**.  

🔹 **Example:**  
```python

🛠 Methods to Detect Outliers

 1. Z score method
Detects how many standard deviations a data point is from the mean.

Rule:
If the Z-score > 3 or < -3 → It's an outlier.

Code:
import numpy as np
from scipy.stats import zscore

data = np.array([10, 12, 15, 14, 13, 120, 14, 13, 12])

z_scores = zscore(data)

outliers = np.where(np.abs(z_scores) > 3)
print("Outliers:", data[outliers])


2. IQR Method
Detects outliers based on the range between the 25th percentile (Q1) and 75th percentile (Q3).

Rule:

Code:
import pandas as pd

# Sample data
df = pd.DataFrame({'Values': [10, 12, 15, 14, 13, 120, 14, 13, 12]})

# Calculate IQR
Q1 = df['Values'].quantile(0.25)
Q3 = df['Values'].quantile(0.75)
IQR = Q3 - Q1

# Identify outliers
outliers = df[(df['Values'] < (Q1 - 1.5 * IQR)) | (df['Values'] > (Q3 + 1.5 * IQR))]
print("Outliers:\n", outliers)

3. Visualization-Based Detection

Box Plot: Quickly spots outliers using IQR.
Scatter Plot: Helps to detect outliers in two-dimensional data.
Histogram: Reveals unusual frequency distribution


🛠 Methods to Handle Outliers

🔄 1️⃣ Remove Outliers
# Remove outliers using IQR
df_no_outliers = df[(df['Values'] >= (Q1 - 1.5 * IQR)) & (df['Values'] <= (Q3 + 1.5 * IQR))]

🔄 2️⃣ Cap Outliers (Winsorization)
from scipy.stats.mstats import winsorize

# Apply Winsorization
capped_data = winsorize(df['Values'], limits=[0.05, 0.05])  # Cap 5% on both ends

🔄 3️⃣ Transform Data (Log, Square Root)
# Log Transformation
df['Log_Transformed'] = np.log1p(df['Values'])  # log(1 + x)

🔄 4️⃣ Impute Outliers
# Replace outliers with median
median_value = df['Values'].median()
df['Values'] = np.where(
    (df['Values'] < (Q1 - 1.5 * IQR)) | (df['Values'] > (Q3 + 1.5 * IQR)),
    median_value,
    df['Values']
)


🚩 Which Method Should You Use?
Situation :	Suggested Method
Outliers are data entry errors :Remove them
Small dataset :Impute or cap outliers
Outliers carry important information :	Transform data (log, sqrt)
Model is robust to outliers (tree-based):	Ignore or handle selectively

```

👉 **Which ML models handle outliers?**  
| Model | Affected by Outliers? |
|--------|----------------------|
| Decision Trees, Random Forest | ❌ No, not sensitive |
| XGBoost, LightGBM | ❌ No, robust to outliers |
| Logistic Regression, SVM, KNN | ✅ Yes, need to remove or scale outliers |
| Neural Networks | ✅ Yes, need preprocessing |

---

## 📌 **5. Feature Scaling**  

### ✅ Why Feature Scaling?

- Algorithms like **SVM, KNN, K-Means, Gradient Descent** are sensitive to feature magnitudes.
- Speeds up convergence in optimization algorithms.
- Prevents dominance of features with larger scales.
- 
### 🔹 **Rules of Thumb**  
✔ **For normal data** → **Use StandardScaler()** (Z-score normalization).  
✔ **For skewed data** → **Use MinMaxScaler() (0-1 scaling)** or **Power Transform (Box-Cox)**.  
✔ **Tree-based models (Decision Trees, Random Forest, XGBoost)** **DON’T need scaling**.  

### 🔢 Types of Feature Scaling Techniques

#### ⚖️ 1️⃣ Min-Max Scaling (Normalization)

**Use Case:**  
- Scales features to a fixed range [0, 1].  
- Useful for algorithms sensitive to magnitudes (**KNN, SVM, Neural Networks**).  
- Not robust to outliers.


**Code:**
```python
from sklearn.preprocessing import MinMaxScaler

data = [[10], [20], [30], [40], [50]]
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)

```


#### ⚖️ 1️⃣ Standardization Z score 

**Use Case:**  
- Scales data to have mean = 0 and sd = 1
- Best for algorithms assuming Gaussian distribution (Logistic Regression, Linear Regression). 
- Partially handles outliers.


**Code:**
```python
from sklearn.preprocessing import StandardScaler

data = [[10], [20], [30], [40], [50]]
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

```
#### Best practise : Scaling in pipeline

```python
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Example pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Choose the appropriate scaler
    ('model', LogisticRegression())
])

# Train the model
pipeline.fit(X_train, y_train)

```

👉 **Which ML models require scaling?**  
| Model | Needs Scaling? |
|--------|-------------|
| Decision Trees, Random Forest | ❌ No |
| XGBoost, LightGBM | ❌ No |
| Logistic Regression, SVM, KNN | ✅ Yes, needs scaling |
| Neural Networks | ✅ Yes, needs scaling |

---

## 📌 **6. Handling Imbalanced Data**  
✔ **For classification tasks with imbalance (>80-20 ratio):**  
   - **Use SMOTE (Synthetic Minority Oversampling)**.  
   - **Try undersampling if dataset is large**.  

### 1️⃣ Resampling Methods

#### 🔄 a) **Oversampling** (Increase minority class)

- **Use Case:** When the dataset is small and losing information is unacceptable.
- **Techniques:** 
  - Random Oversampling
  - SMOTE (Synthetic Minority Over-sampling Technique)

**Code:**
```python
from imblearn.over_sampling import SMOTE

# Assuming X and y are your features and target variable
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print("Resampled dataset shape:", X_resampled.shape)
```
#### 🔄 a) **Undersampling** (Reduce majority class)

- **Use Case:** Large datasets with plenty of majority samples.
- **Techniques:** 
  - Random Undersampling
  - SMOTE (Synthetic Minority Over-sampling Technique)

**Code:**
```python
from imblearn.under_sampling import RandomUnderSampler

undersampler = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = undersampler.fit_resample(X, y)

print("Resampled dataset shape:", X_resampled.shape)
```


👉 **Which ML models are affected by imbalance?**  
| Model | Affected by Imbalance? |
|--------|----------------------|
| Decision Trees, Random Forest | ✅ Yes, but less sensitive |
| Logistic Regression, SVM | ✅ Yes, needs balancing |
| Neural Networks | ✅ Yes, needs balancing |

---

## 📌 **7. Automating Preprocessing with Pipelines**  
Using **Scikit-learn Pipelines** ensures **reproducibility** and saves time.  

🔹 **Example:**  
```python
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

processed_data = pipeline.fit_transform(raw_data)
```

---

# 🎯 **Final Summary: What You Should Do Before Using Each Model**  
| Model | Needs Missing Value Handling? | Needs Encoding? | Needs Scaling? | Handles Outliers? |
|--------|----------------|----------------|---------------|----------------|
| **Decision Trees, Random Forest** | ❌ No | ❌ No | ❌ No | ✅ Yes |
| **XGBoost, LightGBM** | ✅ Yes | ✅ Yes | ❌ No | ✅ Yes |
| **Logistic Regression, SVM, KNN** | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No |
| **Neural Networks** | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No |

---
