### What is Data Preprocessing?
Data preprocessing is the process of cleaning, transforming, and preparing raw data before feeding it to a machine learning model.
It ensures data quality, consistency, and better model performance.

### 🧩 Main Steps of Data Preprocessing

#### 1️⃣ Data Collection

Goal: Gather data from various sources.
Examples of sources:

Databases (SQL, MongoDB, etc.)

CSV/Excel files

APIs / Web scraping

IoT sensors, logs, etc.

Tools/Libraries:

In [None]:
import pandas as pd

df = pd.read_csv("data.csv")  # or pd.read_excel(), pd.read_json(), etc.


### 2️⃣ Data Inspection & Exploration

Goal: Understand the structure, quality, and characteristics of the data.

Common Methods:

In [None]:
df.head()        # First 5 rows
df.info()        # Data types and null values
df.describe()    # Summary statistics
df.shape         # Rows × Columns
df.columns       # Feature names
df.nunique()     # Unique values per column



Visualization:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df)
sns.heatmap(df.corr(), annot=True)


### 3️⃣ Handling Missing Values

Goal: Handle incomplete or missing data properly.

✅ Methods:


| Method | Description | Example |
|--------|-------------|---------|
| Drop missing | Remove rows or columns with NaN | `df.dropna()` |
| Imputation (Mean/Median/Mode) | Fill missing with central value | `df.fillna(df.mean())` |
| Forward/Backward Fill | Fill with previous/next value | `df.fillna(method='ffill')` |
| KNN / Regression Imputation | Predict missing values | `from sklearn.impute import KNNImputer` |

### 4️⃣ Handling Duplicates

Goal: Remove duplicate data to avoid bias.

In [None]:
df = df.drop_duplicates()


### 5️⃣ Handling Outliers

Goal: Detect and handle abnormal data points.

✅ Methods:

| Method | Description | Code Example |
|--------|-------------|--------------|
| IQR (Interquartile Range) | Values outside Q1 - 1.5 × IQR or Q3 + 1.5 × IQR | `df = df[(df[col] >= Q1 - 1.5*IQR) & (df[col] <= Q3 + 1.5*IQR)]` |
| Z-Score Method | Remove data with \|z-score\| > 3 | `from scipy import stats`<br>`df = df[(np.abs(stats.zscore(df[col])) < 3)]` |
| Visualization | Boxplot / Scatterplot | `sns.boxplot(data=df, x='feature')` |

### 6️⃣ Encoding Categorical Data

Goal: Convert categorical values into numeric format for ML models.

✅ Encoding Techniques:

| Method | Use Case | Code Example |
|--------|----------|--------------|
| Label Encoding | Ordinal categories | `le = LabelEncoder()`<br>`df['size'] = le.fit_transform(df['size'])` |
| One-Hot Encoding | Non-ordinal categories | `df_encoded = pd.get_dummies(df, columns=['color', 'city'])` |
| Ordinal Encoding | Custom order | `size_map = {'S':1, 'M':2, 'L':3}`<br>`df['size_ord'] = df['size'].map(size_map)` |

### 7️⃣ Feature Scaling / Normalization

Goal: Normalize feature range so all have equal importance.

| Method | Description | Code Example |
|--------|-------------|--------------|
| Standardization (Z-score) | Mean = 0, SD = 1 | `scaler = StandardScaler()`<br>`X_scaled = scaler.fit_transform(X)` |
| Min-Max Scaling | Range [0, 1] | `scaler = MinMaxScaler()`<br>`X_scaled = scaler.fit_transform(X)` |
| Robust Scaling | Uses median/IQR | `scaler = RobustScaler()`<br>`X_scaled = scaler.fit_transform(X)` |
| L2 Normalization | Scales to unit norm | `normalizer = Normalizer()`<br>`X_norm = normalizer.fit_transform(X)` |

### 8️⃣ Feature Transformation

Goal: Make skewed data normal, or improve interpretability.

| Method | Description | Code Example |
|--------|-------------|--------------|
| Log Transform | Reduces right skew | `df['log_feature'] = np.log1p(df['feature'])` |
| Square Root / Power | Moderate skew | `df['sqrt_feature'] = np.sqrt(df['feature'])` |
| Box-Cox / Yeo-Johnson | Advanced normalization | `pt = PowerTransformer(method='yeo-johnson')`<br>`df['transformed'] = pt.fit_transform(df[['feature']])` |

### 9️⃣ Feature Engineering

Goal: Create new informative features.

Examples:

In [None]:
df['Total'] = df['A'] + df['B']
df['Avg'] = df[['A', 'B', 'C']].mean(axis=1)
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Is_Weekend'] = df['Date'].dt.dayofweek > 4

### 🔟 Feature Selection

Goal: Select the most important features for the model.


| Method | Description | Code Example |
|--------|-------------|--------------|
| Filter Methods | Correlation / Chi-square | `selector = SelectKBest(score_func=f_classif, k=10)`<br>`X_new = selector.fit_transform(X, y)` |
| Wrapper Methods | Recursive Feature Elimination (RFE) | `rfe = RFE(estimator=LogisticRegression(), n_features_to_select=5)`<br>`X_rfe = rfe.fit_transform(X, y)` |
| Embedded Methods | Lasso / Tree-based importance | `lasso = Lasso(alpha=0.01)`<br>`lasso.fit(X, y)`<br>`important_features = lasso.coef_ != 0` |

### 1️⃣1️⃣ Data Splitting

Goal: Split dataset into training, validation, and test sets.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### 1️⃣2️⃣ Balancing Data (if imbalanced classes)

Goal: Handle class imbalance in classification problems.

| Method | Description | Code Example |
|--------|-------------|--------------|
| Oversampling (SMOTE) | Generate synthetic minority samples | `smote = SMOTE(random_state=42)`<br>`X_res, y_res = smote.fit_resample(X, y)` |
| Undersampling | Remove majority samples | `undersampler = RandomUnderSampler(random_state=42)`<br>`X_res, y_res = undersampler.fit_resample(X, y)` |
| Class Weights | Assign weight in training | `model = RandomForestClassifier(class_weight='balanced')`<br>`model.fit(X_train, y_train)` |

### 1️⃣3️⃣ Data Pipeline Creation (Automation)

Goal: Automate preprocessing using Scikit-learn Pipeline.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

numeric_features = ['Age', 'Salary']
categorical_features = ['Gender']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(), categorical_features)
])

pipeline = Pipeline([
    ('preprocess', preprocessor)
])

X_preprocessed = pipeline.fit_transform(df)
