# Data Preprocessing & Feature Engineering Roadmap

This roadmap outlines the essential steps, techniques, and tools for cleaning, transforming, and enhancing data to build effective machine learning models using Python (`Pandas`, `NumPy`, `Scikit-learn`).

## I. Introduction & Initial Exploration

* **Why Preprocess?** The "Garbage In, Garbage Out" principle. Understanding ML algorithm requirements (numerical input, scale sensitivity, handling missing values).
* **Goals:** Improve data quality, extract meaningful patterns, meet algorithm assumptions, enhance model performance.
* **Loading Data:** Using `Pandas` (`pd.read_csv`, `pd.read_excel`, `pd.read_sql`, etc.).
* **Initial Inspection (Crucial First Step):**
    * Viewing data: `.head()`, `.tail()`, `.sample()`.
    * Understanding structure & types: `.info()`, `.shape`, `.dtypes`.
    * Summary statistics: `.describe(include='all')`.
    * Checking unique values & counts: `.nunique()`, `.value_counts()`.
    * Visual exploration (briefly): Using `Matplotlib`/`Seaborn` for histograms, box plots, scatter plots to initially identify distributions, outliers, and relationships.

## II. Handling Missing Data

* **Identifying Missing Values:** Using `.isnull().sum()` or `.isna().sum()`. Visualizing missingness (e.g., using `missingno` library - optional).
* **Strategies:**
    * **Deletion:**
        * Listwise Deletion (Row Removal): `df.dropna()`. Pros: Simple. Cons: Data loss, potential bias.
        * Column Deletion (Feature Removal): `df.dropna(axis=1, thresh=...)`. Pros: Removes uninformative features. Cons: Information loss.
    * **Imputation (Filling Values):** Often preferred over deletion.
        * **Simple Imputation:**
            * Mean: `SimpleImputer(strategy='mean')` (Numerical only, sensitive to outliers).
            * Median: `SimpleImputer(strategy='median')` (Numerical only, robust to outliers).
            * Mode: `SimpleImputer(strategy='most_frequent')` (Categorical or numerical).
            * Constant: `SimpleImputer(strategy='constant', fill_value=...)`.
        * **Advanced Imputation:**
            * KNN Imputation: `KNNImputer` (Uses nearest neighbors).
            * Multivariate Imputation: `IterativeImputer` (Models features to predict missing values).
        * **Missing Indicator Feature:** Creating a binary column indicating missingness (`SimpleImputer(add_indicator=True)` or manually). Can help models learn from the missingness pattern.
* **Implementation:** Using `Pandas` `.fillna()` or `Scikit-learn` imputers (preferred within pipelines). Fit imputers on training data only.

## III. Encoding Categorical Data

* **Understanding Types:** Nominal (no order) vs. Ordinal (inherent order).
* **Techniques:**
    * **Ordinal Encoding:** `OrdinalEncoder` (for ordinal features, specify category order). `LabelEncoder` (typically only for the target variable `y`). Cons: Implies potentially false order if used on nominal data.
    * **One-Hot Encoding (OHE):** `OneHotEncoder`, `pd.get_dummies()` (for nominal features). Creates binary columns. Pros: No order implied. Cons: High dimensionality for high cardinality features (many unique categories). Handle `drop` parameter (`'first'`, `'if_binary'`) to avoid multicollinearity. Handle `handle_unknown='ignore'` for unseen test set categories.
    * **Other Techniques (for High Cardinality):**
        * Binary Encoding.
        * Feature Hashing (`FeatureHasher`).
        * Target Encoding (uses target information, risk of leakage if not done carefully within CV).

## IV. Feature Scaling (Numerical Data)

* **Why Scale?** Importance for distance-based algorithms (KNN, SVM), gradient descent (Linear/Logistic Regression, NNs), and regularization. Tree-based models are less sensitive.
* **Techniques (`sklearn.preprocessing`):**
    * **Standardization (Z-score):** `StandardScaler` (mean=0, std=1). Default choice generally.
    * **Normalization (Min-Max):** `MinMaxScaler` (scales to a range, e.g., `[0, 1]`). Sensitive to outliers. Useful for specific cases (e.g., image pixels).
    * **Robust Scaling:** `RobustScaler` (uses median and IQR). Less sensitive to outliers.
* **Implementation Note:** Fit scaler on training data ONLY, then transform both training and test data.

## V. Handling Outliers

* **Identifying Outliers:**
    * Visualization: Box plots, scatter plots, histograms.
    * Statistical Methods: Z-score, IQR (Interquartile Range) method.
* **Strategies:**
    * **Removal:** Delete outlier data points (use with caution, understand why they are outliers).
    * **Transformation:** Apply non-linear transformations (e.g., `log`, `sqrt`, `Box-Cox`) to reduce skewness and outlier impact.
    * **Capping/Winsorizing:** Limit extreme values to a certain percentile (e.g., replace values above 99th percentile with the 99th percentile value).
    * **Using Robust Algorithms:** Employ models less sensitive to outliers (e.g., `RobustScaler`, tree-based models, `HuberRegressor`).
    * **Treat as Missing:** Consider treating extreme outliers as missing data and impute them.

## VI. Feature Engineering

* **Goal:** Create new features from existing ones to improve model performance by providing more relevant information or capturing non-linear relationships. Often requires domain knowledge.
* **Techniques:**
    * **Interaction Features:** Combining features (e.g., `X1 * X2`, `X1 / X2`). `PolynomialFeatures` generates polynomial and interaction terms automatically.
    * **Transformations:** Applying mathematical functions (`log`, `sqrt`, `exp`, `Box-Cox`) to numerical features to stabilize variance, handle skewness, or linearize relationships.
    * **Binning/Discretization:** Grouping continuous features into discrete bins (`KBinsDiscretizer`, `pd.cut`, `pd.qcut`). Can help capture non-linearities for linear models.
    * **Date/Time Features:** Extracting components like year, month, day, day of week, hour, is_weekend, time differences from datetime columns (`Pandas` `.dt` accessor).
    * **Domain-Specific Features:** Creating features based on understanding the problem context (e.g., distance calculations, text-based features like word counts/sentiment, aggregation from related data).

## VII. Feature Selection

* **Goal:** Select a subset of the most relevant features to improve model performance (reduce overfitting, decrease training time) and interpretability.
* **Techniques (`sklearn.feature_selection`):**
    * **Filter Methods:** Evaluate features independently of the model.
        * `VarianceThreshold`: Remove low-variance (e.g., constant) features.
        * Univariate Statistical Tests: `SelectKBest`, `SelectPercentile` using tests like `f_classif`/`f_regression`, `chi2`, `mutual_info_classif`/`mutual_info_regression`.
    * **Wrapper Methods:** Use a specific model to evaluate subsets of features.
        * Recursive Feature Elimination (`RFE`, `RFECV`): Iteratively remove the least important features based on model performance.
    * **Embedded Methods:** Feature selection is part of the model training process.
        * L1 Regularization (`Lasso`): Coefficients of irrelevant features are shrunk to zero.
        * Tree-based Importances: Accessing `feature_importances_` from tree models (Decision Tree, Random Forest, Gradient Boosting).

## VIII. Pipelines & ColumnTransformer (Revisited)

* **Importance:** Essential for applying preprocessing and feature engineering steps correctly and consistently, especially within cross-validation loops to prevent data leakage.
* **Tools:** `Pipeline`, `make_pipeline`, `ColumnTransformer`, `make_column_transformer`. Allows chaining all steps (imputation, encoding, scaling, feature engineering, selection, final model) into a single `Scikit-learn` estimator object.