# Advanced Data Cleaning and Feature Selection

In this notebook, we will explore advanced techniques to clean data and select the best features for your machine learning models. These methods can help improve the performance of your models when used appropriately.

## Advanced Missing Value Strategies

Handling missing data is a common task in data preprocessing. Advanced techniques go beyond simple filling with zeros or mean values. Let's look at some options:

- **Mean/Median:** Filling missing numerical values with the average or median.
- **Mode:** Filling missing categorical data with the most common value.
- **Forward/Backward Fill:** Propagating next or previous values in time series.
- **Interpolation:** Estimating missing data points using methods like linear or spline interpolation.
- **Predictive Imputation:** Using machine learning models to predict missing values based on other features.

## Outlier Detection Methods

Outliers are data points that differ significantly from other observations. Detecting and treating outliers can improve model stability.

### Methods include:

- **Z-Score:** Identifies points more than a certain number of standard deviations from the mean.
- **IQR Method:** Uses the interquartile range to find points outside 1.5 times the IQR.
- **Isolation Forest:** A machine learning method to detect anomalies.
- **Domain Knowledge:** Using expertise to identify outliers relevant to the business problem.

## Feature Selection Techniques

Selecting the most relevant features is crucial for building effective models.

- **Filter Methods:** Use statistical measures like correlation or mutual information.
- **Wrapper Methods:** Use algorithms that evaluate subsets of features, like forward or backward selection.
- **Embedded Methods:** Feature selection integrated within model training, e.g., LASSO or Ridge regression.
- **Recursive Feature Elimination (RFE):** Systematically removes features and keeps the most important ones.

💡 Goal: Keep informative features and remove redundant ones!

## Example: Advanced Preprocessing Code

Here's some sample Python code demonstrating these techniques using popular libraries.

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Assume df, X, y are predefined DataFrame and arrays

# Advanced imputation
# KNN imputation for numerical features
knn_imputer = KNNImputer(n_neighbors=5)
df_numerical = df.select_dtypes(include=[np.number])
df_imputed = pd.DataFrame(knn_imputer.fit_transform(df_numerical), 
                         columns=df_numerical.columns)

# Outlier detection using IQR
def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Feature selection using SelectKBest
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Recursive Feature Elimination
rf = RandomForestClassifier(n_estimators=100)
rfe = RFE(estimator=rf, n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)

print("Advanced preprocessing complete!")
print(f"Original features: {X.shape[1]}")
print(f"Selected features: {X_selected.shape[1]}")

## Summary

Advanced data cleaning and feature selection techniques can greatly enhance your machine learning workflows. Remember to start simple and gradually incorporate these methods as needed for your data and problem.

## Key Takeaway

Advanced preprocessing can significantly improve model performance — but always start with simple methods and understand your data first.