# 5.0 Data Preprocessing and Feature Engineering
The lesson will cover key aspects of data transformation, including techniques that are crucial in preparing data for machine learning models and improving their performance.

**Lesson objectives:** By the end of this lesson, students should be able to
* Understand the concepts of data preprocessing and feature engineering
* Apply data scaling techniques using Python
* Evaluate the impact of scaling on model performance
* Understand the need for scaling in feature engineering

Data preprocessing and feature engineering are essential steps in any machine learning pipeline because real-world data is often messy, incomplete, and not in a format that models can easily process.

**Data Preprocessing**
* The process of cleaning and preparing data before feeding it into a machine learning model.
* Key steps: handling missing values, encoding categorical variables, scaling/normalizing numerical features, and splitting data into training and test sets.

**Feature Engineering**
* The process of transforming raw data into meaningful features that improve model performance.
* Feature engineering helps improve the predictive power of machine learning algorithms by making the data more understandable or relevant to the model.
* Includes creating new features, encoding categorical features, and applying domain knowledge to the data.

# 5.1. Data Cleaning
## 5.1.1 Handling Missing Data
Missing data can occur due to various reasons such as errors during data collection or loss of data over time.
**Methods for Handling Missing Data:**
* **1) Removing Missing Data using `.dropna()`:** If a column or row has too many missing values, it might be better to drop it. Pandas functions like `.dropna()` can be used to remove missing data.

In [None]:
import pandas as pd
import numpy as np

# Sample dataset with missing values
data = {
    'Name': ['John', 'Alice', 'Bob', np.nan, 'Eve'],
    'Age': [25, np.nan, 22, 28, 29],
    'City': ['New York', 'Los Angeles', np.nan, 'Chicago', 'Houston']
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

**a) Remove Rows with Any Missing Values using `dropna()`**

In [None]:
# Remove rows with any missing values
df_no_na_rows = df.dropna()

print("DataFrame After Removing Rows with Any Missing Values:")
print(df_no_na_rows)

**b) Remove Columns with Any Missing Values using `dropna()`**

In [None]:
# Remove columns with any missing values
df_no_na_columns = df.dropna(axis=1)

print("DataFrame After Removing Columns with Any Missing Values:")
print(df_no_na_columns)

**c) Remove Rows with Missing Values in Specific Columns**

In [None]:
# Remove rows where 'Age' column has missing values
df_no_na_age = df.dropna(subset=['Age'])

print("DataFrame After Removing Rows with Missing 'Age':")
print(df_no_na_age)

* **2) Inputation using `.fillna()`:**
Instead of removing data, you can impute missing values, i.e., fill them in with some reasonable substitute value. The most common methods for imputation are:
  - Mean/Median/Mode Imputation: Replace missing values with the mean (numerical features) or mode (categorical features) of the column.
  - Forward/Backward Fill: For time-series data, you can forward fill or backward fill missing values.
  - Advanced Imputation: Use models like KNN imputation or Multivariate Imputation by Chained Equations (MICE) for more sophisticated handling.

**a) Fill Missing Data with a Specific Value**

In [None]:
# Fill missing 'Age' with the mean of the column
df_imputed_age = df.fillna({'Age': df['Age'].mean()})

print("DataFrame After Imputing Missing 'Age' with Mean:")
print(df_imputed_age)

**b) Fill Missing Data with Mode for Categorical Columns**

In [None]:
# Fill missing 'City' with the mode (most frequent value)
df_imputed_city = df.fillna({'City': df['City'].mode()[0]})

print("DataFrame After Imputing Missing 'City' with Mode:")
print(df_imputed_city)

**c) Forward Fill or Backward Fill**
For time-series data or datasets where you expect adjacent rows to have similar values, you might use:
* forward fill (`ffill()`): Fills the missing value with the previous row’s value
* backward fill (`bfill()`): Fills the missing value with the previous row’s value

In [None]:
# Forward fill missing data
df_forward_fill = df.fillna(method='ffill')

print("DataFrame After Forward Filling Missing Values:")
print(df_forward_fill)

# Backward fill missing data
df_backward_fill = df.fillna(method='bfill')

print("DataFrame After Backward Filling Missing Values:")
print(df_backward_fill)

## 5.1.2. Removing duplicates
Duplicate data can lead to inaccurate models, biased results, and wasted computational resources. Duplicate entries in your dataset can occur for various reasons, such as errors during data collection, merging datasets from different sources, or incorrect data entry. Cleaning these duplicates helps ensure the integrity and reliability of your analysis or machine learning model.

**Methods for Removing Duplicates**

You can use the duplicated() method to identify duplicate rows in your DataFrame.

**1) Identify duplicates:**

In [None]:
import pandas as pd

# Sample DataFrame
data = {
    'ID': [1, 2, 3, 4, 5, 3],
    'Name': ['John', 'Alice', 'Bob', 'Alice', 'Eve', 'Bob'],
    'Age': [25, 30, 22, 30, 29, 22]
}
df = pd.DataFrame(data)

# Identify duplicates
duplicates = df[df.duplicated()]
print(duplicates)

**2) Removing Duplicates in Pandas using `drop_duplicates()`**

In [None]:
# Remove duplicates (by default, keeps the first occurrence of each duplicate row)
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
# or remove duplicates and keep the last occurrence


**3) Removing Duplicates Based on Specific Columns**

In [None]:
# Remove duplicates based on the 'Name' column
df_no_duplicates_name = df.drop_duplicates(subset=['Name'])
print(df_no_duplicates_name)

### 5.1.3. Outlier detection and removal
Outliers are data points that significantly differ from the rest of the data. They can result from measurement errors, data entry errors, or can represent genuine but rare events. Whether they should be removed or not depends on the context and the goal of your analysis or machine learning task.

Outlier detection is crucial because:
* Outliers can negatively impact models that are sensitive to extreme values, like linear regression, K-means clustering, or principal component analysis (PCA).
* Outliers can distort statistical summaries such as the mean and standard deviation.
* Some models (like random forests and tree-based models) are more robust to outliers but still may benefit from detecting and treating them.

There are several methods for detecting and removing outliers, each suitable for different data types and use cases. 

**Common techniques for outlier detection and removal**

**1) Z-Score (Standard Score)**
The Z-score represents how many standard deviations a data point is away from the mean. A high absolute value of the Z-score indicates that the data point is far from the mean and may be an outlier.

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import zscore

# Example dataset
data = {'Value': [10, 12, 12, 13, 14, 15, 16, 16, 100, 18, 19, 20]}
df = pd.DataFrame(data)

# Calculate Z-scores for each data point
df['Z-Score'] = zscore(df['Value'])

# Display Z-scores and filter outliers (Z-score > 3 or < -3)
print(df)

# Remove outliers (Z-score > 3 or < -3)
df_no_outliers = df[df['Z-Score'].abs() <= 3]

print("\nData after Removing Outliers:")
print(df_no_outliers)
# In this example, the value 100 had a Z-score of 7.21, 
# which is much larger than the threshold of 3, so it was removed.

**2) IQR (Interquartile Range) Method**
The IQR method is another popular technique for detecting outliers. The IQR is the range between the first quartile (Q1) and the third quartile (Q3). 
Outliers are typically defined as:
* Lower bound: 𝑄1 − 1.5 × IQR
* Upper bound: Q1 − 1.5 × IQR
  
Any data points below the lower bound or above the upper bound are considered outliers.

In [None]:
# Calculate the IQR for the 'Value' column
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1

# Define the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"IQR: {IQR}")
print(f"Lower Bound: {lower_bound}")
print(f"Upper Bound: {upper_bound}")

# Filter out outliers based on IQR
df_no_outliers_iqr = df[(df['Value'] >= lower_bound) & (df['Value'] <= upper_bound)]

print("\nData after Removing Outliers (IQR):")
print(df_no_outliers_iqr)

# 5.2. Data Transformation
Data transformation is a key step in preprocessing for machine learning. It helps convert raw data into a more useful format for model training, improving the model’s performance and efficiency.

## 5.2.1. Normalization, Standardization, and Scaling
These techniques are used to adjust the range and distribution of numerical features in your dataset, which can be crucial for algorithms that are sensitive to the scale of input features (like SVM, KNN, and Gradient Descent-based algorithms).

### 5.2.1.1. Normalization (Min-Max Scaling)
Normalization (also known as Min-Max Scaling) transforms the data to fit within a specific range, typically [0, 1], or [-1, 1]. Normalization is particularly useful when the data follows a uniform distribution.

In [None]:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Example dataset
data = {'Feature1': [1, 5, 10, 15, 20], 'Feature2': [100, 200, 300, 400, 500]}
df = pd.DataFrame(data)

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Normalize the data
df_normalized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

print("Normalized DataFrame:")
print(df_normalized)

### 5.2.1.2. Standardization (Z-Score Normalization)
Standardization transforms the data to have a mean of 0 and a standard deviation of 1. It’s useful when your data follows a Gaussian (normal) distribution or when you want to compare features that have different units or scales. Standardization is often the preferred method for algorithms that assume a normal distribution (e.g., Linear Regression, Logistic Regression, SVM).

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Standardize the data
df_standardized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

print("Standardized DataFrame:")
print(df_standardized)

### 5.2.1.3. Scaling (Robust Scaling)
Scaling can also refer to Robust Scaling, which is less sensitive to outliers. It scales the data based on the median and interquartile range (IQR) rather than the mean and standard deviation, making it more robust to extreme values.

In [None]:
from sklearn.preprocessing import RobustScaler

# Initialize the RobustScaler
scaler = RobustScaler()

# Scale the data
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

print("Scaled DataFrame (Robust Scaling):")
print(df_scaled)

### 5.2.2. Handling Categorical Data
Many machine learning models require numerical data, but real-world datasets often contain categorical variables like "Gender", "City", or "Category". To convert these into a usable format, it is necessary to encode them.

**Methods for Encoding Categorical Variables:**
#### 5.2.2.1. One-Hot Encoding
One-Hot Encoding creates binary columns for each category in a categorical feature. It’s useful when the feature is nominal (categories have no meaningful order).

For example, a "Color" feature with values "Red", "Green", and "Blue" will be converted into three binary columns.

In [None]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

data = {'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)

# One-hot encode the 'Color' column
encoder = OneHotEncoder(sparse=False)
encoded_data = encoder.fit_transform(df[['Color']])

# Convert the result into a DataFrame
encoded_df = pd.DataFrame(encoded_data, columns=encoder.categories_[0])
print(encoded_df)

#### 5.2.2.2. Label Encoding
Label Encoding assigns each category a unique integer. This is useful for ordinal features (categories with a meaningful order), such as "Low", "Medium", "High". However, this method is not suitable for nominal data because it introduces a false order between categories.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Sample data
data = {'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']}
df = pd.DataFrame(data)

# Initialize LabelEncoder
encoder = LabelEncoder()

# Perform Label Encoding
df['Size_Encoded'] = encoder.fit_transform(df['Size'])

print("Label Encoded DataFrame:")
print(df)

#### 5.2.2.3. Ordinal Encoding
Ordinal Encoding is a form of label encoding where categories are ordered in a meaningful way, such as "Low", "Medium", "High". This method maps each category to an integer value, preserving the ordinal relationship between categories.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Example data
data = {'Rating': ['Low', 'Medium', 'High', 'Medium', 'Low']}
df = pd.DataFrame(data)

# Initialize OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])

# Apply ordinal encoding
df['Rating_encoded'] = ordinal_encoder.fit_transform(df[['Rating']])

print(df)

#### 5.2.2.4. Binary Encoding
Binary Encoding is a combination of label encoding and one-hot encoding. It first converts the category labels into integers (like label encoding), then converts those integers into their binary equivalents. Binary encoding is useful for handling categorical variables with many levels. It is useful when there are a large number of categories in a categorical variable and you want to avoid the high dimensionality problem that one-hot encoding causes.

In [None]:
import category_encoders as ce
import pandas as pd

# Example data
data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C']}
df = pd.DataFrame(data)

# Initialize BinaryEncoder
encoder = ce.BinaryEncoder(cols=['Category'])

# Apply binary encoding
df_encoded = encoder.fit_transform(df)

print(df_encoded)

#### 5.2.2.5. Target Encoding (Mean Encoding)
Target Encoding replaces the category values with the mean of the target variable for each category. This method is particularly useful for high-cardinality categorical variables (i.e., when there are many unique categories).

In [None]:
import category_encoders as ce
import pandas as pd

# Example data
data = {'Category': ['A', 'B', 'A', 'B', 'A'],
        'Target': [1, 0, 1, 0, 1]}
df = pd.DataFrame(data)

# Initialize TargetEncoder
encoder = ce.TargetEncoder(cols=['Category'])

# Apply target encoding
df_encoded = encoder.fit_transform(df['Category'], df['Target'])

print(df_encoded)

### 5.2.3. Feature extraction and feature selection
Feature extraction and feature selection are techniques used to improve model performance by reducing the number of irrelevant or redundant features.

#### 5.2.3.1. Feature Extraction
Feature extraction is the process of transforming raw data into a set of usable features for machine learning. It’s often used with unstructured data like images, text, or time series.
* For text data, feature extraction methods like TF-IDF (Term Frequency-Inverse Document Frequency) and Word2Vec are commonly used.
* For image data, feature extraction methods like HOG (Histogram of Oriented Gradients) or CNNs (Convolutional Neural Networks) can be applied.

**Example of Feature Extraction for Text Data (TF-IDF):**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data
texts = ["I love machine learning", "Machine learning is great", "I love coding"]

# Initialize the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the data into feature vectors
X = vectorizer.fit_transform(texts)

# Convert the result into a DataFrame for better visualization
df_tfidf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

print("Feature Extraction with TF-IDF:")
print(df_tfidf)

#### 5.2.3.2. Feature Selection
Feature selection is the process of choosing the most relevant features for the model, while removing irrelevant or redundant features.
Common feature selection techniques include:
* Filter methods (e.g., Correlation matrix, Chi-square test)
* Wrapper methods (e.g., Recursive Feature Elimination (RFE))
* Embedded methods (e.g., Lasso regression, Decision Trees)

**Example of Feature Selection using Recursive Feature Elimination (RFE):**

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Sample data (X = features, y = target)
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target

# Initialize the model
model = LogisticRegression(max_iter=200)

# Initialize RFE and select top 2 features
rfe = RFE(model, 2)
X_rfe = rfe.fit_transform(X, y)

print("Selected Features based on RFE:")
print(rfe.support_)

## 5.3. Data Splits
In machine learning, it is essential to split the data into different subsets to avoid overfitting and assess the model's generalizability to unseen data.

* **Training Set:** The portion of the data used to train the machine learning model.
* **Test Set:** The portion of the data used to evaluate the performance of the trained model on unseen data.
* **Validation Set:** An optional dataset used during model training for tuning hyperparameters. It’s typically used in combination with techniques like cross-validation.

The key goal of splitting the data is to simulate how the model will perform on real-world, unseen data.

### 5.3.1. Train-Test 
The train-test split is the simplest and most common method of splitting a dataset. It involves dividing the data into two parts:

* A training set (typically 70-80% of the data).
* A test set (typically 20-30% of the data).

The model is trained on the training set and evaluated on the test set. The splitting process should ideally be random to ensure that both the training and test sets are representative of the whole dataset. In Python, this can be done easily using Scikit-learn’s `train_test_split()`.

**Example of Train-Test Split in Python:**
In this example, the dataset is split into 80% for training and 20% for testing, ensuring that the model can be trained on most of the data and tested on unseen data.

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

# Example dataset
data = {
    'Feature1': [1, 2, 3, 4, 5, 6],
    'Feature2': [10, 20, 30, 40, 50, 60],
    'Target': [0, 1, 0, 1, 0, 1]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Features and target variable
X = df[['Feature1', 'Feature2']]  # Feature matrix
y = df['Target']                  # Target variable

# Split the data (80% for training and 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Output the shapes of the split data
print(f"Training set: {X_train.shape}, Test set: {X_test.shape}")

### 5.3.2. Cross-Validation
Cross-validation is a more robust technique for evaluating machine learning models. It involves splitting the data into multiple subsets (folds), training the model on different combinations of these folds, and evaluating it on the remaining fold(s). The most common form is k-fold cross-validation.

In k-fold cross-validation, the dataset is split into k equal-sized folds. For each fold:
* The model is trained on k-1 folds.
* The model is evaluated on the remaining fold.
* This process is repeated k times, with each fold serving as the test set once.

The advantage of cross-validation over a single train-test split is that it allows the model to be evaluated multiple times, using different parts of the data for training and testing. This leads to a more reliable estimate of the model’s performance.

Several types of cross-validation methods are available, such as:
* K-Fold Cross-Validation: The dataset is split into k subsets. Each fold serves as a validation set once.
* Stratified k-Fold Cross-Validation: Used for classification tasks when the data is imbalanced. Ensures that each fold has the same proportion of class labels.
* Leave-One-Out Cross-Validation (LOO CV): A special case where each fold contains only one data point. This is often used with very small datasets.

In Python, cross-validation can be performed using Scikit-learn’s `cross_val_score()` or `KFold`.

**Example of k-Fold Cross-Validation in Python:**
In this example, the logistic regression model was evaluated using 3-fold cross-validation, and the mean of the cross-validation scores is computed.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np

# Create a simple dataset
X = np.array([[1, 10], [2, 20], [3, 30], [4, 40], [5, 50], [6, 60]])
y = np.array([0, 1, 0, 1, 0, 1])

# Initialize a logistic regression model
model = LogisticRegression()

# Perform 3-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=3)  # cv=3 means 3-fold cross-validation

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV score: {np.mean(cv_scores)}")

**Homework/Exercise:**
* Task 1: Implement normalization and standardization on a dataset and compare the performance of different models before and after scaling.
* Task 2: Apply One-hot encoding and Label encoding to a categorical dataset and train a classification model to compare the results.
* Task 3: Use Recursive Feature Elimination (RFE) and SelectKBest to perform feature selection on a dataset and compare the model’s performance with all features versus the selected features.
* Task 4: For a dataset with missing values, demonstrate how imputation, feature extraction, and feature selection can be used together to prepare the dataset for machine learning.
* Task 5: Use train-test split on a dataset of your choice and evaluate the model performance.
* Task 6: Implement k-fold cross-validation using cross_val_score and compare the results with a single train-test split.
* Task 7: If available, use a real-world dataset with imbalanced classes (e.g., fraud detection) and perform stratified k-fold cross-validation.