# Resources

https://mlcourse.ai/book/index.html


#Preprocessing

## Encoding Categorical Data in Python

1. **Label Encoding**: Converts each category into an integer value. Useful for ordinal data where the relationship between categories is meaningful.

2. **One-Hot Encoding**: Creates binary columns for each category. Suitable for nominal data where no ordinal relationship exists.

3. **Dummy Variable Encoding**: Similar to one-hot encoding, but reduces the number of features to avoid multicollinearity by creating N-1 features for N categories.

4. **Binary Encoding**: Converts categories into binary digits and splits them into separate columns, balancing between one-hot and label encoding.

5. **Frequency or Count Encoding**: Replaces categories with their counts or frequencies in the dataset. It can be useful when the frequency of a category is important.

6. **Ordinal Encoding**: Similar to label encoding but respects the order of categories. It’s used when the categorical feature is ordinal.

7. **Custom Mapping**: Involves defining a custom mapping based on domain knowledge, especially when the categorical data has a known order or hierarchy.

In [None]:
! pip install category_encoders
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
import category_encoders as ce  # For Binary Encoding

# Load Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# Create a categorical feature for demonstration
df['flower_category'] = pd.cut(df['sepal length (cm)'], bins=3, labels=['Small', 'Medium', 'Large'])
print(df.head())


python
# One-Hot Encoding
one_hot_encoder = OneHotEncoder(sparse=False)
one_hot_encoded = one_hot_encoder.fit_transform(df[['flower_category']])
one_hot_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names_out(['flower_category']))
print(one_hot_df.head())


# Label Encoding
label_encoder = LabelEncoder()
df['flower_category_encoded'] = label_encoder.fit_transform(df['flower_category'])
print(df[['flower_category', 'flower_category_encoded']].head())


python
# Ordinal Encoding
ordinal_encoder = OrdinalEncoder()
df['flower_category_ordinal'] = ordinal_encoder.fit_transform(df[['flower_category']])
print(df[['flower_category', 'flower_category_ordinal']].head())



# Binary Encoding
binary_encoder = ce.BinaryEncoder(cols=['flower_category'])
df_binary = binary_encoder.fit_transform(df['flower_category'])
print(df_binary.head())


##Feature Engineering Techniques

Feature engineering is a critical process in the preparation of data for use in machine learning. Common techniques include:

- **Feature Transformation**: $ x' = f(x) $, where $ f $ can be log, square root, etc.

- **Scaling** (Min-Max Scaling): $ x' = \frac{x - \min(x)}{\max(x) - \min(x)} $

- **Standardization** (Z-score Normalization): $ x' = \frac{x - \mu}{\sigma} $, where $ $\mu$ $ is the mean and $ $\sigma$ $ is the standard deviation.

- **Normalization**: $ x' = \frac{x}{||x||} $, where $ ||x|| $ is the norm of the vector $ x $.

- **Categorical Encoding** (One-hot Encoding): Convert categorical variable with $n$ categories into $n -1$ binary variables.

- **Binning**: Partitioning continuous features into discrete intervals.

- **Feature Creation**: $ x_{\text{new}} = g(x_1, x_2, \dots, x_n) $, where $ g $ is a function combining one or more existing features.


In [None]:
from sklearn.datasets import load_boston
from sklearn.preprocessing import (
    MinMaxScaler, StandardScaler, Normalizer, FunctionTransformer, KBinsDiscretizer, PolynomialFeatures
)
from sklearn.compose import ColumnTransformer
import pandas as pd
import numpy as np

# Load the Boston housing dataset
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)

# Define transformations
log_transformer = FunctionTransformer(np.log1p, validate=False)
scaler = MinMaxScaler()
standardizer = StandardScaler()
normalizer = Normalizer(norm='l2')
binarizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
polynomial = PolynomialFeatures(degree=2, include_bias=False)

# Define the column transformer
pipeline = ColumnTransformer([
    ('log_transform', log_transformer, ['DIS', 'LSTAT']),
    ('scaling', scaler, ['B', 'ZN']),
    ('standardization', standardizer, ['CRIM', 'TAX']),
    ('normalization', normalizer, ['RAD', 'AGE']),
    ('binning', binarizer, ['INDUS']),
    ('polynomial_features', polynomial, ['RM', 'PTRATIO'])
])

# Apply the transformations
X_transformed = pipeline.fit_transform(X)

# Convert the transformed array back to a DataFrame
X_transformed_df = pd.DataFrame(X_transformed, columns=pipeline.get_feature_names_out())

# Display the first few rows of the transformed DataFrame
print(X_transformed_df.head())

## Class Imbalance
Class imbalance in datasets is a common issue in machine learning, particularly in classification problems. It arises when the number of examples in one class significantly outnumbers the examples in another, which can lead to biased models that underperform on the minority class. For instance, in a credit card fraud detection scenario, the dataset may contain 284,807 transactions, but only 492 are fraudulent. This significant imbalance requires specific handling to train an effective model.

### Remedies for Class Imbalance

#### 1. Downsampling
- **What it is**: Removing observations from the majority class.
- **When to use**: Preferable for large datasets (tens of thousands of samples).
- **Pros**: Helps to equalize class distribution, reduces the risk of overfitting.
- **Cons**: Can lead to loss of valuable information from the majority class.

#### 2. Upsampling
- **What it is**: Increasing the number of observations in the minority class.
- **When to use**: Suitable for smaller datasets.
- **Pros**: Enhances the representation of the minority class without losing information.
- **Cons**: Risk of overfitting due to duplicate minority class samples or artificial noise from synthetic samples.

#### 3. Synthetic Data Generation (e.g., SMOTE)
- **What it is**: Creating synthetic samples for the minority class.
- **Pros**: Improves model sensitivity towards the minority class without exact repetition.
- **Cons**: Computationally expensive and can introduce artificial noise.

#### 4. Class Weight Adjustment
- **What it is**: Adjusting the model's loss function to give more weight to the minority class.
- **Pros**: Directly addresses imbalance within the learning algorithm without altering data.
- **Cons**: Requires careful tuning, not universally effective across all algorithms.

### Evaluating Imbalanced Datasets
When evaluating models trained on imbalanced datasets, standard accuracy metrics can be misleading. Instead, it's crucial to use metrics like precision, recall, AUC (Area Under the Receiver Operating Characteristic curve), and AUPRC (Area Under the Precision-Recall Curve). These metrics provide a more nuanced view of the model's performance, particularly in its ability to correctly identify minority class instances.

### Best Practices
- **Testing on Unaltered Data**: Regardless of the resampling method used, it's essential to evaluate the model on a test set that reflects the original class distribution. This ensures the model's performance is representative of real-world conditions.
- **Careful Implementation**: Manipulating class distributions can affect a model's learned class probabilities, potentially leading to over-prediction of the minority class.

In summary, dealing with class imbalance involves choosing the right technique based on dataset size and carefully evaluating the model with appropriate metrics to ensure it performs well on both the majority and minority classes.