# Model Tuning and Improvement

This tutorial provides an introduction to the critical aspects of model tuning and improvement in classification and regression tasks. Key topics covered include hyperparameter tuning techniques such as grid search and random search, feature selection and engineering, ensemble methods and handling unbalanced data.

## I. Model Tuning and Improvement

Model tuning and improvement involve adjusting and fine-tuning the parameters of a model to optimize its performance, reduce errors, and make it more accurate and reliable. 

### 1. Hyperparameter Tuning

Hyperparameters are the settings or configurations of a model that are external to the model and cannot be learned from the training process. They need to be configured before training. Examples include the learning rate, the depth of trees in a random forest, and the number of hidden layers in a neural network.

Tuning hyperparameters is a crucial step because the right set of hyperparameters can significantly improve the model's performance. Techniques used for hyperparameter tuning include:

- **Grid Search:** Grid search is a traditional way to perform hyperparameter optimization, i.e., it exhaustively tries every combination of the provided hyper-parameter values to find the best model. The main disadvantage of this method is the computational cost, particularly when dealing with a large number of different hyperparameters and much bigger datasets.

- **Random Search:** Random search differs from grid search mainly in that it does not try every single combination of hyperparameters. Instead, it randomly selects combinations of hyperparameters to train the model and evaluate performance. Given the same resources, it allows for a wider exploration of hyperparameters compared to grid search.

We use Grid Search when the hyperparameter space is relatively small and computationally manageable, whereas we use Random Search when the hyperparameter space is large, and we want to explore as many different combinations as possible given a time constraint.

**Examples**

Let's consider an example using Decision Tree and Random Forest algorithms. Here we will perform hyperparameter tuning for both algorithms using GridSearchCV and RandomizedSearchCV.

- **Decision Tree algorithm**

Decision Tree algorithm has a number of hyperparameters that can be tuned for better performance. Here are some of them along with a brief description:

    - max_depth: The maximum depth of the tree. This is a major way to control overfitting as higher depth will allow the model to learn relations very specific to a particular sample.

    - min_samples_split: The minimum number of samples required to split an internal node. This can vary between considering at least one sample at each node to considering all of the samples at each node. When we increase this parameter, the tree becomes more constrained as it has to consider more samples at each node.

    - min_samples_leaf: The minimum number of samples required to be at a leaf node. This parameter is similar to min_samples_splits, however, this describe the minimum number of samples of samples at the leafs, the base of the tree.

    - max_features: The number of features to consider when looking for the best split. This is a key parameter when dealing with a high-dimensional dataset.

- **Random Forest**

Random Forest is a type of ensemble machine learning model that consists of a large number of individual decision trees. Here are some of the key hyperparameters for a Random Forest:

    - n_estimators: The number of trees in the forest. Generally, the more trees, the better the performance, but also the more computational resources it requires.

    - max_depth: The maximum depth of each tree. This can help control overfitting.

    - min_samples_split: The minimum number of samples required to split an internal node in each decision tree.

    - min_samples_leaf: The minimum number of samples required to be at a leaf node in each decision tree.

    - max_features: The number of features to consider when looking for the best split. This can be an integer, float, string or None.

    - bootstrap: Whether bootstrap samples are used when building trees.

Let's illustrate how to tune these hyperparameters for a Decision Tree Classifier and Random Forest Classifier using Grid Search and Random search on the iris dataset:

In [None]:
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Load the iris dataset
iris = datasets.load_iris()

# Decision Tree hyperparameters
dt_param_grid = {
    'max_depth': [1, 2, 3, 4, 5],
    'min_samples_split': [2, 3, 4],
    'min_samples_leaf': [1, 2, 3]
}

# Random Forest hyperparameters
rf_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [2, 4, 6, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 3, 5],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False]
}

dt = DecisionTreeClassifier()
rf = RandomForestClassifier()

# Grid Search for Decision Tree
dt_grid_search = GridSearchCV(dt, dt_param_grid)
dt_grid_search.fit(iris.data, iris.target)
print("Decision Tree - Grid Search Best Params: ", dt_grid_search.best_params_)

# Random Search for Decision Tree
dt_random_search = RandomizedSearchCV(dt, dt_param_grid)
dt_random_search.fit(iris.data, iris.target)
print("Decision Tree - Random Search Best Params: ", dt_random_search.best_params_)

# Grid Search for Random Forest
rf_grid_search = GridSearchCV(rf, rf_param_grid)
rf_grid_search.fit(iris.data, iris.target)
print("Random Forest - Grid Search Best Params: ", rf_grid_search.best_params_)

# Random Search for Random Forest
rf_random_search = RandomizedSearchCV(rf, rf_param_grid)
rf_random_search.fit(iris.data, iris.target)
print("Random Forest - Random Search Best Params: ", rf_random_search.best_params_)

Decision Tree - Grid Search Best Params:  {'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2}
Decision Tree - Random Search Best Params:  {'min_samples_split': 4, 'min_samples_leaf': 1, 'max_depth': 3}


In this script, we define separate dictionaries of parameters to tune for the Decision Tree and Random Forest classifiers. We then initialize GridSearchCV and RandomizedSearchCV objects for each classifier and fit them to our data. The best performing hyperparameters are then printed.

**Note:** in a real-world scenario, you'd also want to split your data into training and testing sets and assess the performance of your model using unseen data.


### 2. Feature Selection and Engineering

#### 1. Feature selection

Feature selection is the process of identifying the most important features for a given model. This reduces the complexity of a model, improves its performance, and reduces overfitting.

**Example**

We'll use the mutual information method for feature selection. Mutual information measures the dependency between two variables: the greater the value, the higher the dependency.

In [1]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
from sklearn.datasets import fetch_california_housing

# Load California Housing dataset
dataset = fetch_california_housing()
X, y = dataset.data, dataset.target
feature_names = dataset.feature_names

# Use mutual information for feature selection
selector = SelectKBest(score_func=mutual_info_regression, k=4) 
X_new = selector.fit_transform(X, y)

selected_features = [feature_names[i] for i in range(len(selector.get_support())) if selector.get_support()[i]]
print('Selected Features: ', selected_features)

Selected Features:  ['MedInc', 'AveRooms', 'Latitude', 'Longitude']


This script selects the top 4 features that have the highest mutual information with the target variable.

#### 2. Features Engineering

Feature engineering is the process of creating new features or transforming existing ones to improve model performance. This can involve a wide range of techniques, from simple mathematical transformations to complex domain-specific methods.

**Example**

A common technique for feature engineering is the creation of polynomial features, which are new features created from the original features raised to the power of the degree specified. Let's create a quadratic feature from the `AveRooms` variable:

In [10]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import fetch_california_housing

# Load California Housing dataset
dataset = fetch_california_housing(as_frame=True)
df= dataset.frame

poly = PolynomialFeatures(degree=2, include_bias=False)
df['AveRooms_2'] = poly.fit_transform(df[['AveRooms']])[:, 1]

df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,AveRooms_2
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526,48.77803
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585,38.914354
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521,68.693192
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413,33.84158
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422,39.461681


In this script, we're creating a new feature `AveRooms_2` which is the square of `AveRooms`.

**Note:** these are very basic examples and the choice of feature selection and feature engineering methods will highly depend on the problem at hand, the dataset, and the model being used. It's always a good practice to validate the performance of the features using a hold-out set or cross-validation.

### 3. Ensemble Methods

Ensemble methods involve combining the decisions from multiple models to improve the overall performance. They can help make the model more robust and prevent overfitting. There are several types of ensemble methods:

- **Bagging (Bootstrap Aggregating):** This technique involves creating multiple subsets of the original data (with replacement), training a model on each, and combining the outputs. An example is the Random Forest algorithm.

- **Boosting:** This technique trains models in sequence where each new model corrects the errors made by the previous ones. Examples include Gradient Boosting and AdaBoost.

- **Stacking:** This involves training multiple different models and using another machine learning model to combine their outputs.


**Examples**

Let's take a look at examples for each type of ensemble method using the Scikit-learn library: Bagging, Boosting, and Stacking.

In each of these examples, the models are trained on the Iris dataset. The models are evaluated by calculating the accuracy of their predictions on a hold-out test set.

- **Bagging (Bootstrap Aggregating):** Bagging uses the idea of bootstrap sampling to create different subsets of the original dataset, trains a model on each subset, and combines the outputs. The most common example is the Random Forest algorithm.

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Initialize RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = rf.predict(X_test)
print("Random Forest Accuracy: ", accuracy_score(y_test, y_pred))

Random Forest Accuracy:  1.0


- **Boosting:** Boosting trains models in sequence with each model learning from the errors of its predecessor. Here's an example using Gradient Boosting.

In [13]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialize GradientBoostingClassifier
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=42)
gb.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = gb.predict(X_test)
print("Gradient Boosting Accuracy: ", accuracy_score(y_test, y_pred))

Gradient Boosting Accuracy:  0.9666666666666667


- **Stacking (Stacked Generalization):** Stacking involves training a model (the "meta-learner") to combine the predictions of several other models. Here's an example:

In [14]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Define the base models
base_models = [
                ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
                ('gb', GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=42))
               ]

# Initialize Stacking Classifier with a Logistic Regression as the meta-learner
stacking_clf = StackingClassifier(estimators=base_models, final_estimator=LogisticRegression())
stacking_clf.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = stacking_clf.predict(X_test)
print("Stacking Classifier Accuracy: ", accuracy_score(y_test, y_pred))

Stacking Classifier Accuracy:  1.0


### 4. Handling Imbalanced Data

In classification problems, it's common to have imbalanced classes, i.e., one class has significantly more samples than the other. This can lead to poor performance for the minority class. Techniques to handle imbalanced data include:

- **Oversampling:** This involves increasing the number of instances in the minority class by randomly replicating them in order to present a higher representation.

- **Undersampling:** This refers to reducing the data of the majority class, bringing balance to the dataset.

- **SMOTE (Synthetic Minority Over-sampling Technique):** This is a combination of oversampling and undersampling, but the oversampling approach is not by replicating minority class but constructing new minority class data instance via an algorithm.

**Note:** these techniques don't always improve results. It's essential to validate performance with a hold-out test set to ensure the tuning steps have improved your model.


**Examples**

Here is a demonstration of two methods for handling imbalanced data using simulated data

**1. Over-sampling the minority class using SMOTE (Synthetic Minority Over-sampling Technique)**

In [17]:
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier

# Create a binary classification dataset with imbalanced class distribution
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=10,
                           n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply SMOTE to the training data
smote = SMOTE(sampling_strategy='minority')
X_sm, y_sm = smote.fit_resample(X_train, y_train)

# Train a Random Forest classifier on the over-sampled data
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_sm, y_sm)

# Evaluate the model
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.98      0.99       197
           1       0.33      0.67      0.44         3

    accuracy                           0.97       200
   macro avg       0.66      0.82      0.72       200
weighted avg       0.98      0.97      0.98       200



If you get the error `ModuleNotFoundError: No module named 'imblearn'`, you need to install the imblearn library. You can install it by uncommenting and executing the following code cell

In [19]:
#!pip install imbalanced-learn

**2. Under-sampling the majority class**

In [18]:
from imblearn.under_sampling import RandomUnderSampler

# Apply under-sampling to the training data
undersampler = RandomUnderSampler(sampling_strategy='majority')
X_us, y_us = undersampler.fit_resample(X_train, y_train)

# Train a Random Forest classifier on the under-sampled data
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_us, y_us)

# Evaluate the model
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.86      0.93       197
           1       0.10      1.00      0.18         3

    accuracy                           0.86       200
   macro avg       0.55      0.93      0.55       200
weighted avg       0.99      0.86      0.92       200



In both examples, the model has been trained on a balanced version of the training set but is tested on the original, imbalanced test set to provide a realistic evaluation of its performance.

Imbalanced datasets are common in fraud detection scenarios where fraud cases represent a minority compared to the normal cases. Rou can apply this demonstration using the Synthetic Financial Datasets For Fraud Detection from the Kaggle platform which is a simulated dataset that is structurally similar to real-world credit card transaction data.

The dataset can be downloaded here: https://www.kaggle.com/datasets/ealaxi/paysim1

Notes : 
- Keep in mind that over-sampling increases the likelihood of overfitting since it replicates the minority class examples, and under-sampling can lead to loss of information. It's best to use a combination of these methods and always validate the results using a hold-out test set.

- in real-life scenarios, data needs to be carefully preprocessed, and the model's performance should be thoroughly evaluated, ideally using multiple metrics and not solely relying on accuracy as it can be misleading when dealing with imbalanced data. The precision, recall, F1 score, and the confusion matrix are good indicators to use when dealing with imbalanced datasets.