# Model Prediction Improvement

# 1. Introduction

There are several methods that can help improve the prediction performance of models. Here are some commonly used techniques:

1. Feature engineering: This involves creating new features or transforming existing ones to better represent the underlying patterns in the data. It can include techniques like scaling, normalization, one-hot encoding, dimensionality reduction, and creating interaction terms.


2. Hyperparameter tuning: Models often have hyperparameters that need to be set before training. Hyperparameter tuning involves systematically searching for the best combination of hyperparameter values that optimize the model's performance. Techniques like grid search, random search, and Bayesian optimization can be used for hyperparameter tuning.


3. Cross-validation: Cross-validation is a technique used to assess the performance of a model and tune hyperparameters. It involves splitting the data into multiple subsets, training the model on one subset, and evaluating its performance on the remaining subsets. This helps to get a more reliable estimate of the model's performance and avoid overfitting.


4. Ensemble methods: Ensemble methods combine multiple models to improve prediction performance. This can be done by averaging the predictions of individual models (e.g., bagging), using weighted averages (e.g., boosting), or allowing models to vote on the final prediction (e.g., random forests). Ensemble methods can help reduce variance, increase stability, and improve generalization.


5. Regularization techniques: Regularization methods help prevent overfitting by adding a penalty term to the model's objective function. Techniques like L1 regularization (Lasso), L2 regularization (Ridge), and Elastic Net can be used to shrink the coefficients of less important features or encourage sparsity in the model.


6. Data augmentation: Data augmentation involves generating additional training data by applying random transformations or perturbations to the existing data. This helps to increase the diversity of the training set and can improve the model's ability to generalize to new examples.


7. Transfer learning: Transfer learning leverages knowledge from pre-trained models that have been trained on large-scale datasets. By utilizing the learned representations from these models, transfer learning allows for better performance on smaller or domain-specific datasets. This is particularly useful when the available training data is limited.


8. Model ensembling: Model ensembling involves combining the predictions of multiple models, often of different types, to make a final prediction. This can be done through techniques like stacking, where the predictions of individual models are used as input features for a meta-model.

# 2. Basic Model without Techniques

In [29]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model without additional techniques
model_basic = RandomForestRegressor(random_state=42)
model_basic.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred_basic = model_basic.predict(X_test)
mse_basic = mean_squared_error(y_test, y_pred_basic)
rmse_basic = np.sqrt(mse_basic)
mae_basic = mean_absolute_error(y_test, y_pred_basic)
r2_basic = r2_score(y_test, y_pred_basic)

finalleaderboard = {
    'MSE': mse_basic,
    'RMSE': rmse_basic,
    'MAE': mae_basic,
    'r2 score': r2_basic
}

finalleaderboard = pd.DataFrame.from_dict(finalleaderboard, orient='index', columns=['Result'])
finalleaderboard

Unnamed: 0,Result
MSE,0.001383
RMSE,0.037193
MAE,0.013667
r2 score,0.998021


1. **Mean Squared Error (MSE)**: It measures how close the predicted values are to the true values by calculating the average of the squared differences between them. Smaller MSE values are better.


2. **Root Mean Squared Error (RMSE)**: It's the square root of MSE. It gives an average measure of the difference between the predicted values and the true values. Smaller RMSE values are better.


3. **Mean Absolute Error (MAE)**: It measures the average difference between the predicted values and the true values, regardless of the direction. Smaller MAE values are better.


4. **R-squared (R2) Score**: It tells us how well the model fits the data by representing the proportion of the target variable's variance that can be explained by the model. Higher R2 values are better, with 1 being the best possible score.

**For MSE, RMSE, and MAE, lower values are better. For R2, a higher value closer to 1 is better.**

# 3. Improved Model with Techniques

Here's an example showcasing a few methods for improving prediction performance using a real dataset. In this example, we'll be using the Boston Housing dataset, which is available in the scikit-learn library.

## 3.1 Preparation

In [30]:
import warnings
# Ignore all warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import RidgeCV
from sklearn.pipeline import make_pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


In this example, we start by loading the Boston Housing dataset using **load_boston()** from scikit-learn. Then, we split the data into training and test sets using **train_test_split()**.

In [31]:
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 3.2 Data Augmentation

We perform data augmentation by shuffling and duplicating the training data. This helps increase the diversity of the training set and can improve the model's generalization capability. We use **shuffle()** from scikit-learn's **utils** module to shuffle the training data, and then concatenate the original and shuffled data along with their respective target values.

In [32]:
# Data augmentation: Shuffle and duplicate training data
X_train_augmented = np.vstack((X_train, shuffle(X_train)))
y_train_augmented = np.concatenate((y_train, y_train))

`X_train_augmented = np.vstack((X_train, shuffle(X_train)))`: This line vertically stacks (concatenates vertically) the original training data X_train with a shuffled version of X_train. The shuffle(X_train) function shuffles the rows of X_train randomly. This augmentation technique helps introduce additional variations in the training data.

`y_train_augmented = np.concatenate((y_train, y_train))`: This line concatenates the original training labels y_train with itself, duplicating the labels. The duplicated labels are added to match the augmented training data created in the previous line. By duplicating the labels, we ensure that each augmented example has a corresponding label.

#### What is vertically stacking?

Vertically stacking refers to concatenating arrays or matrices along their vertical axis. When you vertically stack two arrays or matrices, you are essentially stacking one on top of the other

## 3.3 Feature Engineering

Next, we apply feature engineering by scaling the features using **StandardScaler().**

In [33]:
# Feature engineering: Scaling the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_augmented)
X_test_scaled = scaler.transform(X_test)

## 3.4 Hyperparameter tuning using GridSearchCV for RandomForestRegressor

After that, we perform hyperparameter tuning using **GridSearchCV.** We define a parameter grid with different values for **n_estimators**, **max_depth**, and **min_samples_split**, and use it to search for the best combination of hyperparameters for the **RandomForestRegressor** model.

In [34]:
# Hyperparameter tuning using GridSearchCV for RandomForestRegressor
param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [None, 5],
    'min_samples_split': [2, 5]
}


grid_search_rf = GridSearchCV(RandomForestRegressor(random_state=42), # the model to be tuned (RandomForestRegressor)
                              param_grid_rf, # the hyperparameter grid (param_grid_rf)
                              cv=5 # the number of cross-validation folds
                             )
grid_search_rf.fit(X_train_scaled, y_train_augmented)
best_model_rf = grid_search_rf.best_estimator_

`param_grid_rf`: Defines the grid of hyperparameters to be searched, including 'n_estimators', 'max_depth', and 'min_samples_split'.

`grid_search_rf`: Creates a GridSearchCV object with the RandomForestRegressor model and the defined parameter grid.

`grid_search_rf.fit(X_train_scaled, y_train_augmented)`: Fits the GridSearchCV object on the scaled training data and augmented labels, performing an exhaustive search and cross-validation to find the best hyperparameters.

`best_model_rf`: Retrieves the best-performing model found during the search, represented by the optimal hyperparameter values.

## 3.5 Ridge Regression with Cross-Validation

 We use **RidgeCV** from scikit-learn to perform ridge regression with built-in cross-validation for hyperparameter tuning. We select the best alpha (regularization parameter) using the training set and create a pipeline that includes scaling the features with **StandardScaler().**

In [35]:
ridge_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])
ridge_cv.fit(X_train_scaled, y_train_augmented)
best_model_ridge = make_pipeline(StandardScaler(), RidgeCV())

1. `ridge_cv`: Creates a RidgeCV object, which is a ridge regression model with built-in cross-validation. The alphas parameter specifies a list of alpha values to be tested during the cross-validation process.


2. `ridge_cv.fit(X_train_scaled, y_train_augmented)`: Fits the RidgeCV model on the scaled training data (X_train_scaled) and augmented training labels (y_train_augmented). It performs cross-validation to determine the best alpha value based on the specified list.


3. `best_model_ridge`: Creates a pipeline using make_pipeline. A pipeline is a way to chain multiple steps together, in this case, combining the StandardScaler and RidgeCV steps. The StandardScaler performs feature scaling, and RidgeCV is the ridge regression model with the best alpha value determined from cross-validation.



## 3.6 Ensemble method

We create an ensemble model using the best **RandomForestRegressor** model from the previous hyperparameter tuning step. We use **TransformedTargetRegressor** from scikit-learn to apply scaling to the target variable **(y)**. The ensemble model combines the predictions of the random forest model with the scaled target variable.

In [36]:
# Ensemble method: Model averaging
ensemble_model = TransformedTargetRegressor(regressor=best_model_rf, transformer=StandardScaler())

## 3.7 Cross-validation

We use cross_val_score() from scikit-learn to perform cross-validation on the best models. We calculate the mean cross-validation scores and evaluate the models based on negative mean squared error (-cv_scores). Additionally, we calculate the mean squared error for the ensemble model on the test set.

In [37]:
# Cross-validation
cv_scores_rf = cross_val_score(best_model_rf, X_train_scaled, y_train_augmented, cv=5, scoring='neg_mean_squared_error')
cv_scores_ridge = cross_val_score(best_model_ridge, X_train_scaled, y_train_augmented, cv=5, scoring='neg_mean_squared_error')
cv_scores_ensemble = cross_val_score(ensemble_model, X_train_scaled, y_train_augmented, cv=5, scoring='neg_mean_squared_error')

# Get the mean cross-validation scores
mean_cv_score_rf = np.mean(np.sqrt(-cv_scores_rf))
mean_cv_score_ridge = np.mean(np.sqrt(-cv_scores_ridge))
mean_cv_score_ensemble = np.mean(np.sqrt(-cv_scores_ensemble))

print("Random Forest CV Score: ", mean_cv_score_rf)
print("Ridge Regression CV Score: ", mean_cv_score_ridge)

Random Forest CV Score:  0.8646026859337071
Ridge Regression CV Score:  0.7837979139653724


`cv_scores_rf`: Uses cross_val_score to perform cross-validation for the RandomForestRegressor model (best_model_rf). It computes the negative mean squared error (neg_mean_squared_error) for each fold and stores the scores in cv_scores_rf.

`cv_scores_ridge`: Performs cross-validation for the RidgeCV model (best_model_ridge). It computes the negative mean squared error (neg_mean_squared_error) for each fold and stores the scores in cv_scores_ridge.

`cv_scores_ensemble`: Performs cross-validation for an ensemble model (not shown in the provided code). It computes the negative mean squared error (neg_mean_squared_error) for each fold and stores the scores in cv_scores_ensemble.

`mean_cv_score_rf`: Calculates the mean cross-validation score for the RandomForestRegressor model by taking the negative square root of the average of cv_scores_rf.

`mean_cv_score_ridge`: Calculates the mean cross-validation score for the RidgeCV model by taking the negative square root of the average of cv_scores_ridge.

`mean_cv_score_ensemble`: Calculates the mean cross-validation score for the ensemble model by taking the negative square root of the average of cv_scores_ensemble.

## 3.8 Evaluate the ensemble model on the test set

In [38]:
ensemble_model.fit(X_train_scaled, y_train_augmented)
y_pred_ensemble = ensemble_model.predict(X_test_scaled)
mse_ensemble = mean_squared_error(y_test, y_pred_ensemble)
rmse_ensemble = np.sqrt(mse_ensemble)
mae_ensemble = mean_absolute_error(y_test, y_pred_ensemble)
r2_ensemble = r2_score(y_test, y_pred_ensemble)

In [39]:
finalleaderboard = {
    'MSE': mse_ensemble,
    'RMSE': rmse_ensemble,
    'MAE': mae_ensemble,
    'r2 score': r2_ensemble
}

finalleaderboard = pd.DataFrame.from_dict(finalleaderboard, orient='index', columns=['Result'])
# finalleaderboard = finalleaderboard.sort_values('Accuracy', ascending=False)
finalleaderboard

Unnamed: 0,Result
MSE,0.307318
RMSE,0.554363
MAE,0.461029
r2 score,0.560276


# 4. Result Comparison

In [40]:
rmse_improved = rmse_ensemble
mae_improved = mae_ensemble
r2_improved = r2_ensemble

In [41]:
# Create a dataframe to display the evaluation metrics
data = {
    "Model": ["Basic", "Improved"],
    "RMSE": [rmse_basic, rmse_improved],
    "MAE": [mae_basic, mae_improved],
    "R^2": [r2_basic, r2_improved]
}
df = pd.DataFrame(data)

print("Model Performance Comparison")
df

Model Performance Comparison


Unnamed: 0,Model,RMSE,MAE,R^2
0,Basic,0.037193,0.013667,0.998021
1,Improved,0.554363,0.461029,0.560276


## Why Model Improvement Techniques Sometimes Fail

There could be several reasons why using model performance improvement techniques does not result in improved performance or may even worsen the results. Here are a few potential reasons:

1. Data quality: The techniques applied might not be appropriate for the dataset or the specific problem at hand. It's important to select techniques that are suitable for the data characteristics and problem domain.

2. Improper implementation: The techniques may not have been implemented correctly or may not have been tuned properly. It's crucial to understand the techniques thoroughly and apply them correctly to ensure their effectiveness.

3. Overfitting: Some techniques, such as complex ensemble models or hyperparameter tuning, may inadvertently lead to overfitting if not properly controlled. Overfitting occurs when the model learns to perform well on the training data but fails to generalize to unseen data. This can result in worse performance on new data.

4. Insufficient data or features: If the dataset is too small or lacks informative features, it may be challenging to improve the model's performance significantly. Techniques like feature engineering or data augmentation may not yield substantial improvements in such cases.

5. Noise or outliers: If the dataset contains a significant amount of noise or outliers, it can adversely affect the model's performance. Some techniques may be sensitive to noise or outliers, leading to suboptimal results.

6. Randomness and variability: Machine learning models can exhibit variability due to randomness, especially when using techniques like cross-validation or random initialization. It's possible that the observed performance differences are within the expected range of variability.

It's essential to carefully analyze the specific techniques applied, the dataset characteristics, and thoroughly evaluate the results to understand why the model's performance did not improve as expected. Additionally, it's always beneficial to experiment with different techniques, explore alternative approaches, and iterate on the model improvement process to achieve better results.

# 5. Exercise

You have been provided with a code snippet that compares the performance of a basic model with an improved model using various techniques. Your task is to modify the code and incorporate additional techniques to further enhance the model's performance. Experiment with at least two additional techniques such as feature selection, regularization, or ensemble methods. Evaluate the modified model and compare its performance against the basic model and the previous improved model.

#### Instructions:

1. Study the code provided and understand the implementation of the basic model and the techniques applied in the improved model.


2. Choose two additional techniques that you want to incorporate into the model to further enhance its performance. Examples include feature selection (e.g., using SelectKBest), regularization (e.g., L1 or L2 regularization), or ensemble methods (e.g., Gradient Boosting or Stacking).


3. Modify the code to include the chosen techniques. Make sure to appropriately apply the techniques, train the modified model, and evaluate its performance.


4. Calculate and compare the evaluation metrics such as RMSE, MAE, and R-squared for the basic model, the previous improved model, and the modified model.


5. Reflect on the results and analyze the impact of the additional techniques on the model's performance.

#### Challenge:

Experiment with different parameter settings or combinations of the additional techniques to find the best-performing model. Compare the performance of different combinations and discuss the findings in terms of the evaluation metrics.

Note: You can refer to the scikit-learn documentation for more information about the techniques and parameters you choose to incorporate.

By completing this exercise, you will gain hands-on experience in modifying and improving the model's performance using various techniques. It will also help deepen your understanding of how different techniques impact the evaluation metrics and overall model performance.

In [42]:
# write your code here
