# Model Prediction Improvement

# 1. Introduction

There are several methods that can help improve the prediction performance of models. Here are some commonly used techniques:

1. Feature engineering: This involves creating new features or transforming existing ones to better represent the underlying patterns in the data. It can include techniques like scaling, normalization, one-hot encoding, dimensionality reduction, and creating interaction terms.


2. Hyperparameter tuning: Models often have hyperparameters that need to be set before training. Hyperparameter tuning involves systematically searching for the best combination of hyperparameter values that optimize the model's performance. Techniques like grid search, random search, and Bayesian optimization can be used for hyperparameter tuning.


3. Cross-validation: Cross-validation is a technique used to assess the performance of a model and tune hyperparameters. It involves splitting the data into multiple subsets, training the model on one subset, and evaluating its performance on the remaining subsets. This helps to get a more reliable estimate of the model's performance and avoid overfitting.


4. Ensemble methods: Ensemble methods combine multiple models to improve prediction performance. This can be done by averaging the predictions of individual models (e.g., bagging), using weighted averages (e.g., boosting), or allowing models to vote on the final prediction (e.g., random forests). Ensemble methods can help reduce variance, increase stability, and improve generalization.


5. Regularization techniques: Regularization methods help prevent overfitting by adding a penalty term to the model's objective function. Techniques like L1 regularization (Lasso), L2 regularization (Ridge), and Elastic Net can be used to shrink the coefficients of less important features or encourage sparsity in the model.


6. Data augmentation: Data augmentation involves generating additional training data by applying random transformations or perturbations to the existing data. This helps to increase the diversity of the training set and can improve the model's ability to generalize to new examples.


7. Transfer learning: Transfer learning leverages knowledge from pre-trained models that have been trained on large-scale datasets. By utilizing the learned representations from these models, transfer learning allows for better performance on smaller or domain-specific datasets. This is particularly useful when the available training data is limited.


8. Model ensembling: Model ensembling involves combining the predictions of multiple models, often of different types, to make a final prediction. This can be done through techniques like stacking, where the predictions of individual models are used as input features for a meta-model.


9. Error analysis: By carefully analyzing the errors made by a model, it is possible to gain insights into areas of improvement. This can involve examining misclassified examples, identifying patterns or biases in the errors, and using this information to refine the model or adjust the training process.


10. Transfer learning: Transfer learning is the technique of using a pre-trained model on a different but related task as a starting point for training a new model. By leveraging the knowledge gained from the pre-trained model, the new model can learn more effectively and achieve better performance with less training data.

# 2. Basic Model without Techniques

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Load the Boston Housing dataset
boston = load_boston()
X, y = boston.data, boston.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature engineering: Scaling the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model without additional techniques
model_basic = RandomForestRegressor(random_state=42)
model_basic.fit(X_train_scaled, y_train)

# Evaluate the model on the test set
y_pred_basic = model_basic.predict(X_test_scaled)
mse_basic = mean_squared_error(y_test, y_pred_basic)
rmse_basic = np.sqrt(mse_basic)
mae_basic = mean_absolute_error(y_test, y_pred_basic)
r2_basic = r2_score(y_test, y_pred_basic)

finalleaderboard = {
    'MSE Ensemble': mse_basic,
    'RMSE Ensemble': rmse_basic,
    'MAE Ensemble': mae_basic,
    'r2 score': r2_basic
}

finalleaderboard = pd.DataFrame.from_dict(finalleaderboard, orient='index', columns=['Result'])
# finalleaderboard = finalleaderboard.sort_values('Accuracy', ascending=False)
finalleaderboard


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

Unnamed: 0,Result
MSE Ensemble,7.912745
RMSE Ensemble,2.81296
MAE Ensemble,2.041078
r2 score,0.8921


# 3. Improved Model with Techniques

Here's an example showcasing a few methods for improving prediction performance using a real dataset. In this example, we'll be using the Boston Housing dataset, which is available in the scikit-learn library. 

## 3.1 Preparation

In [2]:
import warnings
# Ignore all warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import RidgeCV
from sklearn.pipeline import make_pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


In this example, we start by loading the Boston Housing dataset using **load_boston()** from scikit-learn. Then, we split the data into training and test sets using **train_test_split()**.

In [3]:
# Load the Boston Housing dataset
boston = load_boston()
X, y = boston.data, boston.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 3.2 Data Augmentation

We perform data augmentation by shuffling and duplicating the training data. This helps increase the diversity of the training set and can improve the model's generalization capability. We use **shuffle()** from scikit-learn's **utils** module to shuffle the training data, and then concatenate the original and shuffled data along with their respective target values.

In [4]:
# Data augmentation: Shuffle and duplicate training data
X_train_augmented = np.vstack((X_train, shuffle(X_train)))
y_train_augmented = np.concatenate((y_train, y_train))

## 3.3 Feature Engineering

Next, we apply feature engineering by scaling the features using **StandardScaler().**

In [5]:
# Feature engineering: Scaling the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_augmented)
X_test_scaled = scaler.transform(X_test)

## 3.4 Hyperparameter tuning using GridSearchCV for RandomForestRegressor

After that, we perform hyperparameter tuning using **GridSearchCV.** We define a parameter grid with different values for **n_estimators**, **max_depth**, and **min_samples_split**, and use it to search for the best combination of hyperparameters for the **RandomForestRegressor** model.

In [6]:
# Hyperparameter tuning using GridSearchCV for RandomForestRegressor
param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}
grid_search_rf = GridSearchCV(RandomForestRegressor(random_state=42), param_grid_rf, cv=5)
grid_search_rf.fit(X_train_scaled, y_train_augmented)
best_model_rf = grid_search_rf.best_estimator_

## 3.5 Ridge Regression with Cross-Validation

 We use **RidgeCV** from scikit-learn to perform ridge regression with built-in cross-validation for hyperparameter tuning. We select the best alpha (regularization parameter) using the training set and create a pipeline that includes scaling the features with **StandardScaler().**

In [7]:
ridge_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])
ridge_cv.fit(X_train_scaled, y_train_augmented)
best_model_ridge = make_pipeline(StandardScaler(), RidgeCV())

## 3.6 Ensemble method

We create an ensemble model using the best **RandomForestRegressor** model from the previous hyperparameter tuning step. We use **TransformedTargetRegressor** from scikit-learn to apply scaling to the target variable **(y)**. The ensemble model combines the predictions of the random forest model with the scaled target variable.

In [8]:
# Ensemble method: Model averaging
ensemble_model = TransformedTargetRegressor(regressor=best_model_rf, transformer=StandardScaler())

## 3.7 Cross-validation

We use cross_val_score() from scikit-learn to perform cross-validation on the best models. We calculate the mean cross-validation scores and evaluate the models based on negative mean squared error (-cv_scores). Additionally, we calculate the mean squared error for the ensemble model on the test set.

In [9]:
# Cross-validation
cv_scores_rf = cross_val_score(best_model_rf, X_train_scaled, y_train_augmented, cv=5, scoring='neg_mean_squared_error')
cv_scores_ridge = cross_val_score(best_model_ridge, X_train_scaled, y_train_augmented, cv=5, scoring='neg_mean_squared_error')
cv_scores_ensemble = cross_val_score(ensemble_model, X_train_scaled, y_train_augmented, cv=5, scoring='neg_mean_squared_error')

# Get the mean cross-validation scores
mean_cv_score_rf = np.mean(np.sqrt(-cv_scores_rf))
mean_cv_score_ridge = np.mean(np.sqrt(-cv_scores_ridge))
mean_cv_score_ensemble = np.mean(np.sqrt(-cv_scores_ensemble))

print("Random Forest CV Score: ", mean_cv_score_rf)
print("Ridge Regression CV Score: ", mean_cv_score_ridge)

Random Forest CV Score:  9.289881555664302
Ridge Regression CV Score:  8.65570992967362


## 3.8 Evaluate the ensemble model on the test set

In [10]:
ensemble_model.fit(X_train_scaled, y_train_augmented)
y_pred_ensemble = ensemble_model.predict(X_test_scaled)
mse_ensemble = mean_squared_error(y_test, y_pred_ensemble)
rmse_ensemble = np.sqrt(mse_ensemble)
mae_ensemble = mean_absolute_error(y_test, y_pred_ensemble)
r2_ensemble = r2_score(y_test, y_pred_ensemble)

In [11]:
finalleaderboard = {
    'MSE Ensemble': mse_ensemble,
    'RMSE Ensemble': rmse_ensemble,
    'MAE Ensemble': mae_ensemble,
    'r2 score': r2_ensemble
}

finalleaderboard = pd.DataFrame.from_dict(finalleaderboard, orient='index', columns=['Result'])
# finalleaderboard = finalleaderboard.sort_values('Accuracy', ascending=False)
finalleaderboard

Unnamed: 0,Result
MSE Ensemble,27.344444
RMSE Ensemble,5.229191
MAE Ensemble,3.825393
r2 score,0.627124


# 4. Result Comparison

In [12]:
rmse_improved = rmse_ensemble
mae_improved = mae_ensemble
r2_improved = r2_ensemble

In [13]:
# Create a dataframe to display the evaluation metrics
data = {
    "Model": ["Basic", "Improved"],
    "RMSE": [rmse_basic, rmse_improved],
    "MAE": [mae_basic, mae_improved],
    "R^2": [r2_basic, r2_improved]
}
df = pd.DataFrame(data)

print("Model Performance Comparison")
df

Model Performance Comparison


Unnamed: 0,Model,RMSE,MAE,R^2
0,Basic,2.81296,2.041078,0.8921
1,Improved,5.229191,3.825393,0.627124


Typically, to enhance model performance, we often apply a selection of techniques rather than multiple techniques simultaneously. However, for educational purposes, we intentionally incorporated an array of techniques in the models. As a result, the models with these applied techniques outperformed the basic models.

# 5. Exercise

You have been provided with a code snippet that compares the performance of a basic model with an improved model using various techniques. Your task is to modify the code and incorporate additional techniques to further enhance the model's performance. Experiment with at least two additional techniques such as feature selection, regularization, or ensemble methods. Evaluate the modified model and compare its performance against the basic model and the previous improved model.

#### Instructions:

1. Study the code provided and understand the implementation of the basic model and the techniques applied in the improved model.


2. Choose two additional techniques that you want to incorporate into the model to further enhance its performance. Examples include feature selection (e.g., using SelectKBest), regularization (e.g., L1 or L2 regularization), or ensemble methods (e.g., Gradient Boosting or Stacking).


3. Modify the code to include the chosen techniques. Make sure to appropriately apply the techniques, train the modified model, and evaluate its performance.


4. Calculate and compare the evaluation metrics such as RMSE, MAE, and R-squared for the basic model, the previous improved model, and the modified model.


5. Reflect on the results and analyze the impact of the additional techniques on the model's performance.

#### Challenge:

Experiment with different parameter settings or combinations of the additional techniques to find the best-performing model. Compare the performance of different combinations and discuss the findings in terms of the evaluation metrics.

Note: You can refer to the scikit-learn documentation for more information about the techniques and parameters you choose to incorporate.

By completing this exercise, you will gain hands-on experience in modifying and improving the model's performance using various techniques. It will also help deepen your understanding of how different techniques impact the evaluation metrics and overall model performance.

In [14]:
# write your code here
