# **Prediction Modeling using AdaBoost**

<img src="https://drive.google.com/uc?id=1SZxMXRZsESpFRyKDkg_QhZSjOekiM2_a" width="700" style="float: center"/>

- Ensemble method that combines several *weak learners* into a *strong learner*
- Weak learners are trained *sequentially*
- Each learner tries to correct the *weaknesses of its predecessor*

### **AdaBoost**
- Uses *stumps* as weak learners to form ensemble
- Each stump is made by considering *previous stump's mistake*
- Stumps have *different weightages* in final prediction

  **Stump Weightage** $=\eta \ln\big(\frac{1-\text{Total Error}}{\text{Total Error}}\big)$


### Case Study: Predicting the Acceptance of Personal Loan

Data to be used: *Bank.csv*

Following is the description of columns in *Bank.csv* file

<TABLE CAPTION="Personal Loan Dataset">
<TR><TD><B>Variable</B></TD><TD><B>Description</B></TD></TR>
<TR><TD>Age</TD><TD>Customer's age</TD></TR>
<TR><TD>Experience</TD><TD># years of professional experience</TD></TR>
<TR><TD>Income</TD><TD>Annual income of the customer (&#36;000)</TD></TR>
<TR><TD>Family</TD><TD>Family size of the customer</TD></TR>
<TR><TD>CCAvg</TD><TD>Avg. spending on credit cards per month (&#36;000)</TD></TR>
<TR><TD>Education</TD><TD>Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional</TD></TR>   
<TR><TD>Mortgage</TD><TD>Value of house mortgage if any. (&#36;000)</TD></TR>
<TR><TD>Securities Account</TD><TD>Does the customer have a securities account with the bank?</TD></TR>
<TR><TD>CD Account</TD><TD>Does the customer have a certificate of deposit (CD) account with the bank?</TD></TR>
<TR><TD>Online</TD><TD>Does the customer use internet banking facilities?</TD></TR>
<TR><TD>CreditCard</TD><TD>Does the customer use a credit card issued by the bank?</TD></TR>
<TR><TD>Personal Loan (outcome)</TD><TD>Did this customer accept the personal loan offered in the campaign?</TD></TR>
</TABLE>

In `Personal Loan` Column:

- 0: Did not accept loan
- 1: Accepted loan

### Import Packages

In [None]:
import pandas as pd                  # Pandas
import numpy as np                   # Numpy
from matplotlib import pyplot as plt # Matplotlib

# Package to implement AdaBoost
import sklearn
from sklearn.ensemble import AdaBoostClassifier

# Package to implement Grid Search Cross Validation
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold

# Package for generating confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Package for generating classification report
from sklearn.metrics import classification_report

# Package to record time
import time

# Package for Data pretty printer
from pprint import pprint

# Ignore Deprecation Warnings
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

### Import Data

In [None]:
# Import Data
bank_df = pd.read_csv('Bank.csv')
bank_df.head()

In [None]:
bank_df['Personal Loan'].value_counts()

Almost 90% of the instances belong to class 0 (customers who rejected loan).

Therefore, it is a highly *imbalanced* dataset.

In [None]:
# Statistical Description
bank_df.describe().T

### Prepare Data

In [None]:
# Selecting data correponding to Input Features X and Outcome y
X = bank_df.drop(columns=['Personal Loan'])
y = bank_df['Personal Loan']


# Data Partitioning into train and test sets
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=1)

## **Implementing AdaBoost for Classification**

### ***Hyperparameters of AdaBoost***

### `n_estimators`:
- The maximum number of weak learners at which boosting is terminated
- In case of perfect fit, the learning procedure is stopped early
- Default = 50
- Input options → integer

### `learning_rate` ($\eta$):
- Weight applied to each classifier at each boosting iteration
- A higher learning rate increases the contribution of each weak learner
- Default = 1.0
- Input options → float

### **Hyperparameter Tuning**

In [None]:
# Define your model
classifier = AdaBoostClassifier(algorithm = 'SAMME', random_state = 42)

In [None]:
# Start with an initial guess for parameters
n_estimators = [int(x) for x in np.linspace(start = 5, stop = 500, num = 10)]

learning_rate = [x for x in np.arange(0.1, 2.1, 0.1)]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'learning_rate': learning_rate
}

pprint(random_grid)

In [None]:
# Creating stratified folds
folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 100)

In [None]:
# Call RandomizedSearchCV()
random_cv = RandomizedSearchCV(estimator = classifier,
                              param_distributions = random_grid,
                              n_iter = 100,
                              scoring = 'f1_macro',
                              cv = folds,
                              verbose = 2,
                              random_state = 42,
                              n_jobs = -1) # Will utilize all available CPUs

In [None]:
# Fit the model
start = time.time()            # Start Time
random_cv.fit(train_X, train_y)
stop = time.time()             # End Time
print(f"Training time: {stop - start}s")

In [None]:
print('Initial score: ', random_cv.best_score_)
print('Initial parameters: ', random_cv.best_params_)

In [None]:
# Create the parameter grid based on the results of random search
param_grid = {'n_estimators': [400, 420, 440, 460, 480, 500],
              'learning_rate': [1.15, 1.20, 1.25]
}

pprint(param_grid)

In [None]:
# Call GridSearchCV()
grid_cv = GridSearchCV(estimator = classifier,
                        param_grid = param_grid,
                        scoring= 'f1_macro',
                        cv = folds,
                        verbose = 1,
                        n_jobs = -1) # Will utilize all available CPUs

In [None]:
# Fit the model
start = time.time()            # Start Time
grid_cv.fit(train_X, train_y)
stop = time.time()             # End Time
print(f"Training time: {stop - start}s")

In [None]:
print('Improved score: ', grid_cv.best_score_)
print('Improved parameters: ', grid_cv.best_params_)

### **Analyzing the performance of each stump in the ensemble**

**Total Error of each stump**: Sum of weights associated with incorrectly classified instances

In [None]:
# Error of each stump
grid_cv.best_estimator_.estimator_errors_


**Stump Weightage** $=\eta \ln\big(\frac{1-\text{Total Error}}{\text{Total Error}}\big)$

For first stump, Total Error = 0.09457143

$\eta = 1.2$

Stump Weightage = 2.7108635

In [None]:
# Stump Weightage
grid_cv.best_estimator_.estimator_weights_

**Making predictions on test set**

In [None]:
# Predictions on test set
y_pred = grid_cv.predict(test_X)

# Generating Classification Report
print("Classification Report - \n",
      classification_report(test_y, y_pred))

**Generating Confusion Matrix**

In [None]:
# Generate confusion matrix
cm = confusion_matrix(test_y, y_pred, labels = grid_cv.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = grid_cv.classes_)

# Specify figure size and font size
fig, ax = plt.subplots(figsize = (6, 6))
plt.rcParams.update({'font.size': 15})

# Display Confusion Matrix
disp.plot(cmap = 'Purples', ax = ax);

**Estimating Prediction Probabilites**

In [None]:
# Getting prediction probabilites
prob = grid_cv.predict_proba(test_X)

# Printing prediction results
result = pd.DataFrame({'Actual': test_y, 'Predicted': y_pred})

# Creating columns for rejection and acceptance prob.
result[['Prob. of 0','Prob. of 1']] = pd.DataFrame(prob.tolist(), index = result.index)

# Saving dataframe as a csv file
result.to_csv('Prediction Results.csv', index = False)

result.sample(10)

**Feature Importance**

In [None]:
# Storing importance values from the best fit model
importance = grid_cv.best_estimator_.feature_importances_

In [None]:
# Displaying feature importance as a dataframe
feature_imp = pd.DataFrame(list(zip(train_X.columns, importance)),
               columns = ['Feature', 'Importance'])

feature_imp = feature_imp.sort_values('Importance', ascending = False).reset_index(drop = True)

feature_imp

In [None]:
# Bar plot
plt.figure(figsize=(10, 5))
plt.barh(feature_imp['Feature'], feature_imp['Importance'], color =['teal','lime'])

plt.xlabel("Importance")
plt.ylabel("Feature")
plt.title("Feature Importance");

# **Prediction Intervals for Regression**

<img src="https://drive.google.com/uc?id=1aBNgNCOASpQulzRzOwAmLvE4P1Eux0ZA" width="500" style="float: center"/>

### **What is a Prediction Interval?**
- It is a **range of values** within which a new observation is expected to fall with a **certain probability**, given the existing data and model.

- **Probability**: The width of the prediction interval depends on the **desired confidence level**, (e.g., 95%), with higher confidence levels leading to wider intervals.

### **Confidence Level of Prediction Interval**

- The confidence level of a prediction interval indicates the probability that the interval will contain the true value of the parameter being estimated.

- Mathematically, the confidence level of a prediction interval is denoted by $ (1 - \alpha) \times 100\% $, where $ \alpha $ is the significance level.

### **Why Prediction Intervals are Useful?**

- **Uncertainty Quantification**: They provide a measure of the uncertainty in individual predictions, which is crucial for risk assessment and decision-making.

- **Communication**: They are an effective tool for communicating the uncertainty in predictions to stakeholders, making the model's predictions more interpretable.




## Case Study: Predicting the Price of Used Toyota Corolla Cars

**In this case study, the objective is to predict the price of used Toyota Corolla Cars.**

Data to be used: *ToyotaCorolla.csv*

The data include the sales price and other information on the car, such as its age, mileage, fuel type, and engine size.

Following is the description of columns in *ToyotaCorolla.csv* file

<TABLE CAPTION="Car Sales Dataset">
<TR><TD><B>Variable</B></TD><TD><B>Description</B></TD></TR>
<TR><TD>Price</TD><TD>Offer price in Euros</TD></TR>
<TR><TD>Age</TD><TD>Age in months as of August 2004</TD></TR>
<TR><TD>Kilometers</TD><TD>Accumulated kilometers on odometer</TD></TR>
<TR><TD>Fuel type</TD><TD>Fuel type (Petrol, Diesel, CNG)</TD></TR>
<TR><TD>HP</TD><TD>Horsepower</TD></TR>
<TR><TD>Metallic</TD><TD>Metallic color? (Yes = 1, No = 0)</TD></TR>   
<TR><TD>Automatic</TD><TD>Automatic? (Yes = 1, No = 0)</TD></TR>
<TR><TD>CC</TD><TD>Cylinder volume in cubic centimeters</TD></TR>
<TR><TD>Doors</TD><TD>Number of doors</TD></TR>
<TR><TD>QuartTax</TD><TD>Quarterly road tax in Euros</TD></TR>
<TR><TD>Weight</TD><TD>Weight in kilograms</TD></TR>
</TABLE>

### Import Packages

In [None]:
import pandas as pd                  # Pandas
import numpy as np                   # Numpy
from matplotlib import pyplot as plt # Matplotlib

# Package to implement Regression Tree Model
import sklearn
from sklearn.tree import DecisionTreeRegressor

# Package to implement Grid Search Cross Validation
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.model_selection import KFold

# Package to visualize Decision Tree
from sklearn import tree

%matplotlib inline

### Import and Prepare Data

In [None]:
# Import Data
car_df = pd.read_csv('ToyotaCorolla.csv')

# Considering top 1000 rows for modeling and analysis
car_df = car_df.iloc[0:1000]

In [None]:
# Selecting columns of interest
predictors = ['Age_08_04', 'KM', 'Fuel_Type', 'HP', 'Met_Color', 'Automatic', 'CC',
              'Doors', 'Quarterly_Tax', 'Weight']

outcome = 'Price'

In [None]:
# Creating dummy variables and specifiy the set of input and output variables
X = pd.get_dummies(car_df[predictors], drop_first=True)
y = car_df[outcome]

In [None]:
# Data Partitioning into train and test sets
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=1)

### Hyperparameter Tuning using Grid Search Cross Validation

In [None]:
# Define your model
reg = DecisionTreeRegressor(random_state = 42)

In [None]:
# Start with an initial guess for parameters
hyper_params = {
    'max_depth': [5, 10, 15, 20],
    'min_samples_split': [20, 40, 60],
    'min_samples_leaf': [10, 20, 30, 40, 50]
}

In [None]:
# Creating folds
folds = KFold(n_splits = 5, shuffle = True, random_state = 100)

In [None]:
# Call GridSearchCV()
model_cv = GridSearchCV(estimator = reg,
                        param_grid = hyper_params,
                        scoring = 'r2', # Use a suitable regression metric
                        cv = folds,
                        verbose = 1,
                        n_jobs = -1) # Will utilize all available CPUs

In [None]:
# Fit the model
model_cv.fit(train_X, train_y)

In [None]:
print('Initial score: ', model_cv.best_score_)
print('Initial parameters: ', model_cv.best_params_)

In [None]:
# Adapt grid based on result from initial grid search
hyper_params_new = {
    'max_depth': list(range(2, 12)),
    'min_samples_split': list(range(15, 24)),
    'min_samples_leaf': list(range(2, 10))
}

In [None]:
# Call GridSearchCV()
model_cv = GridSearchCV(estimator = reg,
                        param_grid = hyper_params_new,
                        scoring = 'r2',
                        cv = folds,
                        verbose = 1,
                        n_jobs = -1) # Will utilize all available CPUs

In [None]:
# Fit the model
model_cv.fit(train_X, train_y)

In [None]:
print('Improved score: ', model_cv.best_score_)
print('Improved parameters: ', model_cv.best_params_)

In [None]:
# Storing best model
bestRegTree = model_cv.best_estimator_

# Visualizing Decision Tree
fig = plt.figure(figsize=(25,20))
a = tree.plot_tree(decision_tree = bestRegTree,
                   feature_names = train_X.columns,
                   filled = True)

### Evaluating Performance of Tuned Model on Test Set

In [None]:
# Predict test set
y_pred = model_cv.predict(test_X)
r2 = sklearn.metrics.r2_score(test_y, y_pred)
RMSE = sklearn.metrics.root_mean_squared_error(test_y, y_pred)
print(r2,RMSE)

In [None]:
y_pred

## **Prediction Intervals using MAPIE Regressor**

#### ***MAPIE: Model Agnostic Prediction Interval Estimator***
- It is a Python library designed to estimate prediction intervals in a **model-agnostic way**.
- It can be used with **any machine learning model**, including linear models, decision trees, ensemble methods, and neural networks.

[**See this link for detailed description on `MAPIE`**](https://mapie.readthedocs.io/en/latest/generated/mapie.regression.MapieRegressor.html)

In [None]:
# Best Regression Model/Tree after hyperparameter tuning
bestRegTree

**Install and Import `MAPIE` Library**

In [None]:
# Install mapie
!pip install -q mapie

In [None]:
# Import mapie
from mapie.regression import MapieRegressor

In [None]:
# Define mapie regressor
mapie = MapieRegressor(estimator = bestRegTree, # Prediction model to use
                       n_jobs = -1,
                       random_state = 42)

# Fit mapie regressor on training data
mapie.fit(train_X, train_y)

alpha = 0.1 # For 90% confidence level

# Use mapie.predict() to get predicted values and intervals
y_test_pred, y_test_pis = mapie.predict(test_X, alpha = alpha)

In [None]:
# Predicted values
y_test_pred

In [None]:
# Prediction Intervals
y_test_pis

In [None]:
# Storing results in a dataframe
predictions = test_y.to_frame()
predictions.columns = ['Actual Value']
predictions["Predicted Value"] = y_test_pred.round()
predictions["Lower Value"] = y_test_pis[:, 0].round()
predictions["Upper Value"] = y_test_pis[:, 1].round()

# Take a quick look
predictions

### **Coverage Calculation**
- **Coverage** refers to the proportion of true/actual values that fall within the prediction intervals generated by a model.

- It is a measure of how well the prediction intervals capture the actual values.

  $\text{Coverage} = \frac{\text{Number of actual values within prediction intervals}}{\text{Total number of actual values}}$


In [None]:
# To calculate coverage score
from mapie.metrics import regression_coverage_score

In [None]:
coverage = regression_coverage_score(test_y,           # Actual values
                                     y_test_pis[:, 0], # Lower bound of prediction intervals
                                     y_test_pis[:, 1]) # Upper bound of prediction intervals

coverage_percentage = coverage * 100
print(f"Coverage: {coverage_percentage:.2f}%")

**Coverage Plot (sorted by prediction interval width)**

In [None]:
# Import necessary library for setting up the plot format
import matplotlib as mpl

# Sort the predictions by 'Actual Value' for better visualization and reset the index
sorted_predictions = predictions.sort_values(by=['Actual Value']).reset_index(drop=True)

# Create a figure and axis object with specified size and resolution
fig, ax = plt.subplots(figsize=(25, 10), dpi=250)

# Plot the actual values with green dots
plt.plot(sorted_predictions["Actual Value"], 'go', markersize=4, label="Actual Value")

# Fill the area between the lower and upper bounds of the prediction intervals with semi-transparent green color
plt.fill_between(np.arange(len(sorted_predictions)),
                 sorted_predictions["Lower Value"],
                 sorted_predictions["Upper Value"],
                 alpha=0.2, color="green", label="Prediction Interval")

# Set font size for x and y ticks
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)

# Format y-axis to show values with commas as thousand separators
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))

# Set the limit for the x-axis to cover the range of samples
plt.xlim([0, len(sorted_predictions)])

# Label the x-axis and y-axis with appropriate font size
plt.xlabel("Samples", fontsize=20)
plt.ylabel("Target", fontsize=20)

# Add a title to the plot, including the coverage percentage, with bold formatting
plt.title(f"Prediction Intervals and Coverage: {coverage_percentage:.2f}%", fontsize=25, fontweight="bold")

# Add a legend to the plot, placed in the upper left, with specified font size
plt.legend(loc="upper left", fontsize=20)

# Save the plot as a PDF file with tight layout
plt.savefig("prediction_intervals_coverage.pdf", format="pdf", bbox_inches="tight")

# Display the plot
plt.show();
