<center><img src="https://gitlab.com/accredian/insaid-data/-/raw/main/Logo-Accredian/Case-Study-Cropped.png" width= 30% /></center>

# <center><b>Ensemble Learning<b></center>

---
# **Table of Contents**
---

**1.** [**Introduction**](#Section1)<br>
**2.** [**Problem Statement**](#Section2)<br>
**3.** [**Installing & Importing Libraries**](#Section3)<br>
  - **3.1** [**Installing Libraries**](#Section31)
  - **3.2** [**Upgrading Libraries**](#Section32)
  - **3.3** [**Importing Libraries**](#Section33)

**4.** [**Data Acquisition & Description**](#Section4)<br>
  - **4.1** [**Data Description**](#Section41)
  - **4.2** [**Data Information**](#Section42)

**5.** [**Data Pre-processing**](#Section5)<br>
  - **5.1** [**Pre-Profiling Report**](#Section51)<br>

**6.** [**Post Data Processing & Feature Selection**](#Section6)<br>
  - **6.1** [**Feature Encoding**](#Section61)<br>
  - **6.2** [**Feature Selection**](#Section62)<br>
  - **6.3** [**Data Preparation**](#Section63)<br>

**8.** [**Model Development & Evaluation**](#Section8)<br>
**9.** [**Conclusion**](#Section9)<br>

---
#**1.1 Bagging**
---
- This method build **several instances** of a black-box estimator on **random** subsets of the original **training set**.

- Then these estimators **aggregate** their **individual predictions** to form a final prediction.

- These methods are used as a way to **reduce** the **variance** (overfitting) of a base estimator by introducing randomization.

- They **work best** with **strong** and **complex models** (e.g., fully developed decision trees).
</br>
<center><img src="https://raw.githubusercontent.com/insaid2018/Term-4/master/images/bagging-updated1.png"></center>

---
# **1.2 Random Forest**
---

- Here each tree in the ensemble is built from a **sample** drawn **with replacement** (i.e., a bootstrap sample) from the training set.

- By default the **bootstrap** is set to **True** and the most common bootstrap used is **.632 bootstrap**.

- On average, **63.2%** of the original **examples** will end up in the **bootstrap sample**, and the remaining **36.8%** will form the **test set**.

- The model tries to **find** the **best split** either from all input features or a random subset of features specified by the size.

- The purpose of these two sources of randomness is to **decrease** the **variance** of the **forest estimator**.

- Random forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase in bias.

- In an implementation, **classifiers** **average** their probabilistic **prediction**, instead of letting each classifier vote for a single class.

</br>
<center><img src="https://raw.githubusercontent.com/insaid2018/Term-4/master/images/random-forest.png"></center>

---
# **1.3 Boosting**
---

- It is **a family of algorithms** which converts weak learner to strong learners.

- Boosting is an **ensemble method** for improving the model predictions of any given learning algorithm.

- The idea of boosting is to **train weak learners sequentially**, each trying to **correct its predecessor**.

- There are **two variants** of Boosting that exists i.e. **AdaBoost** and **Gradient Boosting**.



### **1.3.1 Adaptive Boosting**

- A **popular** boosting algorithm AdaBoost, introduced in **1995** by **Freund and Schapire**.

- Here the core principle is to fit **a sequence of weak learners** on **repeatedly modified** versions of the data.

- The predictions from all of them are then **combined** through a **weighted majority vote** (or sum) to produce the final prediction.

- The data modifications at each so-called **boosting iteration** consist of applying weights $w_1, w_2, …, w_n$ to each of the training samples.

- Initially, those weights are all set to $w_i = \large \frac{1}{N}$, so that the first step simply trains a weak learner on the original data.

- For each successive iteration, the sample **weights** are individually **modified** and the learning algorithm is **reapplied** to the reweighted data.

- The **weights** of training examples are **increased** if **incorrectly** predicted and **decreased** for predicted **correctly** at the previous step.

- To compute the error rate of model $M_i$ , we sum the weights of each of the example in $D_i$ that $M_i$ misclassified.
</br>
<center>$\large error(M_i) = \sum_{i=1}^{d} w_j * err(X_j)$, where </center>
</br>

- where $err(X_j)$ is the misclassification error of eample $X_j$.

- If a example in round i was correctly classified, its weight is multiplied by $\large \frac{error(M_i)}{(1-error(M_i))}$

- As iterations proceed, examples that are difficult to predict receive ever-increasing influence.

- Each subsequent weak learner is thereby **forced** to **concentrate** on the examples that are **missed** by the **previous** ones in the sequence.

</br>
<center><img src="https://raw.githubusercontent.com/insaid2018/Term-4/master/images/adaboost.png"></center>

- Therefore we can say that AdaBoost iteratively trains weak learners, where each subsequent learner **pays more attention to the instances misclassified by its predecessor.**

### **1.3.2 Gradient Boosting**

- Gradient Boosted Decision Trees (GBDT) is a **generalization** of **boosting** to arbitrary differentiable loss functions.

- GBDT is an **accurate** and **effective** off-the-shelf **procedure** that can be used for both regression and classification problems.

- Here model **learns** from the **mistake** - **residual error** directly, rather than update the weights of data points.

- It can **support** both **binary** and **multi-class** classification.
</br>
<center><img src="https://raw.githubusercontent.com/insaid2018/Term-4/master/images/boosting-tree-updated.png"></center>

### **1.3.3 Extreme Gradient Boosting (XGBoost)**

- XGBoost is an optimized and highly efficient implementation of gradient boosting. It introduces several regularization techniques to control overfitting and improves performance.

- It operates on the **principle of gradient boosting**, which involves sequentially adding weak learners (typically decision trees) to the ensemble. **Each new learner corrects the errors made by the existing ensemble**, gradually improving the model's predictive performance.

- XGBoost is favored for its high performance, scalability, and versatility across a wide range of machine learning tasks, including regression, classification, and ranking problems.

<center><img src="https://miro.medium.com/v2/resize:fit:786/format:webp/0*zdmqFZ2nooBRedqC.png"></center>

# **1.4 Voting**

- The idea here is to **combine** conceptually **different** machine learning **classifiers**.

- Then use a **majority vote** or the **average predicted probabilities** (soft vote) to predict the class labels.

- Such a **classifier** can be useful for a set of equally well performing model in order to **balance** out their **individual weaknesses**.

- They can be categorized into two parts i.e. **Hard Voting** and **Soft Voting**.

### **1.4.1 Hard Voting**

- Here the **predicted class** label **represents** the **majority** (mode) of the class labels predicted by each individual classifier.

- E.g., if the prediction for a given sample is
  - Classifier 1 &rarr; Class 1

  - Classifier 2 &rarr; Class 1

  - Classifier 3 &rarr; Class 2

- The **VotingClassifier** (with voting='hard') would classify the sample as “**Class 1**” based on the majority class label.

- In the **cases of a tie**, the VotingClassifier will select the class based on the **ascending sort order**. E.g.
  - Classifier 1 &rarr; Class 2

  - Classifier 2 &rarr; Class 1

- The **Class label 1** will be assigned to the sample.

### **1.4.2 Soft Voting**

- It **returns** the **class label** as **argmax** of the sum of predicted probabilities.

- We can provide **specific weights** to **each classifier**.

- The **predicted** class **probabilities** for each classifier are collected, **multiplied** by the classifier **weight**, and **averaged**.

- The **final** class **label** is then derived from the class label with the **highest average probability**.

- For example, let’s assume we have 3 classifiers and a 3-class classification problems:
  - We assign equal weights to all classifiers: $w1=1, w2=1, w3=1$.

- The weighted average probabilities for a sample would then be calculated as follows:

|Classifier|Class 1|Class 2|Class 3|
|:--:|:--:|:--:|:--:|
|Classifier 1|w1 x 0.2|w1 x 0.5|w1 x 0.3|
|Classifier 2|w2 x 0.6|w2 x 0.3|w2 x 0.1|
|Classifier 3|w3 x 0.3|w3 x 0.4|w3 x 0.3|
|Weighted Average|0.37|0.4|0.23|

- Here, the predicted class label is 2, since it has the highest average probability.

---
# **2. Stacking**
---

- **Stacked generalization** is a method for **combining estimators** to **reduce** their **biases**.

- It **harnesses** the **capabilities** of a range of well-performing **models** on a classification or regression task.

<center><img src='https://raw.githubusercontent.com/insaid2018/Term-3/master/Images/16083537046041326.png'></center>

- The **predictions** of each **individual estimator** are **stacked** together and used as input to a final estimator to compute the prediction.

- The benifit of stacking is that they have **better performance** than any single model in the ensemble.

</br>

**<center><h3>Structural Diagram of Ensemble Techniques</h3></center>**
<center><img src="https://raw.githubusercontent.com/insaid2018/Term-3/master/Images/ensemble-learning-types-part-2-updated.png"></center>

# **Pros & Cons of Ensemble Learning**

**Pros:**

- Ensemble is a proven method for __improving the accuracy__ of the model and works in most of the cases.

- Ensemble makes the model more __robust__ and __stable__ thus entering decent performance on the test cases in more scenarios.

- You can ensemble to capture __linear__ and simple as well __non-linear__ complex relationships in the data.
- This can be done by using __two different models__ and forming an __ensemble of two__.

**Cons:**

- Ensemble __reduce the model interpret-ability__ and makes it very difficult to draw any crucial business insights at the end.

- It is __time-consuming__ and thus might not be the best idea for real-time applications.

- The __selection of models__ for creating an ensemble is an __art__ which is really hard to master.

---
<a name = Section3></a>
# **3. Problem Statement**
---

<center><img src="https://www.reno.gov/Home/ShowImage?id=7739&t=635620964226970000"></center>

**<h4>Scenario:</h4>**

- **Property Hall** is a Canadian **real estate** company that facilitates  a transaction between the buyers and sellers of property.

- The company's **revenue** is **down** for the past three months and they want to identify its root cause.

- They are looking for an **automatic way** to detect **unusual behavior** in their revenue.

- The company already has **access** to the **data** of **houses** in the city of Windsor.

- To identify unusual behavior, they have hired a team of data scientists. **Consider you are one of them...**

---
<a name = Section4></a>
# **4. Installing & Importing Libraries**
---

<a name = Section41></a>
### **4.1 Installing Libraries**

In [1]:
!pip install -q datascience                                         # Package that is required by pandas profiling
!pip install -q ydata-profiling                                   # Library to generate basic statistics about data

<a name = Section42></a>
### **4.2 Importing Libraries**

In [2]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
from ydata_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis)
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.2f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import seaborn as sns                                               # Importin seaborm library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.model_selection import train_test_split                # To perform train, test and split over the data
from sklearn.svm import SVC                                         # To perform modeling using SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression                 # To perform modeling using LogisticRegression
from sklearn.ensemble import RandomForestClassifier                 # To perform modeling using RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier                  # To perform modeling using KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier             # To perform modeling using GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier                     # To perform modeling using StackingClassifier
from sklearn.tree import DecisionTreeClassifier                     # To perform modeling using DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier                       # To perform modeling using VotingClassifier
from sklearn.ensemble import BaggingClassifier                      # To perform modeling using BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier                     # To perform modeling using AdaBoostClassifier
import xgboost as xgb                                               # To perform modeling using XGBoostClassifier
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

---
<a name = Section5></a>
# **5. Data Acquisition & Description**
---

- The dataset is based on **real estate** provided by Property Hall and it is accessible <a href="https://raw.githubusercontent.com/insaid2018/Term-4/master/Data/Housing.csv">**here**</a>.

| Records | Features | Dataset Size |
| :-- | :-- | :-- |
| 546 | 12 | 22 KB|

</br>

| Id | Features | Description |
| :-- | :--| :--|
|01|**price**|Sale price of a house.|
|02|**lotsize**|The lot size of a property in square feet.|
|03|**bedrooms**|Number of bedrooms.|
|04|**bathrms**|Number of bathrooms.|
|05|**stories**|Number of stories excluding basement.|
|06|**driveway**|Does the house has a driveway?|
|07|**recroom**|Does the house has a recreational room?|
|08|**fullbase**|Does the house has a full finished basement?|
|09|**gashw**|Does the house uses gas for hot water heating?|
|10|**airco**|Does the house has central air conditioning?|
|11|**garagepl**|Number of garage places.|
|12|**prefarea**|Is the house located in the preferred neighbourhood of the city?|

In [3]:
data = pd.read_csv(filepath_or_buffer = 'https://raw.githubusercontent.com/insaid2018/Term-4/master/Data/Housing.csv')
print('Data Shape:', data.shape)
data.head()

Data Shape: (546, 12)


Unnamed: 0,price,lotsize,bedrooms,bathrms,stories,driveway,recroom,fullbase,gashw,airco,garagepl,prefarea
0,42000.0,5850,3,1,2,yes,no,yes,no,no,1,no
1,38500.0,4000,2,1,1,yes,no,no,no,no,0,no
2,49500.0,3060,3,1,1,yes,no,no,no,no,0,no
3,60500.0,6650,3,1,2,yes,yes,no,no,no,0,no
4,61000.0,6360,2,1,1,yes,no,no,no,no,0,no


<a name = Section41></a>
### **5.1 Data Description**

- In this section we will get **information about the data** and see some observations.

In [4]:
data.describe()

Unnamed: 0,price,lotsize,bedrooms,bathrms,stories,garagepl
count,546.0,546.0,546.0,546.0,546.0,546.0
mean,68121.6,5150.27,2.97,1.29,1.81,0.69
std,26702.67,2168.16,0.74,0.5,0.87,0.86
min,25000.0,1650.0,1.0,1.0,1.0,0.0
25%,49125.0,3600.0,2.0,1.0,1.0,0.0
50%,62000.0,4600.0,3.0,1.0,2.0,0.0
75%,82000.0,6360.0,3.0,2.0,2.0,1.0
max,190000.0,16200.0,6.0,4.0,4.0,3.0


**Observation:**

- On **average** the **sale price** of the **house** is **$\$$68121.60**.

- **25%** of **houses** have **sale prices <= $\$$49125** while **50%** and **75%** of houses have **sale price <= $\$$62000** and **$\$$82000**.

- Similarly, we can get the information for the rest of the features.

<a name = Section52></a>
### **5.2 Data Information**

- In this section we will see the **information about the types of features**.

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 546 entries, 0 to 545
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   price     546 non-null    float64
 1   lotsize   546 non-null    int64  
 2   bedrooms  546 non-null    int64  
 3   bathrms   546 non-null    int64  
 4   stories   546 non-null    int64  
 5   driveway  546 non-null    object 
 6   recroom   546 non-null    object 
 7   fullbase  546 non-null    object 
 8   gashw     546 non-null    object 
 9   airco     546 non-null    object 
 10  garagepl  546 non-null    int64  
 11  prefarea  546 non-null    object 
dtypes: float64(1), int64(5), object(6)
memory usage: 51.3+ KB


**Observation:**

- We can observer that there is **no null data present**.

- Addtionally, all **features** seems to have **correct data type**.

<a name = Section6></a>

---
# **6. Data Pre-Processing**
---

<a name = Section61></a>
### **6.1 Pre Profiling Report**

- For **quick analysis** pandas profiling is very handy.

- Generates profile reports from a pandas DataFrame.

- For each column **statistics** are presented in an interactive HTML report.

In [6]:
# profile = ProfileReport(df = data)
# profile.to_file(output_file = 'Pre Profiling Report.html')
# print('Accomplished!')

In [7]:
# from google.colab import files                   # Use only if you are using Google Colab, otherwise remove it
# files.download('Pre Profiling Report.html')      # Use only if you are using Google Colab, otherwise remove it

**Observation:**

- The report shows that there are **12 features** out of which **6 are boolean**, **3 are numerical** and **3 are categorical**.

- We can observe that there is **one duplicate** row in our dataset.

- You can get the rest of the information from the report.

**Performing Operations**

In [8]:
data.drop_duplicates(inplace=True)
print('Dropping Duplicates Success!')

Dropping Duplicates Success!


<a name = Section7></a>

---
# **7. Post Data Processing & Feature Selection**
---

- In this section, we will perform **encoding** over **categorical** features and **feed** the result to the **Random Forest** model.

- **Random Forest** will then **identify important features** for our model **using some threshold**.

- This threshold is **used over** the **information gain** which results in **reduction in impurity**.

- And **finally** we will **split** our **data** for the **model development**.

<a name = Section71></a>
### **7.1 Feature Encoding**

- In this section, we will perform **transformation** of categorical features to numeric using **get_dummies()**.

- We can observe that features such as driveway, recroom,	fullbase,	gashw,	airco, prefarea are having binary values.

In [9]:
data = pd.get_dummies(data = data, columns = ['driveway', 'recroom', 'fullbase', 'gashw', 'airco', 'prefarea'])
data.head()

Unnamed: 0,price,lotsize,bedrooms,bathrms,stories,garagepl,driveway_no,driveway_yes,recroom_no,recroom_yes,fullbase_no,fullbase_yes,gashw_no,gashw_yes,airco_no,airco_yes,prefarea_no,prefarea_yes
0,42000.0,5850,3,1,2,1,0,1,1,0,0,1,1,0,1,0,1,0
1,38500.0,4000,2,1,1,0,0,1,1,0,1,0,1,0,1,0,1,0
2,49500.0,3060,3,1,1,0,0,1,1,0,1,0,1,0,1,0,1,0
3,60500.0,6650,3,1,2,0,0,1,0,1,1,0,1,0,1,0,1,0
4,61000.0,6360,2,1,1,0,0,1,1,0,1,0,1,0,1,0,1,0


- Before diving further, we will **create** a **categorical feature** using the **price** feature.

- We will be **performing** **ensemble** methods on **classes** instead of numerical values.

- So what we're doing essentially here is making a new column which will serve as the **target variable**, called **`price_cat`** where the **'price**' column is categorized **based on the quantiles** calculated earlier.

- The **'pd.cut'** function bins the 'price' column into three categories: **'Low', 'Medium'**, and **'High'**, based on the quantile ranges.

In [10]:
quantile33 = data['price'].quantile(0.33)
quantile66 = data['price'].quantile(0.66)
quantile100 = data['price'].quantile(1)

data['price_cat'] = pd.cut(data['price'], bins=[0, quantile33, quantile66, quantile100], labels=['Low', 'Medium', 'High'])

data.drop(labels='price', axis=1, inplace=True)

print(data['price_cat'].value_counts())

data.head()

High      184
Low       181
Medium    180
Name: price_cat, dtype: int64


Unnamed: 0,lotsize,bedrooms,bathrms,stories,garagepl,driveway_no,driveway_yes,recroom_no,recroom_yes,fullbase_no,fullbase_yes,gashw_no,gashw_yes,airco_no,airco_yes,prefarea_no,prefarea_yes,price_cat
0,5850,3,1,2,1,0,1,1,0,0,1,1,0,1,0,1,0,Low
1,4000,2,1,1,0,0,1,1,0,1,0,1,0,1,0,1,0,Low
2,3060,3,1,1,0,0,1,1,0,1,0,1,0,1,0,1,0,Low
3,6650,3,1,2,0,0,1,0,1,1,0,1,0,1,0,1,0,Medium
4,6360,2,1,1,0,0,1,1,0,1,0,1,0,1,0,1,0,Medium


**Observation:**

- We have **successfully** **converted** our **categorical** features to numeric using dummy encoding.

- We can see that the **frequency** distribution of the **price** feature is approx. **normal**.

- In such case, we can rely on **accuracy** as a metric to **evaluate** our model.

<a name = Section73></a>
### **7.3 Data Preparation**

- Now we will **split** our **data** in **training** and **test** part to train and evaluate the model respectively.

- Here the **test_size** represents the proportion of the dataset to include in the test split.

In [11]:
# Dividing the data into features and target variable
X = data.drop('price_cat', axis = 1)
y = data['price_cat']

In [12]:
# Assuming your data is in X (features) and y (labels/targets) format
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Check the sizes of the splits
print("Train set size:", len(X_train))
print("Test set size:", len(X_test))

Train set size: 381
Test set size: 164


**Label Encoding the target variable**

In [13]:
from sklearn.preprocessing import LabelEncoder

label= LabelEncoder()

y_train= label.fit_transform(y_train)
y_test= label.fit_transform(y_test)

**Observation:**

- Now that we have split our data we are **ready** to move to the **next part** and that is Model Development & Evaluation.

<a name = Section8></a>

---
# **8. Model Development & Evaluation**
---
- In this section we will develop variety of models such as:

|Logistic Regression|Decision Tree|Random Forest|Bagging|Adaptive Boosting|Gradient Boosting|Voting Classifier|
|:--|:--|:--|:--|:--|:--|:--|

- For estimating the **performance** of the **model** we will be using **accuracy** as a **metric**.

- Considering the complexity of data, we can run a **loop** over multiple **classifiers** and estimate the accuracy of the model.

- But in a real-life situation, one must train the model individually as then decide which model is best in which situation.

**Intitialising the base classifiers**

In [14]:
# Logistic Regression
log_clf = LogisticRegression(random_state = 42, class_weight='balanced')

# Support Vector Classifier
sv_clf = SVC(random_state=42, class_weight='balanced', probability=True)

# Decision Tree
dt_clf = DecisionTreeClassifier(random_state = 42, class_weight='balanced')

# Random Forest
rf_clf = RandomForestClassifier(n_estimators=500, random_state = 42, class_weight='balanced', n_jobs=-1)

Bagging classifier

In [15]:
# Bagging Classifier
bag_clf = BaggingClassifier(base_estimator=dt_clf, n_estimators=500, random_state=42, n_jobs=-1)

Voting classifier

In [16]:
# Voting Classifier
voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rf_clf), ('dt', dt_clf), ('svc', sv_clf)], voting='soft')

Boosting classifier

In [17]:
# AdaBoost Classifier
ada_clf = AdaBoostClassifier(base_estimator=dt_clf, n_estimators=500, random_state=42)

# Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(n_estimators=500, random_state = 42)

# Define XGBoost classifier
xgb_clf = xgb.XGBClassifier(objective='multi:softmax', num_class=3, random_state=42)

- Calculating the **scores of all the classifiers** initialised above and putting them in a list just for presenting it in a dataframe.

In [18]:
# Intialize a list of classifier objects
clf_list = [log_clf, sv_clf, dt_clf, bag_clf, voting_clf, rf_clf, ada_clf, gb_clf, xgb_clf]

In [19]:
# Create an empty list to append scores and classifier name
train_scores = []
test_scores = []
clf_names = []

# Train classifier over train data and append scores to empty list
for clf in clf_list:
  # Fit the train data over the classifier object
  clf.fit(X_train, y_train)

  # Append train and test score to the empty list
  train_scores.append(np.round(a=clf.score(X_train, y_train), decimals=2))
  test_scores.append(np.round(a=clf.score(X_test, y_test), decimals=2))
  clf_names.append(clf.__class__.__name__)

print('Success!')

Success!


In [20]:
# Create an accuracy dataframe from scores and names list
accuracy_frame = pd.DataFrame(data={'Train Accuracy': train_scores, 'Test Accuracy': test_scores}, index=clf_names)

# View the accuracy of all the classifiers
accuracy_frame.transpose()

Unnamed: 0,LogisticRegression,SVC,DecisionTreeClassifier,BaggingClassifier,VotingClassifier,RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier,XGBClassifier
Train Accuracy,0.7,0.56,1.0,1.0,0.99,1.0,1.0,1.0,1.0
Test Accuracy,0.62,0.55,0.63,0.64,0.64,0.62,0.64,0.65,0.66


**Observation:**

- The models achieved near-perfect or perfect accuracy in the training and not that great in the validation sets, suggesting that they may be **overfitting** the data.

- Although the high **accuracy** scores are **impressive**, they also raise concerns about potential overfitting issues, as the models may have memorized the training data rather than generalizing well to unseen data.

- It is **important** to address the overfitting problem by applying appropriate techniques such as hyperparameter tuning, regularization, or using more advanced ensemble methods.

- By tuning the hyperparameters, such as adjusting the regularization strength or optimizing the tree depth, it is possible to reduce **overfitting** and improve the models' **generalization** capabilities.

- Additionally, **implementing** techniques like **cross-validation** or using more diverse and **larger** datasets can also help combat overfitting and ensure better model performance in real-world scenarios.

# **9. Optimising & Testing the Best Model**

- From the model results above we can see that **XGBoost** **performs the best**, we can **optimise** it using hyperparameter tuning to get the best possible result.

- **`learning_rate`:** This hyperparameter controls the **step size** at each iteration while moving toward a minimum of the loss function. A lower learning rate requires more iterations but may result in better convergence.

- **`max_depth`:** This hyperparameter determines the **maximum depth of each tree in the ensemble**. Deeper trees can model more complex relationships but are prone to overfitting.

- **`subsample`:** This hyperparameter controls the fraction of samples to be used for training each individual tree. **A value of 1.0 means using all samples**, while smaller values introduce randomness and help prevent overfitting.

- **`gamma`:** This hyperparameter is the **minimum loss reduction** required to make a further partition on a leaf node of the tree. It acts as a regularization parameter.

In [21]:
# Define the XGBoost model for multiclass classification
model = xgb.XGBClassifier(objective='multi:softmax', num_class=3, random_state=42)

# Define the hyperparameters grid for tuning
param_grid = {
    'learning_rate': [0.01, 0.1, 0.3],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 0.9, 1.0],
    'gamma': [0, 0.1, 0.2],
}

# Define grid search with cross-validation
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Perform grid search
grid_search.fit(X_train, y_train)

# Get the best parameters and best estimator
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

In [22]:
# Evaluate the best model on the test set
y_pred_train= best_model.predict(X_train)
y_pred_test = best_model.predict(X_test)
train_accuracy = accuracy_score(y_train, y_pred_train)
test_accuracy= accuracy_score(y_test, y_pred_test)

print("Train Accuracy:", train_accuracy)
print("Test Accuracy:", test_accuracy)
print("Best Parameters:", best_params)

Train Accuracy: 0.989501312335958
Test Accuracy: 0.6585365853658537
Best Parameters: {'gamma': 0.2, 'learning_rate': 0.3, 'max_depth': 5, 'subsample': 0.8}


**Observation:**

- We can see that both training and testing accuracy has adjusted and is **not overfitting** anymore from **optimisation through hyperparameter tuning**, as compared to the previous results.

<a name = Section10></a>

---
# **10. Conclusion**
---

- We **studied** the **characteristics** and **distribution** of data in brief.

- We investigated in-depth the **features** which to **retain** and which to **discard**.

- We performed **model development** by using a stacked generalization of a variety of algorithms.

- We observed **better results** as compared to the results obtained in the last notebook.

- This model will **help** the **company** in **saving** lot of **resources** (money, human resources, etc.).

- We **recommend** you to **experiment** with **more hyperparameter** and try to enhance the results of the model.