## EDUNET FOUNDATION - Self-Practice Exercise Notebook

### LAB 24 - Implementing Ensemble Learning Concepts in Python

In [2]:
#!pip install xgboost

<center><h1> Boosting Algorithms</h1></center>


- The Noteook uses the [UCI Machine Learning Mushroom Dataset](https://www.kaggle.com/datasets/uciml/mushroom-classification) to implement the AdaBoost and XGBoost algorithms. 
- For the set of features in the dataset, the task is to identify whether the type of mushroom is poisonous or edible.

### First, let’s import the required libraries for data preprocessing.

In [3]:
import numpy as np
import pandas as pd

Now, let’s import the dataset using the read_csv() method in Pandas and analyze the number of distinct categories in each feature. If a feature has only one unique value, we can drop it, as it has no significance while building the model.
Link for dataset: https://drive.google.com/file/d/1iXup6AV0rIRHG_LCs1v_JXXVd64duMEL/view?usp=drive_link

In [5]:
df = pd.read_csv("mushrooms.csv")
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [7]:
df.shape

(8124, 23)

In [8]:
for col in df.columns:
    print('Unique value count of', col, 'is', len(df[col].unique()))

Unique value count of class is 2
Unique value count of cap-shape is 6
Unique value count of cap-surface is 4
Unique value count of cap-color is 10
Unique value count of bruises is 2
Unique value count of odor is 9
Unique value count of gill-attachment is 2
Unique value count of gill-spacing is 2
Unique value count of gill-size is 2
Unique value count of gill-color is 12
Unique value count of stalk-shape is 2
Unique value count of stalk-root is 5
Unique value count of stalk-surface-above-ring is 4
Unique value count of stalk-surface-below-ring is 4
Unique value count of stalk-color-above-ring is 9
Unique value count of stalk-color-below-ring is 9
Unique value count of veil-type is 1
Unique value count of veil-color is 4
Unique value count of ring-number is 3
Unique value count of ring-type is 5
Unique value count of spore-print-color is 9
Unique value count of population is 6
Unique value count of habitat is 7


- As you can see Unique value count of `veil-type` is 1, the feature `veil-type` has only one distinct value in it and hence, can be dropped.

In [9]:
df = df.drop("veil-type", axis=1)

Since machine learning models prefer numerical data, let’s convert the dataset to numbers by encoding it. `LabelEncoder()` is a method in the Scikit-Learn package that converts labels to numbers.

In [11]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
for column in df.columns:
    df[column] = label_encoder.fit_transform(df[column])

Splitting the dataset into a target matrix Y and a feature matrix X,

In [12]:
X = df.loc[:, df.columns != 'class']
Y = df['class']

In [13]:
X.shape

(8124, 21)

In [15]:
Y.shape

(8124,)

The dataset must be be split into two - training and testing data. Let us go ahead and split the data, 70% of it for training and 30% for testing and standardize the values.

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 100)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## AdaBoost Implementation in Python

The sklearn library in Python has an `AdaBoostClassifier` method which is used to classify the features as poisonous or edible.

The method has the following parameters:

- `base_estimator:` The boosted ensemble is built from this parameter. If None, the value is DecisionTreeClassifier(max_depth=1).
- `n_estimators:` The upper limit in estimators at which boosting is terminated with a default value of 50. If there is a perfect fit, learning is stopped early.
- `learning_rate:` The learning rate reduces the contribution of the classifier by this value. It has a default value of 1.
- `algorithm:` Default value of ‘SAAME’. Another option for this parameter, SAMME.R algorithm, converges faster than SAMME algorithm while taking fewer boosting iterations and producing lower test errors.
- `random_state:` Seed used by the random number generator.

Let us invoke an instance of the AdaBoostClassifier and fit it with the training data.

In [19]:
from sklearn.ensemble import AdaBoostClassifier

adaboost = AdaBoostClassifier(n_estimators = 50, learning_rate = 0.2).fit(X_train, Y_train)
score = adaboost.score(X_test, Y_test)

In [20]:
score

0.9848236259228876

## XGBoost implementation in Python

Unlike AdaBoost, `XGBoost` has a separate library for itself, which hopefully was installed at the beginning. Before importing the library and creating an instance of the `XGBClassifier`, let us take a look at some of the parameters required for invoking the `XGBClassifier` method.

- `max_depth:` Maximum depth of the tree for base learners.
- `learning_rate:` The learning rate of the XGBooster.
- `verbosity:` The degree of verbosity. Valid values are between 0 (silent) and 3 (debug).
- `objective:` The learning objective to be used.
- `booster:` The booster to be chosen amongst gbtree, gblinear and dart.
- `tree_method:` The tree method to be used. The most conservative option is set as default.
- `n_jobs:` Number of parallel threads.
- `gamma:` Minimum loss reduction required to make another split on a leaf node of the tree.
- `reg_alpha:` L1 regularization term on weights of XGBoost.
- `reg_lambda:` L2 regularization term on weights of XGBoost.
- `base_score:` The initial prediction (also called global bias).
- `random_state:` Random number seed.
- `importance_type:` The feature to focus on; either gain, weight, cover, total_gain or total_cover.

In [21]:
from xgboost import XGBClassifier
xgboost = XGBClassifier(n_estimators = 1000, learning_rate = 0.05).fit(X_train, Y_train, early_stopping_rounds = 5, eval_set = [(X_test, Y_test)],verbose = False)
score_xgb = xgboost.score(X_test,Y_test)



In [22]:
score_xgb

1.0

By experimenting with the parameters, one can achieve 100% accuracy with this dataset.

## Reference

1. https://www.section.io/engineering-education/boosting-algorithms-python/

2. https://arxiv.org/pdf/1603.02754.pdf

3. https://xgboost.ai/

4. http://rob.schapire.net/papers/explaining-adaboost.pdf

5. https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/

6. https://www.section.io/engineering-education/introduction-to-scikit-learn-in-python/