<div style="text-align:center">
    <img src="../files/monolearn-logo.png" height="150px">
    <h1>ML course</h1>
    <h3>Session 11: XGBoost, K-Means</h3>
    <h4><a href="https://amzenterprise.ir/">Ali Momenzadeh</a></h5>
</div>

More on Gradient Boosting: https://www.analyticsvidhya.com/blog/2021/09/gradient-boosting-algorithm-a-complete-guide-for-beginners/

#### What is XGBoost?

eXtreme Gradient Boosting (XGBoost) is a scalable and improved version of the gradient boosting algorithm (terminology alert) designed for efficacy, computational speed and model performance. It is an open-source library and a part of the Distributed Machine Learning Community. XGBoost is a perfect blend of software and hardware capabilities designed to enhance existing boosting techniques with accuracy in the shortest amount of time. 

What makes XGBoost a go-to algorithm for winning Machine Learning and Kaggle competitions?

<img src = "../files/11/1_1kjLMDQMufaQoS-nNJfg1Q.png" width=80%>

* Built-in Cross-Validation

<img src = "../files/11/1_Zyus69-ZT1OOmMGdULbYAA.gif" width=30%>

#### XGBoost - How does it work?

**Ensemble learning** is a process in which decisions from multiple machine learning models are combined to reduce errors and improve prediction when compared to a Single ML model. Then the maximum voting technique is used on aggregated decisions (or predictions in machine learning jargon) to deduce the final prediction. Puzzled?

Think of it as organizing efficient routes to your work/college/or grocery stores. As you can use multiple routes to reach your destination, you tend to learn the traffic and the delay it might cause to you at different times of the day, allowing you to devise a perfect route, Ensemble learning is alike!

This image shows a clear distinction of a single ML model with respect to Ensemble Learner:

<img src = "../files/11/1_q-ZQz1EZeFCPr5ijx9nNdQ.png" width=60%>

<img src = "../files/11/1_QJZ6W-Pck_W7RlIDwUIN9Q.jpg" width=80%>

> Gradient boosting is a special case of boosting algorithm where errors are minimized by a gradient descent algorithm and produce a model in the form of weak prediction models e.g. decision trees.

#### XGBoost (Example)

Imagine that you are a hiring manager interviewing several candidates with excellent qualifications. Each step of the evolution of tree-based algorithms can be viewed as a version of the interview process.

1. Decision Tree: Every hiring manager has a set of criteria such as education level, number of years of experience, interview performance. A decision tree is analogous to a hiring manager interviewing candidates based on his or her own criteria.

2. Bagging: Now imagine instead of a single interviewer, now there is an interview panel where each interviewer has a vote. Bagging or bootstrap aggregating involves combining inputs from all interviewers for the final decision through a democratic voting process.

3. Random Forest: It is a bagging-based algorithm with a key difference wherein only a subset of features is selected at random. In other words, every interviewer will only test the interviewee on certain randomly selected qualifications (e.g. a technical interview for testing programming skills and a behavioral interview for evaluating non-technical skills).
    
4. Boosting: This is an alternative approach where each interviewer alters the evaluation criteria based on feedback from the previous interviewer. This ‘boosts’ the efficiency of the interview process by deploying a more dynamic evaluation process.
    
5. Gradient Boosting: A special case of boosting where errors are minimized by gradient descent algorithm e.g. the strategy consulting firms leverage by using case interviews to weed out less qualified candidates.
    
    
6. XGBoost: Think of XGBoost as gradient boosting on ‘steroids’ (well it is called ‘Extreme Gradient Boosting’ for a reason!). It is a perfect combination of software and hardware optimization techniques to yield superior results using less computing resources in the shortest amount of time.

<img src = "../files/11/1_U72CpSTnJ-XTjCisJqCqLg.jpg" width=85%>

#### XGBoost Classification

#### Import libraries

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

import warnings
warnings.filterwarnings('ignore')

In [None]:
# pip install xgboost

In [None]:
import xgboost as xgb

#### Load and prepare data

In [None]:

df = pd.read_csv("diabetes.csv")

In [None]:
df

In [None]:
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

#### Train and Test 

In [None]:
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
from xgboost import XGBClassifier

classifier = XGBClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

#### Evaluation

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred) * 100
print("XGBClassifier accuracy: ", accuracy)

let’s look at the overview of the most frequently tuned **hyperparameters**:

1. learning_rate: also called eta, it specifies how quickly the model fits the residual errors by using additional base learners.

    typical values: 0.01–0.2
    

2. gamma, reg_alpha, reg_lambda: these 3 parameters specify the values for 3 types of regularization done by XGBoost - minimum loss reduction to create a new split, L1 reg on leaf weights, L2 reg leaf weights respectively

    typical values for gamma: 0 - 0.5 but highly dependent on the data
    typical values for reg_alpha and reg_lambda: 0 - 1 is a good starting point but again, depends on the data
    

3. max_depth - how deep the tree's decision nodes can go. Must be a positive integer

    typical values: 1–10
    

4. subsample - fraction of the training set that can be used to train each tree. If this value is low, it may lead to underfitting or if it is too high, it may lead to overfitting

    typical values: 0.5–0.9
    

5. colsample_bytree- fraction of the features that can be used to train each tree. A large value means almost all features can be used to build the decision tree

    typical values: 0.5–0.9
    

**The above are the main hyperparameters people often tune.**

<hr/>

#### XGBoost Regression

#### Import libraries

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

import warnings
warnings.filterwarnings('ignore')

In [None]:
# pip install xgboost

In [None]:
import xgboost as xgb

#### Load and prepare data

* Dataset We will be using a dataset that encapsulates the carbon dioxide emissions generated from burning coal for producing electricity power in the United States of America between 1973 and 2016. Using XGBoost, we will try to predict the carbon dioxide emissions in jupyter notebook for the next few years.

In [None]:
df = pd.read_csv("CO2.csv")

#### EDA

In [None]:
df.head()

In [None]:
df.info()

#### Data Preprocessing

We use Pandas to import the CSV file. We notice that the dataframe contains a column 'YYYYMM' that needs to be separated into 'Year' and 'Month' column. In this step, we will also remove any null values that we may have in the dataframe. Finally, we will retrieve the last five elements of the dataframe to check if our code worked. And it did!

In [None]:
df['Month'] = df.YYYYMM.astype(str).str[4:6].astype(float)
df['Year'] = df.YYYYMM.astype(str).str[0:4].astype(float)

In [None]:
df.shape

In [None]:
df.drop(['YYYYMM'], axis=1, inplace=True)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.tail(5)

In [None]:
df.head()

In [None]:
print(df.dtypes)

In [None]:
df.isnull().sum()

In [None]:
df.shape

In [None]:
X = df.loc[:,['Month', 'Year']].values
y = df.loc[:,'Value'].values

In [None]:
y

In [None]:
# data_dmatrix = xgb.DMatrix(X, label=y)

In [None]:
# data_dmatrix

#### Train and Test 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
from xgboost import XGBRegressor

regressor = XGBRegressor(
    n_estimators=1000,
    learning_rate=0.08,
    subsample=0.75,
    colsample_bytree=1, 
    max_depth=7,
    gamma=0,
)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

#### Evaluation

In [None]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

In [None]:
scores = cross_val_score(regressor, X_train, y_train, cv=10)
print("Mean cross-validation score: %.2f" % scores.mean())

In [None]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE: %f" % (rmse))

In [None]:
r2 = np.sqrt(r2_score(y_test, y_pred))
print("R_Squared Score : %f" % (r2))

#### Visualize results

In [None]:
plt.figure(figsize=(10, 5), dpi=80)
sns.lineplot(x='Year', y='Value', data=df)

In [None]:
plt.figure(figsize=(10, 5), dpi=80)
x_ax = range(len(y_test))
plt.plot(x_ax, y_test, label="test")
plt.plot(x_ax, y_pred, label="predicted")
plt.title("Carbon Dioxide Emissions - Test and Predicted data")
plt.legend()
plt.show()

Finally, the last piece of code will print the forecasted carbon dioxide emissions until 2025.

In [None]:
plt.figure(figsize=(10, 5), dpi=80)
df=pd.DataFrame(y_pred, columns=['pred']) 
df['date'] = pd.date_range(start='8/1/2016', periods=len(df), freq='M')
sns.lineplot(x='date', y='pred', data=df)
plt.title("Carbon Dioxide Emissions - Forecast")
plt.show()

<hr/>

### K-means

#### Supervised vs Unsupervised learning

<img src = "../files/11/1_31iqrQyCqIuuGPLUK_BjMQ.png" width=75%>

<img src = "../files/11/0_p3zNp9D8YFmX32am.jpg" width=75%>

#### K-means Animation

http://shabal.in/visuals/kmeans/1.html

#### Import libraries

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

import warnings
warnings.filterwarnings('ignore')

#### Load and prepare data

In [None]:
df = pd.read_csv("Mall_Customers.csv")

#### EDA

In [None]:
df

In [None]:
df.info()

#### Using K-means in Action

In [None]:
X = df.iloc[:, [3, 4]].values

In [None]:
X

In [None]:
from sklearn.cluster import KMeans

* Within Cluster Sum of Squares (WCSS):

WCSS is the sum of squared distance between each point and the centroid in a cluster. When we plot the WCSS with the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value will start to decrease

In [None]:
wcss = []

for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
    #k-means++ is an algorithm for choosing the initial values (or "seeds") for the k-means clustering algorithm.
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

In [None]:
wcss

<img src="../files/11/914581_vNsFrDUvGn9yTjlnXLgW8A.png" width=70%>

#### How to use K in K-means? (Elbow Curve)

In [None]:
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
y_kmeans = kmeans.fit_predict(X)
y_kmeans

#### Visualise the clusters

In [None]:
kmeans.cluster_centers_

In [None]:
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()