<img src="https://pythongeeks.org/wp-content/uploads/2022/03/ml-xgboost-introduction.webp" width=80% />

# Topics:

-  What is XGBoost (Review)
-  XGBoost in action (Regression)

## What is XGBoost?

XGBoost is an acronym for eXtreme Gradient Boosting. Developed by Tianqi Chen, it is an implementation of the gradient boosting algorithm. It is one of the most widely used tools amongst the various other tools available for the Distributed Machine Learning Community, more commonly referred to as DMLC. It is an ensemble learning method.
XGBoost is a algorithm for winning Machine Learning and Kaggle competitions...

# XGBoost in Action (Regression)

## Importing the libraries

In [None]:
#!pip install xgboost

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk


import xgboost as xgb

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

In [None]:
sk.__version__

## Load and Prepare Data

Dataset We will be using a dataset that encapsulates the carbon dioxide emissions generated from burning coal for producing electricity power in the United States of America between 1973 and 2016. Using XGBoost, we will try to predict the carbon dioxide emissions in jupyter notebook for the next few years.
  
### **Feature description table**

| **Feature Name** | **Data Type** | **Description**                          | **Example Value** |
|------------------|--------------|------------------------------------------|------------------|
| `YYYYMM`         | Integer      | Year and Month in `YYYYMM` format. This represents the time period for each observation. | `197301`         |
| `Value`          | Float        | The numerical value recorded for the corresponding time period. It could represent economic, climatic, or production-related metrics. | `72.076`         |

### Key Notes:
1. **Feature `YYYYMM`**:  
   - This is a **time-based feature** representing a specific month and year.
   - It can be used to plot trends and analyze seasonal or temporal patterns.

2. **Feature `Value`**:  
   - This is a **numerical feature** and can be analyzed using statistical measures (mean, median, standard deviation, etc.).  
   - It is suitable for **trend analysis** or building **time-series models**.



In [None]:
#Read the dataset and print the top 5 elements of the dataset
data = pd.read_csv('/kaggle/input/co2-dataset/co2.csv')
data.head(5)

In [None]:
data.info()

#### We use Pandas to import the CSV file. We notice that the dataframe contains a column 'YYYYMM' that needs to be separated into 'Year' and 'Month' column. In this step, we will also remove any null values that we may have in the dataframe. Finally, we will retrieve the last five elements of the dataframe to check if our code worked. And it did!

In [None]:
data['Month'] = data.YYYYMM.astype(str).str[4:6].astype(float)
data['Year'] = data.YYYYMM.astype(str).str[0:4].astype(float)

In [None]:
data.shape

In [None]:
data.drop(['YYYYMM'], axis=1, inplace=True)
data.replace([np.inf, -np.inf], np.nan, inplace=True)
data.tail(5)

In [None]:
# check for data type
print(data.dtypes)

In [None]:
data.isnull().sum()

In [None]:
import warnings 

warnings.filterwarnings('ignore') 
g = sns.FacetGrid(data, col="Year", height=3.5, aspect=.65)
g.map(sns.kdeplot, 'Value')

In [None]:
sns.scatterplot(x='Year', y='Value', data=data)
plt.title("Electricity Production vs CO2 Emissions")
plt.show()

In [None]:
X = data.loc[:,['Month', 'Year']].values
y = data.loc[:,'Value'].values

In [None]:
y

In [None]:
data_dmatrix = xgb.DMatrix(X,label=y)

In [None]:
data_dmatrix

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
reg_mod = xgb.XGBRegressor(
    n_estimators=1000,
    learning_rate=0.08,
    subsample=0.75,
    colsample_bytree=1,
    max_depth=7,
    gamma=0,
)
reg_mod.fit(X_train, y_train)

In [None]:
#After training the model, we'll check the model training score.
scores = cross_val_score(reg_mod, X_train, y_train,cv=10)
print("Mean cross-validation score: %.2f" % scores.mean())

In [None]:
reg_mod.fit(X_train,y_train)

predictions = reg_mod.predict(X_test)

In [None]:
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print("RMSE: %f" % (rmse))

In [None]:
from sklearn.metrics import r2_score
r2 = np.sqrt(r2_score(y_test, predictions))
print("R_Squared Score : %f" % (r2))

### As you can see, the these statistical metrics have reinstated our confidence about this model. RMSE ~ 7.37 R-Squared Score ~ 98.8% Now, let's visualize the original data set using the seaborn library.

In [None]:

plt.figure(figsize=(10, 5), dpi=80)
sns.lineplot(x='Year', y='Value', data=data)
plt.title("Annual CO2 Emissions")

In [None]:
plt.figure(figsize=(20, 5), dpi=80)
sns.barplot(data,x='Year',y='Value')
plt.xticks(rotation=45)

In [None]:
d1=data.groupby('Year')

yearly_sum = data.groupby('Year')['Value'].sum().reset_index()

print(yearly_sum)

yearly_sum.to_csv('Yearly_Value_Sum.csv', index=False)

plt.figure(figsize=(10, 6))
plt.plot(yearly_sum['Year'], yearly_sum['Value'], marker='o', linestyle='-')
plt.title('Total CO2 Value by Year')
plt.xlabel('Year')
plt.ylabel('Total Value')
plt.xticks(rotation=45)
plt.grid()
plt.show()

In [None]:
from mpl_toolkits.mplot3d import Axes3D

grouped = data.groupby(['Year', 'Month'])['Value'].sum().reset_index()

fig = plt.figure(figsize=(25, 10), dpi=80)
ax = fig.add_subplot(111, projection='3d')

ax.scatter(grouped['Year'], 
           grouped['Month'], 
           grouped['Value'], 
           c=grouped['Value'], 
           cmap='viridis', 
           marker='o')

ax.set_xlabel('Year')
ax.set_ylabel('Month')
ax.set_zlabel('Total CO2 Value')
ax.set_title("3D Visualization of CO2 Emissions by Year and Month")

plt.show()

In [None]:
plt.figure(figsize=(10, 5), dpi=80)
x_ax = range(len(y_test))
plt.plot(x_ax, y_test, label="test")
plt.plot(x_ax, predictions, label="predicted")
plt.title("Carbon Dioxide Emissions - Test and Predicted data")
plt.legend()
plt.show()

Finally, the last piece of code will print the forecasted carbon dioxide emissions until 2025.

In [None]:

plt.figure(figsize=(10, 5), dpi=80)
df=pd.DataFrame(predictions, columns=['pred'])
df['date'] = pd.date_range(start='8/1/2016', periods=len(df), freq='ME')
sns.lineplot(x='date', y='pred', data=df)
plt.title("Carbon Dioxide Emissions - Forecast")
plt.show()