# Regression Analysis with XGBoost

Hi Guys, Welcome to [Tirendaz Academy](https://youtube.com/c/tirendazacademy) 😀 </br>
In this notebook, I'm going to talk about regression analysis with XGBoost. </br>
I'll cover the following topics:</br>
- What is XGBoost?
- Building a regression model with XGBoost
- Building a linear regression model with Scikit-Learn

Happy learning 🐱‍🏍 

# What is XGBoost

XGBoost is short for Extreme Gradient Boosting.

You can use the XGBoost package to implement gradient boosting.

It provides a parallel tree boosting that solves many data science problems in a fast and accurate way. 

To install XGBoost, you can use the `pip install xgboost` command.

# Loading Dataset

In [1]:
import pandas as pd
df = pd.read_csv("bike_rentals.csv")
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1.0,0.0,1.0,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1.0,0.0,1.0,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1.0,0.0,1.0,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1.0,0.0,1.0,0.0,2.0,1.0,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1.0,0.0,1.0,0.0,3.0,1.0,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


# Understanding The Dataset

In [2]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
instant,731.0,366.0,211.165812,1.0,183.5,366.0,548.5,731.0
season,731.0,2.49658,1.110807,1.0,2.0,3.0,3.0,4.0
yr,730.0,0.5,0.500343,0.0,0.0,0.5,1.0,1.0
mnth,730.0,6.512329,3.448303,1.0,4.0,7.0,9.75,12.0
holiday,731.0,0.028728,0.167155,0.0,0.0,0.0,0.0,1.0
weekday,731.0,2.997264,2.004787,0.0,1.0,3.0,5.0,6.0
workingday,731.0,0.682627,0.465773,0.0,0.0,1.0,1.0,1.0
weathersit,731.0,1.395349,0.544894,1.0,1.0,1.0,2.0,3.0
temp,730.0,0.495587,0.183094,0.05913,0.336875,0.499166,0.655625,0.861667
atemp,730.0,0.474512,0.163017,0.07907,0.337794,0.487364,0.608916,0.840896


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    float64
 3   yr          730 non-null    float64
 4   mnth        730 non-null    float64
 5   holiday     731 non-null    float64
 6   weekday     731 non-null    float64
 7   workingday  731 non-null    float64
 8   weathersit  731 non-null    int64  
 9   temp        730 non-null    float64
 10  atemp       730 non-null    float64
 11  hum         728 non-null    float64
 12  windspeed   726 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(10), int64(5), object(1)
memory usage: 91.5+ KB


# Data Preprocessing

### Handling Missing Data

In [4]:
df.isna().sum()

instant       0
dteday        0
season        0
yr            1
mnth          1
holiday       0
weekday       0
workingday    0
weathersit    0
temp          1
atemp         1
hum           3
windspeed     5
casual        0
registered    0
cnt           0
dtype: int64

Let's fill missing data with the median of each column.

In [5]:
values = {"yr":df["yr"].median(),
         "mnth":df["mnth"].median(),
         "temp":df["temp"].median(),
         "atemp":df["atemp"].median(),
         "hum":df["hum"].median(),
         "windspeed":df["windspeed"].median()}

In [6]:
df.fillna(value = values, inplace=True)

Let's take a look at missing data again.

In [7]:
df.isnull().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

### Removing unnecessary columns

In [8]:
df = df.drop(["casual","registered","dteday"], axis=1)

### Creating the target and feature variables

In [9]:
y = df["cnt"]
X = df.drop("cnt", axis=1)

### Splitting the dataset into the training and test set

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)

# Building The Model with XGBoost

In [12]:
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
model = XGBRegressor()
xg_scores = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=10)

Let's take a look at the mean of the scores.

In [13]:
import numpy as np
rmse = np.sqrt(-xg_scores)
rmse

array([ 716.58107786,  669.242233  ,  513.94478758,  699.10947944,
        843.96424434, 1046.137147  , 1005.04935475,  866.77141114,
        904.24420278, 1698.27662405])

In [14]:
np.mean(rmse).round()

896.0

# Building a Linear Regression Model

In [15]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr_scores = cross_val_score(lr, X, y, scoring="neg_mean_squared_error", cv = 10)
lr_rmse = np.sqrt(-lr_scores)
lr_rmse.mean().round()

969.0

Don't forget to follow us on [YouTube](http://youtube.com/tirendazacademy) | [Medium](http://tirendazacademy.medium.com) | [Twitter](http://twitter.com/tirendazacademy) | [GitHub](http://github.com/tirendazacademy) | [Linkedin](https://www.linkedin.com/in/tirendaz-academy) | [Kaggle](https://www.kaggle.com/tirendazacademy) 😎