<a href="https://colab.research.google.com/github/Sarthak016/MachineLearning/blob/main/REGRESSION_ALL_BLUEPRINT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Seoul Bike Sharing Demand Prediction </u></b>



## <b> Problem Description </b>

 Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.

Our objective is to find a way to estimate the value prediction of bike count required at each hour for the stable supply of rental bikes. using the values in the other columns. If we can do so for the historical data, then we should able to estimate bike count required at each hour.

----

## <b> Data Description </b>

The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.


### <b>Attribute Information: </b>

*  Date : year-month-day
*  Rented Bike count - Count of bikes rented at each hour
*  Hour - Hour of he day
*  Temperature-Temperature in Celsius
*  Humidity - %
*  Windspeed - m/s
*  Visibility - 10m
*  Dew point temperature - Celsius
*  Solar radiation - MJ/m2
*  Rainfall - mm
*  Snowfall - cm
*  Seasons - Winter, Spring, Summer, Autumn
*  Holiday - Holiday/No holiday
*  Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

----

> All the Lifecycle In A Data Science Projects
1. Data Analysis
2. Feature Engineering
3. Feature Selection
4. Model Building
5. Model Deployment

----

## **Import Libraries and Data** 


In [None]:
# Import necessary libraries

import numpy as np
from numpy import math

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext google.colab.data_table

import pandas as pd
pd.pandas.set_option('display.max_columns',None)

import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

from sklearn.feature_selection import SelectFromModel

from sklearn import neighbors
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor

import xgboost as xgb
from xgboost import plot_importance

import lightgbm 

In [None]:
# Mount Drive to load data.
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Read the csv file
data=pd.read_csv("/content/drive/MyDrive/Capstone 2/REGRESSION/bike sharing demand prediction/SeoulBikeData.csv",encoding= 'unicode_escape')

## **First Look**

In [None]:
# Fisrt 5 values.
data.head()

In [None]:
# Last 5 values.
data.tail()

>Let's check the duplicate entries

In [None]:
# Check for duplicated entries.
print("Duplicate entry in data:",len(data[data.duplicated()])) 

In [None]:
# Custom Function for Dtype,Unique values and Null values
def datainfo():
    temp_ps = pd.DataFrame(index=data.columns)
    temp_ps['DataType'] = data.dtypes
    temp_ps["Non-null_Values"] = data.count()
    temp_ps['Unique_Values'] = data.nunique()
    temp_ps['NaN_Values'] = data.isnull().sum()
    temp_ps['NaN_Values_Percentage'] = (temp_ps['NaN_Values']/len(data))*100 
    return temp_ps

In [None]:
# Shape of the data.
print("Total Rows and Columns in DataFrame is :",data.shape,"\n") 
# Custom Function
datainfo()

The dataset contains 8760 rows and 14 columns. Each row of the dataset contains information about weather conditions

>Our objective is to find a way to estimate the value prediction of bike count required at each hour for the stable supply of rental bikes. using the values in the other columns. If we can do so for the historical data, then we should able to estimate bike count required at each hour.


> Looks like "Seasons", "Holiday", "Functioning Day" are strings (possibly categories) and rest columns are numerical data. None of the columns contain any missing values, which saves us a fair bit of work!

Here are some statistics for the numerical columns:

In [None]:
# Statistical info.
data.describe().T

The ranges of values in the numerical columns seem reasonable too, so we may not have to do much data cleaning or correction. The "Wind speed","Dew point temperature(°C)", "Solar Radiation", "Rainfall" and "Snowfall" column seems to be significantly skewed however, as the median (50 percentile) is much lower than the maximum value.

# Step 1 - Exploratory Analysis and Visualization

Let's explore the data by visualizing the distribution of values in some columns of the dataset, and the relationships between "Rented Bike count" and other columns.

##**Missing values**

In [None]:
# 1 -step make the list of features which has missing values
feature_with_na=[feature for feature in data.columns if data[feature].isnull().sum()>1]

# 2- step print the feature name and the percentage of missing values
for feature in feature_with_na:
  print(feature, np.round(data[feature].isnull().mean(), 4)*100,  " % missing values")

In [None]:
#lets drop columns which have nan value above 40%
perc=40.0
min_count=int(((100-perc)/100)*data.shape[0] + 1)
data=data.dropna(axis=1,thresh=min_count)

> We'll use Dataprep library for automated visualization.

In [None]:
pip install -U dataprep

In [None]:
# Using dataprep profiling to get a idea about dataset 
from dataprep.eda import create_report
report=create_report(data)
report

#### Numerical Data

In [None]:
# list of numerical variables
numerical_features=[col for col in data.columns if data[col].dtype!='O']
# Separate dataframe for Numerical feature
num_data=data[numerical_features]
num_data.head()

#### Categorical Data

In [None]:
# list of categorical variables
categorical_features=[col for col in data.columns if data[col].dtype=='O']
# Separate dataframe for Categorical feature
cat_data=data[categorical_features]
cat_data.head()

#### Discrete Variables

In [None]:
## Lets analyse the discrete values by creating histograms to understand the distribution
discrete_feature=[feature for feature in numerical_features if len(data[feature].unique())<25]
print("Discrete Variables Count: {}".format(len(discrete_feature)))

for feature in discrete_feature:
    dataset=data.copy()
    fig, ax = plt.subplots(figsize=(12,6),facecolor="#363336")
    ax.patch.set_facecolor('#8C8C8C')
    dataset.groupby(feature)['Rented Bike Count'].median().plot.bar(color='red')
    ax.tick_params(axis='x', colors='#F5E9F5',labelsize=15) 
    ax.tick_params(axis='y', colors='#F5E9F5',labelsize=15)
    ax.set_xlabel(feature, color='#F5E9F5', fontsize=20)
    ax.set_ylabel("Rented Bike Count",  color='#F5E9F5', fontsize=20)       

####Continuous Variables

In [None]:
## Lets analyse the continuous values by creating histograms to understand the distribution
continuous_feature=[feature for feature in numerical_features if feature not in discrete_feature]
print("Continuous feature Count {}".format(len(continuous_feature)))

for feature in continuous_feature:
    dataset=data.copy()
    fig, ax = plt.subplots(figsize=(12,6),facecolor="#363336")
    ax.patch.set_facecolor('#8C8C8C')
    sns.distplot(dataset[feature],color='r',kde_kws={'linewidth':3,'color':'#4B0751'});
    ax.tick_params(axis='x', colors='#F5E9F5',labelsize=15) 
    ax.tick_params(axis='y', colors='#F5E9F5',labelsize=15)
    ax.set_xlabel(feature, color='#F5E9F5', fontsize=20)
    ax.set_ylabel("Count",  color='#F5E9F5', fontsize=20)
   

###Categorical Variables

In [None]:
# Unique number of categorical features
for feature in categorical_features:
    print('The feature is {} and number of categories are {}'.format(feature,len(data[feature].unique())))

In [None]:
#Find out the relationship between categorical variable and dependent feature Rented Bike Count
for feature in categorical_features:
    fig, ax = plt.subplots(figsize=(12,6),facecolor="#363336")
    ax.patch.set_facecolor('#8C8C8C')
    dataset.groupby(feature)['Rented Bike Count'].median().plot.bar(color='red')
    ax.tick_params(axis='x', colors='#F5E9F5',labelsize=15) 
    ax.tick_params(axis='y', colors='#F5E9F5',labelsize=15)
    ax.set_xlabel(feature, color='#F5E9F5', fontsize=20)
    ax.set_ylabel("Rented Bike Count",  color='#F5E9F5', fontsize=20)
  

## Outliers

In [None]:
for feature in numerical_features:
    dataset=data.copy()
    if 0 in dataset[feature].unique():
        pass
    else:
        fig, ax = plt.subplots(figsize=(12,6),facecolor="#363336")
        ax.patch.set_facecolor('#8C8C8C')
        sns.boxplot(data[feature],color='red')
        ax.tick_params(axis='x', colors='#F5E9F5',labelsize=15) 
        ax.tick_params(axis='y', colors='#F5E9F5',labelsize=15)
        ax.set_xlabel(feature, color='#F5E9F5', fontsize=20)
          

## Linear Relation

In [None]:
# Creating scatterplot to determine the co-relation
for col in (numerical_features[1:]):
  fig, ax = plt.subplots(figsize=(12,6),facecolor="#363336")
  ax.patch.set_facecolor('#8C8C8C')
  
  sns.scatterplot(data = data, x = col ,  y = 'Rented Bike Count' ,hue = 'Seasons',s=250,palette=["pink","grey","black","red"], ax =ax)  #... using Season as hue to see the distribution of count

  z = np.polyfit(data[col], data['Rented Bike Count'], 1)  # creating best fit line
  y_hat = np.poly1d(z)(data[col])
  plt.plot(data[col], y_hat, "b--", lw=4)
  
  ax.tick_params(axis='x', colors='#F5E9F5',labelsize=15) 
  ax.tick_params(axis='y', colors='#F5E9F5',labelsize=15)
  ax.set_xlabel(col, color='#F5E9F5', fontsize=20)
  ax.set_ylabel("Rented Bike Count",  color='#F5E9F5', fontsize=20)

## Correlation

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(abs(data.corr()),annot=True,cmap='coolwarm',linewidth=1,linecolor='black')

# Step 2 - Prepare the Dataset for Training


Before we can train the model, we need to prepare the dataset. Here are the steps we'll follow:

1. Identify the input and target column(s) for training the model.
2. Identify numeric and categorical input columns.
3. [Impute](https://scikit-learn.org/stable/modules/impute.html) (fill) missing values in numeric columns
4. [Scale](https://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range) values in numeric columns to a $(0,1)$ range.
5. [Encode](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features) categorical data into one-hot vectors.
6. Split the dataset into training and validation sets.


## Identify Inputs and Targets

While the dataset contains `81` columns, not all of them are useful for modeling. Note the following:

- The first column `Id` is a unique ID for each house and isn't useful for training the model.
- The last column `SalePrice` contains the value we need to predict i.e. it's the target column.
- Data from all the other columns (except the first and the last column) can be used as inputs to the model.
 

> **QUESTION 4**: Create a list `input_cols` of column names containing data that can be used as input to train the model, and identify the target column as the variable `target_col`.

In [None]:
# Identify the input columns (a list of column names)
input_cols = list(data.columns)[1:-1]

# Identify the name of the target column (a single string, not a list)
target_col =list(data.columns)[-1]

In [None]:
# It always a good practice whatever code u execute, print and check it 
print(input_cols)

In [None]:
# It always a good practice whatever code u execute, print and check it 
print(target_col)

Make sure that the `Id` and `SalePrice` columns are not included in `input_cols`.

Now that we've identified the input and target columns, we can separate input & target data.

In [None]:
# Separate input & target data
inputs_df = data[input_cols]
targets = data[target_col]

##Identify Numeric and Categorical Data
The next step in data preparation is to identify numeric and categorical columns. We can do this by looking at the data type of each column.

> **QUESTION 5**: Crate two lists `numeric_cols` and `categorical_cols` containing names of numeric and categorical input columns within the dataframe respectively. Numeric columns have data types `int64` and `float64`, whereas categorical columns have the data type `object`.
>
> *Hint*: See this [StackOverflow question](https://stackoverflow.com/questions/25039626/how-do-i-find-numeric-columns-in-pandas). 

In [None]:
# identifying Numerical and Categorical columns
#numerical=medical.select_dtypes(include=np.number).columns.tolist()
numeric_cols = inputs_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = inputs_df.select_dtypes(include=[object]).columns.tolist()

##Impute Numerical Data
Some of the numeric columns in our dataset contain missing values (nan)

In [None]:
# using isna() to calculate the null values in Numeric columns
missing_counts = inputs_df[numeric_cols].isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]

Machine learning models can't work with missing data. The process of filling missing values is called [imputation](https://scikit-learn.org/stable/modules/impute.html).

<img src="https://i.imgur.com/W7cfyOp.png" width="480">

There are several techniques for imputation, but we'll use the most basic one: replacing missing values with the average value in the column using the `SimpleImputer` class from `sklearn.impute`.


> **QUESTION 6**: Impute (fill) missing values in the numeric columns of `inputs_df` using a `SimpleImputer`. 

In [None]:
# Import SimpleImputer from sklearn library
from sklearn.impute import SimpleImputer

# 1. Create the imputer
imputer = SimpleImputer(strategy = 'mean')

# 2. Fit the imputer to the numeric colums
imputer.fit(inputs_df[numeric_cols])

# 3. Transform and replace the numeric columns
inputs_df[numeric_cols] = imputer.transform(inputs_df[numeric_cols])

In [None]:
# using isna()  to check the null values in Numeric columns
missing_counts = inputs_df[numeric_cols].isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]

##Scale Numerical Values
The numeric columns in our dataset have varying ranges.

In [None]:
# using describe function to see statistics information and .loc to filter min and max from describe function
inputs_df[numeric_cols].describe().loc[['min', 'max']]

A good practice is to [scale numeric features](https://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range) to a small range of values e.g. $(0,1)$. Scaling numeric features ensures that no particular feature has a disproportionate impact on the model's loss. Optimization algorithms also work better in practice with smaller numbers.


> **QUESTION 7**: Scale numeric values to the $(0, 1)$ range using `MinMaxScaler` from `sklearn.preprocessing`.

In [None]:
# Import MinMaxScaler from sklearn library
from sklearn.preprocessing import MinMaxScaler

# Create the scaler
scaler = MinMaxScaler()

# Fit the scaler to the numeric columns
scaler.fit(inputs_df[numeric_cols])

# Transform and replace the numeric columns
inputs_df[numeric_cols] = scaler.transform(inputs_df[numeric_cols])

After scaling, the ranges of all numeric columns should be (0, 1).

In [None]:
# Let's check that scaling worked or not
inputs_df[numeric_cols].describe().loc[['min', 'max']]

##Encode Categorical Columns
Our dataset contains several categorical columns, each with a different number of categories.

In [None]:
# Printing unique Categorical columns 
inputs_df[categorical_cols].nunique().sort_values(ascending=False)



Since machine learning models can only be trained with numeric data, we need to convert categorical data to numbers. A common technique is to use one-hot encoding for categorical columns.

<img src="https://i.imgur.com/n8GuiOO.png" width="640">

One hot encoding involves adding a new binary (0/1) column for each unique category of a categorical column.

> **QUESTION 8**: Encode categorical columns in the dataset as one-hot vectors using `OneHotEncoder` from `sklearn.preprocessing`. Add a new binary (0/1) column for each category

In [None]:
# Import OneHotEncoder from sklearn library
from sklearn.preprocessing import OneHotEncoder

# 1. Create the encoder
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

# 2. Fit the encoder to the categorical colums
encoder.fit(inputs_df[categorical_cols])

# 3. Generate column names for each category
encoded_cols = list(encoder.get_feature_names(categorical_cols))
len(encoded_cols)

In [None]:
# 4. Transform and add new one-hot category columns
inputs_df[encoded_cols] = encoder.transform(inputs_df[categorical_cols])

The new one-hot category columns should now be added to `inputs_df`.

In [None]:
input = inputs_df[numeric_cols + encoded_cols]
target = targets

## Feature Selection

In [None]:
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel
# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

### Apply Feature Selection
# first, I specify the Lasso Regression model, and I
# select a suitable alpha (equivalent of penalty).
# The bigger the alpha the less features that will be selected.

# Then I use the selectFromModel object from sklearn, which
# will select the features which coefficients are non-zero

feature_sel_model = SelectFromModel(Lasso(alpha=0.005, random_state=0)) # remember to set the seed, the random state in this function
feature_sel_model.fit(input, target)


In [None]:
feature_sel_model.get_support()

In [None]:
# let's print the number of total and selected features
# this is how we can make a list of the selected features
selected_feat = input.columns[(feature_sel_model.get_support())]

# let's print some stats
print('total features: {}'.format((input.shape[1])))
print('selected features: {}'.format(len(selected_feat)))

In [None]:
input=input[selected_feat]

##Training and Validation Set
Finally, let's split the dataset into a training and validation set. We'll use a randomly select 25% subset of the data for validation. Also, we'll use just the numeric and encoded columns, since the inputs to our model must be numbers.

In [None]:
# Import train_test_split from sklearn library to make split of data into train sets and validation sets
from sklearn.model_selection import train_test_split
train_inputs, val_inputs, train_targets, val_targets = train_test_split(input, target, test_size=0.25, random_state=42)

In [None]:
# It always a good practice to print and check the executed codes.
train_inputs

In [None]:
# It always a good practice to print and check the executed codes.
train_targets

In [None]:
# It always a good practice to print and check the executed codes.
val_inputs

In [None]:
# It always a good practice to print and check the executed codes.
val_targets

# Models

In [None]:
pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.0.6-cp37-none-manylinux1_x86_64.whl (76.6 MB)
[K     |████████████████████████████████| 76.6 MB 580 kB/s 
Installing collected packages: catboost
Successfully installed catboost-1.0.6


In [None]:
# Import necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn import neighbors
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
from xgboost import plot_importance
import lightgbm 
from catboost import CatBoostRegressor 
from sklearn.neural_network import MLPRegressor

In [None]:
# Model that we are going to use 
models = [
           ['LinearRegression: ',              LinearRegression()],
           ['Lasso: ',                         Lasso()],
           ['Ridge: ',                         Ridge()],
           ['KNeighborsRegressor: ',           neighbors.KNeighborsRegressor()],
           ['SVR:' ,                           SVR(kernel='rbf')],
           ['DecisionTree ',                   DecisionTreeRegressor(random_state=42)],
           ['RandomForest ',                   RandomForestRegressor(random_state=42)],
           ['ExtraTreeRegressor :',            ExtraTreesRegressor(random_state=42)],
           ['GradientBoostingRegressor: ',     GradientBoostingRegressor(random_state=42)],
           ['XGBRegressor: ',                  xgb.XGBRegressor(random_state=42)] ,
           ['Light-GBM: ',                     lightgbm.LGBMRegressor(num_leaves=41, n_estimators=200,random_state=42)],
           ['CatBoost: ',                      CatBoostRegressor(verbose=0, early_stopping_rounds=10,random_state=42)],
           ['MLPRegressor: ',                  MLPRegressor(  activation='relu', solver='adam',learning_rate='adaptive',max_iter=1000,learning_rate_init=0.01,alpha=0.01)]
         ]

In [None]:
# Run all the proposed models and update the information in a list model_data
import time
from math import sqrt
from sklearn import metrics

model_data = []
for name,model in models :

    model_data_dic = {}
    model_data_dic["Name"] = name

    start = time.time()
    end = time.time()

    model.fit(train_inputs,train_targets)
    
    model_data_dic["Train_Time"] = end - start
    # Training set
    model_data_dic["Train_R2_Score"] = metrics.r2_score(train_targets,model.predict(train_inputs))
    model_data_dic["Train_RMSE_Score"] = metrics.mean_squared_error(train_targets,model.predict(train_inputs),squared=False)
    # Validation set
    model_data_dic["Test_R2_Score"] = metrics.r2_score(val_targets,model.predict(val_inputs))
    model_data_dic["Test_RMSE_Score"] = metrics.mean_squared_error(val_targets,model.predict(val_inputs),squared=False)

    model_data.append(model_data_dic)

In [None]:
# Convert list to dataframe
df = pd.DataFrame(model_data)
df

In [None]:
df.plot(x="Name", y=['Test_R2_Score' , 'Train_R2_Score' , 'Test_RMSE_Score'], kind="bar" , title = 'R2 Score Results' , figsize= (10,8)) ;

* Obervations
1. Best results over test set are given by Extra Tree Regressor with R2 score of 0.57
2. Least RMSE score is also by Extra Tree Regressor 0.65
3. Lasso regularization over Linear regression was worst performing model

In [None]:
# Import metrics from sklearn library
from sklearn import metrics

def evaluate_train(model, train_inputs,train_targets):
    # Prediction on Train inputs
    predictions = model.predict(train_inputs)
    print('Train_Data- Model Performance')
    print('Root Mean Squared Error (RMSE):', metrics.mean_squared_error(train_targets, predictions, squared=False))
    print('R^2:', metrics.r2_score(train_targets, predictions))


def evaluate_val(model, val_inputs,val_targets):
    # Prediction on val inputs
    predictions = model.predict(val_inputs)
    print('Validation_data-Model Performance')
    print('Root Mean Squared Error (RMSE):', metrics.mean_squared_error(val_targets, predictions, squared=False))
    print('R^2:', metrics.r2_score(val_targets, predictions))   


# Model 1 - Training a Linear Regression Model

We're now ready to train the model. Linear regression is a commonly used technique for solving [regression problems](https://jovian.ai/aakashns/python-sklearn-logistic-regression/v/66#C6). In a linear regression model, the target is modeled as a linear combination (or weighted sum) of input features. The predictions from the model are evaluated using a loss function like the Root Mean Squared Error (RMSE).


Here's a visual summary of how a linear regression model is structured:

<img src="https://i.imgur.com/iTM2s5k.png" width="480">

However, linear regression doesn't generalize very well when we have a large number of input columns with co-linearity i.e. when the values one column are highly correlated with values in other column(s). This is because it tries to fit the training data perfectly. 

Instead, we'll use Ridge Regression, a variant of linear regression that uses a technique called L2 regularization to introduce another loss term that forces the model to generalize better. Learn more about ridge regression here: https://www.youtube.com/watch?v=Q81RR3yKn30

Training

>  Create and train a linear regression model using `sklearn.linear_model`.

In [None]:
from sklearn.linear_model import LinearRegression
# Create the model
LR= LinearRegression()
# Fit the model using inputs and targets
LR.fit(train_inputs,train_targets)

**Evaluation**

The model is now trained, and we can use it to generate predictions for the training and validation inputs. We can evaluate the model's performance using the RMSE (root mean squared error) loss function.

In [None]:
evaluate_train(LR, train_inputs,train_targets)

In [None]:
evaluate_val(LR, val_inputs,val_targets)

In [None]:
# Visualisation of model performance on test set
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(10,8))
plt.plot(10**(LR.predict(val_inputs)))
plt.plot(np.array(10**val_targets), color='red')

Feature Importance for LinearRegression


Let's look at the weights assigned to different columns, to figure out which columns in the dataset are the most important.

> **QUESTION 11**: Identify the weights (or coefficients) assigned to for different features by the model.
> 
> *Hint:* Read [the docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

In [None]:
# Important features or coefficient of the model
weights = model.coef_.flatten()

Let's create a dataframe to view the weight assigned to each column.

In [None]:
weights_df = pd.DataFrame({
    'columns': train_inputs.columns,
    'weight': weights
}).sort_values('weight', ascending=False)

In [None]:
# Visualize the Feature Importance on bar plot
plt.title('Feature Importance')
sns.barplot(data=weights_df.head(10), x='weight', y='columns');

Although the training accuracy is 100%, the accuracy on the validation set is just about 76%.

## Lasso Regression

In [None]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import GridSearchCV

lasso=Lasso()
parameters={'alpha':[1e-15,1e-13,1e-10,1e-8,1e-6,1e-4,1e-3, 1e-2,1e-1,1,5,10,20,30,40,50,60,100]}
las_regressor=GridSearchCV(lasso,parameters, scoring= 'neg_mean_squared_error',cv=5)
las_regressor.fit(train_inputs,train_targets)

In [None]:
print('using', las_regressor.best_params_,'the negative mean squared error ', las_regressor.best_score_)

In [None]:
evaluate_train(las_regressor, train_inputs,train_targets)

In [None]:
evaluate_val(las_regressor, val_inputs,val_targets)

In [None]:
# Visualisation of model performance on test set
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(10,8))
plt.plot(10**(las_regressor.predict(val_inputs)))
plt.plot(np.array(10**val_targets), color='red')

##Ridge Regression

In [None]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import GridSearchCV

rid=Ridge()
parameters={'alpha': [1e-15,1e-13,1e-10,1e-8,1e-6,1e-4,1e-3, 1e-2,1e-1,1,5,10,20,30,40,50,60,100]}
rid_regressor=GridSearchCV(rid,parameters,scoring='neg_mean_squared_error', cv=5)
rid_regressor.fit(train_inputs,train_targets)

In [None]:
print('using', rid_regressor.best_params_,'the negative mean squared error ', rid_regressor.best_score_)

In [None]:
evaluate_train(rid_regressor, train_inputs,train_targets)

In [None]:
evaluate_val(rid_regressor, val_inputs,val_targets)

In [None]:
# Visualisation of model performance on test set
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(10,8))
plt.plot(10**(las_regressor.predict(val_inputs)))
plt.plot(np.array(10**val_targets), color='red')

###ElasticNet Regressor

In [None]:
from sklearn.linear_model import ElasticNet

elastic= ElasticNet()
parameters={'alpha': [1e-15,1e-13,1e-10,1e-8,1e-6,1e-4,1e-3, 1e-2,1e-1,1,5,10,20,30,40,50,60,100], 'l1_ratio': [0.3,0.4,0.5,0.6,0.7,0.8]}            
elastic_reg =GridSearchCV(elastic,parameters,scoring='neg_mean_squared_error', cv=5)
elastic_reg.fit(train_inputs,train_targets)

In [None]:
print('the best values are', elastic_reg.best_params_, 'having the minimum error', elastic_reg.best_score_)

In [None]:
evaluate_train(elastic_reg, train_inputs,train_targets)

In [None]:
evaluate_val(elastic_reg, val_inputs,val_targets)

In [None]:
# Visualisation of model performance on test set
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(10,8))
plt.plot(10**(elastic_reg.predict(val_inputs)))
plt.plot(np.array(10**val_targets), color='red')

In [None]:
dat= {  'model':['LinearRegression','Lasso','Ridge','Lasso_CV','Ridge_CV','Elastic Net'],
        'score':[LR_score,las_score,reg_score,las_reg_score, rid_reg_score, elastic_reg_score], 
        'MSE': [LR_MSE,las_MSE,reg_MSE,las_reg_MSE, rid_reg_MSE, elastic_reg_MSE],
        'RMSE': [np.sqrt(LR_MSE),np.sqrt(las_MSE),np.sqrt(reg_MSE),np.sqrt(las_reg_MSE), np.sqrt(rid_reg_MSE),np.sqrt(elastic_reg_MSE)],
        'r2_score' : [LR_r2, las_r2, reg_r2,las_reg_r2, rid_reg_r2, elastic_reg_r2],
        'adjR2': [LR_ra2, las_ra2, reg_ra2,las_reg_ra2, rid_reg_ra2, elastic_reg_ra2]
       }

ff=pd.DataFrame(dat)       

# Model 2 -Training a SupportVector

* most important SVR parameter is Kernel type. It can be 
1. linear
2. polynomial  
3. gaussian SVR

 We have a non-linear condition #so we can select polynomial or gaussian but here we select RBF(a #gaussian type) kernel.

In [None]:
from sklearn.svm import SVR
SVR_model = SVR(kernel='rbf')
SVR_model.fit(X,y)

In [None]:
evaluate_train(SVR_model, train_inputs,train_targets)
evaluate_val(SVR_model, val_inputs,val_targets)

# Model 3 -Training a KNN

In [None]:
from sklearn import neighbors

KNR = neighbors.KNeighborsRegressor()
KNR.fit(train_inputs,train_targets)  #fit the model

evaluate_train(KNR, train_inputs,train_targets)
evaluate_val(KNR, val_inputs,val_targets)

# Model 4 -Training and Visualizing Decision Trees

A decision tree in general parlance represents a hierarchical series of binary decisions:

<img src="https://i.imgur.com/qSH4lqz.png" width="480">

A decision tree in machine learning works in exactly the same way, and except that we let the computer figure out the optimal structure & hierarchy of decisions, instead of coming up with criteria manually.

Training

We can use `DecisionTreeRegressor` from `sklearn.tree` to train a decision tree.

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Create the model
DRT=DecisionTreeRegressor(random_state=42)
# Fit the model
DRT.fit(train_inputs, train_targets)

Evaluation

Let's evaluate the decision tree using the accuracy score.

In [None]:
evaluate_train(DRT, train_inputs,train_targets)

The training set accuracy is close to 100%! But we can't rely solely on the training set accuracy, we must evaluate the model on the validation set too. 

We can make predictions and compute accuracy in one step using `model.score`

In [None]:
evaluate_val(DRT, val_inputs,val_targets)

Although the training accuracy is 100%, the accuracy on the validation set is just about 76%.

Visualization

We can visualize the decision tree _learned_ from the training data.

In [None]:
# Import plot_tree from sklearn library
from sklearn.tree import plot_tree
plt.figure(figsize=(80,20))
plot_tree(model, feature_names=train_inputs.columns, max_depth=2, filled=True);



Let's check the depth of the tree that was created.

In [None]:
model.tree_.max_depth

We can also display the tree as text, which can be easier to follow for deeper trees.

In [None]:
# Import export_text from sklearn library
from sklearn.tree import export_text
tree_text = export_text(model, max_depth=10, feature_names=list(train_inputs.columns))
print(tree_text[:3000])

Feature Importance


Based on the gini index computations, a decision tree assigns an "importance" value to each feature. These values can be used to interpret the results given by a decision tree.

In [None]:
# Important features or coefficient of the model
model.feature_importances_

Let's turn this into a dataframe and visualize the most important features.

In [None]:
importance_df = pd.DataFrame({
    'feature': train_inputs.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
# Visualize the Feature Importance on bar plot
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');

# Model 5-Training a Random Forest

While tuning the hyperparameters of a single decision tree may lead to some improvements, a much more effective strategy is to combine the results of several decision trees trained with slightly different parameters. This is called a random forest model. 

The key idea here is that each decision tree in the forest will make different kinds of errors, and upon averaging, many of their errors will cancel out. This idea is also commonly known as the "wisdom of the crowd":

<img src="https://i.imgur.com/4Dg0XK4.png" width="480">

Training

We can use `RandomForestRegressor` from `sklearn.ensemble` to train a decision tree.

In [None]:
from sklearn.ensemble import RandomForestRegressor
# Create the model
RF = RandomForestRegressor(n_jobs=-1, random_state=42)
# Fit the model
RF.fit(train_inputs, train_targets)

`n_jobs` allows the random forest to use mutiple parallel workers to train decision trees, and `random_state=42` ensures that the we get the same results for each execution.

Evaluation

Let's evaluate the Random Forest using the accuracy score.

In [None]:
evaluate_train(RF, train_inputs,train_targets)

In [None]:
evaluate_val(RF, val_inputs,val_targets)

Once again, the training accuracy is almost 100%, but this time the validation accuracy is much better. In fact, it is better than the best single decision tree we had trained so far. Do you see the power of random forests?

This general technique of combining the results of many models is called "ensembling", it works because most errors of individual models cancel out on averaging. Here's what it looks like visually:



##Hyperparameter Tuning using Optuna

In [None]:
import optuna
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score

def objective(trial):
    criterion = trial.suggest_categorical('criterion', ['mse', 'mae'])
    bootstrap = trial.suggest_categorical('bootstrap',['True','False'])
    max_depth = trial.suggest_int('max_depth', 1, 500)
    max_features = trial.suggest_categorical('max_features', ['auto', 'sqrt','log2'])
    max_leaf_nodes = trial.suggest_int('max_leaf_nodes', 1, 500)
    n_estimators =  trial.suggest_int('n_estimators', 30, 1000)
    
    regr = RandomForestRegressor(bootstrap = bootstrap, criterion = criterion,
                                 max_depth = max_depth, max_features = max_features,
                                 max_leaf_nodes = max_leaf_nodes,n_estimators = n_estimators,n_jobs=2)
    
    
    regr.fit(train_inputs, train_targets)
    y_pred = regr.predict(val_inputs)
    return r2_score(val_targets, y_pred)
    

In [None]:
#Execute optuna and set hyperparameters
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100,show_progress_bar = True)

In [None]:
#Create an instance with tuned hyperparameters
optimised_rf = RandomForestRegressor(bootstrap = study.best_params['bootstrap'], criterion = study.best_params['criterion'],
                                     max_depth = study.best_params['max_depth'], max_features = study.best_params['max_features'],
                                     max_leaf_nodes = study.best_params['max_leaf_nodes'],n_estimators = study.best_params['n_estimators'],
                                     n_jobs=2)
#learn
optimised_rf.fit(train_inputs, train_targets)


In [None]:
study.best_params

In [None]:
optuna.visualization.plot_optimization_history(study)
optuna.visualization.plot_slice(study)

In [None]:
base_model = RandomForestRegressor(random_state=42)
base_model.fit(train_inputs, train_targets)
base_accuracy = evaluate_val(base_model, val_inputs,val_targets)

In [None]:
optimised_accuracy = evaluate_val(optimised_rf, val_inputs,val_targets)

## Hyperparameter Tuning

Just like decision trees, random forests also have several hyperparameters. In fact many of these hyperparameters are applied to the underlying decision trees. 

Let's study some the hyperparameters for random forests. You can learn more about them here.

RandomSearch
As the name suggests the RandomSearch algorithm tries random combinations of a range
of values of given parameters. The numerical parameters can be specified as a range
(unlike fixed values in GridSearch). You can control the number of iterations of random
searches that you would like to perform. It is known to find a very good combination in a
lot less time compared to GridSearch; however you have to carefully choose the range for
parameters and the number of random search iteration as it can miss the best parameter
combination with lesser iterations or smaller ranges.
Let’s try the RandomSearchCV for same combination that we tried for GridSearch
and compare the time / accuracy.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

In [None]:
kfold = StratifiedKFold(n_splits=5,shuffle=True, random_state=42)
# Create the random grid
random_grid ={'bootstrap': [True, False],
              'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
              'max_features': ['auto', 'sqrt'],
              'min_samples_leaf': [1, 2, 4],
              'min_samples_split': [2, 5, 10],
              'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

RF = RandomForestRegressor(n_jobs=-1, random_state=42)              

# Random search of parameters, using 3 fold cross validation,
random_search = RandomizedSearchCV(estimator = RF, 
                                   param_distributions = random_grid, 
                                   n_iter = 100, cv = 3, verbose=2)
# Fit the random search model
random_search.fit(train_inputs, train_targets)

In [None]:
#We can view the best parameters from fitting the random search
best_random_search= random_search.best_params_

In [None]:
base_model = RandomForestRegressor(n_jobs=-1, random_state=42)
base_model.fit(train_features, train_labels)
base_accuracy = evaluate_train(model, val_inputs, val_targets)

In [None]:
best_random = random_search.best_estimator_
random_accuracy = evaluate(best_random, val_inputs, val_targets)

In [None]:
print('Improvement of {:0.2f}%.'.format( 100 * (random_accuracy - base_accuracy) / base_accuracy))

**GridSearch**

For a given model, you can define a set of parameter values that you would like to try.
Then using the GridSearchCV function of scikit-learn, models are built for all possible
combinations of a preset list of values of hyperparameter provided by you, and the best
combination is chosen based on the cross-validation score. There are two disadvantages
associated with GridSearchCV.
1. Computationally expensive:
2. Not perfect optimal but nearly optimal parameters:

In [None]:
kfold = StratifiedKFold(n_splits=5,shuffle=True, random_state=42)

# Create the parameter grid based on the results of random search 
search_grid = {'bootstrap': [True],
              'max_depth': [80, 90, 100, 110],
              'max_features': [2, 3],
              'min_samples_leaf': [3, 4, 5],
              'min_samples_split': [8, 10, 12],
              'n_estimators': [100, 200, 300, 1000] }

RF = RandomForestRegressor(n_jobs=-1, random_state=42)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = RF, 
                           param_grid = search_grid, 
                           cv = kfold, verbose = 2)

# Fit the grid search to the data
grid_search.fit(train_inputs, train_targets)             

In [None]:
# We can view the best parameters from fitting the Grid search
best_grid_search = grid_search.best_params_

In [None]:
# We can view the best estimator from fitting the Grid search
best_grid_estimator = grid_search.best_estimator_

In [None]:
grid_accuracy = evaluate(best_grid_search, val_inputs, val_targets)

In [None]:
print('Improvement of {:0.2f}%.'.format( 100 * (grid_accuracy - base_accuracy) / base_accuracy))

Training the Best Model with custom Hyperparameters

>  Train a random forest regressor model with the best hyperparameters to minimize the validation loss.

In [None]:
# Create the model with custom hyperparameters
model = RandomForestRegressor(random_state=42, n_jobs=-1,max_depth=20,
                                    max_features=0.7,n_estimators=40)
# Fit the model
model.fit(train_inputs,train_targets)

In [None]:
# Training set
evaluate_train(best_grid_estimator, train_inputs,train_targets)

# Validation set
evaluate_val(best_grid_estimator, val_inputs,val_targets)

Visualization

We can visualize the decision tree _learned_ from the training data.
>We can can access individual decision trees using `model.estimators_`

In [None]:
model.estimators_[0]

In [None]:
# Import plot_tree from sklearn library
from sklearn.tree import plot_tree

plt.figure(figsize=(80,20))
plot_tree(model.estimators_[0], max_depth=2, feature_names=train_inputs.columns, filled=True, rounded=True, );

Feature Importance
Just like decision tree, random forests also assign an "importance" to each feature, by combining the importance values from individual trees.

In [None]:
# Important features or coefficient of the model
model.feature_importances_

Let's turn this into a dataframe and visualize the most important features.

In [None]:
importance_df = pd.DataFrame({
    'feature': train_inputs.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
# Visualize the Feature Importance on bar plot
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');

# Model 6 ExtraTreeRegressor

Training

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
num_trees = 100
# Create the model
ExtraTree = ExtraTreesRegressor(n_estimators=num_trees)
# Fit the model
ExtraTree.fit(train_inputs,train_targets)


Evalute

In [None]:
evaluate_train(ExtraTree, train_inputs,train_targets)

In [None]:
evaluate_val(ExtraTree, val_inputs,val_targets)

## Hyperparameter Tuning using optuna


In [None]:
import optuna
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score

def objective(trial):
    criterion = trial.suggest_categorical('criterion', ['mse', 'mae'])
    bootstrap = trial.suggest_categorical('bootstrap',['True','False'])
    max_depth = trial.suggest_int('max_depth', 1, 500)
    max_features = trial.suggest_categorical('max_features', ['auto', 'sqrt','log2'])
    max_leaf_nodes = trial.suggest_int('max_leaf_nodes', 1, 500)
    n_estimators =  trial.suggest_int('n_estimators', 30, 1000)
    
    regr = ExtraTreeRegressor(bootstrap = bootstrap, criterion = criterion,
                                 max_depth = max_depth, max_features = max_features,
                                 max_leaf_nodes = max_leaf_nodes,n_estimators = n_estimators,n_jobs=2)
    
    
    regr.fit(train_inputs, train_targets)
    y_pred = regr.predict(val_inputs)
    return r2_score(val_targets, y_pred)
    

In [None]:
#Execute optuna and set hyperparameters
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100,show_progress_bar = True)

In [None]:
#Create an instance with tuned hyperparameters
optimised_Et = ExtraTreeRegressor(bootstrap = study.best_params['bootstrap'], criterion = study.best_params['criterion'],
                                     max_depth = study.best_params['max_depth'], max_features = study.best_params['max_features'],
                                     max_leaf_nodes = study.best_params['max_leaf_nodes'],n_estimators = study.best_params['n_estimators'],
                                     n_jobs=2)
#learn
optimised_Et.fit(train_inputs, train_targets)


In [None]:
study.best_params

In [None]:
optuna.visualization.plot_optimization_history(study)

In [None]:
optuna.visualization.plot_slice(study)

In [None]:
base_model = ExtraTreesRegressor(random_state=42)
base_model.fit(train_inputs, train_targets)
base_accuracy = evaluate_val(base_model, val_inputs,val_targets)

In [None]:
optimised_accuracy = evaluate_val(optimised_Et, val_inputs,val_targets)

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = [{
              'max_depth': [80, 150, 200,250],
              'n_estimators' : [100,150,200,250],
              'max_features': ["auto", "sqrt", "log2"]
            }]
reg = ExtraTreesRegressor(random_state=40)
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = reg, param_grid = param_grid, cv = 5, n_jobs = -1 , scoring='r2' , verbose=2)
grid_search.fit(train_inputs, train_targets)

In [None]:
# Tuned parameter set
best_grid_search=grid_search.best_params_

In [None]:
# Best possible parameters for ExtraTreesRegressor
grid_search.best_estimator_

In [None]:
grid_accuracy = evaluate(best_grid_search,  val_inputs,val_targets)

# Model 7- Training a Gradient Boosting Machines (GBMs) with XGBoost

Boosting is a type of ensemble learning that uses the previous model's result as an input to the next one. Instead of training models separately, boosting trains models sequentially, each new model being trained to correct the errors of the previous ones. At each iteration (round), the outcomes predicted correctly are given a lower weight, and the ones wrongly predicted a higher weight. It then uses a weighted average to produce a final outcome.


<img src="https://miro.medium.com/max/700/1*PZd-TOxSLV_--3glkFHwxQ.png" width="600">

We're now ready to train our gradient boosting machine (GBM) model. Here's how a GBM model works:

1. The average value of the target column and uses as an initial prediction every input.
2. The residuals (difference) of the predictions with the targets are computed.
3. A decision tree of limited depth is trained to **predict just the residuals** for each input.
4. Predictions from the decision tree are scaled using a parameter called the learning rate (this prevents overfitting)
5. Scaled predictions fro the tree are added to the previous predictions to obtain the new and improved predictions.
6. Steps 2 to 5 are repeated to create new decision trees, each of which is trained to predict just the residuals from the previous prediction.

The term "gradient" refers to the fact that each decision tree is trained with the purpose of reducing the loss from the previous iteration (similar to gradient descent). The term "boosting" refers the general technique of training new models to improve the results of an existing model. 

Here's a visual representation of gradient boosting:

![](https://miro.medium.com/max/560/1*85QHtH-49U7ozPpmA5cAaw.png)



Training

To train a GBM, we can use the `XGBRegressor` class from the [`XGBoost`](https://xgboost.readthedocs.io/en/latest/) library.

In [None]:
# Import XGBRegressor from xgboost 
from xgboost import XGBRegressor

# create the model
model = XGBRegressor(random_state=42, n_jobs=-1)
# Fit the model
model.fit(train_inputs, train_targets)

 Evaluation

Note that when using the Learning API you can input and access an evaluation metric, whereas when using the Scikit-learn API you have to calculate it.

In [None]:
evaluate_train(model, train_inputs,train_targets)

In [None]:
evaluate_val(model, val_inputs,val_targets)

##Hyperparameter Tuning using Optuna

In [None]:
import optuna
from sklearn.metrics import mean_squared_error

def objective(trial):
  
    param = {
  
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
        'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
        'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.008,0.01,0.012,0.014,0.016,0.018, 0.02]),
        'n_estimators': 10000,
        'max_depth': trial.suggest_categorical('max_depth', [5,7,9,11,13,15,17]),
        'random_state': trial.suggest_categorical('random_state', [42]),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 300),
    }
    model = xgb.XGBRegressor(**param)  
    
    model.fit(train_inputs,train_targets,eval_set=[(val_inputs,val_targets)],early_stopping_rounds=100,verbose=False)
    
    preds = model.predict(train_inputs)  
    rmse = mean_squared_error(train_targets, preds,squared=False)
    
    return rmse

In [None]:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=30,show_progress_bar = True)

In [None]:
study.best_params

In [None]:
#plot_optimization_histor: shows the scores from all trials as well as the best score so far at each point.
optuna.visualization.plot_optimization_history(study)
#plot_parallel_coordinate: interactively visualizes the hyperparameters and scores
optuna.visualization.plot_parallel_coordinate(study)
#Visualize parameter importances.
optuna.visualization.plot_param_importances(study)

In [None]:
base_model = xgb.XGBRegressor(random_state=42)
base_model.fit(train_inputs, train_targets)
base_accuracy = evaluate_val(base_model, val_inputs,val_targets)

In [None]:
tune_model = xgb.XGBRegressor(**study.best_params,random_state=42)
tune_model.fit(train_inputs, train_targets)
tune_accuracy = evaluate_val(tune_model, val_inputs,val_targets)

## Hyperparameter Tuning and Regularization

Objective function
XGBoost is a great choice in multiple situations, including regression and classification problems. Based on the problem and how you want your model to learn, you’ll choose a different objective function.

The most commonly used are:

* reg:squarederror: for linear regression
* reg:logistic: for logistic regression
* binary:logistic: for logistic regression — with output of the probabilities

`Random Search with Cross Validation`

In [None]:
# Create the random grid
random_grid = { 'max_depth': [3, 5, 6, 10, 15, 20],
                'learning_rate': [0.01, 0.1, 0.2, 0.3],
                'subsample': [0.5,0.6,0.7,0.8,0.9 1.0]
                'colsample_bytree': [0.4,0.5,0.6,0.7,0.8,0.9 1.0]
                'colsample_bylevel': [0.4,0.5,0.6,0.7,0.8,0.9 1.0]
                'n_estimators': [100, 500, 1000] }
                
# Random search of parameters, using 3 fold cross validation,
random_search = RandomizedSearchCV(estimator = model, 
                               param_distributions = random_grid, 
                               n_iter = 100, cv = 3, verbose=2)
# Fit the random search model
random_search.fit(train_inputs, train_targets)

In [None]:
#We can view the best parameters from fitting the random search
best_random_search= random_search.best_params_

In [None]:
base_model = RandomForestRegressor(n_jobs=-1, random_state=42)
base_model.fit(train_features, train_labels)
base_accuracy = evaluate_train(model,val_inputs, val_targets)

In [None]:
best_random = random_search.best_estimator_
random_accuracy = evaluate(best_random, val_inputs, val_targets)

In [None]:
print('Improvement of {:0.2f}%.'.format( 100 * (random_accuracy - base_accuracy) / base_accuracy))

`Grid Search with Cross Validation`

In [None]:
# Create the parameter grid based on the results of random search 
search_grid = { 'max_depth': [3, 5, 6, 10, 15, 20],
                'learning_rate': [0.01, 0.1, 0.2, 0.3],
                'subsample': [0.5,0.6,0.7,0.8,0.9 1.0]
                'colsample_bytree': [0.4,0.5,0.6,0.7,0.8,0.9 1.0]
                'colsample_bylevel': [0.4,0.5,0.6,0.7,0.8,0.9 1.0]
                'n_estimators': [100, 500, 1000] }

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = model, 
                           param_grid = search_grid, 
                           cv = 3, verbose = 2)

# Fit the grid search to the data
grid_search.fit(train_inputs, train_targets)

In [None]:
# We can view the best parameters from fitting the Grid search
best_grid_search = grid_search.best_params_

In [None]:
# We can view the best estimator from fitting the Grid search
best_grid_estimator = grid_search.best_estimator_

In [None]:
grid_accuracy = evaluate(best_grid_search, val_inputs, val_targets)

In [None]:
print('Improvement of {:0.2f}%.'.format( 100 * (grid_accuracy - base_accuracy) / base_accuracy))

Training the Best Model with custom Hyperparameters

>  Train a random forest regressor model with the best hyperparameters to minimize the validation loss.

In [None]:
model = XGBRegressor(n_jobs=-1, random_state=42, n_estimators=1700, 
                     learning_rate=0.3, max_depth=7, subsample=0.9, 
                     colsample_bytree=0.7)
# Fit the model
model.fit(train_inputs,train_targets)

In [None]:
# Training set
evaluate_train(best_grid_estimator, train_inputs,train_targets)

# Validation set
evaluate_val(best_grid_estimator, val_inputs,val_targets)

# Model 8 - Light GBM 
LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

* Faster training speed and higher efficiency.
* Lower memory usage.
* Better accuracy.
* Support of parallel, distributed, and GPU learning.
* Capable of handling large-scale data.

Training

To train a LGBM, we can use the `LGBMRegressor` frm class lightgbm.

In [None]:
import lightgbm 
# create the model
LGBM=lightgbm.LGBMRegressor(random_state=42)
# Fit the model
LGBM.fit(train_inputs,train_targets)

Evaluation

Let's evaluate the predictions using score.

In [None]:
evaluate_train(LGBM, train_inputs,train_targets)

In [None]:
evaluate_val(LGBM, val_inputs,val_targets)

Visualization

We can visualize the LGBM _learned_ from the training data.

In [None]:
lgb.plot_importance(model)

In [None]:
lgb.plot_tree(model,figsize=(80,40))

In [None]:
lgb.plot_metric(model)

## Hyperparameter Tuning with optuna

In [None]:
import optuna
from optuna import Trial, visualization
from optuna.samplers import TPESampler
def objective(trial,data=data):
    
   
    param = {
        'metric': 'rmse', 
        'random_state': 42,
        'n_estimators': 10000,
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
        'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.006,0.008,0.01,0.014,0.017,0.02]),
        'max_depth': trial.suggest_categorical('max_depth', [10,20,100]),
        'num_leaves' : trial.suggest_int('num_leaves', 1, 1000),
        'min_child_samples': trial.suggest_int('min_child_samples', 1, 300),
        'cat_smooth' : trial.suggest_int('min_data_per_groups', 1, 100)
    }

    model = lightgbm.LGBMRegressor(**param)  
    model.fit(train_inputs,train_targets,eval_set=[(val_inputs,val_targets)],early_stopping_rounds=100,verbose=False)
    
    preds = model.predict(val_inputs)
    rmse = mean_squared_error(val_targets, preds,squared=False)
    
    return rmse

In [None]:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=10,show_progress_bar = True)

 Let's do some Quick Visualization for Hyperparameter Optimization Analysis
* Optuna provides various visualization features in optuna.visualization to analyze optimization results visually

In [None]:
#plot_optimization_histor: shows the scores from all trials as well as the best score so far at each point.
optuna.visualization.plot_optimization_history(study)

In [None]:
#plot_parallel_coordinate: interactively visualizes the hyperparameters and scores
optuna.visualization.plot_parallel_coordinate(study)

In [None]:
base_model = LGBMRegressor(random_state=42)
base_model.fit(train_inputs, train_targets)
base_accuracy = evaluate_val(base_model, val_inputs,val_targets)

In [None]:
tune_model = LGBMRegressor(**study.best_params,random_state=42)
tune_model.fit(train_inputs, train_targets)
tune_accuracy = evaluate_val(tune_model, val_inputs,val_targets)

# Model 9 - Catboost 

Training

We can install CatBoost using the following command:


In [None]:
pip install catboost

In [None]:
from catboost import CatBoostRegressor
import catboost as cb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.inspection import permutation_importance

Pool Object
* The Pool function in CatBoost combines independent and dependent variables (X and y), as well as categorical features.
* We pass Pool Object as a training data to fit() method
* We don’t need to define the “cat features” parameter separately when constructing the model since the pool object already has these details.
We will create a pool object using the below code.

In [None]:
train_dataset = cb.Pool(train_inputs, train_targets) 
test_dataset = cb.Pool(val_inputs, val_targets)

In [None]:
model = cb.CatBoostRegressor(loss_function="RMSE")

In [None]:
grid = {'iterations': [100, 150, 200,300,400,500],
        'learning_rate': [0.03, 0.1],
        'depth': [2, 4, 6, 8],
        'l2_leaf_reg': [0.2, 0.5, 1, 3]}
model.grid_search(grid, train_dataset)

Evaluation

Let's evaluate the predictions using score.

In [None]:
evaluate_train(model, train_inputs,train_targets)

In [None]:
evaluate_val(model, val_inputs,val_targets)

## Hyperparameter Tuning 

In [None]:
model1 = cb.CatBoostRegressor(loss_function="RMSE")
grid={'depth': [8],
  'iterations': [500],
  'l2_leaf_reg': [0.5],
  'learning_rate': [0.1]}
model1.grid_search(grid, train_dataset)

lets tune it more

In [None]:
model1 = cb.CatBoostRegressor(loss_function="RMSE")
grid={'depth': [8],
  'iterations': [1000],
  'l2_leaf_reg': [0.01],
  'learning_rate': [0.1]}
model1.grid_search(grid, train_dataset)

In [None]:
evaluate_train(model, train_inputs,train_targets)

In [None]:
evaluate_val(model, val_inputs,val_targets)

## Hyperparameter Tuning with optuna

In [None]:
import optuna
from optuna import Trial, visualization
from optuna.samplers import TPESampler
def objective(trial,data=data):
    
    param = {
        'loss_function': 'RMSE',
        'task_type': 'GPU',
        'l2_leaf_reg': trial.suggest_loguniform('l2_leaf_reg', 1e-3, 10.0),
        'max_bin': trial.suggest_int('max_bin', 200, 400),
        #'rsm': trial.suggest_uniform('rsm', 0.3, 1.0),
        'subsample': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
        'learning_rate': trial.suggest_uniform('learning_rate', 0.006, 0.018),
        'n_estimators':  25000,
        'max_depth': trial.suggest_categorical('max_depth', [5,7,9,11,13,15]),
        'random_state': trial.suggest_categorical('random_state', [2020]),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 1, 300),
    }
    model = CatBoostRegressor(**param)  
    
    model.fit(train_inputs,train_targets,eval_set=[(val_inputs,val_targets)],early_stopping_rounds=200,verbose=False)
    
    preds = model.predict(val_inputs)
    
    rmse = mean_squared_error(val_targets, preds,squared=False)
    
    return rmse

In [None]:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=30)

In [None]:
#plot_optimization_histor: shows the scores from all trials as well as the best score so far at each point.
optuna.visualization.plot_optimization_history(study)

#Conclusion
Totally we trained 6 models
* Model 1 - LINEAR REGRESSION MODEL
* Model 2 - DECISION TREE MODEL
* Model 3 - RANDOM FOREST MODEL
* Model 4 - GRADIENT BOOSTING MACHINES MODEL
* Model 5 - LIGHTGBM MODEL
* Model 6 - CATBOOST MODEL