<table align="center" width=100%>
    <tr>
        <td width="25%">
            <img src="https://monophy.com/media/3o7aD3LftJ423GBsVG/monophy.gif">
        </td>
        <td>
            <div align="center">
                <font color="#0B2F02" size=24px>
                    <b>Carbon Dioxide Emissions
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

# Problem Statement  🚚🏭

**To predict the Carbon Dioxide emissions from a vehicle in Canada depending on the fuel consumption and other describing features of a vehicle.**

<table>
    <tr>
        <td>
            <img src="https://i.gifer.com/6FR.gif">
        </td>
    </tr>
</table>

# Data Dictionary

1. **Make**  → Company of the vehicle
2. **Model**  → Car model
3. **Vehicle_Class**  → Class of vehicle depending on their utility, capacity and weight
4. **Engine_Size**  → Size of engine in terms of Litre
5. **Cylinders**  → Number of cylinders
6. **Transmission**  → Transmission type with number of gears
7. **Fuel_Type**  → Type of Fuel used
8. **Fuel_Consumption_City**  → Fuel consumption in city roads (L/100 km) 
9. **Fuel_Consumption_Hwy**  → Fuel consumption in Hwy roads (L/100 km)
10. **Fuel_Consumption_Comb**  → The combined fuel consumption (55% city, 45% highway) is shown in L/100 km
11. **Fuel_Consumption_Comb1**   → The combined fuel consumption in both city and highway is shown in mile per gallon(mpg)
12. **CO2_Emissions**   → The tailpipe emissions of carbon dioxide (in grams per kilometre) for combined city and highway driving (Target/dependent variable)

## Table of Contents

1. **[Import Libraries](#import_lib)**
2. **[Set Options](#set_options)**
3. **[Read Data](#Read_Data)**
4. **[Exploratory Data Analysis](#data_preparation)**
    - 4.1 - [Preparing the Dataset](#Data_Preparing)
        - 4.1.1 - [Data Dimension](#Data_Shape)
        - 4.1.2 - [Data Types](#Data_Types)
        - 4.1.3 - [Missing Values](#Missing_Values)
        - 4.1.4 - [Duplicate Data](#duplicate)
        - 4.1.5 - [Indexing](#indexing)
        - 4.1.6 - [Final Dataset](#final_dataset)
    - 4.2 - [Understanding the Dataset](#Data_Understanding)
        - 4.2.1 - [Summary Statistics](#Summary_Statistics)
        - 4.2.2 - [Correlation](#correlation)
        - 4.2.3 - [Analyze Categorical Variables](#analyze_cat_var)
        - 4.2.4 - [Anaylze Target Variable](#analyze_tar_var)
        - 4.2.5 - [Analyze Relationship Between Target and Independent Variables](#analyze_tar_ind_var)
        - 4.2.6 - [Feature Engineering](#feature_eng)
5. **[Data Pre-Processing](#data_pre)**
    - 5.1 - [Outliers](#out)
        - 5.1.1 - [Discovery of Outliers](#dis_out)
        - 5.1.2 - [Removal of Outliers](#rem_out)
        - 5.1.3 - [Rechecking of Correlation](#rec_cor)
    - 5.2 - [Categorical Encoding](#cat_enc)
6. **[Building Multiple Linear Regression Models](#bui_mlr_mod)**
    - 6.1 - [Multiple Linear Regression - Basic Model](#bas_mod)
    - 6.2 - [Feature Transformation](#fea_tra)
    - 6.3 - [Feature Scaling](#fea_sca)
    - 6.4 - [Multiple Linear Regression - Full Model - After Feature Scaling](#mod_aft_sca)
    - 6.5 - [Assumptions Before Multiple Linear Regression Model](#ass_bef)
        - 6.5.1 - [Assumption #1: If Target Variable is Numeric](#tgt_num)
        - 6.5.2 - [Assumption #2: Presence of Multi-Collinearity](#pre_mul_col)
    - 6.6 - [Multiple Linear Regression - Full Model - After PCA](#mod_pca)
    - 6.7 - [Feature Selection](#fea_sel)
        - 6.7.1 - [Forward Selection](#for_sel)
        - 6.7.2 - [Backward Elimination](#bac_eli)
    - 6.8 - [Multiple Linear Regression - Full Model - After Feature Selection](#mod_fea_sel)
    - 6.9 - [Assumptions After Multiple Linear Regression Model](#ass_aft)
        - 6.9.1 - [Assumption #1: Linear Relationship Between Dependent and Independent Variable](#lr_dep_ind)
        - 6.9.2 - [Assumption #2: Checking for Autocorrelation](#che_aut_cor)
        - 6.9.3 - [Assumption #3: Checking for Heterskedacity](#che_het)
        - 6.9.4 - [Assumption #4: Test for Normality](#tes_nor)
            - 6.9.4.1 - [Q-Q Plot](#qq_plt)
            - 6.9.4.2 - [Shapiro Wilk Test](#sha_wil_tes)
7. **[Model Evaluation](#mod_eva)**
    - 7.1 - [Measures of Variation](#mea_var)
    - 7.2 - [Inferences about Intercept and Slope](#inf_int_slo)
    - 7.3 - [Confidence Interval for Intercept and Slope](#con_int_slo)
    - 7.4 - [Compare Regression Results](#com_reg_res)
8. **[Model Performance](#mod_per)**
    - 8.1 - [Mean Square Error(MSE)](#mse)
    - 8.2 - [Root Mean Squared Error(RMSE)](#rmse)
    - 8.3 - [Mean Absolute Error(MAE)](#mae)
    - 8.4 - [Mean Absolute Percentage Error(MAPE)](#mape)
    - 8.5 - [Resultant Table](#res_tab)
9. **[Model Optimization](#mod_opt)**
    - 9.1 - [Bias](#bias)
    - 9.2 - [Variance](#var)
    - 9.3 - [Model Validation](#mod_val)
      - 9.3.1 - [Cross Validation](#cro_val)
      - 9.3.2 - [Leave One Out Cross Validation(LOOCV)](#loocv)
    - 9.4 - [Gradient Descent](#gra_des)
    - 9.5 - [Regularization](#reg)
      - 9.5.1 - [Ridge Regression Model](#ridge)
      - 9.5.2 - [Lasso Regression Model](#lasso)
      - 9.5.3 - [Elastic Net Regression Model](#ela_net)
      - 9.5.4 - [Grid Search CV](#gri_sea)
10. **[Displaying Score Summary](#dis_sco_sum)**
11. **[Conclusion](#conclu)**
12. **[Deployment](#deploy)**
13. **[References](#Refer)**

# 1. Import Libraries <a id='import_lib'></a>

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn import preprocessing
import seaborn as sns

from warnings import filterwarnings
filterwarnings('ignore')
%matplotlib inline

# display all columns of the dataframe
pd.options.display.max_columns = None

# display all rows of the dataframe
pd.options.display.max_rows = None
from sklearn.preprocessing import MinMaxScaler
import statsmodels
import statsmodels.api as sm
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.graphics.gofplots import qqplot
from statsmodels.stats.anova import anova_lm
from statsmodels.formula.api import ols
from statsmodels.tools.eval_measures import rmse

# import various functions from scipy
from scipy import stats
from scipy.stats import shapiro

# 'metrics' from sklearn is used for evaluating the model performance
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

from statsmodels.graphics.gofplots import qqplot

# import 'stats'
from scipy import stats

# 'metrics' from sklearn is used for evaluating the model performance
from sklearn.metrics import mean_squared_error

# import functions to perform feature selection
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.feature_selection import RFE

# import function to perform linear regression
from sklearn.linear_model import LinearRegression

# import functions to perform cross validation
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error


# import function to perform linear regression
from sklearn.linear_model import LinearRegression

# import StandardScaler to perform scaling
from sklearn.preprocessing import StandardScaler 

# import SGDRegressor from sklearn to perform linear regression with stochastic gradient descent
from sklearn.linear_model import SGDRegressor

# import function for ridge regression
from sklearn.linear_model import Ridge

# import function for lasso regression
from sklearn.linear_model import Lasso

# import function for elastic net regression
from sklearn.linear_model import ElasticNet

# import function to perform GridSearchCV
from sklearn.model_selection import GridSearchCV

from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import linear_model
from sklearn.decomposition import PCA
from sklearn import preprocessing

# 2. Set Options <a id='set_options'></a>

In [None]:
# display all columns of the dataframe
pd.options.display.max_columns = None
# display all rows of the dataframe
pd.options.display.max_rows = None
# return an output value upto 6 decimals
pd.options.display.float_format = '{:.6f}'.format

# 3. Read Data <a id='Read_Data'></a>

In [None]:
# read csv file using pandas
data = pd.read_csv('../input/co2-emissions-cannada/CO2 Emissions_Canada.csv')

# display the top 5 rows of the dataframe
data.head()

In [None]:
data.info()

# 4. Exploratory Data Analysis <a id='data_preparation'></a>

## 4.1 Preparing the Dataset <a id='Data_Preparing'></a>

### 4.1.1 Data Dimensions <a id='Data_Shape'></a>

In [None]:
# shape returns the dimension of the data
data.shape

In this dataset I have 7384 records across 12 features

### 4.1.2 Data Types <a id='Data_Types'></a>

In [None]:
data.dtypes

The dataset contains **5 object columns, 3 int column and 4 float columns**

### 4.1.3 Missing Values <a id='Missing_Values'></a>

In [None]:
missing_value = pd.DataFrame({
    'Missing Value': data.isnull().sum(),
    'Percentage': (data.isnull().sum() / len(data))*100
})

In [None]:
missing_value.sort_values(by='Percentage', ascending=False)

There are **no missing values** present in this dataset

**Visualising missing values using Heatmap**

In [None]:
# set the figure size
plt.figure(figsize=(15, 8))

# plot heatmap to check null values
# isnull(): returns 'True' for a missing value
# cbar: specifies whether to draw a colorbar; draws the colorbar for 'True' 
sns.heatmap(data.isnull(), cbar=False)

# display the plot
plt.show()

**Visual proof that there are no missing values**

### 4.1.4 Duplicate Data <a id='duplicate'></a>

In [None]:
duplicate = data.duplicated().sum()
print('There are {} duplicated rows in the data'.format(duplicate))

**Getting rid of duplicate data**

In [None]:
data.drop_duplicates(inplace=True)

**Checking for duplicate data after removal of duplicates**

In [None]:
duplicate = data.duplicated().sum()
print('There are {} duplicated rows in the data'.format(duplicate))

### 4.1.5 Indexing <a id='indexing'></a>

In [None]:
data.shape

There are **6281 records** after dropping duplicates

In [None]:
data.tail()

**The last 5 index values range from 7379-7383 but I have only 6281 records thus the indexes need to be reset**

In [None]:
data.reset_index(inplace=True)

In [None]:
data.tail()

**The indexes have been reset but a new column 'index' is created which needs to be dropped**

In [None]:
data.drop(['index'],inplace=True,axis=1)

### 4.1.6 Final Dataset <a id='final_dataset'></a>

In [None]:
data.head()

In [None]:
data.shape

The final dataset has **6281 records and 12 features with no missing and duplicate values**

## 4.2 Understanding the Dataset <a id='Data_Understanding'></a>

### 4.2.1 Summary Statistics <a id='Summary_Statistics'></a>

**Numeric Variables**

In [None]:
data.describe(include=np.number)

<br>Inferences:</br>
<br>1. The average amount of CO2 emitted from cars is 251 g/km</br>
<br>2. Atleast 4 Litres of fuel is consumed be it the car is on city roads or highway</br>
<br>3. About 75% of the cars have 6 or less cylinders</br>
<br>4. The amount of fuel consumed by cars on city roads is comparitvely greater than that of highway</br>

**Categorical Variables**

In [None]:
data.describe(include = object)

<br>Inferences:</br>
<br>    1. There are a total of 42 different car companies with 2053 different car models</br>
<br>    2. Vehicles are divided into 16 different classes with SUV-Small vehicles frequenting the most</br>
<br>    3. 4 different types of fuels used by cars have been identified and fuel X seems to be the most famous</br>
<br>    4. Most of the cars have AS6 transmission</br>

### 4.2.2 Correlation <a id='correlation'></a>

In [None]:
# select the numerical features in the dataset using 'select_dtypes()'
# select_dtypes(include=np.number): considers the numeric variables
data_num_features = data.select_dtypes(include=np.number)

# print the names of the numeric variables 
print('The numerical columns in the dataset are: ',data_num_features.columns)

In [None]:
# generate the correlation matrix
corr =  data_num_features.corr()

# print the correlation matrix
corr

In [None]:
plt.figure(figsize=(20,10))
corr =data_num_features.corr(method='pearson')
sns.heatmap(corr, annot=True,cmap='tab20c')
plt.show()

<br>Inferences:</br>
<br>    1. Fuel_Consumption_Comb1 has a high negative correaltion(<-0.9) with CO2_Emissions, Fuel_Consumption_Comb and Fuel_Consumption_City</br>
<br>    2. CO2_Emissions has high positive correlation(>0.9) with Fuel_Consumption_Comb and Fuel_Consumption_City</br>

### 4.2.3 Analyse Categorical Variables <a id='analyze_cat_var'></a>

In [None]:
# create a list of all categorical variables
# include=object: selects the categoric features
# drop(['city'],axis=1): drops the city column from the dataframe
data_cat_features = data.select_dtypes(include='object')

# plot the count distribution for each categorical variable 
# 'figsize' sets the figure size
# plot a count plot for all the categorical variables
for variable in data_cat_features:
    
    cat_count  = data[variable].value_counts()
    cat_count10 = cat_count[:10,]
    plt.figure(figsize=(10,5))
    sns.barplot(cat_count10.values,cat_count10.index, alpha=0.8)
    if cat_count.size > 10:
        plt.title('Top 10 {}'.format(variable))
    else:
        plt.title(variable)
    plt.ylabel('{}'.format(variable), fontsize=12)
    plt.xlabel('Number of Cars', fontsize=12)
    plt.show()

# avoid overlapping of the plots using tight_layout()    
plt.tight_layout()   

# display the plot
plt.show()

<br>Inferences from each Plot:</br>
<br>    1. Top 10 Make: Most of the cars on Canadian roads are made by Ford</br>
<br>    2. Top 10 Model: The F-150 FFV is amongst the most famous models driven in Canada</br>
<br>    3. Top 10 Vehicle_Class: SUV-Small is the preferred class of vehicle amongst the Canadians</br>
<br>    4. Top 10 Transmission: More than 1000 cars have AS6 and AS8 transmission types</br>
<br>    5. Fuel Type: Majority of the cars in Canada use Fuel type X and Z</br>

### 4.2.4 Analyse Target Variable <a id='analyze_tar_var'></a>

In [None]:
sns.distplot(data['CO2_Emissions'], bins=30, kde=True, axlabel='Carbon Dioxide Emission (30 bins)')

From the above histogram, I can see that CO2_Emissions is moderately positive skewed

In [None]:
mean = data['CO2_Emissions'].mean()

# calculate the mode
mode = data['CO2_Emissions'].mode()

# calculate the median
median = data['CO2_Emissions'].median()

print('Mean for CO2 Emission is ',mean)
print('Median for CO2 Emission is ',median)
print('Mode for CO2 Emission is ',mode)

CO2_Emissions is bi-modal in nature

In [None]:
# create two plots in single figure
# I define two axes by passing the value 3 to the subplot function
# sharey returns the y axis label
fig, axes = plt.subplots(1,3, sharey=True, figsize=(15,8))

# create a boxplot
# orient="v": create a vertical plot
# ax = axes: axes object to draw plot
# I use axes[0] to use the first axes for plotting
sns.boxplot(y=data['CO2_Emissions'], orient="v", ax = axes[0])

# create a violinplot
# orient="v": create a vertical plot
# ax = axes: axes object to draw plot
# I use axes[1] to use the second axes for plotting
sns.violinplot(y=data['CO2_Emissions'], orient="v", ax = axes[1]);

# add a value of mode in the empty subplot
# fontsize: font size of the text
plt.text(0.1, 200, "Mode = 221/246", fontsize=12)

# add a value of median in the empty subplot
# fontsize: font size of the text
plt.text(0.1, 300, "Median = 246", fontsize=12)

# add a value of mean in the empty subplot
# fontsize: font size of the text
plt.text(0.1, 400, "Mean = 251.16", fontsize=12)

# add the result in the empty subplot
# fontsize: font size of the text
plt.text(0.1, 100, "Mode < Median < Mean", fontsize=12)

# remove the axis for the third subplot
plt.axis("off")

# show the plot
plt.show()

Of all the three statistics, the mean is the largest, while the mode is the smallest thus CO2_Emissions is positively skewed which implies that most of the CO2 Emissions are less than the average CO2 Emissions.

### 4.2.5 Analyse Relationship between Target and Independent Variables <a id='analyze_tar_ind_var'></a>

In [None]:
make_co2 = data.groupby('Make')['CO2_Emissions'].mean().sort_values(ascending=False).head(10)
model_co2 = data.groupby('Model')['CO2_Emissions'].mean().sort_values(ascending=False).head(10)
vehicle_class_co2 = data.groupby('Vehicle_Class')['CO2_Emissions'].mean().sort_values(ascending=False).head(10)
transmission_co2 = data.groupby('Transmission')['CO2_Emissions'].mean().sort_values(ascending=False).head(10)
fuel_type_co2 = data.groupby('Fuel_Type')['CO2_Emissions'].mean().sort_values(ascending=False).head()

In [None]:
fig, axes = plt.subplots(5,1, figsize=(15,20))
fig.suptitle('Average of Categorical Variables vs CO2 Emissions')

sns.barplot(ax=axes[0],x = make_co2.values,y = make_co2.index)
axes[0].set_title('CO2 Emissions v/s Make')

sns.barplot(ax=axes[1],x = model_co2.values,y = model_co2.index)
axes[1].set_title('CO2 Emissions v/s Model')

sns.barplot(ax=axes[2],x = vehicle_class_co2.values,y = vehicle_class_co2.index)
axes[2].set_title('CO2 Emissions v/s Vehicle_Class')

sns.barplot(ax=axes[3],x = transmission_co2.values,y = transmission_co2.index)
axes[3].set_title('CO2 Emissions v/s Transmission')

sns.barplot(ax=axes[4], x=fuel_type_co2.values,y=fuel_type_co2.index)
axes[4].set_title('CO2 Emissions v/s Fuel Type')

<br>Inferences from each Plot:</br>
<br>    1. CO2 Emissions v/s Make: While Ford cars are mainly found on the roads of Canada , its Bugatti that emit the most CO2 per car</br>
<br>    2. CO2 Emissions v/s Model: Bugatti Chiron is amongst the most CO2 emitting car model</br>
<br>    3. CO2 Emissions v/s Vehicle_Class: Most of the heavy vehicles like Vans , SUV and Pick-up truck are amongst the top few emitters of CO2</br>
<br>    4. CO2 Emissions v/s Transmission: Most of the cars with automatic transmission emit CO2</br>
<br>    5. CO2 Emissions v/s Fuel_Type: Cars using Fuel Type E are emitting the most CO2</br>

**Let's check the relationship between Cylinders and CO2 Emissions**

In [None]:
# plot the scatter plot
# use 'hue' to add 3rd variable in the scatter plot
plt.rcParams["figure.figsize"] = (15,10)
sns.scatterplot('CO2_Emissions','Cylinders',data = data,hue='Fuel_Type')

# set label for x-axis
plt.xlabel("CO2 Emissions", fontsize=20)

# set label for y-axis
plt.ylabel("Cylinders", fontsize=20)

# set title
plt.title("Scatter Plot", fontsize=20)

# display the plot
plt.show()

<br>From the above scatter plot i can see that:</br>
<br>    1. As the number of cylinders increase, the CO2 emissions increase</br>
<br>    2. Cars with 8 and less than 8 cylinders prefer using Fuel Type X which result in less emissions of CO2</br>
<br>    3. Fuel Type Z results in more CO2 emissions than the other</br>

In [None]:
plt.figure(figsize=(10,5))
sns.pairplot(data,kind="reg")
plt.show()

Inferences:
    1. Fuel_Consumption_Comb1 shows a negative relation with all the other numerical variables
    2. Fuel_Consumption_City and Fuel_Consumption_Hwy are strongly postively related

### 4.2.6 Feature Engineering <a id='feature_eng'></a>

**Create a new feature Make_Type by combining various car companies(Make) on the basis of their functionality**

**There are 42 unique Car Companies. I will divide these companies into Luxury, Sports, Premium and General cars**

In [None]:
data['Make_Type'] = data['Make'].replace(['BUGATTI', 'PORSCHE', 'MASERATI', 'ASTON MARTIN', 'LAMBORGHINI',
                                                       'JAGUAR','SRT'],
                                                      'Sports')

In [None]:
data['Make_Type'] = data['Make_Type'].replace(['ALFA ROMEO', 'AUDI', 'BMW', 'BUICK',
                                                         'CADILLAC', 'CHRYSLER', 'DODGE', 'GMC',
                                                         'INFINITI', 'JEEP', 'LAND ROVER', 'LEXUS', 'MERCEDES-BENZ',
                                                         'MINI', 'SMART', 'VOLVO'],
                                                         'Premium')

In [None]:
data['Make_Type'] = data['Make_Type'].replace(['ACURA', 'BENTLEY', 'LINCOLN', 'ROLLS-ROYCE',
                                                         'GENESIS'],
                                                         'Luxury')

In [None]:
data['Make_Type'] = data['Make_Type'].replace(['CHEVROLET', 'FIAT', 'FORD', 'KIA',
                                                         'HONDA', 'HYUNDAI', 'MAZDA', 'MITSUBISHI',
                                                         'NISSAN', 'RAM', 'SCION', 'SUBARU', 'TOYOTA',
                                                         'VOLKSWAGEN'],
                                                         'General')

In [None]:
data['Make_Type'].unique()

In [None]:
data['Make_Type'].value_counts()

In [None]:
#Drop Make column
data = data.drop(['Make'], axis=1)

In [None]:
data.head()

In [None]:
# set figure size
plt.figure(figsize=(15,8))

# boxplot of claim against region
# x: specifies the data on x axis
# y: specifies the data on y axis
# data: specifies the dataframe to be used
ax = sns.boxplot(x="Make_Type", y="CO2_Emissions", data=data)

# rotate labels using set_ticklabels
# labels: specify the tick labels to be used
# rotation: the angle by which tick labels should be rotated
ax.set_xticklabels(labels=ax.get_xticklabels(), rotation=90)

# show the plot
plt.show()

The plot shows that Sports cars and Luxury cars emit more CO2 compared to Premium and General use cars

**Create a new feature Vehicle_Class_Type by combining various Vehicle_Class on the basis of their size**

**There are 16 unique Vehicle Classes. I will divide them into Hatchback, Sedan, SUV and Truck**

In [None]:
data['Vehicle_Class_Type'] = data['Vehicle_Class'].replace(['COMPACT', 'MINICOMPACT', 'SUBCOMPACT'],
                                                      'Hatchback')

In [None]:
data['Vehicle_Class_Type'] = data['Vehicle_Class_Type'].replace(['MID-SIZE', 'TWO-SEATER', 'FULL-SIZE', 'STATION WAGON - SMALL',
                                                         'STATION WAGON - MID-SIZE'],
                                                         'Sedan')

In [None]:
data['Vehicle_Class_Type'] = data['Vehicle_Class_Type'].replace(['SUV - SMALL', 'SUV - STANDARD', 'MINIVAN'],
                                                         'SUV')

In [None]:
data['Vehicle_Class_Type'] = data['Vehicle_Class_Type'].replace(['VAN - CARGO', 'VAN - PASSENGER', 'PICKUP TRUCK - STANDARD', 'SPECIAL PURPOSE VEHICLE',
                                                         'PICKUP TRUCK - SMALL'],
                                                         'Truck')

In [None]:
# check the unique values of the Make_Type column
data['Vehicle_Class_Type'].unique()

In [None]:
data['Vehicle_Class_Type'].value_counts()

In [None]:
#Drop Vehicle_Class column
data = data.drop(['Vehicle_Class'], axis=1)

In [None]:
data.head()

In [None]:
# set figure size
plt.figure(figsize=(15,8))

# boxplot of claim against region
# x: specifies the data on x axis
# y: specifies the data on y axis
# data: specifies the dataframe to be used
ax = sns.boxplot(x="Vehicle_Class_Type", y="CO2_Emissions", data=data)

# rotate labels using set_ticklabels
# labels: specify the tick labels to be used
# rotation: the angle by which tick labels should be rotated
ax.set_xticklabels(labels=ax.get_xticklabels(), rotation=90)

# show the plot
plt.show()

The plot shows that the bigger the cars are the more CO2 they emit

# 5. Data Preprocessing <a id='data_pre'></a>

In [None]:
data.drop(['Model'],axis=1,inplace=True)

Since Model has 2053 unique values and has no significance with respect to CO2 Emissions , I have dropped this column

In [None]:
data.head()

## 5.1 Outliers <a id='out'></a>

### 5.1.1 Discovery of Outliers<a id='dis_out'></a>

In [None]:
df_num_features=data.select_dtypes(include=np.number)

**Identifying outliers using IQR**

In [None]:
Q1 = df_num_features.quantile(0.25)
Q3 = df_num_features.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

In [None]:
outlier = pd.DataFrame((df_num_features < (Q1 - 1.5 * IQR)) | (df_num_features > (Q3 + 1.5 * IQR)))

In [None]:
for i in outlier.columns:
    print('Total number of Outliers in column {} are {}'.format(i, (len(outlier[outlier[i] == True][i]))))

**Visualizing outliers using Boxplots**

In [None]:
for column in enumerate(df_num_features):
    plt.figure(figsize=(30,5))
    sns.set_theme(style="darkgrid")
    sns.boxplot(x=column[1], data=  df_num_features)
    plt.xlabel(column[1],fontsize=18)
    plt.show()

### 5.1.2 Removal of Outliers<a id='rem_out'></a>

**Checking the normality of numeric features**

In [None]:
stat, p_value = shapiro(df_num_features)

# print the test statistic and corresponding p-value 
print('Test statistic:', stat)
print('P-Value:', p_value)

Since the numeric features are not normal I am removing the outliers using IQR method

In [None]:
data = data[~((data < (Q1 - 1.5 * IQR)) |(data > (Q3 + 1.5 * IQR))).any(axis=1)]
data.shape

In [None]:
data.reset_index(inplace=True)

In [None]:
data.drop(['index'],inplace=True,axis=1)

In [None]:
data.head()

### 5.1.3 Re-checking Correlation<a id='rec_cor'></a>

In [None]:
# select the numerical features in the dataset using 'select_dtypes()'
# select_dtypes(include=np.number): considers the numeric variables
data_num_features = data.select_dtypes(include=np.number)

# print the names of the numeric variables 
print('The numerical columns in the dataset are: ',data_num_features.columns)

In [None]:
# generate the correlation matrix
corr =  data_num_features.corr()

# print the correlation matrix
corr

In [None]:
plt.figure(figsize=(20,10))
corr =data_num_features.corr(method='pearson')
sns.heatmap(corr, annot=True,cmap='tab20b')
plt.show()

Recheck of correlation after treating outliers. There has been a slight change with respect to the correlation between numeric values

## 5.2 Categorical Encoding<a id='cat_enc'></a>

**Filter the numeric and categorical features**

In [None]:
df_dummies = pd.get_dummies(data = data[["Fuel_Type","Transmission","Make_Type","Vehicle_Class_Type"]], drop_first = True)
df_dummies.head()

In [None]:
df_num_features=data.select_dtypes(include=np.number)
df_num_features.head()

**Concatenate numerical and dummy encoded categorical variables**

In [None]:
df_comb = pd.concat([df_num_features, df_dummies], axis = 1)
df_comb.head()

# 6. Building Multiple Linear Regression Models<a id='bui_mlr_mod'></a>

In [None]:
df_comb.drop(['CO2_Emissions'],inplace=True,axis=1)

In [None]:
df_comb.head()

In [None]:
df_comb.isna().sum()

## 6.1 Multiple Linear Regression - Basic Model<a id='bas_mod'></a>

In [None]:
X = df_comb.copy()

In [None]:
X = sm.add_constant(X)
y = data.CO2_Emissions

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

MLR_model1 = sm.OLS(y_train, X_train).fit()
print(MLR_model1.summary())

Interpretations:
    1. 99.5% of the variation in CO2 emissions is explained by the model.
    2. The Durbin-Watson test statistic is 2.006 and indicates that there is no auto-correlation
    3. The Condition Number is 1.00e+16 which suggests that there is severe mutli-collinearity
    4. The features taken into consideration are of different scales

## 6.2 Feature Transformation<a id='fea_tra'></a>

In [None]:
df_num_features.skew()

Since the skewness is relatively low, there is no need to perform any further transformations to reduce skewness

## 6.3 Feature Scaling<a id='fea_sca'></a>

In [None]:
for col in df_num_features.columns:
    print("Column ", col, " :", stats.shapiro(df_num_features[col]))

Since none of the numerical features are normally distributed (p-value<0.05) , I will perform Min-Max normalisation to scale the data

In [None]:
df_num_features.drop('CO2_Emissions',axis=1,inplace=True)

In [None]:
mms = preprocessing.MinMaxScaler()
mmsfit = mms.fit(df_num_features)
dfxz = pd.DataFrame(mms.fit_transform(df_num_features), columns = ['Engine_Size','Cylinders','Fuel_Consumption_City','Fuel_Consumption_Hwy','Fuel_Consumption_Comb','Fuel_Consumption_Comb1'])

In [None]:
dfxz.head()

In [None]:
dfxz = pd.concat([dfxz, df_dummies], axis = 1)
dfxz.head()

## 6.4 Multiple Linear Regression - Full Model - After Feature Scaling<a id='mod_aft_sca'></a>

In [None]:
X=dfxz
X = sm.add_constant(X)
y = data.CO2_Emissions

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

MLR_model2 = sm.OLS(y_train, X_train).fit()
print(MLR_model2.summary())

Interpretations:
    1. 99.5% of the variation in CO2 emissions is explained by the model .
    2. The Durbin-Watson test statistic is 2.006 and indicates that there is no auto-correlation
    3. The Condition Number is 1.24e+16 which suggests that there is severe mutli-collinearity

## 6.5 Assumptions Before Multiple Linear Regression Model<a id="ass_bef"></a>

### 6.5.1 Assumption #1: If Target Variable is Numeric<a id="tgt_num"></a>

In [None]:
target = data['CO2_Emissions']

target.dtype

### 6.5.2 Assumption #2: Presence of Multi-Collinearity<a id="pre_mul_col"></a>

In [None]:
# create an empty dataframe to store the VIF for each variable
vif = pd.DataFrame()

# calculate VIF using list comprehension 
# use for loop to access each variable 
# calculate VIF for each variable and create a column 'VIF_Factor' to store the values 
vif["VIF_Factor"] = [variance_inflation_factor(df_num_features.values, i) for i in range(df_num_features.shape[1])]

# create a column of variable names
vif["Features"] = df_num_features.columns

# sort the dataframe based on the values of VIF_Factor in descending order
# 'ascending = False' sorts the data in descending order
# 'reset_index' resets the index of the dataframe
# 'drop = True' drops the previous index
vif.sort_values('VIF_Factor', ascending = False).reset_index(drop = True)

Since all the features except Fuel_Consumption_Comb1 have a VIF value greater than 10 I cannot proceed with VIF method else I will lose all our features. Hence , I will proceed with PCA

In [None]:
sklearn_pca = PCA()
pcafit = sklearn_pca.fit(dfxz)

In [None]:
pcafit.explained_variance_

In [None]:
pcafit.components_

In [None]:
plt.plot(np.cumsum(pcafit.explained_variance_ratio_))
plt.locator_params(axis="x", nbins=len(pcafit.explained_variance_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

As you can see from the above graph, 28 components describe almost 98% of variance in features

In [None]:
np.round(pcafit.explained_variance_ratio_.reshape(-1,1) * 100,1)

The above output indicates how much variance each component holds and the last 6 components hold no variance

In [None]:
dfx_pca = sklearn_pca.fit_transform(dfxz)
dfx_pca.shape

In [None]:
dfx_pca = pd.DataFrame(dfx_pca, columns=['pca0','pca1','pca2','pca3','pca4','pca5',
                                         'pca6','pca7','pca8','pca9','pca10','pca11',
                                         'pca12','pca13','pca14','pca15','pca16',
                                         'pca17','pca18','pca19','pca20','pca21','pca22',
                                         'pca23','pca24','pca25','pca26','pca27','pca28',
                                         'pca29','pca30','pca31','pca32','pca33',
                                         'pca34','pca35','pca36','pca37','pca38','pca39',
                                         'pca40'])

In [None]:
dfx_pca.head()

## 6.6 Multiple Linear Regression - Full Model - After PCA<a id="mod_pca"></a>

In [None]:
dfx_pca = sm.add_constant(dfx_pca)

In [None]:
X = dfx_pca[['const','pca0','pca1','pca2','pca3','pca4','pca5','pca6','pca7','pca8','pca9','pca10','pca11','pca12','pca13','pca14','pca15','pca16','pca17','pca18','pca19','pca20','pca21','pca22','pca23','pca24','pca25','pca26','pca27','pca28','pca29','pca30','pca31','pca32','pca33']]
X.head()

In [None]:
y = data.CO2_Emissions

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

MLR_model_pca = sm.OLS(y_train, X_train).fit()
print(MLR_model_pca.summary())

Interpretations:
    1. 99.3% of the variation in CO2 emissions is explained by the model .
    2. The Durbin-Watson test statistic is 2.053 and indicates that there is no auto-correlation
    3. The Condition Number is 23.4 which suggests that there is no mutli-collinearity

## 6.7 Feature Selection<a id="fea_sel"></a>

### 6.7.1 Forward Selection<a id="for_sel"></a>

In [None]:
# initiate linear regression model to use in feature selection
linreg = LinearRegression()
linreg_forward = sfs(estimator=linreg, k_features ='best', forward=True,
                     verbose=2, scoring='r2')

# fit the forward selection on training data using fit()
sfs_forward = linreg_forward.fit(X_train, y_train)

In [None]:
# print the selected feature names when k_features = 12
print('Features selected using forward selection are: ')
print(sfs_forward.k_feature_names_)

# print the R-squared value
print('\nR-Squared: ', sfs_forward.k_score_)

All features except pca_14 and pca_22 have been retained for the betterment of the model

### 6.7.2 Backward Elimination<a id="bac_eli"></a>

In [None]:
# initiate linear regression model to use in feature selection
linreg = LinearRegression()
linreg_backward = sfs(estimator = linreg, k_features ='best', forward = False,
                     verbose = 2, scoring = 'r2')

# fit the backward elimination on training data using fit()
sfs_backward = linreg_backward.fit(X_train, y_train)

In [None]:
# print the selected feature names when k_features = 12
print('Features selected using backward elimination are: ')
print(sfs_backward.k_feature_names_)

# print the R-squared value
print('\nR-Squared: ', sfs_backward.k_score_)

Obtained similar results as that of Forward Selection where all features except pca_14 and pca_22 have been retained for the betterment of the model

## 6.8 Multiple Linear Regression - Full Model - After Feature Selection<a id="mod_fea_sel"></a>

In [None]:
X = dfx_pca[['const','pca0', 'pca1', 'pca2', 'pca3', 'pca4', 'pca5', 'pca6', 'pca7', 'pca8', 'pca9', 'pca10', 'pca11', 'pca12', 'pca13', 'pca15', 'pca16', 'pca17', 'pca18', 'pca19', 'pca20', 'pca21', 'pca23', 'pca24', 'pca25', 'pca26', 'pca27', 'pca28', 'pca29', 'pca30', 'pca31', 'pca32', 'pca33']]
X.head()

In [None]:
y = data.CO2_Emissions

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

MLR_full_model = sm.OLS(y_train, X_train).fit()
print(MLR_full_model.summary())

Interpretations:
    1. 99.3% of the variation in CO2 emissions is explained by the model .
    2. The Durbin-Watson test statistic is 2.051 and indicates that there is no auto-correlation
    3. The Condition Number is 23.4 which suggests that there is no mutli-collinearity

## 6.9 Assumptions After Multiple Linear Regression Model<a id="ass_aft"></a>

### 6.9.1 Assumption #1: Linear Relationship Between Dependent and Independent Variable<a id="lr_dep_ind"></a>

In [None]:
import seaborn as sns 
fig, ax = plt.subplots(nrows = 2, ncols= 2, figsize=(20, 15))

# use for loop to create scatter plot for residuals and each independent variable (do not consider the intercept)
# 'ax' assigns axes object to draw the plot onto 
for variable, subplot in zip(X_train.columns[1:5], ax.flatten()):
    sns.scatterplot(X_train[variable], MLR_full_model.resid , ax=subplot)

# display the plot
plt.show()

**Interpretation:** The above plots show no specific pattern, implies that there is a linearity present in the data.

### 6.9.2 Assumption #2: Checking for Autocorrelation<a id="che_aut_cor"></a>

In [None]:
# print the model summary
print(MLR_full_model.summary())

**Interpretation:** From the above summary, I can observe that the value obtained from the `Durbin-Watson` test statistic is close to 2 (= 2.051). Thus, I conclude that there is no autocorrelation.

### 6.9.3 Assumption #3: Checking for Heteroskedasticity<a id="che_het"></a>

Breusch-Pagan is one of the tests for detecting heteroskedasticity in the residuals.<br>
The test hypothesis for the Breusch-Pagan test is given as:
<p style='text-indent:25em'> <strong> H<sub>o</sub>:  There is homoscedasticity present in the data </strong> </p>
<p style='text-indent:25em'> <strong> H<sub>1</sub>:  There is a heteroscedasticity present in the data </strong> </p>

In [None]:
# create vector of result parmeters
name = ['f-value','p-value']
test = sms.het_breuschpagan(MLR_full_model.resid, MLR_full_model.model.exog)
lzip(name, test[2:])

**Interpretation:** I observe that the p-value is less than 0.05; thus, I conclude that there is heteroskedasticity present in the data.

### 6.9.4 Assumption #4: Tests for Normality<a id="tes_nor"></a>

#### 6.9.4.1 Q-Q Plot<a id="qq_plt"></a>

In [None]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

# plot the Q-Q plot
# 'r' represents the regression line
qqplot(MLR_full_model.resid, line = 'r')

# set plot and axes labels
# set text size using 'fontsize'
plt.title('Q-Q Plot', fontsize = 15)
plt.xlabel('Theoretical Quantiles', fontsize = 15)
plt.ylabel('Sample Quantiles', fontsize = 15)

# display the plot
plt.show()

**Interpretation:** The diagonal line (red line) is the regression line and the blue points are the cumulative distribution of the residuals. As some of the points are not close to the diagonal line, I conclude that the residuals do not follow a `normal distribution`.

#### 6.9.4.2 Shapiro Wilk Test<a id="sha_wil_tes"></a>

The Shapiro Wilk test is used to check the normality of the residuals. The test hypothesis is given as:<br>

<p style='text-indent:25em'> <strong> H<sub>o</sub>:  Residuals are normally distributed </strong> </p>
<p style='text-indent:25em'> <strong> H<sub>1</sub>:  Residuals are not normally distributed </strong> </p>

In [None]:
stat, p_value = shapiro(MLR_full_model.resid)
print('Test statistic:', stat)
print('P-Value:', p_value)

**Interpretation:** From the above test I can see that the p-value is 2.153e-37 (less than 0.05), thus I can say that the residuals are not normally distributed.

# 7. Model Evaluation<a id="mod_eva"></a>

## 7.1 Measures of Variation<a id="mea_var"></a>

In [None]:
y_train_pred = MLR_full_model.predict(X_train) 
y_train_pred.head()

In [None]:
# calculate the SSR on train dataset
ssr = np.sum((y_train_pred - y_train.mean())**2)
print('Sum of Squared Regression:',ssr)

In [None]:
# calculate the SSE on train dataset
sse = np.sum((y_train - y_train_pred)**2)
print('Sum of Squared Error:',sse)

In [None]:
# calculate the SST on train dataset
sst = np.sum((y_train - y_train.mean())**2)
print('Sum of Sqaured Total:',sst)

In [None]:
print('Sum of SSR and SSE is:',ssr+sse)

**Interpretation:** From the above output, I can verify that SST (Total variation) is the sum of SSR and SSE.

In [None]:
r_sq =MLR_full_model.rsquared

# print the R-squared value
print('R Squared is:',r_sq)

In [None]:
see = np.sqrt(sse/(len(X_train) - 2))    
print("The standard error of estimate:",see)

## 7.2 Inferences about Intercept and Slope<a id="inf_int_slo"></a>

In [None]:
MLR_full_model.summary()

In [None]:
t_intercept =MLR_full_model.params[0] / MLR_full_model.bse[0]
print('t intercept:',t_intercept)

In [None]:
t_coeff1 =MLR_full_model.params[1] / MLR_full_model.bse[1]
print('t coeff:',t_coeff1)

In [None]:
# calculate p-value for intercept
# use 'sf' (Survival function) from t-distribution to calculate the corresponding p-value

# pass degrees of freedom and t-statistic value for intercept
# degrees of freedom = n - 1 = 4070 - 1 = 4069
pval = stats.t.sf(np.abs(t_intercept), 4069)*2 
print('p val for intercept:',pval)

In [None]:
# calculate p-value for slope
# use 'sf' (Survival function) from t-distribution to calculate the corresponding p-value

# pass degrees of freedom and t-statistic value for slope
# degrees of freedom = n - 1 = 4070 - 1 = 4069
pval = stats.t.sf(np.abs(t_coeff1),4069)*2 
print('p val for slope:',pval)

## 7.3 Confidence Interval for Intercept and Slope<a id="con_int_slo"></a>

In [None]:
# CI for intercept
# create a tuple using the above formula
# here, t_table_value = 1.9622
CI_inter_min, CI_inter_max = MLR_full_model.params[0] - (1.9622*MLR_full_model.bse[0]), MLR_full_model.params[0] + (1.9622*MLR_full_model.bse[0])

# print the confidence interval for intercept 
print('CI for intercept:', [CI_inter_min , CI_inter_max])

In [None]:
# CI for slope
# create a tuple using the above formula
# here, t_table_value = 1.9622
CI_coeff1_min, CI_coeff1_max = MLR_full_model.params[1] - (1.9622*MLR_full_model.bse[1]), MLR_full_model.params[1] + (1.9622*MLR_full_model.bse[1])

# print the confidence interval for slope
print('CI for coeff1:', [CI_coeff1_min, CI_coeff1_max])

## 7.4 Compare Regression Results<a id="com_reg_res"></a>

In [None]:
print(MLR_full_model.summary())

In [None]:
r_sq_mlr = MLR_full_model.rsquared

# print the value
print('r square in regression model:',r_sq_mlr)

**Interpretation:** The value of R-squared is 0.993. Thus, I conclude that the 99.3% variation in the CO2_Emissions is explained by the model. I can also obtain this value from the summary of the model.

In [None]:
# calculate adjusted R-Squared on train dataset
# use 'rsquared_adj' from statsmodel
adj_r_sq = MLR_full_model.rsquared_adj

# print the value
print('Adjusted r square for regression model:',adj_r_sq)

**Interpretation:** I can see that the value of adjusted R-squared calculated using the formula and the one obtained from the model are nearly same. I can also obtain this value from the summary of the model.

Overall F-Test & p-value of the Model

In [None]:
# compute f_value using the below formula 
# f_value = (r_sq / k-1)/((1- r_sq)/n-k)

# k = number of beta coefficients
k = len(X_train.columns)

# n = number of observations
n = len(X_train)

# calculate value of F-statistic
# 'r_sq_mlr' represents the R-Squared value
f_value = (r_sq_mlr / (k - 1))/((1-r_sq_mlr)/(n - k))

# print the value
print('f value for regression model:',f_value)

In [None]:
# degrees of freedom 
# dfn = k-1 = 32-1 = 31
# dfd = n-k = 4396-32 = 4364
p_val = stats.f.sf(f_value, dfn = 31, dfd = 4364)

# print the value
print('p value for regression model:',p_val)

**Interpretation:** As, the p-value is less than 0.05, I accept the alternate hypothesis; i.e. the model is significant.

# 8. Model Performance<a id="mod_per"></a>

In [None]:
train_pred = MLR_full_model.predict(X_train)
test_pred = MLR_full_model.predict(X_test)

In [None]:
train_pred.head()

In [None]:
test_pred.head()

## 8.1 Mean Squared Error (MSE)<a id="mse"></a>

In [None]:
mse_train = round(mean_squared_error(y_train, train_pred),4)

# print the MSE for the training set
print("Mean Squared Error (MSE) on training set: ", mse_train)

# calculate the MSE for the test data
# round the value upto 4 digits using 'round()'
mse_test = round(mean_squared_error(y_test, test_pred),4)

# print the MSE for the test set
print("Mean Squared Error (MSE) on test set: ", mse_test)

## 8.2 Root Mean Squared Error (RMSE)<a id="rmse"></a>

In [None]:
# calculate the MSE using the "mean_squared_error" function

# MSE for the train data
mse_train = mean_squared_error(y_train, train_pred)
rmse_train = round(np.sqrt(mse_train), 4)

# print the RMSE for the train set
print("Root Mean Squared Error (RMSE) on training set: ", rmse_train)

# MSE for the test data
mse_test = mean_squared_error(y_test, test_pred)

# take the square root of the MSE to calculate the RMSE
# round the value upto 4 digits using 'round()'
rmse_test = round(np.sqrt(mse_test), 4)

# print the RMSE for the test set
print("Root Mean Squared Error (RMSE) on test set: ", rmse_test)

## 8.3 Mean Absolute Error (MAE)<a id="mae"></a>

In [None]:
# calculate the MAE using the "mean_absolute_error" function

# calculate the MAE for the train data
# round the value upto 4 digits using 'round()'
mae_train = round(mean_absolute_error(y_train, train_pred),4)

# print the MAE for the training set
print("Mean Absolute Error (MAE) on training set: ", mae_train)

# calculate the MAE for the test data
# round the value upto 4 digits using 'round()'
mae_test = round(mean_absolute_error(y_test, test_pred),4)

# print the MAE for the test set
print("Mean Absolute Error (MAE) on test set: ", mae_test)

## 8.4 Mean Absolute Percentage Error (MAPE)<a id="mape"></a>

In [None]:
def mape(actual, predicted):
    return (np.mean(np.abs((actual - predicted) / actual)) * 100)

In [None]:
mape_train = round(mape(y_train, train_pred),4)

# print the MAPE for the training set
print("Mean Absolute Percentage Error (MAPE) on training set: ", mape_train)

# calculate the MAPE for the test data
# round the value upto 4 digits using 'round()'
mape_test = round(mape(y_test, test_pred),4)

# print the MAPE for the test set
print("Mean Absolute Percentage Error (MAPE) on test set: ", mape_test)

## 8.5 Resultant Table<a id="res_tab"></a>

In [None]:
cols = ['Model_Name', 'R-squared', 'Adj. R-squared', 'MSE', 'RMSE', 'MAE', 'MAPE']

result_table = pd.DataFrame(columns = cols)

In [None]:
from statsmodels.tools.eval_measures import rmse

MLR_full_model_metrics = pd.Series({'Model_Name': "MLR Full Model",
                     'R-squared': MLR_full_model.rsquared,
                     'Adj. R-squared': MLR_full_model.rsquared_adj,
                     'MSE': mean_squared_error(y_test, test_pred),
                     'RMSE': rmse(y_test, test_pred),
                     'MAE': mean_absolute_error(y_test, test_pred),
                     'MAPE': mape(y_test, test_pred)
                   })

result_table = result_table.append(MLR_full_model_metrics, ignore_index = True)

result_table

# 9. Model Optimization<a id="mod_opt"></a>

## 9.1 BIAS <a id="bias"></a>

In [None]:
sns.regplot(y = y_train,x = train_pred,color='red',line_kws={'color':'blue'},marker='x')

## 9.2 VARIANCE<a id="var"></a>

In [None]:
a = np.random.randint(1,4070,1745)
train_pred1 = list(train_pred)
TrainPred2 = []

In [None]:
for i in a:
    TrainPred2.append(train_pred1[i])

In [None]:
sns.regplot(y = test_pred,x = TrainPred2)

<b> INTERPRETATION</b>: The bias is low and variance is high, hence I assume that the model is a complex one. I will have to employ optimization techniques to reduce the complexity and RMSE.

# 9.3. MODEL VALIDATION<a id="mod_val"></a>

## 9.3.1 Cross Validation<a id="cro_val"></a>

In [None]:
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

kf = KFold(n_splits = 10)

In [None]:
def Get_score(model, X_train_k, X_test_k, y_train_k, y_test_k):
    model.fit(X_train_k, y_train_k)
    return model.score(X_test_k, y_test_k)  

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 10, test_size = 0.3)

In [None]:
from sklearn.linear_model import LinearRegression

scores = []
 
for train_index, test_index in kf.split(X_train):
    X_train_k, X_test_k, y_train_k, y_test_k = X_train.iloc[train_index], X_train.iloc[test_index], \
                                               y_train.iloc[train_index], y_train.iloc[test_index]
 
    scores.append(Get_score(LinearRegression(), X_train_k, X_test_k, y_train_k, y_test_k)) 
    
print('All scores: ', scores)

print("\nMinimum score obtained: ", round(min(scores), 4))

print("Maximum score obtained: ", round(max(scores), 4))

print("Average score obtained: ", round(np.mean(scores), 4))

In [None]:
scores = cross_val_score(estimator = LinearRegression(), 
                         X = X_train, 
                         y = y_train, 
                         cv = 10, 
                         scoring = 'r2')

In [None]:
print('All scores: ', scores)

print("\nMinimum score obtained: ", round(min(scores), 4))

print("Maximum score obtained: ", round(max(scores), 4))

print("Average score obtained: ", round(np.mean(scores), 4))

**The R2 value is similar to the one obtained in the MLR model. There are no significant changes.**

## 9.3.2 Leave Out One Cross Validation(LOOCV)<a id="loocv"></a>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 10, test_size = 0.2)

In [None]:
def Get_score(model, X_train_k, X_test_k, y_train_k, y_test_k):
    model.fit(X_train_k, y_train_k)                              
    return model.score(X_test_k, y_test_k)

In [None]:
loocv_rmse = []
loocv = LeaveOneOut()

for train_index, test_index in loocv.split(X_train):

    X_train_l, X_test_l, y_train_l, y_test_l = X_train.iloc[train_index], X_train.iloc[test_index], \
                                               y_train.iloc[train_index], y_train.iloc[test_index]
    
    linreg = LinearRegression()
    linreg.fit(X_train_l, y_train_l)
 
    mse = mean_squared_error(y_test_l, linreg.predict(X_test_l))
    
    rmse = np.sqrt(mse)
    
    loocv_rmse.append(rmse)

In [None]:
print("\nMinimum rmse obtained: ", round(min(loocv_rmse), 4))

print("Maximum rmse obtained: ", round(max(loocv_rmse), 4))
 
print("Average rmse obtained: ", round(np.mean(loocv_rmse), 4))

# 9.4 GRADIENT DESCENT<a id="gra_des"></a>

In [None]:
def get_train_rmse(model):

    train_pred = model.predict(X_train)

    mse_train = mean_squared_error(y_train, train_pred)

    rmse_train = round(np.sqrt(mse_train), 4)

    return(rmse_train)

In [None]:
def get_test_rmse(model):

    test_pred = model.predict(X_test)

    mse_test = mean_squared_error(y_test, test_pred)

    rmse_test = round(np.sqrt(mse_test), 4)

    return(rmse_test)

In [None]:
from sklearn.linear_model import SGDRegressor

sgd = SGDRegressor(random_state = 10)

linreg_with_SGD = sgd.fit(X_train, y_train)

print('RMSE on train set:', get_train_rmse(linreg_with_SGD))

print('RMSE on test set:', get_test_rmse(linreg_with_SGD))

In [None]:
def plot_coefficients(model, algorithm_name):

    df_coeff = pd.DataFrame({'Variable': X.columns, 'Coefficient': model.coef_})

    sorted_coeff = df_coeff.sort_values('Coefficient', ascending = False)

    sns.barplot(x = "Coefficient", y = "Variable", data = sorted_coeff)

    plt.xlabel("Coefficients from {}".format(algorithm_name), fontsize = 15)

    plt.ylabel('Features', fontsize = 15)

In [None]:
MLR_model = linreg.fit(X_train, y_train)

In [None]:
plt.subplot(1,2,1)
plot_coefficients(MLR_model, 'Linear Regression (OLS)')

plt.subplot(1,2,2)
plot_coefficients(linreg_with_SGD, 'Linear Regression (SGD)')

plt.tight_layout()

In [None]:
score_card = pd.DataFrame(columns=['Model_Name', 'Alpha (Wherever Required)', 'l1-ratio', 'R-Squared',
                                       'Adj. R-Squared', 'Train_RMSE','Test_RMSE', 'Test_MAPE'])

In [None]:
def get_test_mape(model):

    test_pred = model.predict(X_test)

    mape_test = mape(y_test, test_pred)

    return(mape_test)

In [None]:
def get_score(model):
    
    r_sq = model.score(X_train, y_train)

    n = X_train.shape[0]

    k = X_train.shape[1]

    r_sq_adj = 1 - ((1-r_sq)*(n-1)/(n-k-1))
    
    return ([r_sq, r_sq_adj])

In [None]:
def update_score_card(algorithm_name, model, alpha = '-', l1_ratio = '-'):
    
    global score_card
    score_card = score_card.append({'Model_Name': algorithm_name,
                       'Alpha (Wherever Required)': alpha, 
                       'l1-ratio': l1_ratio, 
                       'Test_MAPE': get_test_mape(model),
                       'Train_RMSE': get_train_rmse(model),
                       'Test_RMSE': get_test_rmse(model), 
                       'R-Squared': get_score(model)[0], 
                       'Adj. R-Squared': get_score(model)[1]}, ignore_index = True)

In [None]:
update_score_card(algorithm_name = 'Linear Regression (using SGD)', model = linreg_with_SGD)

score_card

# 9.5  Regularization<a id="reg"></a>

## 9.5.1 Ridge Regression Model<a id="ridge"></a>

In [None]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

In [None]:
ridge = Ridge(alpha = 0.1, max_iter = 500)

ridge.fit(X_train, y_train)

print('RMSE on test set:', get_test_rmse(ridge))

In [None]:
update_score_card(algorithm_name='Ridge Regression (with alpha = 0.1)', model = ridge, alpha = 0.1)

score_card

In [None]:
ridge = Ridge(alpha = 1, max_iter = 500)

ridge.fit(X_train, y_train)

print('RMSE on test set:', np.round(get_test_rmse(ridge),2))

In [None]:
update_score_card(algorithm_name='Ridge Regression (with alpha = 1)', model = ridge, alpha = 1)

score_card

In [None]:
ridge = Ridge(alpha = 2, max_iter = 500)

ridge.fit(X_train, y_train)

print('RMSE on test set:', get_test_rmse(ridge))

In [None]:
update_score_card(algorithm_name='Ridge Regression (with alpha = 2)', model = ridge, alpha = 2)

score_card

In [None]:
ridge = Ridge(alpha = 0.5, max_iter = 500)

ridge.fit(X_train, y_train)

print('RMSE on test set:', get_test_rmse(ridge))

In [None]:
plt.subplot(1,2,1)
plot_coefficients(MLR_model, 'Linear Regression (OLS)')

plt.subplot(1,2,2)
plot_coefficients(ridge, 'Ridge Regression (alpha = 0.5)')

plt.tight_layout()

<b>Interpretation:</b> The coefficients obtained from ridge regression have similar values as compared to the coefficients obtained from linear regression using OLS.

## 9.5.2 Lasso Regression Model<a id="lasso"></a>

In [None]:
lasso = Lasso(alpha = 0.01, max_iter = 500)

lasso.fit(X_train, y_train)

print('RMSE on test set:', get_test_rmse(lasso))

In [None]:
plt.subplot(1,2,1)
plot_coefficients(MLR_model, 'Linear Regression (OLS)')

plt.subplot(1,2,2)
plot_coefficients(lasso, 'Lasso Regression (alpha = 0.01)')

plt.tight_layout()

In [None]:
lasso = Lasso(alpha = 0.05, max_iter = 500)

lasso.fit(X_train, y_train)

print('RMSE on test set:', get_test_rmse(lasso))

In [None]:
plt.subplot(1,2,1)
plot_coefficients(MLR_model, 'Linear Regression (OLS)')

plt.subplot(1,2,2)
plot_coefficients(lasso, 'Lasso Regression (alpha = 0.05)')

plt.tight_layout()

<b>Interpretation</b>: The second subplot (on the right) shows that the lasso regression have reduced the coefficients of some variables to zero.

In [None]:
df_lasso_coeff = pd.DataFrame({'Variable': X.columns, 'Coefficient': lasso.coef_})

print('Insignificant variables obtained from Lasso Regression when alpha is 0.05')
df_lasso_coeff.Variable[df_lasso_coeff.Coefficient == 0].to_list()

In [None]:
update_score_card(algorithm_name = 'Lasso Regression', model = lasso, alpha = '0.05')

score_card

## 9.5.3 Elastic-Net Regression Model<a id="ela_net"></a>

In [None]:
enet = ElasticNet(alpha = 0.1, l1_ratio = 0.55, max_iter = 500)

enet.fit(X_train, y_train)

print('RMSE on test set:', get_test_rmse(enet))

In [None]:
update_score_card(algorithm_name = 'Elastic Net Regression', model = enet, alpha = '0.1', l1_ratio = '0.55')

score_card

In [None]:
enet = ElasticNet(alpha = 0.1, l1_ratio = 0.1, max_iter = 500)

enet.fit(X_train, y_train)

print('RMSE on test set:', get_test_rmse(enet))

In [None]:
update_score_card(algorithm_name = 'Elastic Net Regression', model = enet, alpha = '0.1', l1_ratio = '0.1')

score_card

In [None]:
enet = ElasticNet(alpha = 0.1, l1_ratio = 0.01, max_iter = 500)

enet.fit(X_train, y_train)

print('RMSE on test set:', get_test_rmse(enet))

In [None]:
plt.subplot(1,2,1)
plot_coefficients(MLR_model, 'Linear Regression (OLS)')

plt.subplot(1,2,2)
plot_coefficients(enet, 'Elastic Net Regression')

plt.tight_layout()

<b>Interpretation</b>: The second subplot (on the right) shows that the elastic-net regression has reduced the coefficients of some variables to zero.

In [None]:
update_score_card(algorithm_name = 'Elastic Net Regression', model = enet, alpha = '0.1', l1_ratio = '0.01')

score_card

## 9.5.4 Grid Search CV<a id="gri_sea"></a>

In [None]:
tuned_paramaters = [{'alpha':[1e-15, 1e-10, 1e-8, 1e-4,1e-3, 1e-2, 0.1, 1, 5, 10, 20, 40, 60, 80, 100]}]
 
ridge = Ridge()

ridge_grid = GridSearchCV(estimator = ridge, 
                          param_grid = tuned_paramaters, 
                          cv = 10)

ridge_grid.fit(X_train, y_train)

print('Best parameters for Ridge Regression: ', ridge_grid.best_params_, '\n')

print('RMSE on test set:', get_test_rmse(ridge_grid))

In [None]:
update_score_card(algorithm_name = 'Ridge Regression (using GridSearchCV)', 
                  model = ridge_grid, 
                  alpha = ridge_grid.best_params_.get('alpha'))

score_card

In [None]:
tuned_paramaters = [{'alpha':[1e-15, 1e-10, 1e-8, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10, 20]}]
 
lasso = Lasso()

lasso_grid = GridSearchCV(estimator = lasso, 
                          param_grid = tuned_paramaters, 
                          cv = 10)

lasso_grid.fit(X_train, y_train)

print('Best parameters for Lasso Regression: ', lasso_grid.best_params_, '\n')

print('RMSE on test set:', get_test_rmse(lasso_grid))

In [None]:
update_score_card(algorithm_name = 'Lasso Regression (using GridSearchCV)', 
                  model = lasso_grid, 
                  alpha = lasso_grid.best_params_.get('alpha'))

score_card

In [None]:
tuned_paramaters = [{'alpha':[0.0001, 0.001, 0.01, 0.1, 1, 5, 10, 20, 40, 60],
                      'l1_ratio':[0.0001, 0.0002, 0.001, 0.01, 0.1, 0.2, 0.4, 0.55]}]

enet = ElasticNet()

enet_grid = GridSearchCV(estimator = enet, 
                          param_grid = tuned_paramaters, 
                          cv = 10)

enet_grid.fit(X_train, y_train)

print('Best parameters for Elastic Net Regression: ', enet_grid.best_params_, '\n')

print('RMSE on test set:', get_test_rmse(enet_grid))

In [None]:
update_score_card(algorithm_name = 'Elastic Net Regression (using GridSearchCV)', 
                  model = enet_grid, 
                  alpha = enet_grid.best_params_.get('alpha'), 
                  l1_ratio = enet_grid.best_params_.get('l1_ratio'))

score_card

# 10. Displaying score summary<a id="dis_sco_sum"></a>

In [None]:
score_card = score_card.sort_values('Test_RMSE').reset_index(drop = True)

score_card.style.highlight_min(color = 'lightblue', subset = 'Test_RMSE')

Interpretation: I can see that Lasso Regression (using GridSearchCV) has the lowest test RMSE.

In [None]:
# plot the accuracy measure for all models
# secondary_y: specify the data on the secondary axis
score_card.plot(secondary_y=['R-Squared','Adj. R-Squared'])

# display just the plot
plt.show()

The graph shows the performance metrics root mean squared error, R-squared and Adjusted R-squared of the models implemented: the X-axis has the model number as given in the table. 
The plot gives a clear picture of the inverse relation of R squared values and the RMSE value, the better the R-squared value naturally the lesser is the RMSE value.
Findings suggest that the Lasso Regression (using GridSearchCV) has the highest accuracy with lowest RMSE. Finally, it can be concluded that the Lasso Regression (using GridSearchCV) can be used to predict the amount of carbon dioxide emissions.

# 11. Conclusion<a id="conclu"></a>

**Of all the optimization techniques used, I see that Lasso Regression using Grid search CV has been the most effective in reducing RMSE . the exact combination of features responsible for high CO2 emissions cannot be predicted  Since all the features are highly correlated . I can hereby conclude that I have successfully built a model that can predict amount of CO2 Emissions across different vehicle types at a high accuracy rate.**

# 12.Deployment<a id="deploy"></a>

https://coemission.herokuapp.com/

# 13. References<a id="Refer"></a>

https://reader.elsevier.com/reader/sd/pii/S2352484719301088?token=807922D7C5CF2E7E78C846212A5D7F97FFCC0B513EDBEAAC2626D7FB0DBE7EFE67FEBE723E7610FC62CA1FA0F5B5110A&originRegion=eu-west-1&originCreation=20210510125616

https://sci-hub.se/https://ieeexplore.ieee.org/abstract/document/7984819

https://scihub.se/https://www.sciencedirect.com/science/article/abs/pii/S0959652620329875

<table align="center" width=100%>
    <tr>
        <td width="30%">
            <img src="https://i.pinimg.com/originals/60/00/50/600050674a955d69dc5930c45321be30.gif">
        </td>
        <td>
            <div align="center">
                <font color="#208807 " size=24px>
                    <b>Thank You.
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>