# Polynomial Regression
---
In this notebook we will take a look at another regression model — Polynomial Regression. It's actually very closely related to the Linear Regression algorithm that we saw earlier. In polynomial regression, we fit a polynomial equation on the data with a curvilinear relationship between the target variable and the independent variables.
The value of the target (y) changes in a non-uniform manner with respect to the independent variable(x). 

## Importing Project Dependencies
---

Let us begin by importing all the necessary modules.

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

## Importing the Dataset
---
In this notebook we will use the fuel consumption data for polynomial regression modeling, which contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada.

- **MODELYEAR** e.g. 2014
- **MAKE** e.g. Acura
- **MODEL** e.g. ILX
- **VEHICLE CLASS** e.g. SUV
- **ENGINE SIZE** e.g. 4.7
- **CYLINDERS** e.g 6
- **TRANSMISSION** e.g. A6
- **FUEL CONSUMPTION in CITY(L/100 km)** e.g. 9.9
- **FUEL CONSUMPTION in HWY (L/100 km)** e.g. 8.9
- **FUEL CONSUMPTION COMB (L/100 km)** e.g. 9.2
- **CO2 EMISSIONS (g/km)** e.g. 182   --> low --> 0

**Goal:-**
* Predicting the CO<sub>2</sub> emissions generated by the vehicles.

In [2]:
#importing our data set
df = pd.read_csv('https://raw.githubusercontent.com/OneStep-elecTRON/ContentSection/main/Datasets/fuel_consumption_co2.csv')

#checking the top 5 rows in our data
df.head()

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
0,2014,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,8.5,33,196
1,2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2,2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,5.9,48,136
3,2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,11.1,25,255
4,2014,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,10.6,27,244


Now, let us have a look at the basic info regarding our data.

In [3]:
#checking different column info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067 entries, 0 to 1066
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   MODELYEAR                 1067 non-null   int64  
 1   MAKE                      1067 non-null   object 
 2   MODEL                     1067 non-null   object 
 3   VEHICLECLASS              1067 non-null   object 
 4   ENGINESIZE                1067 non-null   float64
 5   CYLINDERS                 1067 non-null   int64  
 6   TRANSMISSION              1067 non-null   object 
 7   FUELTYPE                  1067 non-null   object 
 8   FUELCONSUMPTION_CITY      1067 non-null   float64
 9   FUELCONSUMPTION_HWY       1067 non-null   float64
 10  FUELCONSUMPTION_COMB      1067 non-null   float64
 11  FUELCONSUMPTION_COMB_MPG  1067 non-null   int64  
 12  CO2EMISSIONS              1067 non-null   int64  
dtypes: float64(4), int64(4), object(5)
memory usage: 108.5+ KB


As we can see, there are no null values within our dataset. Now, let us have a look at the statistical analysis of the data.

In [4]:
# Gives a statistical analysis of the data in each column
df.describe()

Unnamed: 0,MODELYEAR,ENGINESIZE,CYLINDERS,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
count,1067.0,1067.0,1067.0,1067.0,1067.0,1067.0,1067.0,1067.0
mean,2014.0,3.346298,5.794752,13.296532,9.474602,11.580881,26.441425,256.228679
std,0.0,1.415895,1.797447,4.101253,2.79451,3.485595,7.468702,63.372304
min,2014.0,1.0,3.0,4.6,4.9,4.7,11.0,108.0
25%,2014.0,2.0,4.0,10.25,7.5,9.0,21.0,207.0
50%,2014.0,3.4,6.0,12.6,8.8,10.9,26.0,251.0
75%,2014.0,4.3,8.0,15.55,10.85,13.35,31.0,294.0
max,2014.0,8.4,12.0,30.2,20.5,25.8,60.0,488.0


Now that we have checked for null values and checked for the important information relevant to different columns, its time for us to go and make our model. Let's start by defining our featues and target variable.

In [5]:
new_df = df[['ENGINESIZE', 'FUELCONSUMPTION_CITY', 'FUELCONSUMPTION_HWY', 'FUELCONSUMPTION_COMB', 'FUELCONSUMPTION_COMB_MPG', 'CO2EMISSIONS']]

X = df[['ENGINESIZE', 'FUELCONSUMPTION_CITY', 'FUELCONSUMPTION_HWY', 'FUELCONSUMPTION_COMB', 'FUELCONSUMPTION_COMB_MPG']]
y = df[['CO2EMISSIONS']]

Now, let us split the data into training and test sets. For this, we will be using sklearn's built-in data splitting method train_test_split method. 

In [6]:
# Step 1- Importing train_test_split method
from sklearn.model_selection import train_test_split

# Step 2- Performing the data split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state  =101)

# Step 3- Printing and checking the shape of the splits
X_train.shape , y_train.shape , X_test.shape , y_test.shape 

((853, 5), (853, 1), (214, 5), (214, 1))

Now that we are done pre-processing the data, let us start working on creating and training our models.

## Modeling
---

First, we will create a simple linear regression model, followed by a polynomial regression model, which will help us to see the improvement in performance, if any.

In [7]:
# Step 1- Importing linear regression model class from sklearn
from sklearn.linear_model import LinearRegression

# Step 2- Creating model class object
model = LinearRegression()

# Step 3- Training linear regression model
model = model.fit(X_train,y_train)

# Step 4- Evaluating the trained model
y_preds = model.predict(X_test)
y_preds = np.round(y_preds) # rounding up predictions to nearest integer value

from sklearn.metrics import mean_absolute_error, mean_squared_error

print("Mean absolute error =", mean_absolute_error(y_test, y_preds))
print("Mean squared error =", mean_squared_error(y_test, y_preds))

Mean absolute error = 14.289719626168225
Mean squared error = 404.2710280373832


Now, we will create another model, this time a polynomial regression one. Now, the Scikit Learn library doesn't come with a polynomial regression model class. So, will first transform the data into a polynomial form, then we will fit this transformed data to a linear regression model. The end result of this will be a polynomial regression model. Let's see how to implement this.

In [8]:
# Step 1- Importing the polynomial transformation class
from sklearn.preprocessing import PolynomialFeatures

# Step 2- Creating the transformer class object
poly_transform = PolynomialFeatures(degree = 2) # will convert to a second degree equation

# Step 3- Performing polynomial transformation
X_train_poly = poly_transform.fit_transform(X_train)

# Step 4- Fitting transformed data to linear model
lin_reg_2=LinearRegression()
lin_reg_2.fit(X_train_poly, y_train)

# Step 5- Evaluating the model
X_test_poly = poly_transform.fit_transform(X_test)
y_preds = lin_reg_2.predict(X_test_poly)

print("Mean absolute error =", mean_absolute_error(y_test, y_preds))
print("Mean squared error =", mean_squared_error(y_test, y_preds))

Mean absolute error = 9.70713172061367
Mean squared error = 258.2293567470115


As we can see, the total error (which denotes the information loss while making the predictions) generated by the model as compared to the simple linear model has already reduced significantly. You can experiment with polynomial equations of different degrees and compare the results, choosing the one that gives you the lowest loss on both test and training set while also making sure that you are not overfitting.  

With this, we come to the end of our polynomial regression tutorial. Go through the tutorial once again before moving on to the quiz. 