![](https://wallpaper.wiki/wp-content/uploads/2017/04/wallpaper.wiki-Full-HD-Wallpapers-1080p-Cars-PIC-WPC002339-1.jpg)

## Problem Statement:

A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. 

 

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:

* Which variables are significant in predicting the price of a car
* How well those variables describe the price of a car

## Business Goal
We are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

#### If this Kernel helped you in any way, some <font color="red"><b>UPVOTES</b></font> would be very much appreciated

#### Below are the steps which we will be basically following:

1. [Step 1: Reading and Understanding the Data](#1)
1.  [Step 2: Cleaning the Data](#2)
    - Missing Value check
    - Data type check
    - Duplicate check
1. [Step 3: Data Visualization](#3)
    - Boxplot
    - Pairplot
1. [Step 4: Data Preparation](#4) 
   - Dummy Variable
1. [Step 5: Splitting the Data into Training and Testing Sets](#5)
   - Rescaling
1. [Step 6: Building a Linear Model](#6)
   - RFE
   - VIF
1. [Step 7: Residual Analysis of the train data](#7)
1. [Step 8: Making Predictions Using the Final Model](#8)
1. [Step 9: Model Evaluation](#8)
   - RMSE Score


**Step 1 : Reading and Understanding the Data**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [None]:
# setting file path
df=pd.read_csv('../input/car-price-prediction/CarPrice_Assignment.csv')
df.head() # gives only the first 5 rows

In [None]:
# rows, column
df.shape

In [None]:
df.info()

In [None]:
df.describe()

# Step 2: Cleaning the Data

In [None]:
# dropping the car_ID as it is not affecting the car price

df.drop('car_ID',axis=1,inplace=True)
df.head()

In [None]:
# Checking if the dataframe has any missing values

print(df.isnull().values.any())

In [None]:
# to see the data type of each column
df.dtypes

In [None]:
# Outlier Analysis of target variable with maximum amount of Inconsistency

outliers = ['price']
plt.rcParams['figure.figsize'] = [8,8]
sns.boxplot(data=df[outliers], orient="v", palette="Set1",whis=1.5, saturation=1, width=0.7)
plt.title("Outliers Variable Distribution", fontsize = 14, fontweight = 'bold')
plt.ylabel("Price Range", fontweight = 'bold')
plt.xlabel("Continous Variable", fontweight = 'bold')
df.shape

### Insights:
### There are some price ranges above 36000 which can be termed as outliers but lets not remove it rather we will use standarization scaling.

In [None]:
# putting all subcategories into a single category
# Using only the car company names

df['CarName']=df['CarName'].str.split(' ',expand=True)
df['CarName'].head()

In [None]:
# checking the unique car companies

df['CarName'].unique()

In [None]:
# renaming the typos in car company names
#replace(wrong:correct)

df['CarName']=df['CarName'].replace({'maxda':'mazda',
                                     'nissan':'Nissan', 
                                     'toyouta':"toyota",
                                     'porcshce':'porsche',
                                     'vokswagen':'volkswagen',
                                     'vw':'volkswagen'
                                    })


In [None]:
# changing the datatype of 'symboling' from int64 to string as it is a categorical variable as per dictionary file.
# check the result of 'df.dtypes' above.

df['symboling']=df['symboling'].astype(str)
df['symboling'].head()

In [None]:
# checking for duplicate values

df.loc[df.duplicated()]

# when no rows are printed, means no duplicate values

In [None]:
# Segregation of columns into numerical and categorical variables
# df.select_dtypes?

cat_col = df.select_dtypes(include='object').columns
num_col = df.select_dtypes(exclude='object').columns
df_cat = df[cat_col]
df_num = df[num_col]

In [None]:

df_cat.head(2)

In [None]:
df_num.head(2)

# Step 3: Visualizing the Data

In [None]:
df['CarName'].value_counts()

In [None]:
# Visualizing the different car names available

plt.figure(figsize=(15,8))
ax=df['CarName'].value_counts().plot(kind='bar',stacked=True, colormap = 'Set2')
plt.title(label = 'CarName')
plt.xlabel("Names of the Car",fontweight = 'bold')
plt.ylabel("Count of Cars",fontweight = 'bold')
plt.show()

Insights:

* Toyota seems to be the most favoured cars.

* Mercury seems to be the least favoured cars.

In [None]:
# Visualizing the distribution of car prices

plt.figure(figsize=(8,8))
plt.title('Car Price Distribution Plot')
sns.distplot(df['price']) #The distplot shows the distribution of a univariate set of observations.

The plots seems to be right skewed, the prices of almost all cars looks like less than 18000.


**Visualising Numeric Variables -**

Pairplot of all the numeric variables.

* The pairplot builds on two basic figures, the histogram and the scatter plot. The **histogram** on the diagonal allows us to see the distribution of a single variable while the **scatter plots** on the upper and lower triangles show the relationship (or lack thereof) between two variables(**bivariate relationships**).

* Correlation - Relationship between two variables.




In [None]:
ax = sns.pairplot(df[num_col])

**Insights:**

* carwidth , carlength, curbweight ,enginesize ,horsepowerseems to have a **positive** correlation with price.

* carheight doesn't show any significant trend with price.

* citympg , highwaympg - seem to have a significant **negative** correlation with price.

**Visualising few more Categorical Variables**

**Boxplot** of all the categorical variables.

* It makes comparing characteristics of data between categories very easy.

* We can tell if our data is pulled in one direction.

* An easy way to identify outliers.

**Subplots -**  
syntax - subplot(rows,cols,index)

It provides a way to plot multiple plots on a single figure.

Eg - here we have printed a 3x3 matrix of boxplots.

In [None]:
plt.figure(figsize=(20, 15))
plt.subplot(3,3,1)
sns.boxplot(x = 'doornumber', y = 'price', data = df)
plt.subplot(3,3,2)
sns.boxplot(x = 'fueltype', y = 'price', data = df)
plt.subplot(3,3,3)
sns.boxplot(x = 'aspiration', y = 'price', data = df)
plt.subplot(3,3,4)
sns.boxplot(x = 'carbody', y = 'price', data = df)
plt.subplot(3,3,5)
sns.boxplot(x = 'enginelocation', y = 'price', data = df)
plt.subplot(3,3,6)
sns.boxplot(x = 'drivewheel', y = 'price', data = df)
plt.subplot(3,3,7)
sns.boxplot(x = 'enginetype', y = 'price', data = df)
plt.subplot(3,3,8)
sns.boxplot(x = 'cylindernumber', y = 'price', data = df)
plt.subplot(3,3,9)
sns.boxplot(x = 'fuelsystem', y = 'price', data = df)
plt.show()

**Insights**

* The cars with **fueltype** as diesel are comparatively expensive than the cars with fueltype as gas.

* All the types of carbody is relatively cheaper as compared to **convertible** carbody.

* The cars with rear **enginelocation** are way expensive than cars with front enginelocation.

* The price of car is directly proportional to **no. of cylinders** in most cases.

* **Enginetype** ohcv comes into higher price range cars.

* **DoorNumber** isn't affecting the price much.

* HigerEnd cars seems to have rwd **drivewheel**

In [None]:
plt.figure(figsize=(25, 6))

plt.subplot(1,3,1)
plt1 = df['cylindernumber'].value_counts().plot(kind = 'bar')
plt.title('Number of cylinders')
plt1.set(xlabel = 'Number of cylinders', ylabel='Frequency of Number of cylinders')

plt.subplot(1,3,2)
plt1 = df['fueltype'].value_counts().plot(kind = 'bar')
plt.title('Fuel Type')
plt1.set(xlabel = 'Fuel Type', ylabel='Frequency of Fuel type')

plt.subplot(1,3,3)
plt1 = df['carbody'].value_counts().plot(kind = 'bar')
plt.title('Car body')
plt1.set(xlabel = 'Car Body', ylabel='Frequency of Car Body')

**Insights:**
* The number of cylinders used in most cars is four.
* Number of Gas fueled cars are way more than diesel fueled cars.
* Sedan is the most prefered car type.

Relationship between **fuelsystem** vs **price** with hue **fueltype**

**hue** - showing the data in different colours, segregated on the basis on 'fueltype'.

hue means colour.

In [None]:
plt.figure(figsize = (10,6))
sns.boxplot(x = 'fuelsystem', y = 'price', hue = 'fueltype', data = df)
plt.show()

Relationship between **carbody** vs **price** with hue **enginelocation**.

In [None]:
plt.figure(figsize = (10, 6))
sns.boxplot(x = 'carbody', y = 'price', hue = 'enginelocation', data = df)
plt.show()

Relationship between **cylindernumber** vs **price** with hue **fueltype**.

In [None]:
plt.figure(figsize = (10, 6))
sns.boxplot(x = 'cylindernumber', y = 'price', hue = 'fueltype', data = df)
plt.show()

**Derived Metrices**

carName grouped with respect to their average prices


In [None]:
plt.figure(figsize=(20, 6))

dfx = pd.DataFrame(df.groupby(['CarName'])['price'].mean().sort_values(ascending = False))
dfx.plot.bar()
plt.title('Car Company Name vs Average Price')
plt.show()

Insights:
* Jaguar,Buick and porsche seems to have the highest average price.

**car body** grouped with respect to their average prices

In [None]:
plt.figure(figsize=(10, 10))

dfx=df.groupby(['carbody'])['price'].mean().sort_values(ascending=False)
dfx.plot.bar()
plt.title('Car Body Name vs Average Price')
plt.show()

**Insights**:

* hardtop and convertible seems to have the highest average price.

In [None]:
# doubt
# Binning the Car Companies based on avg prices of each car Company.
# Binning - putting into buckets

df['price'] = df['price'].astype('int')
dfx = df.copy()
grouped = dfx.groupby(['CarName'])['price'].mean()
print(grouped)

dfx = dfx.merge(grouped.reset_index(), how='left', on='CarName')
bins = [0,10000,20000,40000]
label =['Budget_Friendly','Medium_Range','TopNotch_Cars']
df['Cars_Category'] = pd.cut(dfx['price_y'], bins, right=False, labels=label)
df.head()

**As per the Problem Statement of the question-** 

**Significant variables after Visualization-**

* Cars_Category , Engine Type, engine location, Fuel Type
* Car Body , Aspiration , Cylinder Number
* Drivewheel , Curbweight , Car Length
* Car Length , Car width , Engine Size
* Boreratio , Horse Power , Wheel base
* citympg , highwaympg , price

**Unused variables-**
* Symboling
* car_ID
* CarName
* doornumber - didn't affect price much
* carheight - didn't show any significant trend with price
* fuelsystem
* stroke
* compression ratio
* peakrpm


In [None]:
# List of significant columns
sig_col = ['Cars_Category','fueltype', 'aspiration','carbody','drivewheel','enginelocation', 'wheelbase', 'carlength', 'carwidth', 
           'curbweight', 'enginetype', 'cylindernumber', 'enginesize', 'boreratio', 'horsepower', 'citympg', 'highwaympg', 'price']
len(sig_col)

In [None]:
# Keeping only the significant columns in the data frame

df = df[sig_col]

**Step 4: Data Preprocessing**

**Dummy Variables**

The variable **carbody** has five levels. (convertible, hatchback, sedan, wagon, hardtop)

We change **categorical variables** into integers.

 For this process we use **dummy variables**.

In [None]:
# Categorical variables found previously
cat_col

In [None]:
# List of significant categorical variables
sig_cat_col = ['Cars_Category','fueltype','aspiration','carbody','drivewheel','enginelocation','enginetype','cylindernumber']

In [None]:
# Get the dummy variables for the categorical feature and store it in a new variable - 'dummy1'

dummy1 = pd.get_dummies(df[sig_cat_col])
print(dummy1.shape)
dummy1

avoiding dummy trap - 

In [None]:
# It is a good practice to always drop the first dummy after performing One Hot encoding
# Because the dropped dummy can be explained as the linear combination of the others.
# Therefore, drop_first = True

dummy1 = pd.get_dummies(df[sig_cat_col],drop_first=True)
print(dummy1.shape)
dummy1

In [None]:
# concatenating the dataframe with the dummy variables
df = pd.concat([df,dummy1], axis = 1)

In [None]:
# dropping the significant categorial columns as we have already made and added the dummy variables for the same in the dataframe
df.drop(sig_cat_col, axis = 1, inplace = True)
df.shape

**Step 5: Splitting the Data into Training and Testing Sets**

As we know, the first basic step for regression is performing a train-test split.

In [None]:
df

In [None]:
# We specify this so that the train and test data set always have the same rows, respectively
# We divide the df into 70/30 ratio

np.random.seed(0)

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
df_train

In [None]:
df_test

* **Feature Scaling**

Here we are rescaling the data using **Standardisation Scaling**.

Scaling needs to be done on the significant num columns.

 The significant **categorical columns** have already been converted into **dummies**.

In [None]:
# Numerical variables found previously
num_col

In [None]:
# List of significant numerical variables
sig_num_col = ['wheelbase', 'carlength', 'carwidth', 'curbweight',
       'enginesize', 'boreratio', 'horsepower','citympg', 'highwaympg', 'price']

In [None]:
# We apply feature scaling only on the numerical variables
# since categorical variables are already converted to 0 and 1 using dummies.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

df_train[sig_num_col] = scaler.fit_transform(df_train[sig_num_col])

In [None]:
df_train.head() 

**Checking the correlation coefficients to see which variables are highly correlated.**

A **heatmap** is a two-dimensional graphical representation of data where the individual values that are contained in a matrix are represented as colors. 

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df_train.corr(), cmap= 'coolwarm')
plt.show()

**Insights**

* Here The legend on the right covers the correlation coefficient with blue identifying the low and red identifying high correlation coefficient of variables.

Let's see scatterplot for few correlated variables vs price.

In [None]:
col = ['highwaympg','citympg','horsepower','enginesize','curbweight','carwidth']

In [None]:
# Scatter Plot of independent variables vs dependent variables

plt.figure(figsize=(18,15))

plt.subplot(2,3,1)
sns.scatterplot(x=col[0],y='price',data =df)

plt.subplot(2,3,2)
sns.scatterplot(x=col[1],y='price',data =df)

plt.subplot(2,3,3)
sns.scatterplot(x=col[2],y='price',data =df)

plt.subplot(2,3,4)
sns.scatterplot(x=col[3],y='price',data =df)

plt.subplot(2,3,5)
sns.scatterplot(x=col[4],y='price',data =df)

plt.subplot(2,3,6)
sns.scatterplot(x=col[5],y='price',data =df)

'''
fig,axes = plt.subplots(2,3,figsize=(18,15))
for seg,colm in enumerate(col):
    x,y = seg//3,seg%3
    an=sns.scatterplot(x=colm, y='price' ,data=df, ax=axes[x,y])
    plt.setp(an.get_xticklabels(), rotation=45)
   
plt.subplots_adjust(hspace=0.5)
'''

* We can see there is a line we can fit in above plots

Dividing into X and Y sets for the model building

In [None]:
y_train = df_train.pop('price') # dependent variable
x_train = df_train # Taking all the independent variables into x_train

In [None]:
y_train

In [None]:
x_train

**Step 6: Building a Linear Model**

In [None]:
import statsmodels.api as sm

x_train_copy = x_train

In [None]:
# Add a constant
x_train_copy1 = sm.add_constant(x_train_copy['horsepower'])

# Create a first fitted model
#1st model
lr1=sm.OLS(y_train,x_train_copy1).fit()

In [None]:
lr1.params

In [None]:
# Let's visualise the data with a scatter plot and the fitted regression line

plt.scatter(x_train_copy1.iloc[:, 1], y_train)
plt.plot(x_train_copy1.iloc[:, 1], 0.8062*x_train_copy1.iloc[:, 1], 'r')
plt.show()

In [None]:
print(lr1.summary())

**Adding another variable**

The **R-squared** value obtained is **0.65**. Since we have so many variables, we can clearly do better than this. So let's go ahead and add the other highly correlated variable, i.e. **curbweight**.

In [None]:
# Add a constant
x_train_2copy = sm.add_constant(x_train[['horsepower','curbweight']])

# Create a 2nd fitted model
lr2 = sm.OLS(y_train, x_train_2copy).fit()

In [None]:
lr2.params

In [None]:
print(lr2.summary())

* The R-squared increased from 0.650 to 0.797.

**Adding another variable**

The R-squared value obtained is **0.797**. Since we have so many variables, we can clearly do better than this. So lets add another correlated variable, i.e. **enginesize**.

In [None]:
# Add a constant
x_train_3copy = sm.add_constant(x_train[['horsepower','curbweight', 'enginesize']])

# Create a 2nd fitted model
lr3 = sm.OLS(y_train, x_train_3copy).fit()

In [None]:
lr3.params

In [None]:
print(lr3.summary())

We have achieved a **R-squared** of **0.819** by manually picking the highly correlated variables. Now lets use **RFE** to select the independent variables which accurately predicts the dependent variable price.

**Recursive Feature Elimination (RFE)**

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

We use this as there are too many independent variables.


In [None]:
from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(x_train, y_train)

In [None]:
# Running RFE with the output number of the variable equal to 15
from sklearn.feature_selection import RFE

rfe=RFE(lm,15)
rfe=rfe.fit(x_train,y_train)

checking which variables support RFE

In [None]:
list(zip(x_train.columns,rfe.support_,rfe.ranking_))


In [None]:
# List of Columns which supports the RFE
# Selecting the variables which are in support

col_sup = x_train.columns[rfe.support_]
col_sup


In [None]:
#Dropping 'enginetype_rotor' as per the rfe values in google colab
col_sup = col_sup.drop(col_sup[10])

# adding 'cylindernumber_two' as per the rfe values in google colab
col_sup = col_sup.insert(14,'cylindernumber_two')

In [None]:
col_sup

In [None]:
# Creating X_train dataframe with RFE selected variables

x_train_rfe = x_train[col_sup]
x_train_rfe

**(Theory)**

**VIF - Variance Inflation Factor**

Checking VIF -

VIF gives a basic quantitative idea about how much the feature variables are correlated with each other. It is an extremely important parameter to test our linear model. 

Vif Values of variables should be less than 5 to be accepted.

The **formula** for calculating VIF is:

**VIFi=1/(1−Ri^2)** ->[i is a subscript]

**(Step)**

After passing the arbitary selected columns by RFE we will manually evaluate each models p-value and VIF value. Unless we find the acceptable range for p-values and VIF we keep dropping the variables one at a time based on below criteria.

We want **p-value** less than **0.05**

**Drop the variable if**
* High p-value, High VIF

**Drop the variable with high p-value first if**
* High p-value, Low VIF or Low p-value, High VIF 

**Accept the variable if**
* Low p-value Low VIF 

In [None]:
# Adding a constant variable and Build a first fitted model
import statsmodels.api as sm  
x_train_rfec = sm.add_constant(x_train_rfe)
lm_rfe = sm.OLS(y_train,x_train_rfec).fit()

#Summary of linear model
print(lm_rfe.summary())

Looking at the p-values, it looks like some of the variables aren't really significant (in the presence of other variables)
and we need to drop it.

In [None]:
# Creating a dataframe that will contain the names of all the feature variables and their respective VIFs

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train_rfe.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe.values, i) for i in range(x_train_rfe.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

We generally want a VIF that is less than 5. So there are clearly some variables we need to drop.

**Dropping the variable and updating the model**

Dropping cylindernumber_twelve beacuse its p-value is 0.355 and we want p-value less than 0.05 and hence rebuilding the model

In [None]:
# Dropping highly correlated variables and insignificant variables

x_train_rfe1 = x_train_rfe.drop('cylindernumber_twelve', 1)

# Adding a constant variable and Build a second fitted model

x_train_rfe1c = sm.add_constant(x_train_rfe1)
lm_rfe1 = sm.OLS(y_train, x_train_rfe1c).fit()

#Summary of linear model
print(lm_rfe1.summary())

In [None]:
# Creating a dataframe that will contain the names of all the feature variables and their respective VIFs

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train_rfe1.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe1.values, i) for i in range(x_train_rfe1.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Dropping **cylindernumber_six** beacuse its p-value is **0.490** and we want p-value less than 0.05 and hence rebuilding the model

In [None]:
# Dropping highly correlated variables and insignificant variables

x_train_rfe2 = x_train_rfe1.drop('cylindernumber_six', 1)

# Adding a constant variable and Build a second fitted model

x_train_rfe2c = sm.add_constant(x_train_rfe2)
lm_rfe2 = sm.OLS(y_train, x_train_rfe2c).fit()

#Summary of linear model
print(lm_rfe2.summary())

In [None]:
# Creating a dataframe that will contain the names of all the feature variables and their respective VIFs

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train_rfe2.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe2.values, i) for i in range(x_train_rfe2.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Dropping **carbody_hardtop** beacuse its p-value is **0.05** and we want p-value less than 0.05 and hence rebuilding the model.

In [None]:
# Dropping highly correlated variables and insignificant variables

x_train_rfe3 = x_train_rfe2.drop('carbody_hardtop', 1)

# Adding a constant variable and Build a second fitted model

x_train_rfe3c = sm.add_constant(x_train_rfe3)
lm_rfe3 = sm.OLS(y_train, x_train_rfe3c).fit()

#Summary of linear model
print(lm_rfe3.summary())

In [None]:
# Creating a dataframe that will contain the names of all the feature variables and their respective VIFs

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train_rfe3.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe3.values, i) for i in range(x_train_rfe3.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Dropping **enginetype_ohcv** beacuse its p-value is **0.402** and we want p-value less than 0.05 and hence rebuilding the model.

In [None]:
# Dropping highly correlated variables and insignificant variables

x_train_rfe4 = x_train_rfe3.drop('enginetype_ohcv', 1)

# Adding a constant variable and Build a second fitted model

x_train_rfe4c = sm.add_constant(x_train_rfe4)
lm_rfe4 = sm.OLS(y_train, x_train_rfe4c).fit()

#Summary of linear model
print(lm_rfe4.summary())

In [None]:
# Creating a dataframe that will contain the names of all the feature variables and their respective VIFs

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train_rfe4.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe4.values, i) for i in range(x_train_rfe4.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Dropping **enginetype_dohcv** because its p-value is **0.712** and we want p-value less than 0.05 and hence rebuilding the model.

In [None]:
# Dropping highly correlated variables and insignificant variables

x_train_rfe5 = x_train_rfe4.drop('enginetype_dohcv', 1)

# Adding a constant variable and Build a second fitted model

x_train_rfe5c = sm.add_constant(x_train_rfe5)
lm_rfe5 = sm.OLS(y_train, x_train_rfe5c).fit()

#Summary of linear model
print(lm_rfe5.summary())

In [None]:
# Creating a dataframe that will contain the names of all the feature variables and their respective VIFs

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train_rfe5.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe5.values, i) for i in range(x_train_rfe5.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Dropping **cylindernumber_five** because its p-value is **0.051** and we want p-value less than 0.05 and hence rebuilding the model.

In [None]:
# Dropping highly correlated variables and insignificant variables

x_train_rfe6 = x_train_rfe5.drop('cylindernumber_five', 1)

# Adding a constant variable and Build a second fitted model

x_train_rfe6c = sm.add_constant(x_train_rfe6)
lm_rfe6 = sm.OLS(y_train, x_train_rfe6c).fit()

#Summary of linear model
print(lm_rfe6.summary())

In [None]:
# Creating a dataframe that will contain the names of all the feature variables and their respective VIFs

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train_rfe6.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe6.values, i) for i in range(x_train_rfe6.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Dropping **cylindernumber_two** because its p-value is **0.845** and we want p-value less than 0.05 and hence rebuilding the model.

In [None]:
x_train_rfe6.columns

In [None]:
# Dropping highly correlated variables and insignificant variables

x_train_rfe7 = x_train_rfe6.drop('cylindernumber_two', 1)

# Adding a constant variable and Build a second fitted model

x_train_rfe7c = sm.add_constant(x_train_rfe7)
lm_rfe7 = sm.OLS(y_train, x_train_rfe7c).fit()

#Summary of linear model
print(lm_rfe7.summary())

In [None]:
# Creating a dataframe that will contain the names of all the feature variables and their respective VIFs

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train_rfe7.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe7.values, i) for i in range(x_train_rfe7.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Dropping **curbweight** beacuse its VIF is **5.50** and we want VIF less than 5 and hence rebuilding the model.

In [None]:
x_train_rfe8 = x_train_rfe7.drop('curbweight', 1,)

# Adding a constant variable and Build a sixth fitted model
x_train_rfe8c = sm.add_constant(x_train_rfe8)
lm_rfe8 = sm.OLS(y_train, x_train_rfe8c).fit()

#Summary of linear model
print(lm_rfe8.summary())

In [None]:
# Creating a dataframe that will contain the names of all the feature variables and their respective VIFs

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train_rfe8.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe8.values, i) for i in range(x_train_rfe8.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Dropping **cylindernumber_four** beacuse its VIF is **5.10** and we want VIF less than 5 and hence rebuilding the model.

In [None]:
x_train_rfe9 = x_train_rfe8.drop('cylindernumber_four', 1,)

# Adding a constant variable and Build a sixth fitted model
x_train_rfe9c = sm.add_constant(x_train_rfe9)
lm_rfe9 = sm.OLS(y_train, x_train_rfe9c).fit()

#Summary of linear model
print(lm_rfe9.summary())

In [None]:
# Creating a dataframe that will contain the names of all the feature variables and their respective VIFs

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train_rfe9.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe9.values, i) for i in range(x_train_rfe9.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Lets drop **carbody_sedan** and see if there is any drastic fall in R squared. If not we can drop carbody sedan. Our aim is to explain the maximum variance with minimum variable.

In [None]:
# Dropping highly correlated variables and insignificant variables

x_train_rfe10 = x_train_rfe9.drop('carbody_sedan', 1,)

# Adding a constant variable and Build a sixth fitted model
x_train_rfe10c = sm.add_constant(x_train_rfe10)
lm_rfe10 = sm.OLS(y_train, x_train_rfe10c).fit()

#Summary of linear model
print(lm_rfe10.summary())

The R squared value just dropped by 0.006. Hence we can proceed with dropping carbody_sedan.

In [None]:
# Creating a dataframe that will contain the names of all the feature variables and their respective VIFs

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train_rfe10.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe10.values, i) for i in range(x_train_rfe10.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Dropping **carbody_wagon** beacuse its p-value is **0.327** and we want p-value less than 0.05 and hence rebuilding the model



In [None]:
# Dropping highly correlated variables and insignificant variables

x_train_rfe11 = x_train_rfe10.drop('carbody_wagon', 1,)

# Adding a constant variable and Build a sixth fitted model
x_train_rfe11c = sm.add_constant(x_train_rfe11)
lm_rfe11 = sm.OLS(y_train, x_train_rfe11c).fit()

#Summary of linear model
print(lm_rfe11.summary())

In [None]:
# Creating a dataframe that will contain the names of all the feature variables and their respective VIFs

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train_rfe11.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe11.values, i) for i in range(x_train_rfe11.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Dropping **carbody_hatchback** because its p-value is **0.266** and we want p-value less than 0.05 and hence rebuilding the model

In [None]:
# Dropping highly correlated variables and insignificant variables

x_train_rfe12 = x_train_rfe11.drop('carbody_hatchback', 1,)

# Adding a constant variable and Build a sixth fitted model
x_train_rfe12c = sm.add_constant(x_train_rfe12)
lm_rfe12 = sm.OLS(y_train, x_train_rfe12c).fit()

#Summary of linear model
print(lm_rfe12.summary())

In [None]:
# Creating a dataframe that will contain the names of all the feature variables and their respective VIFs

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train_rfe12.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe12.values, i) for i in range(x_train_rfe12.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Now the VIFs and p-values both are within an acceptable range. So we can go ahead and make our predictions using model lm_rfe12 and lm_rfe9.

**Here, we are proposing Business 2 Models which can be used to predict the car prices.**

**MODEL I**

With lm_rfe12 which has basically 3 predictor variables.

**Step 7: Residual Analysis of the train data**

So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of it.

An error term appears in a statistical model, like a regression model, to indicate the uncertainty in the model.

In [None]:
# Predicting the price of training set.
y_train_price = lm_rfe12.predict(x_train_rfe12c)

In [None]:
# Plot the histogram of the error terms
# error between the actual price and the predicted price by our model

sns.distplot((y_train - y_train_price),bins=20)
plt.title('Error Term Analysis')
plt.xlabel('Errors')
plt.show()


Here we can see that the error is nearly 0.0, so the error terms are normally distributed.

**Step 8: Making Predictions Using the Final Model**

Now that we have fitted the model and checked the normality of error terms, it's time to go ahead and make predictions using the final model.

Applying the scaling on the test sets.

In [None]:
df_test[sig_num_col] = scaler.transform(df_test[sig_num_col])
df_test.shape

Dividing test set into x_test and y_test


In [None]:
y_test = df_test.pop('price')
x_test = df_test

In [None]:
# Adding Constant
x_test_1 = sm.add_constant(x_test)

x_test_new = x_test_1[x_train_rfe12c.columns]

In [None]:
# Making predictions using the final model
y_pred = lm_rfe12.predict(x_test_new)

In [None]:
y_pred


**Step 9: Model Evaluation**

Let's now plot the graph for actual versus predicted values.

In [None]:
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test,y_pred)
plt.title('y_test vs y_pred', fontsize=20)   
plt.xlabel('y_test ', fontsize=18)                       
plt.ylabel('y_pred', fontsize=16)  

RMSE Score

In [None]:
from sklearn.metrics import r2_score

r2_score(y_test, y_pred)

The **R-squared** score of **Training set** is **0.860** and **Test set** is **0.876** which is quite close. Hence, We can say that our model is good enough to predict the Car prices using below predictor variables.

* Cars_Category_TopNotch_Cars
* carwidth
* enginelocation_rear

**Model I Conclusions:**

* R-squared and Adjusted R-squared - 0.860 and 0.876 - 90% variance explained.
* F-stats and Prob(F-stats) (overall model fit) - 291.3 and 9.90e-60(approx. 0.0) - Model fit is significant and explained 90%.
* variance is just not by chance.
p-values - p-values for all the coefficients seem to be less than the significance level of 0.05. - meaning that all the
predictors are statistically significant.

**MODEL II**

With lm_rfe9 which has basically 6 predictor variables.

* Cars_Category_TopNotch_Cars	
* carwidth	
* carbody_sedan	
* enginelocation_rear	
* carbody_hatchback	
* carbody_wagon

** Step 7: Residual Analysis of the train data** 

So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of it.

In [None]:
# Predicting the price of training set.
y_train_price2 = lm_rfe9.predict(x_train_rfe9c)


In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_price2), bins = 20)
fig.suptitle('Error Terms Analysis', fontsize = 20)                   
plt.xlabel('Errors', fontsize = 18)

** Step 8: Making Predictions Using the Final Model** 

Now that we have fitted the model and checked the normality of error terms, it's time to go ahead and make predictions using the model.

In [None]:
x_test_2 = x_test_1[x_train_rfe9c.columns]

In [None]:
# Making predictions using the final model
y_pred2 = lm_rfe9.predict(x_test_2)

**Step 9: Model Evaluation**

Let's now plot the graph for actual versus predicted values.

In [None]:
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test,y_pred2)
fig.suptitle('y_test vs y_pred2', fontsize=20)   
plt.xlabel('y_test ', fontsize=18)                       
plt.ylabel('y_pred2', fontsize=16)

**RMSE Score**

In [None]:
r2_score(y_test, y_pred2)

The R2 score of Training set is 0.872 and Test set is 0.877 which is quite close. Hence, We can say that our model is good enough to predict the Car prices using below predictor variables.

* Cars_Category_TopNotch_Cars	
* carwidth	
* carbody_sedan	
* enginelocation_rear	
* carbody_hatchback	
* carbody_wagon


**Model II Conclusions:**

* R-squared and Adjusted R-squared - 0.872 and 0.877 - 90% variance explained.
* F-stats and Prob(F-stats) (overall model fit) - 154.4 and 3.57e-58(approx. 0.0) - Model fit is significant and explained 90%
variance is just not by chance.
* p-values - p-values for all the coefficients seem to be less than the significance level of 0.05. - meaning that all the
predictors are statistically significant.

**Closing Statement:**

Both the models are good enough to predict the carprices which explains the variance of data upto 90% and the model is significant.