## Car Price Prediction - RFE + Linear Regression
<i><b>Author: Anish Mahapatra</b></i>
<br>
[LinkedIn Profile](https://www.linkedin.com/in/anishmahapatra/)
<br>
[Medium Profile](https://medium.com/@anishmahapatra)

### Expected Outcome:
Build a multiple linear regression model to predict car prices.

### Problem Statement:
<b>Geely Auto</b>, a Chinese automobile company aspires to enter the US market and produce cars. They have hired an automobile consulting company (us) to understand the factors on which the pricing of a car depends on. pecifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:
- Which variables are significant in predicting the price of a car
- How well those variables describe the price of a car
    
### Business Goals:
You are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market. 

### Data Preparation to keep in mind
- CarName is a concatenation of Car Company & Car Model
- Only Company name is to be considered as the variable for the purpose of model building

### Model Evaluation:
Post building the model and residual analysis, make sure to do *R-Squared analysis

In [None]:
#Importing the required modules and packages

import matplotlib.pyplot as plt
from numpy.random import randn
from numpy.random import seed
from numpy import percentile
from scipy import stats
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline
import warnings

In [None]:
# Removing the minimum display columns to 500
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

# Ignoring warnings
warnings.filterwarnings("ignore")

In [None]:
# Importing the required csv from the folder:
carData = pd.read_csv('../input/car-price-prediction/CarPrice_Assignment.csv')

In [None]:
# Sense check of the application data

carData.head()

### 1. Pre-Processing:
- Check the % of missing values
- Check the data types
- Outlier Analysis

### 2. EDA - to understand the data
- Univariate Analysis
- Make heatmap to understand correlation distribution
- Perform Bivariate Analysis



In [None]:
# Checking the top 5 rows and headers of the data
carData.head()

In [None]:
# Looking at the type of the data frame, data types and the number of rows
carData.info()

In [None]:
# Checking the number of rows and columns present in the data
carData.shape

In [None]:
# Looking at the data types of the data
carData.dtypes

In [None]:
# Making a copy of the application in dataframe df (checkpoint!) 
df = carData.copy(deep=False)

### Missing Values

In [None]:
# Calculating the percent of missing values in the dataframe
percentMissing = (df.isnull().sum() / len(df)) * 100

# Making a dataframe with the missing values % and columns into a dataframe (on account of large number of rows) 
missingValuesDf = pd.DataFrame({'columnName': df.columns,
                                 'percentMissing': percentMissing})

In [None]:
# Viewing the dataframe to ensure that the values have been populated correctly
missingValuesDf

So, we have been fortunate enough to get a clean dataset with no missing values. So there will be no more imputations or missing value treatment to be carried out.

### Outlier Treatment Analysis
Let us now analyze the numerical variables

<b>Note:</b> The Boxplots below have been plotted with the standard whiskers of 1.5 x (IQR)

In [None]:
# Selecting only the numeric columns to perform correlation analysis
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_num = df.select_dtypes(include=numerics)

In [None]:
# Displaying the top 5 rows of only the numerical values
df_num.head()

Here, we notice that symboling is a categorical variable.

In [None]:
# Removal of the categorical columns
symboling = df_num.pop('symboling')
car_ID = df_num.pop('car_ID')

We shall analyze the boxplots of the above variables to see if there are any untowardly behavior in the data.

Note: We notice that car_ID is an identifier variable and does not hold significance when it comes

## Outlier Analysis using Box Plots

In [None]:
# Making boxplots as sub-plots to understand the trend of the data 
plt.figure(figsize=(15, 6))
plt.subplot(2,3,1)
sns.boxplot(x = 'wheelbase', data = df_num)
plt.subplot(2,3,2)
sns.boxplot(x = 'carlength', data = df_num)
plt.subplot(2,3,3)
sns.boxplot(x = 'carwidth', data = df_num)
plt.subplot(2,3,4)
sns.boxplot(x = 'carheight', data = df_num)
plt.subplot(2,3,5)
sns.boxplot(x = 'curbweight', data = df_num)
plt.subplot(2,3,6)
sns.boxplot(x = 'enginesize', data = df_num)
plt.show()

In [None]:
# Making boxplots as sub-plots to understand the trend of the data 

plt.figure(figsize=(15, 6))
plt.subplot(2,3,1)
sns.boxplot(x = 'boreratio', data = df_num)
plt.subplot(2,3,2)
sns.boxplot(x = 'stroke', data = df_num)
plt.subplot(2,3,3)
sns.boxplot(x = 'compressionratio', data = df_num)
plt.subplot(2,3,4)
sns.boxplot(x = 'horsepower', data = df_num)
plt.subplot(2,3,5)
sns.boxplot(x = 'peakrpm', data = df_num)
plt.subplot(2,3,6)
sns.boxplot(x = 'citympg', data = df_num)
plt.show()

In [None]:
# Making boxplots as sub-plots to understand the trend of the data 
plt.figure(figsize=(15, 3))
plt.subplot(1,2,1)
sns.boxplot(x = 'highwaympg', data = df_num)
plt.subplot(1,2,2)
sns.boxplot(x = 'price', data = df_num)

Now that we have performed an outlier analysis on the numerical variables of the dataset, we can say that there are a few columns that we can deep drive into with the data:

- <b>citympg</b>: Mileage in the city
- <b>horsepower</b>: horse power
- <b>enginesize</b>: Size of the car
- <b>compressionratio</b>: compression ratio of the car
- <b>stroke</b>: stroke or volume inside the engine
- <b>price</b>: price of the car

### Distribution Analysis

Let us now understand the distribution of the numerical and categorical variables.

### Analysis of numerical variables
Let us now analyze the categorical variables

In [None]:
# Function to plot histogram for numerical, univariate analysis
def plotHistogram(df, colName):
    '''
    This function is used to set the style of the plot, name the graph and plot the distribution for the specified column
    
    Inputs:
    @df (dataframe) - The dataframe for which histograms are to be plotted
    @colName (string) - The numeric column for which histograms is to be plotted
    
    Output:
    Titles distribution plot of specified colName
    '''
    sns.set(style="whitegrid")
    plt.figure(figsize=(20,5)) 
    plt.title(colName)
    plt.ylabel('Density', fontsize=14)
    sns.distplot(df[colName], kde=True)

### Histograms of numerical variables

In [None]:
# Making boxplots as sub-plots to understand the trend of the data 
plt.figure(figsize=(2, 20))
plotHistogram(df_num, 'wheelbase')
plotHistogram(df_num, 'carlength')
plotHistogram(df_num, 'carwidth')
plotHistogram(df_num, 'carheight')
plotHistogram(df_num, 'curbweight')
plotHistogram(df_num, 'enginesize')
plotHistogram(df_num, 'boreratio')
plotHistogram(df_num, 'stroke')
plotHistogram(df_num, 'compressionratio')
plotHistogram(df_num, 'horsepower')
plotHistogram(df_num, 'peakrpm')
plotHistogram(df_num, 'citympg')
plt.show()


Here, we notice that most of the numerical variables follow a normal distribution with minimum skew.

The variables that does not follow a normal distribution:
- compression ratio

### Analysis of categorical variables
Let us now analyze the categorical variables

In [None]:
# Defining a function to view the distribution of the categorical variables
def plotFrequencyTable(df, catColName):
    '''
    This function is used to plot the frequency table of the specified categorical variable
    @df (dataframe) - Dataframe for which frequency table is to be plotted
    @catColName (string) - Column name for which frequency table is to be plotted
    '''
    sns.countplot(x=catColName, data=df)
    plt.title(catColName)
    plt.xticks(rotation = 90)
    plt.show();

In [None]:
## Subsetting data to subset categorical variables
df_cat = df.select_dtypes(include='object')

In [None]:
# Viewing the head of the data for a sense-check
df_cat.head()

#### Frequency Tables of categorical variables to understand trend

In [None]:
# Making boxplots as sub-plots to understand the trend of the data 

# plt.figure(figsize=(10, 10))
plt.subplot(2,2,1)
plotFrequencyTable(df,'enginelocation')
plt.subplot(2,2,2)
plotFrequencyTable(df_cat,'fueltype')
plt.subplot(2,2,3)
plotFrequencyTable(df_cat,'aspiration')
plt.subplot(2,2,4)
plotFrequencyTable(df_cat,'doornumber')
plt.subplot(2,3,5)
plotFrequencyTable(df_cat,'carbody')
plt.subplot(2,3,6)
plotFrequencyTable(df_cat,'drivewheel')
plt.show()

A couple of observations from the above graph would indicate the following:
- The engine is mostly located in the <b>front</b> of the car
- Most of the cars use <b>gas</b> as their fuel
- The aspiration employed by most vehicles is <b>std</b> (standard)
- Just over half the cars sold have <b>four</b> doors
- The most popular car body is <b>sedan</b>
- Most of the cars have a <b>fwd</b> drive wheel

## Bivariate Analysis

Now, we shall perform bi-variate analysis on the variables with respect to price (dependent variable) 

### Correlation Analysis

Let us now analyze from the perspective of correlation analysis as to what the most correlated variable are with the <b>price</b> (dependent variable)

Let us now proceed to plot the correlation matrix of the data:

In [None]:
# Plotting the correlation matrix of the data
cor = df_num.corr()

In [None]:
#Correlation with output variable
cor_target = abs(cor['price'])

In [None]:
#Selecting highly correlated features
relevant_features = cor_target[cor_target>0.5]
round(relevant_features.sort_values(ascending = True), 2)

#### Heatmap of numerical variables

In [None]:
# Let's check the correlation coefficients to see which variables are highly correlated

plt.figure(figsize = (16, 10))
sns.heatmap(df_num.corr(), annot = True, cmap="YlGnBu")
plt.show()

#### Top 10 correlated features with price

Here, we notice that the top variables that are corrlated with price are as follows:
- enginesize    0.87
- curbweight    0.84
- horsepower    0.81
- carwidth      0.76
- highwaympg    0.70
- citympg       0.69
- carlength     0.68
- wheelbase     0.58
- boreratio     0.55


Pearson's correlation is considered significant when the variables generally have a correlation > 0.5.

Let us now try to do the business interpretation of the above variables as to why they may have a higher correlation as compared to the other variables:
- <b>engine size</b>: The more is the engine size of the car, the faster it can go. So, the materials used should be more strong and light which might lead to it being expensive
- <b>curb weight</b>: As curb weight increases, a more powerful engine would be required to pull the car, which would make the price go up
- <b>horse power</b>: A higher horse power adds to the cost of the car
- <b>car width</b>: This is an intersting find. This would imply cars that are wider in girth are an indication of a luxury car
- <b>highway mpg</b>: Interesting! A lower mileage seems to indicate a more expensive car
- <b>city mpg</b>: Interesting! A lower mileage seems to indicate a more expensive car
- <b>car length</b>: This is also interesting as it implies that the more the length of the car, higher is the price of the car
- <b>wheel base</b>: An equally interesting find, faster and more expensive cars seems to have a higher wheel base
- <b>bore ratio</b>: Higher the bore ratio implies that it is a faster car, which would in turn imply that the car is expensive

### Bivariate Plots

Now that we have found the variables that have the highest correlation with the dependent variable - price, we shall now plot the bivariate plots with price.

In [None]:
# Plotting pair plots
sns.pairplot(df)
plt.figure(figsize=(40, 40))
plt.show()

In [None]:
# Plotting the highly correlated variables with price to understand the trend
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'enginesize', y = 'price', data = df)

plt.subplot(2,3,2)
sns.boxplot(x = 'curbweight', y = 'price', data = df)

plt.subplot(2,3,3)
sns.boxplot(x = 'horsepower', y = 'price', data = df)

plt.subplot(2,3,4)
sns.boxplot(x = 'carwidth', y = 'price', data = df)

plt.subplot(2,3,5)
sns.boxplot(x = 'highwaympg', y = 'price', data = df)

plt.subplot(2,3,6)
sns.boxplot(x = 'citympg', y = 'price', data = df)
plt.show()

The bivariate distribution can be analyzed. The following trends are observed from the correlated variables: 
- As engine size increases, price increases
- As curb weight increases, price increases
- As horse power increases, price increases
- As the car width increases, price increases

The following trends are interesting:
- As the highway mileage increases, the price decreases
- As the city mileage increases, the price decreases

## Data Preprocessing:

- Get the name of the car company
- Correct the name of the car company


In [None]:
# Removing the unique identifier of the data
df.pop('car_ID').head()

In [None]:
df.head()

In [None]:
# Getting the name of the brand
df['CarName'] = df['CarName'].str.split('-').str[0]
df['CarName'] = df['CarName'].str.split(' ').str[0]

# Converting to lowercase
df['CarName'] = df['CarName'].str.lower()

# Correcting the mistakes present in the CarName columns like vw to volkswagen, maxda to mazda etc.
df['CarName'] = df['CarName'].str.replace('vw','volkswagen')
df['CarName'] = df['CarName'].str.replace('maxda','mazda')
df['CarName'] = df['CarName'].str.replace('vokswagen','volkswagen')
df['CarName'] = df['CarName'].str.replace('toyouta','toyota')

# Replacing occurences of 4wd with fwd as they are the same thing
df['drivewheel'] = df['drivewheel'].str.replace('4wd','fwd')

We shall now proceed to convert the categorical variables to dummy variables.

In [None]:
# # Making a function to make dummy variables
# def makeDummyVariables(df, colName):
#     '''
#     This function is used to make dummy variables, concatenate it to original dataframe and remove the older categorical column.
    
#     Inputs:
#     @colName (string): Name of the categorical column that we wish to make dummy variables for.
#     @df (dataframe): Dataframe which we would like to make modifications in
    
#     Output:
#     Desired dataframe with dummy varibles with original categorical variable (colName) removed
    
#     '''
#     # Making dummy variables and dropping the first dummy column as n values should have n-1 dummy columns
#     status = pd.get_dummies(df[colName], drop_first = True)
    
#     # Concatenating the dummy variables to the dataframe
#     df = pd.concat([df, status], axis = 1)
    
#     # Dropping the original categorical variable from the dataframe
#     df.drop([colName], axis = 1, inplace = True)


# ---------------- function did not work for some reason. However, an attempt was made------------------------------

In [None]:
# Making dummy variables and dropping the first dummy column as n values should have n-1 dummy columns
status = pd.get_dummies(df['CarName'], drop_first = True)

# Concatenating the dummy variables to the dataframe
df = pd.concat([df, status], axis = 1)

# Dropping the original categorical variable from the dataframe
df.drop(['CarName'], axis = 1, inplace = True)

In [None]:
# Making dummy variables and dropping the first dummy column as n values should have n-1 dummy columns
status = pd.get_dummies(df['fueltype'], drop_first = True)

# Concatenating the dummy variables to the dataframe
df = pd.concat([df, status], axis = 1)

# Dropping the original categorical variable from the dataframe
df.drop(['fueltype'], axis = 1, inplace = True)

In [None]:
# Making dummy variables and dropping the first dummy column as n values should have n-1 dummy columns
status = pd.get_dummies(df['aspiration'], drop_first = True)

# Concatenating the dummy variables to the dataframe
df = pd.concat([df, status], axis = 1)

# Dropping the original categorical variable from the dataframe
df.drop(['aspiration'], axis = 1, inplace = True)

In [None]:
# Making dummy variables and dropping the first dummy column as n values should have n-1 dummy columns
status = pd.get_dummies(df['symboling'], drop_first = True)

# Concatenating the dummy variables to the dataframe
df = pd.concat([df, status], axis = 1)

# Dropping the original categorical variable from the dataframe
df.drop(['symboling'], axis = 1, inplace = True)

In [None]:
# Making dummy variables and dropping the first dummy column as n values should have n-1 dummy columns
status = pd.get_dummies(df['doornumber'], drop_first = True)

# Concatenating the dummy variables to the dataframe
df = pd.concat([df, status], axis = 1)

# Dropping the original categorical variable from the dataframe
df.drop(['doornumber'], axis = 1, inplace = True)

In [None]:
# Making dummy variables and dropping the first dummy column as n values should have n-1 dummy columns
status = pd.get_dummies(df['carbody'], drop_first = True)

# Concatenating the dummy variables to the dataframe
df = pd.concat([df, status], axis = 1)

# Dropping the original categorical variable from the dataframe
df.drop(['carbody'], axis = 1, inplace = True)

In [None]:
# Making dummy variables and dropping the first dummy column as n values should have n-1 dummy columns
status = pd.get_dummies(df['drivewheel'], drop_first = True)

# Concatenating the dummy variables to the dataframe
df = pd.concat([df, status], axis = 1)

# Dropping the original categorical variable from the dataframe
df.drop(['drivewheel'], axis = 1, inplace = True)

In [None]:
# Making dummy variables and dropping the first dummy column as n values should have n-1 dummy columns
status = pd.get_dummies(df['enginelocation'], drop_first = True)

# Concatenating the dummy variables to the dataframe
df = pd.concat([df, status], axis = 1)

# Dropping the original categorical variable from the dataframe
df.drop(['enginelocation'], axis = 1, inplace = True)

In [None]:
# Making dummy variables and dropping the first dummy column as n values should have n-1 dummy columns
status = pd.get_dummies(df['enginetype'], drop_first = True)

# Concatenating the dummy variables to the dataframe
df = pd.concat([df, status], axis = 1)

# Dropping the original categorical variable from the dataframe
df.drop(['enginetype'], axis = 1, inplace = True)

In [None]:
# Making dummy variables and dropping the first dummy column as n values should have n-1 dummy columns
status = pd.get_dummies(df['cylindernumber'], drop_first = True)

# Concatenating the dummy variables to the dataframe
df = pd.concat([df, status], axis = 1)

# Dropping the original categorical variable from the dataframe
df.drop(['cylindernumber'], axis = 1, inplace = True)

In [None]:
# Making dummy variables and dropping the first dummy column as n values should have n-1 dummy columns
status = pd.get_dummies(df['fuelsystem'], drop_first = True)

# Concatenating the dummy variables to the dataframe
df = pd.concat([df, status], axis = 1)

# Dropping the original categorical variable from the dataframe
df.drop(['fuelsystem'], axis = 1, inplace = True)

In [None]:
# Sense check of the data
df.head()

In [None]:
# Shape of the data
df.shape

## Splitting the Data into Training and Testing Sets

In [None]:
from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively
df_train, df_test = train_test_split(df, train_size = 0.7, test_size = 0.3, random_state = 100)

### Rescaling the Features 

We will use MinMax scaling. This is done for the numerical variables

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
# Obtaining the numerical features and scaling them
num_vars = ['wheelbase', 'carlength', 'carwidth', 'carheight', 'curbweight', 'enginesize', 
            'boreratio', 'stroke', 'compressionratio', 'horsepower', 'peakrpm', 'citympg', 
            'highwaympg']

# Scaling the numerical features
df_train[num_vars] = scaler.fit_transform(df_train[num_vars])

# Sense check of the data
df_train.head()

### Dividing into X and Y sets for the model building

In [None]:
# Putting the dependent variable in 'y' 
y_train = df_train.pop('price')

# Putting the rest of the features in 'X'
X_train = df_train

## Building our model

This time, we will be using the **LinearRegression function from SciKit Learn** for its compatibility with RFE (Recursive Feature Elimination)

### RFE
Recursive feature elimination

In [None]:
# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
# Running RFE with the output number of the variable equal to 20

# Making a linear regression model object
lm = LinearRegression()

# Fitting the model on the training dataset
lm.fit(X_train, y_train)

# Outputting the top 20 features
rfe = RFE(lm, 20)             
rfe = rfe.fit(X_train, y_train)

In [None]:
# listing the relevant features (obtained via Recursive Feature Elimination - RFE)
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

#### The features that say `True` are the ones that RFE believes will be relevant to predict the price.

#### The features that say `False` can be given priority, with 1 being the highest, although RFE has implies that the features will not be as relevant for the model

In [None]:
# The columns selected by RFE
col = X_train.columns[rfe.support_]
col

In [None]:
# The columns that were not selected by RFE
X_train.columns[~rfe.support_]

### Building model using statsmodel, for the detailed statistics

In [None]:
# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[col]

## Model Iteration 1

In [None]:
# Adding a constant variable as stats models does not have a constant variable
import statsmodels.api as sm  
X_train_rfe = sm.add_constant(X_train_rfe)

In [None]:
# Running the linear model to understand the ordinary least squares
lm = sm.OLS(y_train,X_train_rfe).fit()   

In [None]:
#Let's see the summary of our linear model
print(lm.summary())

## Model Iteration 2

In [None]:
X_train_rfe = X_train_rfe.drop(["two"], axis = 1)

Rebuilding the model without `two`

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train_rfe)

In [None]:
lm = sm.OLS(y_train,X_train_lm).fit()   # Running the linear model

In [None]:
#Let's see the summary of our linear model
print(lm.summary())

## Model Iteration 3

In [None]:
X_train_rfe = X_train_rfe.drop(["dohcv"], axis = 1)

Rebuilding the model without `dohcv`

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train_rfe)

# Running the linear model
lm = sm.OLS(y_train,X_train_lm).fit()  

#Let's see the summary of our linear model
print(lm.summary())

## Model Iteration 4

In [None]:
X_train_rfe = X_train_rfe.drop(["peakrpm"], axis = 1)

Rebuilding the model without `peakrpm`

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train_rfe)

# Running the linear model
lm = sm.OLS(y_train,X_train_lm).fit()  

#Let's see the summary of our linear model
print(lm.summary())

## Model Iteration 5

In [None]:
X_train_rfe = X_train_rfe.drop(["horsepower"], axis = 1)

Rebuilding the model without `horsepower`

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train_rfe)

# Running the linear model
lm = sm.OLS(y_train,X_train_lm).fit()  

#Let's see the summary of our linear model
print(lm.summary())

In [None]:
X_train_rfe.columns

## VIF - Variance Inflation Factor

In [None]:
X_train_rfe = X_train_rfe.drop(['const'], axis=1)

In [None]:
# Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Making a copy of the application in dataframe df (checkpoint!) 
X_train_new = X_train_rfe.copy(deep=False)

In [None]:
X_train_new = X_train_new.drop(["three"], axis = 1)

In [None]:
# Calculate the VIFs for the new model
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_train_lm = sm.add_constant(X_train_new)

# Running the linear model
lm = sm.OLS(y_train,X_train_lm).fit()  

#Let's see the summary of our linear model
print(lm.summary())

In [None]:
# Viewing the final columns to be used in the model
X_train_new.columns

We notice that if we remove more columns, the R-Squared gradually reduces.

## Residual Analysis of the train data

So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
y_train_price = lm.predict(X_train_lm)

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_price), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label

We note that the errors are normally distributed

## Making Predictions

#### Applying the scaling on the test sets

In [None]:
num_vars = ['wheelbase', 'carlength', 'carwidth', 'carheight', 'curbweight', 'enginesize', 
            'boreratio', 'stroke', 'compressionratio', 'horsepower', 'peakrpm', 'citympg', 
            'highwaympg']

df_test[num_vars] = scaler.transform(df_test[num_vars])

#### Dividing into X_test and y_test

In [None]:
y_test = df_test.pop('price')
X_test = df_test

In [None]:
# Now let's use our model to make predictions.

# Creating X_test_new dataframe by dropping variables from X_test
X_test_new = X_test[X_train_new.columns]

# Adding a constant variable 
X_test_new = sm.add_constant(X_test_new)

In [None]:
# Making predictions
y_pred = lm.predict(X_test_new)

In [None]:
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)                          # Y-label

In [None]:
# Adding a constant variable 
import statsmodels.api as sm  
X_test_lm = sm.add_constant(X_test_new)

# Running the linear model
lm = sm.OLS(y_test,X_test_lm).fit()  

#Let's see the summary of our linear model
print(lm.summary())

## R-Squared value

In [None]:
from sklearn.metrics import r2_score
print('R-Squared Score for the Car Price Prediction model with Linear Regression + RFE is:\n', round(r2_score(y_test, y_pred)*100, 2), '%')

## Conclusion

Hence, the features that can be mentioned as good predictors to model the price are: **Geely Auto** can use are:
- carwidth
- curbweight
- enginesize
- boreratio
- stroke
- bmw
- mitsubishi
- peugeot
- porsche
- rear
- l
- rotor
- five
- four
- twelve

A couple of interesting points to note about a care that would have a higher price:
   - An "I - Engine type" is an indicator of the car being more expensive
   - BMW, Peugeot and Porsche are the more expensive brands
   - Heavier car and wider cars are more expensive
   - Higher numer of cylinders is a string indicator of a more expensive car (>4)

Hey, thank you so much for taking the time to go through my notebook. I'm a Data Scientist in top Data Science firm and am pursuing my MS in Data Science. This is one of my first notebooks in Kaggle and I would love some feedback from this wonderful community. 

I'm a fun person to connect with, please feel free to connect with me 😄
<br>
[LinkedIn Profile](https://www.linkedin.com/in/anishmahapatra/)
<br>
[Medium Profile](https://medium.com/@anishmahapatra)
<br>
[GitHub Profile](https://github.com/anishmahapatra01/)