> ### <font color=blue>**Problem Statement**

A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:Which variables are significant in predicting the price of a car How well those variables describe the price of a car Based on various market surveys, the consulting firm has gathered a large dataset of different types of cars across the Americal market.

> ### <font color=blue>Business Goal

You are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

> ### <font color=blue>Data Preparation

There is a variable named CarName which is comprised of two parts - the first word is the name of 'car company' and the second is the 'car model'. For example, chevrolet impala has 'chevrolet' as the car company name and 'impala' as the car model name. You need to consider only company name as the independent variable for model building.

> ### <font color=blue>Model Evaluation:

When you're done with model building and residual analysis, and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set.

from sklearn.metrics import r2_score r2_score(y_test, y_pred) where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set.

Please don't forget to perform this step as the R-squared score on the test set holds some marks. The variable names inside the 'r2_score' function can be different based on the variable names you have chosen.

In [None]:
import warnings
warnings.filterwarnings('ignore')
#import the packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# reading the csv file and set index to carid
df=pd.read_csv("../input/car-price-prediction/CarPrice_Assignment.csv").set_index("car_ID")
#display rowsa and columns
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
df.head()

In [None]:
# to check the shape of the dataframe
df.shape

In [None]:
# to check the information of the data frame
df.info()

> ### <font color=red>No null values 

In [None]:
#to check the statistics of the data frame
df.describe()

In [None]:
# to check the columns of the data frame
df.columns

In [None]:
#to convert int to object 
df["symboling"]=df["symboling"].map({-3:"safe",-2:"safe",-1:"safe",0:"moderate",1:"moderate",2:"risk",3:"risk"})
df.info()

In [None]:
# to split carname to company name
df["CompanyName"]=df["CarName"].str.split(" ").str[0]

In [None]:
# to count the company names
df["CompanyName"].value_counts()

> ### <font color=red>company names  spellings are  wrong

In [None]:
# replacing to proper company  spelling names
df["CompanyName"].replace({'maxda':'mazda','vw':'volkswagen','porcshce':'porsche','Nissan':'nissan','vokswagen':'volkswagen',
                             'toyouta':'toyota','alfa-romero':'alfa-romeo'},inplace=True)

In [None]:
#scatter plot for numerical variable
col=("wheelbase","carlength","carwidth","carheight","curbweight","enginesize","fuelsystem","boreratio","stroke","compressionratio","horsepower","peakrpm","citympg","highwaympg")
plt.figure(figsize=(20,15))
for i in range(0,len(col)):
    plt.subplot(4,4,i+1)
    sns.scatterplot(x=col[i],
            y="price",data=df)
plt.show()

In [None]:
# categoricalvalue by using box plot
plt.figure(figsize=(20,15))
col=("symboling","fueltype","aspiration","doornumber","carbody","drivewheel","enginelocation","enginetype","cylindernumber","fuelsystem")
for i in range(0,len(col)):
    plt.subplot(4,4,(i+1))
    sns.boxplot(x=col[i],y="price",data=df)
plt.show()

> ### <font color=red>we can clearly see there is clear relation between engine location and price

In [None]:
plt.figure(figsize=(25,15))
sns.boxplot(x="CompanyName",y="price",data=df)
plt.show()

In [None]:
# convert categorical values to numerical values
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["fueltype"]=le.fit_transform(df['fueltype'])
df["aspiration"]=le.fit_transform(df['aspiration'])
df["doornumber"]=le.fit_transform(df['doornumber'])
df["enginelocation"]=le.fit_transform(df['enginelocation'])
df.head()
df.drop(["CarName",'CompanyName'],axis=1,inplace=True)
df.head()

In [None]:
# converting categorical variables to dummy variables 
df = pd.get_dummies(df)
df.head()

In [None]:
# after dummy variable convertion from object variables to integer variable
df.info()

# Splitting data into train and test data set

In [None]:
from sklearn.model_selection import train_test_split
#creation of train and test data set  as 70:30
df_train,df_test=train_test_split(df,train_size=0.7,test_size=.3,random_state=100)

In [None]:
print(df_train.shape)
print(df_test.shape)

In [None]:
# checking the train head data set
df_train.head()


In [None]:
# checking statistical train datframe
df_train.describe()


In [None]:
#scaling by using min max scaler
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
# Some variable are out of scale
li=["wheelbase","carlength",'carwidth','carheight',"curbweight",'enginesize','boreratio','stroke','compressionratio','horsepower','peakrpm','citympg','highwaympg','price']
df_train[li]=scaler.fit_transform(df_train[li])
df_train.head()

In [None]:
# to cjeck the statstical dataa of the train data
df_train.describe()

In [None]:
# to find co-relation on train set
plt.figure(figsize=(40,40))
sns.heatmap(df_train.corr(),cmap='YlGnBu')

In [None]:
#finding the corelation with respect to price
cor=df_train.corr().iloc[[17]]
cor

# Buliding the model

Using RFE method

In [None]:
y_train=df_train.pop("price")
X_train=df_train

In [None]:
#import the sklearn
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
# creating object for linear regression
lm=LinearRegression()
# fitting data to X and y train
lm.fit(X_train,y_train)
#selecting the top 15 features
rfe=RFE(lm,15)
rfe=rfe.fit(X_train,y_train)

In [None]:
#listing the raking
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
#listing the top 15 features
col=X_train.columns[rfe.support_]
col

# Building using stats

In [None]:
import statsmodels.api as sm
#Creating X_train_rfe which will contain only the top 15 selected columns from the X_train dataset.
X_train_rfe=X_train[col]
# training the model
X_train_rfe=sm.add_constant(X_train_rfe)
#Applying the linearRegression model on the X_train_rfe and fitting the training dataset.
lr_1=sm.OLS(y_train,X_train_rfe).fit()
lr_1.summary()

> ### <font color=red>highwaympg p-value is high so drop from train data frame.

In [None]:
X_train_new = X_train_rfe.drop(columns=['highwaympg'])
X_train_lm = sm.add_constant(X_train_new)

lr_2 = sm.OLS(y_train, X_train_lm).fit()
lr_2.summary()

> ### <font color=red>All features are acceptable for p-value, so calculate VIF

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_train_new = X_train_new.drop(columns=['const'])
vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

> ### <font color=red>High VIF so remove "enginetype_rotor"

In [None]:
X_train_new = X_train_new.drop(columns=['enginetype_rotor'])

X_train_lm = sm.add_constant(X_train_new)

lr_3 = sm.OLS(y_train, X_train_lm).fit()
lr_3.summary()

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

> ### <font color=red>drop curbweight as high VIf value

In [None]:
X_train_new = X_train_new.drop(columns=['curbweight'])

X_train_lm = sm.add_constant(X_train_new)
lr_4 = sm.OLS(y_train, X_train_lm).fit()
lr_4.summary()

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

> ### <font color=red>Do not drop engine size, as value of R_square reduces very high, we are considering next VIF value i.e. carwidth

In [None]:
X_train_new = X_train_new.drop(columns=['carwidth'])

X_train_lm = sm.add_constant(X_train_new)

lr_5 = sm.OLS(y_train, X_train_lm).fit()
lr_5.summary()

> ### <font color=red>carbody_convertible has high p-value, so droping the carbody_conertible

In [None]:
X_train_new = X_train_new.drop(columns=['carbody_convertible'])

X_train_lm = sm.add_constant(X_train_new)

lr_6 = sm.OLS(y_train, X_train_lm).fit()
lr_6.summary()

> ### <font color=red>Drop the engine location from train data set as it has high p value

In [None]:
X_train_new = X_train_new.drop(columns=['enginelocation'])

X_train_lm = sm.add_constant(X_train_new)

lr_7 = sm.OLS(y_train, X_train_lm).fit()
lr_7.summary()

> ### <font color=red>No high pvalues so checking VIF value

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

> ### <font color=red>drop the stroke as high VIF value

In [None]:
X_train_new = X_train_new.drop(columns=['stroke'])

X_train_lm = sm.add_constant(X_train_new)

lr_8 = sm.OLS(y_train, X_train_lm).fit()
lr_8.summary()

> ### <font color=red>Drop the boreratio as high p value

In [None]:
X_train_new = X_train_new.drop(columns=['boreratio'])

X_train_lm = sm.add_constant(X_train_new)

lr_9 = sm.OLS(y_train, X_train_lm).fit()
lr_9.summary()

> ### <font color=red>Drop the cylindernumber_three as high p value

In [None]:
X_train_new = X_train_new.drop(columns=['cylindernumber_three'])

X_train_lm = sm.add_constant(X_train_new)

lr_10 = sm.OLS(y_train, X_train_lm).fit()
lr_10.summary()

In [None]:
X_train_new = X_train_new.drop(columns=['enginetype_ohc'])

X_train_lm = sm.add_constant(X_train_new)

lr_11 = sm.OLS(y_train, X_train_lm).fit()
lr_11.summary()

> ### <font color=red>As no values are high p-values, so checeking VIF

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

> ### <font color=red>The value of VIF of horsepower is high so drop it

In [None]:
X_train_new = X_train_new.drop(columns=['horsepower'])

X_train_lm = sm.add_constant(X_train_new)

lr_12 = sm.OLS(y_train, X_train_lm).fit()
lr_12.summary()

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
X = X_train_new
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

> ### <font color=red>All p valueslessthan 5% and VIF values less than 5%

# Residual analysis on train data

Checking if the error terms are also normally distributed. We will plot the histogram of the error terms and check whether it is normally distributed or not.

In [None]:
X_train_lm.shape

In [None]:
y_train_price=lr_12.predict(X_train_lm)

In [None]:
# Plotting histogram of the error terms
fig = plt.figure(figsize=(5,5))
sns.distplot((y_train - y_train_price))
fig.suptitle('Error Terms')
plt.xlabel('Errors')

> ### <font color=red>As you can see the distribution is similar to normal distribution and the mean of the distribution is 0.

> ### <font color=red>Predection on train and test data set

In [None]:
df_test.describe()

In [None]:
#creating a list which will contain all the variables which are out of scale.
li = ['wheelbase','carlength','carwidth','carheight','curbweight','enginesize','boreratio','stroke',
     'compressionratio','horsepower','peakrpm','citympg','highwaympg','price']

#performing fit_transform() on the columns present in the above list.
df_test[li] = scaler.transform(df_test[li])
df_test.head()

In [None]:
#creating X and ytest
X_test=df_test
y_test=df_test.pop("price")

In [None]:
# making predection on lr_12 model
X_test_new=X_test[X_train_new.columns]
# adding constant
X_test_new=sm.add_constant(X_test_new)

In [None]:
# making predection
y_pred=lr_12.predict(X_test_new)

# Model Evaluation

In [None]:
fig = plt.figure(figsize=(10,10))
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)   

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

> ### <font color=red>As the R_squared value on test set is 80%, it is a pretty good LinearRegression model and we can use this to predict the car price for Geely motors.

> ### <font color=red>Therefore our model is 𝑝𝑟𝑖𝑐𝑒= -0.1286 + 1.4493enginesize - 0.1181enginetype_ohcv - 0.3495#ofCylinder_twelve + 0.2840#ofCylinder_two