# Car Price Prediction


### Problem Statement
A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. 

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:

    - Which variables are significant in predicting the price of a car
    - How well those variables describe the price of a car

Based on various market surveys, the consulting firm has gathered a large dataset of different types of cars across the Americal market. 

### Business Goal 

You are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market. 

In [21]:
import warnings
warnings.filterwarnings('ignore')

#importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Step 1: Reading and Understanding the Data

Let's start with the following steps:

1. Importing data using the pandas library
2. Understanding the structure of the data

In [22]:
import pandas as pd
cars = pd.read_csv('/kaggle/input/d/dewanskhan/car-data/car data.csv')
cars.head()

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0


In [None]:
cars.shape

In [None]:
cars.describe()

In [None]:
cars.info()

### Step 2 : Data Cleaning and Preparation

In [None]:
#Splitting company name from CarName column
CompanyName = cars['Car_Name'].apply(lambda x : x.split(' ')[0])

cars.head()

In [None]:
cars.Car_Name.unique()

##### Fixing invalid values
- There seems to be some spelling error in the CompanyName column.

    - `maxda` = `mazda`
    - `Nissan` = `nissan`
    - `porsche` = `porcshce`
    - `toyota` = `toyouta`
    - `vokswagen` = `volkswagen` =  `vw`

In [None]:
cars.Car_Name = cars.Car_Name.str.lower()

def replace_name(a,b):
    cars.Car_Name.replace(a,b,inplace=True)

replace_name('maxda','mazda')
replace_name('porcshce','porsche')
replace_name('toyouta','toyota')
replace_name('vokswagen','volkswagen')
replace_name('vw','volkswagen')

cars.Car_Name.unique()

In [None]:
#Checking for duplicates
cars.loc[cars.duplicated()]

In [None]:
cars.columns

### Step 3: Visualizing the data


In [None]:
#plt.figure(figsize=(0,8))
#plt.subplot(1,2,1)
plt.title('Selling_Price')

plt.subplot(1,2,2)
plt.title('Seller_Type')
plt.show()

In [None]:
print(cars.Year.describe(percentiles = [0.25,0.50,0.75,0.85,0.90,1]))

#### Inference :

1. The plot seemed to be right-skewed, meaning that the most prices in the dataset are low(Below 15,000).
2. There is a significant difference between the mean and the median of the price distribution.
3. The data points are far spread out from the mean, which indicates a high variance in the car prices.(85% of the prices are below 18,500, whereas the remaining 15% are between 18,500 and 45,400.)

#### Step 3.1 : Visualising Categorical Data

    - CompanyName
    - Symboling
    - fueltype
    - enginetype
    - carbody
    - doornumber
    - enginelocation
    - fuelsystem
    - cylindernumber
    - aspiration
    - drivewheel

In [None]:
plt.figure(figsize=(25, 6))

plt.subplot(1,3,1)
plt1 = cars.Car_Name.value_counts().plot('bar')
plt.title('Car_Name')
plt1.set(xlabel = 'Kms_Driven', ylabel='Car_Name')

plt.subplot(1,3,2)
plt1 = cars.Year.value_counts().plot('bar')
plt.title('Year')
plt1.set(xlabel = 'Seller_Type', ylabel='Seller_Type')

plt.subplot(1,3,3)
plt1 = cars.Present_Price.value_counts().plot('bar')
plt.title('Present_Price')
plt1.set(xlabel = 'Present_Price', ylabel='Present_Price')

plt.show()

#### Inference :

1. `Toyota` seemed to be favored car company.
2. Number of `gas` fueled cars are more than `diesel`.
3. `sedan` is the top car type prefered.

In [None]:
plt.figure(figsize=(20,8))

plt.subplot(1,2,1)
plt.title('Car_Name')
sns.countplot(cars.Car_Name, palette=("cubehelix"))

plt.subplot(1,2,2)
plt.title('Year')
sns.boxplot(x=cars.Year, y=cars.Present_Price, palette=("cubehelix"))

plt.show()

#### Inference :

1. It seems that the symboling with `0` and `1` values have high number of rows (i.e. They are most sold.)
2. The cars with `-1` symboling seems to be high priced (as it makes sense too, insurance risk rating -1 is quite good). But it seems that symboling with `3` value has the price range similar to `-2` value. There is a dip in price at symboling `1`.

In [None]:
plt.figure(figsize=(20,8))

plt.subplot(1,2,1)
plt.title('Car_Name')
sns.countplot(cars.Car_Name, palette=("Blues_d"))

plt.subplot(1,2,2)
plt.title('year')
sns.boxplot(x=cars.Year, y=cars.Selling_Price, palette=("PuBuGn"))

plt.show()

#df = pd.DataFrame(cars.groupby(['Fule_type'])['selling_price'].mean().sort_values(ascending = True))
#df.plot.bar(figsize=(8,6))
#plt.title('year')
plt.show()


#### Inference :

1. `ohc` Engine type seems to be most favored type.
2. `ohcv` has the highest price range (While `dohcv` has only one row), `ohc` and `ohcf` have the low price range.

In [None]:
plt.figure(figsize=(25, 6))

df = pd.DataFrame(cars.groupby(['Car_Name'])['Year'].mean().sort_values(ascending = False))
df.plot.bar()
plt.title('Car_Name vs Year')
plt.show()
df = pd.DataFrame(cars.groupby(['Selling_Price'])['Present_Price'].mean().sort_values(ascending = False))
df.plot.bar()
plt.title('Selling_Price vs Average Present_Price')
plt.show()

df = pd.DataFrame(cars.groupby(['Car_Name'])['Selling_Price'].mean().sort_values(ascending = False))
df.plot.bar()
plt.title('Car_Name vs Average Seller_Type')
plt.show()


#### Inference :

1. `Jaguar` and `Buick` seem to have highest average price.
2. `diesel` has higher average price than  gas.
3. `hardtop` and `convertible` have higher average price.

#### Inference :

1. `doornumber` variable is not affacting the price much. There is no sugnificant difference between the categories in it.
2. It seems aspiration with `turbo` have higher price range than the `std`(though it has some high values outside the whiskers.)

#### Inference :

1. Very few datapoints for `enginelocation` categories to make an inference.
2. Most common number of cylinders are `four`, `six` and `five`. Though `eight` cylinders have the highest price range.
3. `mpfi` and `2bbl` are most common type of fuel systems. `mpfi` and `idi` having the highest price range. But there are few data for other categories to derive any meaningful inference
4. A very significant difference in drivewheel category. Most high ranged cars seeme to prefer `rwd` drivewheel.

#### Step 3.2 : Visualising numerical data

#### Inference :

1. `carwidth`, `carlength` and `curbweight` seems to have a poitive correlation with `price`. 
2. `carheight` doesn't show any significant trend with price.

#### Inference :

1. `enginesize`, `boreratio`, `horsepower`, `wheelbase` - seem to have a significant positive correlation with price.
2. `citympg`, `highwaympg` - seem to have a significant negative correlation with price.

### Step 4 : Deriving new features

### Step 5 : Bivariate Analysis

#### Inference :

1. `fueleconomy` has an obvios `negative correlation` with price and is significant.

#### Inference :

1. High ranged cars prefer `rwd` drivewheel with `idi` or `mpfi` fuelsystem.

### List of significant variables after Visual analysis :

    - Car Range 
    - Engine Type 
    - Fuel type 
    - Car Body 
    - Aspiration 
    - Cylinder Number 
    - Drivewheel 
    - Curbweight 
    - Car Length
    - Car width
    - Engine Size 
    - Boreratio 
    - Horse Power 
    - Wheel base 
    - Fuel Economy 

In [None]:
cars_lr = cars[['price', 'fueltype', 'aspiration','carbody', 'drivewheel','wheelbase',
                  'curbweight', 'enginetype', 'cylindernumber', 'enginesize', 'boreratio','horsepower', 
                    'fueleconomy', 'carlength','carwidth', 'carsrange']]
cars_lr.head()

In [None]:
sns.pairplot(cars_lr)
plt.show()

### Step 6 : Dummy Variables

In [None]:
# Defining the map function
def dummies(x,df):
    temp = pd.get_dummies(df[x], drop_first = True)
    df = pd.concat([df, temp], axis = 1)
    df.drop([x], axis = 1, inplace = True)
    return df
# Applying the function to the cars_lr

cars_lr = dummies('fueltype',cars_lr)
cars_lr = dummies('aspiration',cars_lr)
cars_lr = dummies('carbody',cars_lr)
cars_lr = dummies('drivewheel',cars_lr)
cars_lr = dummies('enginetype',cars_lr)
cars_lr = dummies('cylindernumber',cars_lr)
cars_lr = dummies('carsrange',cars_lr)

In [None]:
cars_lr.head()

In [None]:
cars_lr.shape

### Step 7 : Train-Test Split and feature scaling

In [None]:
from sklearn.model_selection import train_test_split

np.random.seed(0)
df_train, df_test = train_test_split(cars_lr, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
num_vars = ['wheelbase', 'curbweight', 'enginesize', 'boreratio', 'horsepower','fueleconomy','carlength','carwidth','price']
df_train[num_vars] = scaler.fit_transform(df_train[num_vars])

In [None]:
df_train.head()

In [None]:
df_train.describe()

In [None]:
#Correlation using heatmap
plt.figure(figsize = (30, 25))
sns.heatmap(df_train.corr(), annot = True, cmap="YlGnBu")
plt.show()

Highly correlated variables to price are - `curbweight`, `enginesize`, `horsepower`,`carwidth` and `highend`.

In [None]:
#Dividing data into X and y variables
y_train = df_train.pop('price')
X_train = df_train

### Step 8 : Model Building

In [None]:
#RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
lm = LinearRegression()
lm.fit(X_train,y_train)
rfe = RFE(lm, 10)
rfe = rfe.fit(X_train, y_train)

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
X_train.columns[rfe.support_]

#### Building model using statsmodel, for the detailed statistics

In [None]:
X_train_rfe = X_train[X_train.columns[rfe.support_]]
X_train_rfe.head()

In [None]:
def build_model(X,y):
    X = sm.add_constant(X) #Adding the constant
    lm = sm.OLS(y,X).fit() # fitting the model
    print(lm.summary()) # model summary
    return X
    
def checkVIF(X):
    vif = pd.DataFrame()
    vif['Features'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return(vif)

#### MODEL 1

In [None]:
X_train_new = build_model(X_train_rfe,y_train)

p-vale of `twelve` seems to be higher than the significance value of 0.05, hence dropping it as it is insignificant in presence of other variables.

In [None]:
X_train_new = X_train_rfe.drop(["twelve"], axis = 1)

#### MODEL 2

In [None]:
X_train_new = build_model(X_train_new,y_train)

In [None]:
X_train_new = X_train_new.drop(["fueleconomy"], axis = 1)

#### MODEL 3

In [None]:
X_train_new = build_model(X_train_new,y_train)

In [None]:
#Calculating the Variance Inflation Factor
checkVIF(X_train_new)

dropping `curbweight` because of high VIF value. (shows that curbweight has high multicollinearity.)

In [None]:
X_train_new = X_train_new.drop(["curbweight"], axis = 1)

#### MODEL 4


In [None]:
X_train_new = build_model(X_train_new,y_train)

In [None]:
checkVIF(X_train_new)

dropping `sedan` because of high VIF value.

#### MODEL 5

In [None]:
X_train_new = X_train_new.drop(["wagon"], axis = 1)

#### MODEL 6

#### MODEL 7

### Step 9 : Residual Analysis of Model

Error terms seem to be approximately normally distributed, so the assumption on the linear modeling seems to be fulfilled.

### Step 10 : Prediction and Evaluation

#### Evaluation of test via comparison of y_pred and y_test

#### Evaluation of the model using Statistics

#### Inference :

1. *R-sqaured and Adjusted R-squared (extent of fit)* - 0.899 and 0.896 - `90%` variance explained.
2. *F-stats and Prob(F-stats) (overall model fit)* - 308.0 and 1.04e-67(approx. 0.0) - Model fir is significant and explained `90%` variance is just not by chance.
3. *p-values* - p-values for all the coefficients seem to be less than the significance level of 0.05. - meaning that all the predictors are statistically significant.