## USED CAR PRICE PREDICTION  

**Submitted by Kommuri Raju**

**GCD Student, INSAID**

**Batch: May 9, 2021**

## Table of Content

1. [Introduction](#section1)<br>
2. [Problem Statement](#section2)<br>
3. [Library Installments](#section3)<br>
    - 3.1 [Installing Libraries](#section301)<br>
    - 3.2 [Importing Libraries](#section302)<br>
4. [Data Acquisition & Descriptions](#section4)<br>
    - 4.1 [Data Description](#section401)<br>
    - 4.2 [Data Information](#section402)<br>
5. [Data Pre-processing](#section5)<br>
    - 5.1 [Pre-profiling Report](#section501)<br>
    - 5.2 [Data Cleaning](#section502)<br>
    - 5.3 [Post-Profiling Report](#section503)<br>    
6. [Exploratory Data Analysis](#section6)<br>
    - 6.1 [Analysis of Features w.r.t.Target variable](#section601)<br>
7. [Data Post-processing](#section7)<br>
    - 7.1 [Handling Categorical columns](#section701)<br>
    - 7.2 [Feature Extraction](#section702)<br>
    - 7.3 [Linear Regression](#section703)<br>


<a name = section1></a>
## 1. Introduction

**Company Introduction**

- **SWIPECAR**, is an American company that **buys and sells second hand cars.**
- **They initiated their business in the late 80s and have gained huge popularity over the years.**
- **The company clients are local and foreign customers who seek to buy and sell second hand cars.**

**Current Scenario**
- Company has started **facing loss in business due to the technical advancements.**
- There are **several competitors in the market who have been using enhanced techniques.**
- The **company is pretty old and they have been using traditional measures to estimate old cars prices.**
- These **traditional measures include weight analysis, condition of parts and build year.**
- They are looking for a **more robust way to estimate the price of old cars.**


<a name = section2></a>
## 2. Problem Definition

**The current process suffers from the following problems:**

- They have been using manual traditional measures to estimate old cars prices.
- These measures are **time consuming and not accurate.**
- Company is looking for a robust way to estimate the prices of used cars.

Recently they got to know about data scientists who help businesses to sort out such issues.They decided to hire a team of data scientists. Consider you are one of them.

**Your Role**
- You will be given a set of data to predict the prices of used cars based on their features.
- You need to find an automated way to get rid of their manual work.
- Your task is to build a regression model using the provided data.

**Project Deliverables**
- Deliverable: Predict the prices of used cars.
- Machine Learning Task: Regression
- Target Variable: Price
- Win Condition: N/A (best possible model)

**Evaluation Metric**
- The model evaluation will be based on the RMSE score.

<a name = section3></a>
## 3. Library Installments

<a name = section301></a>
### 3.1 Installing Libraries

In [None]:
!pip install -q datascience                                         # Package that is required by pandas profiling
!pip install -q pandas-profiling                                    # Library to generate basic statistics about data

<a name = section302></a>
### 3.2 Importing Libraries

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing package pandas (For Panel Data Analysis)
from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis)
#-------------------------------------------------------------------------------------------------------------------------------
import warnings
warnings.filterwarnings('ignore')
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib
import matplotlib.pyplot as plt                                     # Importing pyplot interface to use matplotlib
import seaborn as sns                                             # Importing seaborn library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
import scipy as sp                                                  # Importing library for scientific calculations
#-------------------------------------------------------------------------------------------------------------------------------

<a name = section4></a>
## 4. Data Acquisition & Description


The dataset is divided into two parts:

**Train Set:**

| Records | Features | Target Variable |
 | :-- | :-- | :-- |
|**181**|**27**|**Price**|


**Test Set:**

| Records | Features | Predicted Variable |
| :-- | :-- | :-- |
| **20**|**26**|**Price** |


**The model developmet for price prediction will be done in Jupyter Notebook.**


|Id|Feature|Description|
|:--|:--|:--|
|01| ID |Feature uniquely identifying each record| 
|02| symboling | Degree to which the auto is riskier than its price indicates|  
|03| normalized losses | Relative average loss payment per insured vehicle year| 
|04| make| Make of the car|   
|05| fuel-type| Type of fuel consumed by the car|
|06| Aspiration| Type of internal combustion engine used|
|07| num-of-doors| Number of doors available in the car|
|08| body-style| Body style of car|
|09| drive-wheels | Drive wheel of car|
|10| engine-location| Location of engine in car|
|11| wheel-base| Distance between the centres of the front and rear wheels|
|12| length| Length of the car|
|13|post-code| Postal Code of the customer| 
|14| width | Width of the car.|  
|15| height | Height of the car| 
|16| curb-weight| Total mass of a vehicle with standard equipment|   
|17| engine-type| Type of engine used in the car|
|18| num-of-cylinders| Number of cylinders used in the car|
|19| engine-size| Size of the engine used in the car|
|20| fuel-system| Type of fuel system used in the car|
|21| bore | Diameter of each cylinder in the piston engine|
|21| stroke| Full travel of the piston along the cylinder, in either direction|
|22| compression-ratio| Volume of the cylinder and the combustion chamber when the piston is at the bottom, and the volume of the combustion chamber when the piston is at the top|
|23| horsepower| The power an engine produces by a car|
|24| peak-rpm | The max power produced by engine in terms of revolutions per minute|  
|25| city-mpg | City mileage per gallon rating of car| 
|26| highway-mpg| Highway mileage per gallon rating of car|   
|27| price| Price of the car |

In [None]:
# Loading the Train dataset #
missing_values = ['n.a','?','na','NA']                    # Data file contains "?" sign as values in some of the columns
data = pd.read_csv("TrainData-cars price-Narayan.csv",na_values = missing_values)
print('Data Shape: ', data.shape)
print(data.head())

In [None]:
data_final = pd.read_csv("TestData-Cars price-Narayan.csv",na_values = missing_values)
print('Data Shape:', data_final.shape)
print(data_final.head())

<a name = section401></a>
### 4.1 Data Description

In [None]:
data.describe()

In [None]:
data_final.describe()

<a name = section402></a>
### 4.2 Data Information

In [None]:
data.info()

In [None]:
data_final.info()

<a name = section5></a>
## 5. Data Pre-processing

In [None]:
data.isnull().sum()

In [None]:
(data == 0 ).sum(axis = 0)

In [None]:
data_final.isnull().sum()

In [None]:
(data_final == 0 ).sum(axis = 0)

## Observations :
 - Training Data Set has 59 0s in Column **symboling** column which is fine as symboling values range from -3 to +3
 - Taining Data Set also has *missing values** or "?" in Columns **'normalized losses','num-of-doors','bore','stroke','horsepower' and 'peak-ram'.** 
 - Training Data Set has 6 **missing values** in Column **normalized losses** column 
 - Cleaning of data is done by filling **0's,'?' and BLANKS** with MEAN or MEDIAN or MODE values of that particular column as **DEEMED FIT**

<a name = section501></a>
### 5.1 Pre-profiling Report

In [None]:
profile_Train = ProfileReport(df = data, minimal = True)

print('Pre-Profiling Accomplished!')
profile_Train

In [None]:
profile_Test = ProfileReport(df = data_final, minimal = True)

print('Pre-Profiling Accomplished!')
profile_Test

<a name = section502></a>
### 5.2 Data Cleaning

In [None]:
print(data['normalized-losses'].median())
print(data['normalized-losses'].mean())

In [None]:
print(data['num-of-doors'].mode())

In [None]:
print(data['bore'].median())
print(data['bore'].mean())

In [None]:
print(data['stroke'].median())
print(data['stroke'].mean())

In [None]:
print(data['horsepower'].median())
print(data['horsepower'].mean())

In [None]:
print(data['peak-rpm'].median())
print(data['peak-rpm'].mean())

In [None]:
data['normalized-losses'].fillna(value = 115.0, inplace = True)
data['bore'].fillna(value = 3.33, inplace = True)
data['stroke'].fillna(value = 3.2745, inplace = True)
data['horsepower'].fillna(value = 95.0, inplace = True)
data['peak-rpm'].fillna(value = 5100.0, inplace = True)
data['num-of-doors'].fillna(value = 'four', inplace = True)

In [None]:
data.isnull().sum()
data.info()

In [None]:
print(data_final['normalized-losses'].median())
print(data_final['normalized-losses'].mean())

In [None]:
data_final['normalized-losses'].fillna(value = 128.0, inplace = True)

In [None]:
data_final.info()

In [None]:
# Handling Duplicate data
#print('Contains Duplicate Rows?',data_trn.duplicated().any())
data[data.duplicated()]

In [None]:
data_final[data_final.duplicated()]

## Observations 

- **No missing value** in the dataset.
- **No Duplicated value** in the dataset.

<a name = section503></a>
### 5.3 Post-profiling Report

In [None]:
post_profile_data = ProfileReport(df = data,explorative = True)

print('Post-Profiling Accomplished!')
post_profile_data

In [None]:
post_profile_data_final = ProfileReport(df = data,explorative = True)

print('Post-Profiling Accomplished!')
post_profile_data_final

**Data is clean so we are good to go for Analysis**

<a name = section6></a>
## 6. Exploratory Data Analysis

In [None]:
sns.distplot(data['price'])

## Observations

 - There is **Skewness** in the target variable.
 - We have to **transform** it so it can be used for **model building.**

<a name = section601></a>
### 6.1 Analysis of features w.r.t. Target Variable

In [None]:
# Manufacturer having highest Car count
fig = plt.figure(figsize = (10,7))
data['make'].value_counts().plot(kind = 'barh', cmap = 'Dark2')
plt.ylabel('Brand Name', size = 14)
plt.xlabel('Car count', size = 14)
plt.title('Maker having Highest Car count', size = 16)
plt.grid()

In [None]:
# No of cars having four doors
figure = plt.figure(figsize =(15,7))
data['num-of-doors'].value_counts().plot(kind = 'barh',color = 'green')
plt.xlabel('Car count', size = 14)
plt.ylabel('Car Doors', size = 14)
plt.title('No of cars with door count', size = 16)
plt.grid()

In [None]:
# Drive wheel counts for Cars
figure = plt.figure(figsize = (10,7)) 
data['drive-wheels'].value_counts().plot( kind= 'bar', cmap = 'Dark2')
plt.xlabel('No of Drive wheels', size = 14)
plt.ylabel('Car Count', size = 14)
plt.title('Cars having Drive wheel count', size = 16)

In [None]:
# Car counts with Dree of risk
figure = plt.figure(figsize = (10,7))
data['symboling'].value_counts().plot(kind = 'barh', cmap = 'Set2')
plt.xlabel('Car Count', size = 14)
plt.ylabel('Risk count', size = 14)
plt.title('Risk profile', size = 16)
plt.grid()

In [None]:
# Most Expensive Car
figure = plt.figure(figsize = (20,8))
make_price = data.groupby('make')['price'].mean()   # value_counts().sort_values(ascending = False)[:15]

make_price.plot.bar(color = 'tomato')
plt.xlabel('Car Company', size = 14)
plt.ylabel('Car Price', size = 14)
plt.title('Most Expensive Car', size = 16)
plt.xticks(size = 12)
plt.grid()

In [None]:
# Correlation between Matrices
sns.pairplot(data= data[['price','city-mpg','highway-mpg','horsepower','engine-size','curb-weight']], height = 2, aspect = 1.5)

In [None]:
figure = plt.figure(figsize = (20,8))
HeatMap = sns.heatmap(data.corr(),annot = True,cmap = 'coolwarm',vmin = 0, vmax = 1,linecolor = 'black',linewidths = 1)
HeatMap.set_title('Correlation HeatMap', fontdict = {'fontsize':16})

In [None]:
data[['highway-mpg', 'city-mpg','horsepower','engine-size','curb-weight','peak-rpm']].plot(kind='box', figsize=(10, 7), vert=False)

## Observations

 - **Price** is having **Strong Positive correlation** with **curb-weight (0.84), engine-size (0.87) and horsepower (0.8).**
 - **Price** is having **Strong Negative correlation** with **city-mpg (0.68) and highway-mpg (0.7).**
 - **wheel-base** is having HIGH CORRELATION with **height, width and curb-weight**
 - **engine-size** is having HIGH CORRELATION with **horsepower**
 - **bore** is having HIGH CORRELATION with **curb-weight, length**
 - **city-mpg** is having HIGH CORRELATION with **highway-mpg**
 - The  above mentioned INDEPENDENT VARIABLES, are resulting in **Multicollinearity.**
 - Some Outliers are still present in the feature columns of dataset, for which we will **apply Transformations**. 


<a name = section7></a>
## 7. Data Post-processing

 - **Machine Learning Algorithm** works with **numerical** values.
 - So, it is necessary to **convert the categorical columns into Numerical columns.**

<a name = section701></a>
### 7.1 Handling Categorical columns 

### (A) Categorical Columns : 
make, fuel-type, aspiration, num-of-doors, body-style, drive-wheels, engine-location, engine-type, num-of-cylinders, fuel-system.

### (B) Numerical Columns:
ID, symboling, normalized-losses, wheel-base, length, width, height, curb-weight, bore, stroke, compression- ratio, horsepower, peak-rpm, city-mpg, highway-mpg.

In [None]:
# Creating a new Dataframe
data_categorical = data[['make','fuel-type','aspiration','num-of-doors','body-style','drive-wheels','engine-location',
                         'engine-type','num-of-cylinders','fuel-system']]
data_categorical

In [None]:
from sklearn.preprocessing import LabelEncoder
data_categorical = data_categorical.apply(LabelEncoder().fit_transform)
data_categorical

In [None]:
data_numerical = data[['symboling', 'normalized-losses','wheel-base','curb-weight','engine-size','highway-mpg', 'price' ]]
data_numerical

## Observations
- We are taking only ** 7 Numerical Value** Columns

In [None]:
data_model = pd.concat([data_categorical,data_numerical], axis=1)
data_model

## VIF ==> Variance Inflation Factor
- **Checking Multicollinearity using VIF**
- **Supports only NUMERICAL values, STRING and LOGICAL COLUMNS to be dropped**
- **The values cannot be BLANK or NaN**
- **VIF or Variable Inflation Factor is used to identify the correlation of 1 INDEPENDENT VARIABLE(PREDICTOR) with a group of OTHER VARIABLES(PREDICTORS)**
- **VIF is the RECIPROCAL of TOLERANCE VALUE**
- **A RULE OF THUMB commonly used in PRACTICE is; if VIF < 3 it is IDEAL and if  VIF > 10, you have HIGH MULTICOLLINEARITY**

In [None]:
data_intermediate = data[['ID', 'symboling', 'normalized-losses', 'wheel-base', 'length', 'width', 'height',
       'curb-weight', 'engine-size', 'bore', 'stroke', 'compression-ratio', 'horsepower',
       'peak-rpm', 'city-mpg', 'highway-mpg', 'price' ]]
data_intermediate

In [None]:
data_intermediate = pd.concat([data_categorical,data_intermediate], axis=1)
data_intermediate

In [None]:
data_intermediate.columns

In [None]:
# Import library for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):
  # Calculating VIF
  vif = pd.DataFrame()
  vif["variables"] = X.columns
  vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
  return(vif)

c = data_intermediate.iloc[::-1]
calc_vif(c)

In [None]:
x = data_model.drop(['price'],axis = 1)
x

In [None]:
y = data['price']
y

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [None]:
print('Train cases as below')
print('x_train shape: ',x_train.shape)
print('y_train shape: ',y_train.shape)
print('\nTest cases as below')
print('x_test shape: ',x_test.shape)
print('y_test shape: ',y_test.shape)

In [None]:
x_train.head(5)

In [None]:
y_train.head(5)

In [None]:
x_test.head(5)

In [None]:
y_test.head(5)

<a name = section703></a>
### 7.3 Linear Regression

**Here, we will use these 4 steps :** 

 - (a) Load the algorithm 
 - (b) Instantiate and Fit the model/Train the model using the training dataset
 - (c) Prediction on the test set
 - (d) Evaluate the model - Calculate RMSE the GO-TO METRIC

In [None]:
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression(fit_intercept=True)

In [None]:
lr_model.fit(x_train, y_train)

In [None]:
y_pred_train = lr_model.predict(x_train)  
y_pred_test = lr_model.predict(x_test) 

## Different Metrics for Evaluation of the LINEAR MODEL before HYPER PARAMETER TUNING

In [None]:
from sklearn import metrics
MAE_train = metrics.mean_absolute_error(y_train, y_pred_train)
MAE_test = metrics.mean_absolute_error(y_test, y_pred_test)
print('MAE for training set is {}'.format(MAE_train))
print('MAE for test set is {}'.format(MAE_test))

In [None]:
RMSE_train = np.sqrt(metrics.mean_squared_error(y_train, y_pred_train))
RMSE_test = np.sqrt(metrics.mean_squared_error(y_test, y_pred_test))
print('RMSE for training set is {}'.format(RMSE_train))
print('RMSE for test set is {}'.format(RMSE_test))

In [None]:
r2_train = metrics.r2_score(y_train, y_pred_train)
r2_test = metrics.r2_score(y_test, y_pred_test)
print('R-Squared of train data:',r2_train)
print('R-Squared of test data:',r2_test)

#### As there is a huge difference between the train score and test score, the model is overfitted. To over come this issue we will use the hyper parameter tuning.

## Use RandomizedSearchCV for Parameters' Tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV
# n(*, fit_intercept=True, normalize='deprecated', copy_X=True, n_jobs=None, positive=False)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV

In [None]:
lr_best = LinearRegression(fit_intercept=False,
    normalize=True,
    copy_X=False,
    n_jobs=1,)
param_dict = {"fit_intercept" :['False'],
              "normalize": ['False'],
              "copy_X": ['True','False'] }
#clf = RandomizedSearchCV(lr_model,param_dict)
#clf.fit(x_train,y_train)
rs = RandomizedSearchCV(estimator=lr_best,param_distributions = param_dict,n_iter = 20,n_jobs= -1,random_state=0,)
topmost = rs.fit(x_train,y_train)


In [None]:
rs.best_params_

In [None]:
rs.best_score_

In [None]:
print(topmost.best_params_)
print(topmost.best_score_)
print(topmost.best_estimator_)

In [None]:
y_pred_train_best = topmost.predict(x_train)  

y_pred_test_best = topmost.predict(x_test) 

## Different Metrics for Evaluation of the LINEAR MODEL after HYPER PARAMETER TUNING

### RMSE is the GO-TO Metric here

In [None]:
r2_train_best = metrics.r2_score(y_train, y_pred_train_best)
r2_test_best = metrics.r2_score(y_test, y_pred_test_best)
print('R-Squared of train data:',r2_train_best)
print('R-Squared of test data:',r2_test_best)

In [None]:
metrics.mean_absolute_error(y_train, y_pred_train_best)

### We have also calculated other Metrics viz. 
### MAE (Mean Average Error)
### MLSE (Mean squared logarithmic Error) --- if there are NEGATIVE values in the DataSet then we cannot APPLY this SCORE
### R2 or R-SQUARED ERROR
### MAPE (Mean Average Percentage Error)

In [None]:
MAE_train = metrics.mean_absolute_error(y_train, y_pred_train_best)
MAE_test = metrics.mean_absolute_error(y_test, y_pred_test_best)
print('MAE for training set is {}'.format(MAE_train))
print('MAE for test set is {}'.format(MAE_test))

In [None]:
MLSE_train = metrics.mean_squared_log_error(y_train, y_pred_train_best)
MLSE_test = metrics.mean_squared_log_error(y_test,y_pred_test_best)
print('Mean squared logarithmic Error(RMLSE) of train data:',MLSE_train)
print('Mean squared logarithmic Error(RMLSE) of test data:',MLSE_train)

In [None]:
r2_train = metrics.r2_score(y_train, y_pred_train_best)
r2_test = metrics.r2_score(y_test, y_pred_test_best)
print('R-Squared of train data:',r2_train)
print('R-Squared of test data:',r2_test)

## Analyzing the Test File

### Seperate out the Categorical and Numerical Columns in the Test DataSet

In [None]:
data_submission = data_final['ID']
data_submission

In [None]:
data_submission1 = data_final['ID']
data_submission1

In [None]:
data_final_categorical = data_final[['make','fuel-type','aspiration','num-of-doors','body-style','drive-wheels',
                                     'engine-location','engine-type','num-of-cylinders','fuel-system']]
data_final_categorical

In [None]:
data_final_numerical = data_final[['symboling', 'normalized-losses','wheel-base','curb-weight','engine-size','highway-mpg']]
data_final_numerical

In [None]:
data_final_categorical = data_final_categorical.apply(LabelEncoder().fit_transform)
data_final_categorical

In [None]:
data_final_model = pd.concat([data_final_categorical,data_final_numerical], axis=1)
data_final_model

In [None]:
y_pred_test_final = lr_model.predict(data_final_model)
y_pred_test_final
y_pred_test_final.shape

In [None]:
y_pred_test_final_topmost = topmost.predict(data_final_model)
y_pred_test_final_topmost
y_pred_test_final_topmost.shape

### Convert the array into a DataFrame

In [None]:
y_pred_test_final = pd.DataFrame(y_pred_test_final)

In [None]:
y_pred_test_final

In [None]:
y_pred_test_final_topmost = pd.DataFrame(y_pred_test_final_topmost)
y_pred_test_final_topmost

### Prepare the submission file which should have only two columns viz. the KEY/INDEX column(ID) and TARGET column(price)

In [None]:
submission_file = pd.concat([data_submission,y_pred_test_final], axis = 1)

In [None]:
submission_file

In [None]:
submission_file_1 = pd.concat([data_submission1,y_pred_test_final_topmost], axis = 1)
submission_file_1

### To write the final data to the submission files which are .csvs without HEADER and INDEX

In [None]:
submission_file.to_csv('Used_Car_Price_Prediction_Submission_Raj.csv', header=False, index=False)

In [None]:
submission_file_1.to_csv('Used_Car_Price_Prediction_Submission_Raj_Tuned.csv', header=False, index=False)

## THANK YOU !!!!