Skip to content

PiyushBadhe/Car-Price-Prediction-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🚗 Car Price Prediction ML

machine-learning-linear-regression

Logo

🚀 Need of Car Price Prediction in the first place

Used Cars so called Second-hand's car have a huge market base. Many consider to buy an Used Car intsead of buying a new one, as it's is feasible and a better investment.

The main reason for this huge market is that when one buys a New Car and sale it just another day without any default on it, the price of car reduces by 30%.

There are too many frauds in the market who not only sale wrong but also they could mislead to very wrong price of a vehicle.

To overcome this frauds and misleading ourselves from fake and improper prices, here I used this Algorithm predicting car values besed on some of the main features defining the values of cars by using real-world #CarDekho dataset to Predict the price of any used car.

📝 Project Description

Generic badge Generic badge

Car Price Prediction is a really an interesting Machine Learning problem for a beginner as there are many factors that influence the price of a car in the second-hand market. In this Project, we will be looking at a dataset based on sale/purchase of cars where our end goal will be predicting the price of the car given its features to maximize the profit.

🛠 Dataset Required

Car_dataset.csv

⚙️ Librariess used

I've used a separate ML environment where only limited but required libraries were installed

NumPy Pandas MatplotLib Seaborn SciKit-Learn Seaborn Pickle

Use !pip command to install those libraries into your Environment

NumPy : It is an ibnbiult Library used in Python bu sometimes unexpectedly we need to download it !pip install numpy

Pandas : !pip install pandas

MatplotLib : !pip install matplotlib

SciKit_Learn : !pip install sklearn

Seaborn : !pip install seaborn

Pickle : !pip install pickle

Let's Build Model now! ⚡️⚡️

1. DATA PREPROCESSING 🏽‍

  • Import very first package for data reading for carrying out preprocessing techniques on the same
import pandas as pd
  • Now, assign the data values from Dataset Car_dataset.csv with read_csv
df = pd.read_csv('Car_dataset.csv')
  • Visualize and validate whether the dataset is successfully assigned to the vaariable
df.head()

df.head()

  • Check size of the Dataset
df.shape

df.shape

  • Choosing features uniquely defines each car's properties hence varying values can be achieved

  • Features used here are Seller_Type, Transmission, Owner, Fuel

  • Using these features and their unique values which directly classify/differentiate each car

print(df['Seller_Type'].unique())
print(df['Transmission'].unique())
print(df['Owner'].unique())
print(df['Fuel'].unique())

Unique Values

  • Checking if presence of NULL values in the dataset
df.isnull().sum()

isnull()

  • Describing all calculated statistical terms aka Sum, Mean, Standard Deviation, Minimum, Maximum etc
df.describe()

df.describe()

  • Fetching Columns present before Data Preparation
df.columns

df.columns

2. DATA PREPARATION 🏽‍

  • Neglecting unncessary column(s) from the Dataset i.e. Car_Name as Car_name may include many and is not uniquely differentiating as a feature
final_dataset = df[['Year', 'Selling_Price', 'Km_Driven', 'Fuel', 'Seller_Type','Transmission', 'Owner']]
final_dataset.head()

final_dataset1

  • We actually can add or modify features as per references for training and testing the model

  • Here, we are going to add a new feature Car_Age to simplify how many years a particular car is been used

  • Add a Current_Year column to Dataset having value 2021 in all the rows as 2021 is the current year

final_dataset['Current_Year'] = 2021
final_dataset.head()

final_dataset2

  • Getting Car_Age with simple logic and finally adding Car_Age column
final_dataset['Car_Age'] = final_dataset['Current_Year'] - final_dataset['Year']
final_dataset.head()

final_dataset3

  • As we know how is old the car now, we can neglect both the Year and Current_Year columns now
final_dataset.drop(['Year'], axis = 1, inplace = True)
final_dataset.drop(['Current_Year'], axis = 1, inplace = True)
final_dataset.head()

final_dataset4

  • Converting encoded unicode

  • In case if you don't know unicoding, let me simplify your doubt just in a minute with a small table;

{Parameter1} {Parameter2} {Parameter3} Description
1 0 0 This will represent the value is belonging to {Parameter1}
0 1 0 This will represent the value is belonging to {Parameter2}
0 0 1 This will represent the value is belonging to {Parameter3}
TIP: But also when Parameter1==0 and Parameter2==0, It will actually represent belonging to {Parameter3} itself
final_dataset = pd.get_dummies(final_dataset, drop_first = True) # First column should be deleted from "dummy variable trap"

final_dataset.head()

Unicode

3. DATA VISUALIZATION 🏽‍

Now it's time to Visualize the Data prepared till now

  • Import Seaborn and plot a Pairplot very quickly
import seaborn as sbs
sbs.pairplot(final_dataset)

sbs.pairplot

  • Import MatplotLib as well and plot a heatmap having correlation in between the data
  • For more of the %matplotlib inline term refer this article.
import matplotlib.pyplot as plt
%matplotlib inline


# Heatmapping the data
corrmat = final_dataset.corr()
top_corr_features = corrmat.index

plt.figure(figsize = (20, 20))


# Visualize the heatmap
hmap = sbs.heatmap(final_dataset[top_corr_features].corr(), annot = True, cmap = "RdYlGn")  # Color pattern chosen here = "RdYlGn"

sbs.hmap

4. FEATURE ENGINEERING 🏽‍

  • Let's have a look again to the Dataset prepared till now
final_dataset.head()

DEPENDENT and INDEPENDENT Features

  • Looking at the very first column of Dataset it is the Selling_Price we are going to predict through our ML Model
  • So in this case, we won't be needing this column for our model building
  • Let's then neglect Selling_Price then but now by using iloc function
X = final_dataset.iloc[:,1:] 
Y = final_dataset.iloc[:,0]
X.head()

X.head()

Y.head()

Y.head()

FEATURE Importance

  • Let's now fit our X and Y values to the model with ExtraTreeRegressor
  • Import ExtraTreeRegressor from Seaborn
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
model.fit(X, Y)
  • Now we can know each of the features' Importance
print(model.feature_importances_)

print(model.feature_importances_)

  • Are you able to understand the feature's understand? Or Can you tell which of the features is more important than another one?

  • So that's exactly where Visualization plays a very important role for drawing insights from the data we couldn't understand

  • Let's say we are going to plot a Graph of Features Importance

feat = pd.Series(model.feature_importances_, index = X.columns)
feat.nlargest(5).plot(kind = 'barh')
plt.show()

plt.show()

5. TRAINING ML MODEL 🏽‍

  • Whoosh! After all of the DATA PREPARATION, we can build our model for real
  • But before we'd do that, we have to split the data for training and testing our model
  • Training the model what is called Building a ML Model and Testing of the trained model will output the predicted Selling Price of the car which is our end goal
  • For Training and Testing we'll be using 8:2 ratio data. However it is best to use more of the present data for training purpose as it'll give very great accuracy at the end
  • Rest 20% of the data will be used for testing our ML model for its accuracy
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, y_test = train_test_split(X,Y, test_size = 0.2)

X_train.shape  # Checking the size of the dataset used for training our ML model

X_train.shape

5.1 Ensembling 🔗
  • The goal of ENSEMBLE METHODS is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator

  • In this project we're using RandomForestRegressor

from sklearn.ensemble import RandomForestRegressor

rf_random = RandomForestRegressor()
  • For Estimation, we'll be introducing `RandomizedSearchCV'

  • Randomized search on hyper parameters: RandomizedSearchCV implements a “fit” and a “score” method. It also implements “score_samples”, “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used. For more of RandomizedSearchCV refer to this Article

from sklearn.model_selection import RandomizedSearchCV
# Hyperparameters
# RandomizedSearchCV

import numpy as np

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)] # max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 100]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10]

  • Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

print(random_grid)

print(random_grid)

  • Use the random grid to search for best hyperparameters; First create the base model to tune
rf = RandomForestRegressor()
  • Random search of parameters, using 3 fold cross validation, search across 100 different combinations
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,scoring='neg_mean_squared_error', n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = 1)
  • Fit a number of decision tree classifiers on various sub-samples of the dataset
rf_random.fit(X_train,Y_train)

rf_random.fit(X_train,Y_train) This Process takes couple of minutes but in the end you'll able to see the steps done and probaby an output like this one here

  • Print the best parameters
rf_random.best_params_

rf_random.best_params_

Our output gives the best parameters as 1000 for n_estimators and 2 for min_samples_split. It also gives 1 for min_samples_leaf, 'sqrt' for max_features and 25 for max_depth.

  • Print the best score: Mean cross-validated score of the estimator
rf_random.best_score_

rf_random.best_score_

  • Finally Predictions can be made by our ML Model
predictions = rf_random.predict(X_test)
  • Visualize those 'Predictions' with the testing data we've splitted and secured to serve
sbs.distplot(y_test-predictions)

sbs.distplot(y_test-predictions)

plt.scatter(y_test,predictions)

plt.scatter(y_test, predictions)

  • Now calculate the Mean Absolute Error (MAE), Mean Squared Error (MSE), Metrics Mean Squared Error (MMSE)
from sklearn import metrics

MAE MSE RMSE

  • To use this Dataset in future or for deployment, store a Pickle file
import pickle
# Open a file, where you ant to store the data
file = open('random_forest_regression_model.pkl', 'wb')

# Dump information to that file
pickle.dump(rf_random, file)

Contact 💬

Please Feel free to Contact

Generic badge badhepiyush7@gmail.com

Generic badge

Acknowledgements

  • This Project is built for Predicting Used Car Selling Prices but not New Cars Prices or Showroom Prices
  • The Dataset I've used in this project and more such Datasets for practices can also be found here Car Data Datasets