# Title: Car Price Prediction Model Building

`Author`   : Abdullah Khan Kakar [Github](https://github.com/AbdullahKhanKakar)--[LinkedIn](https://www.linkedin.com/in/abdullahkhankakar/)--[Kaggle](https://www.kaggle.com/abdullahkhanuet22)

`Date`     : 2.Febuary.2024

`Dataset`  : [OLX Cars Dataset](https://www.kaggle.com/datasets/abdullahkhanuet22/olx-cars-dataset)

In this notebook, I build a powerful price prediction model using Random Forest Regressor and other techniques, who ables to predict closest prices of cars on unseen data. Outline:

- Dataset Overview:
- Data Preprocessing:
- Data Binning:
- Model Building:
- Import Model into Pickle

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("/kaggle/input/olx-cars-dataset/OLX_cars_dataset00.csv")

# Dataset Overview

In [3]:
df.head(2)

Unnamed: 0,Ad ID,Car Name,Make,Model,Year,KM's driven,Price,Fuel,Registration city,Car documents,Assembly,Transmission,Condition,Seller Location,Description,Car Features,Images URL's,Car Profile
0,1079071571,fresh import Passo 2021model,Toyota,Passo,2021,54000,4190000,Petrol,Unregistered,Original,Imported,Automatic,Used,"Airline Avenue, Islamabad","it's 2021 model fresh import, perfect engine s...","ABS, Air Bags, AM/FM Radio, CD Player, Cassett...",['https://images.olx.com.pk/thumbnails/4039460...,https://www.olx.com.pk/item/fresh-import-passo...
1,1080125520,Suzuki ravi,Suzuki,Ravi,2018,95000,1300000,Petrol,Karachi,Original,Local,Manual,Used,"Kahuta, Rawalpindi",Suzuki ravi 2018 col,AM/FM Radio,['https://images.olx.com.pk/thumbnails/4102504...,https://www.olx.com.pk/item/suzuki-ravi-iid-10...


#### 1. Number of Rows, Columns

In [4]:
rows, columns = df.shape
print(f"Number of Rows: {rows}")
print(f"Number of Columns: {columns}")

Number of Rows: 9179
Number of Columns: 18


#### 2. Non-null values and Data Types

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9179 entries, 0 to 9178
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Ad ID              9179 non-null   int64 
 1   Car Name           9179 non-null   object
 2   Make               9179 non-null   object
 3   Model              9179 non-null   object
 4   Year               9179 non-null   int64 
 5   KM's driven        9179 non-null   int64 
 6   Price              9179 non-null   int64 
 7   Fuel               9179 non-null   object
 8   Registration city  9179 non-null   object
 9   Car documents      9179 non-null   object
 10  Assembly           9179 non-null   object
 11  Transmission       9179 non-null   object
 12  Condition          9179 non-null   object
 13  Seller Location    9179 non-null   object
 14  Description        9179 non-null   object
 15  Car Features       9179 non-null   object
 16  Images URL's       9179 non-null   object


#### 3. Dataset Statistics

In [6]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Ad ID,9179.0,1079720000.0,2848393.0,1019824000.0,1080003000.0,1080543000.0,1080773000.0,1080975000.0
Year,9179.0,2012.269,6.043902,1989.0,2007.0,2013.0,2017.0,2024.0
KM's driven,9179.0,96570.42,61983.25,1.0,53000.0,92000.0,125000.0,533528.0
Price,9179.0,2036814.0,1159302.0,185000.0,1025000.0,1820000.0,2750000.0,5000000.0


#### 4. Duplicated Rows

In [7]:
print("Duplicated Rows:")
print(f"Total: {df.duplicated().sum()}")

Duplicated Rows:
Total: 201


#### 5. Basic Info about dataset such as  Missing values, its percentages, unique values and data types of features

In [8]:
basic_info = pd.DataFrame({
    "Features": df.columns,
    "Missing Values": df.isnull().sum().values,
    "Missing Values %": ((df.isnull().sum().values)/len(df)),
    "Unique Values": df.nunique().values,
    "Data Types": df.dtypes
})
basic_info.reset_index(drop=True)

Unnamed: 0,Features,Missing Values,Missing Values %,Unique Values,Data Types
0,Ad ID,0,0.0,8976,int64
1,Car Name,0,0.0,7970,object
2,Make,0,0.0,11,object
3,Model,0,0.0,58,object
4,Year,0,0.0,27,int64
5,KM's driven,0,0.0,1598,int64
6,Price,0,0.0,842,int64
7,Fuel,0,0.0,4,object
8,Registration city,0,0.0,61,object
9,Car documents,0,0.0,2,object


# Data Preprocessing

#### 1. Drop duplicated rows

In [9]:
df.drop_duplicates(inplace=True)

#### 2. Drop unnecessary features

In [10]:
df.drop(columns=["Ad ID","Car Name","Condition","Seller Location","Registration city","Description","Car Features","Images URL's","Car Profile"], inplace=True)

In [11]:
df.shape

(8978, 9)

#### 3. Drop rows that are outliers

In [12]:
df = df[(df["Model"]!="Civic VTi") & (df["Model"]!="Civic EXi") & (df["Model"]!="Civic VTi Oriel") & (df["Model"]!="Cervo") & (df["Model"]!="Every Wagon") & (df["Model"]!="Liana") & (df["Model"]!="Mehran VX") & (df["Model"]!="Khyber") & (df["Model"]!="Cultus VXL") & (df["Model"]!="Corolla Assista") & (df["Model"]!="Corolla Axio") & (df["Model"]!="Surf") & (df["Model"]!="Prius") & (df["Model"]!="ISIS")]
df = df[df["Year"]!=2024]

In [13]:
df.head(2)

Unnamed: 0,Make,Model,Year,KM's driven,Price,Fuel,Car documents,Assembly,Transmission
0,Toyota,Passo,2021,54000,4190000,Petrol,Original,Imported,Automatic
1,Suzuki,Ravi,2018,95000,1300000,Petrol,Original,Local,Manual


In [14]:
df.shape

(8960, 9)

# Data Binning or Discretization

In [15]:
n_df = df.copy()

#### 1. Create Bins of Year column

In [16]:
bins = [1999,2004,2008,2012,2016,2020,2024]
labels = [1,2,3,4,5,6]
n_df["Year_Range"] = pd.cut(n_df["Year"], bins=bins, labels=labels)
n_df.head()

Unnamed: 0,Make,Model,Year,KM's driven,Price,Fuel,Car documents,Assembly,Transmission,Year_Range
0,Toyota,Passo,2021,54000,4190000,Petrol,Original,Imported,Automatic,6
1,Suzuki,Ravi,2018,95000,1300000,Petrol,Original,Local,Manual,5
2,Suzuki,Bolan,2015,50000,800000,Petrol,Original,Local,Manual,4
3,Daihatsu,Move,2013,94000,2155000,Petrol,Original,Imported,Automatic,4
4,Suzuki,Swift,2011,126544,1440000,Petrol,Original,Local,Manual,3


#### 2. Create Bins of KM's driven column

In [17]:
bins = [0,30000,60000,90000,120000,150000,180000,210000,224000,227000,300000,330000,360000,390000,410000,440000,470000,500000,533530]
labels = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
n_df["KM's driven_Range"] = pd.cut(n_df["KM's driven"], bins=bins, labels=labels)
n_df.head()

Unnamed: 0,Make,Model,Year,KM's driven,Price,Fuel,Car documents,Assembly,Transmission,Year_Range,KM's driven_Range
0,Toyota,Passo,2021,54000,4190000,Petrol,Original,Imported,Automatic,6,2
1,Suzuki,Ravi,2018,95000,1300000,Petrol,Original,Local,Manual,5,4
2,Suzuki,Bolan,2015,50000,800000,Petrol,Original,Local,Manual,4,2
3,Daihatsu,Move,2013,94000,2155000,Petrol,Original,Imported,Automatic,4,4
4,Suzuki,Swift,2011,126544,1440000,Petrol,Original,Local,Manual,3,5


In [18]:
n_df.isnull().sum()

Make                 0
Model                0
Year                 0
KM's driven          0
Price                0
Fuel                 0
Car documents        0
Assembly             0
Transmission         0
Year_Range           0
KM's driven_Range    0
dtype: int64

#### 3. Create new columns data type from Category to Integer

In [19]:
cols = ["Year_Range","KM's driven_Range"]
for col in cols:
    n_df[col] = n_df[col].astype("int32")

In [20]:
# checking correlations of new column with Price column; it's better
n_df.select_dtypes(["int","float"]).corr()

Unnamed: 0,Year,KM's driven,Price,Year_Range,KM's driven_Range
Year,1.0,-0.378985,0.680903,0.982744,-0.381022
KM's driven,-0.378985,1.0,-0.186917,-0.378809,0.986599
Price,0.680903,-0.186917,1.0,0.672904,-0.193626
Year_Range,0.982744,-0.378809,0.672904,1.0,-0.380328
KM's driven_Range,-0.381022,0.986599,-0.193626,-0.380328,1.0


# Model Building

In [21]:
# import libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, QuantileTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

#### 1.Train Test Split

In [22]:
x = n_df.drop("Price", axis=1)
y = n_df["Price"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15, random_state=42)

In [23]:
print(f"Shape of X_Train: {x_train.shape}")
print(f"Shape of X_Test: {x_test.shape}")

Shape of X_Train: (7616, 10)
Shape of X_Test: (1344, 10)


In [24]:
x_train

Unnamed: 0,Make,Model,Year,KM's driven,Fuel,Car documents,Assembly,Transmission,Year_Range,KM's driven_Range
252,Honda,City Aspire,2017,62000,Petrol,Original,Local,Manual,5,3
6546,Mitsubishi,Pajero Mini,2007,178000,Petrol,Original,Imported,Manual,2,6
9122,Toyota,Altis Grande,2015,76000,Petrol,Original,Local,Automatic,4,3
6119,Toyota,Corolla GLI,2018,58000,Petrol,Original,Local,Manual,5,2
6133,Daihatsu,Move,2020,42000,Petrol,Original,Imported,Automatic,5,2
...,...,...,...,...,...,...,...,...,...,...
5844,Honda,City Aspire,2016,47000,Petrol,Original,Local,Manual,4,2
5285,Honda,Civic Prosmetic,2007,42000,Petrol,Original,Local,Automatic,2,2
5490,Suzuki,Mehran VXR,2016,33000,Petrol,Original,Imported,Manual,4,2
862,Honda,City IVTEC,2007,246000,Petrol,Original,Local,Manual,2,10


In [25]:
y_train

252     3650000
6546    1350000
9122    4250000
6119    4300000
6133    3670000
         ...   
5844    2000000
5285    2350000
5490    1190000
862     1780000
7426    4700000
Name: Price, Length: 7616, dtype: int64

#### 2. Create Column Transformers

In [26]:
ct1 = ColumnTransformer(transformers=[
    ("oneHotEncoder", OneHotEncoder(sparse_output=False, drop="first"), [0,1,4,5,6,7])
], remainder="passthrough")

ct2 = ColumnTransformer(transformers=[
    ("minMaxScaler", MinMaxScaler(), [2,3,8,9])
], remainder="passthrough")

ct3 = ColumnTransformer(transformers=[
    ("quantileTransformers", QuantileTransformer(output_distribution="normal"), slice(0,None))
])

ct4 = RandomForestRegressor()

#### 3. Create Pipeline

In [27]:
pipeline = Pipeline([
    ("first", ct1),
    ("second", ct2),
    ("third", ct3),
    ("fourth", ct4)
])

In [28]:
pipeline.fit(x_train, y_train)
y_pred = pipeline.predict(x_test)
# accuracy measure
print(mean_absolute_error(y_pred, y_test))

180205.81828117446


# Import model into Pickle

In [29]:
# import pickle

# pickle.dump(pipeline, open("pipeline.pkl", "wb"))
# pickle.dump(n_df, open("dataset.pkl", "wb"))

###### End of Code!