## Car Price Prediction

Given a range of features in our given dataset, we will perform data cleaning, some feature engineering and finally we will make a Regression model to predict the prices of the cars based on features.

In [1]:
# Importing all the necessary libraries
import numpy as np
import pandas as pd
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neural_network import MLPRegressor

In [2]:
# Reading our dataset using Pandas
data=pd.read_csv('./car_data.csv')

### Exploratory Analysis and Data Cleaning

We will perform some exploratory analysis of the data to check if the data needs any cleaning i.e. we will check for any null values in feature columns and will make sure that there are no non-numeric values in numeric feature columns.

In [3]:
data.head()         # Prints first few examples of your dataset

Unnamed: 0,Make,Model,Version,Price,Make_Year,CC,Assembly,Mileage,Registered City,Transmission
0,Honda,Insight,,7400000.0,2019,1500,Imported,2000,Un-Registered,Automatic
1,Mitsubishi,Minica,Black Minica,1065000.0,2019,660,Imported,68000,Lahore,Automatic
2,Audi,A6,1.8 TFSI Business Class Edition,9300000.0,2015,1800,Local,70000,Lahore,Automatic
3,Toyota,Aqua,G,2375000.0,2014,1500,Imported,99900,Islamabad,Automatic
4,Honda,City,1.3 i-VTEC,2600000.0,2017,1300,Local,55000,Islamabad,Manual


#### Looking for Null Values

In [4]:
data.info()                  # Gives some numeric info of your data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80572 entries, 0 to 80571
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Make             80572 non-null  object
 1   Model            80572 non-null  object
 2   Version          73800 non-null  object
 3   Price            80572 non-null  object
 4   Make_Year        80572 non-null  int64 
 5   CC               80572 non-null  int64 
 6   Assembly         80572 non-null  object
 7   Mileage          80572 non-null  int64 
 8   Registered City  80572 non-null  object
 9   Transmission     80572 non-null  object
dtypes: int64(3), object(7)
memory usage: 6.1+ MB


The info() method indicates that our "Version" feature column contains Null values. Since there are thousands of values so it will not be reasonable to discard all of these training examples. Thus, we will replace all the Null values with a placeholder "Unknown". So that we can use the remaining features of these examples while training.

In [5]:
data['Version'].fillna('Unknown', inplace=True)      # Replacing the null values by 'Unknown'

We can see that there are no Null values in "Version" Feature now.

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80572 entries, 0 to 80571
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Make             80572 non-null  object
 1   Model            80572 non-null  object
 2   Version          80572 non-null  object
 3   Price            80572 non-null  object
 4   Make_Year        80572 non-null  int64 
 5   CC               80572 non-null  int64 
 6   Assembly         80572 non-null  object
 7   Mileage          80572 non-null  int64 
 8   Registered City  80572 non-null  object
 9   Transmission     80572 non-null  object
dtypes: int64(3), object(7)
memory usage: 6.1+ MB


#### Looking for Non-Numeric Values in Numeric Features

The describe() method should display the statistics of all the features that contain numeric values only. Other than the features shown below, we also expect the "Price" column to be displayed in the below output as it contains numeric values.

But analysis of data shows that for some examples, in the "Price" column we have "Call for price" instead of some numeric value. So, we need to get rid of such examples.

In [7]:
data.describe()

Unnamed: 0,Make_Year,CC,Mileage
count,80572.0,80572.0,80572.0
mean,2011.724209,1404.083267,85653.66008
std,6.953399,684.458171,82241.870901
min,1990.0,1.0,1.0
25%,2007.0,1000.0,36500.0
50%,2013.0,1300.0,73000.0
75%,2017.0,1600.0,110520.0
max,2021.0,10000.0,999999.0


In [8]:
# Finding the number of non-numeric values in the "Price" column

numeric_values_mask = pd.to_numeric(data['Price'], errors='coerce').notna()    # Creates a boolean mask showing True for all the numeric values
numeric_values_count = numeric_values_mask.sum()
print(f"Number of non-numeric values in 'Price' column: {data['Price'].count()-numeric_values_count}")

Number of non-numeric values in 'Price' column: 1209


Only 1209 examples out of 80572 examples are faulty. So we can drop them and it will have any significant effect on our training.

#### Removing the Non-Numeric Values from Price column

In [9]:
# Removes all the non-numeric values
data = data[pd.to_numeric(data['Price'], errors='coerce').notna()]
data['Price'] = data['Price'].astype(float)

Now we can also see the "Price" column in describe() output indicating that now it only contains numeric values.

In [10]:
data.describe()

Unnamed: 0,Price,Make_Year,CC,Mileage
count,79363.0,79363.0,79363.0,79363.0
mean,2558124.0,2011.65213,1396.004927,86338.493668
std,3695094.0,6.957034,673.923098,82507.706101
min,100000.0,1990.0,1.0,1.0
25%,950000.0,2007.0,1000.0,37508.5
50%,1725000.0,2013.0,1300.0,74000.0
75%,2870000.0,2017.0,1600.0,112000.0
max,95000000.0,2021.0,10000.0,999999.0


### Feature Engineering

Now we analyze that what features are not going to be benificial towards the prediction of our car prices. And we will try to extract some useful information from such features.

#### Make_Year into Car_Age

Make Year does not provide any useful information regarding car price prediction. A more interpretation of this feature will be that how old our car is. So we can convert Make_Year into Car_Age.

In [11]:
data['Make_Year'] = datetime.now().year - data['Make_Year']      # Calculating Car Age
data.rename(columns={'Make_Year': 'Car_Age'}, inplace=True)      # Rename the feature in the dataframe
data.head()

Unnamed: 0,Make,Model,Version,Price,Car_Age,CC,Assembly,Mileage,Registered City,Transmission
0,Honda,Insight,Unknown,7400000.0,4,1500,Imported,2000,Un-Registered,Automatic
1,Mitsubishi,Minica,Black Minica,1065000.0,4,660,Imported,68000,Lahore,Automatic
2,Audi,A6,1.8 TFSI Business Class Edition,9300000.0,8,1800,Local,70000,Lahore,Automatic
3,Toyota,Aqua,G,2375000.0,9,1500,Imported,99900,Islamabad,Automatic
4,Honda,City,1.3 i-VTEC,2600000.0,6,1300,Local,55000,Islamabad,Manual


#### Handling Categorical Variables

Our Regression Models can only take numeric data as input. That is why the features 'Make', 'Model','Version','Registered City'and 'Transmission' cannot be fed into the model as it is. One way to get useful information from these categorical variables is to apply one-hot encoding to them. And that is exactly what we are going to do!

In [12]:
data = pd.get_dummies(data, columns=['Make', 'Model','Version', 'Assembly', 'Registered City', 'Transmission'], drop_first = True)
data.head()

Unnamed: 0,Price,Car_Age,CC,Mileage,Make_Audi,Make_BMW,Make_Bentley,Make_Buick,Make_Cadillac,Make_Changan,...,Registered City_Toba Tek Singh,Registered City_Umer Kot,Registered City_Un-Registered,Registered City_Vehari,Registered City_Wah cantt,Registered City_Warburton,Registered City_Wazirabad,Registered City_Yazman mandi,Registered City_Zafarwal,Transmission_Manual
0,7400000.0,4,1500,2000,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
1,1065000.0,4,660,68000,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,9300000.0,8,1800,70000,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,2375000.0,9,1500,99900,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,2600000.0,6,1300,55000,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True


As we can see that all the categorical variables have been converted to One-hot Encodings.

### Train-Test Split

Now we will step towards the training phase of our model and first of all we will split our data into 80-20% train-test split.

In [13]:
X = data.drop('Price', axis=1)
y = data['Price']

In [14]:
# Spliting the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Before feeding our data to the model, it is advisible to normalize or scale our data for faster convergence.

In [15]:
# Use StandardScaler to scale the input features
scaler = StandardScaler()
X_train = scaler.fit_transform(np.array(X_train))
X_test = scaler.transform(np.array(X_test))
y_train = scaler.fit_transform(np.array(y_train.ravel()).reshape(-1,1))
y_test = scaler.transform(np.array(y_test.ravel()).reshape(-1,1))

### Training the Regression Models

There are a number of Regression of models that we can try for our task. Since the data is highly non-linear, so we cannot use linear regression and we will use some non-linear model. Two significant ones MLPRegressor and RandomForestRegressor are tested here.

#### Random Forest Regressor

In [16]:
# Initialize and train the Random Forest Regression model
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train.ravel())

In [17]:
# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the Random Forest Regression model
mse = mean_squared_error(y_test.ravel(), y_pred)
r2 = r2_score(y_test.ravel(), y_pred)

print("\nRandom Forest Regression Metrics:")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")


Random Forest Regression Metrics:
Mean Squared Error: 0.05089615861063002
R-squared: 0.9545992943424892


We got a significantly low MSE and R2 score while evaluating on test set, indicating that our model is performing really well.

#### MLP Regressor

In [18]:
# Initialize and train the MLPRegressor model
model_2  = MLPRegressor(max_iter=10)
model_2.fit(X_train, y_train.ravel())




In [19]:
y_pred = model_2.predict(X_test)
r2_score(y_test.ravel(), y_pred)
print("R2 Score: ", r2_score(y_test.ravel(), y_pred))
mse = mean_squared_error(y_test.ravel(), y_pred)

print(f'Mean Squared Error: {mse}')

R2 Score:  0.8477838709474831
Mean Squared Error: 0.17064087738626135


Again we got significant low MSE and R2 score while evaluating on test set, indicating that our model is performing really well.

We can use any of these two models for our task!