# Predicting the Sale Price of Cars using Machine Learning

This notebook is to predict sale price of cars based on the available data.

## 1.Problem definition

> To predict the sale price of a car with the given information dataset on cars

## 2. Data
The data used in this notebook is taken from https://www.kaggle.com/pouyamofidi/carsalesextendedmissingdata


## 3. Evaluation
+ The evaluation metric is R^2 value of the predicted model

## 4. Features
+ The features of the data are information on the Make, Colour, Odometer(KM), Number of Doors and Price of the Cars.

In [134]:
# Import all the tools we need
# Regular data analysis and plotting libraries
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score, mean_absolute_error,mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


In [135]:
car_sales = pd.read_csv("car-sales-extended-missing-data.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [136]:
car_sales.skew(axis=0, skipna= True)
#car_sales.dtypes

Odometer (KM)   -0.026245
Doors            0.116474
Price            0.978618
dtype: float64

In [137]:
car_sales["Doors"].value_counts()

4.0    811
5.0     75
3.0     64
Name: Doors, dtype: int64

In [138]:
# drop the price columns with Nan
car_sales.dropna(subset=["Price"], inplace= True)
car_sales.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [139]:
# Create X and Y
X= car_sales.drop("Price", axis=1)
y=car_sales["Price"]

In [140]:
# fill missing values with Scikit learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# fill categorical values with missing and numerical values with mean
cat_imputer = SimpleImputer(strategy="constant", fill_value= "missing")
door_imputer =SimpleImputer(strategy="constant", fill_value= 4)
num_imputer = SimpleImputer(strategy="mean")

#define columns
cat_feature = ["Make", "Colour"]
door_feature =["Doors"]
num_feature = ["Odometer (KM)"]

# create an imputer
imputer = ColumnTransformer([("cat_imputer", cat_imputer, cat_feature),("door_imputer",door_imputer, door_feature),("num_imputer", num_imputer, num_feature)])


filled_X = imputer.fit_transform(X)
# filled_X_test = imputer.transform(X_test)

# filled_X_train

filled_X

car_sales_filled= pd.DataFrame(filled_X, columns=["Make", "Colour","Doors","Odometer (KM)"])
car_sales_filled

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,White,4.0,35431.0
1,BMW,Blue,5.0,192714.0
2,Honda,White,4.0,84714.0
3,Toyota,White,4.0,154365.0
4,Nissan,Blue,3.0,181577.0
...,...,...,...,...
945,Toyota,Black,4.0,35820.0
946,missing,White,3.0,155144.0
947,Nissan,Blue,4.0,66604.0
948,Honda,White,4.0,215883.0


In [141]:
# turn the categories into numbers

categorical_features= ["Make", "Colour", "Doors", ]
one_hot= OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder ="passthrough")

transformed_X = transformer.fit_transform(car_sales_filled)
transformed_X,y


(<950x15 sparse matrix of type '<class 'numpy.float64'>'
 	with 3800 stored elements in Compressed Sparse Row format>,
 0      15323.0
 1      19943.0
 2      28343.0
 3      13434.0
 4      14043.0
         ...   
 995    32042.0
 996     5716.0
 997    31570.0
 998     4001.0
 999    12732.0
 Name: Price, Length: 950, dtype: float64)

In [142]:
#splitting the transformed_X,y data into training and testing sets
X_train,X_test, y_train, y_test= train_test_split(transformed_X,y,test_size=0.2)
X_train

<760x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3040 stored elements in Compressed Sparse Row format>

In [143]:
#fitting data into model
np.random.seed(5)
model= RandomForestRegressor()

model.fit(X_train, y_train)
model.score(X_test, y_test)

0.2679572500055195

In [144]:
model_cross_val_score = np.mean(cross_val_score(model,transformed_X,y))
model_cross_val_score

0.21719889668151807

In [145]:
# evaluate the Model

y_preds= model.predict(X_test)

print("Regression metrics on the test set")
print(f'R2: {r2_score(y_test, y_preds)}')
print(f'Mean Absolute score: {mean_absolute_error(y_test, y_preds)}')

print(f'Mean Squared Error: {mean_squared_error(y_test, y_preds)}')


Regression metrics on the test set
R2: 0.2679572500055195
Mean Absolute score: 6500.658119324828
Mean Squared Error: 63798423.082452774
