# Problem Description
In this exercise you will analyse a mock dataset of second hand car sales in the UK. You can 
download this dataset as a csv file from Canvas at the following link: 
https://canvas.hull.ac.uk/files/5020067/download?download_frd=1
You will see that the dataset contains 50,000 rows, with each row corresponding to the sale 
of a second hand car. For each car sold, the dataset contains the following information: 
- Manufacturer – the name of the manufacturer that produced the car. 
- Model – the name of the model of the car. 
- Engine size – the size of the engine, in litres. 
- Fuel type – the type of fuel that the engine uses. 
- Year of manufacture – the year in which the car was made. 
- Mileage – the total number of miles that the car has been driven. 
- Price – the price that the car was sold for, in Pound Sterling (GBP). 

NOTE: whilst the names of the car manufacturers and models in this dataset may be familiar 
to you, be aware that this is a mock dataset of imaginary car sales data that we generated. 
In particular, the prices given in this dataset are not intended to be a realistic representation 
of the actual price of a given car. Furthermore, the years of manufacture contained in this 
dataset do not necessarily reflect the actual years in which a particular model was in 
production in the real world. 

### Goal
Your goal for this exercise is to explore how supervised learning models can be used to 
predict the price of a second hand car, based on the information contained in this dataset. 
You will also study how unsupervised learning techniques can be used to identify clustering 
patterns in this dataset. 
You will write up the results of your analysis in the style of a scientific report. Your report 
should address the following questions: 

a. Compare regression models that predict the price of a car based on a single 
numerical input feature. Based on your results, which numerical variable in the 
dataset is the best predictor for a car’s price, and why? For each numerical input 
feature, is the price better fit by a linear model or by a non-linear (e.g. polynomial) 
model? 

b. Consider regression models that take multiple numerical variables as input features 
to predict the price of a car. Does the inclusion of multiple input features improve 
the accuracy of the model’s prediction compared to the single-input feature models 
that you explored in part (a)?

c. In parts (a) and (b) you only considered models that use the numerical variables from 
the dataset as inputs. However, there are also several categorical variables in the 
dataset that are likely to affect the price of the car. Now train a regression model 
that uses all relevant input variables (both categorical and numerical) to predict the 
price (e.g. a Random Forest Regressor model). Does this improve the accuracy of 
your results? 

d. Develop an Artificial Neural Network (ANN) model to predict the price of a car based 
on all the available information from the dataset. How does its performance 
compare to the other supervised learning models that you have considered? Discuss 
your choices for the architecture of the neural network that you used, and describe 
how you tuned the hyperparameters in your model to achieve the best performance.

e. Based on the results of your analysis, what is the best model for predicting the price 
of a car and why? You should use suitable figures and evaluation metrics to support 
your conclusions. 

f. Use the k-Means clustering algorithm to identify clusters in the car sales data. 
Consider different combinations of the numerical variables in the dataset to use as 
input features for the clustering algorithm. In each case, what is the optimal number 
of clusters (k) to use and why? Which combination of variables produces the best 
clustering results? Use appropriate evaluation metrics to support your conclusions. 

g. Compare the results of the k-Means clustering model from part (f) to at least one 
other clustering algorithm. Which algorithm produces the best clustering? Use 
suitable evaluation metrics to justify your answer. 

# 1. Importing Required Libraries 

In [66]:
# importing libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings 

warnings.filterwarnings('ignore')


# 2. Dataset

## 2.1 load dataset

In [67]:
# loading dataset 
dataset = pd.read_csv("E:\Project\Machine Learing\paid\car_sales_data.csv")

In [68]:
dataset.head()

Unnamed: 0,Manufacturer,Model,Engine size,Fuel type,Year of manufacture,Mileage,Price
0,Ford,Fiesta,1.0,Petrol,2002,127300,3074
1,Porsche,718 Cayman,4.0,Petrol,2016,57850,49704
2,Ford,Mondeo,1.6,Diesel,2014,39190,24072
3,Toyota,RAV4,1.8,Hybrid,1988,210814,1705
4,VW,Polo,1.0,Petrol,2006,127869,4101


## 2.2 Dataset size

In [69]:
dataset.shape

(50000, 7)

## 2.3 Taking care of missing values 

In [70]:
# taking care of missing values

dataset.isnull().values.any()

False

## 2.4 Dataset summary

In [71]:
dataset.describe()

Unnamed: 0,Engine size,Year of manufacture,Mileage,Price
count,50000.0,50000.0,50000.0,50000.0
mean,1.773058,2004.20944,112497.3207,13828.90316
std,0.734108,9.645965,71632.515602,16416.681336
min,1.0,1984.0,630.0,76.0
25%,1.4,1996.0,54352.25,3060.75
50%,1.6,2004.0,100987.5,7971.5
75%,2.0,2012.0,158601.0,19026.5
max,5.0,2022.0,453537.0,168081.0


## 2.5 Chack unique values of every column

In [72]:
dataset['Manufacturer'].unique() # manufacturer names

array(['Ford', 'Porsche', 'Toyota', 'VW', 'BMW'], dtype=object)

In [73]:
dataset['Model'].unique() # models name

array(['Fiesta', '718 Cayman', 'Mondeo', 'RAV4', 'Polo', 'Focus', 'Prius',
       'Golf', 'Z4', 'Yaris', '911', 'Passat', 'M5', 'Cayenne', 'X3'],
      dtype=object)

In [74]:
dataset['Model'].unique().size # number of unique models

15

In [75]:
dataset['Fuel type'].unique() # fuel type 

array(['Petrol', 'Diesel', 'Hybrid'], dtype=object)

## 2.6 Extract feature input variables and target variables

In [76]:
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 6:].values

In [77]:
X

array([['Ford', 'Fiesta', 1.0, 'Petrol', 2002, 127300],
       ['Porsche', '718 Cayman', 4.0, 'Petrol', 2016, 57850],
       ['Ford', 'Mondeo', 1.6, 'Diesel', 2014, 39190],
       ...,
       ['Ford', 'Mondeo', 1.6, 'Diesel', 2022, 4030],
       ['Ford', 'Focus', 1.0, 'Diesel', 2016, 26468],
       ['VW', 'Golf', 1.4, 'Diesel', 2012, 109300]], dtype=object)

In [78]:
Y.size

50000

# 3. spliting dataset into the Traing data set and  Testing data set


In [79]:
# spliting dataset into the Traing data set and  Testing data set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size  = 1/3, random_state = 0)

In [80]:
print(X_train.shape, Y_train.shape)

(33333, 6) (33333, 1)


In [81]:
print(X_test.shape, Y_test.shape)

(16667, 6) (16667, 1)


# 4. Applying supervised learning

##  4.1 Single Numerical input feature

Comparing regression models that predict the price of a car based on a single 
numerical input feature. Checking for each numerical input 
feature, is the price better fit by a linear model or by a non-linear (e.g. polynomial) 
model.
Also Determining which numerical variable in the 
dataset is the best predictor for a car’s price

### 4.1.1 Extracting every single numerical feature lies in the dataset

In [82]:
X_train_engine = X_train[:, 2:3]
X_train_manufacture = X_train[:, 4:5]
X_train_mileage = X_train[:, 5:6]

### 4.1.2 fitting simple linear regression to the training set


In [83]:
# fitting simple linear regression to the training set

from sklearn.linear_model import LinearRegression

regressor_engine_size  = LinearRegression()
regressor_Year_of_manufacture  = LinearRegression()
regressor_Mileage = LinearRegression()


regressor_engine_size.fit(X_train_engine, Y_train)
regressor_Year_of_manufacture.fit(X_train_manufacture, Y_train)
regressor_Mileage.fit(X_train_mileage, Y_train)


### 4.1.3 Calculating the R-square value for each numerical features's model

In [84]:
# Calculating the R-square value for each numerical features's model

engine = regressor_engine_size.score(X_test[:, 2:3], Y_test)
manufacture = regressor_Year_of_manufacture.score(X_test[:, 4:5], Y_test)
mileage = regressor_Mileage.score(X_test[:, 5:6], Y_test)

print(f"The R-square Score in Linear Regression using\n\nEngine size = {engine}\nYear of manufacture = {manufacture}\nMileage = {mileage}")

The R-square Score in Linear Regression using

Engine size = 0.1579553281999313
Year of manufacture = 0.510984724303591
Mileage = 0.3983649608124491


### 4.1.4 PolynomialFeatures with degree 2 (adjust the degree)


In [85]:
# PolynomialFeatures with degree 2 (adjust the degree)

from sklearn.preprocessing import PolynomialFeatures

poly_engine = PolynomialFeatures(degree=2)
poly_manufacture = PolynomialFeatures(degree=2)
poly_mileage = PolynomialFeatures(degree=2)


poly_regressor_engine  = LinearRegression()
poly_regressor_manufacture  = LinearRegression()
poly_regressor_mileage = LinearRegression()


poly_regressor_engine.fit(poly_engine.fit_transform(X_train_engine), Y_train)
a = poly_regressor_engine.score(poly_engine.fit_transform(X_test[:, 2:3]), Y_test)

poly_regressor_manufacture.fit(poly_manufacture.fit_transform(X_train_manufacture), Y_train)
b = poly_regressor_engine.score(poly_engine.fit_transform(X_test[:, 4:5]), Y_test)

poly_regressor_mileage.fit(poly_mileage.fit_transform(X_train_mileage), Y_train)
c = poly_regressor_engine.score(poly_engine.fit_transform(X_test[:, 5:6]), Y_test)


print(f"The R-square Score in Polynomial Regression using\n\nEngine size = {engine}\nYear of manufacture = {manufacture}\nMileage = {mileage}")

The R-square Score in Polynomial Regression using

Engine size = 0.1579553281999313
Year of manufacture = 0.510984724303591
Mileage = 0.3983649608124491


## 4.2 Multiple Numerical input feature

### 4.2.1 Extract training and testeing data 

In [86]:
X_multiple= dataset[['Engine size', 'Year of manufacture', 'Mileage']]
y_multiple = dataset['Price']

X_train_mul, X_test_mul, y_train_mul, y_test_mul = train_test_split(X_multiple, y_multiple, test_size=0.2, random_state=0)


### 4.2.2 Create regression mode and fit to the training data

In [87]:
regressor1 = LinearRegression()
regressor2 = LinearRegression()

regressor1.fit(X_train_mul, y_train_mul)

poly_mul = PolynomialFeatures(degree = 2)
regressor2.fit(poly_mul.fit_transform(X_train_mul), y_train_mul)

### 4.2.3 Calculate R-square value

In [88]:
pa = regressor1.score(X_test_mul, y_test_mul)

pb = regressor2.score(poly_mul.fit_transform(X_test_mul), y_test_mul)

print(f"R-squre Score in Linear Regression = {pa}\nR-square Score in Polynomial Regression = {pb}")

R-squre Score in Linear Regression = 0.6749798831772409
R-square Score in Polynomial Regression = 0.8941503326884358


## 4.3 Random Forest

### 4.3.1 Importing libraries

In [89]:
# libraries 
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder


### 4.3.2 Extract features (X) and target variable (y)

In [90]:
# Extract features (X) and target variable (y)
X = dataset.drop('Price', axis=1)
y = dataset['Price'] 


In [91]:
X

Unnamed: 0,Manufacturer,Model,Engine size,Fuel type,Year of manufacture,Mileage
0,Ford,Fiesta,1.0,Petrol,2002,127300
1,Porsche,718 Cayman,4.0,Petrol,2016,57850
2,Ford,Mondeo,1.6,Diesel,2014,39190
3,Toyota,RAV4,1.8,Hybrid,1988,210814
4,VW,Polo,1.0,Petrol,2006,127869
...,...,...,...,...,...,...
49995,BMW,M5,5.0,Petrol,2018,28664
49996,Toyota,Prius,1.8,Hybrid,2003,105120
49997,Ford,Mondeo,1.6,Diesel,2022,4030
49998,Ford,Focus,1.0,Diesel,2016,26468


In [92]:
y

0          3074
1         49704
2         24072
3          1705
4          4101
          ...  
49995    113006
49996      9430
49997     49852
49998     23630
49999     10400
Name: Price, Length: 50000, dtype: int64

### 4.3.3 Perform categorical encoding on the 'Manufacturer', 'Model' and 'Fuel type' columns


In [93]:
# Perform categorical encoding on the 'Manufacturer', 'Model' and 'Fuel type' columns

encoder = OneHotEncoder(sparse=False, drop='first')
X_encoded = pd.DataFrame(encoder.fit_transform(X[['Manufacturer', 'Model', 'Fuel type']]))


In [94]:
X_encoded

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
49996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
49997,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49998,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 4.3.4 Concatenate the encoded features with the original dataset


In [95]:
# Concatenate the encoded features with the original dataset
X = pd.concat([X, X_encoded], axis=1)

X

Unnamed: 0,Manufacturer,Model,Engine size,Fuel type,Year of manufacture,Mileage,0,1,2,3,...,10,11,12,13,14,15,16,17,18,19
0,Ford,Fiesta,1.0,Petrol,2002,127300,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,Porsche,718 Cayman,4.0,Petrol,2016,57850,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,Ford,Mondeo,1.6,Diesel,2014,39190,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Toyota,RAV4,1.8,Hybrid,1988,210814,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,VW,Polo,1.0,Petrol,2006,127869,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,BMW,M5,5.0,Petrol,2018,28664,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
49996,Toyota,Prius,1.8,Hybrid,2003,105120,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
49997,Ford,Mondeo,1.6,Diesel,2022,4030,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49998,Ford,Focus,1.0,Diesel,2016,26468,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 4.3.5 drop the orinial column of categorical variable


In [96]:
# drop the orinial column of categorical variable
X = X.drop(['Manufacturer', 'Model', 'Fuel type'], axis = 1)

In [97]:
X

Unnamed: 0,Engine size,Year of manufacture,Mileage,0,1,2,3,4,5,6,...,10,11,12,13,14,15,16,17,18,19
0,1.0,2002,127300,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,4.0,2016,57850,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1.6,2014,39190,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.8,1988,210814,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,1.0,2006,127869,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,5.0,2018,28664,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
49996,1.8,2003,105120,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
49997,1.6,2022,4030,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49998,1.0,2016,26468,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 4.3.6 convert all features name string

In [98]:
# convert all features name string

X.columns = X.columns.astype(str)

### 4.3.6 Split the dataset into training and testing sets


In [99]:
# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [100]:
x_train
y_train

39087     1438
30893      448
45278    14099
16398     8234
13653     6721
         ...  
11284      370
44732     8507
38158    26206
860        902
15795     6072
Name: Price, Length: 40000, dtype: int64

### 4.3.7 Create a Random Forest Regressor model


In [101]:
# Create a Random Forest Regressor model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)


### 4.3.8 Train the model on the training data


In [102]:

# Train the model on the training data
rf_model.fit(x_train, y_train)


### 4.2.9 Make predictions on the test data


In [103]:
# Make predictions on the test data
y_pred = rf_model.predict(x_test)


### 4.3.10 Evaluate the model


In [104]:
# Evaluate the model
r2 = r2_score(y_test, y_pred)

print(f'R-square score in Random Forest Regression model: {r2}')

R-square score in Random Forest Regression model: 0.9984705869992906


## 4.4 ANN model

In [105]:
x_train

Unnamed: 0,Engine size,Year of manufacture,Mileage,0,1,2,3,4,5,6,...,10,11,12,13,14,15,16,17,18,19
39087,1.4,1990,143415,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
30893,1.0,1990,259900,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
45278,2.0,2006,106750,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
16398,1.8,2001,126649,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
13653,2.0,1992,66179,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11284,1.2,1984,279567,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
44732,1.2,2010,108027,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
38158,2.0,2016,72963,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
860,2.4,1988,275595,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


### 4.4.1 standardize the training and testing input data

In [106]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# standardize
x_train_scaled = scaler.fit_transform(x_train) 
x_test_scaled = scaler.fit_transform(x_test)

In [107]:
x_train_scaled.shape

(40000, 23)

### 4.4.2 build ANN model


In [108]:
# build ANN model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential()

model.add(Dense(128, activation = 'relu', input_shape = (x_train_scaled.shape[1],)))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(1, activation = 'linear'))

### 4.4.3 compiling the model


In [109]:
# compiling the model

model.compile(
        optimizer = 'adam',
        loss = 'mean_squared_error'
)

### 4.4.4 fit the model using the training data 


In [110]:
# fit the model using the training data 

model.fit(x_train_scaled, y_train, epochs = 50, batch_size = 32, validation_split = 0.1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x1bd02270cd0>

### 4.4.5 make prediction using the testing data


In [111]:
# make prediction using the testing data

y_pred = model.predict(x_test_scaled)



In [112]:
y_pred

array([[69719.6    ],
       [35882.383  ],
       [17838.664  ],
       ...,
       [39878.55   ],
       [  684.56635],
       [35565.305  ]], dtype=float32)

### 4.4.6 Evaluate the model

In [113]:
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print(f'R-square score in ANN model: {r2}')


R-square score in ANN model: 0.9984977825995061


# 5. Applying unsupervised learning

## 5.1 K mean clustering

### 5.1.0 Importing libraries

In [114]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score


### 5.1.1 k-means clustering using 2 numerical feature input

### 5.1.1.1 Drop the original categorical columns


In [115]:
# Drop the original categorical columns
data = dataset.drop(['Manufacturer', 'Model', 'Fuel type', 'Year of manufacture'], axis=1)
data

Unnamed: 0,Engine size,Mileage,Price
0,1.0,127300,3074
1,4.0,57850,49704
2,1.6,39190,24072
3,1.8,210814,1705
4,1.0,127869,4101
...,...,...,...
49995,5.0,28664,113006
49996,1.8,105120,9430
49997,1.6,4030,49852
49998,1.0,26468,23630


### 5.1.1.2 standaraize the data


In [116]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

In [117]:
scaled_data

array([[-1.05306836,  0.20664955, -0.65512697],
       [ 3.03356561, -0.76289192,  2.18530499],
       [-0.23574157, -1.02339075,  0.62395067],
       ...,
       [-0.23574157, -1.51423421,  2.1943203 ],
       [-1.05306836, -1.20099344,  0.59702656],
       [-0.50818383, -0.04463549, -0.2088691 ]])

### 5.1.1.3 Apply k-Means clustering with different k values


In [118]:
# Apply k-Means clustering with different k values

print('k-means clustering using 2 numerical feature input')

for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    clusters = kmeans.fit_predict(scaled_data)

    # Evaluate clustering
    inertia = kmeans.inertia_
    silhouette = silhouette_score(scaled_data, clusters)
    davies_bouldin = davies_bouldin_score(scaled_data, clusters)
    
    print(f'Clusters: {k}, Inertia: {inertia}, Silhouette: {silhouette}, Davies-Bouldin: {davies_bouldin}')


k-means clustering using 2 numerical feature input
Clusters: 2, Inertia: 98091.45262098621, Silhouette: 0.3669037012204407, Davies-Bouldin: 1.1078968604155395
Clusters: 3, Inertia: 67640.20684663932, Silhouette: 0.3837362495977886, Davies-Bouldin: 0.9464029615658083
Clusters: 4, Inertia: 53355.18029571917, Silhouette: 0.3939015989244124, Davies-Bouldin: 0.8875475042141868
Clusters: 5, Inertia: 40952.01764038799, Silhouette: 0.3551896769108442, Davies-Bouldin: 0.865004339535966
Clusters: 6, Inertia: 34662.87493969449, Silhouette: 0.3188912041933196, Davies-Bouldin: 0.8978029803587639
Clusters: 7, Inertia: 29914.267549593955, Silhouette: 0.3340851222070231, Davies-Bouldin: 0.8437406928459491
Clusters: 8, Inertia: 26899.46263766519, Silhouette: 0.3110482480099364, Davies-Bouldin: 0.8780938100030149
Clusters: 9, Inertia: 24156.625624845357, Silhouette: 0.323078080141373, Davies-Bouldin: 0.8524310418261667
Clusters: 10, Inertia: 21934.46738942913, Silhouette: 0.3224778871061413, Davies-Boul

### 5.1.2 k-means clustering using 3 numerical feature input 

### 5.1.2.1 Drop the original categorical columns 

In [119]:
# Drop the original categorical columns
data = dataset.drop(['Manufacturer', 'Model', 'Fuel type'], axis=1)

In [120]:
data

Unnamed: 0,Engine size,Year of manufacture,Mileage,Price
0,1.0,2002,127300,3074
1,4.0,2016,57850,49704
2,1.6,2014,39190,24072
3,1.8,1988,210814,1705
4,1.0,2006,127869,4101
...,...,...,...,...
49995,5.0,2018,28664,113006
49996,1.8,2003,105120,9430
49997,1.6,2022,4030,49852
49998,1.0,2016,26468,23630


### 5.1.2.2 standarize the data 

In [121]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

In [122]:
scaled_data

array([[-1.05306836, -0.22905558,  0.20664955, -0.65512697],
       [ 3.03356561,  1.22234304, -0.76289192,  2.18530499],
       [-0.23574157,  1.01500038, -1.02339075,  0.62395067],
       ...,
       [-0.23574157,  1.84437103, -1.51423421,  2.1943203 ],
       [-1.05306836,  1.22234304, -1.20099344,  0.59702656],
       [-0.50818383,  0.80765772, -0.04463549, -0.2088691 ]])

### 5.1.2.3 Apply k-Means clustering with different k values


In [123]:
# Apply k-Means clustering with different k values

print('k-means clustering using 3 numerical feature input')
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    clusters = kmeans.fit_predict(scaled_data)

    # Evaluate clustering
    inertia = kmeans.inertia_
    silhouette = silhouette_score(scaled_data, clusters)
    davies_bouldin = davies_bouldin_score(scaled_data, clusters)
    
    print(f'Clusters: {k}, Inertia: {inertia}, Silhouette: {silhouette}, Davies-Bouldin: {davies_bouldin}')


k-means clustering using 3 numerical feature input
Clusters: 2, Inertia: 116530.87591309159, Silhouette: 0.39735675661460274, Davies-Bouldin: 0.9841527072766522
Clusters: 3, Inertia: 87492.31979501166, Silhouette: 0.4093777993502264, Davies-Bouldin: 0.957912509564022
Clusters: 4, Inertia: 69304.38347564946, Silhouette: 0.328995478659553, Davies-Bouldin: 0.9876761833564394
Clusters: 5, Inertia: 54699.89629693521, Silhouette: 0.343744398210847, Davies-Bouldin: 0.8953580078939852
Clusters: 6, Inertia: 47052.289475394784, Silhouette: 0.29801437627490474, Davies-Bouldin: 0.9524050784773186
Clusters: 7, Inertia: 42343.923009551996, Silhouette: 0.27303272152241786, Davies-Bouldin: 1.0097146049595984
Clusters: 8, Inertia: 38495.91528808992, Silhouette: 0.2800694520350489, Davies-Bouldin: 1.0132378712529535
Clusters: 9, Inertia: 35293.32738858744, Silhouette: 0.28069235267142595, Davies-Bouldin: 1.0184967931412017
Clusters: 10, Inertia: 32535.288539463025, Silhouette: 0.28223384257860057, Davie

## 5.2 heirarchical clustering

In [124]:
scaled_data.shape

(50000, 4)

### 5.2.1 create a hierarchical clustering object


In [125]:
from sklearn.cluster import AgglomerativeClustering

agg = AgglomerativeClustering(n_clusters=3)

### 5.2.2 Subsample data (For Limited Memory)


In [126]:
# Subsample data
from sklearn.utils import shuffle
subsample_size = 5000  # You can adjust this based on your memory constraints
subsampled_data = shuffle(scaled_data, random_state=42)[:subsample_size]

agg_labels = agg.fit_predict(subsampled_data)


kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(subsampled_data)



### 5.2.3 Evaluate cluster 3 for Hierarchical Clustering


In [127]:
# Evaluate clustering for Hierarchical Clustering
agg_silhouette = silhouette_score(subsampled_data, agg_labels)
agg_davies_bouldin = davies_bouldin_score(subsampled_data, agg_labels)


### 5.3.4 Evaluate k means clustering for cluster 3


In [128]:
# Evaluate k means clustering
kmeans_silhouette = silhouette_score(subsampled_data, clusters)
kmeans_davies_bouldin = davies_bouldin_score(subsampled_data, clusters)

### 5.3.5 print evaluation metrices

In [129]:
print('\nHierarchical Clustering:')
print(f'Silhouette: {agg_silhouette}, Davies-Bouldin: {agg_davies_bouldin}')

print('\nK-means Clustering:')
print(f'Silhouette: {kmeans_silhouette}, Davies-Bouldin: {kmeans_davies_bouldin}')


Hierarchical Clustering:
Silhouette: 0.38276355502401527, Davies-Bouldin: 0.9219150632524684

K-means Clustering:
Silhouette: 0.4016670124291188, Davies-Bouldin: 0.9190927767103263
