### Problem Statement

Using a car dataset containing features such as car maintenance, buying, doors, persons, safety, and other specifications, the task is to develop a machine-learning model that can accurately predict the car's rating based on the given features. The rating could be a continuous variable representing the overall acceptance score or a categorical variable representing the level of acceptance, such as unacceptable, acceptable, good, or very good. 

The goal is to develop a KNN model to help car manufacturers and dealerships better understand their customers' preferences and improve their product offerings and services. Additionally, customers can use the model to make informed decisions when purchasing a car based on their preferences and requirements.

### Similarity-Based Learning => KNN: Car dataset

In [1]:
# Importing necessary packages
import sklearn
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
from sklearn import linear_model,preprocessing
import pandas as pd
import numpy as np

In [2]:
# Loading car dataset
data=pd.read_csv("https://raw.githubusercontent.com/ShriHemaPriya/MBA6693_CustomAssignment/main/car.data")


In [3]:
# Viewing the data
data.head()


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [4]:
#Describing the data
data.describe()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
count,1728,1728,1728,1728,1728,1728,1728
unique,4,4,4,3,3,3,4
top,vhigh,vhigh,4,4,big,med,unacc
freq,432,432,432,576,576,576,1210


### How to preprocess the car data 

I removed the non-categorical columns in the car data for preprocessing the data

In [5]:
labels = preprocessing.LabelEncoder()
print(labels)

In [6]:
# Converting categorical features of buying column into numbers list using fit_transform
buying_col = labels.fit_transform(list(data["buying"]))
maint_col = labels.fit_transform(list(data["maint"]))
doors_col = labels.fit_transform(list(data["doors"]))
persons_col = labels.fit_transform(list(data["persons"]))
lug_boot_col = labels.fit_transform(list(data["lug_boot"]))
safety_col = labels.fit_transform(list(data["safety"]))

In [7]:
# Converting to unacceptable, acceptable, good and very good into numerical values
classVar = labels.fit_transform(list(data["class"]))
print(classVar)

[2 2 2 ... 2 1 3]


In [8]:
# Viewing the converted numerical columns
buying_col,maint_col

(array([3, 3, 3, ..., 1, 1, 1]), array([3, 3, 3, ..., 1, 1, 1]))

In [9]:
# Creating X & Y data listusing Zipping for tuple
X = list(zip(buying_col,maint_col,doors_col,persons_col,lug_boot_col,safety_col))
Y = list(classVar)


### What is the good split ratio for car data?

I split the data of the 70 - 30 ratio to train the data better and to avoid the overfitting issue

In [10]:
# Splitting the datasets into train and test
X_train,X_test,Y_train,Y_test = sklearn.model_selection.train_test_split(X,Y, test_size = 0.3)

In [11]:
# Viewing the X train and Y test data sets
X_train,Y_test

([(2, 1, 2, 1, 2, 1),
  (0, 0, 1, 2, 2, 1),
  (3, 2, 1, 0, 0, 0),
  (3, 0, 2, 0, 2, 2),
  (2, 2, 0, 1, 1, 0),
  (0, 1, 3, 0, 0, 1),
  (1, 2, 1, 2, 0, 2),
  (2, 2, 3, 0, 2, 1),
  (1, 3, 2, 0, 2, 0),
  (3, 3, 1, 2, 1, 0),
  (2, 1, 0, 2, 2, 1),
  (1, 1, 0, 0, 2, 2),
  (2, 0, 3, 0, 2, 1),
  (2, 2, 3, 1, 0, 1),
  (1, 2, 1, 1, 0, 0),
  (1, 0, 1, 0, 2, 2),
  (1, 3, 2, 2, 0, 0),
  (2, 1, 1, 0, 0, 1),
  (3, 1, 0, 0, 1, 0),
  (2, 3, 1, 2, 1, 1),
  (3, 2, 1, 0, 1, 1),
  (1, 1, 2, 0, 0, 2),
  (1, 0, 2, 2, 0, 1),
  (0, 2, 0, 0, 1, 1),
  (0, 3, 0, 2, 2, 2),
  (1, 3, 1, 1, 1, 1),
  (2, 0, 1, 0, 0, 0),
  (1, 0, 1, 1, 0, 2),
  (3, 2, 3, 1, 2, 2),
  (3, 0, 3, 2, 0, 2),
  (1, 2, 1, 2, 0, 1),
  (3, 3, 3, 1, 2, 1),
  (3, 1, 2, 1, 0, 1),
  (0, 2, 1, 1, 0, 2),
  (2, 0, 0, 0, 2, 0),
  (2, 3, 2, 1, 2, 2),
  (1, 3, 1, 1, 2, 0),
  (3, 2, 0, 0, 2, 2),
  (3, 1, 3, 1, 0, 1),
  (3, 0, 1, 1, 0, 1),
  (3, 0, 0, 0, 1, 0),
  (2, 1, 2, 1, 2, 2),
  (3, 0, 2, 0, 1, 1),
  (2, 2, 3, 1, 1, 0),
  (1, 1, 2, 0, 2, 2),
  (1, 1, 0

### What should be the parameter for the number of neighbors?

In [12]:
# Creating the KNN classifier
knnClassfier = KNeighborsClassifier(n_neighbors=8)

In [13]:
knnClassfier.fit(X_train,Y_train)

In [14]:
# Accuracy
accuracy = knnClassfier.score(X_test,Y_test)

In [15]:
accuracy

0.9017341040462428

### Comparing the neighbors and accuracy


For n_neighbors = 2 => 80.3%
For n_neighbors = 4 => 86.1%
For n_neighbors = 7 => 89.3%
For n_neighbors = 8 => 90.1% - Highest

In [16]:
#Performing predicton
predicted = knnClassfier.predict(X_test)

In [17]:
# Ratings definition
ratings = ["unacceptable","acceptable","good","very good"]

### Comparison between the actual and predicted values

Ratings:
unacceptable = 0
acceptable = 1
good = 2
very good = 3

In [18]:
for i in range(len(predicted)):
    print(f'The predicted rating: {ratings[predicted[i]]}, actual rating: {ratings[Y_test[i]]}, Data : {X_test[i]}')

The predicted rating: unacceptable, actual rating: unacceptable, Data : (2, 0, 2, 2, 0, 0)
The predicted rating: good, actual rating: good, Data : (1, 2, 3, 2, 1, 1)
The predicted rating: good, actual rating: good, Data : (3, 3, 2, 1, 2, 2)
The predicted rating: good, actual rating: good, Data : (3, 0, 3, 1, 0, 1)
The predicted rating: good, actual rating: good, Data : (2, 0, 0, 0, 2, 2)
The predicted rating: good, actual rating: good, Data : (3, 1, 2, 0, 1, 2)
The predicted rating: unacceptable, actual rating: unacceptable, Data : (1, 0, 3, 1, 0, 2)
The predicted rating: good, actual rating: good, Data : (2, 2, 3, 0, 0, 2)
The predicted rating: good, actual rating: good, Data : (0, 3, 1, 1, 0, 1)
The predicted rating: unacceptable, actual rating: unacceptable, Data : (2, 0, 2, 1, 1, 0)
The predicted rating: unacceptable, actual rating: unacceptable, Data : (1, 3, 1, 2, 1, 2)
The predicted rating: good, actual rating: acceptable, Data : (1, 2, 3, 2, 1, 2)
The predicted rating: good, ac

### Summary

The problem was framed as a supervised machine learning problem, where the target variable is the car's rating, and the input variables are the car's features. The dataset is split into training and testing sets, and the K-Nearest Neighbor machine learning algorithm is used to train and evaluate the training set. Finally, the performance of the model was evaluated on the testing dataset. The highest accuracy achieved for this problem is 90.1%, using the KNN algorithm with a value of n_neighbors set to 8. 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=2dbe32be-e994-4f53-a670-d1aa34020dfb' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>