### K-Nearest Neighbor Model for Car Classification

This notebook use KNN model to predict car classification based on the train/test split data from data prep notebook
- construct model will be fitted with default paramters and result evaluated
- tune model using GridSearchto get optimial parameters
- evaluate model and consider next steps

In [1]:
#import relevant packages and libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pickle
import functions as fn

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
import sklearn.metrics as metric
plt.style.use('ggplot')

#run data_prep notebook to create train/test split data pickles



In [2]:
#retrieve train/test split data
with open('X_train.pickle', 'rb') as file:
    X_train = pickle.load(file)
    
with open('X_test.pickle', 'rb') as file:
    X_test = pickle.load(file)

with open('y_train.pickle', 'rb') as file:
    y_train = pickle.load(file)
    
with open('y_test.pickle', 'rb') as file:
    y_test = pickle.load(file)

In [3]:
#normalizing feature data (X_train and X_test) using Standard Scaler
scaler = StandardScaler()
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_test = scaler.transform(X_test)

scaled_df_train = pd.DataFrame(scaled_data_train, columns = X_train.columns)
scaled_df_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns)
scaled_df_train.head()
#preview scaled data

Unnamed: 0,Year,Engine HP,Engine Cylinders,highway MPG,city mpg,Popularity,MSRP,Hatchback,Hybrid,Diesel,...,x4_Cargo Van,x4_Convertible,x4_Coupe,x4_Crew Cab Pickup,x4_Extended Cab Pickup,x4_Passenger Minivan,x4_Passenger Van,x4_Regular Cab Pickup,x4_Sedan,x4_Wagon
0,1.070747,2.463359,1.174737,-1.163008,-0.833465,-0.665766,0.863701,-0.376708,-0.231191,-0.183668,...,-0.079359,-0.292225,-0.3571,-0.218013,-0.14328,-0.073439,-0.131456,-0.134932,-0.551356,-0.195067
1,0.522695,-0.933291,-0.89528,0.364742,0.17821,-0.526818,-0.514851,-0.376708,-0.231191,-0.183668,...,-0.079359,-0.292225,-0.3571,-0.218013,-0.14328,-0.073439,-0.131456,-0.134932,-0.551356,-0.195067
2,-2.765616,-0.206048,0.139728,-0.290008,-0.49624,-0.79027,-0.429133,2.654574,-0.231191,-0.183668,...,-0.079359,-0.292225,-0.3571,-0.218013,-0.14328,-0.073439,-0.131456,-0.134932,-0.551356,-0.195067
3,0.522695,-0.762175,-0.89528,-0.290008,-0.159015,-0.137486,-0.437842,-0.376708,-0.231191,5.444615,...,-0.079359,-0.292225,-0.3571,-0.218013,6.979335,-0.073439,-0.131456,-0.134932,-0.551356,-0.195067
4,-0.573409,-1.617754,-0.89528,0.801242,0.965069,-0.68709,-0.643172,2.654574,-0.231191,-0.183668,...,-0.079359,-0.292225,-0.3571,-0.218013,-0.14328,-0.073439,-0.131456,-0.134932,-0.551356,-0.195067


#### Fit base KNN Model with default parameters

In [4]:
#fit base model using default parameters
knn = KNeighborsClassifier()
knn.fit(scaled_data_train, y_train)

#predict for test and train data sets
test_preds = knn.predict(scaled_data_test)
train_preds = knn.predict(scaled_data_train)

In [5]:
#evaluate model performance, metrics for test set
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.91      0.92      0.92       478
           1       0.92      0.92      0.92        64
           2       0.83      0.75      0.79        64
           3       0.89      0.91      0.90       460
           4       0.92      0.78      0.84        72
           5       0.91      0.92      0.91       317
           6       0.87      0.73      0.79        37

    accuracy                           0.90      1492
   macro avg       0.89      0.85      0.87      1492
weighted avg       0.90      0.90      0.90      1492



In [6]:
#metrics for train set, there is slight overfitting
print(metric.classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       0.94      0.97      0.96      1431
           1       0.96      0.95      0.96       227
           2       0.88      0.86      0.87       157
           3       0.93      0.95      0.94      1337
           4       0.90      0.80      0.85       205
           5       0.96      0.94      0.95      1008
           6       0.91      0.84      0.88       109

    accuracy                           0.94      4474
   macro avg       0.93      0.90      0.91      4474
weighted avg       0.94      0.94      0.94      4474



In [7]:
#find best knn parameters
fn.optimize_knn_params(scaled_data_train, y_train, min_k=1, max_k=10, cv=5)

{'metric': 'manhattan', 'n_neighbors': 1, 'weights': 'uniform'}

In [8]:
#fit model with paramters above
knn = KNeighborsClassifier(n_neighbors=1, weights='uniform', metric='manhattan')
knn.fit(scaled_data_train, y_train)
test_preds = knn.predict(scaled_data_test)
train_preds = knn.predict(scaled_data_train)

In [9]:
#checking testing metrics
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.97      0.98      0.97       478
           1       0.98      0.98      0.98        64
           2       1.00      0.94      0.97        64
           3       0.98      0.98      0.98       460
           4       0.95      0.97      0.96        72
           5       0.98      0.98      0.98       317
           6       0.97      1.00      0.99        37

    accuracy                           0.98      1492
   macro avg       0.98      0.98      0.98      1492
weighted avg       0.98      0.98      0.98      1492



In [10]:
print(metric.classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1431
           1       1.00      1.00      1.00       227
           2       1.00      1.00      1.00       157
           3       1.00      1.00      1.00      1337
           4       1.00      1.00      1.00       205
           5       1.00      1.00      1.00      1008
           6       1.00      1.00      1.00       109

    accuracy                           1.00      4474
   macro avg       1.00      1.00      1.00      4474
weighted avg       1.00      1.00      1.00      4474



Optimal k value is 1, indicating potential data leakage and data similarity.

#### Remove 'Popularity' feature and refit base KNN model

In [11]:
#remove Popularity from features
scaled_df_train = scaled_df_train.drop('Popularity', axis=1)
scaled_df_test = scaled_df_test.drop('Popularity', axis=1)

In [12]:
knn = KNeighborsClassifier()
knn.fit(scaled_df_train, y_train)
test_preds = knn.predict(scaled_df_test)
train_preds = knn.predict(scaled_df_train)

In [13]:
#metrics for test data
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.89      0.90      0.90       478
           1       0.92      0.91      0.91        64
           2       0.85      0.73      0.79        64
           3       0.85      0.87      0.86       460
           4       0.83      0.67      0.74        72
           5       0.87      0.89      0.88       317
           6       0.76      0.76      0.76        37

    accuracy                           0.87      1492
   macro avg       0.85      0.82      0.83      1492
weighted avg       0.87      0.87      0.87      1492



In [14]:
#metrics for training data
print(metric.classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       0.94      0.96      0.95      1431
           1       0.96      0.94      0.95       227
           2       0.89      0.85      0.87       157
           3       0.92      0.93      0.92      1337
           4       0.88      0.77      0.82       205
           5       0.93      0.94      0.93      1008
           6       0.89      0.85      0.87       109

    accuracy                           0.93      4474
   macro avg       0.92      0.89      0.90      4474
weighted avg       0.93      0.93      0.93      4474



In [15]:
#check for best k parameter by optimizing f1 score
fn.optimize_knn_params(scaled_df_train, y_train, min_k=1, max_k=10, cv=5)

{'metric': 'manhattan', 'n_neighbors': 1, 'weights': 'uniform'}

In [16]:
#build model with best params from grid search
knn = KNeighborsClassifier(n_neighbors=1, weights='uniform', metric='manhattan')
knn.fit(scaled_df_train, y_train)
test_preds = knn.predict(scaled_df_test)
train_preds = knn.predict(scaled_df_train)

In [17]:
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.96      0.96      0.96       478
           1       0.98      0.98      0.98        64
           2       0.98      0.94      0.96        64
           3       0.94      0.94      0.94       460
           4       0.88      0.82      0.85        72
           5       0.96      0.97      0.97       317
           6       0.90      1.00      0.95        37

    accuracy                           0.95      1492
   macro avg       0.94      0.94      0.94      1492
weighted avg       0.95      0.95      0.95      1492



In [18]:
print(metric.classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1431
           1       1.00      1.00      1.00       227
           2       1.00      1.00      1.00       157
           3       1.00      1.00      1.00      1337
           4       1.00      1.00      1.00       205
           5       1.00      1.00      1.00      1008
           6       1.00      1.00      1.00       109

    accuracy                           1.00      4474
   macro avg       1.00      1.00      1.00      4474
weighted avg       1.00      1.00      1.00      4474



Optimal k value is again 1, which further support previous findings.

#### Fit KNN model with less features to explore feature impact

Investigate features impact on model performance by using smaller sets of features then add in additional features to evaluate impact on model performacne

In [19]:
#check available feature 
X_train.columns

Index(['Year', 'Engine HP', 'Engine Cylinders', 'highway MPG', 'city mpg',
       'Popularity', 'MSRP', 'Hatchback', 'Hybrid', 'Diesel', 'Luxury',
       'High-Performance', 'Exotic', 'Factory Tuner', 'Performance',
       'Crossover', 'Flex Fuel', 'x0_diesel', 'x0_electric',
       'x0_flex-fuel (premium unleaded recommended/E85)',
       'x0_flex-fuel (premium unleaded required/E85)',
       'x0_flex-fuel (unleaded/E85)', 'x0_premium unleaded (recommended)',
       'x0_premium unleaded (required)', 'x0_regular unleaded',
       'x1_AUTOMATED_MANUAL', 'x1_AUTOMATIC', 'x1_DIRECT_DRIVE', 'x1_MANUAL',
       'x2_all wheel drive', 'x2_four wheel drive', 'x2_front wheel drive',
       'x2_rear wheel drive', 'x3_Compact', 'x3_Large', 'x3_Midsize',
       'x4_2dr Hatchback', 'x4_4dr Hatchback', 'x4_4dr SUV', 'x4_Cargo Van',
       'x4_Convertible', 'x4_Coupe', 'x4_Crew Cab Pickup',
       'x4_Extended Cab Pickup', 'x4_Passenger Minivan', 'x4_Passenger Van',
       'x4_Regular Cab Pickup', 'x

In [20]:
#start with just two features
col_features = ['highway MPG', 'city mpg']

scaled_df_train_ltd = scaled_df_train[col_features]
scaled_df_test_ltd = scaled_df_test[col_features]

In [21]:
#fit model
knn = KNeighborsClassifier()
knn.fit(scaled_df_train_ltd, y_train)
test_preds = knn.predict(scaled_df_test_ltd)
train_preds = knn.predict(scaled_df_train_ltd)

In [22]:
#evaluate model, model perform well below models with all features
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.54      0.67      0.60       478
           1       0.59      0.47      0.52        64
           2       0.27      0.14      0.19        64
           3       0.54      0.62      0.58       460
           4       0.31      0.14      0.19        72
           5       0.50      0.40      0.44       317
           6       0.67      0.05      0.10        37

    accuracy                           0.52      1492
   macro avg       0.49      0.36      0.37      1492
weighted avg       0.51      0.52      0.51      1492



In [23]:
print(metric.classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       0.57      0.72      0.64      1431
           1       0.63      0.44      0.51       227
           2       0.40      0.24      0.30       157
           3       0.57      0.69      0.62      1337
           4       0.52      0.20      0.29       205
           5       0.59      0.41      0.48      1008
           6       0.57      0.07      0.13       109

    accuracy                           0.57      4474
   macro avg       0.55      0.40      0.43      4474
weighted avg       0.57      0.57      0.55      4474



In [24]:
fn.optimize_knn_params(scaled_df_train_ltd, y_train, min_k=1, max_k=13, cv=5)

{'metric': 'minkowski', 'n_neighbors': 13, 'weights': 'distance'}

In [25]:
knn = KNeighborsClassifier(n_neighbors=10, weights='distance', metric='minkowski')
knn.fit(scaled_df_train_ltd, y_train)
test_preds = knn.predict(scaled_df_test_ltd)
train_preds = knn.predict(scaled_df_train_ltd)

In [26]:
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.54      0.66      0.60       478
           1       0.65      0.48      0.55        64
           2       0.33      0.06      0.11        64
           3       0.52      0.67      0.59       460
           4       0.71      0.14      0.23        72
           5       0.53      0.40      0.46       317
           6       0.82      0.24      0.38        37

    accuracy                           0.54      1492
   macro avg       0.59      0.38      0.42      1492
weighted avg       0.55      0.54      0.52      1492



In [27]:
print(metric.classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       0.58      0.71      0.64      1431
           1       0.66      0.48      0.56       227
           2       0.60      0.17      0.27       157
           3       0.56      0.74      0.64      1337
           4       0.72      0.16      0.26       205
           5       0.63      0.42      0.50      1008
           6       0.68      0.24      0.35       109

    accuracy                           0.59      4474
   macro avg       0.63      0.42      0.46      4474
weighted avg       0.60      0.59      0.57      4474



Model with only highway and city mpg reduces prediction accuracy significantly compared to using more features.

In [28]:
#add Engine HP
col_features = ['highway MPG', 'city mpg', 'Engine HP']

scaled_df_train_ltd = scaled_df_train[col_features]
scaled_df_test_ltd = scaled_df_test[col_features]

knn = KNeighborsClassifier()
knn.fit(scaled_df_train_ltd, y_train)
test_preds = knn.predict(scaled_df_test_ltd)
train_preds = knn.predict(scaled_df_train_ltd)

print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.88      0.89      0.88       478
           1       0.84      0.88      0.85        64
           2       0.81      0.59      0.68        64
           3       0.82      0.91      0.86       460
           4       0.73      0.53      0.61        72
           5       0.86      0.80      0.83       317
           6       0.70      0.76      0.73        37

    accuracy                           0.84      1492
   macro avg       0.80      0.76      0.78      1492
weighted avg       0.84      0.84      0.84      1492



Model improves drastically, which means Engine HP is a highly deterministic feature

In [29]:
fn.optimize_knn_params(scaled_df_train_ltd, y_train, min_k=1, max_k=10, cv=5)

{'metric': 'manhattan', 'n_neighbors': 5, 'weights': 'distance'}

In [30]:
knn = KNeighborsClassifier(n_neighbors=5, weights='distance', metric='manhattan')
knn.fit(scaled_df_train_ltd, y_train)
test_preds = knn.predict(scaled_df_test_ltd)
train_preds = knn.predict(scaled_df_train_ltd)

In [31]:
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.95      0.94      0.94       478
           1       0.92      0.95      0.94        64
           2       0.91      0.80      0.85        64
           3       0.92      0.96      0.94       460
           4       0.85      0.81      0.83        72
           5       0.94      0.95      0.94       317
           6       0.97      0.95      0.96        37

    accuracy                           0.93      1492
   macro avg       0.93      0.91      0.91      1492
weighted avg       0.93      0.93      0.93      1492



In [32]:
print(metric.classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       0.97      0.98      0.98      1431
           1       0.98      0.97      0.98       227
           2       0.93      0.90      0.92       157
           3       0.98      0.98      0.98      1337
           4       0.91      0.91      0.91       205
           5       0.97      0.97      0.97      1008
           6       1.00      0.98      0.99       109

    accuracy                           0.97      4474
   macro avg       0.96      0.96      0.96      4474
weighted avg       0.97      0.97      0.97      4474



Optimized model has prediction accuracy of 0.93, much higher than model with just highway and city mpg.

In [33]:
#inspect feature MSRP
col_features = ['highway MPG', 'city mpg', 'MSRP']

scaled_df_train_ltd = scaled_df_train[col_features]
scaled_df_test_ltd = scaled_df_test[col_features]

knn = KNeighborsClassifier()
knn.fit(scaled_df_train_ltd, y_train)
test_preds = knn.predict(scaled_df_test_ltd)
train_preds = knn.predict(scaled_df_train_ltd)

print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.64      0.71      0.67       478
           1       0.61      0.67      0.64        64
           2       0.36      0.19      0.25        64
           3       0.66      0.72      0.69       460
           4       0.41      0.19      0.26        72
           5       0.75      0.70      0.73       317
           6       0.39      0.30      0.34        37

    accuracy                           0.65      1492
   macro avg       0.55      0.50      0.51      1492
weighted avg       0.64      0.65      0.64      1492



In [34]:
fn.optimize_knn_params(scaled_df_train_ltd, y_train, min_k=1, max_k=15, cv=5)

{'metric': 'manhattan', 'n_neighbors': 15, 'weights': 'distance'}

In [38]:
knn = KNeighborsClassifier(n_neighbors=15, weights='distance', metric='manhattan')
knn.fit(scaled_df_train_ltd, y_train)
test_preds = knn.predict(scaled_df_test_ltd)
train_preds = knn.predict(scaled_df_train_ltd)

In [39]:
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.71      0.73      0.72       478
           1       0.75      0.75      0.75        64
           2       0.54      0.34      0.42        64
           3       0.71      0.77      0.74       460
           4       0.51      0.38      0.43        72
           5       0.75      0.74      0.74       317
           6       0.55      0.46      0.50        37

    accuracy                           0.70      1492
   macro avg       0.64      0.59      0.61      1492
weighted avg       0.70      0.70      0.70      1492



In [40]:
print(metric.classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1431
           1       1.00      1.00      1.00       227
           2       0.98      1.00      0.99       157
           3       1.00      1.00      1.00      1337
           4       1.00      1.00      1.00       205
           5       1.00      1.00      1.00      1008
           6       1.00      1.00      1.00       109

    accuracy                           1.00      4474
   macro avg       1.00      1.00      1.00      4474
weighted avg       1.00      1.00      1.00      4474



Adding 'MSRP' feature increases model performance but not nearly as much as model with 'Engine HP'.  Also train data shows high level of overfitting.

In [50]:
#fit model using only engine prediction
col_features = ['Engine HP']

scaled_df_train_hp = scaled_df_train[col_features]
scaled_df_test_hp = scaled_df_test[col_features]

knn = KNeighborsClassifier()
knn.fit(scaled_df_train_hp, y_train)
test_preds = knn.predict(scaled_df_test_hp)
train_preds = knn.predict(scaled_df_train_hp)

print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.75      0.82      0.78       478
           1       0.71      0.58      0.64        64
           2       0.39      0.45      0.42        64
           3       0.78      0.74      0.76       460
           4       0.35      0.24      0.28        72
           5       0.67      0.64      0.65       317
           6       0.54      0.73      0.62        37

    accuracy                           0.70      1492
   macro avg       0.60      0.60      0.59      1492
weighted avg       0.70      0.70      0.70      1492



Using only 'Engine HP' feature to predict already results in 70% accuracy and 0.6 F1 score, showing that it has highest impact on prediction

In [51]:
#drop engine from original data to compare results
scaled_df_train_nohp = scaled_df_train.drop('Engine HP', axis=1)
scaled_df_test_nohp = scaled_df_test.drop('Engine HP', axis=1)

In [52]:
knn = KNeighborsClassifier()
knn.fit(scaled_df_train_nohp, y_train)
test_preds = knn.predict(scaled_df_test_nohp)
train_preds = knn.predict(scaled_df_train_nohp)

In [53]:
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.88      0.91      0.89       478
           1       0.89      0.89      0.89        64
           2       0.85      0.73      0.79        64
           3       0.85      0.85      0.85       460
           4       0.82      0.65      0.73        72
           5       0.86      0.89      0.88       317
           6       0.77      0.73      0.75        37

    accuracy                           0.86      1492
   macro avg       0.85      0.81      0.83      1492
weighted avg       0.86      0.86      0.86      1492



In [54]:
print(metric.classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       0.94      0.96      0.95      1431
           1       0.95      0.95      0.95       227
           2       0.88      0.85      0.87       157
           3       0.92      0.92      0.92      1337
           4       0.88      0.77      0.82       205
           5       0.93      0.94      0.93      1008
           6       0.91      0.83      0.87       109

    accuracy                           0.93      4474
   macro avg       0.92      0.89      0.90      4474
weighted avg       0.93      0.93      0.93      4474



In [56]:
fn.optimize_knn_params(scaled_df_train_nohp, y_train, min_k=1, max_k=10, cv=5)

{'metric': 'manhattan', 'n_neighbors': 1, 'weights': 'uniform'}

In [57]:
knn = KNeighborsClassifier(n_neighbors=1, weights='uniform', metric='manhattan')
knn.fit(scaled_df_train_nohp, y_train)
test_preds = knn.predict(scaled_df_test_nohp)
train_preds = knn.predict(scaled_df_train_nohp)

In [58]:
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.95      0.96      0.95       478
           1       0.97      0.97      0.97        64
           2       1.00      0.94      0.97        64
           3       0.94      0.93      0.93       460
           4       0.81      0.82      0.81        72
           5       0.96      0.96      0.96       317
           6       0.88      0.95      0.91        37

    accuracy                           0.94      1492
   macro avg       0.93      0.93      0.93      1492
weighted avg       0.94      0.94      0.94      1492



Without 'Engine HP' feature model still predict with 94% accuracy with best k=1.  

#### Summary

- KNN model can predict carmaker origin on test date with 94% accuracy and 0.93 F1-score
- The mdoel shows slight overfitting with training data having 100% accuracy and 1.0 F1-score
- Because optimized k=1, it shows that features combined are highly deterministic, with prediction needing only one neighbor for accurate prediction
- features should be examined for potential leakage
- train, test split should not be random to model performance