## Predicting Class of a Car

Using various features

**class values**                                  
unacc, acc, good, vgood

**attributes**

buying:   vhigh, high, med, low.     
maint:    vhigh, high, med, low.   
doors:    2, 3, 4, 5more.      
persons:  2, 4, more.                     
lug_boot: small, med, big.                    
safety:   low, med, high.               


In [2]:
import numpy as np
import pandas as pd

In [16]:
cars = pd.read_csv('car.data', names='buying maint doors persons lug_boot safety class'.split())

In [17]:
'buying maint doors persons lug_boot safety class'.split()

['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']

In [98]:
cars

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc
...,...,...,...,...,...,...,...
1723,low,low,5more,more,med,med,good
1724,low,low,5more,more,med,high,vgood
1725,low,low,5more,more,big,low,unacc
1726,low,low,5more,more,big,med,good


In [22]:
cars['doors'].unique()

array(['2', '3', '4', '5more'], dtype=object)

As there are many features with objects, so we will need dummy variable in there place

 ### Using dummy variable to define X, y value

In [26]:
cars.columns

Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], dtype='object')

In [28]:
X = pd.get_dummies(cars[['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']])
X.head()

Unnamed: 0,buying_high,buying_low,buying_med,buying_vhigh,maint_high,maint_low,maint_med,maint_vhigh,doors_2,doors_3,...,doors_5more,persons_2,persons_4,persons_more,lug_boot_big,lug_boot_med,lug_boot_small,safety_high,safety_low,safety_med
0,0,0,0,1,0,0,0,1,1,0,...,0,1,0,0,0,0,1,0,1,0
1,0,0,0,1,0,0,0,1,1,0,...,0,1,0,0,0,0,1,0,0,1
2,0,0,0,1,0,0,0,1,1,0,...,0,1,0,0,0,0,1,1,0,0
3,0,0,0,1,0,0,0,1,1,0,...,0,1,0,0,0,1,0,0,1,0
4,0,0,0,1,0,0,0,1,1,0,...,0,1,0,0,0,1,0,0,0,1


In [31]:
y = cars['class']

### Splitting into training and testing

In [32]:
from sklearn.model_selection import train_test_split

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state= 0)

### Random Forest

In [43]:
from sklearn.ensemble import RandomForestClassifier

In [49]:
clf_rf = RandomForestClassifier(random_state=0)

In [50]:
clf_rf.fit(X_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [51]:
y_pred_rf = clf.predict(X_test)

In [52]:
# Testing Accuracy

clf_rf.score(X_test, y_test)

0.9421965317919075

In [53]:
# Training Accuracy

clf_rf.score(X_train, y_train)

0.9991728701406121

#### Classification report and Confusion Matrix

In [54]:
from sklearn.metrics import classification_report, confusion_matrix

In [66]:
print('Classification Report:\n',classification_report(y_test, y_pred_rf,target_names=y.unique()))
print('\nConfusion Matrix\n',confusion_matrix(y_test, y_pred_rf)) 

Classification Report:
               precision    recall  f1-score   support

       unacc       0.83      0.87      0.85       115
         acc       0.71      0.60      0.65        25
       vgood       0.97      0.97      0.97       363
        good       0.73      0.69      0.71        16

    accuracy                           0.92       519
   macro avg       0.81      0.78      0.80       519
weighted avg       0.92      0.92      0.92       519


Confusion Matrix
 [[100   4  10   1]
 [  7  15   0   3]
 [ 10   0 353   0]
 [  3   2   0  11]]


#### Conclusion

Predictions seems OK

### Support Vector Classifier

In [67]:
from sklearn.svm import SVC

In [68]:
clf_svc = SVC()

In [69]:
clf_svc.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [71]:
y_pred_svc = clf_svc.predict(X_test)

In [72]:
# Training Score

clf_svc.score(X_train, y_train)

0.9073614557485525

In [74]:
# Testiong Score

clf_svc.score(X_test, y_test)

0.8901734104046243

#### Classification Report and Confusion Report

In [75]:
print('Classification Report:\n',classification_report(y_test, y_pred_svc,target_names=y.unique()))
print('\nConfusion Matrix\n',confusion_matrix(y_test, y_pred_svc)) 

Classification Report:
               precision    recall  f1-score   support

       unacc       0.68      1.00      0.81       115
         acc       0.00      0.00      0.00        25
       vgood       1.00      0.94      0.97       363
        good       0.78      0.44      0.56        16

    accuracy                           0.89       519
   macro avg       0.61      0.59      0.58       519
weighted avg       0.87      0.89      0.87       519


Confusion Matrix
 [[115   0   0   0]
 [ 23   0   0   2]
 [ 23   0 340   0]
 [  9   0   0   7]]


  'precision', 'predicted', average, warn_for)


#### Conclusion

precision and recall valur for **acc class** are worse.
Therefore, this are not good answers.

Also there are many class misplaced as shown in confusion matrix

### Let's Use GridSearchCV

In [76]:
from sklearn.model_selection import GridSearchCV

#### Random Forest

In [77]:
param = {'n_estimators':[3,6,9,12]}

In [82]:
grid_rf = GridSearchCV(RandomForestClassifier(), param_grid=param, cv=5)

In [83]:
grid_rf.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid

In [84]:
grid_rf.best_params_

{'n_estimators': 12}

In [85]:
grid_rf.cv_results_

{'mean_fit_time': array([0.00730371, 0.01503825, 0.02360473, 0.02859097]),
 'std_fit_time': array([0.00228634, 0.00283494, 0.00449445, 0.00170999]),
 'mean_score_time': array([0.00297742, 0.0036109 , 0.00378695, 0.00272732]),
 'std_score_time': array([0.00165408, 0.00152407, 0.00310779, 0.00225191]),
 'param_n_estimators': masked_array(data=[3, 6, 9, 12],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'n_estimators': 3},
  {'n_estimators': 6},
  {'n_estimators': 9},
  {'n_estimators': 12}],
 'split0_test_score': array([0.88477366, 0.88065844, 0.89300412, 0.90946502]),
 'split1_test_score': array([0.89711934, 0.88065844, 0.93004115, 0.94650206]),
 'split2_test_score': array([0.89669421, 0.92975207, 0.92975207, 0.9214876 ]),
 'split3_test_score': array([0.9338843 , 0.9214876 , 0.95454545, 0.95867769]),
 'split4_test_score': array([0.92468619, 0.89958159, 0.92468619, 0.92468619]),
 'mean_test_score': array([0.90736146, 0.9

In [86]:
print('Training Score:',grid_rf.score(X_train, y_train))
print('Testing Score:',grid_rf.score(X_test, y_test))

Training Score: 0.9983457402812241
Testing Score: 0.9325626204238922


In [90]:
# predictions which give best score
y_pred_grid_rf = grid_rf.predict(X_test)

##### Classification Report and Confusion Matrix

In [91]:
print('Classification Report:\n',classification_report(y_test, y_pred_grid_rf,target_names=y.unique()))
print('\nConfusion Matrix\n',confusion_matrix(y_test, y_pred_grid_rf))

Classification Report:
               precision    recall  f1-score   support

       unacc       0.82      0.94      0.87       115
         acc       0.87      0.52      0.65        25
       vgood       0.99      0.98      0.98       363
        good       0.69      0.56      0.62        16

    accuracy                           0.93       519
   macro avg       0.84      0.75      0.78       519
weighted avg       0.93      0.93      0.93       519


Confusion Matrix
 [[108   1   5   1]
 [  9  13   0   3]
 [  9   0 354   0]
 [  6   1   0   9]]


#### Support Vector Classifier

In [92]:
params = {'C':[0.1, 1, 10, 100],'gamma':[1, 0.1, 0.01, 0.001]}

In [93]:
grid_svc = GridSearchCV(SVC(), param_grid=params, cv=5)

In [94]:
grid_svc.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='warn', n_jobs=None,
             param_grid={'C': [0.1, 1, 10, 100],
                         'gamma': [1, 0.1, 0.01, 0.001]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [95]:
grid_svc.best_params_

{'C': 100, 'gamma': 0.1}

In [96]:
grid_svc.cv_results_

{'mean_fit_time': array([0.16531534, 0.06485624, 0.0591783 , 0.05434904, 0.18365989,
        0.05475717, 0.06470785, 0.05850215, 0.18916092, 0.05269456,
        0.0537498 , 0.0577024 , 0.19253311, 0.0506238 , 0.05155048,
        0.06514316]),
 'std_fit_time': array([0.01658675, 0.00288779, 0.00290982, 0.00272016, 0.00844119,
        0.00306148, 0.00360758, 0.00271748, 0.00550482, 0.00300074,
        0.00673399, 0.00170461, 0.00558888, 0.00376303, 0.00506606,
        0.00627003]),
 'mean_score_time': array([0.01893249, 0.01105328, 0.01119528, 0.01119447, 0.01908717,
        0.00859804, 0.01000919, 0.01024184, 0.01896276, 0.00937405,
        0.00746989, 0.01049423, 0.01926765, 0.00706878, 0.00758882,
        0.00524068]),
 'std_score_time': array([0.00290166, 0.00324672, 0.00411223, 0.00254204, 0.00127078,
        0.00097406, 0.00306703, 0.00050168, 0.00154309, 0.00286057,
        0.00285454, 0.00324   , 0.00378034, 0.00299213, 0.00232184,
        0.00256403]),
 'param_C': masked_array(d

In [97]:
print('Training Score:',grid_svc.score(X_train, y_train))
print('Testing Score:',grid_svc.score(X_test, y_test))

Training Score: 1.0
Testing Score: 0.9942196531791907


In [100]:
y_pred_grid_svc = grid_svc.predict(X_test)

##### Classification Report and Confusion Matrix

In [101]:
print('Classification Report:\n',classification_report(y_test, y_pred_grid_svc,target_names=y.unique()))
print('\nConfusion Matrix\n',confusion_matrix(y_test, y_pred_grid_svc))

Classification Report:
               precision    recall  f1-score   support

       unacc       0.99      0.98      0.99       115
         acc       0.93      1.00      0.96        25
       vgood       1.00      1.00      1.00       363
        good       1.00      0.94      0.97        16

    accuracy                           0.99       519
   macro avg       0.98      0.98      0.98       519
weighted avg       0.99      0.99      0.99       519


Confusion Matrix
 [[113   2   0   0]
 [  0  25   0   0]
 [  0   0 363   0]
 [  1   0   0  15]]


**Conclusion:**
It is cleary visible from classification report and confusion matrix that this is lit 