## Note
I have used **Random Forest Classifier** and **GridSearch** which is one of the best **cross-validation** techniques to find optimal hyper-parameter

### Importing libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sklearn.model_selection as sms
import sklearn.preprocessing as sp
import sklearn.linear_model as slm
import sklearn.metrics as sm
import sklearn.ensemble as ens
from warnings import filterwarnings 
filterwarnings('ignore') # to ignore warnings
import sklearn.linear_model as slm
from sklearn import svm

### Loading data

In [2]:
#df=pd.read_csv("D:\Data\Gamboo\Cars\car.data", names=['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'T']) 
df=pd.read_csv("car.data")
df.head()
# A -- attribites, T - Target

Unnamed: 0,A1,A2,A3,A4,A5,A6,T
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


### Preprocessing

In [3]:
df[df.duplicated()]
# no suplicates to remove

Unnamed: 0,A1,A2,A3,A4,A5,A6,T


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A1      1728 non-null   object
 1   A2      1728 non-null   object
 2   A3      1728 non-null   object
 3   A4      1728 non-null   object
 4   A5      1728 non-null   object
 5   A6      1728 non-null   object
 6   T       1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


In [5]:
df['T'].unique()

array(['unacc', 'acc', 'vgood', 'good'], dtype=object)

In [6]:
df.describe()

Unnamed: 0,A1,A2,A3,A4,A5,A6,T
count,1728,1728,1728,1728,1728,1728,1728
unique,4,4,4,3,3,3,4
top,low,low,4,4,small,low,unacc
freq,432,432,432,576,576,576,1210


In [7]:
for c in list(df.columns):    
    print(df[c].value_counts())
    print(15*'-')

low      432
med      432
vhigh    432
high     432
Name: A1, dtype: int64
---------------
low      432
med      432
vhigh    432
high     432
Name: A2, dtype: int64
---------------
4        432
3        432
5more    432
2        432
Name: A3, dtype: int64
---------------
4       576
more    576
2       576
Name: A4, dtype: int64
---------------
small    576
med      576
big      576
Name: A5, dtype: int64
---------------
low     576
med     576
high    576
Name: A6, dtype: int64
---------------
unacc    1210
acc       384
good       69
vgood      65
Name: T, dtype: int64
---------------


There is a recommended general rule when creating **"Dummy Variables"**. It is to delete the dummy variable with the largest number of each variable. This helps us to **avoid multicollinearity**. Since the **degree of freedom is n-1**, we must reduce the number of dummy variables for each variable by one. However, as seen above, the values of each column are equally distributed. So it doesn't matter, for instance, we can easily delete the first ones

In [8]:
pd.set_option('display.max_columns', None)
df = pd.get_dummies(df, columns=['A1', 'A2', 'A3', 'A4', 'A5', 'A6'], drop_first=True)
df.head()
# drop_firts -- drops the firsts

Unnamed: 0,T,A1_low,A1_med,A1_vhigh,A2_low,A2_med,A2_vhigh,A3_3,A3_4,A3_5more,A4_4,A4_more,A5_med,A5_small,A6_low,A6_med
0,unacc,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0
1,unacc,0,0,1,0,0,1,0,0,0,0,0,0,1,0,1
2,unacc,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0
3,unacc,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0
4,unacc,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1


### Splitting data

In [9]:
df.columns

Index(['T', 'A1_low', 'A1_med', 'A1_vhigh', 'A2_low', 'A2_med', 'A2_vhigh',
       'A3_3', 'A3_4', 'A3_5more', 'A4_4', 'A4_more', 'A5_med', 'A5_small',
       'A6_low', 'A6_med'],
      dtype='object')

In [10]:
X = df.drop(columns=['T'])
y = df[['T']]

In [11]:
x_train, x_test, y_train, y_test = sms.train_test_split(X, y, test_size=0.3, random_state=25)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(1209, 15) (519, 15) (1209, 1) (519, 1)


### Modelling

In [12]:
# to merge all scores and metrics results altogether
performance=pd.DataFrame(data=None, index=['BAS','AS'])
performance

BAS
AS


In [13]:
# creating function to evaluate model at one run
def evaluate(model,model_name):
    performance[model_name]=[sm.balanced_accuracy_score(y_test, model.predict(x_test)),
    sm.accuracy_score(y_test, model.predict(x_test)) ]
    
    print(sm.classification_report(y_test, model.predict(x_test)))
    print("--------Balanced Accuracy Score--------")
    print(sm.balanced_accuracy_score(y_test, model.predict(x_test)))
    print("-------Accuracy Score------------------")
    print(sm.accuracy_score(y_test, model.predict(x_test)))
    
pd.options.display.float_format = '{:.3f}'.format

#### Logistic regression

We will do **Hyperparameter tuning** to find **otimal best** parameters in some models

In [14]:
parameters = {'penalty':('l1','l2','elastic', 'none'),'C':[1, 10, 50, 100] }
lr1 = slm.LogisticRegression()
clf1 = sms.GridSearchCV(lr1, parameters)
clf1.fit(x_train, y_train)

GridSearchCV(estimator=LogisticRegression(),
             param_grid={'C': [1, 10, 50, 100],
                         'penalty': ('l1', 'l2', 'elastic', 'none')})

In [15]:
evaluate(clf1, 'Log_reg_0.5')

              precision    recall  f1-score   support

         acc       0.87      0.83      0.85       108
        good       0.85      0.89      0.87        19
       unacc       0.96      0.97      0.97       369
       vgood       0.85      0.96      0.90        23

    accuracy                           0.94       519
   macro avg       0.88      0.91      0.90       519
weighted avg       0.94      0.94      0.94       519

--------Balanced Accuracy Score--------
0.9130178973414449
-------Accuracy Score------------------
0.9364161849710982


#### SVC classification model

In [16]:
parameters = {'C':[1, 10, 50, 100] }
svc = svm.SVC()
clf2 = sms.GridSearchCV(svc, parameters)
clf2.fit(x_train, y_train)
clf2.best_params_

{'C': 10}

In [17]:
evaluate(clf2, 'SVC')

              precision    recall  f1-score   support

         acc       0.96      0.99      0.98       108
        good       1.00      0.89      0.94        19
       unacc       1.00      1.00      1.00       369
       vgood       1.00      0.96      0.98        23

    accuracy                           0.99       519
   macro avg       0.99      0.96      0.97       519
weighted avg       0.99      0.99      0.99       519

--------Balanced Accuracy Score--------
0.959822323719042
-------Accuracy Score------------------
0.9903660886319846


In [18]:
performance[['SVC']]

Unnamed: 0,SVC
BAS,0.96
AS,0.99


### Ensemble methods

#### RandomForest classification model

In [19]:
parameters = { 'n_estimators':[i for i in range(150,200,10)] }
rfc = ens.RandomForestClassifier(random_state=25)
clf3 = sms.GridSearchCV(rfc, parameters)
clf3.fit(x_train, y_train)

clf3.best_params_

{'n_estimators': 190}

In [20]:
evaluate(clf3, 'R_Forest')

              precision    recall  f1-score   support

         acc       0.80      0.88      0.84       108
        good       0.62      0.26      0.37        19
       unacc       0.96      0.98      0.97       369
       vgood       0.82      0.61      0.70        23

    accuracy                           0.92       519
   macro avg       0.80      0.68      0.72       519
weighted avg       0.91      0.92      0.91       519

--------Balanced Accuracy Score--------
0.6824507399345542
-------Accuracy Score------------------
0.9152215799614644


In [21]:
performance[['R_Forest']]

Unnamed: 0,R_Forest
BAS,0.682
AS,0.915


#### ExtraTree classification model

In [22]:
parameters = { 'n_estimators':[i for i in range(80,200,10)] }
etc = ens.ExtraTreesClassifier(random_state=25)
clf4 = sms.GridSearchCV(etc, parameters)
clf4.fit(x_train, y_train)

clf4.best_params_

{'n_estimators': 90}

In [23]:
evaluate(clf4, 'ExTree')

              precision    recall  f1-score   support

         acc       0.81      0.87      0.84       108
        good       0.60      0.32      0.41        19
       unacc       0.96      0.97      0.97       369
       vgood       0.79      0.65      0.71        23

    accuracy                           0.91       519
   macro avg       0.79      0.70      0.73       519
weighted avg       0.91      0.91      0.91       519

--------Balanced Accuracy Score--------
0.7028083715238373
-------Accuracy Score------------------
0.9132947976878613


#### GradientBoosting classification model

In [24]:
parameters = { 'learning_rate':[0.1,0.01,0.001], 'n_estimators':[50,100,150,200,250] }
gbc = ens.GradientBoostingClassifier(random_state=25)
clf5 = sms.GridSearchCV(gbc, parameters)
clf5.fit(x_train, y_train)

clf5.best_params_

{'learning_rate': 0.1, 'n_estimators': 250}

In [25]:
evaluate(clf5, 'GradientBC')

              precision    recall  f1-score   support

         acc       0.97      0.98      0.98       108
        good       0.86      0.95      0.90        19
       unacc       1.00      0.99      1.00       369
       vgood       0.96      0.96      0.96        23

    accuracy                           0.99       519
   macro avg       0.95      0.97      0.96       519
weighted avg       0.99      0.99      0.99       519

--------Balanced Accuracy Score--------
0.9693103900909337
-------Accuracy Score------------------
0.9865125240847784


#### AdaBoost classification model

In [26]:
parameters = { 'learning_rate':[0.1, 1, 2], 'n_estimators':[50,100,150,200]}
abc = ens.AdaBoostClassifier(random_state=25)
clf6 = sms.GridSearchCV(abc, parameters)
clf6.fit(x_train, y_train)

clf6.best_params_

{'learning_rate': 1, 'n_estimators': 50}

In [27]:
evaluate(clf6, 'AdaBC')

              precision    recall  f1-score   support

         acc       0.71      0.59      0.65       108
        good       0.50      0.37      0.42        19
       unacc       0.91      0.96      0.93       369
       vgood       0.78      0.78      0.78        23

    accuracy                           0.86       519
   macro avg       0.72      0.68      0.70       519
weighted avg       0.84      0.86      0.85       519

--------Balanced Accuracy Score--------
0.6764204903681379
-------Accuracy Score------------------
0.8554913294797688


In [28]:
performance

Unnamed: 0,Log_reg_0.5,SVC,R_Forest,ExTree,GradientBC,AdaBC
BAS,0.913,0.96,0.682,0.703,0.969,0.676
AS,0.936,0.99,0.915,0.913,0.987,0.855


I merged evaluation metrics and scores of all models we trained into **"performance"** in order to make the comparison. So, that can be seen some models performed better like **GradientBoost, SVC, Logistic** models. Particularly, doing of **hyperparameter tuning** boosted performance on closer optimal values

According to the comparison of models performances, the **Support Vector Classification** is the optimal model to pick and to utilize as the major

**Thanks**