# 1 Initialisation

## 1.1 Installation & Load Libraries

In [1]:
!pip install lazypredict

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lazypredict
  Downloading lazypredict-0.2.9-py2.py3-none-any.whl (12 kB)
Collecting lightgbm==2.3.1
  Downloading lightgbm-2.3.1-py2.py3-none-manylinux1_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 5.4 MB/s 
[?25hCollecting tqdm==4.56.0
  Downloading tqdm-4.56.0-py2.py3-none-any.whl (72 kB)
[K     |████████████████████████████████| 72 kB 899 kB/s 
Collecting scikit-learn==0.23.1
  Downloading scikit_learn-0.23.1-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 45.9 MB/s 
Collecting numpy==1.19.1
  Downloading numpy-1.19.1-cp37-cp37m-manylinux2010_x86_64.whl (14.5 MB)
[K     |████████████████████████████████| 14.5 MB 32.0 MB/s 
[?25hCollecting PyYAML==5.3.1
  Downloading PyYAML-5.3.1.tar.gz (269 kB)
[K     |████████████████████████████████| 269 kB 49.2 MB/s 
[?25hCollecting pytest==5.4.3
  Downloading pyte

@@@@@@@@@@**_Please Restart the kernel_**@@@@@@@@@@


In [2]:
import requests

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import plot_confusion_matrix


## 1.2 Load Data

According to the Dataset description, the fields can take on the following values:

**Class values** - Car Acceptability 

unacc, acc, good, vgood

**Attributes**

buying:   vhigh, high, med, low. <br>
maint:    vhigh, high, med, low.<br>
doors:    2, 3, 4, 5more.<br>
persons:  2, 4, more.<br>
lug_boot: small, med, big.<br>
safety:   low, med, high.<br>


In [4]:
# Download Data
req = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data")

# Save Data 
with open("car.data", "wb") as f:
    f.write(req.content)

ConnectionError: ignored

In [None]:
# read Data 
data =  pd.read_csv("car.data", sep=",", header = None )
data.columns = ["buying", "maint", "doors", "persons", "lug_boot", "safety", "class"]
print(data.shape)
data.head()

In [None]:
# Check for duplicated records - No duplicates found
data.drop_duplicates().shape

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data["buying"].value_counts()/ len(data)

# 2 Exploratory Data Analysis (EDA)

## 2.1 Distribution

Check the distributions of variables for each buying class

In [None]:
# Function to plot the distributions based on a filter rule
def plot_distributions(subset_rule, title):
    plot_data = data[subset_rule] 
    fig, ax = plt.subplots(2, 4, figsize=(12,6))
    fig.suptitle(f"Distribution of Variables for {title}")
    fig.tight_layout()
    fig.subplots_adjust(top=0.92)
    for i, categorical_feature in enumerate(data):
        plot = plot_data[categorical_feature].value_counts().plot(kind = "bar", ax=ax[i // 4][i % 4])
        plot.set_title(categorical_feature)
        plot.tick_params(axis='x', rotation=0)
    fig.show()

# Plot the distributions based on the different classes in buying
for buying_class in ["low", "med", "high", "vhigh"]:
    subset_rule = data["buying"] == buying_class
    plot_distributions(subset_rule, buying_class)

From the distribution plots it can be seen that only **class** is helpful for predicting **buying**, as all the other variables seem to have somewhat of a uniform distribution. 

## 2.2 Chi-Square test of independence 


H0: Variables are independent <br>
HA: Variables are dependent <br>
Test on 95% confidence  

In [None]:
for col in data.columns:
    crosstab = pd.crosstab(data["buying"], data[col])
    print(crosstab)

Since all variables except for **class** has equal number of samples for buying, their p-value would be 1. Hence we only test **class** for chi2

In [None]:
from scipy.stats import chi2_contingency
crosstab = pd.crosstab(data["buying"], data["class"])
stat, p, dof, expected = chi2_contingency(crosstab)
print(f"The P-value is {p} which is lesser than 0.05, hence we can reject the null hypothesis")

## 2.3. EDA Conclusion 

We discovered that **class** is the only variable that is significant in predicting **buying**.

However, we will continue to experiment later on section 4 to see if there is any performance gain for including the rest of the variables. If there is no significant performance gain, then I will opt to only keep **class** as the features. 

# 3 Feature Engineering

## 3.1 Encoding

**One-hot vs Ordinal Encoding**

Since all features are Ordinal i.e. _doors_ consists of the values [2, 3, 4, 5more]. <br>
Ordinal Encoding is prefered as it preserves some aspect of order while One-hot Encoding disregards the order of the categories.
Hence **Ordinal Encoding is chosen as the best encoding method** in this case.


In [None]:
data.columns

In [None]:
oridinal_encoding = {'buying': {"vhigh" : 4, "high" : 3, "med" : 2, "low": 1},
                  'maint': {"vhigh" : 4, "high" : 3, "med" : 2, "low": 1},
                  'doors': {"2" : 1, "3" : 2, "4" : 3, "5more": 4},
                  'persons': {"2" : 1, "4" : 2, "more" : 3},
                  'lug_boot': {"small" : 1, "med" : 2, "big": 3},
                  'safety': { "high" : 3, "med" : 2, "low": 1},
                  'class': {"unacc" : 1, "acc": 2, "good": 3, "vgood": 4}}

encoded_data = data.replace(oridinal_encoding)

# 4 Model Selection

# 4.1 Stratified Train Test Split

We stratify by **buying** to mantain the same proportion 

In [None]:
from sklearn.model_selection import train_test_split
y = encoded_data["buying"]

**All Features**

In [None]:
# All Features
X = encoded_data.drop("buying", axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify = y)

[x.shape for x in [X_train, X_test, y_train, y_test]]

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

**Class Only**

In [None]:
# Features - class only 
# These are marked with a _c 
X_c = encoded_data[["class"]]
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_c, y, test_size=0.25, random_state=42, stratify = y)
[x.shape for x in [X_train_c, X_test_c, y_train_c, y_test_c]]

In [None]:
y_train_c.value_counts()

In [None]:
y_test_c.value_counts()

# 4.2 LazyPredict

**LazyPredict** is a library that quickly evaluates many different models based on the data. 

We can pick the best models here to inspect and tune further.

**All Features**

In [None]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_test, y_train, y_test)

In [None]:
models

**Class Only**

In [None]:
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train_c, X_test_c, y_train_c, y_test_c)

In [None]:
models

While there is no performance gain in terms of Accuracy for using All Features, there is a gain for F1-Score.  

We should investigate this further by looking at the confusion matrix. We will choose adaboost as the model to do this. 

**All Features**

In [None]:
clf = AdaBoostClassifier()
clf.fit(X_train, y_train)

print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

In [None]:
plot_confusion_matrix(clf, X_train, y_train)

In [None]:
clf = AdaBoostClassifier()
clf.fit(X_train_c, y_train_c)

print(clf.score(X_train_c, y_train_c))
print(clf.score(X_test_c, y_test_c))

In [None]:
from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(clf, X_train_c, y_train_c)

## 4.3 Hyper parameter tuning

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
skfold = StratifiedKFold(n_splits=5, shuffle = True, random_state = 0)

param_grid = {"base_estimator":[DecisionTreeClassifier(max_depth = 2)],
              "base_estimator__criterion" : ["gini", "entropy"],
              "base_estimator__splitter" :   ["best", "random"],
              "algorithm" : ["SAMME","SAMME.R"],
              "n_estimators" :[10, 20, 50, 100],
              "learning_rate":  [0.05, 0.1, 0.3, 0.5, 1.5, 2.5],
              "random_state": [0]}
optimal_clf = GridSearchCV(estimator=AdaBoostClassifier(),
             param_grid=param_grid,
             cv = skfold,
             verbose = 1)

%time optimal_clf.fit(X_train, y_train)
optimal_clf.best_estimator_