# Machine Learning Lab 
## Lab 10 - Imbalanced Dataset
---
**S Shyam Sundaram** <br>
**19BCE1560** <br>
**October 11, 2021**<br>

**Dr Abdul Quadir MD**<br>
**L31+L32**

In [1]:
import pandas as pd

In [2]:
data=pd.read_csv('car_evaluation.csv',header=None)
data.head()

Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [3]:
data.columns=['buying','maint','doors','persons','lug_boot','safety','outcome']
data.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,outcome
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [4]:
data.shape

(1728, 7)

In [5]:
data.outcome.value_counts()

unacc    1210
acc       384
good       69
vgood      65
Name: outcome, dtype: int64

We have 1210 values that are 'unacc', 384 'acc', 69 'good' and 65 'vgood'. Their respective percentages are:

In [6]:
for i in data.outcome.value_counts():
    print(str(round((i/1728)*100,2))+"%")

70.02%
22.22%
3.99%
3.76%


This indicates that the dataset is unbalanced.

In [7]:
X=data.drop(['outcome'],axis=1)
y=data['outcome']
X

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
0,vhigh,vhigh,2,2,small,low
1,vhigh,vhigh,2,2,small,med
2,vhigh,vhigh,2,2,small,high
3,vhigh,vhigh,2,2,med,low
4,vhigh,vhigh,2,2,med,med
...,...,...,...,...,...,...
1723,low,low,5more,more,med,med
1724,low,low,5more,more,med,high
1725,low,low,5more,more,big,low
1726,low,low,5more,more,big,med


In [8]:
y

0       unacc
1       unacc
2       unacc
3       unacc
4       unacc
        ...  
1723     good
1724    vgood
1725    unacc
1726     good
1727    vgood
Name: outcome, Length: 1728, dtype: object

# Encoding non-integer entries

In [9]:
from sklearn.preprocessing import LabelEncoder

In [10]:
enc=LabelEncoder()
X.loc[:,['buying','maint','lug_boot','safety','doors','persons']] = X.loc[:,['buying','maint','lug_boot','safety','doors','persons']].apply(enc.fit_transform)

In [11]:
X

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
0,3,3,0,0,2,1
1,3,3,0,0,2,2
2,3,3,0,0,2,0
3,3,3,0,0,1,1
4,3,3,0,0,1,2
...,...,...,...,...,...,...
1723,1,1,3,2,1,2
1724,1,1,3,2,1,0
1725,1,1,3,2,0,1
1726,1,1,3,2,0,2


## Splitting train-test

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
train_x, test_x, train_y, test_y = train_test_split(X,y, test_size=0.3, random_state=10)

## Building a KNN Classifier model and evaluating it

In [14]:
from sklearn.neighbors import KNeighborsClassifier 
model = KNeighborsClassifier()
model.fit(train_x,train_y)
y_predict = model.predict(test_x)

In [15]:
from sklearn.metrics import accuracy_score 
print(accuracy_score(test_y,y_predict))
pd.crosstab(test_y,y_predict)

0.9267822736030829


col_0,acc,good,unacc,vgood
outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
acc,84,1,17,0
good,8,13,0,0
unacc,3,0,368,0
vgood,4,2,3,16


84 out of 102 'acc' are correctly classified. 82%<br>
13 out of 21 'good' are correctly classified. 62%<br>
368 out of 371 'unacc' are correctly classified. 99%<br>
16 out of 15 'vgood' are correctly classified. 93%

# Solving the problem: Sampling
---

We have three ways to try and solve this issue of imbalance. They are:<br>
1. [Oversampling](#os)<br>
2. [Undersampling](#us)<br>
3. [SMOTE](#sm)

## Oversampling <a name="os"></a>
---

In [16]:
import imblearn
from imblearn.over_sampling import RandomOverSampler

In [17]:
over=RandomOverSampler()

In [18]:
Over_X,Over_y=over.fit_resample(X,y)

In [19]:
from collections import Counter

In [20]:
print(Counter(Over_y))

Counter({'unacc': 1210, 'acc': 1210, 'vgood': 1210, 'good': 1210})


In [21]:
Over_y

0       unacc
1       unacc
2       unacc
3       unacc
4       unacc
        ...  
4835    vgood
4836    vgood
4837    vgood
4838    vgood
4839    vgood
Name: outcome, Length: 4840, dtype: object

We see that there are 1210 entries of each class.

In [22]:
train_x, test_x, train_y, test_y = train_test_split(Over_X,Over_y, test_size=0.3, random_state=10)
model = KNeighborsClassifier()
model.fit(train_x,train_y)
y_predict = model.predict(test_x)

In [23]:
print(accuracy_score(test_y,y_predict))
pd.crosstab(test_y,y_predict)

0.8842975206611571


col_0,acc,good,unacc,vgood
outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
acc,311,27,2,23
good,0,348,0,0
unacc,73,25,259,18
vgood,0,0,0,366


We see that the accuracy has dropped to 89%.

## Undersampling <a name="us"></a>
---

In [24]:
from imblearn.under_sampling import RandomUnderSampler

In [25]:
under=RandomUnderSampler()

In [26]:
Under_X,Under_y=under.fit_resample(X,y)

In [27]:
print(Counter(Under_y))

Counter({'acc': 65, 'good': 65, 'unacc': 65, 'vgood': 65})


In [28]:
Under_y

0        acc
1        acc
2        acc
3        acc
4        acc
       ...  
255    vgood
256    vgood
257    vgood
258    vgood
259    vgood
Name: outcome, Length: 260, dtype: object

We see that the total entries have reduced. All have 65 which was the lowest we had seen in the dataset for a specific category.

In [29]:
train_x, test_x, train_y, test_y = train_test_split(Under_X,Under_y, test_size=0.3, random_state=10)
model = KNeighborsClassifier()
model.fit(train_x,train_y)
y_predict = model.predict(test_x)

In [30]:
print(accuracy_score(test_y,y_predict))
pd.crosstab(test_y,y_predict)

0.6794871794871795


col_0,acc,good,unacc,vgood
outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
acc,4,11,2,5
good,0,16,0,0
unacc,3,3,10,1
vgood,0,0,0,23


The accuracy of the classifier has dropped by a lot more now! It is down to 68%!

## SMOTE <a name="sm"></a>
---

In [31]:
from imblearn.over_sampling import SMOTE

In [32]:
smote=SMOTE()

In [33]:
SMOTE_X,SMOTE_y=smote.fit_resample(X,y)

In [34]:
print(Counter(SMOTE_y))

Counter({'unacc': 1210, 'acc': 1210, 'vgood': 1210, 'good': 1210})


In [35]:
SMOTE_y

0       unacc
1       unacc
2       unacc
3       unacc
4       unacc
        ...  
4835    vgood
4836    vgood
4837    vgood
4838    vgood
4839    vgood
Name: outcome, Length: 4840, dtype: object

In [36]:
train_x, test_x, train_y, test_y = train_test_split(SMOTE_X,SMOTE_y, test_size=0.3, random_state=10)
model = KNeighborsClassifier()
model.fit(train_x,train_y)
y_predict = model.predict(test_x)

In [37]:
print(accuracy_score(test_y,y_predict))
pd.crosstab(test_y,y_predict)

0.9827823691460055


col_0,acc,good,unacc,vgood
outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
acc,355,6,2,0
good,0,348,0,0
unacc,17,0,358,0
vgood,0,0,0,366


Using SMOTE to deal with the imbalanced dataset has increased our accuracy upto 98.2%! Each class has significantly higher true positives.