## SMOTE (Synthetic Minority Oversampling Technique)

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings(action = 'ignore', category=FutureWarning)
data = pd.read_csv('car_evaluation.csv')
data.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,outcome
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


#### SMOTE is a data preprocessing technique which creates artifical datapoints of the minority class,to deal with an imbalanced dataset. 
#### TWO main techniques to handle imbalanced data are :- 
#### Random oversampling-It increases the size of the training set through a repitition of the original samples of the minority class. 
#### SMOTE - creates new training data of the minority classes based on the original sample. if it sees two samples of the minority class near to each other,it will create a third sample between the former two minority samples. This leads to an increase in variety in the training data,hence the model gets to learn more.

In [2]:
# Buying, Maintainence,doors,persons,luggagespace, safety are features of car

In [3]:
# Outcome means car model, good, accepteable. unacceptable and vgood

In [4]:
data.shape

(1728, 7)

In [5]:
data.outcome.value_counts()

unacc    1210
acc       384
good       69
vgood      65
Name: outcome, dtype: int64

In [6]:
# Here minority class is good and vgood which is way less than the unacceptable outcome.Clearly data is imbalanced

In [7]:
#Creating Feature sets
X = data.iloc[:,:-1]
y = data.outcome
X.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
0,vhigh,vhigh,2,2,small,low
1,vhigh,vhigh,2,2,small,med
2,vhigh,vhigh,2,2,small,high
3,vhigh,vhigh,2,2,med,low
4,vhigh,vhigh,2,2,med,med


In [8]:
#Here feature set contains all the details except the dependent variable which is "Outcome"

In [9]:
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
X.loc[:,['buying','maint','lug_boot','safety','doors','persons']] = X.loc[:,['buying','maint','lug_boot','safety','doors','persons']].apply(enc.fit_transform)
X.head()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.loc[:,['buying','maint','lug_boot','safety','doors','persons']] = X.loc[:,['buying','maint','lug_boot','safety','doors','persons']].apply(enc.fit_transform)


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
0,3,3,0,0,2,1
1,3,3,0,0,2,2
2,3,3,0,0,2,0
3,3,3,0,0,1,1
4,3,3,0,0,1,2


In [10]:
from sklearn.model_selection import train_test_split
X_train,X_test, y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=10)


In [11]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(X_train,y_train)
y_predict = model.predict(X_test)

In [12]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_predict))
pd.crosstab(y_test,y_predict)

0.9267822736030829


col_0,acc,good,unacc,vgood
outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
acc,82,1,19,0
good,6,14,1,0
unacc,2,0,369,0
vgood,3,1,5,16


In [13]:
#calculating the accuracy percentages
#accepatable-82/82+1+19*100=80%
#good-14/6+14+1+0*100=66%
#unacceptable-369/371*100=99%
#We see that the majority class has a greater accuracy.

In [15]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()

In [17]:
X_train_smote, y_train_smote = smote.fit_resample(X_train.astype('float'),y_train)

In [18]:
y_train_smote.shape


(3356,)

In [19]:
y_train.shape

(1209,)

In [20]:
from collections import Counter
print("Before SMOTE :" , Counter(y_train))
print("After SMOTE :" , Counter(y_train_smote))
#a 'counter' in an algorithm will record the number of a particular category as inputted by the user

Before SMOTE : Counter({'unacc': 839, 'acc': 282, 'good': 48, 'vgood': 40})
After SMOTE : Counter({'acc': 839, 'unacc': 839, 'vgood': 839, 'good': 839})


In [21]:
model.fit(X_train_smote,y_train_smote)
y_predict = model.predict(X_test)
print(accuracy_score(y_test,y_predict))
pd.crosstab(y_test,y_predict)

0.930635838150289


col_0,acc,good,unacc,vgood
outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
acc,85,6,10,1
good,1,20,0,0
unacc,14,1,355,1
vgood,0,2,0,23
