                                       Machine Learning with Imbalanced Dataset

IMBALANCED DATASET - The dataset may contain uneven samples /instances , so that it makes the algorithm to predict with accuracy of 1.0 each time u run the model. For example, if u have simple dataset with 4 features and output(target) feature with 2 class, then total no. of instances/samples be 100. Now, out of 100, 80 instances belongs to category1 of the output(target) feature and only 20 instances contribute to the category2 of the output(target) feature. So, obviously, this makes bias in training and predicting the model. So, this dataset refers to Imbalanced dataset.


Importing Neccessary Packages and reading the csv file and printing the head of the csv file.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

file = pd.read_csv("yeast.csv",sep=',')
print(file.head())

    Mcg   Gvh   Alm   Mit  Erl  Pox   Vac   Nuc     Class
0  0.58  0.61  0.47  0.13  0.5  0.0  0.48  0.22  negative
1  0.43  0.67  0.48  0.27  0.5  0.0  0.53  0.22  negative
2  0.64  0.62  0.49  0.15  0.5  0.0  0.53  0.22  negative
3  0.58  0.44  0.57  0.13  0.5  0.0  0.54  0.22  positive
4  0.42  0.44  0.48  0.54  0.5  0.0  0.48  0.22  negative


Computing the Basic Statistics(Descriptive) of the "Class" feature in the dataset. It shows that there are two unique values(positive and negative), with positive value counts upto 429 and negative 1055.
In [3]:


In [2]:
file['Class'].describe()

count         1484
unique           2
top       negative
freq          1055
Name: Class, dtype: object

Now, we just grouped the datset based on the 'class' feature to visualize the counts of positive and negative values.

In [3]:
f = file.groupby("Class")
f.count()

Unnamed: 0_level_0,Mcg,Gvh,Alm,Mit,Erl,Pox,Vac,Nuc
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
negative,1055,1055,1055,1055,1055,1055,1055,1055
positive,429,429,429,429,429,429,429,429


We are converting the 'class' feature from text to int using .map function.

In [4]:
file['Class'] = file['Class'].map({'positive': 1, 'negative': 0})
print(file['Class'].head())

0    0
1    0
2    0
3    1
4    0
Name: Class, dtype: int64


In [5]:
f = file.groupby("Class")
f.count()

Unnamed: 0_level_0,Mcg,Gvh,Alm,Mit,Erl,Pox,Vac,Nuc
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,1055,1055,1055,1055,1055,1055,1055,1055
1,429,429,429,429,429,429,429,429


Now using the sklearn library, we import train_test_test from cross validation and split the original dataset into training and test dataset(80,20).

In [6]:
from sklearn.cross_validation import train_test_split
train, test = train_test_split(file,test_size=0.2)
features_train=train[['Mcg','Gvh','Alm','Mit','Erl','Pox','Vac','Nuc']]
features_test = test[['Mcg','Gvh','Alm','Mit','Erl','Pox','Vac','Nuc']]
labels_train = train.Class
labels_test = test.Class
print(features_train.shape)
print(features_test.shape)
print(labels_train.value_counts())
print(labels_test.value_counts())

(1187, 8)
(297, 8)
0    850
1    337
Name: Class, dtype: int64
0    205
1     92
Name: Class, dtype: int64




We build the normal model using the original dataset and we check for accuracy_score and roc_score for various algorithms.

1.RandomForestClassifier
2.LogisticRegression
3.svm.SVC(kernel='linear')
4.svm.SVC(kernel='rbf')
5.svm.SVC(kernel='poly')
6.DecisionTreeClassifier

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
clf = RandomForestClassifier(n_estimators=100, random_state=7).fit(features_train,labels_train)
prediction = clf.predict(features_test)
print(prediction)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0
 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1
 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0
 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0
 0]
Accuracy: 0.801346801347
ROC AUC Curve: 0.786935286935


In [8]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(features_train,labels_train)
prediction = clf.predict(features_test)
print(prediction)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0
 0]
Accuracy: 0.740740740741
ROC AUC Curve: 0.741268493815


In [9]:
from sklearn import svm
clf = svm.SVC(kernel='linear',class_weight='balanced',probability=True)
training = clf.fit(features_train,labels_train)
predictions = clf.predict(features_test)
print(predictions)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[1 1 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 1 0 1
 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1
 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 0 1
 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 0
 0 0 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 0
 0 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 0 1 1 1 0 0 0 0
 0 1 0 1 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 1 0 0 1 1 0 0 0 1 0 1 0 1 0 0
 0 0 1 1 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1 0 0 0 1
 0]
Accuracy: 0.686868686869
ROC AUC Curve: 0.741268493815


In [10]:
from sklearn import svm
clf = svm.SVC(kernel='rbf',class_weight='balanced',probability=True)
training = clf.fit(features_train,labels_train)
predictions = clf.predict(features_test)
print(predictions)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[1 1 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 1 0 1
 1 1 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1
 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 1 0 0 1 1 0 1 0 0 0 0 1
 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 0
 0 0 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 0
 0 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 0 1 1 1 0 0 0 0
 0 1 0 1 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0
 0 0 1 1 1 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 1 0 1 0 1 0 0 1 1 1 1 1 1 0 0 0 1
 0]
Accuracy: 0.676767676768
ROC AUC Curve: 0.741268493815


In [11]:
from sklearn import svm
clf = svm.SVC(kernel='poly',class_weight='balanced',probability=True)
training = clf.fit(features_train,labels_train)
predictions = clf.predict(features_test)
print(predictions)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[1 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 1 0 1 1 1 1 0 0 1 0 1
 1 1 1 1 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1
 0 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1
 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0
 0 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 0 1 0 1 1 1 0 0 0 0
 0 0 0 1 1 0 1 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 1 0 0 0 1
 0 1 0 0 0 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1
 0 0 1 1 1 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 1 0 1 0 1 0 0 1 1 1 1 0 1 0 0 0 1
 0]
Accuracy: 0.59595959596
ROC AUC Curve: 0.741268493815


In [12]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_split=2)
training = clf.fit(features_train,labels_train)
predictions = clf.predict(features_test)
print(predictions)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0
 0 1 1 1 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0
 0 1 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1
 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0
 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0
 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1
 0 1 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 1 1 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 0 1 0 0 0 1
 0]
Accuracy: 0.676767676768
ROC AUC Curve: 0.741268493815


So we have seen the accuracy and roc values for the algorithms with the original dataset. Now its time for us to use the various sampling algorithms to handled the imbalanced dataset.

There two main ways to handle the Imbalanced datset:

1.Over Sampling 
2.Under Sampling

OVER SAMPLING: It is nothing but Sampling the minority class and making it equivalent to the majority class 
Ex:before sampling: Counter({1: 111, 0: 65}) 
after sampling: Counter({1: 111, 0: 111}) 
Note:The counts of 1's and 0's before and after sampling


OVER SAMPLING ALGORITHM: 

1.SMOTE - Synthetic Minority Over Sampling Technique A subset of data is taken from the minority class as an example and then new synthetic similar instances are created. These synthetic instances are then added to the original dataset. The new dataset is used as a sample to train the classification models


In [13]:
from collections import Counter
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE(kind='borderline1').fit_sample(features_train, labels_train)
print("before sampling:",format(Counter(labels_train)))
print("after sampling:",format(Counter(y_resampled)))
print(X_resampled.shape)

before sampling: Counter({0: 850, 1: 337})
after sampling: Counter({0: 850, 1: 850})
(1700, 8)


Now after over sampling the training data set, we again train and test our model with previously used ML algorithms to check for the accuracy and roc_score on the test data.

In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
clf = RandomForestClassifier(n_estimators=100, random_state=7).fit(X_resampled, y_resampled)
prediction = clf.predict(features_test)
print(prediction)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))
print(np.unique(prediction))

[0 1 1 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0
 0 1 1 1 0 1 0 1 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1
 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 1
 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0
 0 0 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0
 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0
 0 1 0 0 0 1 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 0 1 0 0 1 1 1 1 0 1 0 0 0 0
 0]
Accuracy: 0.79797979798
ROC AUC Curve: 0.764354066986
[0 1]


In [15]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(X_resampled, y_resampled)
prediction = clf.predict(features_test)
print(prediction)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[1 1 0 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 1 1 1 0 1 1 0 1
 1 1 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1
 0 1 0 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 0 1 1 0 0 0 1 0 0 1 1 0 1 0 0 0 0 1
 1 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 0
 0 0 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 0
 0 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 0 1 0 1 0 0 0 0
 0 1 0 1 0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0
 0 0 1 1 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1 0 0 0 1
 0]
Accuracy: 0.680134680135
ROC AUC Curve: 0.664197530864


In [16]:
from sklearn import svm
clf = svm.SVC(kernel='linear',class_weight='balanced',probability=True)
training = clf.fit(X_resampled, y_resampled)
predictions = clf.predict(features_test)
print(predictions)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[1 1 0 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 1 1 1 0 1 1 1 1
 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 1
 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 1 0 0 1 1 0 1 0 0 0 0 1
 1 0 1 0 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0
 0 0 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 1 0 0 0 1
 0 0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 1 1 0 1 1 0 1 1 1 0 0 0 0
 0 1 0 1 0 1 0 1 0 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0
 0 0 1 1 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1 0 0 0 1
 0]
Accuracy: 0.643097643098
ROC AUC Curve: 0.664197530864


In [17]:
from sklearn import svm
clf = svm.SVC(kernel='rbf',class_weight='balanced',probability=True)
training = clf.fit(X_resampled, y_resampled)
predictions = clf.predict(features_test)
print(predictions)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[1 1 0 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 1 1 1 1 0 1 1 1 1
 1 1 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1
 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 0 0 1
 1 0 1 0 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 0 0 0
 0 1 0 1 1 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 1 1 0 0 0 1
 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 0 0 1 0 1 1 1 1 0 1 1 0 1 1 0 1 0 1 0 0 0 0
 0 1 0 1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0
 0 0 1 1 1 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1 0 0 0 1
 0]
Accuracy: 0.626262626263
ROC AUC Curve: 0.664197530864


In [18]:
from sklearn import svm
clf = svm.SVC(kernel='poly',class_weight='balanced',probability=True)
training = clf.fit(X_resampled, y_resampled)
predictions = clf.predict(features_test)
print(predictions)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 1 1 1
 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 1
 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1
 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1
 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1
 0]
Accuracy: 0.424242424242
ROC AUC Curve: 0.664197530864


In [26]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_split=2)
training = clf.fit(X_resampled, y_resampled)
predictions = clf.predict(features_test)
print(predictions)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[0 1 1 0 1 0 0 1 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 1 0 0 1 1 1 0 0 1 1 1 0 0
 1 0 0 0 1 0 1 1 0 0 0 1 1 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 0 0 1 1 0 1
 1 1 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 1
 1 0 1 0 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 1 1 0 1
 0 0 0 0 1 0 1 1 1 0 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0
 1 0 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 1 0 1 0 1 1
 0 1 1 1 0 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0
 1 0 0 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0
 1]
Accuracy: 0.515151515152
ROC AUC Curve: 0.664197530864


UNDER SAMPLING: It is nothing but Sampling the majority class and making it equivalent to the minority class 
Ex:before sampling: Counter({1: 111, 0: 65}) 
after sampling: Counter({0: 65, 1: 65})

UNDER SAMPLING ALGORITHM: 

1.RandomUnderSampler - Random Undersampling aims to balance class distribution by randomly eliminating majority class examples. This is done until the majority and minority class instances are balanced out.

2.NearMiss - selects the majority class samples whose average distances to three closest minority class samples are the smallest.

In [22]:
from collections import Counter
from imblearn.under_sampling import NearMiss
rus = NearMiss(random_state=42)
X_resampled, y_resampled = rus.fit_sample(features_train, labels_train)
print("before sampling:",format(Counter(labels_train)))
print("after sampling:",format(Counter(y_resampled)))
print(X_resampled.shape)

before sampling: Counter({0: 850, 1: 337})
after sampling: Counter({0: 337, 1: 337})
(674, 8)


Now after under sampling the training data set, we again train and test our model with previously used ML algorithms to check for the accuracy and roc_score on the test data.

In [23]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_split=2)
training = clf.fit(X_resampled, y_resampled)
predictions = clf.predict(features_test)
print(predictions)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[0 1 1 0 1 0 0 1 1 1 1 0 1 1 0 1 0 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0
 1 0 0 0 1 0 1 1 0 0 0 1 1 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 0 0 1 1 0 1
 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 0 1 0 0 1
 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 1 1 0 1
 0 0 0 0 1 0 1 1 1 0 0 1 1 0 1 0 0 0 0 1 0 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 0
 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 1
 0 1 1 1 0 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0
 1 1 0 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0
 1]
Accuracy: 0.538720538721
ROC AUC Curve: 0.664197530864


In [24]:
from sklearn import svm
clf = svm.SVC(kernel='poly',class_weight='balanced',probability=True)
training = clf.fit(X_resampled, y_resampled)
predictions = clf.predict(features_test)
print(predictions)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0]
Accuracy: 0.69696969697
ROC AUC Curve: 0.664197530864


In [25]:
from sklearn import svm
clf = svm.SVC(kernel='rbf',class_weight='balanced',probability=True)
training = clf.fit(X_resampled, y_resampled)
predictions = clf.predict(features_test)
print(predictions)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1
 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1
 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0
 0]
Accuracy: 0.737373737374
ROC AUC Curve: 0.664197530864


In [27]:
from sklearn import svm
clf = svm.SVC(kernel='linear',class_weight='balanced',probability=True)
training = clf.fit(X_resampled, y_resampled)
predictions = clf.predict(features_test)
print(predictions)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1
 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 0 1 1 0 1 1 1 0 0 0 0 1 0 1 0 0 1
 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0
 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 0 0 1 1 1 0 0 0 1
 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 1 0 1 0 0 0 1 0 1 1 1 0 0 0 0 1 1 0 0 0 1 0 1 1 1 1 0 1 0 0 0 0
 1]
Accuracy: 0.727272727273
ROC AUC Curve: 0.664197530864


In [28]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(X_resampled, y_resampled)
prediction = clf.predict(features_test)
print(prediction)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))

[1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 0
 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 1
 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 1
 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 1 1 0 0 1
 0 0 0 1 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0
 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 1 0 0 1 0 0 1 1 1 0 1 0 1
 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
 1 1 0 0 0 1 0 1 0 0 0 1 0 1 1 1 0 0 0 0 1 1 0 0 1 1 0 1 1 1 1 0 1 1 0 0 0
 1]
Accuracy: 0.703703703704
ROC AUC Curve: 0.6702228996


In [29]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
clf = RandomForestClassifier(n_estimators=100, random_state=7).fit(X_resampled, y_resampled)
prediction = clf.predict(features_test)
print(prediction)
print("Accuracy:",clf.score(features_test,labels_test))
print("ROC AUC Curve:",roc_auc_score(prediction,labels_test))
print(np.unique(prediction))

[0 1 1 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 0 1 1 0
 0 1 1 1 0 0 0 1 0 0 0 1 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1
 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 1 0 0 1
 1 0 1 0 1 0 1 0 1 1 1 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1
 1 0 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0
 1 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 1
 0 1 1 1 0 1 0 1 1 1 1 0 1 1 0 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0 0 1 0 0
 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0
 1]
Accuracy: 0.558922558923
ROC AUC Curve: 0.60900201323
[0 1]
