# Project 1: Classification of results into Malignant/Benign

Final Project
Predict whether a mammogram mass is benign or malignant
We'll be using the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:

BI-RADS assessment: 1 to 5 (ordinal)
Age: patient's age in years (integer)
Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
Severity: benign=0 or malignant=1 (binominal)
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

Your assignment
Apply several different supervised machine learning techniques to this data set, and see which one yields the highest accuracy as measured with K-Fold cross validation (K=10). Apply:

Decision tree
Random forest
KNN
Naive Bayes
SVM
Logistic Regression
And, as a bonus challenge, a neural network using Keras.
The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.

In [1]:
import os

os.chdir("C:/Users/Siddhartha/Documents/Data Science/MLCourse")

In [2]:
import pandas as pd

df = pd.read_csv (r'mammographic_masses.data.txt')
print (df)

     5  67  3 5.1 3.1  1
0    4  43  1   1   ?  1
1    5  58  4   5   3  1
2    4  28  1   1   3  0
3    5  74  1   5   ?  1
4    4  65  1   ?   3  0
..  ..  .. ..  ..  .. ..
955  4  47  2   1   3  0
956  4  56  4   5   3  1
957  4  64  4   5   3  0
958  5  66  4   5   3  1
959  4  62  3   3   3  0

[960 rows x 6 columns]


In [3]:
import numpy as np
df = df.replace('?', np.nan)                       #missing data to NaN

In [4]:
df

Unnamed: 0,5,67,3,5.1,3.1,1
0,4,43,1,1,,1
1,5,58,4,5,3,1
2,4,28,1,1,3,0
3,5,74,1,5,,1
4,4,65,1,,3,0
...,...,...,...,...,...,...
955,4,47,2,1,3,0
956,4,56,4,5,3,1
957,4,64,4,5,3,0
958,5,66,4,5,3,1


In [5]:
df.columns = ['BI_RADS', 'Age', 'Shape', 'Margin', 'Density', 'Severity']

In [6]:
df

Unnamed: 0,BI_RADS,Age,Shape,Margin,Density,Severity
0,4,43,1,1,,1
1,5,58,4,5,3,1
2,4,28,1,1,3,0
3,5,74,1,5,,1
4,4,65,1,,3,0
...,...,...,...,...,...,...
955,4,47,2,1,3,0
956,4,56,4,5,3,1
957,4,64,4,5,3,0
958,5,66,4,5,3,1


In [7]:
df1 = df.dropna()
df1

Unnamed: 0,BI_RADS,Age,Shape,Margin,Density,Severity
1,5,58,4,5,3,1
2,4,28,1,1,3,0
7,5,57,1,5,3,1
9,5,76,1,4,3,1
10,3,42,2,1,3,1
...,...,...,...,...,...,...
955,4,47,2,1,3,0
956,4,56,4,5,3,1
957,4,64,4,5,3,0
958,5,66,4,5,3,1


In [8]:
df1[['BI_RADS','Age','Shape','Margin','Density','Severity']].describe()   #Stats not significantly different from df\
#So remove NaN

Unnamed: 0,Severity
count,829.0
mean,0.484922
std,0.500074
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [9]:
df1

Unnamed: 0,BI_RADS,Age,Shape,Margin,Density,Severity
1,5,58,4,5,3,1
2,4,28,1,1,3,0
7,5,57,1,5,3,1
9,5,76,1,4,3,1
10,3,42,2,1,3,1
...,...,...,...,...,...,...
955,4,47,2,1,3,0
956,4,56,4,5,3,1
957,4,64,4,5,3,0
958,5,66,4,5,3,1


In [10]:
df2 = df1.drop('BI_RADS', 1)

In [11]:
df2

Unnamed: 0,Age,Shape,Margin,Density,Severity
1,58,4,5,3,1
2,28,1,1,3,0
7,57,1,5,3,1
9,76,1,4,3,1
10,42,2,1,3,1
...,...,...,...,...,...
955,47,2,1,3,0
956,56,4,5,3,1
957,64,4,5,3,0
958,66,4,5,3,1


In [12]:
dfvariables= df1[['Age','Shape','Margin','Density']]

In [13]:
dfvariables

Unnamed: 0,Age,Shape,Margin,Density
1,58,4,5,3
2,28,1,1,3
7,57,1,5,3
9,76,1,4,3
10,42,2,1,3
...,...,...,...,...
955,47,2,1,3
956,56,4,5,3
957,64,4,5,3
958,66,4,5,3


In [14]:
dfclass=df1['Severity']

In [15]:
dfclass

1      1
2      0
7      1
9      1
10     1
      ..
955    0
956    1
957    0
958    1
959    0
Name: Severity, Length: 829, dtype: int64

In [16]:
ArrayVariables = dfvariables.to_numpy()

In [17]:
ArrayVariables

array([['58', '4', '5', '3'],
       ['28', '1', '1', '3'],
       ['57', '1', '5', '3'],
       ...,
       ['64', '4', '5', '3'],
       ['66', '4', '5', '3'],
       ['62', '3', '3', '3']], dtype=object)

In [18]:
ArrayClasses = dfclass.to_numpy()
ArrayClasses

array([1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,

In [19]:
feature_names = ['Age','Shape','Margin','Density']

In [20]:
classes = ['Severity']

In [21]:
feature_names

['Age', 'Shape', 'Margin', 'Density']

In [22]:
classes

['Severity']

In [23]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(ArrayVariables)

In [24]:
X_scaled                   #normalized the data

array([[ 0.15215552,  0.98067959,  1.39867207,  0.24061945],
       [-1.89330809, -1.43412253, -1.15669795,  0.24061945],
       [ 0.0839734 , -1.43412253,  1.39867207,  0.24061945],
       ...,
       [ 0.56124824,  0.98067959,  1.39867207,  0.24061945],
       [ 0.69761248,  0.98067959,  1.39867207,  0.24061945],
       [ 0.424884  ,  0.17574555,  0.12098706,  0.24061945]])

# Train-Test Split

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, ArrayClasses, test_size=0.25, random_state=42)

In [26]:
X_train

array([[ 0.0839734 ,  0.98067959,  0.75982956,  3.09024128],
       [ 0.69761248,  0.98067959,  1.39867207,  0.24061945],
       [ 1.4476158 ,  0.98067959,  1.39867207,  0.24061945],
       ...,
       [-1.34785113, -0.62918849, -1.15669795,  0.24061945],
       [-2.37058293, -1.43412253, -1.15669795,  0.24061945],
       [ 0.56124824, -1.43412253, -1.15669795,  0.24061945]])

# Decision Tree Classifier

In [27]:
from sklearn.model_selection import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [28]:
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)

In [29]:
y_pred = clf.predict(X_test)

In [30]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.7932692307692307


## We get accuracy of 0.79 

tree.plot_tree(clf);

In [31]:
#K-Fold Cross Validation

In [32]:
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import datasets

In [33]:
# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
scores = cross_val_score(clf, ArrayVariables, ArrayClasses, cv=10)

# Print the accuracy for each fold:
print(scores)

# And the mean accuracy of all 5 folds:
print(scores.mean())

[0.71428571 0.76190476 0.73493976 0.72289157 0.78313253 0.69879518
 0.73493976 0.76829268 0.79268293 0.69512195]
0.740698683234681


In [34]:
## Random Forest Classifier:

In [35]:
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

In [36]:
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.7980769230769231


In [37]:
## Here we can see better accuracy than Decision Trees

In [38]:
scores = cross_val_score(clf, ArrayVariables, ArrayClasses, cv=10)

# Print the accuracy for each fold:
print(scores)

# And the mean accuracy of all 5 folds:
print(scores.mean())

[0.69047619 0.77380952 0.80722892 0.75903614 0.81927711 0.72289157
 0.74698795 0.7804878  0.82926829 0.69512195]
0.762458544981319


In [39]:
clf.predict(X_test)

array([0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 1, 1, 1], dtype=int64)

In [40]:
clf.predict([[2,3,5,6]])

array([1], dtype=int64)

# SUPPORT VECTOR MACHINES:

In [41]:

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [42]:
clf = SVC(kernel='linear')

In [43]:
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

In [44]:
y_pred

array([0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       1, 1, 0, 0, 1, 0, 0, 1, 1, 1], dtype=int64)

In [45]:
print(accuracy_score(y_test,y_pred))


0.8221153846153846


In [46]:
#KFOLDS ACCURACY:
scores = cross_val_score(clf, ArrayVariables, ArrayClasses, cv=10)

# Print the accuracy for each fold:
print(scores)

# And the mean accuracy of all 5 folds:
print(scores.mean())

[0.71428571 0.77380952 0.86746988 0.80722892 0.84337349 0.71084337
 0.80722892 0.80487805 0.90243902 0.74390244]
0.7975459328603612


In [47]:
#SVM seem to work better than the decision tree

# K Nearest Neighbour

In [48]:
from sklearn.neighbors import NearestCentroid

In [49]:
clf = NearestCentroid()

In [50]:
clf.fit(X_train, y_train)

NearestCentroid(metric='euclidean', shrink_threshold=None)

In [51]:
NearestCentroid()

NearestCentroid(metric='euclidean', shrink_threshold=None)

In [52]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [53]:
y_pred = classifier.predict(X_test)

In [54]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[89 26]
 [15 78]]
              precision    recall  f1-score   support

           0       0.86      0.77      0.81       115
           1       0.75      0.84      0.79        93

    accuracy                           0.80       208
   macro avg       0.80      0.81      0.80       208
weighted avg       0.81      0.80      0.80       208



In [55]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_pred, y_test)

In [56]:
accuracy

0.8028846153846154

In [57]:
y_pred

array([0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 1, 1, 1], dtype=int64)

In [58]:
#KFOLDS ACCURACY:
scores = cross_val_score(classifier, ArrayVariables, ArrayClasses, cv=10)

# Print the accuracy for each fold:
print(scores)

# And the mean accuracy of all 5 folds:
print(scores.mean())

[0.72619048 0.77380952 0.8313253  0.77108434 0.87951807 0.72289157
 0.78313253 0.75609756 0.82926829 0.76829268]
0.7841610343814281


In [59]:
a = range(1,50)
a[31]

32

In [60]:
i=1
for i in a:
    from sklearn.neighbors import KNeighborsClassifier
    classifier = KNeighborsClassifier(n_neighbors=i)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    from sklearn.metrics import accuracy_score
    accuracy = accuracy_score(y_pred, y_test)
    print(accuracy)
    i=i+1

0.7596153846153846
0.7307692307692307
0.7788461538461539
0.7644230769230769
0.8028846153846154
0.7980769230769231
0.8125
0.8076923076923077
0.7932692307692307
0.7980769230769231
0.8076923076923077
0.7740384615384616
0.8076923076923077
0.7884615384615384
0.7980769230769231
0.7836538461538461
0.8028846153846154
0.7980769230769231
0.8028846153846154
0.7884615384615384
0.7932692307692307
0.7932692307692307
0.7884615384615384
0.7788461538461539
0.7884615384615384
0.8028846153846154
0.7980769230769231
0.8028846153846154
0.8028846153846154
0.8028846153846154
0.8076923076923077
0.8076923076923077
0.8028846153846154
0.7980769230769231
0.8076923076923077
0.8076923076923077
0.8028846153846154
0.8173076923076923
0.8076923076923077
0.8076923076923077
0.8125
0.8125
0.8125
0.8125
0.8125
0.8125
0.8125
0.8076923076923077
0.8125


# Naive Bayes

In [61]:
from sklearn.naive_bayes import GaussianNB

In [62]:
gnb = GaussianNB()

In [63]:
y_pred = gnb.fit(X_train, y_train).predict(X_test)

In [64]:
y_pred

array([0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1,
       1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 1, 0, 0, 1, 1, 1], dtype=int64)

In [65]:
accuracy = accuracy_score(y_pred, y_test)

In [66]:
accuracy

0.8076923076923077

# Multinomial Naive Bayes

In [67]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

model = make_pipeline(TfidfVectorizer(), MultinomialNB())

In [68]:
MultiNB = MultinomialNB()

In [69]:
#MinMax Scalar:

In [70]:
from sklearn.preprocessing import MinMaxScaler

# create scaler
scaler = MinMaxScaler()

# fit and transform in one step
Xminmax= scaler.fit_transform(ArrayVariables)

In [71]:
Xminmax

array([[0.51282051, 1.        , 1.        , 0.66666667],
       [0.12820513, 0.        , 0.        , 0.66666667],
       [0.5       , 0.        , 1.        , 0.66666667],
       ...,
       [0.58974359, 1.        , 1.        , 0.66666667],
       [0.61538462, 1.        , 1.        , 0.66666667],
       [0.56410256, 0.66666667, 0.5       , 0.66666667]])

In [72]:
#Train Test Split Again:
from sklearn.model_selection import train_test_split
Xmm_train, Xmm_test, y_train, y_test = train_test_split(Xminmax, ArrayClasses, test_size=0.25, random_state=42)

In [73]:
MultiNB.fit(Xmm_train,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [74]:
print(MultiNB)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)


In [75]:
y_pred = MultiNB.predict(X_test)

In [76]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.75


# Complex SVM:

In [77]:
from sklearn.svm import SVC
svclassifier = SVC(kernel='rbf')
svclassifier.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [78]:
y_pred = svclassifier.predict(X_test)

In [79]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.8125


In [80]:
#sigmoid kernel:

svclassifier = SVC(kernel='sigmoid')
svclassifier.fit(X_train, y_train)
y_pred = svclassifier.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.7451923076923077




In [81]:
#poly kernel:

svclassifier = SVC(kernel='poly')
svclassifier.fit(X_train, y_train)
y_pred = svclassifier.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.8269230769230769




##The Best Performing Kernel out of all SVC kernels is: Poly Kernel

# Logistic Regression

In [82]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

In [83]:
model = LogisticRegression(solver='liblinear', random_state=0)

In [84]:
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [85]:
y_pred = model.predict(X_test)

In [86]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.8125


# Artificial Neural Networks:

In [87]:
## We will use the library of Keras for this:

In [88]:
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

Using TensorFlow backend.


In [92]:
X_train[4]

array([-0.46148357, -1.43412253, -1.15669795,  0.24061945])

In [91]:
y_train[4]

0

In [135]:
model = Sequential()
model.add(Dense(64,input_dim=4,kernel_initializer='normal' ,activation='relu'))
model.add(Dense(32,kernel_initializer='normal' ,activation='relu'))
model.add(Dense(8,kernel_initializer='normal' ,activation='tanh'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))

In [136]:
model.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_23 (Dense)             (None, 64)                320       
_________________________________________________________________
dense_24 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_25 (Dense)             (None, 8)                 264       
_________________________________________________________________
dense_26 (Dense)             (None, 1)                 9         
Total params: 2,673
Trainable params: 2,673
Non-trainable params: 0
_________________________________________________________________


In [137]:
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

In [140]:
history = model.fit(X_train, y_train,
                    batch_size=100,
                    epochs=50,
                    verbose=0,
                    validation_data=(X_train, y_train))

In [141]:
score = model.evaluate(X_test, y_test, verbose=0)
print(score)

[0.43810944373791033, 0.8269230723381042]


##  SVC with "poly" Kernel, and Neural Network so far has given the best accuracy, i.e. ~82.692%. 