# **Problem** : Based on the Features of the Tumor in the Breast , Classify the Tumor as Malignant or Benign.

Although the given type of problem is generally Complicated and require a lot of Preliminary & Exploratory Analysis and As well as intense Data Cleaning. But our Aim is to Explore and Compare the Accuracy of Mutiple Classification ML Techniques , therefore we will Prioritise the Comparison over the EDA and Data Cleaning. To do this I chose a much Simpler and Ready to use dataset from UCI Machine Learning Repositary.

# **1.Importing Libraries**

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# **2.Importing Dataset**

In [2]:
dataset = pd.read_csv('Data.csv')

# **3.Preliminary Analysis and Missing Value Detection & Rectification**

In [3]:
dataset.shape

(683, 11)

In [4]:
dataset.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [5]:
dataset.isna().sum()

Sample code number             0
Clump Thickness                0
Uniformity of Cell Size        0
Uniformity of Cell Shape       0
Marginal Adhesion              0
Single Epithelial Cell Size    0
Bare Nuclei                    0
Bland Chromatin                0
Normal Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64

In [6]:
df = dataset

In [7]:
df.columns

Index(['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size',
       'Uniformity of Cell Shape', 'Marginal Adhesion',
       'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
       'Normal Nucleoli', 'Mitoses', 'Class'],
      dtype='object')

# **4.Splitting of the Dataset**

Lets drop the sample code number as it is no longer necessarty for the model

In [8]:
x = df.iloc[:, 1:-1]
y = df.iloc[:, -1]

Splitting the dataset into Dependent and Independent Variables

In [9]:
x.head()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses
0,5,1,1,1,2,1,3,1,1
1,5,4,4,5,7,10,3,2,1
2,3,1,1,1,2,2,3,1,1
3,6,8,8,1,3,4,3,7,1
4,4,1,1,3,2,1,3,1,1


In [10]:
y.head()

0    2
1    2
2    2
3    2
4    2
Name: Class, dtype: int64

Spliting the Dataset into Test set and Train set

In [11]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)

# **5.Feature Scaling**

In [12]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

In [42]:
x_train[:,:3]

array([[ 0.91903747,  0.9407658 ,  2.30881719],
       [ 1.27578287, -0.04290763,  1.63138773],
       [ 1.27578287,  2.25233038,  2.30881719],
       ...,
       [-1.22143494, -0.69868992, -0.73961536],
       [-0.50794414, -0.69868992, -0.73961536],
       [ 1.98927367,  1.92443923,  1.29267301]])

In [44]:
y_train[:10]

556    4
66     4
571    4
299    2
355    2
627    2
247    4
625    2
529    2
610    4
Name: Class, dtype: int64

# **6.Training the Model with Various ML Classification Techniques**

# a. Logistic Regression

In [15]:
from sklearn.linear_model import LogisticRegression
classifier1 = LogisticRegression(random_state = 0)
classifier1.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [16]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred1 = classifier1.predict(x_test)
cm1 = confusion_matrix(y_test, y_pred1)
print(cm1)
a1 = accuracy_score(y_test, y_pred1)
print(a1)

[[103   4]
 [  5  59]]
0.9473684210526315


# b. K-Nearest Neighbours(K-NN)

In [17]:
from sklearn.neighbors import KNeighborsClassifier
classifier2 = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier2.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [18]:
y_pred2 = classifier2.predict(x_test)
cm2 = confusion_matrix(y_test, y_pred2)
print(cm2)
a2 = accuracy_score(y_test, y_pred2)
print(a2)

[[103   4]
 [  5  59]]
0.9473684210526315


# c. Naive Bayes

In [19]:
from sklearn.naive_bayes import GaussianNB
classifier3 = GaussianNB()
classifier3.fit(x_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [21]:
y_pred3 = classifier3.predict(x_test)
cm3 = confusion_matrix(y_test, y_pred3)
print(cm3)
a3 = accuracy_score(y_test, y_pred3)
print(a3)

[[99  8]
 [ 2 62]]
0.9415204678362573


# d. Support Vector Machine (SVM)

In [22]:
from sklearn.svm import SVC
classifier4 = SVC(kernel = 'linear', random_state = 0)
classifier4.fit(x_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False)

In [24]:
y_pred4 = classifier4.predict(x_test)
cm4 = confusion_matrix(y_test, y_pred4)
print(cm4)
a4 = accuracy_score(y_test, y_pred4)
print(a4)

[[102   5]
 [  5  59]]
0.9415204678362573


# e. Kernel Support Vector Machine (KSVM)

In [26]:
from sklearn.svm import SVC
classifier5 = SVC(kernel = 'rbf', random_state = 0)
classifier5.fit(x_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False)

In [27]:
y_pred5 = classifier5.predict(x_test)
cm5 = confusion_matrix(y_test, y_pred5)
print(cm5)
a5 = accuracy_score(y_test, y_pred5)
print(a5)

[[101   6]
 [  3  61]]
0.9473684210526315


# f. Decision Tree Classification

In [28]:
from sklearn.tree import DecisionTreeClassifier
classifier6 = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier6.fit(x_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

In [30]:
y_pred6 = classifier6.predict(x_test)
cm6 = confusion_matrix(y_test, y_pred6)
print(cm6)
a6 = accuracy_score(y_test, y_pred6)
print(a6)

[[104   3]
 [  4  60]]
0.9590643274853801


# g. Random Forest Classification

In [31]:
from sklearn.ensemble import RandomForestClassifier
classifier7 = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier7.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [32]:
y_pred7 = classifier7.predict(x_test)
cm7 = confusion_matrix(y_test, y_pred7)
print(cm7)
a7 = accuracy_score(y_test, y_pred7)
print(a7)

[[104   3]
 [  5  59]]
0.9532163742690059


# **7.Aggregation of Results**

In [40]:
print('Classification Technique' ,' | ', 'Accuracy Score',' | ','Confusion Matrix')
print('Logistic Regression', ' :               ',a1)
print(cm1)
print('K-Nearest Neighbours(K-NN)',' :         ',a2)
print(cm2)
print('Naive Bayes',' :                        ',a3)
print(cm3)
print('Support Vector Machine (SVM)',' :        ',a4)
print(cm4)
print('Kernel Support Vector Machine (KSVM)',' : ',a5)
print(cm5)
print('Decision Tree Classification',' :          ',a6)
print(cm6)
print('Random Forest Classification',' :         ',a7)
print(cm7)


Classification Technique  |  Accuracy Score  |  Confusion Matrix
Logistic Regression  :                0.9473684210526315
[[103   4]
 [  5  59]]
K-Nearest Neighbours(K-NN)  :          0.9473684210526315
[[103   4]
 [  5  59]]
Naive Bayes  :                         0.9415204678362573
[[99  8]
 [ 2 62]]
Support Vector Machine (SVM)  :         0.9415204678362573
[[102   5]
 [  5  59]]
Kernel Support Vector Machine (KSVM)  :  0.9473684210526315
[[101   6]
 [  3  61]]
Decision Tree Classification  :           0.9590643274853801
[[104   3]
 [  4  60]]
Random Forest Classification  :          0.9532163742690059
[[104   3]
 [  5  59]]


**From the Above we can see the Decision Tree Classsification has the Highest Accuracy score . But this differs according to nature of the dataset. Generally KSVM and Random Forest Classification give pretty high results. Ofcourse Each Technique has its own Pros and Cons.I personally choose SVM and KSVM for most problems since they avoid the problem of overfitting and still give pretty high accuracy.**

**If your problem is linear, you should go for Logistic Regression or SVM.
If your problem is non linear, you should go for K-NN, Naive Bayes, Decision Tree or Random Forest.**