Last Updated: 2/1/2021

UCSD Class Project: Cogs 188

# Supervised Comparison

Description: Below is a comparison of different Supervised Learning Models on a single data set. First one made entirely by myself and the others are imported from sklearn. Due to the binary classification yet multifeature dataset, I thought it would be interesting to explore the different results given different methodologies.

## Breast Cancer Prediction

Breast cancer is one of the most common cancers among women worldwide, representing the majority of new cancer cases and cancer-related deaths according to global statistics, making it a significant public health problem in today’s society.

The early diagnosis of Breast Cancer can improve the prognosis and chance of survival significantly, as it can promote timely clinical treatment to patients. Further accurate classification of benign tumors can prevent patients undergoing unnecessary treatments. Thus, the correct diagnosis of BC and classification of patients into malignant or benign groups is the subject of much research. Because of its unique advantages in critical features detection from complex Brest Cancer datasets, Machine Learning is widely recognized as the methodology of choice in Breast Cancer pattern classification and forecast modelling.

Link to data set: http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29

Link to sklearn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn.datasets
from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from tqdm import tqdm

## Load the breast cancer data

In [2]:
# Load Data Set
breast_cancer = sklearn.datasets.load_breast_cancer()

# Turn Data into Dataframe
data = pd.DataFrame(breast_cancer.data, columns = breast_cancer.feature_names)
data['class'] = breast_cancer.target

# Seperate the Data
X = data.drop('class', axis = 1)
Y = data['class']

## Some Visualizations of the Data

In [3]:
data.head(5)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,class
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [4]:
data.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,class
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


In [5]:
#sns.pairplot(data,hue = 'class', palette= 'coolwarm', vars = ['radius_mean', 'texture_mean', 'perimeter_mean','area_mean','smoothness_mean'])

## Split Training and Testing Data

In [6]:
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder

labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

In [7]:
# Data Splitting

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1, stratify= Y, random_state = 42)

In [8]:
# Valu Extraction

X_train = X_train.values
X_test = X_test.values

In [9]:
# Feature Scaling

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [10]:
# Dictionary to keep all results in
results = {}

## Method 1: Perceptron Class Built

In [11]:
class Perceptron:
    def __init__ (self):
        self.w = None
        self.b = None
    
    def model(self, x):
        # returns the prediction for a single example x
        pred = 0
        decision = np.dot(self.w, x)
        
        if decision >= self.b:
            pred = 1
            
        return pred 
  
    def predict(self, X): 
        # returns the predictions for multiple examples X
        Y = []
        for x in X:
            res = self.model(x)
            Y.append(res)
        return np.array(Y)
    
    def fit(self, X, Y, epochs = 500, learning_rate = .01):
        accuracy = {}
        wt_matrix = []
        max_accuracy = 0
        self.w = np.ones(X.shape[1])
        self.b = 0
        
        for i in tqdm(range(epochs)):
            for x, y in zip(X, Y):
                y_pred = self.model(x)
                
                # Learning
                if y == 1 and y_pred == 0:
                    self.w = self.w + learning_rate * x 
                    self.b = self.b - learning_rate * 1 
                elif y == 0 and y_pred == 1:
                    self.w = self.w - learning_rate * x 
                    self.b = self.b + learning_rate * 1 
          
                wt_matrix.append(self.w)    
            
                accuracy[i] = accuracy_score(self.predict(X), Y)
        
            if (accuracy[i] > max_accuracy):
                max_accuracy = accuracy[i]
                chkptw = self.w
                chkptb = self.b
        
        self.w = chkptw 
        self.b = chkptb 
        print(max_accuracy)

## Initialize and train Perceptron (Method 1)

In [12]:
# Initialize Perceptrion
perceptron = Perceptron()

# Learning
perceptron.fit(X_train, Y_train, 1000, 0.1)

# Predict X_test Results
Y_pred_test = perceptron.predict(X_test)
acc = accuracy_score(Y_pred_test, Y_test)
print(acc)

# Confusion Matrix
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(Y_test, Y_pred_test)

# Add to results
results['Perceptron_M1'] = [acc,cm]

100%|██████████| 1000/1000 [08:41<00:00,  1.92it/s]

0.998046875
0.9122807017543859





## Method 2: SKLearn's Perceptron Import 

In [13]:
from sklearn.linear_model import Perceptron

# Train Perceptron
p = Perceptron(random_state = 42) # default epoch=1000, # default learning rate .1
p.fit(X_train, Y_train)

# Print Accuracy
Y_pred_test = p.predict(X_test)
acc = accuracy_score(Y_pred_test, Y_test)
print(acc)

# Confusion Matrix
cm = confusion_matrix(Y_test, Y_pred_test)

# Add to results
results['Perceptron_M2'] = [acc,cm]

0.9122807017543859


## Method 3: SKLearn's Logistic Regression 

In [14]:
from sklearn.linear_model import LogisticRegression

# Train Logistic Regression Model
log = LogisticRegression(random_state = 42) 
log.fit(X_train, Y_train)

# Print Accuracy
Y_pred_test = log.predict(X_test)
acc = accuracy_score(Y_pred_test, Y_test)
print(acc)

# Confusion Matrix
cm = confusion_matrix(Y_test, Y_pred_test)

# Add to results
results['LogRegression'] = [acc,cm]

0.9649122807017544


## Method 4: SKLearn's K Nearest Neighbors 

In [15]:
from sklearn.neighbors import KNeighborsClassifier

# Train K Nearest Neighbor Model
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(X_train, Y_train)

# Print Accuracy
Y_pred_test = knn.predict(X_test)
acc = accuracy_score(Y_pred_test, Y_test)
print(acc)

# Confusion Matrix
cm = confusion_matrix(Y_test, Y_pred_test)

# Add to results
results['KNN'] = [acc,cm]

0.9824561403508771


## Method 5: SKLearn's Support Vector Machine

In [16]:
from sklearn.svm import SVC

# Train K Nearest Neighbor Model
svm = SVC(kernel = 'linear', random_state = 42)
svm.fit(X_train, Y_train)

# Print Accuracy
Y_pred_test = svm.predict(X_test)
acc = accuracy_score(Y_pred_test, Y_test)
print(acc)

# Confusion Matrix
cm = confusion_matrix(Y_test, Y_pred_test)

# Add to results
results['SVM'] = [acc,cm]

0.9473684210526315


## Method 6: Kernel SVM

In [17]:
from sklearn.svm import SVC

# Train K Nearest Neighbor Model
svm_k = SVC(kernel = 'rbf', random_state = 42)
svm_k.fit(X_train, Y_train)

# Print Accuracy
Y_pred_test = svm_k.predict(X_test)
acc = accuracy_score(Y_pred_test, Y_test)
print(acc)

# Confusion Matrix
cm = confusion_matrix(Y_test, Y_pred_test)

# Add to results
results['Kernel_SVM'] = [acc,cm]

0.9649122807017544


## Method 7: SKLearn's Naive Bayes 

In [18]:
from sklearn.naive_bayes import GaussianNB

# Train K Nearest Neighbor Model
gaus = GaussianNB()
gaus.fit(X_train, Y_train)

# Print Accuracy
Y_pred_test = gaus.predict(X_test)
acc = accuracy_score(Y_pred_test, Y_test)
print(acc)

# Confusion Matrix
cm = confusion_matrix(Y_test, Y_pred_test)

# Add to results
results['NaiveBayes'] = [acc,cm]

0.9473684210526315


## Method 8: SKLearn's Decision Tree 

In [19]:
from sklearn.tree import DecisionTreeClassifier

# Train K Nearest Neighbor Model
dec = DecisionTreeClassifier(criterion = 'entropy', random_state = 42)
dec.fit(X_train, Y_train)

# Print Accuracy
Y_pred_test = dec.predict(X_test)
acc = accuracy_score(Y_pred_test, Y_test)
print(acc)

# Confusion Matrix
cm = confusion_matrix(Y_test, Y_pred_test)

# Add to results
results['DecTree'] = [acc,cm]

0.8771929824561403


## Method 9: SKLearn's Random Forest

In [20]:
from sklearn.ensemble import RandomForestClassifier

# Train K Nearest Neighbor Model
rm = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
rm.fit(X_train, Y_train)

# Print Accuracy
Y_pred_test = rm.predict(X_test)
acc = accuracy_score(Y_pred_test, Y_test)
print(acc)

# Confusion Matrix
cm = confusion_matrix(Y_test, Y_pred_test)

# Add to results
results['RandomForest'] = [acc,cm]

0.9298245614035088


# Results

In [29]:
print(results)

{'Perceptron_M1': [0.9122807017543859, array([[19,  2],
       [ 3, 33]], dtype=int64)], 'Perceptron_M2': [0.9122807017543859, array([[20,  1],
       [ 4, 32]], dtype=int64)], 'LogRegression': [0.9649122807017544, array([[20,  1],
       [ 1, 35]], dtype=int64)], 'KNN': [0.9824561403508771, array([[20,  1],
       [ 0, 36]], dtype=int64)], 'SVM': [0.9473684210526315, array([[20,  1],
       [ 2, 34]], dtype=int64)], 'Kernel_SVM': [0.9649122807017544, array([[20,  1],
       [ 1, 35]], dtype=int64)], 'NaiveBayes': [0.9473684210526315, array([[20,  1],
       [ 2, 34]], dtype=int64)], 'DecTree': [0.8771929824561403, array([[20,  1],
       [ 6, 30]], dtype=int64)], 'RandomForest': [0.9298245614035088, array([[20,  1],
       [ 3, 33]], dtype=int64)]}


In [38]:
for key, val in results.items():
    if len(key) < 4:
        print(key,'\t\tacc:', val[0]*100)
        print(val[1])
        print()
    else:
        print(key,'\tacc:', val[0]*100)
        print(val[1])
        print()

Perceptron_M1 	acc: 91.22807017543859
[[19  2]
 [ 3 33]]

Perceptron_M2 	acc: 91.22807017543859
[[20  1]
 [ 4 32]]

LogRegression 	acc: 96.49122807017544
[[20  1]
 [ 1 35]]

KNN 		acc: 98.24561403508771
[[20  1]
 [ 0 36]]

SVM 		acc: 94.73684210526315
[[20  1]
 [ 2 34]]

Kernel_SVM 	acc: 96.49122807017544
[[20  1]
 [ 1 35]]

NaiveBayes 	acc: 94.73684210526315
[[20  1]
 [ 2 34]]

DecTree 	acc: 87.71929824561403
[[20  1]
 [ 6 30]]

RandomForest 	acc: 92.98245614035088
[[20  1]
 [ 3 33]]



In [39]:
# Done