The goal of this model is to predict or classify a group of cells either as malignant or benign based
on historic data of previous cells, their features and their diagnosed classification.

In [None]:
!pip install scikit-learn==0.23.1

In [None]:
import pandas as pd
import numpy as np
import pylab as pl
from sklearn import preprocessing
import scipy.optimize as opt
from sklearn.model_selection import train_test_split
%matplotlib inline
import matplotlib.pyplot as plt

<h2 id="load_dataset">Load the Cancer data</h2>
The dataset is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007)[http://mlearn.ics.uci.edu/MLRepository.html]. The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:

|Field name|Description|
|--- |--- |
|ID|Clump thickness|
|Clump|Clump thickness|
|UnifSize|Uniformity of cell size|
|UnifShape|Uniformity of cell shape|
|MargAdh|Marginal adhesion|
|SingEpiSize|Single epithelial cell size|
|BareNuc|Bare nuclei|
|BlandChrom|Bland chromatin|
|NormNucl|Normal nucleoli|
|Mit|Mitoses|
|Class|Benign or malignant|


In [None]:
#Downloading the dataset
#Click here and press Shift+Enter
!wget -O cell_samples.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/cell_samples.csv

In [None]:
cell_df = pd.read_csv('cell_samples.csv')
cell_df.head()

The ID field contains the patient identifiers. The characteristics of the cell samples from each patient are contained in fields Clump to Mit. The values are graded from 1 to 10, with 1 being the closest to benign.

The Class field contains the diagnosis, as confirmed by separate medical procedures, as to whether the samples are benign (value = 2) or malignant (value = 4).

Looking at the distribution of the classes based on Clump thickness and Uniformity of cell size:

In [None]:
ax = cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Blue', label='malignant');
cell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax);
plt.show()

<h2 id='Load_dataset'> Data preprocessing and selection </h2>

In [None]:
cell_df.dtypes

The BareNuc column contains some values that are not numeric, so they must be converted to numerical values or their corresponding rows must be removed.

In [None]:
cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]
cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')
cell_df.dtypes

Now the data must be split into the feature set and the target set

In [None]:
feature_set = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
x = np.asarray(feature_set)
x[0:5]

In [None]:
cell_df['Class'] = cell_df['Class'].astype('int')
y = np.asarray(cell_df['Class']) 
y[0:5]

<h2> Train/Test dataset </h2>


In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state = 4)
print('Train Set:', x_train.shape, y_train.shape)
print('Test Set:', x_test.shape, y_test.shape)

<h2>Modelling a Support Vector Machine with Scikit-learn </h2> 


In [None]:
from sklearn import svm
clf = svm.SVC(kernel='linear')
clf.fit(x_train, y_train)

In [None]:
y_predict = clf.predict(x_test)
y_predict

<h2> Evaluation </h2>


In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import itertools

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_predict, labels=[2,4])
np.set_printoptions(precision=2)

print (classification_report(y_test, y_predict))

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False,  title='Confusion matrix')

As seen from the confusion matrix figure above, out of 90 data points, the SVM model correctly predicts 85 of the as benign, and out of 47 data points, the model correctly predicted all 47 0f them as malignant.


Using __f1_score__ metric from the sklearn library to check accuracy:


In [None]:
from sklearn.metrics import f1_score
f1_score(y_test, y_predict, average ='weighted')

Using __jaccard index__ metric from the sklearn library to check accuracy:

In [None]:
from sklearn.metrics import jaccard_score
jaccard_score(y_test, y_predict, pos_label=2)