# Breast Cancer Biopsy Classification
The aim of this project is to build a classifier that can determine whether a breast cancer biopsy sample is malignant or benign.

Every iopsy sample has various metrics are recorded about it, including: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.

Using a large dataset of labeled biopsy samples from breast cancer tumors, I will try to build a binary classification model to determine whether a tumor is malignant or benign based on these features. 

In this project I will use mutliple approches to build a model:

*   Approach 1: Linear Regression 
*   Approach 2: Boundary classifier
*   Approach 3: Logistic Regression
*   Approach 4: Multiple Feature Logistic Regression
*   Approach 5: Decision Tree Model








**Download Data**

The dataset used to train these models is called the Breast Cancer Wisconsin (Diagnostic) Data Set. It consists of 569 biopsy samples, just like the ones described above, from breast cancer tumors.

Each biopsy sample in the dataset is labeled with an ID number and whether or not the tumor it came from is malignant (1) or benign (0). Each sample also has 10 different features associated with it, some of which are described above. 

In [None]:
import pandas as pd
from sklearn import metrics

!wget -q --show-progress "https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%201%20-%205/Session%202b%20-%20Logistic%20Regression/cancer.csv"

data = pd.read_csv('cancer.csv')
data['diagnosis'].replace({'M':1, 'B':0}, inplace = True)
data.to_csv('cancer.csv')
del data

**Data Feature Descriptions**
1. 𝑃𝑒𝑟𝑖𝑚𝑒𝑡𝑒𝑟 : Total distance between points defining the cell's nuclear perimeter.
2. 𝑅𝑎𝑑𝑖𝑢𝑠 : Average distance from the center of the cell's nucleus to its perimeter.
3. 𝑇𝑒𝑥𝑡𝑢𝑟𝑒 : The texture of the cell nucleus is measured by finding the variance of the gray scale intensities in the component pixels.
4. 𝐴𝑟𝑒𝑎 : Nuclear area is measured by counting the number of pixels on the interior of the nucleus and adding one-half of the pixels in the perimeter.
5. 𝑆𝑚𝑜𝑜𝑡ℎ𝑛𝑒𝑠𝑠 : Measures the smoothness of a nuclear contour by measuring the difference between the length of a radial line and the mean length of the lines surrounding it.
6. 𝐶𝑜𝑛𝑐𝑎𝑣𝑖𝑡𝑦 : Measures the severity of concavities or indentations in a cell nucleus. Chords are drawn between non-adjacent snake points and measure the extent to which the actual boundary lies inside each chord.
7. 𝑆𝑦𝑚𝑚𝑒𝑡𝑟𝑦 : The major axis (longest chord) through the center is found. Then, the difference between the distance on both sides of the lines that are perpendicular to the major axis is calculated. The image below shows an example of this:

The paper that first detailed these measurements for this dataset can be found here for more information: https://pdfs.semanticscholar.org/1c4a/4db612212a9d3806a848854d20da9ddd0504.pdf

Loading annotated dataset

In [None]:
# Importing Python tools for loading/navigating data
import os             # Good for navigating your computer's files 
import numpy as np    # Great for lists (arrays) of numbers
import pandas as pd   # Great for tables (google spreadsheets, microsoft excel, csv)
from sklearn.metrics import accuracy_score   # Great for creating quick ML models

In [None]:
data_path  = 'cancer.csv'

dataframe = pd.read_csv(data_path)

dataframe = dataframe[['diagnosis', 'perimeter_mean', 'radius_mean', 'texture_mean', 'area_mean', 'smoothness_mean', 'concavity_mean', 'symmetry_mean']]
dataframe['diagnosis_cat'] = dataframe['diagnosis'].astype('category').map({1: '1 (malignant)', 0: '0 (benign)'})

Visualizing dataset

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt 

In [None]:
sns.catplot(x = 'radius_mean', y = 'diagnosis_cat', data = dataframe, order=['1 (malignant)', '0 (benign)'])
dataframe.head()

# Predicting Diagnosis

Satring predicting a diagnosis using a single feature: mean radius

# First Approach: Linear Regression

Fitting and visualizing a linear regression:

In [None]:
# Fit and visualize a linear regression 
from sklearn import linear_model

X,y = dataframe[['radius_mean']], dataframe[['diagnosis']]

model = linear_model.LinearRegression()
model.fit(X, y)
preds = model.predict(X)

sns.scatterplot(x='radius_mean', y='diagnosis', data=dataframe)
plt.plot(X, preds, color='r')
plt.legend(['Linear Regression Fit', 'Data'])

# Second Approach: Boundary Classifier
The variable we are trying to predict is categorical, not continuous. So, we can't use a linear regression; we have to use a classifier.

In [None]:
boundary = 15 # change me!

sns.catplot(x = 'radius_mean', y = 'diagnosis_cat', data = dataframe, order=['1 (malignant)', '0 (benign)'])
plt.plot([boundary, boundary], [-.2, 1.2], 'g', linewidth = 2)

Building the boundary Classifier:

This function will take in a boundary value of our choosing and then classify the data points based on whether or not they are above or below the boundary.

In [None]:
def boundary_classifier(target_boundary, radius_mean_series):
  result = []
  for i in radius_mean_series:
    if i > target_boundary:
      result.append(1)
    else:
      result.append(0)
  return result

Running the classifier

In [None]:
chosen_boundary = 15

y_pred = boundary_classifier(chosen_boundary, dataframe['radius_mean'])
dataframe['predicted'] = y_pred

y_true = dataframe['diagnosis']

sns.catplot(x = 'radius_mean', y = 'diagnosis_cat', hue = 'predicted', data = dataframe, order=['1 (malignant)', '0 (benign)'])
plt.plot([chosen_boundary, chosen_boundary], [-.2, 1.2], 'g', linewidth = 2)

Calculating accuracy

In [None]:
accuracy = accuracy_score(y_true,y_pred)
print(accuracy)

# Third Approach: Logistic Regression
Splitting Traning and Testing Data


In [None]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(dataframe, test_size = 0.2, random_state = 1)

Single Variable Logistic Regression:

Building a logistic regression model to predict the diagnosis using radius mean

In [None]:
X = ['radius_mean']
y = 'diagnosis'

X_train = train_df[X]
print('X_train, our input variables:')
print(X_train.head())
print()

y_train = train_df[y]
print('y_train, our output variable:')
print(y_train.head())

Preparing the model

In [None]:
logreg_model = linear_model.LogisticRegression()
logreg_model.fit(X_train, y_train)

Testing the Model

In [None]:
X_test = test_df[X]
y_test = test_df[y]
y_pred = logreg_model.predict(X_test)

Visualizing the results

In [None]:
test_df['predicted'] = y_pred.squeeze()
sns.catplot(x = 'radius_mean', y = 'diagnosis_cat', hue = 'predicted', data=test_df, order=['1 (malignant)', '0 (benign)'])

Evaluating the accuracy

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

Plotting logistic regeresison's soft probablity

In [None]:
y_prob = logreg_model.predict_proba(X_test)
X_test_view = X_test[X].values.squeeze()
plt.xlabel('radius_mean')
plt.ylabel('Predicted Probability')
sns.scatterplot(x = X_test_view, y = y_prob[:,1], hue = y_test, palette=['purple','green'])

# Fourth Approach: Multiple Feature Logistic Regression
Using more feautres in the logistic regression to predict diagnosis

In [None]:
multi_X = ['perimeter_mean', 'radius_mean', 'texture_mean','area_mean']
y = 'diagnosis'

# 1. Split data into train and test
multi_train_df, multi_test_df = train_test_split(dataframe, test_size = 0.2, random_state = 1)

# 2. Prepare your X_train, X_test, y_train, and y_test variables by extracting the appropriate columns:
multi_X_train, multi_X_test = multi_train_df[multi_X], multi_test_df[multi_X]
y_train, y_test = multi_train_df[y], multi_test_df[y]

# 3. Initialize the model object
model = linear_model.LogisticRegression()

# 4. Fit the model to the training data
model.fit(multi_X_train, y_train)

# 5. Use this trained model to predict on the test data
multi_preds = model.predict(multi_X_test)

# 6. Evaluate the accuracy by comparing to to the test labels and print out accuracy.
accuracy = accuracy_score(y_test, multi_preds)
print(multi_X)
print(accuracy)

Logistic Regression can learn an optimal classification boundary by using multiple features together, which can improve its prediction accuracy even more.

Creating a confusion matrix

In [None]:
# Import the metrics class
from sklearn import metrics

# Create the Confusion Matrix
# y_test = dataframe['diagnosis']
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)

# Visualizing the Confusion Matrix
class_names = [0,1] # Our diagnosis categories

fig, ax = plt.subplots()
# Setting up and visualizing the plot (do not worry about the code below!)
tick_marks = np.arange(len(class_names)) 
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g') # Creating heatmap
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y = 1.1)
plt.ylabel('Actual diagnosis')
plt.xlabel('Predicted diagnosis')

Calculating `True Negative`, `False Positive`, `False Negative` and `True Postive` metrics from confusion matrix

In [None]:
print (cnf_matrix)
(tn, fp), (fn, tp) = cnf_matrix
print ("TN, FP, FN, TP:", tn, fp, fn, tp)

Calculate model's performace by chosen metric

In [None]:
accuracy = (tp + tn)/(tn + fp + fn + tp)
precision = (tp)/(tp + fp)
recall = tp/(tp + fn)

print ("accuracy, precision, recall", accuracy, precision, recall)

# Fifth Approach: Decision Tree Model

Create the model

In [None]:
from sklearn import tree

class_dt = tree.DecisionTreeClassifier(max_depth=3)

# Use previous `X_train` and `y_train` sets to build the model
class_dt.fit(multi_X_train, y_train)

Visualize and interpret the tree

In [None]:
plt.figure(figsize=(13,8))  # set plot size
tree.plot_tree(class_dt, fontsize=10) 

Find predictions based on model

In [None]:
multi_y_pred = class_dt.predict(multi_X_test)

Calculate model preformance

In [None]:
print("Accuracy: ", metrics.accuracy_score(y_test, multi_y_pred))
print("Precision: ", metrics.precision_score(y_test, multi_y_pred))
print("Recall: ", metrics.recall_score(y_test, multi_y_pred))

# Choosing a Classifier
Let's try to choose the overall best classifier for this dataset. I will:

*   Use a for loop to train and evaluate each classifer in the list on our dataset.
*   Calculate the precision, recall, and accuracy on the test set for each classifier








Import classifiers

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

Using a for loop to train and test each classifier, and print the results

In [None]:
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()] 

for classifier in classifiers:
  print("---------------")
  print(str(classifier) + '\n')
  classifier.fit(multi_X_train, y_train)
  multi_y_pred = classifier.predict(multi_X_test)
  print("Accuracy: ", metrics.accuracy_score(y_test, multi_y_pred))
  print("Precision: ", metrics.precision_score(y_test, multi_y_pred))
  print("Recall: ", metrics.recall_score(y_test, multi_y_pred)) 
  print("---------------")

You can find more successful hyperparameters for your classifiers. Your experiments will help you find a classifier that works very well on our test set. However, you're running a risk by doing so much manual fine-tuning: you might end up "overfitting" by choosing a classifier that works well on your test set, but might not work well on other data.

That's why remember to have a training set that we use to train each candidate model; a validation set that we use to evaluate each candidate model and choose the best one; and finally, a test set which we use only once, if you choose to experiment a lot with the hyperparamters