# Logistic Regression - Class Exercise 2

## Introduction

Breast cancer is a disease in which abnormal breast cells grow out of control and form tumours. If left unchecked, the tumours can spread throughout the body and become fatal. There are a few types of breast cancer such as triple-negative breast cancer, inflammatory breast cancer, etc.

We will use the "breast-cancer-wisconsin" dataset for this exercise. This dataset contains 10 features. Our goal is to build a logistic regression model to predict the class of breast cancer.

## Metadata (Data Dictionary)

| Variable | Data Type | Description |
|----------|-----------|-------------|
| ID | int | ID of the patient |
| Clump Thickness | int | Clump Thickness |
| Uniformity of Cell Size | int | Uniformity of Cell Size |
| Uniformity of Cell Shape | int | Uniformity of Cell Shape |
| Marginal Adhesion | int | Marginal Adhesion |
| Single Epithelial Cell Size | int | Single Epithelial Cell Size |
| Bare Nuclei | float | Bare Nuclei |
| Bland Chromatin | int | Bland Chromatin |
| Normal Nucleili | int | Normal Nucleili |
| Mitoses | int | Mitoses |
| Class | int | Class of the breast cancer |


## Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

## Import Data

<font color=red><b>Action</b>: Load the data file and check against the metadata.

In [2]:
df = pd.read_csv('breast-cancer-wisconsin.csv')
df

Unnamed: 0,ID,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleili,Mitoses,Class
0,1000025,5,1,1,1,2,1.0,3,1,1,2
1,1002945,5,4,4,5,7,10.0,3,2,1,2
2,1015425,3,1,1,1,2,2.0,3,1,1,2
3,1016277,6,8,8,1,3,4.0,3,7,1,2
4,1017023,4,1,1,3,2,1.0,3,1,1,2
...,...,...,...,...,...,...,...,...,...,...,...
694,776715,3,1,1,1,3,2.0,1,1,1,2
695,841769,2,1,1,1,2,1.0,1,1,1,2
696,888820,5,10,10,3,7,3.0,8,10,2,4
697,897471,4,8,6,4,3,4.0,10,6,1,4


In our dataset, we have 2 classes, Class 2 and Class4.<br>
It is not a common convention to denote the classes.<br>
We can convert them to Class 0 and Class 1.

<font color=red><b>Action</b>: For "Class" column, convert 2 to 0 and 4 to 1

In [3]:
df['Class'] = df['Class'].map({2: 0, 4: 1})
df

Unnamed: 0,ID,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleili,Mitoses,Class
0,1000025,5,1,1,1,2,1.0,3,1,1,0
1,1002945,5,4,4,5,7,10.0,3,2,1,0
2,1015425,3,1,1,1,2,2.0,3,1,1,0
3,1016277,6,8,8,1,3,4.0,3,7,1,0
4,1017023,4,1,1,3,2,1.0,3,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...
694,776715,3,1,1,1,3,2.0,1,1,1,0
695,841769,2,1,1,1,2,1.0,1,1,1,0
696,888820,5,10,10,3,7,3.0,8,10,2,1
697,897471,4,8,6,4,3,4.0,10,6,1,1


Now, let's start with EDA!

## Exploratory Data Analysis

### Check the proportion of each class

<font color=red><b>Action</b>: For "Class" column, check the count and the proportion of each class

In [4]:
df['Class'].value_counts()

Class
0    458
1    241
Name: count, dtype: int64

In [5]:
df['Class'].value_counts() / df.shape[0]

Class
0    0.655222
1    0.344778
Name: count, dtype: float64

There are about 65.5% of Class 0 and 34.5% of Class 1.<br>

### Check for missing values

<font color=red><b>Action</b>: Check the proportion of missing values for each column

In [6]:
df.isnull().mean()

ID                             0.00000
Clump Thickness                0.00000
Uniformity of Cell Size        0.00000
Uniformity of Cell Shape       0.00000
Marginal Adhesion              0.00000
Single Epithelial Cell Size    0.00000
Bare Nuclei                    0.02289
Bland Chromatin                0.00000
Normal Nucleili                0.00000
Mitoses                        0.00000
Class                          0.00000
dtype: float64

Bare Nuclei has 2.3% of missing values. We need to deal with it before we can proceed to the modelling.

We can fill the missing values by the mode.

<font color=red><b>Action</b>: Fill the missing values in "Bare Nuclei" column by the mode

In [7]:
mode = df['Bare Nuclei'].mode()[0]
df['Bare Nuclei'] = df['Bare Nuclei'].fillna(mode)

df.isnull().mean()

ID                             0.0
Clump Thickness                0.0
Uniformity of Cell Size        0.0
Uniformity of Cell Shape       0.0
Marginal Adhesion              0.0
Single Epithelial Cell Size    0.0
Bare Nuclei                    0.0
Bland Chromatin                0.0
Normal Nucleili                0.0
Mitoses                        0.0
Class                          0.0
dtype: float64

Now, there is no more missing value.

### Extract the features and the label
We are using "Admit" as the categorical label for classification.
Take note that, "Chance of Admit" should be excluded as it is the equivalent numeric label.

<font color=red><b>Action 1</b>: Create a string variable to be the column name of the label<br>
<b>Action 2</b>: Create a list that contains the names of columns to be used as input features

In [8]:
label = 'Class'
excluded_features = [label, 'ID']
features = [feature for feature in list(df) if feature not in excluded_features]
features

['Clump Thickness',
 'Uniformity of Cell Size',
 'Uniformity of Cell Shape',
 'Marginal Adhesion',
 'Single Epithelial Cell Size',
 'Bare Nuclei',
 'Bland Chromatin',
 'Normal Nucleili',
 'Mitoses']

<font color=red><b>Action</b>: Split the full DataFrame into the training set and the test set.</font>

In [9]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=0)
display(train_df.head())
display(test_df.head())

Unnamed: 0,ID,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleili,Mitoses,Class
293,601265,10,4,4,6,2,10.0,2,3,1,1
62,1116116,9,10,10,1,10,8.0,3,3,1,1
485,1002025,1,1,1,3,1,3.0,1,1,1,0
422,1257648,4,3,3,1,2,1.0,3,3,1,0
332,770066,5,2,2,2,2,1.0,2,2,1,0


Unnamed: 0,ID,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleili,Mitoses,Class
476,1296025,4,1,2,1,2,1.0,1,1,1,0
531,867392,4,2,2,1,2,1.0,2,1,1,0
40,1096800,6,6,6,9,6,1.0,7,8,1,0
432,1277629,5,1,1,1,2,1.0,2,2,1,0
14,1044572,8,7,5,10,7,9.0,5,5,4,1


<font color=red><b>Action</b>: Extract the features and the label for the training set and the test set, respectively.

In [10]:
train_x = train_df[features]
train_y = train_df[label]

test_x = test_df[features]
test_y = test_df[label]

<font color=red><b>Action</b>: Display the proportion of each class in the training/test set

In [11]:
display(train_y.value_counts())
display(test_y.value_counts())

Class
0    373
1    186
Name: count, dtype: int64

Class
0    85
1    55
Name: count, dtype: int64

<font color=red><b>Action</b>: Display the proportion (in %) of each class in the training/test set

In [12]:
display(train_y.value_counts() / train_y.shape[0])
display(test_y.value_counts() / test_y.shape[0])

Class
0    0.667263
1    0.332737
Name: count, dtype: float64

Class
0    0.607143
1    0.392857
Name: count, dtype: float64

### Build a logistic regression model

In [13]:
from sklearn.linear_model import LogisticRegression

<font color=red><b>Action 1</b>: Initialize a logistic regression model<br>
<b>Action 2</b>: Train the model by the training features and the train label<br>
<b>Action 3</b>: Predict the training label and the test label.<br>

In [14]:
model = LogisticRegression()
model.fit(train_x, train_y)

test_yhat = model.predict(test_x)

### Generate a confusion matrix

We can use <font color=#AA0000><b>sklearn.metrics.confusion_matrix()</b></font> function to construct a confusion matrix.<br>
It takes 2 inputs: 1) the actual label; 2) the predicted label.<br>

Input (1) and (2) should not be swapped.

<font color=red><b>Action</b>: Construct a confusion matrix for the test prediction

In [15]:
metrics.confusion_matrix(test_y, test_yhat)

array([[82,  3],
       [ 1, 54]], dtype=int64)

The confusion matrix looks great as we have much more true positive and true negative than false positive and false negative.

### Write a custom function to determine the classification metrics
We have written this function in Class Exercise 1.

In [16]:
def get_classification_metrics(y, yhat):
    # Record the total number of samples
    n = len(y)
    
    # Count the number of correct samples and calculate the accuracy
    n_correct = (y == yhat).sum()
    accuracy = n_correct / n
    
    # One way to calculate the error rate
    error_rate = 1 - accuracy
    
    # The other way to calculate the error rate
    n_incorrect = (y != yhat).sum()
    error_rate = n_incorrect / n
    
    # Count the number of true positive
    TP = ((y == 1) & (yhat == 1)).sum()
    
    # Count the number of false positive
    FP = ((y == 0) & (yhat == 1)).sum()
    
    # Count the number of true negative
    TN = ((y == 0) & (yhat == 0)).sum()
    
    # Count the number of false negative
    FN = ((y == 1) & (yhat == 0)).sum()
    
    # Calculate sensitivity / specificity / precision / recall
    sensitivity = recall = TP / (TP + FN)
    specificity = TN / (FP + TN)
    precision = TP / (TP + FP)
    
    item = ['Accuracy', 'Error Rate', 'Sensitivity', 'Specificity', 'Precision', 'Recall']
    value = accuracy, error_rate, sensitivity, specificity, precision, recall
    
    df_out = {'Item': item, 'Value': value}
    df_out = pd.DataFrame(df_out)
    return df_out

<font color=red><b>Action</b>: Run the function above

In [17]:
get_classification_metrics(test_y, test_yhat)

Unnamed: 0,Item,Value
0,Accuracy,0.971429
1,Error Rate,0.028571
2,Sensitivity,0.981818
3,Specificity,0.964706
4,Precision,0.947368
5,Recall,0.981818


## Evaluate the model by AUC
AUC measures how much the model is capable of distinguishing between classes.

<font color=red><b>Action 1</b>: Initialize a logistic regression model<br>
<b>Action 2</b>: Train the model by the training features and the train label<br>
<b>Action 3</b>: Predict the test label as <u>probability</u><br>

In [18]:
model = LogisticRegression()
model.fit(train_x, train_y)

test_yhat_prob = model.predict_proba(test_x)

The prediction will be given as a 2D-array.<br>
Axis 0 has the number of instances corresponding to the number of samples.<br>
Axis 1 has the number of instances corresponding to the number of classes.

On Axis 1, The 1st value is the probability of being Class 0, the 2nd value is the probability of being Class 1.

In [19]:
test_yhat_prob

array([[9.94183432e-01, 5.81656754e-03],
       [9.91685261e-01, 8.31473903e-03],
       [1.40755024e-02, 9.85924498e-01],
       [9.86802469e-01, 1.31975309e-02],
       [1.22380343e-04, 9.99877620e-01],
       [9.97059380e-01, 2.94061963e-03],
       [2.92699845e-03, 9.97073002e-01],
       [9.97059380e-01, 2.94061963e-03],
       [1.47780262e-04, 9.99852220e-01],
       [2.31463011e-01, 7.68536989e-01],
       [9.98882508e-01, 1.11749237e-03],
       [9.97588936e-01, 2.41106431e-03],
       [1.22477403e-03, 9.98775226e-01],
       [2.31500354e-02, 9.76849965e-01],
       [3.49779203e-03, 9.96502208e-01],
       [9.97059380e-01, 2.94061963e-03],
       [9.98359112e-01, 1.64088824e-03],
       [4.41327346e-04, 9.99558673e-01],
       [6.23122766e-02, 9.37687723e-01],
       [9.93668793e-01, 6.33120669e-03],
       [2.01192184e-03, 9.97988078e-01],
       [7.88951596e-05, 9.99921105e-01],
       [9.98648452e-01, 1.35154832e-03],
       [9.96444834e-01, 3.55516591e-03],
       [9.396433

On Axis 1, each group contains values that add up to be 1.<br>
We can verify that.

In [20]:
test_yhat_prob.sum(axis=1)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1.])

We are going to extract the 2nd value on Axis 1.

In [21]:
test_yhat_prob_class1 = test_yhat_prob[:, 1]
test_yhat_prob_class1

array([5.81656754e-03, 8.31473903e-03, 9.85924498e-01, 1.31975309e-02,
       9.99877620e-01, 2.94061963e-03, 9.97073002e-01, 2.94061963e-03,
       9.99852220e-01, 7.68536989e-01, 1.11749237e-03, 2.41106431e-03,
       9.98775226e-01, 9.76849965e-01, 9.96502208e-01, 2.94061963e-03,
       1.64088824e-03, 9.99558673e-01, 9.37687723e-01, 6.33120669e-03,
       9.97988078e-01, 9.99921105e-01, 1.35154832e-03, 3.55516591e-03,
       6.03566768e-02, 9.92354528e-01, 3.55516591e-03, 1.42035641e-03,
       9.99891389e-01, 9.45414558e-01, 2.33996798e-02, 8.56278252e-03,
       8.31473903e-03, 5.79841779e-03, 5.23931690e-03, 1.98842983e-02,
       6.52277455e-04, 9.99442831e-01, 6.44447374e-03, 1.64878092e-03,
       4.59873143e-02, 1.12504365e-02, 7.42713118e-03, 1.11749237e-03,
       9.99895161e-01, 9.99832551e-01, 2.34987623e-03, 9.69490686e-01,
       2.89873325e-02, 9.84302445e-01, 9.81114654e-01, 8.26222832e-03,
       1.99389179e-03, 8.40719965e-01, 9.96271592e-04, 4.81300332e-03,
      

This array represents the probability of being Class 1 for each sample in the test set.<br>
By default, if the probability is greater than 0.5, this sample will be predicted as Class 1; Otherwise, Class 0.<br>

Let's recreate the predicted class.

<font color=red><b>Action</b>: Convert the array of probability to an array of predicted class<br>

In [22]:
test_yhat = (test_yhat_prob_class1 > 0.5).astype(int)
test_yhat

array([0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 0])

<font color=red><b>Action</b>: Construct a confusion matrix for the test prediction

In [23]:
metrics.confusion_matrix(test_y, test_yhat)

array([[82,  3],
       [ 1, 54]], dtype=int64)

Refer to [the last result](#Generate-a-confusion-matrix), we should see the same confusion matrix now.