## Logistic Regression with Breast Cancer Data

### Introduction to Data Science
#### Last Updated: November 28, 2022
---  

### SOURCES
- [Logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

### OBJECTIVES
- Implement logistic regression using `sklearn`
- Use the sigmoid function to compute the predicted probability
- Compute binary classification metrics

### CONCEPTS
- logistic regression

---


## 1. Logistic Regression with `sklearn`

We worked with the sigmoid function for computing probabilities of a binary outcome.  
Logistic regression is a model that does these things:
- Use a linear combination of predictors as input, equal to: $\beta_0 + \beta_1 X_1 + ... + \beta_n X_n$
- Feed the input into the sigmoid function (a.k.a. logistic function)
- Estimate the parameters to minimize error
- Output a probability estimate of the outcome

Now we will work with the Wisconsin Breast Cancer Dataset to predict if a cell is benign ('B') or malignant ('M'). 

The dataset was sourced here:  
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)

In [2]:
import numpy as np
import pandas as pd

In [3]:
datapath = '../datasets/wdbc.csv'

**Read in the Data**

In [4]:
df = pd.read_csv(datapath)
df.head()

Unnamed: 0,id,diagnosis,f1,f2,f3,f4,f5,f6,f7,f8,...,f21,f22,f23,f24,f25,f26,f27,f28,f29,f30
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


**About the Fields**

`id` - unique identifier for each subject  
`diagnosis` - target variable indicating if cell is malignant or benign  
`f1-f30` - cell measurement variables

---

**Preprocessing**

The `diagnosis` field is the target variable. It needs to be converted to values of 0 and 1.  
We can make malignant = 1, benign = 0.

In [5]:
df['target'] = df['diagnosis'].apply(lambda x: 1 if x == 'M' else 0)

Let's make sure this is correct

In [6]:
print(df.head())
print(df.tail())

         id diagnosis     f1     f2      f3      f4       f5       f6      f7  \
0    842302         M  17.99  10.38  122.80  1001.0  0.11840  0.27760  0.3001   
1    842517         M  20.57  17.77  132.90  1326.0  0.08474  0.07864  0.0869   
2  84300903         M  19.69  21.25  130.00  1203.0  0.10960  0.15990  0.1974   
3  84348301         M  11.42  20.38   77.58   386.1  0.14250  0.28390  0.2414   
4  84358402         M  20.29  14.34  135.10  1297.0  0.10030  0.13280  0.1980   

        f8  ...    f22     f23     f24     f25     f26     f27     f28  \
0  0.14710  ...  17.33  184.60  2019.0  0.1622  0.6656  0.7119  0.2654   
1  0.07017  ...  23.41  158.80  1956.0  0.1238  0.1866  0.2416  0.1860   
2  0.12790  ...  25.53  152.50  1709.0  0.1444  0.4245  0.4504  0.2430   
3  0.10520  ...  26.50   98.87   567.7  0.2098  0.8663  0.6869  0.2575   
4  0.10430  ...  16.67  152.20  1575.0  0.1374  0.2050  0.4000  0.1625   

      f29      f30  target  
0  0.4601  0.11890       1  
1  0.2750 

**Modeling**

Next, let's fit a Logistic Regression model to the data, using two predictors

**Import the model**

In [7]:
from sklearn.linear_model import LogisticRegression

**Prepare the data, using f1 and f2 as predictors**

In [8]:
X = df[['f1','f2']].values
y = df['target'].values

**Fit the logistic regression model**

In [9]:
model = LogisticRegression().fit(X, y)

**Extract the model coefficients**

In [10]:
model.coef_

array([[1.0462619 , 0.21688595]])

**Extract the model intercept**

In [11]:
model.intercept_

array([-19.67135103])

This tells us that the coefficient on f1 and f2 is 1.0462619 and 0.21688595, respectively

**Predict the Cell Types**

In [12]:
# store the predictions in a new column
df['label_predicted'] = model.predict(X)

# print the predictions
print(model.predict(X))

[1 1 1 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0
 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0
 0 1 0 1 1 0 0 0 1 1 0 1 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 1
 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 1 1 0 0
 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 1
 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0
 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1
 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1
 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0
 1 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 1 0 1 0 1 1 

**Filter the dataframe to show subjects where the prediction matches the target. These are correct predictions.**

In [13]:
df[df['label_predicted'] == df['target']]

Unnamed: 0,id,diagnosis,f1,f2,f3,f4,f5,f6,f7,f8,...,f23,f24,f25,f26,f27,f28,f29,f30,target,label_predicted
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,1,1
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,1,1
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,1,1
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,1,1
6,844359,M,18.25,19.98,119.60,1040.0,0.09463,0.10900,0.11270,0.07400,...,153.20,1606.0,0.14420,0.25760,0.3784,0.1932,0.3063,0.08368,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,1,1
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,1,1
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,1,1
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,1,1


**Predict the Probability of Each Cell Type**

In [14]:
model.predict_proba(X)

array([[1.97470913e-01, 8.02529087e-01],
       [3.32048070e-03, 9.96679519e-01],
       [3.91751573e-03, 9.96082484e-01],
       ...,
       [2.21665676e-02, 9.77833432e-01],
       [2.63048459e-04, 9.99736952e-01],
       [9.98034375e-01, 1.96562522e-03]])

For example, for the first subject, the probability of a benign cell is 1.97470913e-01  
The probability of a malignant cell is 8.02529087e-01

Since the malignant probability is greater than the default threshold of 0.5, the predicted cell type is 1 (malignant).

**Extracting Positive Probabilities**

It can be useful to extract the probabilities of the positive label for each subject.  
This can be done by extracting the second index across all rows like this:

In [15]:
model.predict_proba(X)[:,1]

array([8.02529087e-01, 9.96679519e-01, 9.96082484e-01, 3.54716872e-02,
       9.90691245e-01, 3.76777057e-02, 9.77163862e-01, 3.08025631e-01,
       2.07922344e-01, 1.94506527e-01, 8.93803446e-01, 6.72332033e-01,
       9.96872102e-01, 8.91515811e-01, 4.00739770e-01, 8.19702945e-01,
       5.13418767e-01, 8.44231392e-01, 9.97154370e-01, 8.39013335e-02,
       7.05043968e-02, 8.84488084e-04, 3.70773543e-01, 9.99427316e-01,
       9.15748188e-01, 8.60386958e-01, 5.62841469e-01, 9.85102078e-01,
       8.60225499e-01, 8.78211788e-01, 9.94864935e-01, 3.81312901e-02,
       9.65674945e-01, 9.98036551e-01, 7.47018639e-01, 9.25919913e-01,
       4.87163824e-01, 1.14708874e-01, 8.14215994e-01, 2.58807657e-01,
       2.83088636e-01, 2.70073344e-02, 9.96535802e-01, 2.01242651e-01,
       2.38340531e-01, 9.74879417e-01, 5.84698707e-04, 1.36462218e-01,
       2.00189562e-02, 3.27234193e-01, 6.40094626e-02, 1.35120621e-01,
       3.83098927e-02, 9.69150763e-01, 7.11572266e-01, 2.78736966e-02,
      

Suppose we want to change the threshold, predicting cell type 1 if the probability of the positive label is greater than 0.85. This now requires a higher confidence compared to using 0.5 as the threshold.

In [17]:
model.predict_proba(X)[:,1] > 0.85

array([False,  True,  True, False,  True, False,  True, False, False,
       False,  True, False,  True,  True, False, False, False, False,
        True, False, False, False, False,  True,  True,  True, False,
        True,  True,  True,  True, False,  True,  True, False,  True,
       False, False, False, False, False, False,  True, False, False,
        True, False, False, False, False, False, False, False,  True,
       False, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
        True, False, False, False, False,  True,  True, False, False,
       False,  True,  True, False,  True, False,  True, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False,  True,  True, False,  True,  True, False, False, False,
       False,  True,

The first subject had a positive probability of 0.8025. When the threshold was 0.5, the system predicted the label to be positive. When we raised the threshold to 0.85, it predicted the label to be false. 

---

**THINK ABOUT AND DISCUSS**

1) If you raise the threshold for predicting a positive label, what generally happens to the recall? What generally happens to the precision?

answer: Since we require a higher confidence to predict the positive label, the precision will likely increase.  However, the recall will likely decrease, producing more false negatives. This is why it is important to measure precision and recall together.

---

**TRY FOR YOURSELF**

You will evaluate the model performance.

Hint: you can count the number of rows in a dataframe by using the `len()` function like this: `len(df)`

2) Compute the accuracy of the model, where accuracy = #correct / #total

In [39]:
# answer
len(df[df['label_predicted'] == df['target']]) / len(df)

0.8910369068541301

3) Count the number of true positives (where the model predicted 1 and the target is 1)

In [40]:
# answer
len(df[ (df['label_predicted'] == 1) & (df['target'] == 1) ])

173

4) Compute the recall of the model, where precision = #true_positive / #actual_positive

In [41]:
# answer
len(df[ (df['label_predicted'] == 1) & (df['target'] == 1) ]) / len( df[df['target'] == 1])

0.8160377358490566

---

5) As we saw earlier, the first subject has a probability of malignancy = 8.02529087e-01  
Compute this by using the intercept, coefficients, f1, f2 values, and the definition of the sigmoid:

$sigmoid = 1 / ( 1 + np.exp(-(b0 + b1 * x1 + b2 * x2) ))$

Note this version uses two predictors.

In [51]:
#answer
b0 = -19.67135103
b1 = 1.0462619
b2 = 0.21688595
x1 = 17.99
x2 = 10.38

1 / ( 1 + np.exp(-(b0 + b1 * x1 + b2 * x2) ))

0.802529072692052

---