# Breast Cancer Detection

In this notebook, I explore logistic regression using a breast cancer diagnosis data set downloaded from Kaggle.

Link to data:

This project will look at how linear regression and logistic regression differ, and how logistic regression can be a powerful tool for classification.

Since the diagnosis has two possible outputs (benign or malignant), I will be exploring binary logistic regression. 

## Linear Regression

The basic model of a multivariate linear regression is:

$\hat{y_{i}} = \alpha + \beta_{1}x_{i,1} + \beta_{2}x_{i,2} + \beta_{3}x_{i,3} + ... + \beta_{k}x_{i,k} + \mu_{i}$

Where the inputs for a single observation, $i$, are $x_{i,1}, x_{i,2}, x_{i,3}, ..., x_{i,k}$ , and are weighted by the coeffiecients $\beta_{1}, \beta_{2}, \beta_{3}, ..., \beta_{k}$, and $\mu_{i}$ is the error term to give output $y_{i}$. 

TO DO: write in matrix notation

We can calculate $\textbf{b}$ by solving the minimization function, where we try to minimize the sum of squared residuals: $min(\sum(y_{i}-\hat{y_{i}}))$ where

$\textbf{b} = \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \beta_{3} \\ : \\ \beta{k} \end{bmatrix}$

$\textbf{X} = \begin{bmatrix} x_{1, 1} && x_{1,2} && x_{1,3} && ... && x_{1,k} \\ x_{2,1} && x_{2,2} && x_{2,3} && ... && x_{2,k}\\ x_{2,1} && x_{2,2} && x_{3,3} && ... && x_{3,k} \\ : \\ x_{i,1} && x_{i,2} && x_{i,3} && ... && x_{i,k}
\end{bmatrix}$

$\textbf{y} = \begin{bmatrix} y_{1} \\ y_{2} \\ y_{3} \\ : \\ y_{k} \end{bmatrix}$


$\textbf{e} = \begin{bmatrix} \mu_{1} \\ \mu_{2} \\ \mu_{3} \\ : \\ \mu_{k} \end{bmatrix}$

Using the matrix notation we can rewrite the regression equation as $$\textbf{y} = \textbf{Xb} + \textbf{e}$$ Solving for $\textbf{b}$ we get $$\textbf{b} = (\textbf{X}'\textbf{X})^{-1}\textbf{X}'\textbf{y}$$

TODO: Provide a proof

## In Practice

In [159]:
import pandas as pd

data = pd.read_csv("Data/data.csv")

Before we solve for $\textbf{b}$, lets store the $\textbf{y}$ vector separately and split the data into training and testing datasets

In [160]:
# Creating y_train and y_test matrix

y = data['diagnosis'].to_numpy()
y = [1 if i == 'M' else 0 for i in y]

y_train = y[0:500]
y_test = y[500:568]

# Creating X_train and X_test matrix

X = data.loc[: , data.columns!='diagnosis']
X = X.loc[: , X.columns!='Unnamed: 32']

X_train = X[0:500]
X_train = X_train.to_numpy()

X_test = X[500:568]
X_test = X_test.to_numpy()

# Printing dimensions

print(len(list(y)))
print(len(X_train[0]))

569
31


Looking at the number of rows and columns, we can say $k$ = 31 and $i$ = 569. So our final $\textbf{b}$ vector should be of length 31.

Let's solve for the $\textbf{b}$ vector using the NumPy package

In [161]:
import numpy as np

X_prime = np.transpose(X_train)

b = np.dot(np.linalg.inv(np.dot(X_prime,X_train)),np.dot(X_prime,y_train))

In [162]:
print(b)

[-9.24030151e-11 -3.42552811e-01  1.21316299e-02  1.85691039e-02
  1.55657357e-03  1.01869352e+00 -1.55465160e+00  1.41561470e+00
  2.26443265e+00 -7.70338578e-01 -1.89557327e+01  3.71603650e-01
  1.26879320e-02 -1.85716975e-02 -1.34436057e-03  1.57048687e+01
 -3.17509290e+00 -4.03578648e+00  1.49186236e+01  2.91028503e-01
  1.28422577e+01  2.34848461e-01  5.44758303e-04 -2.78806414e-03
 -1.24979260e-03 -3.64378656e-01  1.31595399e-01  3.88886944e-01
  3.36723720e-01  8.91493998e-01  4.82658826e+00]


In [163]:
y_pred = np.dot(X_test, b)
print(y_pred)

[ 0.11486802  0.73917221  0.10391303  1.25156621 -0.15711754 -0.05746075
  0.00239589  0.0344772   0.31353432  0.73864309  0.0715515   0.10538098
  0.69169125  0.29523675  0.30783559  0.087518    0.83409879  0.86568653
  0.11013268  0.13461524  0.04827287  1.14490559 -0.17682345  0.19412757
 -0.00958384 -0.10441507  0.25593557  0.0831417   0.30910114  0.00596248
  0.20123765  0.14560633  0.0883085   0.85952936 -0.03517548  0.929026
  0.4448618   0.2755596   0.06079819 -0.17847464 -0.13535756  0.5080809
  0.39935311  0.2224012  -0.03689219  0.21825966 -0.11798687  0.00207782
  0.07105424  0.22303138 -0.11151775 -0.01819313  0.28424832 -0.05713002
  0.25154136  0.26615756 -0.15549615  0.09498463  0.11746422  0.21482975
  0.39140479  0.0052234   1.28228326  1.17205382  1.33266842  0.95607624
  0.58220969  1.65436659]


In [164]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression(fit_intercept = False)

reg.fit(X_train, y_train)
b_sk = reg.coef_

In [165]:
print(b_sk)

[-9.24030153e-11 -3.42552812e-01  1.21316297e-02  1.85691041e-02
  1.55657357e-03  1.01869351e+00 -1.55465161e+00  1.41561470e+00
  2.26443264e+00 -7.70338578e-01 -1.89557327e+01  3.71603668e-01
  1.26879309e-02 -1.85716994e-02 -1.34436060e-03  1.57048687e+01
 -3.17509289e+00 -4.03578646e+00  1.49186236e+01  2.91028503e-01
  1.28422577e+01  2.34848459e-01  5.44758484e-04 -2.78806393e-03
 -1.24979260e-03 -3.64378640e-01  1.31595400e-01  3.88886939e-01
  3.36723717e-01  8.91493998e-01  4.82658826e+00]


In [166]:
y_pred_sk = reg.predict(X_test)
print(y_pred_sk)

[ 0.11486802  0.73917221  0.10391303  1.25156621 -0.15711754 -0.05746075
  0.00239589  0.0344772   0.31353432  0.73864309  0.0715515   0.10538098
  0.69169125  0.29523675  0.30783559  0.087518    0.83409879  0.86568653
  0.11013268  0.13461524  0.04827287  1.14490559 -0.17682345  0.19412757
 -0.00958384 -0.10441507  0.25593557  0.0831417   0.30910113  0.00596248
  0.20123765  0.14560633  0.0883085   0.85952936 -0.03517548  0.929026
  0.4448618   0.2755596   0.06079819 -0.17847464 -0.13535756  0.5080809
  0.39935311  0.2224012  -0.03689219  0.21825966 -0.11798687  0.00207782
  0.07105424  0.22303138 -0.11151775 -0.01819313  0.28424832 -0.05713002
  0.25154136  0.26615756 -0.15549615  0.09498463  0.11746422  0.21482975
  0.39140479  0.0052234   1.28228326  1.17205382  1.33266842  0.95607624
  0.58220969  1.65436659]


## We will mark all values below 0.5 as 0, otherwise it will be a 1

In [168]:
y_pred = [0 if i <.5 else 1 for i in y_pred]
print(y_pred)

[0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]


In [169]:
# actual values
print(y_test)

[0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]


In [175]:
np.subtract(y_test,y_pred)

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  1,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0])

## Out of the 68 patients in the testing dataset, the model misdiagnoses 3 total patients: 2 false positives and 1 false negative