# Breast Cancer Detection

In this notebook, I explore logistic regression using a breast cancer diagnosis data set downloaded from Kaggle.

Link to data:

This project will look at how linear regression and logistic regression differ, and how logistic regression can be a powerful tool for classification.

Since the diagnosis has two possible outputs (benign or malignant), I will be exploring binary logistic regression. 

## Linear Regression

The basic model of a multivariate linear regression is:

$\hat{y_{i}} = \alpha + \beta_{1}x_{i,1} + \beta_{2}x_{i,2} + \beta_{3}x_{i,3} + ... + \beta_{k}x_{i,k} + \mu_{i}$

Where the inputs for a single observation, $i$, are $x_{i,1}, x_{i,2}, x_{i,3}, ..., x_{i,k}$ , and are weighted by the coeffiecients $\beta_{1}, \beta_{2}, \beta_{3}, ..., \beta_{k}$, and $\mu_{i}$ is the error term to give output $y_{i}$. 

TO DO: write in matrix notation

We can calculate $\textbf{b}$ by solving the minimization function, where we try to minimize the sum of squared residuals: $min(\sum(y_{i}-\hat{y_{i}}))$ where

$\textbf{b} = \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \beta_{3} \\ : \\ \beta{k} \end{bmatrix}$

$\textbf{X} = \begin{bmatrix} x_{1, 1} && x_{1,2} && x_{1,3} && ... && x_{1,k} \\ x_{2,1} && x_{2,2} && x_{2,3} && ... && x_{2,k}\\ x_{2,1} && x_{2,2} && x_{3,3} && ... && x_{3,k} \\ : \\ x_{i,1} && x_{i,2} && x_{i,3} && ... && x_{i,k}
\end{bmatrix}$

$\textbf{y} = \begin{bmatrix} y_{1} \\ y_{2} \\ y_{3} \\ : \\ y_{k} \end{bmatrix}$


$\textbf{e} = \begin{bmatrix} \mu_{1} \\ \mu_{2} \\ \mu_{3} \\ : \\ \mu_{k} \end{bmatrix}$

Using the matrix notation we can rewrite the regression equation as $$\textbf{y} = \textbf{Xb} + \textbf{e}$$ Solving for $\textbf{b}$ we get $$\textbf{b} = (\textbf{X}'\textbf{X})\textbf{X}'\textbf{y}$$

TODO: Provide a proof

## In Practice

In [71]:
import pandas as pd

data = pd.read_csv("Data/data.csv")

Before we solve for $\textbf{b}$, lets store the $\textbf{y}$ vector separately and split the data into training and testing datasets

In [103]:
# Creating y_train and y_test matrix

y = data['diagnosis'].to_numpy()
y = [1 if i == 'M' else 0 for i in y]

y_train = y[0:500]
y_test = y[500:568]

# Creating X_train and X_test matrix

X = data.loc[: , data.columns!='diagnosis']
X = X.loc[: , X.columns!='Unnamed: 32']

X_train = X[0:500]
X_train = X_train.to_numpy()

X_test = X[500:568]
X_test = X_test.to_numpy()

# Printing dimensions

print(len(list(y)))
print(len(X_train[0]))

569
31


Looking at the number of rows and columns, we can say $k$ = 31 and $i$ = 569. So our final $\textbf{b}$ vector should be of length 31.

Let's solve for the $\textbf{b}$ vector using the NumPy package

In [104]:
import numpy as np

X_prime = np.transpose(X_train)

b = np.dot((np.dot(X_prime,X_train)),np.dot(X_prime,y_train))

In [110]:
print(X_test)

[9.14862e+05 1.50400e+01 1.67400e+01 9.87300e+01 6.89400e+02 9.88300e-02
 1.36400e-01 7.72100e-02 6.14200e-02 1.66800e-01 6.86900e-02 3.72000e-01
 8.42300e-01 2.30400e+00 3.48400e+01 4.12300e-03 1.81900e-02 1.99600e-02
 1.00400e-02 1.05500e-02 3.23700e-03 1.67600e+01 2.04300e+01 1.09700e+02
 8.56900e+02 1.13500e-01 2.17600e-01 1.85600e-01 1.01800e-01 2.17700e-01
 8.54900e-02]
