## Linear Regression
For linear regression model, we still use breast cancer diagonosis dataset, even though the data set has only two output, the linear regression should still be a good model to predict the outcomes of the breast cancer.

In [1]:
# Import all the necessary libraries
import numpy as np 
import pandas as pd 
import random

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import statistics
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
# Set the Seaborn theme
sns.set_theme()

In [2]:
#importing the dataset as a dataframe
df = pd.read_csv("https://raw.githubusercontent.com/AmyrMa/INDE-577/main/data/data.csv")
df = df.drop(['id','Unnamed: 32'], axis = 1)
df['diagnosis'] = np.where(df.diagnosis.values == 'M', -1, 1)
y = df['diagnosis']
X = df.drop(['diagnosis'], axis=1).values

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 40)

In [4]:
# beta = (X^tX)^-1X^tY
def beta(X, y):
    normal_eqn = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
    return normal_eqn
design_mat = np.c_[np.ones((426,1)), X_train]
beta_hat = beta(design_mat, y_train)
beta_hat

array([ 5.05726956e+00,  3.16098321e-01,  1.29020812e-03, -3.64927102e-02,
       -6.02477389e-04, -1.28674884e+00,  8.87513306e+00, -4.03412217e+00,
       -1.99091196e+00, -8.62853947e-01,  1.44080134e+00, -1.59879496e+00,
        3.31331788e-02,  1.19480020e-01,  2.68263004e-03, -2.36099168e+01,
        8.09098896e-01,  7.60795205e+00, -2.04619462e+01, -8.87360302e+00,
        3.05882766e+01, -3.22819324e-01, -2.47891654e-02,  2.39256793e-03,
        1.90917456e-03, -1.05615197e+00, -4.20315477e-02, -7.15161592e-01,
       -2.05205843e+00, -7.73505030e-02, -1.09838061e+01])

In [5]:
X_test_mat = np.c_[np.ones((143,1)), X_test]
y_hat = X_test_mat.dot(beta_hat)

## Model performance 
here we calculate the $r^2$ value to see the model performance

In [6]:
r_squared = 1-sum((y_test.values-y_hat)**2)/sum((y_test.values-statistics.mean(y_test.values))**2)

In [7]:
r_squared

0.8007752774401433

## Accuracy Score
Since the result is binary, we have to convert the y-hat value to -1 to 1, so we set y_hat to -1 if it is negative, and 1 if it is positive.

In [8]:
y_pred = np.where(y_hat < 0, -1, 1)
print(f'Test accuracy score = {accuracy_score(y_test, y_pred)}')


Test accuracy score = 0.972027972027972


## Conclusion
We calculated the r^2 value to see the model perfomance, 0.801 shows that the linear model can provide significant informations on diagonosis of breast cancer. After we convert the y_hat to binary output, we see that the accuracy score is 0.97 which is strong model. Later we will compare this result to the logistic regression model.