In [36]:
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score

Why logistic regression?
    - Logistic regression is used for binary classification. This suits our project as we are classifying whether a tumour is malignant (M) or benign (B).

How will we evaluate the model?
   - Bias (how accurate the model is)
   - Variance (how spread out the predictions are from the mean)

(This shows us if our model is overfitting / underfitting)

Since it's a classification algorithm, we evaluate it with:
- Accuracy
- Precision
- Recall

Confusion matrix to see the distribution of TP, FP, TN, FN

IMPORTANT:
- Cost of FN is much higher than TP, FP, TN
- Develop dummy model

Chi-squared tests to see which variables are useless?

In [37]:
df = pd.read_csv('./Data Exploration/wdbc.csv')
features = df.drop(labels='B/M', axis=1)
labels = df['B/M']

Distribution of B / M, we will use this result when comparing our model to a dummy model later

In [38]:
labels.value_counts()

B    357
M    212
Name: B/M, dtype: int64

In [39]:
# 80% train, 20% test
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.2)

In [40]:
# Fit model to training data
model = LogisticRegression().fit(x_train, y_train)
y_pred = model.predict(x_test)
# Accuracy
model.score(x_test, y_test)

0.6491228070175439

In [42]:
y_pred

array(['B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'], dtype=object)

We're getting an accuracy of approx. 65%. Let's investigate the types of errors that we're getting (TP, FP, TN, FN) with a confusion matrix:

In [44]:
c = confusion_matrix(y_test, y_pred)

print(f'True negatives: {c[0][0]}')
print(f'False negatives: {c[1][0]}')
print(f'True positives: {c[1][1]}')
print(f'False positives: {c[0][1]}')

True negatives: 74
False negatives: 40
True positives: 0
False positives: 0


We are getting 0 true positives. Our model is only predicting B's (no M's).

In [52]:
print(f'Recall: {recall_score(y_test, y_pred, pos_label="M")}')
print(f'Precision: {precision_score(y_test, y_pred, pos_label="M", zero_division=0)}')

Recall: 0.0
Precision: 0.0


We are getting a lot of false negatives. So, the model is classifying M tumours as B. This causes much more harm than classifying B tumours as M.

Next: Compare model to dummy model

Thoughts:
    - Data preprocessing is needed.
    - Another model? SVMs?