# Import Libraries

In [28]:
# Import necessary libraries
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Importing all necessary libraries to facilitate prediction objective.


# Data Preparation

In [21]:
# Load Dataset
os.chdir('/content/drive/MyDrive/Colab/Datasets/')
df = pd.read_csv('Breast_Cancer.csv')

# Remove any missing values if any
df.dropna(inplace=True)

# Handle duplicates if any
df.drop_duplicates(inplace=True)

# Encode Outcome Variable 'Status', Alive - 1, Dead - 0
df = pd.get_dummies(df, columns=['Status'], drop_first=True)

# Encode other Categorical Data
df = pd.get_dummies(df, columns=['Race', 'Marital Status', 'T Stage ', 'N Stage', '6th Stage', 'differentiate', 'A Stage', 'Estrogen Status', 'Progesterone Status'], drop_first=True)

# Find Grade values = 'anaplastic; Grade IV' and convert to 4
df['Grade'] = df['Grade'].map(lambda x: 4 if x == ' anaplastic; Grade IV' else x)

# Split Data set 80% Training and 20% Test
X = df.drop('Status_Dead', axis=1)
y = df['Status_Dead']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check sizes
print(f'Training set: {X_train.shape}')
print(f'Test set: {X_test.shape}')

Training set: (3218, 27)
Test set: (805, 27)


- Load dataset from drive
- Drop any missing values
- Drop duplicates
- Encode target variable, Alive = 0, Dead = 1
- Encode other categorical variables
- Set status as target variable
- Split dataset 80% training, 20% Test

# Fitting the Model

In [27]:
# Implement Logistic regression to train data
model_log = LogisticRegression()
model_log.fit(X_train, y_train)

y_pred_log = model_log.predict(X_test)

# Implement MLPClassifier to train data
model_mlp = MLPClassifier(hidden_layer_sizes=(10, 5), max_iter=5000)
model_mlp.fit(X_train, y_train)

y_pred_mlp = model_mlp.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


- Train dataset on logistic regression and predict on unseen data(Test). Model is not converging based on the warning above. Adding 'max_iter', so the model would run multiple times and reach convergence would help fix the warning
- Train dataset on MLPClassifier and Predict on unseen data(Test), two hidden layers, with 10 and 5 neurons each and 5000 iteration limit.

# Evaluation Metrics

In [29]:
# Calculate accuracy, precision, recall & F1 score for logistics
accuracy_log = accuracy_score(y_test, y_pred_log)
precision_log = precision_score(y_test, y_pred_log)
recall_log = recall_score(y_test, y_pred_log)
f1_score_log = f1_score(y_test, y_pred_log)

# Calculate accuracy, precision, recall & F1 score for neural network
accuracy_mlp = accuracy_score(y_test, y_pred_mlp)
precision_mlp = precision_score(y_test, y_pred_mlp)
recall_mlp = recall_score(y_test, y_pred_mlp)
f1_score_mlp = f1_score(y_test, y_pred_mlp)

# Print Values for both MLP and Logistics
print('Logistic Regression Metrics:')
print(f'Accuracy: {accuracy_log:.2f}')
print(f'Precision: {precision_log:.2f}')
print(f'Recall: {recall_log:.2f}')
print(f'F1 Score: {f1_score_log:.2f}')

print('\nNeural Network Metrics:')
print(f'Accuracy: {accuracy_mlp:.2f}')
print(f'Precision: {precision_mlp:.2f}')
print(f'Recall: {recall_mlp:.2f}')
print(f'F1 Score: {f1_score_mlp:.2f}')

Logistic Regression Metrics:
Accuracy: 0.90
Precision: 0.81
Recall: 0.45
F1 Score: 0.58

Neural Network Metrics:
Accuracy: 0.90
Precision: 0.81
Recall: 0.46
F1 Score: 0.59


Here the metrics for each model, **Accuracy** for how often the model predicted correctly, **Precision** how often did the model actually catch true classes that are actually false, **Recall** for how many true catches were actually true and **F1 score** for the balance between Precision and Recall.

## Comparison and Analysis

Both **Logistic Regression** and the **Neural Network** had the same **accuracy (90%)** and **precision (81%)**, meaning they were equally good at predicting positives correctly. However, the **Neural Network slightly outperformed** Logistic Regression in **recall (46% vs 45%)** and **F1 score (0.59 vs 0.58)**. This means it was a bit better at identifying actual positives. The F1 score balances precision and recall, so a higher F1 shows slightly better overall performance. While both models are close, the Neural Network edges out as the better model for this classification task.