# Breast Cancer Diagnosis

Please use the Breast Cancer Diagnosis dataset provided for cancer prediciton

Breast Cancer Wisconsin (Diagnostic): https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

The dataset can be found at: https://github.com/MRCIEU/python_and_health_ds_training/tree/main/day-3/data

In [None]:
# import libraries

import pandas as pd
import matplotlib.pyplot as plt

## Step 1: Read the data
**Q1.1:** Read the dataset from your local file

**Q1.2:** Print the information of the dataset using `info()`

## Step 2: Exploratory Data Analysis (EDA)

**Q2.1:** Drop unnecessary columns (e.g. `id` and `Unnamed: 32`) and display the first few rows.

hint: `df.drop()`

In [None]:
# Dropping unnecessary columns


# Display the first few rows


**Q2.2:** Check for missing values in the dataset. Remove the missing value if available.

**Q2.3:** Show the distribution of each feature

**Q2.4:** Plot the distribution of the target value (i.e. diagnosis).

Hint: use `value_counts()` to count the number of each catgories

**Q2.5:** Visualize correlations between features and identify highly correlated pairs.

Hint: map `B` and `M` to 0 and 1 using `df['diagnosis'] = df['diagnosis'].apply(lambda x: 1 if x == 'M' else 0)`

In [None]:
df['diagnosis'] = df['diagnosis'].apply(lambda x: 1 if x == 'M' else 0)

## Step 3: Data Pre-processing

**Q3.1 (Optional):** Select features based on the correlation matrix

**Q3.2:** Seperate features (`X`) and targets (`y`)

Optional: Standardise features 

**Q3.3:** Split the dataset into training and test set

In [None]:
from sklearn.model_selection import train_test_split

## Step 4: Using two different Machine Learning algorithms to predict cancer

Please select and apply two different models for prediciton

https://scikit-learn.org/stable/supervised_learning.html

**Q4.1** Algorithm 1

**Q4.2** Algorithm 2

**Q4.3:** Make predictions with the models

In [None]:
# Make predictions

## Step 5: Evaluate the models

Compare the performance of two models using accuracy, precision, recall, and F1 Score

**Q5.1:** Create a function for calculate the model performance. 

Hint: The function takes `y_pred` and `y_true` as input and return `accuracy`, `precision`, `recall`, and `f1`.

```from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score```

https://scikit-learn.org/1.5/modules/model_evaluation.html

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Define a function to print evaluation metrics
def print_evaluation_metrics(y_true, y_pred):
    
    return accuracy, precision , recall, f1

**Q5.2:** Evaluate the models

In [None]:
print("Model 1:")
print_evaluation_metrics()
print("\nModel 2:")
print_evaluation_metrics()

**Q5.3:** Compare the models’ performance using their confusion matrices

In [None]:
from sklearn.metrics import confusion_matrix


## Step 6: Visualise the Results

### ROC (Receiver Operating Characteristic) Curve
- evaluate the performance of a binary classification model by examining its ability to distinguish between two classes (e.g., "positive" and "negative")
- the larger the area under the curve (AUROC), the better the performance
- Diagonal line: represents a no-skill classifier (random guessing).


**Q6.1:** Plot ROC curves for both models

Hint: use `predic_proba()` to get the probability of the prediction instead of the class

In [None]:
from sklearn.metrics import roc_curve, auc

# Generate ROC curve values
logistic_fpr, logistic_tpr, _ = roc_curve(y_test, logistic_model.predict_proba(X_test)[:, 1])
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_model.predict_proba(X_test)[:, 1])

# Calculate AUC
logistic_auc = auc(logistic_fpr, logistic_tpr)
rf_auc = auc(rf_fpr, rf_tpr)

# Plot ROC curves
plt.figure(figsize=(8, 6))
plt.plot(logistic_fpr, logistic_tpr, label=f"Logistic Regression (AUC = {logistic_auc:.3f})")
plt.plot(rf_fpr, rf_tpr, label=f"Random Forest (AUC = {rf_auc:.3f})")
plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curves for Model Comparison")
plt.legend()
plt.show()