[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CompOmics/D012554A_2025/blob/main/notebooks/day_2/2.1b_Exercises_Histone_marks_lr.ipynb)

# 2.1 Exercises – Logistic Regression & Classification

In the lecture notebook you applied logistic regression to classify gene expression levels from histone modification signals. In these exercises you will apply those same techniques to a new binary classification problem: diagnosing breast tumours as malignant or benign.

## Dataset: Breast Cancer Wisconsin (Diagnostic)

The dataset is computed from digitized images of fine needle aspirate (FNA) of breast masses. It contains 30 real-valued features describing cell nuclei:

| Feature group | Examples |
|---------------|----------|
| Mean values | `mean radius`, `mean texture`, `mean perimeter`, ... |
| Standard errors | `radius error`, `texture error`, ... |
| Worst (largest) values | `worst radius`, `worst texture`, ... |

The target is the diagnosis: `0 = malignant`, `1 = benign`.

Throughout these exercises you will:
1. Load and explore the data
2. Fit a baseline logistic regression model
3. Evaluate with accuracy and log-loss
4. Apply feature scaling and re-evaluate
5. Tune the regularisation hyperparameter `C` with cross-validation
6. Inspect feature importance from the model coefficients

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

random_seed = 123
np.random.seed(random_seed)

---
## Exercise 1 – Load and explore the dataset

Load the Breast Cancer dataset using `sklearn.datasets.load_breast_cancer()` and convert it to a Pandas DataFrame.

1. Print the shape of the feature matrix.
2. Display the first 5 rows.
3. Show the class distribution (how many malignant vs. benign?).

Hint: The returned object has `.data`, `.feature_names`, and `.target` attributes.

In [None]:
from sklearn.datasets import load_breast_cancer

# YOUR CODE HERE

---
## Exercise 2 – Visualize the features

1. Create a heatmap of a random sample of 20 rows (use `sns.heatmap`).
2. Create boxplots of all 30 features.

What do you notice about the scales of different features?

In [None]:
# YOUR CODE HERE

*Your observations here*

---
## Exercise 3 – Boxplots grouped by feature type

The 30 features fall into three groups of 10 based on their suffix:
- mean features (e.g. `mean radius`)
- error features (e.g. `radius error`)
- worst features (e.g. `worst radius`)

Create separate boxplots for each group (3 plots total), similar to the lecture’s per-histone-mark boxplots.

Hint: Use a list comprehension to select columns containing `"mean"`, `"error"`, or `"worst"`.

In [None]:
# YOUR CODE HERE

---
## Exercise 4 – Train/validation split & first logistic regression

1. Split the data into 80% training / 20% validation (`random_state=123`).
2. Fit a `LogisticRegression` model (set `max_iter=10000`).
3. Compute and print the accuracy on both the training and validation sets.

Hint: Use `train_test_split` from `sklearn.model_selection`.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# YOUR CODE HERE

---
## Exercise 5 – Evaluate with log-loss

Accuracy alone can be misleading. Compute the log-loss on both the training and validation sets using `predict_proba`.

The log-loss formula is:

$$-\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log p_i + (1-y_i) \log (1-p_i) \right]$$

Print both the training and validation log-loss.

Hint: Use `log_loss` from `sklearn.metrics` and the `[:,1]` column of `predict_proba`.

In [None]:
from sklearn.metrics import log_loss

# YOUR CODE HERE

---
## Exercise 6 – Feature scaling with MinMaxScaler

As you saw in the boxplots, the features are at very different scales.

1. Scale all features to [0, 1] using `MinMaxScaler`.
2. Important: Fit the scaler on the training set only, then transform both train and validation.
3. Fit a new `LogisticRegression` on the scaled data.
4. Print the accuracy and log-loss on both sets.
5. Compare with the unscaled results from Exercises 4–5.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# YOUR CODE HERE

*Your comparison here*

---
## Exercise 7 – Hyperparameter tuning with GridSearchCV

The regularisation parameter `C` controls how much the model penalises large coefficients.

1. Define a parameter grid: `C = [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 10, 100]`.
2. Use `GridSearchCV` with 5-fold cross-validation and `scoring='neg_log_loss'`.
3. Fit on the scaled training data.
4. Print the best `C` value.
5. Plot the mean cross-validation log-loss vs. `C` (use a log scale for the x-axis).

In [None]:
from sklearn.model_selection import GridSearchCV

# YOUR CODE HERE

---
## Exercise 8 – Evaluate the tuned model

Using the `best_estimator_` from the grid search:

1. Compute the validation log-loss.
2. Compute the validation accuracy.
3. Compare with the untuned results from Exercise 6.

In [None]:
# YOUR CODE HERE

*Your comparison here*

---
## Exercise 9 – Feature importance

Logistic regression coefficients can be directly interpreted as feature importances.

1. Extract the coefficients from the best model.
2. Create a DataFrame with columns `Feature` and `Coefficient`.
3. Sort by the absolute value and show the top 10.
4. Create a horizontal bar plot of these top 10 features.

Which features are the most important for the diagnosis?

In [None]:
# YOUR CODE HERE

---
## Bonus – Confusion matrix & precision/recall

In medical diagnosis, false negatives (missing a malignant tumour) can be more dangerous than false positives.

1. Compute the confusion matrix on the validation set using `sklearn.metrics.confusion_matrix`.
2. Visualise it as a heatmap with `sns.heatmap(annot=True)`.
3. Print the precision and recall for the malignant class (class 0).

Hint: Use `classification_report` from sklearn for a quick summary.

In [None]:
# YOUR CODE HERE