# Logistic Regression with Scikit-Learn
In this notebook, we apply **Logistic Regression** to the *Breast Cancer* dataset included in `scikit-learn`.

Educational Objectives:
- Understand the workflow of a supervised model in Python
- Apply preprocessing, training, and evaluation
- Interpret the outputs and performance metrics

In [1]:
# Environment Setup 

# !pip install numpy pandas seaborn matplotlib scikit-learn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print("Environment ready — scikit-learn, NumPy, Pandas, Seaborn, and Matplotlib installed.")

Environment ready — scikit-learn, NumPy, Pandas, Seaborn, and Matplotlib installed.


## 1. Breast Cancer Dataset
The dataset contains **569 patients** and **30 features** extracted from breast biopsy images.

- Each row corresponds to a patient.
- Each column represents a morphological characteristic of the tumor mass (e.g., mean radius, texture, symmetry).
- The target variable indicates the diagnosis: **0 = malignant, 1 = benign**.

## 2. Loading and Inspecting the Data
Let us load the dataset and convert it into a Pandas `DataFrame` to visualize the first few rows.

In [2]:
# We load the "Breast Cancer" dataset integrated into scikit-learn.
data = load_breast_cancer()

# We extract features (X)
X = pd.DataFrame(data.data, columns=data.feature_names)
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [3]:
# We extract the target (y) from the dataset.
y = pd.Series(data.target, name='target')

# Unique values and counting for each class
y.value_counts()

target
1    357
0    212
Name: count, dtype: int64

Output: first rows of the dataset with 30 numerical variables. The `target` column contains 0 (malignant) and 1 (benign).

## 3. Division into Training and Test Sets
We divide the data into two sets:
- **Training set**: used to train the model.
- **Test set**: used to evaluate the generalization capability.

The function `train_test_split` allows specifying the size of the test set and maintaining the class proportions with the `stratify` argument.

In [4]:
# We divide the dataset into training (80%) and testing (20%).
# stratify=y → maintains the same class distribution in the two sets
# random_state=42 → ensures reproducible results
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Let's check the size of the two sets.
X_train.shape, X_test.shape

((455, 30), (114, 30))

Output: number of observations in the training and test set. With `test_size=0.2`, 20% of the data is used for testing.

## 4. Preprocessing with StandardScaler
Many models require normalized features. `StandardScaler` standardizes each variable according to the formula:

$$x'_i = \frac{x_i - \mu}{\sigma}$$

where $\mu$ is the mean and $\sigma$ is the standard deviation calculated on the training set. The transformation is then applied to the test set using the same parameters.

In [5]:
# Let's initialize the standardizer.
scaler = StandardScaler()

# Let's calculate the mean and standard deviation on the training set and transform the data.
X_train_scaled = scaler.fit_transform(X_train)

# We apply the same transformation to the test set (without refitting).
X_test_scaled = scaler.transform(X_test)

# Let's visualize the first rows of the standardized data.
X_train_scaled[:5]

array([[-1.07200079e+00, -6.58424598e-01, -1.08808010e+00,
        -9.39273639e-01, -1.35939882e-01, -1.00871795e+00,
        -9.68358632e-01, -1.10203235e+00,  2.81062120e-01,
        -1.13231479e-01, -7.04860874e-01, -4.40938351e-01,
        -7.43948977e-01, -6.29804931e-01,  7.48061001e-04,
        -9.91572979e-01, -6.93759567e-01, -9.83284458e-01,
        -5.91579010e-01, -4.28972052e-01, -1.03409427e+00,
        -6.23497432e-01, -1.07077336e+00, -8.76534437e-01,
        -1.69982346e-01, -1.03883630e+00, -1.07899452e+00,
        -1.35052668e+00, -3.52658049e-01, -5.41380026e-01],
       [ 1.74874285e+00,  6.65017334e-02,  1.75115682e+00,
         1.74555856e+00,  1.27446827e+00,  8.42288215e-01,
         1.51985232e+00,  1.99466430e+00, -2.93045055e-01,
        -3.20179716e-01,  1.27567198e-01, -3.81382677e-01,
         9.40746962e-02,  3.17524379e-01,  6.39656015e-01,
         8.73892616e-02,  7.08450758e-01,  1.18215034e+00,
         4.26212305e-01,  7.47970186e-02,  1.22834212e+

Output: first rows of the transformed training set. The values are now centered around 0 and have a standard deviation of 1.

## 5. Model Creation and Training

The **Logistic Regression** model estimates the **probability** that a sample belongs to a certain class (e.g., 0 or 1).

##### Key Parameters:
- `max_iter`: maximum number of iterations → increase (e.g., `1000`) if it does not converge.  
- `C`: strength of regularization (inverse of $\lambda$, smaller = more regularization).  
- `penalty`: type of regularization (`'l2'` for `lbfgs`).
- `solver` → optimization algorithm (e.g., `'lbfgs'` (*Limited-memory BFGS*): an advanced version of Gradient Descent, precise and fast).

The formula of the model is:
$$
P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p)}}
$$

Where:
- $\beta_0$ is the intercept  
- $\beta_i$ are the coefficients that the model learns from the data.

In [6]:
# Let's create the logistic regression model.
model = LogisticRegression(max_iter=1000, solver='lbfgs', random_state=42)

# We train the model on the training data (scaled features and target).
model.fit(X_train_scaled, y_train)

Output: the model has learned the coefficients $\beta_i$ and the intercept $\beta_0$ from the training data.

## 6. Model Evaluation on the Test Set

After training, we evaluate the model's performance on the **test set**.

##### 1. **Accuracy**
$$
\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of samples}}
$$
This measures the overall percentage of correct classifications.

##### 2. **Confusion Matrix**

It shows how the model classifies the actual samples:

|                | **Predicted: 0** | **Predicted: 1** |
|:---------------|:----------------|:----------------|
| **Actual: 0**  | True Negatives (TN) | False Positives (FP) |
| **Actual: 1**  | False Negatives (FN) | True Positives (TP) |

##### 3. **Classification Report**

It summarizes the key metrics for each class:

- **Precision**
$$
\text{Precision} = \frac{TP}{TP + FP}
$$
→ among all the model's positive predictions, how many are **correct**.

- **Recall**
$$
\text{Recall} = \frac{TP}{TP + FN}
$$
→ among all the *truly positive* cases, how many are **recognized** by the model.

- **F1-score**
$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$
→ harmonic mean of precision and recall; balances **accuracy** and **sensitivity**.

- **Support**
$$
\text{Support} = \text{number of actual samples belonging to that class}
$$

- **Macro avg:** simple average of the metrics across classes  
- **Weighted avg:** weighted average based on the number of samples in each class

In [7]:
# We predict the labels on the test set.
y_pred = model.predict(X_test_scaled)

# Let's calculate the evaluation metrics.
acc = accuracy_score(y_test, y_pred)            # Accuratezza totale
cm = confusion_matrix(y_test, y_pred)           # Matrice di confusione
report = classification_report(y_test, y_pred)  # Precision, Recall, F1-score

# Let's print the results.
print("Accuracy:", acc, "\n")
print("Confusion Matrix:\n", cm, "\n")
print("Classification Report:\n", report)

Accuracy: 0.9824561403508771 

Confusion Matrix:
 [[41  1]
 [ 1 71]] 

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.98      0.98        42
           1       0.99      0.99      0.99        72

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114



Output:
- `acc`: model accuracy on the test set.
- `cm`: confusion matrix (rows = true values, columns = predicted).
- `report`: precision, recall, and F1-score for each class.

**Final Interpretation:**  
Accuracy measures overall performance, while the classification report shows *how well the model recognizes each class*.

## 7. Cross-Validation

**Cross-validation** divides the dataset into *k* subsets (folds).  
The model is trained on *k − 1* folds and validated on the remaining single fold.  
This process is repeated *k* times, changing the validation fold each time.

The final performance is the **average of the scores obtained** in each cycle:

$$
CV_{score} = \frac{1}{k} \sum_{i=1}^{k} score_i
$$

In this way, a more **stable and reliable** estimate of the model's performance is obtained, reducing dependence on a single data split.

##### 5-Fold Cross-Validation

Using `cross_val_score`, we evaluate the model on 5 different splits of the training set.  
We obtain 5 accuracy values (one for each fold) and their **average**, which represents the average performance of the model.

In [8]:
# We perform a 5-fold cross-validation (cv=5).
# The model is trained and validated 5 times on different splits of the training set.
# scoring='accuracy' → we use accuracy as the evaluation metric
scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')

# Let's visualize the scores of each fold and their average.
scores, scores.mean()

(array([0.96703297, 0.97802198, 0.96703297, 1.        , 0.98901099]),
 np.float64(0.9802197802197803))

Output: accuracy obtained in each fold and final average, which provides a more robust estimate of performance.

## 8. Hyperparameter Search with GridSearchCV

**Grid Search** systematically explores different **combinations of hyperparameters**.  
For each combination, it performs **cross-validation** and calculates the average performance of the model.  
Finally, it selects the configuration that achieves the highest average score.

Selection criterion:

$$
\text{best\_params} = \arg\max_{\theta \in \text{grid}} \; CV\_score(\theta)
$$

`GridSearchCV` finds the hyperparameters $\theta$ that maximize the average performance in cross-validation.

In [9]:
# Let's define the grid of hyperparameters to be tested.
# 'C' controls the intensity of the regularization (smaller → more regularization)
# 'penalty' indicates the type of penalty (here only L2, compatible with 'lbfgs')
param_grid = {'C': [0.01, 0.1, 1, 10], 'penalty': ['l2']}

# We initialize the Grid Search with 5-fold cross-validation.
# scoring='accuracy' → we use accuracy as the evaluation metric
grid = GridSearchCV(
    LogisticRegression(max_iter=1000, solver='lbfgs'),
    param_grid,
    cv=5,
    scoring='accuracy'
)

# We perform the search for the best hyperparameters.
grid.fit(X_train_scaled, y_train)

# We present the best parameters found and the average score in cross-validation.
grid.best_params_, grid.best_score_

({'C': 0.1, 'penalty': 'l2'}, np.float64(0.9802197802197803))

Output:
- `best_params_`: optimal combination of hyperparameters found.
- `best_score_`: average accuracy achieved with these parameters in cross-validation.

## 9. Conclusions
We have built and evaluated a **logistic regression** model applied to the *Breast Cancer* dataset:

- Preprocessing with feature standardization.
- Training and evaluation on the test set.
- Use of cross-validation to estimate the robustness of the model.
- Hyperparameter optimization with Grid Search.

The model demonstrates high performance and represents a simple yet effective approach for binary classification.