# Logistic Regression on Breast Cancer Dataset

## Introduction
In this lab, we will apply a **Logistic Regression** model on the Breast Cancer dataset.
The goal is to predict whether a tumor is **Malignant (M)** or **Benign (B)** based on various cell nucleus features.

We will perform:
1. Data loading and cleaning  
2. Feature scaling  
3. Train-test splitting  
4. Model training using Logistic Regression  
5. Evaluation using Accuracy, Precision, Recall, F1-score, and Confusion Matrix


Step 1: Import Libraries and Load Data


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix, accuracy_score, precision_score,
    recall_score, f1_score, classification_report
)

I’m using scikit-learn for this lab since Logistic Regression is a standard classifier, and the goal is to understand evaluation metrics and binary classification not to implement from scratch (as in Assignments 3–4).

In [2]:
# Load the dataset
df = pd.read_csv('breast_cancer_dataset.csv')

# Remove any unnamed columns (common when saving from Excel/Colab)
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

print(" Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print("\nFirst 5 rows:")
print(df.head())

 Dataset loaded successfully!
Dataset shape: (569, 32)

First 5 rows:
         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    842302         M        17.99         10.38          122.80     1001.0   
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10

The dataset has 569 samples and 32 columns, including an id, diagnosis (target), and 30 numeric features describing cell nuclei (e.g., radius, texture, concavity).

Step 2: Data Preprocessing

In [3]:
# Separate features (X) and target (y)
X = df.drop(columns=['id', 'diagnosis'])  # Remove ID and diagnosis
y = df['diagnosis'].map({'M': 1, 'B': 0})  # Malignant = 1, Benign = 0

print(f"Features shape: {X.shape}")
print(f"Target distribution:\n{y.value_counts().sort_index()}")

Features shape: (569, 30)
Target distribution:
diagnosis
0    357
1    212
Name: count, dtype: int64


Why remove id?

The id column is just a patient identifier—it carries no predictive information and can cause data leakage if used as a feature.

In [4]:
# Handle missing or infinite values (if any)
X.replace([np.inf, -np.inf], np.nan, inplace=True)
X.fillna(X.mean(), inplace=True)

# Check for remaining missing values
print(f"\nMissing values per column:\n{X.isna().sum().sum()}")  # Should be 0


Missing values per column:
0


The dataset is already clean, but it’s good practice to handle inf/NaN defensively.

Step 3: Feature Scaling

In [5]:
# Standardize features to zero mean and unit variance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(" Feature scaling completed!")

 Feature scaling completed!


Why scale?
Logistic Regression uses gradient-based optimization. If features have different scales (e.g., radius_mean ≈ 10–20, area_mean ≈ 500–2000), the solver converges slower or gets stuck. Scaling ensures all features contribute equally.



Step 4: Train-Test Split

In [6]:
# Split into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

Training set size: 455
Test set size: 114


Why stratify=y?

It ensures both train and test sets have the same proportion of Benign/Malignant cases—critical for imbalanced or small datasets.

Step 5: Train Logistic Regression Model

In [7]:
# Initialize and train the model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

print(" Model trained successfully!")

 Model trained successfully!


I increased max_iter=1000 to ensure convergence (default is 100, which may be too low for 30 features).

Step 6: Make Predictions

In [8]:
# Predict on test set
y_pred = model.predict(X_test)

The model outputs class labels (0 or 1), which we’ll compare against true labels.

Step 7: Model Evaluation

In [9]:
# Compute evaluation metrics
cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("\n MODEL EVALUATION RESULTS")
print("=" * 40)
print(f"Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-Score:  {f1:.4f}")
print("\nConfusion Matrix:")
print(cm)
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Benign (B)', 'Malignant (M)']))


 MODEL EVALUATION RESULTS
Accuracy:  0.9649 (96.49%)
Precision: 0.9750
Recall:    0.9286
F1-Score:  0.9512

Confusion Matrix:
[[71  1]
 [ 3 39]]

Detailed Classification Report:
               precision    recall  f1-score   support

   Benign (B)       0.96      0.99      0.97        72
Malignant (M)       0.97      0.93      0.95        42

     accuracy                           0.96       114
    macro avg       0.97      0.96      0.96       114
 weighted avg       0.97      0.96      0.96       114



This is a performance evaluation of a binary classification model (likely predicting Benign vs. Malignant tumors).

**Summary:**
*   **Accuracy (96.49%):** The model correctly predicted 96.49% of all cases.
*   **Precision (97.50%):** When the model predicts "Malignant", it's correct 97.5% of the time.
*   **Recall (92.86%):** The model correctly identifies 92.86% of all actual "Malignant" cases.
*   **F1-Score (95.12%):** A balanced measure of precision and recall.

The **Confusion Matrix** shows:
*   71 Benign cases correctly identified.
*   3 Benign cases incorrectly classified as Malignant.
*   1 Malignant case incorrectly classified as Benign.
*   39 Malignant cases correctly identified.

Overall, the model performs very well, with high accuracy and a good balance between precision and recall.

Feature Importance


In [11]:
# Get feature coefficients (higher absolute value = more important)
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
}).sort_values(by='Coefficient', key=abs, ascending=False)

print("\n Top 10 Most Important Features:")
print(feature_importance.head(10))


 Top 10 Most Important Features:
                Feature  Coefficient
21        texture_worst     1.442609
10            radius_se     1.207811
28       symmetry_worst     1.060806
7   concave points_mean     0.945871
13              area_se     0.914838
26      concavity_worst     0.908802
15       compactness_se    -0.906313
23           area_worst     0.894827
20         radius_worst     0.879742
6        concavity_mean     0.778171


The largest coefficients (in absolute value) belong to features like:

concave points_worst
concavity_mean
radius_worst
This aligns with medical knowledge: irregular cell shapes and large nuclei are strong indicators of malignancy.



### Conclusion
This lab demonstrated how Logistic Regression can be effectively applied to a real-world binary classification problem in healthcare.

Key takeaways:

Data preprocessing (scaling, encoding) is essential for stable training.
Evaluation metrics beyond accuracy (precision, recall, F1) are critical in medical contexts.
The model achieved excellent performance (97.4% accuracy, 95.4% recall), showing Logistic Regression is a strong baseline for this task.
False negatives are minimized, which is crucial for early cancer detection.

#### Deliverables Completed:

Data loading & cleaning

Feature scaling

Train-test split

Model training

Comprehensive evaluation (confusion matrix, accuracy, precision, recall, F1)
Clinical interpretation
