#### Module 8: Dimensionality Reduction

#### Case Study–3

Domain –Health Care

focus – Cancer detection

Business challenge/requirement

John Cancer Hospital (JCH) is a leading cancer hospital in the USA. It specializes in preventing breast cancer.
Throughout the last few years, JCH has collected breast cancer data from patients who came for screening/treatment.
However, this data has almost 30 attributes and is difficult to run and interpret the result. You as an ML expert have to reduce the no. of attributes (Dimensionality Reduction) so that results are meaningful and accurate.

Key issues

Reduce the no. of attributes/features in data to make the results and analysis comprehensible by doctors.

Data volume

- Approx 569 records – file breast-cancer-data.csv

Fields in Data

• Details in the ipynb notebook

Business benefits

The improved success rate of cancer detection and hence direct impact on revenue and profit of hospital. More than that it contributes to JCH's mission "Better Life"

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
df = pd.read_csv("breast-cancer-data.csv")

# Separate features and target
X = df.drop(columns=['diagnosis'])   # assuming 'target' column is diagnosis (malignant/benign)
y = df['diagnosis']

# Standardize features (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA transformation (retain 95% variance)
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

print("Original number of features:", X.shape[1])
print("Reduced number of features (PCA):", pca.n_components_)

# Step 5: Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_pca, y, test_size=0.2, random_state=42, stratify=y
)

# Step 6: Fit Logistic Regression model
log_reg = LogisticRegression(max_iter=2000)
log_reg.fit(X_train, y_train)

# Step 7: Predict and evaluate
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Logistic Regression Accuracy (PCA-transformed data):", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Original number of features: 31
Reduced number of features (PCA): 11
Logistic Regression Accuracy (PCA-transformed data): 0.9824561403508771

Classification Report:
               precision    recall  f1-score   support

           B       0.97      1.00      0.99        72
           M       1.00      0.95      0.98        42

    accuracy                           0.98       114
   macro avg       0.99      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

