# Step 1: Import Libraries
Import the necessary libraries including NumPy, scikit-learn, and Matplotlib

- numpy and matplotlib.py are the most essential libraries
- sklearn.datasets allows me to load the breast cancer dataset
- StandardScaler will be used to normalize the features for PCA
- PCA is the main topic here. I will import this from sklearns deocmposition module. 
- train_test_split is required to split and train/test the sets
- LogisticRegression will be the classification model
- accuracy_score will be used to evaluate prediction peformance

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score



# Step 2: Load the Dataset
Load the Breast Cancer Wisconsin dataset using `load_breast_cancer()` function from scikit-learn and separate features (X) and target labels (y).
X will be 30 numeric features
y will be binary labels that are either 0 malignant, or 1 benign

In [5]:
data = load_breast_cancer()
X = data.data
y = data.target

df = pd.DataFrame(X, columns = data.feature_names)
#print(df.head())
#print(df.describe())


# Step 3: Split the Data
For this assignment, I will split the data into 80% training and 20% testing to allow me to train the model on one portoin and evaluate it on unseen data for a fair accuracy comparison.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Standardize the Data
To standardize the data, I will use the StandardScaler(). Each feature will have a mean of 0 and a standard deviation of 1. I want to ensure that all features contribute to the PCA equally. If I don't standardize the data, feature with larger scales could dominate the principle components.

In [7]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 5: Apply PCA
For the PCA components, I want to reduce the dimensionse of the dataset from 30 features to 2 principle components.

In [8]:
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)


# Step 6: Train and Evaluate the Model without PCA
Now I can train a Logistic Regression classifier on the original 30 standardized features to serve as the baseline for comparison.

In [9]:
model_no_pca = LogisticRegression()
model_no_pca.fit(X_train_scaled, y_train)
y_pred_no_pca = model_no_pca.predict(X_test_scaled)
accuracy_no_pca = accuracy_score(y_test, y_pred_no_pca)
print(f"Accuracy without PCA: {accuracy_no_pca:.4f}")


Accuracy without PCA: 0.9737


# Step 7: Train and Evaluate the Model with PCA
Now I will train the same model on the PCA-reduced dataset that has only 2-features to see how well the model performs with less features. If the two components can caputre enough variance, the performance might not degrade by a lot.

In [10]:
model_pca = LogisticRegression()
model_pca.fit(X_train_pca, y_train)
y_pred_pca = model_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)
print(f"Accuracy with PCA: {accuracy_pca:.4f}")

Accuracy with PCA: 0.9912


# Step 8: Compare Accuracy


In [11]:
print("Accuracy Improvement: ", accuracy_pca - accuracy_no_pca)

Accuracy Improvement:  0.01754385964912275
