<a href="https://colab.research.google.com/github/Ramjeet-Dixit/IITM-AIML-Rdixit/blob/main/PCA_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Principal Component Analysis (PCA) on the Wisconsin Diagnostic Breast Cancer (WDBC) Dataset

---

## 1. Dataset Introduction

The **Wisconsin Diagnostic Breast Cancer (WDBC)** dataset is a popular dataset for machine learning and statistics.  
It contains features computed from digitized images of fine needle aspirate (FNA) of breast masses, describing characteristics of the cell nuclei present in the image.

- **wdbc.data**: The main data file (no header)


**Goal:**  
Classify tumors as malignant (cancerous) or benign (non-cancerous) based on features.

---

## 2. Feature Details

Each row in `wdbc.data` represents a single patient's tumor.

- **Columns:**
  1. **ID number** (not used for ML)
  2. **Diagnosis:** M = Malignant, B = Benign
  3. **30 real-valued features** (computed for each nucleus):
      - **Radius** (mean of distances from center to points on the perimeter)
      - **Texture** (standard deviation of gray-scale values)
      - **Perimeter**
      - **Area**
      - **Smoothness** (local variation in radius lengths)
      - **Compactness** (perimeter² / area - 1.0)
      - **Concavity** (severity of concave portions)
      - **Concave points** (number of concave portions)
      - **Symmetry**
      - **Fractal dimension** ("coastline approximation" - 1)

Each of these 10 features is calculated as:
- **Mean**
- **Standard error**
- **Worst (largest) value**

**Total features:** 10 × 3 = 30 per sample.

---

In [None]:
## 3. Step-by-Step PCA in Python

### 3.1. Import Libraries

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

## Load the Data

In [None]:
# Define column names from .names file (add as list)
col_names = [
    "id", "diagnosis",
    "radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean", "compactness_mean",
    "concavity_mean", "concave_points_mean", "symmetry_mean", "fractal_dimension_mean",
    "radius_se", "texture_se", "perimeter_se", "area_se", "smoothness_se", "compactness_se",
    "concavity_se", "concave_points_se", "symmetry_se", "fractal_dimension_se",
    "radius_worst", "texture_worst", "perimeter_worst", "area_worst", "smoothness_worst",
    "compactness_worst", "concavity_worst", "concave_points_worst", "symmetry_worst", "fractal_dimension_worst"
]

df = pd.read_csv("/content/wdbc.data", header=None, names=col_names)
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [None]:
df["diagnosis"].unique()

## Data Preprocessing

In [None]:
# Remove ID column (not useful for ML)
df = df.drop(columns=["id"])

# Convert diagnosis to numeric
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})

# Separate features and label
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


## Applying PCA

In [None]:
#
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

print(f"Original shape: {X_scaled.shape}")
print(f"Reduced shape after PCA: {X_pca.shape}")

In [None]:
print("Explained variance by components:", np.round(pca.explained_variance_ratio_ * 100, 2))

In [None]:
#fit pca with some no. of components
pca = PCA(n_components=5) #top 5 components
X_pca = pca.fit_transform(X_scaled)

print(f"Original shape: {X_scaled.shape}")
print(f"Reduced shape after PCA: {X_pca.shape}")
print("Explained variance by components:", np.round(pca.explained_variance_ratio_ * 100, 2))

In [None]:
#fit pca with some amount of variation
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

print(f"Original shape: {X_scaled.shape}")
print(f"Reduced shape after PCA: {X_pca.shape}")
print("Explained variance by components:", np.round(pca.explained_variance_ratio_ * 100, 2))

## Visualizing the Results

In [None]:
# Scree plot - variance explained by each component
plt.figure(figsize=(8,4))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Variance Explained by Principal Components')
plt.grid()
plt.show()



PCA decorrelates : components will not be correlated to each other

In [None]:
# 2D Scatter plot with first two principal components
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=y, cmap='coolwarm', alpha=0.7)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Breast Cancer Data Projected onto First Two Principal Components')
plt.colorbar(label='Diagnosis (1=Malignant, 0=Benign)')
plt.show()
