# Comprehensive Tutorial on Principal Component Analysis (PCA)

In this tutorial, we will cover:

1. **Understanding PCA:** An explanation of what PCA is, why it is used, and how it works.
2. **PCA for Regression:** Using PCA on the California Housing dataset to reduce dimensionality and build a regression model.
3. **Other Applications of PCA:** Additional examples, such as using PCA for clustering/classification (with the Iris dataset).

Let's start with an in-depth explanation of PCA.

## What is PCA?

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction. It transforms a dataset with many features into a new coordinate system with fewer dimensions (called principal components) while retaining most of the original variance.

### Key Points:

- **Dimensionality Reduction:** Simplifies complex data by reducing the number of variables, which can help in visualization and model performance.
- **Variance Retention:** The first principal component captures the highest variance; subsequent components capture the remaining variance under the constraint of orthogonality.
- **De-correlation:** The resulting principal components are uncorrelated (orthogonal) linear combinations of the original variables.

### Why Use PCA?

1. **Reduce the Curse of Dimensionality:** Fewer dimensions can lead to better model generalization and reduced computational cost.
2. **Visualization:** Reducing data to 2 or 3 dimensions helps in visualizing complex datasets.
3. **Noise Reduction:** By discarding components with low variance, you remove noise and redundant features.
4. **Improved Interpretability:** Simplified models that are easier to understand and analyze.

### How Does PCA Work?

1. **Standardization:** Scale the data so that each feature has zero mean and unit variance.
2. **Covariance Matrix Calculation:** Compute the covariance matrix to understand how features vary together.
3. **Eigen Decomposition:** Extract eigenvalues and eigenvectors from the covariance matrix. Eigenvectors determine the directions (principal components), and eigenvalues determine their magnitude (variance).
4. **Sorting and Selection:** Sort the eigenvectors by their corresponding eigenvalues in descending order and select the top components.
5. **Projection:** Transform the original data onto the new lower-dimensional subspace formed by the selected principal components.


## Application 1: PCA for Regression

We will use the California Housing dataset to demonstrate how PCA can be applied to reduce dimensionality before building a regression model. The steps include:

1. Loading and inspecting the dataset.
2. Standardizing the data.
3. Applying PCA to reduce the dataset to 2 principal components.
4. Visualizing the PCA-transformed data.
5. Using the PCA components in a regression model to predict house prices.


In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Enable inline plotting
%matplotlib inline

# Set a plotting style
plt.style.use('seaborn')

# ------------------------------------------
# Step 1: Load the California Housing Dataset
# ------------------------------------------
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target  # Median house value

print("First 5 rows of the California Housing dataset:")
print(X.head())

# ------------------------------------------
# Step 2: Standardize the Data
# ------------------------------------------
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Shape of standardized data:", X_scaled.shape)

# ------------------------------------------
# Step 3: Apply PCA
# ------------------------------------------
n_components = 2  # We choose 2 principal components
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X_scaled)
print("Shape of PCA-transformed data:", X_pca.shape)

# ------------------------------------------
# Step 4: Examine the Explained Variance Ratio
# ------------------------------------------
explained_variance = pca.explained_variance_ratio_
print(f"\nExplained Variance Ratio for {n_components} components:")
for i, ratio in enumerate(explained_variance, 1):
    print(f"Principal Component {i}: {ratio:.2%}")

# ------------------------------------------
# Step 5: Visualize the PCA-Reduced Data
# ------------------------------------------
plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.6)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA of California Housing Dataset")
cbar = plt.colorbar(scatter)
cbar.set_label("Median House Value")
plt.grid(True)
plt.show()

# ------------------------------------------
# Step 6: Use PCA Components in a Regression Model
# ------------------------------------------
# Split the PCA data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# Train a linear regression model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = regressor.predict(X_test)

# Evaluate the model using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"\nMean Squared Error on test set using PCA components: {mse:.4f}")

## Application 2: PCA for Clustering / Classification

PCA can also be used to reduce dimensions for visualization or as a preprocessing step for clustering and classification tasks. In this example, we will use the Iris dataset to:

1. Apply PCA to reduce the data to 2 dimensions.
2. Visualize the data with the true species labels.

This is useful for understanding how well the data separates into distinct classes.

In [None]:
# Import additional library for the Iris dataset
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
target_names = iris.target_names

# Standardize the Iris data
scaler_iris = StandardScaler()
X_iris_scaled = scaler_iris.fit_transform(X_iris)

# Apply PCA to reduce to 2 components
pca_iris = PCA(n_components=2)
X_iris_pca = pca_iris.fit_transform(X_iris_scaled)

# Print the explained variance ratio
print("Explained variance ratio for Iris PCA:", pca_iris.explained_variance_ratio_)

# Visualize the PCA-reduced Iris data
plt.figure(figsize=(8,6))
for target, color in zip(np.unique(y_iris), ['navy', 'turquoise', 'darkorange']):
    plt.scatter(X_iris_pca[y_iris == target, 0],
                X_iris_pca[y_iris == target, 1],
                label=target_names[target],
                color=color,
                alpha=0.7)

plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Iris Dataset')
plt.legend()
plt.grid(True)
plt.show()

## Conclusion

In this tutorial, we have learned:

- **What PCA is and why it is important:**
  - Dimensionality reduction, noise reduction, and improved computational efficiency.

- **How PCA works:**
  - Standardization, covariance matrix calculation, eigen decomposition, and projection.

- **Application of PCA in Regression:**
  - We applied PCA on the California Housing dataset, visualized the results, and built a regression model.

- **Application of PCA in Clustering/Classification:**
  - We reduced the dimensions of the Iris dataset and visualized the data to observe class separability.

PCA is a powerful tool for simplifying complex datasets, enabling more efficient machine learning and better visualization. Feel free to experiment further with different numbers of components or apply PCA to other datasets.

Happy Teaching and Coding!