---

## Grading Info/Details - Assignment 4.1:

The assignment will be graded semi-automatically, which means that your code will be tested against a set of predefined test cases and qualitatively assessed by a human. This will speed up the grading process for us.

* For passing the test scripts:
    - Please make sure to **NOT** alter predefined class or function names, as this would lead to failing of the test scripts.
    - Please do **NOT** rename the files before uploading to the Whiteboard!

* **(RESULT)** tags indicate checkpoints that will be specifically assessed by a human.

* You will pass the assignment if you pass the majority of test cases and we can at least confirm effort regarding the **(RESULT)**-tagged checkpoints per task.

---

## Task 4.1.1 - PCA from Scratch

Implement Principal Component Analysis (PCA) from scratch using only `NumPy`.
This assignment will help you understand the mathematical foundations of PCA.

* Implement the PCA given the class structure below. **(RESULT)**
* Test your implementation using small synthetic datasets described in the test functions below. **(RESULT)**

In [14]:
import numpy as np
import matplotlib.pyplot as plt

# Build PCA

In [15]:
class PCA:
    """
    Principal Component Analysis implementation using only NumPy.
    """

    def __init__(self, n_components=2):
        """
        Initialize PCA.
        """
        self.n_components = n_components

    def fit(self, X):
        """
        Fit PCA on the training data X.
        """
        # TODO: Implement this function
        self.mean_ = np.mean(X, axis=0)
        X = X - self.mean_

        cov = np.dot(X.transpose(), X)
        S, W = np.linalg.eig(cov)
        idx = np.argsort(S)[::-1]
        self.eigenvectors = W[:, idx]
        self.explained_variance_ = S[idx][:self.n_components]
        self.components= self.eigenvectors[:, :self.n_components]




    def transform(self, X, dim=5):
        """
        Transform X into the principal component space.
        """
        # TODO: Implement this function
        X = X - self.mean_
        return np.dot(X, self.components)

    def inverse_transform(self, X):
        """
        Transform data back to original space.
        """
        # TODO: Implement this function
        return X.dot(self.components.transpose()) + self.mean_

    def fit_transform(self, X):
        """
        Fit PCA and transform X in one step.
        """
        self.fit(X)
        return self.transform(X)

# Testing decorrelation, dimensionality reduction and reconstruction

In [16]:
def test_basic_pca():
    """Test 1: Basic PCA on 2D data"""
    # TODO: Implement this function
    #decorrelation
    np.random.seed(42)


    data_2d = np.random.rand(100, 2)
    # print(data_2d.shape)
    pca = PCA(n_components=2)
    transformeddata = pca.fit_transform(data_2d)
    # print(transformeddata.shape)
    return  transformeddata, data_2d, pca


def test_dimensionality_reduction():
    """Test 2: Reduce 5D data to 2D"""
    # TODO: Implement this function
    np.random.seed(42)

    # Create a dataset with 100 samples and 5 features (5D)
    data_5d = np.random.rand(100, 5)
    # print(data_5d.shape)
    pca = PCA(n_components=2)
    data_2d = pca.fit_transform(data_5d)
    # print(data_2d.shape)
    return data_2d, data_5d, pca


def test_reconstruction():
    """Test 3: Inverse transform (reconstruction)"""
    # TODO: Implement this function
    reduced, original, pca=test_dimensionality_reduction()
    reconstructed = pca.inverse_transform(reduced)
    print( "Reconstruction error:", np.linalg.norm(original - reconstructed, ord='fro'))



def test_variance_ordering():
    """Test 4: Components are ordered by variance"""
    print("Test 4: Verify components are ordered by explained variance")

    np.random.seed(42)
    X = np.random.randn(100, 5)

    pca = PCA(n_components=5)
    pca.fit(X)

    # Check that explained variances are in descending order
    variances = pca.explained_variance_
    is_sorted = np.all(variances[:-1] >= variances[1:])

    print(f"Explained variances: {variances}")
    print(f"Is sorted (descending): {is_sorted}")
    assert is_sorted, "Components not sorted by variance!"
    print("✓ Test 4 passed\n")


def test_centered_data():
    """Test 5: Verify data is properly centered"""
    print("Test 5: Verify data centering")

    np.random.seed(42)
    X = np.random.randn(100, 3) + 10  # Add offset

    pca = PCA(n_components=2)
    pca.fit(X)

    # Mean should be close to the original data mean
    print(f"Original data mean: {np.mean(X, axis=0)}")
    print(f"Stored mean: {pca.mean_}")
    print(f"Difference: {np.mean(np.abs(np.mean(X, axis=0) - pca.mean_)):.10f}")
    print("✓ Test 5 passed\n")


def run_all_tests():
    print("Running PCA test suite...\n")
    try:
      test_basic_pca()
      test_dimensionality_reduction()
      test_reconstruction()
      test_variance_ordering()
      test_centered_data()

      print("ALL TESTS PASSED!")

    except AssertionError as e:
        print(f"\n❌ Test failed: {e}")
    except Exception as e:
        print(f"\n❌ Unexpected error: {e}")

In [17]:
# Run the test suite
run_all_tests()

Running PCA test suite...

Reconstruction error: 4.581361017952531
Test 4: Verify components are ordered by explained variance
Explained variances: [125.2632111  102.95072664  96.46122129  86.48035729  65.91672829]
Is sorted (descending): True
✓ Test 4 passed

Test 5: Verify data centering
Original data mean: [10.09176598  9.81676669 10.07482166]
Stored mean: [10.09176598  9.81676669 10.07482166]
Difference: 0.0000000000
✓ Test 5 passed

ALL TESTS PASSED!


As you can see, the mean is matched and the variance is ordered correctly. Also, the dimensions were reduced from 5 to 2 (checked by the commented print statement). Also, the reconstruction error is relatively small which means that reconstruction was generally successful.

## Task 4.1.2 - PCA on Real-World Data

* Apply your PCA implementation on the `California Housing Dataset`. **(RESULT)**
* Compare your results with those obtained from the scikit-learn PCA implementation: `sklearn.decomposition.PCA`. Are your results within numerical precision? **(RESULT)**


In [18]:
from sklearn.datasets import fetch_california_housing
from sklearn.decomposition import PCA as SklearnPCA


#Testing california housing dataset using Scikit-learn PCA

In [19]:
data = fetch_california_housing()
X = data.data

pca_2 = SklearnPCA(n_components=5)
X_pca = pca_2.fit_transform(X)
X_recon = pca_2.inverse_transform(X_pca)
recon_error = np.linalg.norm(X - X_recon, ord='fro')
print("Reconstruction error (Scikit-learn implementation):", recon_error)

var_sklearn =  X_pca.var(axis=0)


Reconstruction error (Scikit-learn implementation): 259.1595704420067


# Testing california housing dataset using our custom made PCA class

In [20]:
pca2 = PCA(n_components=5)
reduced = pca2.fit_transform(X)
# print(y.shape)
reconstructed = pca2.inverse_transform(reduced)
error = np.linalg.norm(X - reconstructed, ord='fro')
print("Reconstruction error (Custom implementation):", error)
var_red =  reduced.var(axis=0)

diff = np.linalg.norm(var_sklearn - var_red)
print("Difference between Scikit-learn reduced dataset variance and custome PCA reduced dataset variance",diff)
print("Difference between Scikit-learn reconstruction error and custome PCA reconstruction error",recon_error-error)






Reconstruction error (Custom implementation): 259.1595704420065
Difference between Scikit-learn reduced dataset variance and custome PCA reduced dataset variance 2.328313557364388e-10
Difference between Scikit-learn reconstruction error and custome PCA reconstruction error 2.2737367544323206e-13


The results of our implementation is very similar to scikit-learn's implementation. This is shown by comparing the reconstruction error of both implementations which is almost (within numeric precision) the same for both.  We also compared the variance of the reduced dataset in both implementations where the difference between them showed that they are within numeric precision.

## Congratz, you made it! :)