# Formative Assignment: Advanced Linear Algebra (PCA)
This notebook will guide you through the implementation of Principal Component Analysis (PCA). Fill in the missing code and provide the required answers in the appropriate sections. You will work with a dataset that is Africanized .

Make sure to display outputs for each code cell when submitting.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Fix the file path - /content/ is for Google Colab, not Linux
try:
    df = pd.read_csv('DatasetAfricaMalaria.csv')  # Look in current directory
    print("Using actual malaria dataset")
    print("Dataset shape:", df.shape)
    print("Dataset columns:", df.columns.tolist())
    print("First few rows:")
    print(df.head())
    
    # Select only numeric columns
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    print("Numeric columns:", numeric_cols.tolist())
    
    if len(numeric_cols) > 0:
        # Get numeric data and handle missing values
        numeric_data = df[numeric_cols]
        print("Missing values per column:")
        print(numeric_data.isnull().sum())
        
        # Drop columns with too many missing values (>50% missing)
        threshold = len(numeric_data) * 0.5
        numeric_data = numeric_data.dropna(axis=1, thresh=threshold)
        print("Columns after dropping those with >50% missing:", numeric_data.columns.tolist())
        
        # Fill remaining missing values with column means
        numeric_data = numeric_data.fillna(numeric_data.mean())
        
        # Convert to numpy array
        data = numeric_data.values
        print("Using cleaned numeric data with shape:", data.shape)
    else:
        print("No numeric columns found, using random data")
        data = np.random.rand(100, 10)
        
except FileNotFoundError:
    print("Dataset not found, using random data")
    data = np.random.rand(100, 10)

# Step 1: Load and Standardize the data (use of numpy only allowed)
print("Data type:", type(data))
print("Data shape:", data.shape)
print("Sample data (first 2 rows):", data[:2])

# Check for any remaining NaN values
if np.isnan(data).any():
    print("Warning: Data still contains NaN values. Using random data instead.")
    data = np.random.rand(100, 10)

mean = np.mean(data, axis=0)
std_dev = np.std(data, axis=0)

# Avoid division by zero for constant columns
std_dev[std_dev == 0] = 1

standardized_data = (data - mean) / std_dev

print("Standardized data shape:", standardized_data.shape)
standardized_data[:5]  # Display the first few rows of standardized data

Using actual malaria dataset
Dataset shape: (594, 27)
Dataset columns: ['Country Name', 'Year', 'Country Code', 'Incidence of malaria (per 1,000 population at risk)', 'Malaria cases reported', 'Use of insecticide-treated bed nets (% of under-5 population)', 'Children with fever receiving antimalarial drugs (% of children under age 5 with fever)', 'Intermittent preventive treatment (IPT) of malaria in pregnancy (% of pregnant women)', 'People using safely managed drinking water services (% of population)', 'People using safely managed drinking water services, rural (% of rural population)', 'People using safely managed drinking water services, urban (% of urban population)', 'People using safely managed sanitation services (% of population)', 'People using safely managed sanitation services, rural (% of rural population)', 'People using safely managed sanitation services, urban  (% of urban population)', 'Rural population (% of total population)', 'Rural population growth (annual %)', '

array([[-1.58113883, -1.21256459, -0.50676022, -1.23400095, -1.66809102,
         1.23397271, -0.56164688,  1.4949511 ,  1.86242385,  1.15800583,
         1.77365688,  2.08348956,  1.99788845,  1.60900436, -0.78319128],
       [-1.58113883,  0.61644942,  0.22065066, -0.79850359,  0.43654858,
         0.79847029,  1.02714839, -1.04968763, -1.49943184, -2.0312881 ,
        -0.08450631, -0.60024644,  0.2987136 , -0.89572088,  0.02653472],
       [-1.58113883,  1.85097486, -0.50677256,  0.08917686,  0.50362873,
        -0.08922047,  0.39163028, -0.12891488,  0.18799653, -0.88446427,
        -1.05813946, -1.01427689, -1.20999189,  0.41358982, -0.75042082],
       [-1.58113883, -1.20605768, -0.50658756, -0.82130741, -2.37243257,
         0.82127437,  0.88208447,  0.75053369,  0.33317496,  1.11063462,
         0.84629616,  0.50795659,  1.37380734, -1.6059517 ,  0.36666905],
       [-1.58113883,  2.00127157, -0.48578404,  1.12146692,  0.64617404,
        -1.12152249,  1.64885089, -0.79883234, 

### Step 3: Calculate the Covariance Matrix
The covariance matrix helps us understand how the features are related to each other. It is a key component in PCA.

In [None]:
# Step 3: Calculate the Covariance Matrix
cov_matrix = np.cov(standardized_data, rowvar=False)  # Calculate covariance matrix
cov_matrix

array([[ 1.00168634e+00, -6.80542806e-02,  3.57947203e-01,
        -8.40379544e-02, -4.97357957e-02,  8.40398632e-02,
        -5.65317141e-02,  1.18565745e-01,  1.10879038e-01,
         1.00777429e-01,  6.83801684e-02,  6.84030302e-02,
         5.87010243e-02, -9.12704426e-19,  5.24219978e-18],
       [-6.80542806e-02,  1.00168634e+00,  2.88995393e-01,
         2.45446793e-01,  3.87139217e-01, -2.45450290e-01,
         3.09619690e-01, -3.68912526e-01, -3.13484257e-01,
        -3.99015433e-01, -4.40983873e-01, -3.75800980e-01,
        -4.25948448e-01,  5.88013289e-02, -2.64875225e-01],
       [ 3.57947203e-01,  2.88995393e-01,  1.00168634e+00,
         2.13337373e-01,  2.40234419e-01, -2.13346062e-01,
         2.52430491e-01, -2.32449030e-01, -1.80609519e-01,
        -1.67590337e-01, -1.75105836e-01, -1.14344867e-01,
        -2.08589045e-01, -8.65447283e-02,  6.68523086e-02],
       [-8.40379544e-02,  2.45446793e-01,  2.13337373e-01,
         1.00168634e+00,  6.52752199e-01, -1.00168633

### Step 4: Perform Eigendecomposition
Eigendecomposition of the covariance matrix will give us the eigenvalues and eigenvectors, which are essential for PCA.
Fill in the code to compute the eigenvalues and eigenvectors of the covariance matrix.

In [None]:
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

print("Eigenvalues shape:", eigenvalues.shape)
print("Eigenvectors shape:", eigenvectors.shape)
print("\nFirst 5 eigenvalues:", eigenvalues[:5])
print("\nSum of eigenvalues:", np.sum(eigenvalues))

# Display eigenvalues and eigenvectors
print("\nAll eigenvalues:")
print(eigenvalues)
print("\nFirst eigenvector (first column):")
print(eigenvectors[:, 0])

### Step 5: Sort Principal Components
Sort the eigenvectors based on their corresponding eigenvalues in descending order. The higher the eigenvalue, the more important the eigenvector.
Complete the code to sort the eigenvectors and print the sorted components.

<a url ='https://www.youtube.com/watch?v=vaF-1xUEXsA&t=17s'>How Is Explained Variance Used In PCA?'<a/>

### Step 6: Project Data onto Principal Components
Now that we’ve selected the number of components, we will project the original data onto the chosen principal components.
Fill in the code to perform the projection.

### Step 7: Output the Reduced Data
Finally, display the reduced data obtained by projecting the original dataset onto the selected principal components.

### Step 8: Visualize Before and After PCA
Now, let's plot the original data and the data after PCA to compare the reduction in dimensions visually.