

<center>
    <img src="https://miro.medium.com/v2/resize:fit:300/1*mgncZaKaVx9U6OCQu_m8Bg.jpeg">
</center>



The goal of PCA is to extract information while reducing the number of features
from a dataset by identifying which existing features relate to another. The crux of the algorithm is trying to determine the relationship between existing features, called principal components, and then quantifying how relevant these principal components are. The principal components are used to transform the high dimensional data to a lower dimensional data while preserving as much information. For a principal component to be relevant, it needs to capture information about the features. We can determine the relationships between features using covariance.

In [None]:
# Import necessary packages
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:

data = np.array([
    [   1,   2,  -1,   4,  10],
    [   3,  -3,  -3,  12, -15],
    [   2,   1,  -2,   4,   5],
    [   5,   1,  -5,  10,   5],
    [   2,   3,  -3,   5,  12],
    [   4,   0,  -3,  16,   2],
])

In [None]:
fuel_econ = pd.read_csv('fuel_econ.csv')

# Handle non-numeric columns in fuel_econ dataset
non_numeric_cols = fuel_econ.select_dtypes(exclude=[np.number]).columns
print("Non-numeric columns in fuel_econ:", non_numeric_cols)

# Option 1: Exclude non-numeric columns
fuel_econ_numeric = fuel_econ.select_dtypes(include=[np.number])


Non-numeric columns in fuel_econ: Index(['make', 'model', 'VClass', 'drive', 'trans', 'fuelType'], dtype='object')


### Step 1: Standardize the Data along the Features

![image.png](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQLxe5VYCBsaZddkkTZlCY24Yov4JJD4-ArTA&usqp=CAU)




Explain why we need to handle the data on the same scale.

Standarizing data is important because it helps ensure that all variables are treated equaly and that no single variable dominates the the analysis

In [None]:
# Standardize the data
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)

![cov matrix.webp](https://dmitry.ai/uploads/default/original/1X/9bd2851674ebb55e404cc3ff5e2ffe65b42ff460.png)

We use the pair - wise covariance of the different features to determine how they relate to each other. With these covariances, our goal is to group / cluster based on similar patterns. Intuitively, we can relate features if they have similar covariances with other features.

### Step 2: Calculate the Covariance Matrix



In [None]:
# Calculate covariance matrix
cov_matrix = np.cov(standardized_data.T)

print() # Change 'Print()' to 'print()' to call the built-in print function




### Step 3: Eigendecomposition on the Covariance Matrix


In [None]:


eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)



### Step 4: Sort the Principal Components
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

In [None]:
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

order_of_importance = np.argsort(eigenvalues)[::-1]
print ( 'the order of importance is :\n {}'.format(order_of_importance))

# utilize the sort order to sort eigenvalues and eigenvectors
# Use order_of_importance instead of sorted_indices
sorted_eigenvalues = eigenvalues[order_of_importance]

print('\n\n sorted eigen values:\n{}'.format(sorted_eigenvalues))
# Use order_of_importance instead of sorted_indices
sorted_eigenvectors = eigenvectors[:, order_of_importance]
print('\n\n The sorted eigen vector matrix is: \n {}'.format(sorted_eigenvectors))

the order of importance is :
 [0 1 4 2 3]


 sorted eigen values:
[3.80985761e+00 1.73655615e+00 4.04085720e-01 4.94531029e-02
 4.74189469e-05]


 The sorted eigen vector matrix is: 
 [[-0.4640131   0.45182808 -0.03317471 -0.70733581  0.28128049]
 [ 0.45019005  0.48800851 -0.15803498  0.29051532  0.6706731 ]
 [ 0.37929082 -0.55665017 -0.5029143  -0.48462321  0.24186072]
 [-0.4976889   0.03162214 -0.78311558  0.36999674 -0.03373724]
 [ 0.43642295  0.49682965 -0.32822489 -0.20861365 -0.64143906]]


Question:

1. Why do we order eigen values and eigen vectors?

Ordering eigenvalues and their corresponding eigenvectors is important in PCA because the magnitude of each eigenvalue represents the amount of variance captured by its associated eigenvector (principal component). By arranging them in descending order, we can prioritize the principal components that capture the most variance. This allows us to reduce the dimensionality of the data while retaining the most significant information, as we focus on the components that explain the majority of the variance.

2. Is it true we would consider the lowest eigen value compared to the highest? Defend your answer

No, in PCA, we generally do not prioritize the lowest eigenvalues because they represent components that capture the least variance in the data. The goal of PCA is to reduce the dimensionality while preserving as much variance as possible, which is why we focus on the highest eigenvalues. By selecting the components with higher eigenvalues, we retain the directions in the data that explain the most variance, thus preserving essential patterns and structure. Lower eigenvalues correspond to directions with minimal variance and are often disregarded to reduce noise and simplify the model.

You want to see what percentage of information each eigen value holds. You would have print out the percentage of each eigen value using the formula



> (sorted eigen values / sum of all sorted eigen values) * 100



In [None]:
# Calculate explained variance percentages for each eigenvalue
explained_variance = (sorted_eigenvalues / np.sum(sorted_eigenvalues)) * 100

# Format the explained variance percentages to two decimal places
explained_variance = ["{:.2f}%".format(value) for value in explained_variance]

# Print the explained variance percentages
print(explained_variance)


['63.50%', '28.94%', '6.73%', '0.82%', '0.00%']


#Initialize the number of Principle components then perfrom matrix multiplication with the variable K example k = 3 for 3 priciple components




> The reulting matrix (with reduced data) = standardized data * vector with columns k

See expected output for k = 2



In [None]:
import numpy as np

# Assuming 'data' is your original dataset
# Replace 'data' with the actual variable name of your dataset
# Load your dataset or create a sample dataset
# Example: Load data from a CSV file
# data = np.genfromtxt('your_data.csv', delimiter=',')
# Example: Create a sample dataset
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])


# Calculate the mean and standard deviation of each column (feature)
data_mean = np.mean(data, axis=0)
data_std = np.std(data, axis=0)

# Standardize the data by subtracting the mean and dividing by the standard deviation
standardized_data = (data - data_mean) / data_std

k = 2

# You need to define and calculate 'sorted_eigenvectors' before using it
# For example, you might have obtained it from PCA:
# from sklearn.decomposition import PCA
# pca = PCA()
# pca.fit(standardized_data)
# sorted_eigenvectors = pca.components_

# Assuming you have 'sorted_eigenvectors' calculated, uncomment the following line:
# reduced_data = np.matmul(standardized_data, sorted_eigenvectors[:, :k])

# Placeholder for demonstration purposes - Replace with your actual sorted_eigenvectors
sorted_eigenvectors = np.eye(3)  # Example: Using an identity matrix

# Now you can calculate reduced_data
reduced_data = np.matmul(standardized_data, sorted_eigenvectors[:, :k])



In [None]:
print(reduced_data)

[[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]


In [None]:
print(reduced_data.shape)

(3, 2)


# *What are 2 positive effects and 2 negative effects of PCA

Give two benefits and 2 limitations

**Benefits of PCA:**
1. **Dimensionality Reduction**: PCA reduces the number of features while retaining most of the dataset's information, improving computation speed and simplifying visualization.
2. **Noise Reduction**: By focusing on principal components that capture the most variance, PCA can help filter out noise and irrelevant information.

**Limitations of PCA:**
1. **Information Loss**: Reducing dimensions may discard some information, especially if we retain fewer components.
2. **Linear Relationships**: PCA assumes linear relationships between features, so it may not be effective for data with complex, non-linear structures.
