# Assignment 9 - [30 points] Solutions

## <u>Case Study 1</u>: Fashion MNIST Dataset Principal Component Analysis for Quick Summarization

In this analysis, we will continue our exploration of the same random sample of the "Fashion MNIST" dataset. *As a reminder, this dataset is comprised of 500 28-by-28 pixel images of fashion items. Each of the 784 image pixels is represented by a numerical gray scale value which can range from 0 (black) to 255 (white). Each object has an associated pre-assigned class label, which corresponds to the fashion item that the image is a picture of. The 10 types of fashion items included in this dataset are: Pullover, Sandal, Bag, Ankle boot, Coat, Shirt, T-shirt/top, Sneaker, Dress, and Trousers.*

### <u>Research Goals</u>:

In this analysis, we have the following research goals.

#### Pixel Relationships that Describe the Most Amount of Variance in the Dataset

First, we would like to determine which pixel relationships describe the most amount of variance in the dataset. We will use the loading vectors from our PCA to determine this.

#### Quick Summarization of the Images

Next, we will use the loading vectors and the principal component coordinates of the objects to quickly summarize each of of the 500 objects in the dataset.


In [13]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn')
import seaborn as sns
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

## 1. Data Preprocessing and Cleaning

### 1.1. Before Mean Scaling

#### 1.1.1. Original Dataset
First, read the fashion_mnist_sample.csv into a dataframe. Then make a copy of this dataset that has dropped the pre-assigned class labels and has divided each of the values in this dataframe by 255. 

In [2]:
df = pd.read_csv('fashion_mnist_sample.csv')
df.head()

Unnamed: 0,label,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,Dress,0,0,0,0,0,0,0,0,0,...,162,176,128,0,0,0,0,0,0,0
1,Shirt,0,0,0,0,0,0,0,0,0,...,117,57,0,0,0,0,0,0,0,0
2,Trousers,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Sandal,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Bag,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
X = df.iloc[:, 1:]/255
df_scaled = X.copy()
df_scaled["label"] = df.label

#### 1.1.2. Original Pixel Means

Calculate the mean of each of the 784 pixels. Save these 784 pixel means as an numpy array object.

**Hint:** you can convert a pandas series to a numpy array by using **np.array( pandas_series)**. 

In [12]:
mean_list = np.array(df_scaled.iloc[:, :-1].mean())
mean_list[:5]

array([0.00000000e+00, 0.00000000e+00, 7.05882353e-05, 1.41176471e-04,
       1.49019608e-04])

### 1.2. Mean Scaling

Next, mean-scale your dataset and save it as a new dataframe.

In [14]:
X = StandardScaler(with_std=False).fit_transform(X)
X

array([[ 0.00000000e+00,  0.00000000e+00, -7.05882353e-05, ...,
        -8.47058824e-03, -3.99215686e-03, -7.84313725e-04],
       [ 0.00000000e+00,  0.00000000e+00, -7.05882353e-05, ...,
        -8.47058824e-03, -3.99215686e-03, -7.84313725e-04],
       [ 0.00000000e+00,  0.00000000e+00, -7.05882353e-05, ...,
        -8.47058824e-03, -3.99215686e-03, -7.84313725e-04],
       ...,
       [ 0.00000000e+00,  0.00000000e+00, -7.05882353e-05, ...,
        -8.47058824e-03, -3.99215686e-03, -7.84313725e-04],
       [ 0.00000000e+00,  0.00000000e+00, -7.05882353e-05, ...,
        -8.47058824e-03, -3.99215686e-03, -7.84313725e-04],
       [ 0.00000000e+00,  0.00000000e+00, -7.05882353e-05, ...,
        -8.47058824e-03, -3.99215686e-03, -7.84313725e-04]])

Checking if the mean is zero

In [17]:
np.sum(np.round(np.mean(X,axis=0), 2))

0.0

## 2. Descriptive Analytics

### 2.1. Pixel Variability

First, calculate the variance of each of the 784 pixels in your mean-scaled dataset. Display these 784 variances in a histogram.

### 2.2. Total Pixel Variance

Calculate the sum of all of the pixel variances.

## 3. Selecting the Number of Principal Components

In this case study, we would like to use PCA to learn more about what pixel relationships in the images account for the most (second most, third most, etc) amount of variance in the mean-scaled dataset. In addition, we would like to preserve as much of the mean-scaled pixel variance as possible while keeping the number of principal components that we use relatively low.


### 3.1. Percent of Total Original Pixel Variance

First, create a plot below that plots the following:
* on the x-axis is k = number of principal components used in a PCA
* on the y-axis is the percent of total (mean-scaled) original pixel variance that would be preserved by using the corresponding k principal components.

### 3.2. How many principal components to use?

Suppowe we know that we would like for at least 80% of the original pixel variance to preserved in our principal components. What is the minimum number of principal components what we would need to use in order for our principal components to preserve at least 80% of the original (mean-scaled) total pixel variance?

## 4. PCA

### 4.1. Performing PCA

Using $k$, the number of principal components that you selected in #3.2, project your mean-scaled pixel dataset onto $k$ principal components.

Use a random state of 100.

### 4.2. Pixel Relationships in the Loading Vectors

#### 4.2.1. Without Re-Adding the Pixel Means

Next, visualize each of the $k$ loading vectors in a 28-by-28 pixel image. In 4.2.1 we would like to visualize each loading vector just as they are (without re-adding the pixel means).

#### 4.2.2. Re-Adding the Pixel Means

Next, add your saved pixel means (from 1.1.2) to each of the loading vectors. Then visualize each of these "mean-added" $k$ loading vectors in the 28-by-28 pixel image.

#### 4.2.3. Interpretation

What you might have noticed is that the images in 4.2.1 have much more visible variation, distinctness, and interpretability than the images in 4.2.2.

Explore the numerical values in the vectors that we visualized above in 4.2.1 and 4.2.2 and figure out why this happened.

### 4.3. Analyzing Principal Component Coordinates

#### 4.3.1. First Two Principal Components
Next, plot your first two principal component attributes in a scatterplot. Color-code the points in your scatterplot by the fashion item labels.

#### 4.3.2. Third and Fourth Principal Components
Next, plot the third and fourth principal component attributes in a scatterplot. Color-code the points in your scatterplot by the fashion item labels.

#### 4.3.3. Boxplots of Principal Component Values

* Create a side-by-side boxplot plot, plotting the principal component 1 value for each of the 10 fashion items.
* Create a side-by-side boxplot plot, plotting the principal component 2 value for each of the 10 fashion items.
* Create a side-by-side boxplot plot, plotting the principal component 3 value for each of the 10 fashion items.

### 4.4. Principal Component 1 Interpretation

Use your results from 4.3 and 4.2.1 to answer the following questions below. You may also need to rely on prior knowledge of what each of these fashion items looks like. In assignment 6 we visualized a handful of each type of fashion item).

1.  What kind of fashion items have the highest positive median principal component 1 values? Give 2.
1.  What kind of fashion items have the highest negative median principal component 1 values? Give 3.
3.  What kind of fashion items have the lowest magnitude median principal component 1 values? Give 2.
4.  Describe the pixel relationship that accounts for the *most* amount of image variability in this dataset.

### 4.5. Principal Component 2 Interpretation

Use your results from 4.3 and 4.2.1 to answer the following questions below. You may also need to rely on prior knowledge of what each of these fashion items looks like. In assignment 6 we visualized a handful of each type of fashion item).

1.  What kind of fashion items have the highest positive median principal component 2 values? Give 2.
2.  What kind of fashion items have the highest negative median principal component 2 values? Give 2.
3.  What kind of fashion items have the lowest magnitude median principal component 2 values? Give 2.
4.  Describe the pixel relationship that accounts for the *second* most amount of image variability in this dataset.

### 4.6. Principal Component 3 Interpretation

Use your results from 4.3 and 4.2.1 to answer the following questions below. You may also need to rely on prior knowledge of what each of these fashion items looks like. In assignment 6 we visualized a handful of each type of fashion item).

1.  What kind of fashion items have the highest positive median principal component 3 values? Give 2.
2.  What kind of fashion items have the highest negative median principal component 3 values? Give 1.
3.  Describe the pixel relationship that accounts for the *third* most amount of image variability in this dataset.

### 4.7. The Variability of Sandals... Again &#128514;

#### 4.7.1. High and Low Principal Component 1 Values for Sandals

Finally, visualize the image of the sandal that has the highest magnitude value for principal component 1. Then visualize the image of the sandal that has the lowest magnitude value for principal component 1.

#### 4.7.2. Interpretation

Why do we think image with the higher principal component 1 magnitude was given a mucher higher value than the other sandal image that we looked at?

## 5. Double Checking Desired PCA Properties

### 5.1. Attribute Variances

Calculate the variances of your k principal components.

### 5.2. Percent of Total Pixel Variance

Calculate the the sum of the principal component variances. Then divide this value by the total pixel attribute variance from 2.2.

### 5.3. Covariance Matrix

Finally, calculate the covariance matrix of your principal components.