# Problem Set 1
## PPOL 566

For this assignment, you must submit a completed version of this notebook on Canvas no later than **11:59 PM on Wednesday, September 7**. The filename must end with your Georgetown NetID.

**Written responses** must be written in Markdown within the notebook. For full credit, responses must be written in complete sentences with proper spelling and grammar. Any references included must be properly cited. Failure to properly cite sources will result in a zero on the assignment.

**Code** must run without warnings or errors. Scripts that produce errors will receive zero points for those sections. For full credit, code must be free of semantic errors, be written with liberal comments, include meaningful variable names, and use control structures to minimize repetition (three or more instances of nearly identical statements. NOTE: There is an empty cell after each problem. This does not mean that all code for a given section must be placed in one cell. You may use as many cells as you would like for each section.

## Part 1

### 1.1) Discuss two ways that domain expertise can be applied to unsupervised learning methods.

### 1.2) Describe one scenario in which you *would* use principal components analysis and describe another scenario in which you *would not* use principal components analysis.

### 1.3) Define standardization and discuss its usage with principal components analysis and clustering.  Must data always be standardized?

<font face = "Arial" size = 3>
Standardization is a preprocessing step that transforms data to a single scale. After tranformation, all the data points are centered around a mean value of 0 and a standard deviation of 1. It is a very useful tool when dealing with features that have different scales as we need the analysis to be scale agnostic otherwise the variance captured in the data will due to larger magnitudes and not actual variance. In Principal Component Analysis(PCA)we try to determine the features that capture the maximum variance, so not standardizing the data might lead to erroneous results.<br>
Data doen't always need to standardized. This is especially true when the feautures in question have the same scale. In those instances, it is beneficial to capture the variances using the original dataset. 
</font>

## Part 2
The **data** file for this assignment is called **pset1_data.csv** and can be found in the Data folder on Canvas. The dataset includes 876 observations for the following individual personality ratings:
 * Openness to experience
 * Conscientiousness
 * Extraversion
 * Agreeableness
 * Neuroticism
 
Together, these are known as the Big Five personality traits, often abbreviated **OCEAN**.

In [19]:
### Prelimenaries 
### Importing required packages 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

### Reading in the data 
df_personality = pd.read_csv("pset1_data.csv")

### Exploring the data
print(df_personality.shape)
print(df_personality.dtypes)
print(list(df_personality.columns))
df_personality.head()

(876, 5)
Nscore    float64
Escore    float64
Oscore    float64
Ascore    float64
Cscore    float64
dtype: object
['Nscore', 'Escore', 'Oscore', 'Ascore', 'Cscore']


Unnamed: 0,Nscore,Escore,Oscore,Ascore,Cscore
0,0.31287,-0.57545,-0.58331,-0.91699,-0.00665
1,-0.67825,-0.30033,-1.55521,2.03972,1.63088
2,-1.32828,1.93886,-0.84732,-0.30172,1.63088
3,0.62967,2.57309,-0.97631,0.76096,1.13407
4,-0.79151,0.80523,-0.01928,0.94156,3.46436


### 2.1) Perform a dimensionality reduction using Principal Components Analysis. 

In [36]:
def pca_process(df) :
    """
    Function to perform Principal Component Analysis(PCA) on a given dataframe. 

    Arguments : 
    ~~~~~~~~~~~
    df : Object of class DataFrame that has to go through PCA

    Returns :
    ~~~~~~~~~~~
    
    
    """
    ### Standardizing the dataset 
    st_scaler = StandardScaler()
    df_columns = list(df.columns)
    n = df.shape[1]
    df_scaled = pd.DataFrame(st_scaler.fit_transform(df), columns = df_columns)
    
    ### Fitting the dataset to PCA 
    pca_model = PCA()
    pca_model.fit(df_scaled)

    ### Getting the loadings matrix 
    Z_matrix = np.transpose(pca_model.components_)
    Z_Columns = [f"Z{i}" for i in range(1, n+1)]
    Z_df = pd.DataFrame(Z_matrix, index = df_columns, columns = Z_Columns)
    return Z_df

pca_process(df_personality)





Unnamed: 0,Z1,Z2,Z3,Z4,Z5
Nscore,0.561962,0.040939,0.256371,0.570327,0.53993
Escore,-0.523148,0.381223,-0.306198,-0.034267,0.697176
Oscore,-0.033766,0.860811,0.213478,0.24496,-0.390239
Ascore,-0.38733,-0.119581,0.886913,-0.156674,0.156571
Cscore,-0.509264,-0.312567,-0.091268,0.767464,-0.213591


### 2.2) Display a table of principal component loadings. 

In [10]:
### Matrix Z 
df["Escore"].std()

0.9656095879811529

### 2.3) Provide an interpretation of the loadings for the first two components.

In [30]:
pca_process(df_personality).head()

Unnamed: 0,Nscore,Escore,Oscore,Ascore,Cscore
0,0.375577,-0.650142,-0.556303,-0.97586,-0.079816
1,-0.622536,-0.36506,-1.513866,2.038381,1.55145
2,-1.277152,1.955204,-0.816419,-0.348618,1.55145
3,0.694612,2.612397,-0.943506,0.73474,1.05654
4,-0.736595,0.780528,-0.000594,0.918854,3.377916


### 2.4) Create a scree plot based on the PCA results.

In [17]:
st = StandardScaler()
df_scaled = st.fit_transform(df_personality)
df_scaled = pd.DataFrame(df_scaled)
df_scaled

Unnamed: 0,0,1,2,3,4
0,0.375577,-0.650142,-0.556303,-0.975860,-0.079816
1,-0.622536,-0.365060,-1.513866,2.038381,1.551450
2,-1.277152,1.955204,-0.816419,-0.348618,1.551450
3,0.694612,2.612397,-0.943506,0.734740,1.056540
4,-0.736595,0.780528,-0.000594,0.918854,3.377916
...,...,...,...,...,...
871,0.801138,-1.330225,0.593107,-0.659155,-0.477449
872,-0.089371,-0.650142,1.432558,-0.975860,-0.851752
873,-0.736595,0.279771,0.307454,-0.348618,-0.348205
874,1.201302,-1.480081,-1.238312,-1.847506,-1.452913


### 2.5) Based on your interpretation of the scree plot, how many principal components would you keep?