# PCA SOLUTION

**File:** PCASolution.ipynb

**Course:** Data Science Foundations: Data Mining in Python

# CHALLENGE

In this challenge, I invite you to do the following:

1. Set up the PCA object.
1. Project the data onto the principal directions found by PCA.
1. Plot the ratio of variances explained by each direction.
1. Create a scatter plot of projected data along the first two principal directions.

# IMPORT LIBRARIES

In [None]:
import pandas as pd                    # For dataframes
import matplotlib.pyplot as plt        # For plotting data
import seaborn as sns                  # For plotting data
from sklearn.decomposition import PCA  # For PCA

# LOAD DATA

For this challenge, we'll use the `swiss` dataset, which is saved in the data folder as "swiss.csv." This dataset contains a standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888. (For more information, see https://opr.princeton.edu/archive/pefp/switz.aspx.)

We'll use the complete dataset for this challenge, as opposed to separating it into training and testing sets.

In [None]:
# Imports the data
df = pd.read_csv('data/swiss.csv')

In [None]:
# Shows the first few rows of the training data
df.head()

# PRINCIPAL COMPONENT ANALYSIS

In [None]:
# Sets up the PCA object
pca = PCA()

# Transforms the data ('tf' = 'transformed')
df_tf = pca.fit_transform(df)

# Plot the variance explained by each component
plt.plot(pca.explained_variance_ratio_)

In [None]:
# Plots the projected data set on the first two principal components and colors by class
sns.scatterplot(
    x=df_tf[:, 0], 
    y=df_tf[:, 1])