# CME538 - Introduction to Data Science
## Lecture 10.1 - Feature Extraction

### Lecture Structure
1. [Principal Component Analysis (PCA)](#section1)
2. [Application 1: mtcars Dataset](#section2)

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
import warnings
warnings.filterwarnings('ignore')

<a id='section1'></a>
# Principal Component Analysis (PCA)
### Import Trees Dataset

In [None]:
trees = pd.read_csv('trees.csv', index_col=0)
trees.head()

What do these columns mean?
- **Girth (numeric)**: Tree diameter in inches.
- **Height (numeric)**: Height in feet.
- **Volume (numeric)**: Volume of timber in cubic feet.

Let's try plotting this dataset.

In [None]:
sns.pairplot(trees);

First, let's only consider `Girth` and `Volume` to keep things simple and easy to visualize (2D). Let's plot `Girth` vs `Volume`.

In [None]:
ax = sns.scatterplot(x='Girth', y='Volume', data=trees)
ax.set_xlabel('Girth (Diameter), inches', fontsize=16)
ax.set_ylabel('Volume, feet$^{3}$', fontsize=16);

### Try PCA using sklearn

Ok, let's try using sklearn to get the first two principal components.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Define df for features of interest
X = trees[['Girth', 'Volume']]

# Scale input
X_scaled = StandardScaler().fit_transform(X)

# Initialize PCA object and set the number of 
# components to 2
pca = PCA(n_components=2)

# Compute the components using .fit_transform()
X_transformed = pca.fit_transform(X_scaled)
X_transformed = pd.DataFrame(data=X_transformed, 
                             columns=['pc1', 'pc2'])

If we apply our PCA transformation to the original data, we get two new columns. Column `0` is `PC1` and column `1` is `PC2`.

In [None]:
X_transformed.head()

Now, let's plot them.

In [None]:
ax = sns.scatterplot(x='pc1', y='pc2', data=X_transformed)
ax.set_xlim([-4, 4])
ax.set_ylim([-4, 4])
ax.set_xlabel('Principal Component 1 (PC1)', fontsize=16)
ax.set_ylabel('Principal Component 2 (PC2)', fontsize=16);

And, how much variance does each component account for?

In [None]:
print('PC1 total explained variance: {:0.6f}'.format(pca.explained_variance_[0]))
print('PC2 total explained variance: {:0.6f}'.format(pca.explained_variance_[1]))

In [None]:
print('PC1 total explained variance: {:0.2f}%'.format(pca.explained_variance_ratio_[0]*100))
print('PC2 total explained variance: {:0.2f}%'.format(pca.explained_variance_ratio_[1]*100))

Or, using `Numpy`.

In [None]:
print('PC1 total explained variance: {:0.6f}'.format(np.var(X_transformed['pc1'], ddof=1)))
print('PC2 total explained variance: {:0.6f}'.format(np.var(X_transformed['pc2'], ddof=1)))

Next, let's plot the distributions for `pc1` and `pc2`.

In [None]:
ax = sns.histplot(data=X_transformed, kde=True, stat='density')
ax.set_xlim([-4, 4]);

In [None]:
ax = sns.jointplot(data=X_transformed, 
                   x='pc1', y='pc2', 
                   xlim=(-4, 4), ylim=(-4, 4),
                   marginal_kws=dict(binwidth=0.25, rug=True))
ax.set_axis_labels('Principal Component 1 (PC1)', 
                   'Principal Component 2 (PC2)', fontsize=16);

### Let's try coding this up ourselves
First, let's create `X` again, which is a DataFrame with two columns (`Girth` and `Volume`). Again, we'll keep things simple and easy to visualize by only using 2 features but this would work will all three (`Girth`, `Volume`, `height`) or more.

In [None]:
X = trees[['Girth', 'Volume']]

Next, let's scale out fatures by substracting the mean and dividing by the standard deviation. This is called standardizing the data. Let's call this new array `Z`.

In [None]:
Z = pd.DataFrame(data=StandardScaler().fit_transform(X), columns=X.columns)

What does `Z` look like? 

In [None]:
Z.head(10)

The next thing we have to do is compute the covariance matrix of `Z`. There are a few ways we can compute the covariance matrix in `Python`. Let's try them all.

#### Method 1
We can take `Z`, transpose it, and multiply the transposed matrix by `Z`  like this $ZZ^{T}$.

In [None]:
Z_cov1 = np.matmul(Z.T, Z) / (len(Z) - 1)
print(Z_cov1)

#### Method 2
And, of course, `Numpy` has a built-in function for this `np.cov()`.

In [None]:
Z_cov2 = np.cov(Z.T)
print(Z_cov2)

Next, we have to compute the eigenvectors and their corresponding eigenvalues for the matrix `Z`. Let's do this using `Numpy`.

In [None]:
eig_val, eig_vec = np.linalg.eig(Z_cov2)
print('Eigen Values:\n{}\n'.format(eig_val))
print('Eigen Vectors:\n{}'.format(eig_vec))

Ok, and let's compare this to the output from sklearn (`.explained_variance_`, `.components_`).

In [None]:
print('Explained Variance:\n{}\n'.format(pca.explained_variance_))
print('Principal Components:\n{}'.format(pca.components_))

Hmmmmmmmm, something doesn't look right when comparing `eig_vec` to `pca.components_`. While `PCA()` lists the entries of an eigenvectors row-wise, `np.linalg.eig()` lists the entries of the eigenvectors column-wise. A quick transpose to `eig_vec` solves this problem.

In [None]:
eig_val, eig_vec = np.linalg.eig(Z_cov2)
print('Eigen Values:\n{}\n'.format(eig_val))
print('Eigen Vectors:\n{}'.format(eig_vec.T))

Now you can see how eigen decomposition can be used to compute the principal components of a feature matrix. 

The last thing we have to do is transform our data into this new coordinate system. This is what we did earlier using `pca.fit_transform()` from `sklearn` but how do we do it without relying on this very nice package?

In [None]:
Z_transformed = np.matmul(Z, eig_vec)
Z_transformed.columns = ['pc1', 'pc2']
Z_transformed.head()

And let's plot to see if we get the same as when using `sklearn`.

In [None]:
ax = sns.scatterplot(x='pc1', y='pc2', data=Z_transformed)
ax.set_xlim([-4, 4])
ax.set_ylim([-4, 4])
ax.set_xlabel('Principal Component 1 (PC1)\n{:.2f}% explained var'.format(eig_val[0] / sum(eig_val) * 100), fontsize=16)
ax.set_ylabel('Principal Component 2 (PC2)\n{:.2f}% explained var'.format(eig_val[1] / sum(eig_val) * 100), fontsize=16);

It worked!!!! But that was a lot of work. Moving, forward, we'll use the `sklearn` implementation.

Laslty, let's plot our vectors on the original data and the transformed data.

In [None]:
def draw_vector(v0, v1, pc, ax=None):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->', linewidth=2,
                    shrinkA=0, shrinkB=0, color='#fc4f30')
    ax.annotate('', v1, v0, arrowprops=arrowprops)
    ax.text(v1[0]+0.25, v1[1]+0.25, pc, color='#fc4f30', ha='center', va='center', fontsize=14)

In [None]:
# Plot data
fig, ax = plt.subplots(1, 2, figsize=(14, 7))
fig.subplots_adjust(wspace=0.2)

# Plot Girth vs Volume
ax[0].scatter(Z['Girth'], Z['Volume'])
for pc, (length, vector) in enumerate(zip(pca.explained_variance_, 
                                          pca.components_)):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v, 'PC{}'.format(pc+1), ax[0])
ax[0].axis('equal')
ax[0].set_xlabel('Girth (Scaled)', fontsize=16)
ax[0].set_ylabel('Volume (Scaled)', fontsize=16)
ax[0].set_xlim([-4, 4])
ax[0].set_ylim([-4, 4])

# Plot PC1 vs PC2
ax[1].scatter(X_transformed['pc1'], X_transformed['pc2'])
for pc, (length, vector) in enumerate(zip(pca.explained_variance_, 
                                          pca.components_.T)):
    v = vector * 3
    draw_vector(pca.mean_, pca.mean_ + v, X.columns[pc], ax[1])
ax[1].axis('equal')
ax[1].set_xlabel('Principal Component 1 (PC1)\n{:.2f}% explained var'.format(eig_val[0] / sum(eig_val) * 100), fontsize=16)
ax[1].set_ylabel('Principal Component 2 (PC2)\n{:.2f}% explained var'.format(eig_val[1] / sum(eig_val) * 100), fontsize=16)
ax[1].set_xlim([-4, 4])
ax[1].set_ylim([-4, 4]);

<a id='section2'></a>
# Application 1: mtcars Dataset
## mtcars Dataset
Let's move on to a slightly more complex dataset `mtcars`. This dataset consists of data on 32 models of car, taken from an issue of the 1974 Motor Trend magazine. There are 11 features expressed in varying imperial units.
- `mpg`: Fuel consumption (Miles per (US) gallon): more powerful and heavier cars tend to consume more fuel.
- `cyl`: Number of cylinders: more powerful cars often have more cylinders.
- `disp`: Displacement (cu.in.): the combined volume of the engine's cylinders.
- `hp`: Gross horsepower: this is a measure of the power generated by the car.
- `drat`: Rear axle ratio: this describes how a turn of the drive shaft corresponds to a turn of the wheels. Higher values will decrease fuel efficiency.
- `wt`: Weight (1000 lbs): pretty self-explanatory!
- `qsec`: 1/4 mile time: the cars speed and acceleration.
- `vs`: Engine block: this denotes whether the vehicle's engine is shaped like a "V", or is a more common straight shape.
- `am`: Transmission: this denotes whether the car's transmission is automatic (0) or manual (1).
- `gear`: Number of forward gears: sports cars tend to have more gears.
- `carb`: Number of carburetors: associated with more powerful engines.

Let's import the dataset

In [None]:
mtcars = pd.read_csv('mtcars.csv')
mtcars.head()

First, let's compute the principal components using `sklearn`. Because `PCA` works best with numerical data, let's exclude the categorical variables (`model`, `vs` and `am`).

In [None]:
mtcars.columns

In [None]:
# Get features
mtcars_features = mtcars.drop(['model', 'vs', 'am', 'carb', 'country'], axis=1)

# Initialize scaler
scaler = StandardScaler()

# Scale features
mtcars_features_scaled = pd.DataFrame(data=scaler.fit_transform(mtcars_features),
                                      columns=mtcars_features.columns)

# Initialize PCA object 
pca = PCA()

# Compute principal components
pca.fit(mtcars_features_scaled)

# Transform features into new coordinate system
X_transformed = pca.transform(mtcars_features_scaled)
X_transformed = pd.DataFrame(data=X_transformed, 
                             columns=['pc{}'.format(comp+1) for 
                                      comp in range(mtcars_features.shape[1])])

# View DataFrame
X_transformed.head()

Next, I'll just create a DataFrame to easily summarize out principal components.

In [None]:
pca_summary = pd.DataFrame(
    {'Variance': pca.explained_variance_,
     'Proportion of Variance': pca.explained_variance_ratio_,
     'Cumulative Proportion': np.cumsum(pca.explained_variance_ratio_)}
).T
pca_summary.columns = ['PC{}'.format(comp+1) for comp in range(mtcars_features.shape[1])]
pca_summary.head()

Next, let's create what is called a `scree plot`.

In [None]:
plt.plot(np.arange(pca_summary.shape[1]), 
         pca_summary.loc['Proportion of Variance', :]*100, '-o', label='Variance')
plt.plot(np.arange(pca_summary.shape[1]), 
         pca_summary.loc['Cumulative Proportion', :]*100, '-o', label='Cummulative Variance')
plt.xticks(np.arange(pca_summary.shape[1]), 
           ['PC{}'.format(comp+1) for comp in range(mtcars_features.shape[1])]);
plt.legend()
plt.xlabel('Principal Component')
plt.ylabel('Proportion of Explained Variance, %');

The `scree` plot shows us that after `PC3`, are minimal gains in terms of the explained variance when adding additional principal components.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
fig.subplots_adjust(wspace=0.15)
ax.scatter(X_transformed['pc1'], X_transformed['pc2'], alpha=0.5)

ax.axis('equal')
ax.set_xlabel('Principal Component 1 (PC1)\n{:.2f}% explained var'.format(pca_summary.iloc[1, 0]*100), fontsize=16)
ax.set_ylabel('Principal Component 2 (PC2)\n{:.2f}% explained var'.format(pca_summary.iloc[1, 1]*100), fontsize=16)
ax.set_xlim([-4, 6])
ax.set_ylim([-5, 5]);

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
fig.subplots_adjust(wspace=0.15)
for pc, vector in enumerate(pca.components_.T):
    v = vector[0:2] * 6
    draw_vector([0, 0], [0, 0] + v, mtcars_features.columns[pc], ax)
ax.scatter(X_transformed['pc1'], X_transformed['pc2'], alpha=0.5)

ax.axis('equal')
ax.set_xlabel('Principal Component 1 (PC1)\n{:.2f}% explained var'.format(pca_summary.iloc[1, 0]*100), fontsize=16)
ax.set_ylabel('Principal Component 2 (PC2)\n{:.2f}% explained var'.format(pca_summary.iloc[1, 1]*100), fontsize=16)
ax.set_xlim([-4, 6])
ax.set_ylim([-5, 5]);

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
fig.subplots_adjust(wspace=0.15)
for idx, txt in enumerate(mtcars['model'].tolist()):
    ax.annotate(txt, (X_transformed.iloc[idx]['pc1'], 
                      X_transformed.iloc[idx]['pc2']), 
                color=[0.3, 0.3, 0.3], alpha=0.5)
    
for pc, vector in enumerate(pca.components_.T):
    v = vector[0:2] * 6
    draw_vector([0, 0], [0, 0] + v, mtcars_features.columns[pc], ax)
ax.scatter(X_transformed['pc1'], X_transformed['pc2'], alpha=0.75)

ax.axis('equal')
ax.set_xlabel('Principal Component 1 (PC1)\n{:.2f}% explained var'.format(pca_summary.iloc[1, 0]*100), fontsize=16)
ax.set_ylabel('Principal Component 2 (PC2)\n{:.2f}% explained var'.format(pca_summary.iloc[1, 1]*100), fontsize=16)
ax.set_xlim([-4, 6])
ax.set_ylim([-5, 5]);

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
fig.subplots_adjust(wspace=0.15)
colors = {'Japan': "#1a53ff", 'US': "#00bfa0", 'Europe': "#dc0ab4"}
for idx, (model, country) in enumerate(zip(mtcars['model'].tolist(), mtcars['country'].tolist())):
    ax.annotate(model, 
                (X_transformed.iloc[idx]['pc1'], X_transformed.iloc[idx]['pc2']), 
                color=colors[country],
                alpha=0.5)
    
for pc, vector in enumerate(pca.components_.T):
    v = vector[0:2] * 6
    draw_vector([0, 0], [0, 0] + v, mtcars_features.columns[pc], ax)
ax.scatter(X_transformed['pc1'], X_transformed['pc2'], alpha=0.75)

ax.axis('equal')
ax.set_xlabel('Principal Component 1 (PC1)\n{:.2f}% explained var'.format(pca_summary.iloc[1, 0]*100), fontsize=16)
ax.set_ylabel('Principal Component 2 (PC2)\n{:.2f}% explained var'.format(pca_summary.iloc[1, 1]*100), fontsize=16)
ax.set_xlim([-4, 6])
ax.set_ylim([-5, 5]);

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
fig.subplots_adjust(wspace=0.15)
colors = {'Japan': "#1a53ff", 'US': "#00bfa0", 'Europe': "#dc0ab4"}
for idx, (model, country) in enumerate(zip(mtcars['model'].tolist(), mtcars['country'].tolist())):
    
    ax.annotate('{} ({})'.format(model, country), 
                (X_transformed.iloc[idx]['pc3'], X_transformed.iloc[idx]['pc4']), 
                color=colors[country],
                alpha=0.5)
    
for pc, vector in enumerate(pca.components_.T):
    v = vector[0:2] * 2
    draw_vector([0, 0], [0, 0] + v, mtcars_features.columns[pc], ax)
ax.scatter(X_transformed['pc3'], X_transformed['pc4'], alpha=0.75)

ax.axis('equal')
ax.set_xlabel('Principal Component 1 (PC1)\n{:.2f}% explained var'.format(pca_summary.iloc[1, 2]*100), fontsize=16)
ax.set_ylabel('Principal Component 2 (PC2)\n{:.2f}% explained var'.format(pca_summary.iloc[1, 3]*100), fontsize=16)
ax.set_xlim([-6, 9])
ax.set_ylim([-5, 5]);

In [None]:
bugatti = pd.DataFrame([{'model': 'Bugatti Veyron', 'mpg': 13.3, 'cyl': 16, 
                         'disp': 487.8, 'hp': 987, 'drat': 3.643, 'wt': 4.387, 
                         'qsec': 9.9, 'vs': 2, 'am': 1, 'gear': 7, 'carb': 0, 
                         'country': 'Europe'}])
mtcars = pd.concat([mtcars, bugatti]).reset_index(drop=True)
mtcars.tail()

In [None]:
# Get features
mtcars_features = mtcars.drop(['model', 'vs', 'am', 'carb', 'country'], axis=1)

# Scale features
mtcars_features_scaled = pd.DataFrame(data=scaler.transform(mtcars_features),
                                      columns=mtcars_features.columns)

# Transform features into new coordinate system
X_transformed = pca.transform(mtcars_features_scaled)
X_transformed = pd.DataFrame(data=X_transformed, 
                             columns=['pc{}'.format(comp+1) for 
                                      comp in range(mtcars_features.shape[1])])

# View DataFrame
X_transformed.head()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
fig.subplots_adjust(wspace=0.15)
colors = {'Japan': "#1a53ff", 'US': "#00bfa0", 'Europe': "#dc0ab4"}
for idx, (model, country) in enumerate(zip(mtcars['model'].tolist(), mtcars['country'].tolist())):
    
    ax.annotate(model, 
                (X_transformed.iloc[idx]['pc1'], X_transformed.iloc[idx]['pc2']), 
                color=colors[country],
                alpha=0.5)
    
for pc, vector in enumerate(pca.components_.T):
    v = vector[0:2] * 6
    draw_vector([0, 0], [0, 0] + v, mtcars_features.columns[pc], ax)
ax.scatter(X_transformed['pc1'], X_transformed['pc2'], alpha=0.75)

ax.axis('equal')
ax.set_xlabel('Principal Component 1 (PC1)\n{:.2f}% explained var'.format(pca_summary.iloc[1, 0]*100), fontsize=16)
ax.set_ylabel('Principal Component 2 (PC2)\n{:.2f}% explained var'.format(pca_summary.iloc[1, 1]*100), fontsize=16)
ax.set_xlim([-10, 10])
ax.set_ylim([-6, 12]);

In [None]:
# Import dataset 
mtcars = pd.read_csv('mtcars.csv')

# Get features
mtcars_features = mtcars.drop(['model', 'vs', 'am', 'carb', 'country'], axis=1)

# Scale features
mtcars_features_scaled = pd.DataFrame(data=scaler.transform(mtcars_features),
                                      columns=mtcars_features.columns)

# Combine features and Principal Components
combined_df = pd.concat([mtcars_features_scaled, X_transformed], axis=1)

# Compute the correlation matrix
correlation = combined_df.corr()

# Plot correlatin between features and Principal Components.
correlation_plot_data = correlation.loc[mtcars_features_scaled.columns, 
                                        X_transformed.columns]
fig, ax = plt.subplots(figsize=(20, 7))
sns.set(font_scale=2)
sns.heatmap(correlation_plot_data, cmap='bwr', linewidths=.7, 
            annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax,
            cbar_kws={'label': 'Correlation'})
ax.xaxis.set_tick_params(labelsize=30)
ax.yaxis.set_tick_params(labelsize=30, labelrotation=0)
plt.show()