
# Introduction #

In this exercise you'll apply PCA to a dataset with a fairly large number of features. By examining the variance explained by the principal components, you'll be able to reduce the number of features your models uses by more than half, while also reducing predictive error.

Run this cell to set everything up!

In [None]:
import matplotlib.pyplot as plt

# Set Matplotlib defaults
plt.style.use("seaborn-whitegrid")
plt.rc("figure", autolayout=True)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=14,
    titlepad=10,
)

# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex3 import *

The *Gold* dataset contains lagged percent-returns from various financial securities (like stocks and bonds). The task is to predict the price of gold on a three-week horizon using these past returns.

Run the next cell to set up the data.

In [None]:
import pandas as pd
from IPython.display import display

df = pd.read_csv("../input/fe-course-data/gold.csv")
display(df.head())
X = df.copy()
y = X.pop("Gold_T+22")

# Step 1 - Define Transforms #

The *Gold* dataset has features that are all numeric and that are all measuring the same kind of thing (returns). This makes it a good candidate for PCA. Define three transforms:
1. a `PolynomialFeatures` instance to create interaction features
2. a `PowerTransformer` to normalize the data
3. a `PCA` instance that retains all components

In [None]:
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.decomposition import PCA


# YOUR CODE HERE: Define the feature transforms
polynomial_features = PolynomialFeatures(
    degree=2, interaction_only=True, include_bias=False,
)
scaler = StandardScaler()
pca = PCA()


# Check your answer
q_1.check()

Now let's determine how many components to keep. Run this next cell for a plot of the explained variance.


In [None]:
import numpy as np

X_pca = scaler.fit_transform(X)
X_pca = pca.fit_transform(X_pca)

explained = pca.explained_variance_ratio_
cumulative = np.cumsum(explained)
c_95 = next(i+1 for i, x in enumerate(cumulative) if x >= 0.95)
plt.figure(dpi=100, figsize=(8, 4))
plt.subplot(121)
plt.plot(explained)
plt.axvline(x=c_95, color='k')
plt.title("Explained Variance per Component")
plt.subplot(122)
plt.plot(cumulative)
plt.axvline(x=c_95, color='k')
plt.title("Cumulative Variance")
plt.show();

print("95% explained variance at {} components.".format(c_95))

It seems in this case that there isn't a clear "elbow" in the graph so we'll have to make a guess about how many components to retain. Setting a threshold of 95% variance (the vertical line) seems to give us a good compromise between dimension reduction and information loss.

# Step 2 - Define PCA with Retained Components #

Redefine the PCA transform retain enough components to explain at least 95% of the variance in the dataset, as indicated in the plots above. You can either specify the number of components or the percent.

In [None]:
# YOUR CODE HERE: define the PCA transform
pca = PCA(n_components=0.95)

q_2.check()

# Step 3 - Define Pipeline

Now create the complete pipeline you'll use for prediction, using XGBoost as before. Pay attention to the order the transforms occur in the pipeline.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline

# YOUR CODE HERE: Create the XGBRegressor, then define the complete pipeline 
rf = RandomForestRegressor(n_jobs=-1)
pipeline = make_pipeline(
    scaler, pca, rf,
)


# Check your answer
q_3.check()

# Step 4 - Estimate Performance #

Now evaluate the complete pipeline with 5-fold cross validation.

In [None]:
from sklearn.model_selection import cross_val_score


# YOUR CODE HERE: Cross-validate
score = cross_val_score(
    pipeline, X, y, cv=5, scoring='neg_mean_absolute_error'
)
score = -1 * score.mean()
print("Score: {:.4f}".format(score))


# Check your answer
q_4.check()

If you like, you can compare this pipeline to one without the PCA transform by replacing `pipeline` with `rf` in the cell above and rerunning.

# Keep Going #