# Dimensionality Reduction

**Dimensionality** - no. of independent features/variables in a dataset

Dataset with more than 10 is considered to have **high dimensionality**

High Dimensionality:

- Increases complexity.


![Why Reduce Dimensionality?](why_reduce_dimensionality.png)

**Dimesionality Reduction:**
- Feature Selection

The selected features remain unchanged, and are therefore easy to interpret.
- Feature Extraction

![Feature Selection and Feature Extraction](feature_selection_extraction.png)

1. **Remove features with little/no variance**

This is because they contain little information that could be used. Mostly in classification problems.

Features with no variance have:
- Std of 0 or approaching 0
- Min and Max value is same or almost same

Ways to identify features with no variance:
- Use .describe on the dataset.
- Plot numeric features using pairplot (Good for small to medium sized dimensionality)
- Using Variance Threshold


2. **Remove highly correlated features/redundant features**

Ways to identify highly correlated features:
- Plot numeric features using pairplot (Good for small to medium sized dimensionality)

3. **Remove features with null values**

## Visually detecting redundant features

In [None]:
# Create a pairplot and color the points using the 'Gender' feature
sns.pairplot(ansur_df_1, hue='Gender', diag_kind='hist')

# Show the plot
plt.show()

# Remove one of the redundant features
reduced_df = ansur_df_1.drop('stature_m', 1)

# Create a pairplot and color the points using the 'Gender' feature
sns.pairplot(reduced_df, hue='Gender')

# Remove the feature with no variance
reduced_df = ansur_df_2.drop('n_legs', 1)

# Create a pairplot and color the points using the 'Gender' feature
sns.pairplot(reduced_df, hue='Gender', diag_kind='hist')

# Show the plot
plt.show()

## t-SNE visualization of high-dimensional data

Helps us explore interesting patterns in the dataset.

Doesn't work with non-numeric data. Encode non-numeric data or drop them.

Use it when you want to visually explore the patterns in a high dimensional dataset.

In [None]:
# Non-numerical columns in the dataset
non_numeric = ['Branch', 'Gender', 'Component']

# Drop the non-numerical columns from df
df_numeric = df.drop(non_numeric, axis=1)

# Create a t-SNE model with learning rate 50
m = TSNE(learning_rate=50)

# Fit and transform the t-SNE model on the numeric dataset
tsne_features = m.fit_transform(df_numeric)
print(tsne_features.shape)

The above code reduced the numeric columns from 90 to just 2 which can easily be plotted.

Visualize the output of t-SNE dimensionality reduction on the combined male and female Ansur dataset. You'll create 3 scatterplots of the 2 t-SNE features ('x' and 'y') which were added to the dataset df. In each scatterplot you'll color the points according to a different categorical variable.

In [None]:
# Color the points by Army Component
sns.scatterplot(x="x", y="y", hue='Component', data=df)

# Show the plot
plt.show()

# Color the points by Army Branch
sns.scatterplot(x="x", y="y", hue='Branch', data=df)

# Show the plot
plt.show()

# Color the points by Gender
sns.scatterplot(x="x", y="y", hue='Gender', data=df)

# Show the plot
plt.show()

## Curse of Dimensionality

Models tend to overfit on data of high dimensionality.

If accuracy on train set is more than accuracy on test set then the model did not generalize well.

To avoid overfitting, no. of observations should increase exponentially with no. of features added to the dataset.

## Feature Selection

In [None]:
# Import train_test_split()
from sklearn.model_selection import train_test_split

# Select the Gender column as the feature to be predicted (y)
y = ansur_df['Gender']

# Remove the Gender column to create the training data
X = ansur_df.drop('Gender', 1)

# Perform a 70% train and 30% test data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

print("{} rows in test set vs. {} in training set. {} Features.".format(X_test.shape[0], X_train.shape[0], X_test.shape[1]))

Fitting and testing the model

In [None]:
# Import SVC from sklearn.svm and accuracy_score from sklearn.metrics
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Create an instance of the Support Vector Classification class
svc = SVC()

# Fit the model to the training data
svc.fit(X_train, y_train)

# Calculate accuracy scores on both train and test data
accuracy_train = accuracy_score(y_train, svc.predict(X_train))
accuracy_test = accuracy_score(y_test, svc.predict(X_test))

print("{0:.1%} accuracy on test set vs. {1:.1%} on training set".format(accuracy_test, accuracy_train))

output of the above code:
49.7% accuracy on test set vs. 100.0% on training set

You'll reduce the overfit with the help of dimensionality reduction. In this case, you'll apply a rather drastic form of dimensionality reduction by only selecting a single column that has some good information to distinguish between genders. You'll repeat the train-test split, model fit and prediction steps to compare the accuracy on test vs. training data.

In [None]:
# Assign just the 'neckcircumferencebase' column from ansur_df to X
X = ansur_df[['neckcircumferencebase']]

# Split the data, instantiate a classifier and fit the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
svc = SVC()
svc.fit(X_train, y_train)

# Calculate accuracy scores on both train and test data
accuracy_train = accuracy_score(y_train, svc.predict(X_train))
accuracy_test = accuracy_score(y_test, svc.predict(X_test))

print("{0:.1%} accuracy on test set vs. {1:.1%} on training set".format(accuracy_test, accuracy_train))

output of the above code:
49.7% accuracy on test set vs. 100.0% on training set

## Removing Featues with little variance

Box plots help visualize differences in mean, median, variance of numeric features.

In [None]:
# Create the boxplot
head_df.boxplot()

plt.show()

Normalization would be done to ensure the features are comparable.

In [None]:
# Normalize the data
normalized_df = head_df / head_df.mean()

normalized_df.boxplot()
plt.show()

print(normalized_df.var())

Output from the code above.

headbreadth          1.678952e-03
headcircumference    1.029623e-03
headlength           1.867872e-03
tragiontopofhead     2.639840e-03
n_hairs              1.002552e-08
measurement_error    3.231707e-27

Inspect the printed variances. If you want to remove the 2 very low variance features. What would be a good variance threshold?

Ans: 1.0e-03

You established that 0.001 is a good threshold to filter out low variance features in head_df after normalization. Now use the VarianceThreshold feature selector to remove these features.

### Removing low variance features using Variance Threshold

In [None]:
from sklearn.feature_selection import VarianceThreshold

# Create a VarianceThreshold feature selector
sel = VarianceThreshold(threshold=0.001)

# Fit the selector to normalized head_df
sel.fit(head_df / head_df.mean())

# Create a boolean mask
mask = sel.get_support()

# Apply the mask to create a reduced dataframe
reduced_df = head_df.loc[:, mask]

print("Dimensionality reduced from {} to {}.".format(head_df.shape[1], reduced_df.shape[1]))

Dimensionality reduced from 6 to 4 due to the above code.

## Removing Featues with missing values

In [None]:
# Create a boolean mask on whether each feature has less than 50% missing values.
mask = school_df.isna().sum() / len(school_df) < 0.5

# Create a reduced dataset by applying the mask
reduced_df = school_df.loc[:, mask]

print(school_df.shape)
print(reduced_df.shape)

## Pairwise Correlation
![Correlation Coefficient](correlation_coefficient.png)
![Correlation Coefficient](correlation_coefficient_plotted.png)


Correlation of A to B = Correlation of B to A

**Correlation matrix**

In [None]:
ansur_df.corr()

**Visualizing the correlation matrix**

In [None]:
# Create the correlation matrix
corr = ansur_df.corr()

# Draw the heatmap
sns.heatmap(corr,  cmap=cmap, center=0, linewidths=1, annot=True, fmt=".2f")
plt.show()

**Create a boolean mask for the upper triangle of the plot and the mask to the heatmap**

In [None]:
# Create the correlation matrix
corr = ansur_df.corr()

# Generate a mask for the upper triangle 
mask = np.triu(np.ones_like(corr, dtype=bool))

# Add the mask to the heatmap
sns.heatmap(corr, mask=mask, cmap=cmap, center=0, linewidths=1, annot=True, fmt=".2f")
plt.show()

### Removing highly correlated features

Highly correlated features do not bring new information but increase complexity.

Remove one of the 2 features with a high correlation.

Incase numerous features represent the same measurement, remove all but one.

### Correlation Caveats

There's need to visually inspect correlated features to avoid surprises like the one below.

![Anscombe's Quartet](correlation_caveats_anscombe.png)

Correlation does not imply causation.

### Filtering out highly correlated features

You're going to automate the removal of highly correlated features in the numeric ANSUR dataset. You'll calculate the correlation matrix and filter out columns that have a correlation coefficient of more than 0.95 or less than -0.95.

Since each correlation coefficient occurs twice in the matrix (correlation of A to B equals correlation of B to A) you'll want to ignore half of the correlation matrix so that only one of the two correlated features is removed. Use a mask trick for this purpose.

In [None]:
# Calculate the correlation matrix and take the absolute value
corr_matrix = ansur_df.corr().abs()

# Create a True/False mask and apply it
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
tri_df = corr_matrix.mask(mask)

# List column names of highly correlated features (r > 0.95)
to_drop = [c for c in tri_df.columns if any(tri_df[c] >  0.95)]

# Drop the features in the to_drop list
reduced_df = ansur_df.drop(to_drop, axis=1)

print("The reduced dataframe has {} columns.".format(reduced_df.shape[1]))

## Selecting featues for model performance

### Feature Selection using Classification Algorithms
Get model Coefficients for different features using Logistic Regression.

In [None]:
# Fit the scaler on the training features and transform these in one go
X_train_std = scaler.fit_transform(X_train)

# Fit the logistic regression model on the scaled training data
lr.fit(X_train_std, y_train)

# Scale the test features
X_test_std = scaler.transform(X_test)

# Predict diabetes presence on the scaled test set
y_pred = lr.predict(X_test_std)

# Prints accuracy metrics and feature coefficients
print("{0:.1%} accuracy on test set.".format(accuracy_score(y_test, y_pred))) 
print(dict(zip(X.columns, abs(lr.coef_[0]).round(2))))

### Recursive Feature Elimination with Logistic Regression

Dropping the feature with the lowest coefficient causes the other feature coefficients to change(increase/decrease) hence drop unimportant features one by one.

In [None]:
# Create the RFE with a LogisticRegression estimator and 3 features to select
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=3, verbose=1)

# Fits the eliminator to the data
rfe.fit(X_train, y_train)

# Print the features and their ranking (high = dropped early on)
print(dict(zip(X.columns, rfe.ranking_)))

# Print the features that are not eliminated
print(X.columns[rfe.support_])

# Calculates the test set accuracy
acc = accuracy_score(y_test, rfe.predict(X_test))
print("{0:.1%} accuracy on test set.".format(acc)) 

output of the above code:

{'age': 1, 'insulin': 4, 'triceps': 3, 'pregnant': 5, 'glucose': 1, 'bmi': 1, 'family': 2, 'diastolic': 6}

Index(['glucose', 'bmi', 'age'], dtype='object')

80.6% accuracy on test set. 

### Tree Based Feature Selection

Some models perform feature selection by design to avoid overfitting e.g **RandomForestClassifier**

In [None]:
# Perform a 75% training and 25% test data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Fit the random forest model to the training data
rf = RandomForestClassifier(random_state=0)
rf.fit(X_train, y_train)

# Calculate the accuracy
acc = accuracy_score(y_test, rf.predict(X_test))

# Print the importances per feature
print(dict(zip(X.columns, rf.feature_importances_.round(2))))

# Print accuracy
print("{0:.1%} accuracy on test set.".format(acc))

output for the above code:

{'age': 0.16, 'insulin': 0.13, 'triceps': 0.11, 'pregnant': 0.09, 'glucose': 0.21, 'bmi': 0.09, 'family': 0.12, 'diastolic': 0.08}

77.6% accuracy on test set.

Now lets use the fitted random model to select the most important features.

In [None]:
# Create a mask for features importances above the threshold
mask = rf.feature_importances_ > 0.15

# Prints out the mask
print(mask)

Sub-select the most important features by applying the mask.

In [None]:
# Create a mask for features importances above the threshold
mask = rf.feature_importances_ > 0.15

# Apply the mask to the feature dataset X
reduced_X = X.loc[:, mask]

# prints out the selected column names
print(reduced_X.columns)

### Recurcive Feature Elimination with Random Forest Classifier

This method is more conservative compared to selecting features after applying a single importance threshold. Since dropping one feature can influence the relative importances of the others.

In [None]:
# Wrap the feature eliminator around the random forest model
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=2, verbose=1)

# Fit the model to the training data
rfe.fit(X_train, y_train)

# Create a mask using an attribute of rfe
mask = rfe.support_

# Apply the mask to the feature dataset X and print the result
reduced_X = X.loc[:, mask]
print(reduced_X.columns)

Change the settings of RFE() to eliminate 2 features at each step

In [None]:
# Set the feature eliminator to remove 2 features on each step
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=2, step = 2, verbose=1)

# Fit the model to the training data
rfe.fit(X_train, y_train)

# Create a mask
mask = rfe.support_

# Apply the mask to the feature dataset X and print the result
reduced_X = X.loc[:, mask]
print(reduced_X.columns)

### Feature Selection using Regression Algorithms

### Regularized Linear Regression

**Lasso Regression**

In [None]:
# Set the test size to 30% to get a 70-30% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Fit the scaler on the training features and transform these in one go
X_train_std = scaler.fit_transform(X_train)

# Create the Lasso model
la = Lasso()

# Fit it to the standardized training data
la.fit(X_train_std, y_train)

**Lasso model results**

In [None]:
# Transform the test set with the pre-fitted scaler
X_test_std = scaler.transform(X_test)

# Calculate the coefficient of determination (R squared) on X_test_std
r_squared = la.score(X_test_std, y_test)
print("The model can predict {0:.1%} of the variance in the test set.".format(r_squared))

# Create a list that has True values when coefficients equal 0
zero_coef = la.coef_ == 0

# Calculate how many features have a zero coefficient
n_ignored = sum(zero_coef)
print("The model has ignored {} out of {} features.".format(n_ignored, len(la.coef_)))

**Adjusting the regularization strength**

Your current Lasso model has an R2 score of 84.7%. When a model applies overly powerful regularization it can suffer from high bias, hurting its predictive power.

Let's improve the balance between predictive power and model simplicity by tweaking the alpha parameter.

Find the highest value for alpha that gives an R2 value above 98% from the options: 1, 0.5, 0.1, and 0.01.

In [None]:
# Find the highest alpha value with R-squared above 98%
la = Lasso(alpha=0.1, random_state=0)

# Fits the model and calculates performance stats
la.fit(X_train_std, y_train)
r_squared = la.score(X_test_std, y_test)
n_ignored_features = sum(la.coef_ == 0)

# Print peformance stats 
print("The model can predict {0:.1%} of the variance in the test set.".format(r_squared))
print("{} out of {} features were ignored.".format(n_ignored_features, len(la.coef_)))

With this more appropriate regularization strength we can predict 98% of the variance in the BMI value while ignoring 2/3 of the features.

Finding the best value for alpha can be a tedious process. Use Cross Validation to achieve that i.e use LassoCV instead. It gets the best value for alpha and uses it to fit the model. .alha_ method gives the best value for alpha.

## Combining feature selectors

### Feature Selection with LassoCV

In [None]:
from sklearn.linear_model import LassoCV

# Create and fit the LassoCV model on the training set
lcv = LassoCV()
lcv.fit(X_train, y_train)
print('Optimal alpha = {0:.3f}'.format(lcv.alpha_))

# Calculate R squared on the test set
r_squared = lcv.score(X_test, y_test)
print('The model explains {0:.1%} of the test set variance'.format(r_squared))

# Create a mask for coefficients not equal to zero
lcv_mask = lcv.coef_ != 0
print('{} features out of {} selected'.format(sum(lcv_mask), len(lcv_mask)))

Output for the above code:

Optimal alpha = 0.089

The model explains 88.2% of the test set variance

26 features out of 32 selected

### Feature Selection with RandomForest (RFE)

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

# Select 10 features with RFE on a RandomForestRegressor, drop 3 features on each step
rfe_rf = RFE(estimator=RandomForestRegressor(), 
             n_features_to_select=10, step=3, verbose=1)
rfe_rf.fit(X_train, y_train)

# Calculate the R squared on the test set
r_squared = rfe_rf.score(X_test, y_test)
print('The model can explain {0:.1%} of the variance in the test set'.format(r_squared))

# Assign the support array to gb_mask
rf_mask = rfe_rf.support_

Output for the above code:
    
Fitting estimator with 32 features.

Fitting estimator with 29 features.

Fitting estimator with 26 features.

Fitting estimator with 23 features.

Fitting estimator with 20 features.

Fitting estimator with 17 features.

Fitting estimator with 14 features.

Fitting estimator with 11 features.

The model can explain 84.0% of the variance in the test set

### Feature Selection with GradientBoosting (RFE)

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import GradientBoostingRegressor

# Select 10 features with RFE on a GradientBoostingRegressor, drop 3 features on each step
rfe_gb = RFE(estimator=GradientBoostingRegressor(), 
             n_features_to_select=10, step=3, verbose=1)
rfe_gb.fit(X_train, y_train)

# Calculate the R squared on the test set
r_squared = rfe_gb.score(X_test, y_test)
print('The model can explain {0:.1%} of the variance in the test set'.format(r_squared))

# Assign the support array to gb_mask
gb_mask = rfe_gb.support_

Output for the above code:
    
Fitting estimator with 32 features.

Fitting estimator with 29 features.

Fitting estimator with 26 features.

Fitting estimator with 23 features.

Fitting estimator with 20 features.

Fitting estimator with 17 features.

Fitting estimator with 14 features.


Fitting estimator with 11 features.

The model can explain 85.6% of the variance in the test set

### Combining the above Feature Selectors

We'll combine the votes of the 3 models you built in the previously, to decide which features are important into a meta mask. 

We'll then use this mask to reduce dimensionality and see how a simple linear regressor performs on the reduced dataset.

In [None]:
# Sum the votes of the three models
votes = np.sum([lcv_mask, rf_mask, gb_mask], axis=0)

# Create a mask for features selected by all 3 models
meta_mask = votes >= 3

# Apply the dimensionality reduction on X
X_reduced = X.loc[:, meta_mask]

# Plug the reduced dataset into a linear regression pipeline
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.3, random_state=0)
lm.fit(scaler.fit_transform(X_train), y_train)
r_squared = lm.score(scaler.transform(X_test), y_test)
print('The model can explain {0:.1%} of the variance in the test set using {1:} features.'.format(r_squared, len(lm.coef_)))

Output for the above code:

The model can explain 86.8% of the variance in the test set using 7 features.

## Feature Extraction

**Manual Feature Extraction**

In [None]:
# Calculate the mean height
height_df['height'] = height_df[['height_1', 'height_2', 'height_3']].mean(axis=1)

# Drop the 3 original height features
reduced_df = height_df.drop(['height_1', 'height_2', 'height_3'], axis=1)

print(reduced_df.head())

### PCA-Principal Components Analysis

![PCA intuition](pca_intuition.png)

The 2 perpendicular vectors above are aligned with the variance of the data.

- People with a positive component for the red vector have both long forearms and long upper arms.

- People with a negative component for the red vector have short forearms and short upper arms.

- People with a positive component for the yellow vector have long upper arms relative to their forearms.

- People with a negative component for the yellow vector have long forearms relative to their upper arms.

![PCA intuition](pca_concept_coordinates.png) 
![PCA intuition](pca_concept_components.png)

A point at coordinates(2.7, 1) can be described in terms of the vectors i.e 2 times the red vector and -1 times the yellow vector.

2 and -1 become the 1st and 2nd principal components respectively. 

The 1st one is most important as it is aligned with the most source of variance in the data.

**The component do not have duplicate information and are ranked from most important to least important.**

![PCA correlation](pca_correlation.png)


In a dataset with very high correlation, most of the variance will be explained in the 1st few components.

In the above diagram, PC1 explains 90% of the variance. It would thus make sence to drop the 2PC.

Visually inspect a 4 feature sample of the ANSUR dataset before and after PCA using Seaborn's pairplot(). 

This will allow you to inspect the pairwise correlations between the features.

In [None]:
# Create a pairplot of the principal component dataframe
sns.pairplot(ansur_df)
plt.show()

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Create the scaler
scaler = StandardScaler()
ansur_std = scaler.fit_transform(ansur_df)

# Create the PCA instance and fit and transform the data with pca
pca = PCA()
pc = pca.fit_transform(ansur_std)
pc_df = pd.DataFrame(pc, columns=['PC 1', 'PC 2', 'PC 3', 'PC 4'])

# Create a pairplot of the principal component dataframe
sns.pairplot(pc_df)
plt.show()

Apply PCA on a somewhat larger ANSUR datasample with 13 dimensions.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Create the scaler
scaler = StandardScaler()
ansur_std = scaler.fit_transform(ansur_df)

# Create the PCA instance and fit and transform the data with pca
pca = PCA()
pca = pca.fit(ansur_std)

# Inspect the explained variance ratio per component
print(pca.explained_variance_ratio_)

# Print the cumulative sum of the explained variance ratio
print(pca.explained_variance_ratio_.cumsum())

Difficult to understand Principal Components.

![pca components](pca_components.png)

**pca.components_** tells us to what extent each component vector is affected by a particular feature in the dataset.

Feature with the biggest positive or negative effect on a component can be used to add meaning to that component.

PC1 above is affected by both hand and foot length.

PC2 people with shorter hands and longer feet score high for PC2.

PCA to the numeric features of the Pokemon dataset, poke_df, using a pipeline to combine the feature scaling and PCA in one go.

In [None]:
# Build the pipeline
pipe = Pipeline([('scaler', StandardScaler()),
        		 ('reducer', PCA(n_components=2))])

# Fit it to the dataset and extract the component vectors
pipe.fit(poke_df)
vectors = pipe.steps[1][1].components_.round(2)

# Print feature effects
print('PC 1 effects = ' + str(dict(zip(poke_df.columns, vectors[0]))))
print('PC 2 effects = ' + str(dict(zip(poke_df.columns, vectors[1]))))

Output for the above code:

PC 1 effects = {'Sp. Atk': 0.46, 'Sp. Def': 0.45, 'Defense': 0.36, 'Attack': 0.44, 'Speed': 0.34, 'HP': 0.39}
PC 2 effects = {'Sp. Atk': -0.31, 'Sp. Def': 0.24, 'Defense': 0.63, 'Attack': -0.01, 'Speed': -0.67, 'HP': 0.08}

All features have a similar positive effect. PC 1 can be interpreted as a measure of overall quality (high stats).

Defense has a strong positive effect on the second component and speed a strong negative one. This component quantifies an agility vs. armor & protection trade-off.

Use the PCA pipeline you've built in the previously to visually explore how some categorical features relate to the variance in poke_df

By Type

In [None]:
pipe = Pipeline([('scaler', StandardScaler()),
                 ('reducer', PCA(n_components=2))])

# Fit the pipeline to poke_df and transform the data
pc = pipe.fit_transform(poke_df)

# Add the 2 components to poke_cat_df
poke_cat_df['PC 1'] = pc[:, 0]
poke_cat_df['PC 2'] = pc[:, 1]

# Use the Type feature to color the PC 1 vs PC 2 scatterplot
sns.scatterplot(data=poke_cat_df, 
                x='PC 1', y='PC 2', hue='Type')
plt.show()

By Legendary 

In [None]:
pipe = Pipeline([('scaler', StandardScaler()),
                 ('reducer', PCA(n_components=2))])

# Fit the pipeline to poke_df and transform the data
pc = pipe.fit_transform(poke_df)

# Add the 2 components to poke_cat_df
poke_cat_df['PC 1'] = pc[:, 0]
poke_cat_df['PC 2'] = pc[:, 1]

# Use the Type feature to color the PC 1 vs PC 2 scatterplot
sns.scatterplot(data=poke_cat_df, 
                x='PC 1', y='PC 2', hue='Legendary')
plt.show()

In [None]:
# Build the pipeline
pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('reducer', PCA(n_components=2)),
        ('classifier', RandomForestClassifier(random_state=0))])

# Fit the pipeline to the training data
pipe.fit(X_train, y_train)

# Prints the explained variance ratio
print(pipe.steps[1][1].explained_variance_ratio_)

# Score the accuracy on the test set
accuracy = pipe.score(X_test, y_test)

# Prints the model accuracy
print('{0:.1%} test set accuracy'.format(accuracy))

Output for the above code:

[0.45624044 0.17767414]

95.8% test set accuracy

Repeat above with 3 PCs

In [None]:
# Build the pipeline
pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('reducer', PCA(n_components=3)),
        ('classifier', RandomForestClassifier(random_state=0))])

# Fit the pipeline to the training data
pipe.fit(X_train, y_train)

# Score the accuracy on the test set
accuracy = pipe.score(X_test, y_test)

# Prints the explained variance ratio and accuracy
print(pipe.steps[1][1].explained_variance_ratio_)
print('{0:.1%} test set accuracy'.format(accuracy))

Output for the code above:

[0.45624044 0.17767414 0.12858833]

95.0% test set accuracy

### Selecting the proportion of variance to keep

In [None]:
# Pipe a scaler to PCA selecting 80% of the variance
pipe = Pipeline([('scaler', StandardScaler()),
        		 ('reducer', PCA(n_components=0.8))])

# Fit the pipe to the data
pipe.fit(ansur_df)

print('{} components selected'.format(len(pipe.steps[1][1].components_)))

Output for the code above:

11 components selected

### Choosing the number of components to keep

In [None]:
# Pipeline a scaler and pca selecting 10 components
pipe = Pipeline([('scaler', StandardScaler()),
        		 ('reducer', PCA(n_components=10))])

# Fit the pipe to the data
pipe.fit(ansur_df)

# Plot the explained variance ratio
plt.plot(pipe.steps[1][1].explained_variance_ratio_)

plt.xlabel('Principal component index')
plt.ylabel('Explained variance ratio')
plt.show()

Output of the above code:

![explained variance plotted](explained_variance_plotted.png)

To how many components can you reduce the dataset without compromising too much on explained variance?

Note that the x-axis is zero indexed.

Ans: 3 (the elbow of the curve is at index 2)

### PCA Inverse

In [None]:
# Transform the input data to principal components
pc = pipe.transform(X_test)

# Inverse transform the components to original feature space
X_rebuilt = pipe.inverse_transform(pc)

# Prints the number of features
print("X_rebuilt has {} features".format(X_rebuilt.shape[1]))

# Plot the reconstructed data
plot_digits(X_rebuilt)