# Dimensionality Reduction with PCA

## Task 1

This notebook deals with the understanding and usage of the ___Principal Component Analysis (PCA)___ on different datasets. PCA is a model-based way to reduce the amount of "uninformative" or redundant information, meaning that it tries to reduce the amount of individual features while trying to lose as little information as possible.
The newly obtained features from the PCA can then be used instead of the (higher-dimensional) original feature vector.

__"The key idea here is to replace redundant features with a few new features that adequately summarize information contained in the original feature space."__

To gain some understanding we will first take a look at how to "manually" extract the principal components to then use tools provided by scikit-learn to achieve the same goal. Most of the information and content in _Task 1_ is taken from LINK and packed into this notebook to get a better overview as well as to have all the information in one place.

### Implementation
#### Manual and Step-by-step
First we are going to load, split and standardize an example dataset ("wine.data", __[download from here](https://github.com/DataScienceLabFHSWF/machine-learning-book/tree/main/data/pca)__) 

In [None]:
import pandas as pd

df = pd.read_csv("wine.data")

df.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue',
                   'OD280/OD315 of diluted wines', 'Proline']
df.head()

Now we use sklearn train_test_split() to split our dataset into training and test data (80/20).

In [None]:
from sklearn.model_selection import train_test_split

X, y = df.iloc[:,1:].values, df.iloc[:,0].values

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=0)

Standardize data with sklearn:

In [None]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
X_train_std = std_scaler.fit_transform(X_train)
X_test_std = std_scaler.transform(X_test)

We can extract the principal components by performing an eigendecomposition on the covariance matrix of our training data.

In [None]:
import numpy as np
cov_mat = np.cov(X_train_std.T)
eigenvals, eigenvecs = np.linalg.eigh(cov_mat)

print('\nEigenvalues \n%s' % eigenvals)

Then we use these eigenvalues to calculate the amount of variance each of these components explains. Here we first calculate the total amount of variance by summing up all eigenvalues to then normalize each individual value by this sum. Additionally we track the cumulative explained variance.

In [None]:
var_total = sum(eigenvals)
var_explained = [(i /var_total) for i in sorted(eigenvals, reverse=True)]
cum_var_explained = np.cumsum(var_explained)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns


plt.bar(range(1, len(eigenvals)+1), var_explained, alpha=0.5, align='center',
        label='Explained variance per component')
plt.step(range(1, len(eigenvals)+1), cum_var_explained, where='mid',
         label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

##### Feature transformation
In order to use these newly aquired principal components as features we need to perform a base transformation where we use the PCs as new axes in our coordinate system. This will lead to unrecognizable data as the values of the original features are transformed. 
Below we first combine eigenvalues and eigenvectors into a list of tuples (eigen_pairs).

In [None]:
# Make a list of (eigenvalue, eigenvector) tuples
# eigenvectors stored as columns of a matrix -> [:, i]
eigen_pairs = [(np.abs(eigenvals[i]), eigenvecs[:, i])
               for i in range(len(eigenvals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eigen_pairs.sort(key=lambda k: k[0], reverse=True)

eigen_pairs[0]

This list of eigen_pairs has ofc. 13 elements, one for each principal component. Each of the elements of this list are tuples, and each tuple contains the eigenvalue as well as the eigenvector coordinates for each dimension. As PCA is often used for dimensionality reduction lets assume we want only to use the first two principal components (the ones with the highest explained variance as we want to capture as much underlying information as possible).

__Exercise:__ How much of the original variance was captured by the first two principal components? How much additional explained variance is provided by the third PC?

In [None]:
# code here

As mentioned above we only want to use the first two PCs of the 13 that were derived. These two components capture more than half of the explained variance of the underlying original dataset and could thus be a suitable choice to perform dimension reduction while simultaneously minimizing the loss of information. We use _numpy.hstack_ to create a new matrix with two columns that contains all coordinate values for the first two eigenvectors. 

In [None]:
w = np.hstack((eigen_pairs[0][1][:, np.newaxis],
               eigen_pairs[1][1][:, np.newaxis]))
print('Matrix W:\n', w)

__Exercise:__ Use the code from above to create a matrix _w2_ from the coordinates of the four most informative components.

In [None]:
# code here

Now that we have our matrix _w_ we can transform our data by taking the dot product of the original data and the PCA-column vectors, projecting the original data onto the PCA axes.

In [None]:
print("X_train_std[0]: {}\n".format(X_train_std[0]))
print("X_train_std[0].dot(w): {}".format(X_train_std[0].dot(w)))

X_train_pca = X_train_std.dot(w)

When plotting the results we can see that the first two PCs already achieve quite good clustering, even though they 

In [None]:
colors = ['r', 'b', 'g']
markers = ['s', 'x', 'o']

for l, c, m in zip(np.unique(y_train), colors, markers):
    plt.scatter(X_train_pca[y_train == l, 0], 
                X_train_pca[y_train == l, 1], 
                c=c, label=l, marker=m)

plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.tight_layout()
plt.show()

__Exercise:__ Now take only the best and the worst PCs and use them as basis for the feature transformation. How does the resulting scatter plot differ from the one shown above? What's the reason behind this?

In [None]:
# code here

### PCA with scikit-learn
Of course you do not always have to perform these tasks by hand as there is already a scikit-learn implementation for the PCA. This makes it more comfortable to use and takes care of the underlying math. To perform the same operations as above we just have to do the following when utilizing sklearn:

In [None]:
from sklearn.decomposition import PCA

pca = PCA()
X_train_pca = pca.fit_transform(X_train_std)
pca.explained_variance_ratio_

In [None]:

plt.bar(range(1, len(pca.explained_variance_ratio_)+1), pca.explained_variance_ratio_, alpha=0.5, align='center',
        label='Explained variance per component')
plt.step(range(1, len(pca.explained_variance_ratio_)+1), np.cumsum(pca.explained_variance_ratio_), where='mid',
         label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

## Task 2


In this task you will take a closer look at the _Iris_ dataset ("IRIS.csv", __[download from here](https://github.com/DataScienceLabFHSWF/machine-learning-book/tree/main/data/pca)__) which lists different attributes of the Iris flower, taken from various samples. The goal here is to predict the correct species by applying and comparing different machine learning algorithms of your choice.  
Perform PCA on the data and take a look at the _explained variance_ in order to get the best trade-off between dimensionality-reduction and information loss. Train and test every model you consider on the PCA-transformed and the original data to see the impact of the aforementioned loss. 

### Visualization
Plot the distribution of the different features in the dataset. What kind of scaling would you choose for the different attributes?

Use the _seaborn_ package to plot a heatmap of the correlation matrix of the dataframe (df.corr()). Which attributes are highly correlated, which feature sticks out?

### Preprocess
Preprocess the data here (label encoding, scaling)

### PCA
### Apply and create dataframe
Split the dataframe into two separate dataframes (feature data and target data (X,y)). Apply PCA to the feature data, build a new dataframe from the transformed components and store it as _X\_pca_.

Plot the heatmap of the PCA-transformed dataframe and compare it to the map from above. How can you explain the differences?

Take a look at the explained variance of the principal components and display display it in a plot. How many components would you choose to get the best trade-off between dimensionality-reduction and information loss? How much of the variance in the entire dataset do they represent?

Now create a dataframe out of your chosen components (_np.hstack()_).

Create a scatter plot of the first two PCs

### Training
Split the datasets (original and PCA-transformed) into train and test data.

 Use different algorithms of your choice to try and achieve the highest target score.
Use _train\_test\_splot from _sklearn.model_selection_ (random_state: 25, test_size: 0.25)

Plot the results. Which of the algorithms you chose performed best? Do the results surprise you in any way?

## Task 3
In this task we try to predict the anticipated final status of bank loans, given various attributes describing the loan taker. The _credit_ dataset for this task can be downloaded from __[here](https://github.com/DataScienceLabFHSWF/machine-learning-book/tree/main/data/pca)__.  
As always, load the data into a dataframe and get familiar with it (__[might be useful](https://www.investopedia.com/terms/c/chargeoff.asp)__). What kinds of categories do you see (numerical, categorical)? 

Are there any features that correlate strongly with each other?

### Dealing with missing data
You might have noticed that there are quite a few missing/na values in our dataset. Take a closer look at each of the attributes that has missing values and try to decide how to best deal with it (fill with constant value, drop entirely, ...).


### Categorical variables
At this point your dataframe should have no attributes with missing values left. In the next step you should use One-Hot-Encoding or Label-Encoding to transform the categorical values into numerical/binary ones.  
What are the names of these features? Which technique(s) did you use?  

First encode the target column:

And now the other attributes:

Now append these transformed matrizes to your cleaned dataframe (_pd.concat()_) to create the final version of the dataset.

### PCA
Perform the PCA and plot the important metrics. How many components are needed to explain most of the variance?  


### Training
Use the cleaned and encoded dataframe for training and test data. Compare the results of PCA-transformed and the cleaned, untransformed dataset.

### Evaluation and results

Apply various machine learning algorithms to your training data and evaluate on your test data. What do you notice?

Plot your results