
# Dimensionality Reduction in Python with Scikit-Learn

# 1  Introduction
Scikit-learn is a free software machine learning library for the Python programming language. It is built on NumPy, SciPy, and matplotlib. It features various classification, regression, clustering and dimentionality reduction algorithms. In this notebook, we would focus on its use in the dimensionality reduction topic.

Dimensionality reduction (DR) is embedding the original high-dimensional data in a lower-dimensional space, where critical information should be preserved. The motivation to apply DR is as follows:
- Combat computational cost
- Avoid curse of dimensionality
- Capture intrinsic dimensionality
- Noise removal 
- Visualize when data (2D or 3D)

### Application in Machine Learning
In machine learning, The model performance could become better with more features of samples, but also be more likely to overfit. DR can be used to extract effective features while **controlling overfitting** because it only preserves the most important components of the feature space and drops the other components.

Dimensionality reduction can be used in both supervised and unsupervised learning contexts. 

|learning method |main algorithm |
|--- | ---|
|unsupervised learning |PCA|
|supervised learning | LDA |

### Notebook Structure
Part 2 - Part 3 shows the most common dimensionality reduction techniques PCA and LDA, which are both statistical methods. We use the same Iris dataset in their implementation examples so that we can compare the two methods. Part 4 concludes the similarity and difference between these two techniques and Part 5 shows some useful resources about Scikit-learn.

There are many other dimensionality techniques including non-negative matrix factorization (NMF) and independent component analysis (ICA).  
See here for many other techniques: https://scikit-learn.org/stable/modules/decomposition.html#

# 2  Principal component analysis (PCA)
## 2.1 Principle
Principal Component Analysis (PCA) selects the most influential characteristics of the dataset, creates principal components based on them and then reduces the dimensionality. A general rule of thumb is to take number of principal  components that contribute to significant variance and ignore those with diminishing variance returns. In unsupervised learning, PCA can be used to extract principal components as input features.

Performing PCA is a five-step process:
1. **Normalization**  
If there are large differences between the ranges of initial variables, those variables with larger ranges (e.g. 0-100) will dominate over those with small ranges (e.g. 0-1), which will lead to biased results. So, transforming the data to comparable scales can prevent this problem.
2. **Compute covariance metrix**  
The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other. Because sometimes, variables contain redundant information so that they are highly correlated.  
The **covariance matrix** is a p × p symmetric matrix, where p is the number of data dimensions. It has entries of covariances associated with all possible pairs of the initial variables. For example, the covariance matrix for a 3-dimensional data set with 3 variables x, y, and z is shown as follows:
![Matrix.png](https://builtin.com/sites/default/files/styles/ckeditor_optimize/public/inline-images/Principal%20Component%20Analysis%20Covariance%20Matrix.png)  
3. **Identify principle components**  
Principal components are constructed as linear combinations of the initial variables, which makes them less interpretable. They are are selected on the basis of variance that they cause in the output and they are uncorrelated. Most of the information within the initial variables is squeezed or compressed into the first components, then maximum remaining information in the second and so on.   
This can be done by computing the  eigenvectors and eigenvalues of the covariance metrix. The eigenvectors and eigenvalues comes in pair and their number is equal to the number of data dimensions.  
*The **eigenvectors** of the Covariance matrix are actually the directions of the axes where there is the most variance(most information) and that we call Principal Components. The **eigenvalues** are simply the coefficients attached to eigenvectors, which give the amount of variance carried in each Principal Component.*
4. **Decide dimension of feature vector**  
Computing the eigenvectors and ordering them by their eigenvalues in descending order, allow us to find the principal components in order of significance. So the aim of this step is to choose how many important components to keep in the feature vector.
5. **Recast the data along the principal components axes**  
the aim of this step is to use the formed feature vector to reorient the data from the original axes to the ones represented by the principal components. This can be done by multiplying the transpose of the original data set by the transpose of the feature vector.
![vector.png](https://builtin.com/sites/default/files/styles/ckeditor_optimize/public/inline-images/Principal%20Component%20Analysis%20feature%20vector.png)

## 2.2 Implementing PCA with Scikit-Learn
We will follow a pipeline:
- import libraries and dataset
- perform preprocessing
- perform PCA to find out optimal number of features
- train our models with different number of principle features and make predictions
- evaluate accuracies with different number of principle features

### Importing libraries and dataset
We are going to use the famous Iris data set. The data set contains 150 instances with 4 features, which have been equally classified into 3 classes. Each class refers to a type of iris plant. 

See here for more information on this dataset: https://archive.ics.uci.edu/ml/datasets/iris 

In [1]:
#Importing libraries
import numpy as np
import pandas as pd

#download the dataset using pandas
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(url, names=names)

#output the first five rows of our dataset 
dataset.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


### Preprocessing
Since PCA can only be applied to numeric data, categorical features are required to be converted into numerical features before applying PCA by **normalization**.

In [2]:
# Divide the dataset into a feature set (X) and corresponding labels (y)
X = dataset.drop('Class', 1)
y = dataset['Class']
print(X, y)

# Split the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Normalize our feature set (X)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

     sepal-length  sepal-width  petal-length  petal-width
0             5.1          3.5           1.4          0.2
1             4.9          3.0           1.4          0.2
2             4.7          3.2           1.3          0.2
3             4.6          3.1           1.5          0.2
4             5.0          3.6           1.4          0.2
..            ...          ...           ...          ...
145           6.7          3.0           5.2          2.3
146           6.3          2.5           5.0          1.9
147           6.5          3.0           5.2          2.0
148           6.2          3.4           5.4          2.3
149           5.9          3.0           5.1          1.8

[150 rows x 4 columns] 0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: Class, Length: 150, dtype: objec

### Applying PCA
Performing PCA using Scikit-Learn is a two-step process:
1. Initialize the PCA class by passing the number of components to train our model.  
Note: The n_component parameter can be set as 4 at most , which is the number of original data dimension.
2. Call the **transform** methods by passing the feature set to these methods. The transform method returns the specified number of principal components.
Note: PCA depends only upon the feature set and not the label data. So it's often used in unsupervised learning.

**fit_transform() vs transform():**  
The fit_transform function performs the training and transforming in one step. The reason why we use fit_transform() on the train data is that we learn the parameters of scaling on the train data while we scale the train data. We only use transform() on the test data because we use the scaling paramaters learned on the train data to scale the test data. This is the standart procedure to scale. You always learn your scaling parameters on the train and then use them on the test.  
Here is an article that explanes it very well : https://sebastianraschka.co

In [19]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 1) 
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

The class explained_variance_ratio_ of PCA returns the variance caused by each of the principal components.

In [7]:
explained_variance = pca.explained_variance_ratio_
print("explained variance ratio: ", explained_variance)

explained variance ratio:  [0.72226528 0.23974795 0.03338117 0.0046056 ]


**Analysis**: It can be seen that first principal component is responsible for 72.22% variance. Similarly, the second principal component causes 23.9% variance in the dataset. Collectively we can say that 96.21% (72.22 + 23.9) percent of the classification information contained in the feature set is captured by the first two principal components.

### Training and Making Predictions
We'll use random forest classification to make predictions.

In [20]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

### Performance Evaluation

In [21]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

pre = confusion_matrix(y_test, y_pred)

In [5]:
# Get the result with original four features by not executing the applying PCA part
print(pre)
print('Accuracy', str(accuracy_score(y_test, y_pred)))

[[11  0  0]
 [ 0 13  0]
 [ 0  0  6]]
Accuracy 1.0


In [22]:
# Get the result with only one features by setting n_components = 1
print(pre)
print('Accuracy', str(accuracy_score(y_test, y_pred)))

[[11  0  0]
 [ 0 12  1]
 [ 0  1  5]]
Accuracy 0.9333333333333333


In [18]:
# Get the result with two features by setting n_components = 2
print(pre)
print('Accuracy', str(accuracy_score(y_test, y_pred)))

[[11  0  0]
 [ 0  9  4]
 [ 0  2  4]]
Accuracy 0.8


In [14]:
# Get the result with three features by setting n_components = 3
print(pre)
print('Accuracy', str(accuracy_score(y_test, y_pred)))

[[11  0  0]
 [ 0  8  5]
 [ 0  1  5]]
Accuracy 0.8


In [10]:
# Get the result with four features by setting n_components = 4

print(pre)
print('Accuracy', str(accuracy_score(y_test, y_pred)))

[[11  0  0]
 [ 0 11  2]
 [ 0  1  5]]
Accuracy 0.9


Overall, the performance over different number of features is summarized as follow.

|Principe feature number | accuracy |
| --- | --- |
| 0 | 1.0 |
| 1 |  0.93 |
| 2 | 0.8 |
| 3 | 0.8 |
| 4 | 0.9 |

## 2.3 Result Analysis
1. When we don't use PCA, optimal level of accuracy can be achieved. It implies dimensionality reduction techniques could lead to loss of information to some extent. Even though the accuracy could be decreased, dimensionality reduction still has meaning when we want to reduce computational cost, visualize for data analysis and control overfitting when we have too many features. 
2. When PCA is implemented, We can achieve the highest accuracy while significantly reducing the number of features in the dataset. In other words, the accuracy of a classifier doesn't necessarily improve with increased number of principal components. 

# 3 Linear Discriminant Analysis (LDA)
## 3.1 Introduction
Linear Discriminant Analysis (LDA) operates by projecting data from a multidimensional graph onto a lower-dimensional graph. LDA works best when the means of the classes are far from each other. And it can be used as a classification algorithm in addition to carrying out dimensionality reduction. 

Note: Not to be confused with latent Dirichlet allocation. Here is an article that explanes their difference:  
https://www.quora.com/What-is-the-difference-between-LDA-linear-analysis-discriminant-and-LDA-latent-Dirichlet-allocation

Performing LDA for a binary classification problem is a three-step process:
1. **Normalization**: Transforming the data to comparable scales is also needed for LDA.
2. **Plot a new axis in the 2D graph**. This new axis should separate the two data points based on two primary goals: 
    - minimizing the variance within the two classes
    - maximizing the distance between the means of the two data classes. 
3. **Move the data points in two-dimensional graph to the new axis** in 3 steps: 
   - Calculate the separability between the classes, and this is based on the distance between the class means or the between-class variance. 
   - Calculate the within class variance, which is the distance between the mean and sample for different classes. 
   - Construct the lower dimensional (1D) space, which maximizes the between class variance according to the previous calculation.

## 3.2 Implementing LDA with Scikit-Learn
We will follow a pipeline:

- import libraries and dataset
- perform preprocessing
- perform LDA with different number of linear discriminates
- train our models with different number of linear discriminates and make predictions
- evaluate accuracies with different number of linear discriminates

All parts are exactly the same execpt for the performing LDA part so that we can compare results of LDA and PCA in part 4.

### Importing libraries and dataset
we will also apply LDA on the Iris dataset.

In [23]:
#Importing libraries
import numpy as np
import pandas as pd

#download the dataset using pandas
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(url, names=names)

### Preprocessing
Performing feature scaling is also needed for LDA.

In [24]:
# Divide the dataset into a feature set (X) and corresponding labels (y)
X = dataset.drop('Class', 1)
y = dataset['Class']

# Split the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Normalize our feature set (X)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### Performing LDA
Performing LDA using Scikit-Learn is a two-step process:
1. Initialize the LDA class by passing the number of linear discriminates.  
Note: The parameter n_components can be set as K-1 at most, where K is the classification number.
2. Call the **transform** methods by passing the feature set to these methods. The transform method returns the specified number of linear discriminates.
Note: The transform method here takes two parameters, which are X_train and the y_train. This reflects the fact that LDA takes the output class labels into account while selecting the linear discriminants, while PCA doesn't depend upon the output labels.

In [32]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

lda = LDA(n_components=1)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

### Training and Making Predictions
We'll also use random forest classification to make predictions.

In [33]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

### Evaluating the Performance

In [34]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

pre = confusion_matrix(y_test, y_pred)

In [27]:
# Get the result with original four features by not executing the applying LDA part
print(pre)
print('Accuracy ' + str(accuracy_score(y_test, y_pred)))

[[11  0  0]
 [ 0 13  0]
 [ 0  0  6]]
Accuracy 1.0


In [35]:
# Get the result with two features by setting n_components = 1
print(pre)
print('Accuracy ' + str(accuracy_score(y_test, y_pred)))

[[11  0  0]
 [ 0 13  0]
 [ 0  0  6]]
Accuracy 1.0


In [31]:
# Get the result with two features by setting n_components = 2
print(pre)
print('Accuracy ' + str(accuracy_score(y_test, y_pred)))

[[11  0  0]
 [ 0 13  0]
 [ 0  1  5]]
Accuracy 0.9666666666666667


Overall, the performance over different number of features is summarized as follow.  

|Linear discriminates number | accuracy |
| --- | --- |
| 0 | 1.0 |
| 1 | 1.0 |
| 2 | 0.97 |

## 3.3 Result Analysis
1. With one linear discriminant, the algorithm achieved an accuracy of 100%, which is greater than the accuracy achieved with one principal component in PCA, which was 93.33%.
2. The accuracy of a classifier doesn't necessarily improve with increased number of linear discriminates.


# 4 Conclusion
#### Similarity between PCA and LDA
- They are both linear transformation techniques.  
- They both use the idea of matrix factorization to reduce dimensions.  
- They both assume that the data fit a Gaussian distribution. So they are not suitable of dimensionality reduction for non-Gaussian samples.  


#### Difference between PCA and LDA
- PCA is an unsupervised while LDA is a supervised dimensionality reduction technique.  
**Advantage**: It is beneficial that PCA can be applied to labeled as well as unlabeled data since it has no concern with the class labels. On the other hand, LDA requires output classes for finding linear discriminants and hence requires labeled data.  
**Disadvantage**: PCA cannot use the prior knowledge of experience of classification.
- LDA can project data into at most K-1 dimension subspace, where K is the classification number. While PCA doesn't have this restriction. PCA can project data into at most N dimension subspace, where N is the original data dimension.  

- PCA only considers the global structure of the data, while LDA utilizes the class information (maximum separation).  
PCA chooses the direction of the sample point projection with the largest variance;  
LDA mainly choose the direction with the best classification performance, seeking to maximize the distance between data points between different categories after projection and minimize the distance between data points of the same category.  
- LDA can be used as a classification algorithm in addition to carrying out dimensionality reduction.
- LDA may overfit the data, while PCA doesn't have this problem because it's not affected by factors other than the dataset.

#### what to choose for dimensionality reduction?
First of all, It depends on the learning method. if it's unsupervised learning, then only. PCA can be used.  
Otherwise, it depends on the dataset you have in hand. In case of uniformly distributed data, LDA almost always performs better than PCA. However if the data is highly skewed (irregularly distributed) then it is advised to use PCA since LDA can be biased towards the majority class. And LDA does not work well when sample classification information depends on variance rather than mean.

# 5 Resources
If you are interested in learning more about Scikit-Learn, checkout the homepage that includes documentation and related resources: https://scikit-learn.org/stable/

Here are some useful documentation links:  
Quick Start Tutorial http://scikit-learn.org/stable/tutorial/basic/tutorial.html  
User Guide http://scikit-learn.org/stable/user_guide.html  
API Reference http://scikit-learn.org/stable/modules/classes.html  
Example Gallery http://scikit-learn.org/stable/auto_examples/index.html  

### Reference:  
https://stackabuse.com/dimensionality-reduction-in-python-with-scikit-learn/  
https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/  
https://stackabuse.com/implementing-lda-in-python-with-scikit-learn/  
https://builtin.com/data-science/step-step-explanation-principal-component-analysis