# Dimensionality Reduction

### The two main approaches for the dimensionality reduction is Projection and Mainfold Learning.

### Popular Dimensionality Reduction Techniques are PCA, Kernal PCA and LLE (Locally Linear Embedding)

### The Curse of Dimensionality : 
### the more dimensions the training set has, the greater is the risk of overfitting.                                          
### the number of training instances required to reach a given density grows exponentially with the number of dimensions.
### the processing time increases.
### visualization of the of high dimension data is difficult. 
                                  

## PCA

#### PCA assumes the dataset is centered around the origin. Hence to implement PCA, data should be centered first. Whereas in Scikit Learn's PCA classes , the centering of the data is already taken care of.

#### Datasets used in the codes is taken from : https://archive.ics.uci.edu/ml/datasets/Gisette
#### GISETTE is a handwritten digit recognition problem. The problem is to separate the highly confusible digits '4' and '9'.
#### Further information on the datasets can be reached at https://archive.ics.uci.edu/ml/machine-learning-databases/gisette/Dataset.pdf

In [None]:
import pandas as pd

X = pd.read_csv("G:/My Data Science files/Dimensionality Reduction/gisette_train.data.txt",delimiter = " " ,header = None)

In [None]:
X = X.drop([5000],axis=1)

In [None]:
X1 = X

In [None]:
# The following code gives first two components of PCA.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X2D = pca.fit_transform(X1)



In [None]:
# To know the variance explained by different principal components factors.
print(pca.explained_variance_ratio_)

In [None]:
#%%time
import numpy as np
pca = PCA()
pca.fit(X1)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

In [None]:
print("95% of variance in the data is explained by " , d , "components factors")

In [None]:
%%time
# another way to get factors by percentage of variance needed.

pca = PCA(n_components = 0.95)
X_reduced = pca.fit_transform(X1)

In [None]:
import matplotlib.pyplot as plt
plt.plot(cumsum)
plt.xlabel('Dimensions')
plt.ylabel('Explained variance')
plt.show()

## Incremental PCA

#### In Incremental PCA, we can split the training set into mini batches and feed into PCA algorithm one mini batch at a time. This is useful for a very large datasets.

In [None]:
from sklearn.decomposition import IncrementalPCA

n_batches = 100
inc_pca = IncrementalPCA(n_components=60)
for X_batch in np.array_split(X,n_batches):
    inc_pca.partial_fit(X_batch)

X_reduced = inc_pca.transform(X)

## Randomized PCA 

#### It is a stochastic algorithm that quickly finds an approximation of first d principal components.

In [None]:
%%time
from sklearn.decomposition import PCA
rnd_pca = PCA(n_components = 1761 , svd_solver = "randomized")
X_reduced = rnd_pca.fit_transform(X)

## Kernel PCA

#### It makes possible to perform complex nonlinear projections for dimensionality reduction.

In [None]:
%%time
from sklearn.decomposition import KernelPCA

rbf_pca = KernelPCA(n_components = 1761 , kernel = "rbf", gamma = 0.04)
X_reduced = rbf_pca.fit_transform(X)

### Selecting a kernel and tuning hyperparameters

In [None]:
y = pd.read_csv("G:/My Data Science files/Dimensionality Reduction/gisette_train.labels.txt",header = None)

In [None]:
from numpy import loadtxt
Y = loadtxt("G:/My Data Science files/Dimensionality Reduction/gisette_train.labels.txt")

In [None]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ("kpca",KernelPCA(n_components=10)),
    ("log_reg",LogisticRegression())
])

param_grid = [{
    "kpca__gamma":np.linspace(0.03,0.05,10),
    "kpca__kernel":["rbf","sigmoid"]
}]

grid_search = GridSearchCV(clf,param_grid,cv=3)
grid_search.fit(X,Y)

print(grid_search.best_params_)

## LLE (Locally Linear Embedding)

### It is a Mainfold Learning Technique that does not rely on projections like PCA. But this algorithm takes a lot of time.

In [None]:
from sklearn.manifold import LocallyLinearEmbedding

lle = LocallyLinearEmbedding(n_components=1761, n_neighbours=10)
X_reduced = lle.fit_transform(X)

## Other Dimensionality Reduction Technique : MDS(Multidimensional Scaling) , Isomap, t-Distributed Stochastic Neighbour Embedding (t-SNE), Linear Discriminant Analysis(LDA) 