In [13]:
import numpy as np
import pandas as pd
import seaborn as sns
import random
from sklearn.decomposition import PCA

**Dimensionality reduction** speeds up the training by making less features, but it cause loss of informations, that will make your model predict slightly worse, reducing the dimensionality of the training data may filter out some noise, and unnecessary details and thus result in higher performance, but in general it won’t it will just speed up training

The more dimensions the training set has, the greater the risk of overfitting it.

# Main Approaches for Dimensionality Reduction
1- Projection <br>
when the dimensions are lower, the data points will be closer to each other

![download](https://user-images.githubusercontent.com/96451039/228697500-3477df0e-d035-49aa-b6af-31771ea76337.png)


The blue instances are closer to each other than red ones as they are projected from 3D to 2D

2- Mainfold <br>
However, projection is not always the best approach to dimensionality
reduction. In many cases the subspace may twist and turn.


![download](https://user-images.githubusercontent.com/96451039/228697534-07fc95f5-b9ca-4d92-9ee6-a433f13940a9.png)

if we used projection it will be the worst choice as the layers will overlap

![download](https://user-images.githubusercontent.com/96451039/228697665-966c022e-704d-4a55-b02a-bf437b275e45.png)


On the left this is the result of projection, while on the right is the mainfold, it is rolled and twisted in higher dimension, and resembles a 2D plane 

# **PCA** the most popular dimensionality reduction algorithm <br>
First it identifies the hyperplane that lies closest to the data, and then it projects the data onto it. <br>
The simple idea behind PCA is to select maximum variance because it will lead minimum loss of information, and it minimizes the distance between dataset, and its projection onto that axis <br>
So , as we see in the figure below the top of right hand is maximum variance, middle intermediate variance, bottom least variance, so as the variance decreases, the loss of information increases so, we choosed the maximum ones.<br>
To summarize: you are trying to take the hyperplane that maximum number of data points lies on it

![download](https://user-images.githubusercontent.com/96451039/228697716-df1ad668-b662-4544-a1fb-d385d95e144e.png)


In [14]:
x1 = random.sample(range(0,200),200)
x2 = random.sample(range(0,200),200)
x3 = random.sample(range(0,200),200)

In [15]:
df = pd.DataFrame({'x1':x1,'x2':x2,'x3':x3})
df.head()

Unnamed: 0,x1,x2,x3
0,138,10,22
1,106,63,176
2,101,120,130
3,95,39,13
4,139,135,83


In [16]:
# we can find the principal component of training set by svd 
# PCA assumes that the dataset is centered around the origin.
df_centered = df - df.mean(axis=0)
U, s, Vt = np.linalg.svd(df_centered)
c1 = Vt[0]
c2 = Vt[1]

In [17]:
# principal vectors are 3*3, so we will take first 2 to make the dataset into 2D instead of 3D
Vt

array([[ 0.1324852 , -0.68717324, -0.71431129],
       [ 0.95563881,  0.27982202, -0.0919462 ],
       [-0.263063  ,  0.67044208, -0.69376169]])

In [18]:
print(c1,c2)

[ 0.1324852  -0.68717324 -0.71431129] [ 0.95563881  0.27982202 -0.0919462 ]


In [19]:
W2 = Vt[:2].T
df_2D = df_centered @ W2

In [20]:
df_2D.head()

Unnamed: 0,0,1
0,121.96181,18.873854
1,-28.701836,-11.035736
2,-35.674818,4.365451
3,102.765724,-13.276261
4,-7.375348,49.198527


# using sklearn it handles whether the data is around origin or not

In [21]:
pca = PCA(n_components=2)
df_2D = pd.DataFrame(pca.fit_transform(df))
df_2D.head()

Unnamed: 0,0,1
0,121.96181,18.873854
1,-28.701836,-11.035736
2,-35.674818,4.365451
3,102.765724,-13.276261
4,-7.375348,49.198527


In [22]:
# the variance on the Principal Component --> information of each axis, here it decides to drop the least column that benefit model
# which has approximatly 0.3 of informations, in example like this we wont apply PCA, it is just for practicing
pca.explained_variance_ratio_

array([0.3602933, 0.3349831])

In [23]:
# you can keep the percentage of information you need by doing the following
pca = PCA(n_components=0.95)
df_2D = pd.DataFrame(pca.fit_transform(df))
df_2D.head()
# so to keep 95% of information u cant apply PCA

Unnamed: 0,0,1,2
0,121.96181,18.873854,-16.36596
1,-28.701836,-11.035736,-79.253815
2,-35.674818,4.365451,-7.810263
3,102.765724,-13.276261,20.632424
4,-7.375348,49.198527,24.856773


In [24]:
# to get back dropped or compressed columns u can use inverse_transform()

the difference is the quality but the numbers can be detected easily too.

![download](https://user-images.githubusercontent.com/96451039/228697816-b1db1e87-295d-434d-8e88-373b895a0996.png)

By default, svd_solver is actually set to "auto": Scikit-Learn automatically uses
the randomized PCA algorithm if max(m, n) > 500 and n_components is an integer
smaller than 80% of min(m, n), or else it uses the full SVD approach. So the preceding
code would use the randomized PCA algorithm even if you removed the
svd_solver="randomized" argument, since 154 < 0.8 × 784. If you want to force
Scikit-Learn to use full SVD for a slightly more precise result, you can set the
svd_solver hyperparameter to "full".