# Curse of Dimensionality  

Dimensionality in a dataset becomes a severe impediment to achieve a reasonable efficiency for most algorithms. Increasing the number of features does not always improve accuracy. When data does not have enough features, the model is likely to underfit, and when data has too many features, it is likely to overfit. Hence it is called the curse of dimensionality. The curse of dimensionality is an astonishing paradox for data scientists, based on the exploding amount of n-dimensional spaces — as the number of dimensions, n, increases.

<img src="https://i.postimg.cc/Yqbw46y2/curse.jpg" width="400"/>

There are two techniques to make dimensionality reduction:

* Feature Selection
* Feature Extraction

<img src="https://i.postimg.cc/4xmjmr9w/vs.png" width="500"/>

### Feature Selection
n feature selection, usually, a subset of original features is selected.
<img src="https://i.postimg.cc/7LzcN0X2/feature-selection.png" width="300"/>

### Feature Extraction
In feature extraction, a set of new features are found. That is found through some mapping from the existing features. Moreover, mapping can be either linear or non-linear.
<img src="https://i.postimg.cc/4yWF354r/feature-extraction.png" width="300"/>

## Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an exploratory approach to reduce the data set's dimensionality to 2D or 3D, used in exploratory data analysis for making predictive models. Principal Component Analysis is a `linear` transformation of data set that defines a new coordinate rule such that:

   * The highest variance by any projection of the data set appears to lays on the first axis.
   * The second biggest variance on the second axis, and so on.
   

* PCA is also used to `reduce dimensionality` so that we can easily visualize the data.
* PCA uses feature extraction, i.e., it combines our input variables in a specific way so we can drop the least important or least significant variables while still retaining the fundamental attributes of our old variables. 

<img src='https://files.codingninjas.in/article_images/applying-pca-on-mnist-dataset-0-1638213401.jpg'>


<table><tr>
    <td><img src="https://i.postimg.cc/TPZxdTNr/PCA1.png" width="417"/> </td>
    <td> <img src="https://i.postimg.cc/NFxvkms4/PCA2.png" width="500"/></td>
</tr></table>

> 💥 In the eyes of PCA, variance is an objective and mathematical way to quantify the amount of information in our data.
**Variance is information.**



>**✨ Additional infornamtion:**  
    **Autoencoders** are neural networks that stack numerous non-linear transformations to reduce input into a low-dimensional latent space (layers). The aim of an autoencoder is to learn a lower-dimensional representation (encoding) for a higher-dimensional data, typically for dimensionality reduction, by training the network to capture the most important parts of the input image.  
    PCA is restricted to a linear map, while auto encoders can have nonlinear enoder/decoders.
    
   <img src="https://i.postimg.cc/RVx6HdKp/autoencoder.png" width="400"/>

# PCA for Data Visualization

### PCA and KNN on IRIS dataset

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import warnings 
warnings.filterwarnings('ignore')

In [2]:
iris = load_iris()

In [3]:
X = iris.data
y = iris.target

In [4]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [5]:
X_df = pd.DataFrame(X, columns=iris.feature_names)
y_df = pd.DataFrame(y, columns=["label"])
X_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [6]:
px.scatter_matrix(X_df, color=y, title='Scatter Matrix of Features', height=800, )

In [7]:
px.pie(y_df, names="label")

In [8]:
df = pd.concat([X_df, y_df], axis=1)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [9]:
px.imshow(df.corr())

>**❗ NOTE:** One of the biggest aims of these sort of plots and EDAs are to identify features that are not much helpful in explaining the target outcome. The SepalWidthCm feature seems to be less relevant in explaining the target class as compared to the other features

In [10]:
X_df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### Train Test split

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0)

### Standardizing the features

In [12]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### Apply PCA to transform iris dataset

In [13]:
pca = PCA(n_components=4)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

> The `explained_variance_ratio_` tells us how much of the total variance is explained by each principal component.

In [14]:
pca.explained_variance_ratio_

array([0.72229951, 0.2397406 , 0.03335483, 0.00460506])

In [15]:
px.bar(x= ["pca-1", "pca-2", "pca-3", "pca-4"] ,y = pca.explained_variance_ratio_)

> As we can see from the above plot :  
>* The first component covers 72.962% of the original datas information with a loss of ~ 28%.
>* The second component covers 22.850% of the original datas information with a loss of ~ 78%.
>* Both the first and second principal components are enough to cover ~ 95% with a loss of ~ 5%.
> The third and fourth components can be safely ignored because they only contribute to ~3% and 0.5% of original datas information.

💥 **Since the first two principal components have high variance we will select them for dimensionality reduction.**

### Plotting the 2 principal components with maximum variance

In [16]:
pca = PCA(n_components=2)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

In [17]:
pca.explained_variance_ratio_

array([0.72229951, 0.2397406 ])

In [18]:
px.bar(x= ["pca-1", "pca-2"] ,y = pca.explained_variance_ratio_)

In [19]:
X_train_pca_df = pd.DataFrame(X_train_pca,columns=['PCA-1','PCA-2'])
X_train_pca_df.head()

Unnamed: 0,PCA-1,PCA-2
0,1.272261,0.358003
1,0.151947,-0.299969
2,-2.189471,0.616852
3,0.941933,0.012192
4,1.762775,-0.270963


In [20]:
pca.components_

array([[ 0.52840089, -0.23210632,  0.58393133,  0.57091449],
       [ 0.35568076,  0.9336255 ,  0.00813386,  0.04205311]])

In [21]:
px.scatter(X_train_pca_df, x="PCA-1", y="PCA-2", color= y_train)

In [22]:
X_train_pca.shape, X_test_pca.shape

((120, 2), (30, 2))

In [23]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier()
scores = cross_val_score(knn, X_train, y_train, cv=5)
print("score before dimension reduction:",scores.mean())   

knn = KNeighborsClassifier()
scores = cross_val_score(knn, X_train_pca, y_train, cv=5)
print("score after dimension reduction:",scores.mean())   

score before dimension reduction: 0.9333333333333332
score after dimension reduction: 0.9083333333333334


# PCA to Speed-up Machine Learning Algorithms

### MNIST Dataset
<img src='https://files.codingninjas.in/article_images/applying-pca-on-mnist-dataset-1-1638213401.jpg'>


 In this section, we will reduce the 784 dimensions of the MNIST dataset to 2 dimensions and plot the corresponding principal components obtained.

In [24]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.decomposition import PCA
from time import time
import warnings 
warnings.filterwarnings('ignore')

In [25]:
train_df = pd.read_csv("./dataset/mnist_train.csv", dtype=np.uint8)
test_df = pd.read_csv("./dataset/mnist_test.csv", dtype=np.uint8)

In [26]:
train_df.head()

Unnamed: 0,label,1x1,1x2,1x3,1x4,1x5,1x6,1x7,1x8,1x9,...,28x19,28x20,28x21,28x22,28x23,28x24,28x25,28x26,28x27,28x28
0,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
train_df.describe()

Unnamed: 0,label,1x1,1x2,1x3,1x4,1x5,1x6,1x7,1x8,1x9,...,28x19,28x20,28x21,28x22,28x23,28x24,28x25,28x26,28x27,28x28
count,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,...,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0
mean,4.453933,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.200433,0.088867,0.045633,0.019283,0.015117,0.002,0.0,0.0,0.0,0.0
std,2.88927,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6.042472,3.956189,2.839845,1.68677,1.678283,0.3466,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,254.0,254.0,253.0,253.0,254.0,62.0,0.0,0.0,0.0,0.0


In [28]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 785 entries, label to 28x28
dtypes: uint8(785)
memory usage: 44.9 MB


In [29]:
px.pie(train_df, names="label")

In [30]:
X_train = train_df.drop(['label'], axis=1)
y_train = train_df['label']
X_test = test_df.drop(['label'], axis=1)
y_test = test_df['label']

In [31]:
print("X_train_shape:",X_train.shape)
print("Y_train_shape:",y_train.shape)
print("X_test_shape:",X_test.shape)
print("Y_test_shape:",y_test.shape)

X_train_shape: (60000, 784)
Y_train_shape: (60000,)
X_test_shape: (10000, 784)
Y_test_shape: (10000,)


In [32]:
instance_index = 7890 
matrix_conv=X_train.iloc[instance_index].to_numpy().reshape(28,28)
px.imshow(matrix_conv,binary_string=True)

####### Matplotlib alternative for plotting grayscale images
# import matplotlib.pyplot as plt
# plt.imshow(matrix_conv, cmap='gray')

## Standardizing the features

In [33]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Apply PCA to transform MNIST dataset

### Let’s see a Choosing the number of components

In [34]:
pca = PCA()
pca.fit(X_train)
dataframe = pd.DataFrame({'number of components':range(1, len(pca.explained_variance_ratio_)+1) 
                          , 'cumulative explained variance':np.cumsum(pca.explained_variance_ratio_)})
px.line(dataframe, x="number of components" ,y="cumulative explained variance",
       )


In [56]:
new_arr = np.array([1, 2, 3, 4])
np.cumsum(new_arr)

array([ 1,  3,  6, 10])

In [35]:
pca.explained_variance_ratio_

array([5.64671692e-02, 4.07827199e-02, 3.73938042e-02, 2.88511485e-02,
       2.52110863e-02, 2.19426996e-02, 1.92334439e-02, 1.74579923e-02,
       1.53509230e-02, 1.40171960e-02, 1.34174302e-02, 1.20374194e-02,
       1.11456955e-02, 1.08992356e-02, 1.02864922e-02, 9.94486564e-03,
       9.36383280e-03, 9.21045666e-03, 8.93436778e-03, 8.69912619e-03,
       8.27363019e-03, 8.03417369e-03, 7.64845500e-03, 7.41772464e-03,
       7.15292868e-03, 6.91846831e-03, 6.84135964e-03, 6.56674546e-03,
       6.31676724e-03, 6.12919839e-03, 5.96255295e-03, 5.87716416e-03,
       5.71591699e-03, 5.62307416e-03, 5.54682002e-03, 5.38418374e-03,
       5.31182250e-03, 5.19605602e-03, 5.08211255e-03, 4.80005571e-03,
       4.76455820e-03, 4.69139360e-03, 4.54348956e-03, 4.51345787e-03,
       4.46963401e-03, 4.43383155e-03, 4.38215469e-03, 4.30381751e-03,
       4.26877901e-03, 4.23647017e-03, 4.04696121e-03, 3.99447403e-03,
       3.97456119e-03, 3.93820800e-03, 3.85813590e-03, 3.79042674e-03,
      

In [36]:

pca = PCA(n_components=2)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

In [37]:
df_pca = pd.DataFrame(X_train_pca)

In [40]:
df_pca['label'] = y_train

In [41]:
df_pca

Unnamed: 0,0,1,label
0,-0.922158,-4.815009,5
1,8.708977,-7.754344,0
2,2.328393,9.431587,4
3,-6.582172,-3.746342,1
4,-5.183250,3.133339,9
...,...,...,...
59995,-2.039337,-5.119084,8
59996,0.607841,-6.498377,3
59997,-3.777212,-3.230569,5
59998,1.722367,-4.948067,6


In [48]:
px.scatter(df_pca[df_pca['label'].isin([1,0])]


### Recover main features from PCA-transformation

Each principal component is a `linear` combination of the original variables:

<img src='https://i.stack.imgur.com/RQKn6.png'>

In [51]:
coef = pca.transform(np.identity(X_train.shape[1]))

In [55]:
pca.components_

array([[-4.39258366e-17,  6.14846154e-19, -6.16501500e-20, ...,
        -0.00000000e+00, -0.00000000e+00, -0.00000000e+00],
       [ 8.64042157e-17,  8.02486169e-18,  1.18954031e-19, ...,
        -0.00000000e+00, -0.00000000e+00, -0.00000000e+00]])