#### About

> Dimensionality Reduction

Dimensionality reduction is a technique used to reduce the number of features or variables in a data set while preserving relevant information. Visualizing, analyzing, and modeling high-dimensional data can be challenging, and dimensionality reduction techniques help solve this problem by transforming the data into a lower-dimensional representation. There are two main types of size reduction techniques:

1. Selection of means. Feature selection techniques involve selecting a subset of raw features or variables from a data set based on specific criteria, such as their importance or relevance to a given problem. These methods preserve the original features but discard some of them, resulting in a reduced size set.

2. Feature extraction: Feature extraction techniques generate new features or variants from raw features using mathematical transformations. These methods create a new feature set that captures the most important information from the original features, resulting in a low-dimensional representation of the data.




> Commonly used Dimensionality reduction techniques

1. Principal Component Analysis (PCA): PCA is a widely used linear dimensionality reduction technique that transforms the original features into a new set of uncorrelated features called principal components, which are ordered by their explained variance. PCA finds the directions in the data with the highest variance and projects the data onto these directions to create a lower-dimensional representation.


In [3]:
from sklearn.decomposition import PCA
import numpy as np

In [12]:
X = np.random.random((199,2324))
print(X.shape)

(199, 2324)


In [8]:
# Create a PCA instance with desired number of components
pca = PCA(n_components=2)

In [9]:
# Fit PCA to the data
pca.fit(X)

In [11]:
# Transform the data to the first two principal components
X_pca = pca.transform(X)

print(X_pca.shape)

(199, 2)


2. t-SNE (t-distributed Stochastic Neighbor Embedding): t-SNE is a nonlinear dimensionality reduction technique that is particularly useful for visualizing high-dimensional data in a low-dimensional space. It preserves the local structure of the data by minimizing the divergence between pairwise similarities in the original data and in the reduced-dimensional representation.

In [13]:
from sklearn.manifold import TSNE


In [14]:
# Create a t-SNE instance with desired number of components
tsne = TSNE(n_components=2)


In [15]:

# Fit t-SNE to the data
X_tsne = tsne.fit_transform(X)
print(X_tsne.shape)

(199, 2)


3. LLE (Locally Linear Embedding): LLE is a nonlinear dimensionality reduction technique that assumes that the data lies on a locally linear manifold. It finds a lower-dimensional representation of the data by reconstructing each data point as a weighted linear combination of its neighbors in the original space.


In [16]:
from sklearn.manifold import LocallyLinearEmbedding


In [17]:
# Create an LLE instance with desired number of components
lle = LocallyLinearEmbedding(n_components=6)

In [18]:
# Fit LLE to the data
X_lle = lle.fit_transform(X)
print(X_lle.shape)

(199, 6)


4. Autoencoders: Autoencoders are a type of neural network-based dimensionality reduction technique that consists of an encoder and a decoder. The encoder maps the original features to a lower-dimensional representation, and the decoder maps the lower-dimensional representation back to the original features. Autoencoders are trained to minimize the reconstruction error, which encourages the model to learn a meaningful lower-dimensional representation of the data.



In [22]:
from keras.layers import Input, Dense
from keras.models import Model


2023-04-21 04:14:46.547370: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-04-21 04:14:46.609599: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-04-21 04:14:46.611876: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [23]:
# Define the input shape and the desired dimension of the encoded representation
input_dim = X.shape[1]
encoding_dim = 2


In [24]:
# Define the encoder
input_layer = Input(shape=(input_dim,))
encoded = Dense(encoding_dim, activation='relu')(input_layer)


2023-04-21 04:14:54.113127: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-04-21 04:14:54.119070: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


In [25]:

# Define the decoder
decoded = Dense(input_dim, activation='sigmoid')(encoded)


In [29]:

# Create the autoencoder model
autoencoder = Model(input_layer, decoded)


In [31]:
#Compile the model
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

In [32]:
autoencoder.fit(X, X, epochs=100, batch_size=32) #train



Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7f5b881e3ca0>

In [33]:
X_encoded = autoencoder.predict(X)
print(X_encoded.shape)



(199, 2324)
