### **Dimensionality Reduction.**

Datasets are sometimes very large, containing potentially millions of data points across a large numbers of features.

Each feature can also be thought of as a ‘dimension’. In some cases, for high-dimensional data, we may want or need to try to reduce the number of dimensions. Reducing the number of dimensions (or reducing the number of features in a dataset) is called ‘dimensionality reduction’

**Dimensionality reduction** in machine learning (ML) is the process of reducing the number of features or dimensions in a dataset, while preserving the most important information. The objectives of dimensionality reduction are to visualize data easier, understand and analyze, and to improve the performance of ML models.

The simplest way to do so could simply be to drop some dimensions, and we could even choose to drop the dimensions that seem likely to be the least useful. This would be a simple method of dimensionality reduction. However, this approach is likely to throw away a lot of information, and we wouldn’t necessarily know which features to keep. Typically we want to try to reduce the number of dimensions while still preserving the most information we can from the dataset.

### **When Should You Avoid Using Dimensionality Reduction?**

There are several situations where dimensionality reduction techniques are not recommended:

- If the dataset is small, the information loss caused by dimensionality reduction techniques may be large, and it may be better to use all the features.

- When the data is already well-structured and easy to understand, dimensionality reduction techniques may not be necessary, and it may be better to use all the features as the benefits from interpretability may outweigh gains such as ML performance.

- When the data has a non-linear structure, dimensionality reduction techniques such as PCA which only captures linear relationships in the data may not be effective, and other techniques such as t-SNE, UMAP are more appropriate.

- When the data is highly skewed, dimensionality reduction techniques such as PCA which assumes a normal distribution of the data may not be effective.

### **DEMO:**

In [22]:
import pandas as pd 


data = pd.read_csv("https://statso.io/wp-content/uploads/2023/06/rfm_data.csv ")

In [23]:
data.head(2)

Unnamed: 0,CustomerID,PurchaseDate,TransactionAmount,ProductInformation,OrderID,Location
0,8814,2023-04-11,943.31,Product C,890075,Tokyo
1,2188,2023-04-11,463.7,Product A,176819,London


**First we are going to create a DataFrame to hold the columns you want to preserve**

In [24]:
columns_to_preserve = data[['PurchaseDate', 'Location']]

We are going to assume that 'CustomerID', 'PurchaseDate', 'OrderID', and 'Location' are not used in dimensionality reduction, so we are ging to drop them

In [25]:
columns_to_drop = ['CustomerID', 'PurchaseDate', 'OrderID', 'Location']
data = data.drop(columns=columns_to_drop)

#Then we convery convert categorical data to numerical using one-hot encoding
data = pd.get_dummies(data, columns=['ProductInformation'])

data.head(2)

Unnamed: 0,TransactionAmount,ProductInformation_Product A,ProductInformation_Product B,ProductInformation_Product C,ProductInformation_Product D
0,943.31,0,0,1,0
1,463.7,1,0,0,0


In [26]:
# Standardize the data
scaler = StandardScaler()
data_standardized = scaler.fit_transform(data)

<IPython.core.display.Javascript object>

In [27]:
# Apply PCA
n_components = 2  # Set the number of components you want to reduce to
pca = PCA(n_components=n_components)
data_pca = pca.fit_transform(data_standardized)

<IPython.core.display.Javascript object>

In [28]:
# Create a DataFrame with reduced dimensions
column_names = [f'PC{i+1}' for i in range(n_components)]
reduced_data = pd.DataFrame(data=data_pca, columns=column_names)

In [29]:
# Concatenate the reduced data with other relevant columns if needed: Example below concatenates 'PurchaseDate' and 'Location'

# Concatenate the reduced data with the columns you want to preserve
final_data = pd.concat([columns_to_preserve, reduced_data], axis=1)

# Save the reduced data to a new CSV file
final_data.to_csv('data/reduced_dataset.csv', index=False)

In [32]:
df = pd.read_csv("data/reduced_dataset.csv")

In [33]:
df.head()

Unnamed: 0,PurchaseDate,Location,PC1,PC2
0,2023-04-11,Tokyo,1.943539,-0.224364
1,2023-04-11,London,-0.350556,0.232157
2,2023-04-11,New York,-0.436379,0.269576
3,2023-04-11,London,-0.404816,0.255815
4,2023-04-11,Paris,-0.288808,0.205235


Apart from Principal Component Analysis, **T-distributed stochastic neighbor embedding (t-SNE)**  and **Linear discriminant analysis (LDA)**  are the other two common deminitonalit reduction techniques.

In [None]:
#import T-distributed stochastic neighbor 
from sklearn.manifold import TSNE

#An example  using T-distributed stochastic neighbor
n_components = 2  # Set the number of components you want to reduce to
tsne = TSNE(n_components=n_components, random_state=42)  # You can change the random_state for reproducibility
data_tsne = tsne.fit_transform(data_standardized)

In [None]:
Import Linear discriminant analysis (LDA) from SKlearn 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

#An example  using Linear discriminant analysis (LDA)
n_components = 2  # Set the number of components you want to reduce to
lda = LDA(n_components=n_components)
data_lda = lda.fit_transform(data_standardized, data['ProductInformation'])

# Create a DataFrame with reduced dimensions
column_names = [f'LDA{i+1}' for i in range(n_components)]
reduced_data_lda = pd.DataFrame(data=data_lda, columns=column_names)

### **Important points:**

- Dimensionality reduction is the process of reducing the number of features in a dataset while retaining as much information as possible.

- This can be done to reduce the complexity of a model, improve the performance of a learning algorithm, or make it easier to visualize the data.

- Techniques for dimensionality reduction include: principal component analysis (PCA), singular value decomposition (SVD), and linear discriminant analysis (LDA).

- Each technique projects the data onto a lower-dimensional space while preserving important information.

- Dimensionality reduction is performed during pre-processing stage before building a model to improve the performance

- It is important to note that dimensionality reduction can also discard useful information, so care must be taken when applying these techniques.

#### **Reference:**
https://medium.com/codex/dimensionality-reduction-techniques-for-categorical-continuous-data-75d2bca53100