<a href="https://colab.research.google.com/github/MathMachado/DSWP/blob/master/PCA%2C%20t-SNE%20and%20UMAP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Iris Dataset

Here's an example using the Iris dataset, which contains 4 features. We will apply PCA, t-SNE, and UMAP to reduce the dimensionality to 2D and visualize the results.

In [None]:
pip install umap-learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

# UMAP
umap_reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = umap_reducer.fit_transform(X_scaled)

# Plot the results
fig, axs = plt.subplots(1, 3, figsize=(18, 5))

axs[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y)
axs[0].set_title("PCA")

axs[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y)
axs[1].set_title("t-SNE")

axs[2].scatter(X_umap[:, 0], X_umap[:, 1], c=y)
axs[2].set_title("UMAP")

plt.show()


## Conclusion

This code will generate three scatterplots, each representing the reduced 2D representation of the Iris dataset using PCA, t-SNE, and UMAP. The colors indicate the different classes in the Iris dataset.

You will notice that PCA creates a linear separation between the classes, but there is some overlap between two of the classes. On the other hand, t-SNE and UMAP produce more distinct clusters, with better separation between all three classes. This shows that t-SNE and UMAP are better at capturing the complex, non-linear structures in the data. However, PCA provides a faster and more interpretable solution, which can be useful in certain scenarios.

# Wine Dataset

We will apply PCA, t-SNE, and UMAP to reduce the dimensionality to 2D and visualize the results.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

# UMAP
umap_reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = umap_reducer.fit_transform(X_scaled)

# Plot the results
fig, axs = plt.subplots(1, 3, figsize=(18, 5))

axs[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y)
axs[0].set_title("PCA")

axs[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y)
axs[1].set_title("t-SNE")

axs[2].scatter(X_umap[:, 0], X_umap[:, 1], c=y)
axs[2].set_title("UMAP")

plt.show()


You will observe that PCA separates the classes reasonably well, but there is still some overlap between two of the classes. t-SNE and UMAP provide better separation between the classes and reveal more distinct clusters. This demonstrates that t-SNE and UMAP can capture complex, non-linear structures in the data more effectively than PCA. However, PCA offers a faster and more interpretable approach, which might be beneficial in certain situations.

## Selecting the most important features

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target
feature_names = wine.feature_names

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA
pca = PCA()
pca.fit(X_scaled)

# Get PCA components
components = pca.components_

# Get explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_

# Calculate the contribution of each feature to the first two principal components
# (or another number of components based on your preference)
num_components = 2
feature_contributions = np.abs(components[:num_components]).sum(axis=0)

# Get the indices of the top N features
N = 5  # Number of top features to select
top_feature_indices = np.argsort(feature_contributions)[-N:]

# Get the top N features' names
top_features = [feature_names[i] for i in top_feature_indices]

print("Top", N, "features:")
print(top_features)


# Feature Selection

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE, SelectKBest, f_classif
from sklearn.linear_model import LassoCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target
feature_names = wine.feature_names

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# RFE
svm_estimator = SVC(kernel="linear")
rfe = RFE(estimator=svm_estimator, n_features_to_select=5)
rfe.fit(X_scaled, y)
rfe_features = np.array(feature_names)[rfe.support_]

# SelectKBest
kbest = SelectKBest(score_func=f_classif, k=5)
kbest.fit(X_scaled, y)
kbest_features = np.array(feature_names)[kbest.get_support()]

# LASSO
lasso = LassoCV(cv=5)
lasso.fit(X_scaled, y)
lasso_features = np.array(feature_names)[np.abs(lasso.coef_) > 1e-5]

# Random Forest feature importances
rf = RandomForestClassifier()
rf.fit(X_scaled, y)
importances = rf.feature_importances_
rf_features = np.array(feature_names)[importances > np.mean(importances)]

print("RFE selected features:", rfe_features)
print("SelectKBest selected features:", kbest_features)
print("LASSO selected features:", lasso_features)
print("Random Forest selected features:", rf_features)


## RFE (Recursive Feature Elimination)

### Pros:
Can be used with any estimator that exposes a coef_ or feature_importances_ attribute.
Considers interactions between features.
Can provide better performance when irrelevant features are present.

### Cons:
Computationally expensive as it requires fitting the model multiple times.

### When to use: When you have a supervised learning problem and want to consider interactions between features.

### How to interpret: The selected features are the ones that contribute the most to the model's performance according to the estimator used.

## SelectKBest

### Pros:
Fast and efficient for selecting the top K features based on univariate statistical tests.
Easy to interpret.

### Cons:
Ignores interactions between features.
Assumes that features are independent.

### When to use: When you want a quick and simple way to select a subset of features based on their individual importance.

### How to interpret: The selected features are the ones that have the highest scores according to the chosen statistical test.

## LASSO (Least Absolute Shrinkage and Selection Operator)

### Pros:
Performs feature selection and regression simultaneously.
Can handle high-dimensional datasets and multicollinearity.

### Cons:
Assumes a linear relationship between features and target variable.
Can have difficulty selecting the correct features when there are groups of highly correlated features.

### When to use: When you have a linear regression problem and want a sparse model

### How to interpret: The selected features are the ones with non-zero coefficients in the LASSO model. Features with larger absolute coefficients have a stronger impact on the target variable.


## Feature importances from tree-based models (Random Forest, XGBoost, etc.)

### Pros:
Can handle non-linear relationships and interactions between features.
Robust to outliers.
Provides a measure of feature importance directly from the model.

### Cons:
Random Forest can be computationally expensive for large datasets.
The importance measure can be biased towards high-cardinality categorical features.

### When to use: When you have a supervised learning problem, and you want a more robust way to assess feature importances that can handle non-linear relationships and interactions.

### How to interpret: The selected features are the ones with higher importance scores according to the tree-based model. Higher importance scores indicate a stronger contribution to the model's performance.

When choosing a feature selection method, consider the type of problem (regression or classification), the relationship between features and the target variable (linear or non-linear), the presence of interactions between features, and the computational complexity of the method. Some methods may work better for certain datasets and problem types, so it can be helpful to experiment with multiple methods and evaluate their performance on your specific task.