<a href="https://colab.research.google.com/github/Brendan-Ho/ESS569_AI_data/blob/main/Dimensionality_Reduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
"""
High Dimensionality in DNA Sequencing Data

DNA sequencing data often poses a unique challenge due to its high dimensionality. In most Machine Learning (ML) applications involving DNA or microbial sequencing, the data's dimensionality is increased by the use of **k-mers**—short DNA subsequences of length k (e.g., 4-mer, 6-mer). This approach involves breaking down DNA sequences into overlapping substrings of fixed length, which can provide valuable insights into sequence composition. However, this also results in an enormous feature space, as each possible k-mer represents a unique feature.

For instance, a 6-mer has 4^6 = 4,096 possible combinations, so a dataset based on 6-mers alone may have thousands of features per sample. Additionally, k-mer counts across DNA samples can lead to **sparse data matrices**, where many counts may be zero, further complicating the analysis.

High dimensionality can result in:
- Increased computational cost for model training and testing.
- Potential overfitting in ML models due to an excessive number of features relative to sample size.
- Difficulty visualizing and interpreting relationships among samples, as meaningful clusters may not appear in higher-dimensional spaces.

To manage these challenges, dimensionality reduction techniques like PCA and t-SNE can be used. PCA helps reduce the feature space while retaining as much variance as possible, useful for linear patterns. t-SNE, on the other hand, captures non-linear relationships and is excellent for visualizing clusters in a lower-dimensional space. Together, these techniques aid in data interpretation and provide insights into clustering patterns and feature relevance for downstream ML tasks.
"""


"\nHigh Dimensionality in DNA Sequencing Data\n\nDNA sequencing data often poses a unique challenge due to its high dimensionality. In most Machine Learning (ML) applications involving DNA or microbial sequencing, the data's dimensionality is increased by the use of **k-mers**—short DNA subsequences of length k (e.g., 4-mer, 6-mer). This approach involves breaking down DNA sequences into overlapping substrings of fixed length, which can provide valuable insights into sequence composition. However, this also results in an enormous feature space, as each possible k-mer represents a unique feature.\n\nFor instance, a 6-mer has 4^6 = 4,096 possible combinations, so a dataset based on 6-mers alone may have thousands of features per sample. Additionally, k-mer counts across DNA samples can lead to **sparse data matrices**, where many counts may be zero, further complicating the analysis. \n\nHigh dimensionality can result in:\n- Increased computational cost for model training and testing.\n-

In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

# Load your dataset
# Example: df = pd.read_csv('path_to_your_data.csv')
# Ensure data is scaled and cleaned before applying PCA/t-SNE

# Define your features and target
X = df.drop(columns=['target'])  # Replace 'target' with the actual label column if applicable
y = df['target']  # Optional: only if you have a label for coloring

### Dimensionality Reduction with PCA ###

# Initialize PCA with desired components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Plot PCA results
plt.figure(figsize=(10, 6))
plt.title("PCA Result (2D Projection)")
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=y, palette="viridis")  # Remove hue if no target
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()

# Explained Variance Chart
plt.figure(figsize=(8, 4))
plt.title("Explained Variance by PCA Components")
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o', linestyle='--')
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.show()

### Dimensionality Reduction with t-SNE ###

# Initialize t-SNE
tsne = TSNE(n_components=2, perplexity=30, n_iter=1000, random_state=0)
X_tsne = tsne.fit_transform(X)

# Plot t-SNE results
plt.figure(figsize=(10, 6))
plt.title("t-SNE Result (2D Projection)")
sns.scatterplot(x=X_tsne[:, 0], y=X_tsne[:, 1], hue=y, palette="viridis")  # Remove hue if no target
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.show()
