# Node2Vec & Clustering

In this notebook, you will work with a graph dataset and apply **Node2Vec** for embedding, followed by clustering the embeddings using different algorithms. Your goal is to experiment with different parameters for Node2Vec, explore multiple clustering techniques, and visualize the resulting embeddings.

### Tasks:
1. **Select reasonable parameters for Node2Vec**: Select the missing parameters for Node2Vec to generate reasonable embeddings.
2. **Experiment with DBSCAN**: Compare **DBSCAN** with **KMeans** and explore the differences.
3. **Visualize the Embeddings**: Use techniques like **UMAP** or **t-SNE** to visualize the embeddings and clusters.

Feel free to explore and experiment, and make sure to document your findings and choices in the code cells. We suggest spending only a few hours on this assignment but feel free to spend as much time on this as you need. We are not expecting a deep analysis or impactful results and don't guarantee that the underlying data will have a strong clustering profile. We are interested more in how you would approach the problem in general and how you can explain to us your decisions.

If you encounter issues with building your environment, or running out of memory, please get in touch with us. The data for this task has been trimmed to run on a typical laptop and should not require more than 4 Gb of RAM.

Good luck!

In [None]:
# Install required packages
# !pip install pandas numpy scikit-learn pecanpy

In [None]:
# Imports
import pandas as pd
import numpy as np
from pecanpy.graph import AdjlstGraph
from pecanpy import pecanpy as node2vec
from sklearn.cluster import KMeans

In [None]:
# Load the data
df = pd.read_parquet('omics_embedding_take_home_2025.parquet')
df.head()

In [None]:
# Extract unique node IDs and edges
node_ids = pd.unique(df[['source_id', 'target_id']].values.ravel())
edges_iter = df[['source_id', 'target_id']].itertuples(index=False, name=None)

In [None]:
# Initialize graph
graph = AdjlstGraph()

# Add nodes
for node_id in node_ids:
    graph.add_node(node_id)

# Add edges
for source, target in edges_iter:
    graph.add_edge(source, target, directed=False)

### 1. Select the Right Parameters for Node2Vec

Experiment with the following Node2Vec parameters to improve embedding quality. Use external resources as needed to understand their impact:

- **dim**
- **num_walks**
- **walk_length**

In [None]:
# Params for initializing the model (graph traversal)
init_params = {
    "p": 1.0,
    "q": 1.0,
    "extend": False,
    "directed": False,
    "verbose": True,
}

# Params for embedding
# Modify the ADJUST_PARAM values in embed_params based on your experimentation
embed_params = {
    "dim": ADJUST_PARAM,
    "num_walks": ADJUST_PARAM,
    "walk_length": ADJUST_PARAM,
    "window_size": 5,
    "epochs": 5,
    "verbose": True,
}

In [None]:
# Initialize the model
model = node2vec.PreComp(init_params)

# Load the graph
model = model.from_adjlst_graph(adjlst_graph=graph)

# Train embedding with embedding-related params
node_embeddings = model.embed(**embed_params)

In [None]:
# Extract patient IDs from both sides of edges
patients = set(df[df["source_type"] == "Patient"]["source_id"]) | \
           set(df[df["target_type"] == "Patient"]["target_id"])

# Filter embeddings for patient nodes only
node_ids_array = np.array(node_ids)
patient_mask = np.isin(node_ids_array, list(patients))
patient_embeddings = node_embeddings[patient_mask]
patient_ids = node_ids_array[patient_mask]

### 2. Experiment with DBSCAN

Please experiment with **DBSCAN** and compare its results with **KMeans**. Try to explore the differences between these two clustering algorithms.

In [None]:
# Apply KMeans clustering
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
patient_labels = kmeans.fit_predict(patient_embeddings)

In [None]:
# Your code here

### 3. Visualize the Embeddings

Visualize the embeddings using techniques like **UMAP** or **t-SNE** to explore the clusters of patients.

In [None]:
# Your code here