# LLM Embeddings and dimensionality reduction

In this notebook we load a list of PhD topics and create an LLM-embedding from them. In such an embedding, each PhD topic is represented in high-dimensional space, e.g. as a vector with 1000 numbers. In order to display these embeddings on screen, e.g. in a two-dimensional plot, we apply dimensionality reduction to it.

In [1]:
from openai import OpenAI
import pandas as pd
from sklearn.manifold import TSNE
import stackview
import numpy as np
import yaml
import pandas as pd
import yaml


First, we load the CSV file and take a look at it.

In [2]:
df = pd.read_csv("phd_topics.csv")
df


Unnamed: 0,name,research_field,topic
0,Taylor Reed,Biodiversity Synthesis,Integrative Modeling of Multi‑Taxon Functional...
1,Riley Jain,Biodiversity Economics,Quantifying the Economic Valuation of Pollinat...
2,Taylor Adams,Biodiversity Conservation,Integrative Landscape Genomics for Enhancing A...
3,Devon Thomas,Biodiversity & People,Integrating Traditional Ecological Knowledge a...
4,Alex Lee,Biodiversity in the Anthropocene,"Integrating Genomic, Functional, and Landscape..."
...,...,...,...
245,Sam O'Hara,Biodiversity in the Anthropocene,Integrative Genomic‑Ecological Modeling of Spe...
246,Dana Kumar,Theory in Biodiversity Science,Scaling Laws and Emergent Dynamics of Multi‑Tr...
247,Reese Singh,Biodiversity Conservation,"Integrating Genomic, Landscape, and Socio‑econ..."
248,Casey Singh,Biodiversity in the Anthropocene,"Integrating Genomic, Functional Trait, and Lan..."


Second, we load the embedding model [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct), a leading small embedding model.

In [3]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
e = HuggingFaceEmbedding(model_name="intfloat/multilingual-e5-large-instruct")
e.get_text_embedding("Hello world")[:5]


[0.005216754507273436,
 0.025283094495534897,
 0.007280194200575352,
 -0.044905297458171844,
 0.024866608902812004]

Next, we test this model.

In [4]:
def embed(text):
    return e.get_text_embedding(text)
    #from openai import OpenAI
    #client = OpenAI()
    #response = client.embeddings.create(
    #    input=text,
    #    model="text-embedding-ada-002"
    #)
    #return response.data[0].embedding

embed("Hello world")[:5]


[0.005216754507273436,
 0.025283094495534897,
 0.007280194200575352,
 -0.044905297458171844,
 0.024866608902812004]

The following code will apply the `embed` function to all topics in our table.

In [5]:
df["embedding"] = df["topic"].apply(embed)
df.head()


Unnamed: 0,name,research_field,topic,embedding
0,Taylor Reed,Biodiversity Synthesis,Integrative Modeling of Multi‑Taxon Functional...,"[-0.0071924785152077675, 0.0039014238864183426..."
1,Riley Jain,Biodiversity Economics,Quantifying the Economic Valuation of Pollinat...,"[-0.005492590367794037, 0.022543391212821007, ..."
2,Taylor Adams,Biodiversity Conservation,Integrative Landscape Genomics for Enhancing A...,"[-0.0024650206323713064, 0.019827308133244514,..."
3,Devon Thomas,Biodiversity & People,Integrating Traditional Ecological Knowledge a...,"[-0.00911727361381054, 0.0035786619409918785, ..."
4,Alex Lee,Biodiversity in the Anthropocene,"Integrating Genomic, Functional, and Landscape...","[-0.0033709630370140076, 0.018772806972265244,..."


Again, we apply dimensionality reduction for visualization purposes, namely [t-SNE](distributed_stochastic_neighbor_embedding) and [UMAP](https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Uniform_manifold_approximation_and_projection).

In [6]:
# Convert embedding vectors to numpy array for t-SNE
embeddings = np.array(df['embedding'].tolist())

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
tsne_embeddings = tsne.fit_transform(embeddings)

df['TSNE0'] = tsne_embeddings[:, 0]
df['TSNE1'] = tsne_embeddings[:, 1]

#df


In [7]:
from umap import UMAP

# Convert embedding vectors to numpy array
embeddings = np.array(df['embedding'].tolist())

# Apply UMAP
umap = UMAP(n_components=2, random_state=42)
umap_embeddings = umap.fit_transform(embeddings)

df['UMAP0'] = umap_embeddings[:, 0]
df['UMAP1'] = umap_embeddings[:, 1]

# df

  warn(


In [8]:
df["selection"] = 1


The resulting two dimensions can be visualized on screen.

In [9]:
stackview.scatterplot(df, column_x="UMAP0", column_y="UMAP1")


HBox(children=(VBox(children=(VBox(children=(HBox(children=(Label(value='Axes '), Dropdown(index=6, layout=Lay…

In [10]:
df["selection"].unique()


array([1])

Finally, we store the topcis, together with the embeddings and the two-dimensional UMAPs to a yml file.

In [11]:
import yaml

# Convert DataFrame to dictionary
data_dict = df.to_dict()

# Save as YAML file
with open('phd_topics.yml', 'w') as file:
    yaml.dump(data_dict, file)
