# LLM Embeddings and dimensionality reduction

In this notebook we load a list of PhD topics and create an LLM-embedding from them. In such an embedding, each PhD topic is represented in high-dimensional space, e.g. as a vector with 1000 numbers. In order to display these embeddings on screen, e.g. in a two-dimensional plot, we apply dimensionality reduction to it.

In [1]:
from openai import OpenAI
import pandas as pd
from sklearn.manifold import TSNE
import stackview
import numpy as np
import yaml
import pandas as pd
import yaml


First, we load the CSV file and take a look at it.

In [2]:
df = pd.read_csv("phd_topics.csv")
df


Unnamed: 0,name,research_field,topic
0,Taylor Reed,Chemicals in the Environment / Ecotoxicology,Microplastic-Associated Persistent Organic Pol...
1,Riley Jain,Water Resources and Environment / Aquatic Ecos...,Microbial Community Resilience to Agricultural...
2,Taylor Adams,Ecosystems of the Future / Conservation Biolog...,Resilience and Relocation: Social-Ecological P...
3,Devon Thomas,Ecosystems of the Future / Ecology of Agroecos...,Resilience and Adaptive Capacity: Integrating ...
4,Alex Lee,Chemicals in the Environment / Computational B...,Predicting Persistent Organic Pollutant Bioacc...
...,...,...,...
245,Jamie Singh,Chemicals in the Environment / Molecular Toxic...,Persistent Organic Pollutants and Epigenetic T...
246,Riley Garcia,Water Resources and Environment / Lake Research,Microbial Community Response to Nutrient Loadi...
247,Bailey Garcia,Environment and Society / Urban & Environmenta...,The Urban Metabolism of Green Infrastructure: ...
248,Sam O'Hara,Chemicals in the Environment / Computational B...,Predictive Modeling of Persistent Organic Poll...


Second, we load the embedding model [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct), a leading small embedding model.

In [4]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
e = HuggingFaceEmbedding(model_name="intfloat/multilingual-e5-large-instruct")
e.get_text_embedding("Hello world")[:5]


[0.005216754507273436,
 0.025283094495534897,
 0.007280194200575352,
 -0.044905297458171844,
 0.024866608902812004]

Next, we test this model.

In [5]:
def embed(text):
    return e.get_text_embedding(text)
    #from openai import OpenAI
    #client = OpenAI()
    #response = client.embeddings.create(
    #    input=text,
    #    model="text-embedding-ada-002"
    #)
    #return response.data[0].embedding

embed("Hello world")[:5]


[0.005216754507273436,
 0.025283094495534897,
 0.007280194200575352,
 -0.044905297458171844,
 0.024866608902812004]

The following code will apply the `embed` function to all topics in our table.

In [6]:
df["embedding"] = df["topic"].apply(embed)
df.head()


Unnamed: 0,name,research_field,topic,embedding
0,Taylor Reed,Chemicals in the Environment / Ecotoxicology,Microplastic-Associated Persistent Organic Pol...,"[-0.010754222050309181, -0.00575306685641408, ..."
1,Riley Jain,Water Resources and Environment / Aquatic Ecos...,Microbial Community Resilience to Agricultural...,"[0.00467681884765625, 0.0035836827009916306, -..."
2,Taylor Adams,Ecosystems of the Future / Conservation Biolog...,Resilience and Relocation: Social-Ecological P...,"[0.0015734180342406034, 0.01460769772529602, -..."
3,Devon Thomas,Ecosystems of the Future / Ecology of Agroecos...,Resilience and Adaptive Capacity: Integrating ...,"[-0.0008501994889229536, 0.01444125734269619, ..."
4,Alex Lee,Chemicals in the Environment / Computational B...,Predicting Persistent Organic Pollutant Bioacc...,"[-0.0032572217751294374, 0.002003519097343087,..."


Again, we apply dimensionality reduction for visualization purposes, namely [t-SNE](distributed_stochastic_neighbor_embedding) and [UMAP](https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Uniform_manifold_approximation_and_projection).

In [7]:
# Convert embedding vectors to numpy array for t-SNE
embeddings = np.array(df['embedding'].tolist())

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
tsne_embeddings = tsne.fit_transform(embeddings)

df['TSNE0'] = tsne_embeddings[:, 0]
df['TSNE1'] = tsne_embeddings[:, 1]

#df


In [8]:
from umap import UMAP

# Convert embedding vectors to numpy array
embeddings = np.array(df['embedding'].tolist())

# Apply UMAP
umap = UMAP(n_components=2, random_state=42)
umap_embeddings = umap.fit_transform(embeddings)

df['UMAP0'] = umap_embeddings[:, 0]
df['UMAP1'] = umap_embeddings[:, 1]

# df

  warn(


In [9]:
df["selection"] = 1


The resulting two dimensions can be visualized on screen.

In [10]:
stackview.scatterplot(df, column_x="UMAP0", column_y="UMAP1")


HBox(children=(VBox(children=(VBox(children=(HBox(children=(Label(value='Axes '), Dropdown(index=6, layout=Lay…

In [11]:
df["selection"].unique()


array([1])

Finally, we store the topcis, together with the embeddings and the two-dimensional UMAPs to a yml file.

In [12]:
import yaml

# Convert DataFrame to dictionary
data_dict = df.to_dict()

# Save as YAML file
with open('phd_topics.yml', 'w') as file:
    yaml.dump(data_dict, file)
