# LLM Embeddings and dimensionality reduction

In this notebook we load a list of PhD topics and create an LLM-embedding from them. In such an embedding, each PhD topic is represented in high-dimensional space, e.g. as a vector with 1000 numbers. In order to display these embeddings on screen, e.g. in a two-dimensional plot, we apply dimensionality reduction to it.

In [1]:
from openai import OpenAI
import pandas as pd
from sklearn.manifold import TSNE
import stackview
import numpy as np
import yaml
import pandas as pd
import yaml


First, we load the CSV file and take a look at it.

In [3]:
df = pd.read_csv("phd_topics.csv")
df


Unnamed: 0,name,research_field,topic
0,Taylor Reed,FIZ-KA - Leibniz-Institut für Informationsinfr...,"Digital Archives, Embodied Knowledge, and the ..."
1,Riley Jain,HKI - Leibniz-Institut für Naturstoff-Forschun...,Microbial Secondary Metabolites and Narrative:...
2,Taylor Adams,IÖR - Leibniz-Institut für ökologische Raument...,Spatial Imaginaries of Ecological Transition: ...
3,Devon Thomas,"IWM - Leibniz-Institut für Wissensmedien, Tübi...",Algorithmic Storytelling and the Evolution of ...
4,Alex Lee,MfN - Museum für Naturkunde - Leibniz-Institut...,The Literary Ecology of Scientific Illustratio...
...,...,...,...
245,Jamie Campbell,FIZ-KA - Leibniz-Institut für Informationsinfr...,Algorithmic Aesthetics and the Computational H...
246,Skyler Reed,WIAS - Weierstraß-Institut für Angewandte Anal...,Algorithmic Poetics: Computational Modeling an...
247,Casey Lee,"AIP - Leibniz Institut für Astrophysik, Potsdam",Cosmic Narratives: Poetics and Reception of As...
248,Robin Flores,IGZ - Leibniz-Institut für Gemüse- und Zierpfl...,The Rhetoric of Resilience: Horticultural Inno...


Second, we load the embedding model [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct), a leading small embedding model.

In [4]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
e = HuggingFaceEmbedding(model_name="intfloat/multilingual-e5-large-instruct")
e.get_text_embedding("Hello world")[:5]


  return forward_call(*args, **kwargs)


[0.0052167088724672794,
 0.02528308518230915,
 0.007280258461833,
 -0.04490530118346214,
 0.024866655468940735]

Next, we test this model.

In [5]:
def embed(text):
    return e.get_text_embedding(text)
    #from openai import OpenAI
    #client = OpenAI()
    #response = client.embeddings.create(
    #    input=text,
    #    model="text-embedding-ada-002"
    #)
    #return response.data[0].embedding

embed("Hello world")[:5]


[0.0052167088724672794,
 0.02528308518230915,
 0.007280258461833,
 -0.04490530118346214,
 0.024866655468940735]

The following code will apply the `embed` function to all topics in our table.

In [7]:
df["embedding"] = df["topic"].apply(embed)
df.head()


Unnamed: 0,name,research_field,topic,embedding
0,Taylor Reed,FIZ-KA - Leibniz-Institut für Informationsinfr...,"Digital Archives, Embodied Knowledge, and the ...","[0.019196026027202606, 0.010897933505475521, -..."
1,Riley Jain,HKI - Leibniz-Institut für Naturstoff-Forschun...,Microbial Secondary Metabolites and Narrative:...,"[0.016622617840766907, -0.009818249382078648, ..."
2,Taylor Adams,IÖR - Leibniz-Institut für ökologische Raument...,Spatial Imaginaries of Ecological Transition: ...,"[-0.016480615362524986, 0.014093692414462566, ..."
3,Devon Thomas,"IWM - Leibniz-Institut für Wissensmedien, Tübi...",Algorithmic Storytelling and the Evolution of ...,"[0.008821303024888039, 0.005257884040474892, -..."
4,Alex Lee,MfN - Museum für Naturkunde - Leibniz-Institut...,The Literary Ecology of Scientific Illustratio...,"[-0.02530881017446518, 0.004650192800909281, -..."


Again, we apply dimensionality reduction for visualization purposes, namely [t-SNE](distributed_stochastic_neighbor_embedding) and [UMAP](https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Uniform_manifold_approximation_and_projection).

In [8]:
# Convert embedding vectors to numpy array for t-SNE
embeddings = np.array(df['embedding'].tolist())

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
tsne_embeddings = tsne.fit_transform(embeddings)

df['TSNE0'] = tsne_embeddings[:, 0]
df['TSNE1'] = tsne_embeddings[:, 1]

#df


In [9]:
from umap import UMAP

# Convert embedding vectors to numpy array
embeddings = np.array(df['embedding'].tolist())

# Apply UMAP
umap = UMAP(n_components=2, random_state=42)
umap_embeddings = umap.fit_transform(embeddings)

df['UMAP0'] = umap_embeddings[:, 0]
df['UMAP1'] = umap_embeddings[:, 1]

# df

  warn(


In [10]:
df["selection"] = 1


The resulting two dimensions can be visualized on screen.

In [11]:
stackview.scatterplot(df, column_x="UMAP0", column_y="UMAP1")


HBox(children=(VBox(children=(VBox(children=(HBox(children=(Label(value='Axes '), Dropdown(index=6, layout=Lay…

In [12]:
df["selection"].unique()


array([1])

Finally, we store the topcis, together with the embeddings and the two-dimensional UMAPs to a yml file.

In [13]:
import yaml

# Convert DataFrame to dictionary
data_dict = df.to_dict()

# Save as YAML file
with open('phd_topics.yml', 'w') as file:
    yaml.dump(data_dict, file)
