# Visualizing lanaguage embeddings

In this notebook we will visualize UMAP representations of language embeddings together with word clouds generated from the underlying text data. This is a game: You can draw outlines around point clouds and the wordcloud will update. From the wordcloud of PhD topics you can guess at which institute the researchers might work. You can test if your guess is right by clicking the `Show solution` button at the bottom.

In [1]:
import pandas as pd
#import umap
import stackview
import numpy as np
import yaml

In [2]:
with open("phd_topics.yml", 'r') as file:
    data_dict = yaml.safe_load(file)
df = pd.DataFrame(data_dict)
df.head()

Unnamed: 0,TSNE0,TSNE1,UMAP0,UMAP1,embedding,name,research_field,selection,topic
0,-24.977535,-0.853152,0.522525,8.064949,"[-0.010754222050309181, -0.00575306685641408, ...",Taylor Reed,Chemicals in the Environment / Ecotoxicology,1,Microplastic-Associated Persistent Organic Pol...
1,-10.760478,8.616426,1.470769,3.908189,"[0.00467681884765625, 0.0035836827009916306, -...",Riley Jain,Water Resources and Environment / Aquatic Ecos...,1,Microbial Community Resilience to Agricultural...
2,14.342352,-7.349468,6.670248,-0.525505,"[0.0015734180342406034, 0.01460769772529602, -...",Taylor Adams,Ecosystems of the Future / Conservation Biolog...,1,Resilience and Relocation: Social-Ecological P...
3,4.714762,-1.994494,2.44301,3.408841,"[-0.0008501994889229536, 0.01444125734269619, ...",Devon Thomas,Ecosystems of the Future / Ecology of Agroecos...,1,Resilience and Adaptive Capacity: Integrating ...
4,-14.277933,-3.898955,4.178447,6.685272,"[-0.0032572217751294374, 0.002003519097343087,...",Alex Lee,Chemicals in the Environment / Computational B...,1,Predicting Persistent Organic Pollutant Bioacc...


In [3]:
import ipywidgets as widgets

# original wordcloud widget
wc = stackview.wordcloudplot(df, column_text="topic", column_x="UMAP0", column_y="UMAP1")

# new controls
show_btn = widgets.Button(description="Show solution")
reset_btn = widgets.Button(description="Reset")
label = widgets.Textarea(
    value="",
    placeholder="",
    description="",
    disabled=True,  # read-only
    layout=widgets.Layout(width='70%', height='150px')
)

code_text = """
"""

def on_show(b):
    # 1. Filter rows where selection == 1
    filtered = df[df['selection'] == 1]
    
    # 2. Get unique values (excluding NaNs) and sort them for readability
    label.value = "\n".join(np.unique(filtered['research_field'].dropna())) 

def on_reset(b):
    label.value = ""

wc.observe(on_reset)

show_btn.on_click(on_show)
reset_btn.on_click(on_reset)

controls = widgets.HBox([show_btn, reset_btn, label])
ui = widgets.VBox([wc, controls])

# display the combined UI (putting ui as last expression will display it in a notebook)
ui

VBox(children=(VBox(children=(HBox(children=(HBox(children=(VBox(children=(VBox(children=(HBox(children=(VBox(â€¦