# Tutorial 1: Data exploration and visualization using LLM embeddings
*This notebook is part of the [LLMCode library](https://github.com/PerttuHamalainen/LLMCode).*

*A note on data privacy: The user experience of this notebook is better on Google Colab, but if you are processing data that cannot be sent to Google and OpenAI servers, you should run this notebook locally using the "Aalto" LLM API.*

**Learning goals**

*Embedding models* are a subcategory of modern Large Language Models (LLMs).

In this notebook, you'll learn to utilize such models to explore and visualize your data in useful ways.

**How to use this Colab notebook?**
* Select the LLM API and model to use below. The default values are recommended, but some of the examples may produce better quality results using the more expensive "text-embedding-3-large" model. For details about the models, see [OpenAI documentation](https://platform.openai.com/docs/guides/embeddings).
* Select "Run all" from the Runtime menu above.
* Enter your API key below when prompted. This will be provided to you at the workshop. You can also create your own OpenAI account at https://platform.openai.com/signup. The initial free quota you get with the account should be enough for the exercises of this notebook. To create an API key, follow [OpenAI's instructions](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)
* Proceed top-down following the instructions

**New to Colab notebooks?**

Colab notebooks are browser-based learning environments consisting of *cells* that include either text or code. The code is executed in a Google virtual machine instead of your own computer. You can run code cell-by-cell (click the "play" symbol of each code cell), and selecting "Run all" as instructed above is usually the first step to verify that everything works. For more info, see Google's [Intro video](https://www.youtube.com/watch?v=inN8seMm7UI) and [curated example notebooks](https://colab.google/notebooks/)


In [None]:
#Initial setup code. If you opened this notebook in Colab, this code is hidden
#by default to avoid unnecessary user interface clutter

#-------------------------------------------------------
#User-defined parameters. You can freely edit the values
llm_API="OpenAI" # @param ["OpenAI", "Aalto"]
embedding_model="text-embedding-ada-002" #@param ["text-embedding-ada-002","text-embedding-3-small", "text-embedding-3-large"]


#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

#Import packages
import pandas as pd
import numpy as np
from IPython.display import HTML, clear_output
import getpass
import os
import html
import plotly.express as px
import textwrap
import openpyxl
import re

#determine if we are running in Colab
import sys
original_dir = os.getcwd()
RunningInCOLAB = 'google.colab' in sys.modules
if RunningInCOLAB:
  import plotly.io as pio
  pio.renderers.default = "colab"
  if not os.path.exists("LLMCode"):
    if not os.getcwd().endswith("LLMCode"):
      print("Cloning the LLMCode repository...")
      #until the repo is public, we download this working copy instead of cloning
      #(shared as: anyone with the link can view)
      #!wget "https://drive.google.com/uc?export=download&id=1Td6ukrRGK9sUjlH1c6VTAYp2t_E1UNuQ" -O LLMCode.zip
      #!mkdir LLMCode
      #!unzip -q LLMCode.zip -d LLMCode
      !git clone https://github.com/PerttuHamalainen/LLMCode.git
  if not os.getcwd().endswith("LLMCode"):
    os.chdir("LLMCode")
    print("Installing dependencies...")
    !pip install -r requirements_notebooks.txt
import llmcode
os.chdir(original_dir)

#Jupyter is already running an asyncio event loop => need this hack for async OpenAI API calling
import nest_asyncio
nest_asyncio.apply()

#Prompt the user for an API key if not provided via a system variable
clear_output()
if llm_API=="OpenAI":
    if os.environ.get("OPENAI_API_KEY") is None:
        print("Please input an OpenAI API key")
        api_key = getpass.getpass()
        os.environ["OPENAI_API_KEY"] = api_key
elif llm_API=="Aalto":
    if os.environ.get("AALTO_OPENAI_API_KEY") is None:
        print("Please input an Aalto OpenAI API key")
        api_key = getpass.getpass()
        os.environ["AALTO_OPENAI_API_KEY"] = api_key
else:
    print(f"Invalid API type: {llm_API}")

#Initialize the LLMCode library
llmcode.init(API=llm_API)
llmcode.set_cache_directory("data_exploration_cache")

Please input an OpenAI API key


# What are embeddings?
Embedding models convert text to vectors (arrays of floating point numbers). In casual terms, people refer to such embedding vectors simply as "embeddings".

3D vectors such as [x,y,z] can be interpreted as positions in a 3D space. The embedding vectors have typically much more than 3 numbers, corresponding to positions in a high-dimensional "meaning space", with the following properties:

* Semantically similar texts produce vectors that are close to each other
* Directions in the "meaning space" correspond to particular relations or meanings.

Typically, the closeness metric used is cosine similarity, which is computed for two vectors $\mathbf{a}$ and $\mathbf{b}$:

$$\frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$$

Fascinatingly, embedding vectors allow describing semantic relationships using mathematics, e.g.,

$cook - food = programmer - software$,

which can be read as "a cook is to food like a programmer is to software". This is one explanation for the power of LLMs; they learn to produce rich internal embeddings/representations of text, which allows them to solve complex tasks with basic mathematical operations.



# Visualizing semantic relationships

The code below visualizes the example above, by calculating the embeddings of the given texts, projecting them into 2D using Principal Component Analysis (PCA), and scatterplotting the result.

**Exercise:**
* Hover your mouse over the plotted points. You should see that the y-axis roughly corresponds to the creator-created meaning.
* Try changing the "texts" variable to explore other relationships. Remember to separate concepts with semicolon. After changing the texts, run the visualization code again by clicking on the "Play" button (the triangle). If you don't see the button, hover your mouse over the "texts" input field below.

You can use both single words and longer sentences. A classic example from the machine learning literature is the gendered relationship of the words king,man,queen,woman.

Note that if you add many texts with different relationships, the plot can become difficult to read, as the relationship directions in the high-dimensional embeddings space become more difficult to consistently map into 2D.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values
texts = 'sport:food; cook; software; programmer' # @param {type:"string"}

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

#Convert the input into a list of strings
text_list=texts.split(";")
cleaned_text_list=[]
for text in text_list:
  text=text.strip()
  if len(text)>0:
    cleaned_text_list.append(text)

#calculate the embeddings
embeddings=llmcode.embed(cleaned_text_list,model=embedding_model)

#since the embeddings are high-dimensional, reduce dimensionality to 2 for plotting
#this will result in a 4x2 matrix, the columns of which we can use as x and y coordinates
embeddings_2d=llmcode.reduce_embedding_dimensionality(embeddings,
                                        num_dimensions=2,
                                        method="PCA")

#visualize as a scatter plot
#we use the Plotly visualization library which expects input as a Pandas Dataframe
df=pd.DataFrame()
df["texts"]=cleaned_text_list
df["x"]=embeddings_2d[:,0] #take the 1st column as x (Python uses 0-based indexing)
df["y"]=embeddings_2d[:,1] #take the 2nd column as x
fig=px.scatter(df,
             width=400,
             height=400,
             x="x",
             y="y",
             hover_name="texts")
fig.show() #renderer="colab" if RunningInCOLAB else None)

# Load data

We will be using the [Games As Art](https://osf.io/ryvt6/) open dataset based on a survey about how and why people experience video games as art. The LLMCode repository contains a copy of the data after slight preprocessing. We will focus on the short experience descriptions that participants provided.

**How to view the full experience descriptions?**

Click on the blue table icon to the right of the data preview. Note that this option is not available if you run this notebook on local Jupyter server.

**How to use your own data?**

If you have your own data as a column of texts in an Excel or csv file, you can either 1) upload it to Colab using the file browser on the left and input its filename, or 2) input a download URL for your data. Remember to specify which data column to use!

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values
xlsx_or_csv_filename_or_URL="https://raw.githubusercontent.com/PerttuHamalainen/LLMCode/master/test_data/bopp_test.xlsx" #@param {type:"string"}
data_column="experience" #@param {type:"string"}

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing
if xlsx_or_csv_filename_or_URL[-5:]==".xlsx":
  df = pd.read_excel(xlsx_or_csv_filename_or_URL)
else:
  df = pd.read_csv(xlsx_or_csv_filename_or_URL)

df[data_column]=df[data_column].astype(str).apply(openpyxl.utils.escape.unescape) #fix an excel import issue
df=df[[data_column]]
print("Loaded dataset:")
display(df)

# Scatterplotting raw data

The first embedding-based technique one might try on a dataset is simply scatterplotting all data items to visually inspect possible structure. The results will depend on the dimensionality reduction method used. The code below allows you to try 3 different options for this.

### Exercise
- Try all the provided dimensionality reduction methods and inspect the results by hovering your mouse over the data points.
- Q1: Which dimensionality reduction method provides the most clear data clusters?
- Q2: Do the clusters make sense? If not, why might that be?

Answers are provided below the visualization.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

dimensionality_reduction_method="UMAP" #@param ["PCA", "UMAP", "MDS"]

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

def visualize_texts(texts,
                   dimensionality_reduction_method,
                   hover_texts=None,
                   split_into_sentences=False,
                   min_split_length=3,
                   categories=None,
                   hover_wrap_width=60):

    if isinstance(texts,pd.Series):
        texts=texts.astype(str).tolist()

    if split_into_sentences:
        result=[]
        for text in texts:
            subsequences=re.split(r'[.\n\r]',text)
            for s in subsequences:
                if len(s)>=min_split_length:
                    result.append(s)
        texts=result

        print("calculating embeddings")
    embeddings = llmcode.embed(texts,model=embedding_model)

    # reduce dimensionality to 2d and add the x,y coordinates to the dataset
    embeddings_2d = llmcode.reduce_embedding_dimensionality(embeddings,
                                                    num_dimensions=2,
                                                    method=dimensionality_reduction_method)
    df=pd.DataFrame()
    df["x"] = embeddings_2d[:, 0]  # take the 1st column as x (Python uses 0-based indexing)
    df["y"] = embeddings_2d[:, 1]  # take the 2nd column as x

    # Format the hover texts
    if hover_texts is None:
        hover_texts=["</br>".join(textwrap.wrap(text,width=hover_wrap_width)) for text in texts]
    else:
        hover_texts=["</br>".join(textwrap.wrap(text,width=hover_wrap_width)) for text in hover_texts]
    #if hover_contexts is not None:
    #    for count,context in enumerate(hover_contexts):
    #        hover_texts[count]+="</br></br>"+"Context: "+"</br>".join(textwrap.wrap(context,width=hover_wrap_width))

    df["hover"]=hover_texts  #make the texts readable

    # visualize
    return px.scatter(df,
               width=800,
               height=800,
               x="x",
               y="y",
               hover_name="hover")

fig=visualize_texts(df[data_column],dimensionality_reduction_method)
fig.show()

 |███████████████████████████████████████████████████████████████████████████████████████████---------| 92.0% 


### Exercise answers
Click on the ">" to view. (If you are running this notebook in vanilla Jupyter instead of Colab, the answers will be visible by default)


A1: UMAP (Universal Manifold Approximation and Projection) is a more recent method than PCA (Principal Component Analysis) and MDS (Multi-Dimensional Scaling) and usually results in more well defined clusters, but is slower to compute.

A2: The clusters are not well defined. A plausible reason for this is that the embeddings try to capture every aspect of the texts (topics, sentiment, style...). This becomes confusing especially when the texts are long and might discuss multiple topics, which is the case here. Below, we provide one solution for this confusion, and the topic will be revisited in the next notebook (Relevant Data Extraction).

# Scatterplotting individual sentences

One way to make the embedding clusters more coherent is to split the texts into smaller pieces with more singular meanings.

Often, it can be more useful to simply visualize individual sentences, as demonstrated below.

**Exercise:**
* Inspect the results by hovering your mouse over the data points.
* Q1: Are the clusters more coherent than above?
* Q2: What remaining problems can you spot?

Answers are provided below the visualization.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

dimensionality_reduction_method="MDS" #@param ["PCA", "UMAP", "MDS"]

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

def split_into_sentences(df, text_column, new_sentence_column='sentence', min_sentence_length=3):
    """
    Splits the strings in the specified text_column of a Pandas DataFrame into sentences.
    Returns a new DataFrame with a new sentence_column where each row contains one sentence.
    Other columns are duplicated for each sentence originating from the same original row.
    Sentences shorter than 3 characters are ignored.

    Parameters:
    - df (pd.DataFrame): Original DataFrame.
    - text_column (str): Name of the column containing text to split into sentences.
    - new_sentence_column (str): Name of the new column to store individual sentences.

    Returns:
    - pd.DataFrame: Transformed DataFrame with sentences split into separate rows.
    """
    # Define a regular expression pattern for sentence splitting
    sentence_endings = re.compile(r'(?<=[.!?\n])\s+')

    # Function to split a single text into sentences
    def split_text(text):
        if pd.isna(text):
            return []
        # Split the text into sentences based on the regex pattern
        sentences = sentence_endings.split(text)
        # Strip whitespace and filter out sentences shorter than 3 characters
        return [s.strip() for s in sentences if len(s.strip()) >= 3]

    # Apply the split_text function to the specified text column
    df_copy = df.copy()  # To avoid modifying the original DataFrame
    df_copy[new_sentence_column] = df_copy[text_column].apply(split_text)

    # Explode the list of sentences into separate rows
    exploded_df = df_copy.explode(new_sentence_column)

    # Optionally, drop rows where the new_sentence_column is NaN or empty
    exploded_df = exploded_df.dropna(subset=[new_sentence_column])
    exploded_df = exploded_df[exploded_df[new_sentence_column].str.len() >= min_sentence_length]

    # Reset the index of the resulting DataFrame
    exploded_df = exploded_df.reset_index(drop=True)
    return exploded_df

df_sentences=split_into_sentences(df,data_column)
hover_texts=df_sentences["sentence"].astype(str)+f"</br></br>{data_column}: "+df_sentences[data_column].astype(str)

fig=visualize_texts(df_sentences["sentence"],
                   dimensionality_reduction_method,
                   hover_texts=hover_texts)
fig.show()

### Exercise answers
Click on the ">" to view. (If you are running this notebook in vanilla Jupyter instead of Colab, the answers will be visible by default)


A1: The clusters should appear more coherent. For instance, there's a clear cluster of sentences about appreciating beauty and visual aesthetics.

A2: Depending on one's research questions, many of the clusters may not feel relevant. For instance, if one is interested in emotions or other subjective experiences, sentences describing game mechanics might not be relevant. Below, we will investigate how to filter and focus the visualization.

# Semantic search
Another useful way of browsing the data is embedding-based semantic search.

The code below sorts and visualizes the data as an interactive table (click on the table icon on the right), showing the closest matching data items first.

### Exercise:

Try at least two different search strings and compare the sentence-level results with the full experience results

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

search_string = "I felt sad" #@param {type:"string"}
search_individual_sentences=True     #@param {type: "boolean"}

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing

def sort_by_similarity(df,column,search_string):
    embeddings=llmcode.embed(df[column],model=embedding_model)
    embeddings/=np.linalg.norm(embeddings,axis=1,keepdims=True)
    search_embedding=llmcode.embed([search_string],model=embedding_model)
    search_embedding/=np.linalg.norm(search_embedding)
    similarity=embeddings @ search_embedding.T
    result=df.copy()
    result["similarity"]=similarity
    return result.sort_values(by="similarity",ascending=False)

if search_individual_sentences:
    df_sorted=sort_by_similarity(df_sentences,"sentence",search_string)
else:
    df_sorted=sort_by_similarity(df,data_column,search_string)
display(df_sorted)


Loaded embeddings from cache, hash 7fe6b772bb1a41e6a6f6efd3d5995749
 |----------------------------------------------------------------------------------------------------| 0.0% 


Unnamed: 0,experience,sentence,similarity
86,The last level of Journey going up the mountai...,I cried.,0.909673
30,"When I played this game, I was immediately awe...","It made me feel contemplative, happy, sad, and...",0.879359
113,The game tries not to reach a big target audie...,"This felt like kind of wrong, and this emotion...",0.874995
199,I became heavily invested in the story and atm...,"When the story ended, I was heartbroken at the...",0.874315
533,The game involved exploring an island through ...,It left me somehow both melancholy and hopeful.,0.871728
...,...,...,...
100,"you start in an empty screen, everything is wh...","if you shoot, your gun fires paint splotches t...",0.732978
600,I experienced the game Night in the Woods as a...,The light platforming and exploration element ...,0.728841
303,"I do not play many video games, but I do watch...",Starcraft 2 is an incredibly interesting game ...,0.728387
185,The game is largely dependent on the storyline...,Most decisions are presented with a fair amoun...,0.727940


# Combining semantic search and visualization

Below, we demonstrate using the semantic search above to filter the visualized data. The colors of the scatterplot points indicate the similarity of the texts or sentences with the filter text. Similarity values below the threshold are clipped to 0 to help pinpointing the most similar data points.

In [None]:
#-------------------------------------------------------
#User-defined parameters. You can freely edit the values

filter_text="It was sad but good" #@param{type:"string"}
filter_threshold=0.81                #@param {type:"slider", min:0, max:0.9, step:0.01}
individual_sentences=False        #@param {type:"boolean"}

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing
def visualize_texts(texts,
                   dimensionality_reduction_method,
                   hover_texts=None,
                   colors=None,
                   min_split_length=3,
                   hover_wrap_width=60):

    if isinstance(texts,pd.Series):
        texts=texts.astype(str).tolist()

    embeddings = llmcode.embed(texts,model=embedding_model)

    # reduce dimensionality to 2d and add the x,y coordinates to the dataset
    embeddings_2d = llmcode.reduce_embedding_dimensionality(embeddings,
                                                    num_dimensions=2,
                                                    method=dimensionality_reduction_method)
    df=pd.DataFrame()
    df["x"] = embeddings_2d[:, 0]  # take the 1st column as x (Python uses 0-based indexing)
    df["y"] = embeddings_2d[:, 1]  # take the 2nd column as x

    # Format the hover texts
    if hover_texts is None:
        hover_texts=["</br>".join(textwrap.wrap(text,width=hover_wrap_width)) for text in texts]
    else:
        hover_texts=["</br>".join(textwrap.wrap(text,width=hover_wrap_width)) for text in hover_texts]

    df["hover"]=hover_texts  #make the texts readable


    # visualize
    if colors is not None:
        df["color"]=colors
    fig=px.scatter(df,
               width=800,
               height=800,
               x="x",
               y="y",
               hover_name="hover",
               color="color" if colors is not None else None)
    return fig

def compute_similarities(texts_A,texts_B):
    if isinstance(texts_A,str):
        texts_A=[texts_A]
    if isinstance(texts_B,str):
        texts_B=[texts_B]
    embeddings_A=llmcode.embed(texts_A,model=embedding_model)
    embeddings_A/=np.linalg.norm(embeddings_A,axis=1,keepdims=True)
    embeddings_B=llmcode.embed(texts_B,model=embedding_model)
    embeddings_B/=np.linalg.norm(embeddings_B,axis=1,keepdims=True)
    return embeddings_A @ embeddings_B.T

'''
def filter_by_embedding_similarity(df,column,filter_text,filter_threshold):
    similarity=compute_column_similarity(df,column,filter_text)
    result=df.copy()
    result["filter_similarity"]=similarity
    result=result[result["filter_similarity"] > filter_threshold]
    return result
'''

def similarity_to_color(similarity,threshold=None):
    color=similarity.copy()
    if threshold is not None:
        color[similarity<threshold]=0 #np.min(color)
    return color

if individual_sentences:
    similarities=compute_similarities(df_sentences["sentence"],filter_text)
    colors=similarity_to_color(similarities,filter_threshold)
    fig=visualize_texts(texts=df_sentences["sentence"],
                   dimensionality_reduction_method="UMAP",
                   hover_texts=hover_texts,
                   colors=colors)
else:
    similarities=compute_similarities(df[data_column],filter_text)
    colors=similarity_to_color(similarities,filter_threshold)
    fig=visualize_texts(texts=df[data_column],
                   dimensionality_reduction_method="UMAP",
                   colors=colors)
fig.layout.coloraxis.colorbar.title = 'Similarity'
fig.show()

Loaded embeddings from cache, hash d1ca6074b385a0bb8b7c481076aae904
 |----------------------------------------------------------------------------------------------------| 0.0% 
Loaded embeddings from cache, hash d1ca6074b385a0bb8b7c481076aae904
Loaded dimensionality reduction results from cache, hash  deeaacc077755caf6b8122c75dc6cc91


# Semantic projection

One can obtain a so-called [semantic projection](https://www.nature.com/articles/s41562-022-01316-8) axis by subtracting the embedding of one text from another and normalizing the result. Other texts can then be projected on the axis, which can be useful, e.g., for classification.

Below, we demonstrate this in plotting emotion words on a "core affect" 2d map where the axes correspond to valence (positivity or negativity of the emotion) and arousal.

**Optional exercises:**

- Plot your own texts or words by defining via the "custom_texts" parameter. Each plotted text should be separated by ";". For instance, try emotion words such as "happy; sad; elated; dismayed".
- Change one axis to be gender and other to be age, then investigate the models ageism and gender biases using both clearly gendered words such as "mr" and "ms" and non-gendered words such as "doctor", "nurse", "boss", "assistant".


In [None]:
#@title Define the plot axes, with "---" separating the achor texts
axis1_name="displeasure---pleasure" #@param{type:"string"}
axis2_name="tired,calm---awake,alert" #@param{type:"string"}
custom_texts="" #@param{type:"string"}

#-------------------------------------------------------------------
#Implementation. Only edit this part if you know what your are doing
#define the embedded texts
if len(custom_texts.strip())==0:
  texts=df[data_column]
else:
  texts=[x.strip() for x in custom_texts.split(";")]

#calculate the axes
def semantic_axis(axis_name):
    anchors=axis_name.split("---")
    anchor_embeddings=llmcode.embed(anchors,model=embedding_model)
    result=anchor_embeddings[1,:]-anchor_embeddings[0,:]
    result=result / np.linalg.norm(result)
    return result
axis1=semantic_axis(axis1_name)
axis2=semantic_axis(axis2_name)

#calculate the embeddings
embeddings=llmcode.embed(texts,model=embedding_model)

#project the embeddings on the axes ("@" denotes matrix multiplication)
x=embeddings @ axis1
y=embeddings @ axis2

#visualize as a scatter plot
df_viz=pd.DataFrame()
hover_wrap_width=60
df_viz["texts"]=["</br>".join(textwrap.wrap(text,width=hover_wrap_width)) for text in texts]
df_viz[axis1_name]=x
df_viz[axis2_name]=y
fig=px.scatter(df_viz,
             width=800,
             height=800,
             x=axis1_name,
             y=axis2_name,
             hover_name="texts",
          )
fig.show()

 |----------------------------------------------------------------------------------------------------| 0.0% 
 |----------------------------------------------------------------------------------------------------| 0.0% 
Loaded embeddings from cache, hash d1ca6074b385a0bb8b7c481076aae904
