# Create embeddings using locally hosted embedding models

First ensure that you have locally hosted LLM server running. In this example, we will be using [LMStudio](https://lmstudio.ai/). LMStudio can be used with [OpenAI Python client](https://github.com/openai/openai-python).

In [1]:
from openai import OpenAI
import numpy as np
from numpy.linalg import norm

client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

We want to experiment using different embedding models to create embeddings of our inputs. In this example, we will be showing that
[Granite-Embedding-278m-multilingual](https://huggingface.co/ibm-granite/granite-embedding-278m-multilingual) does a pretty decent job even though it is not trained on Myanmar language.


In [2]:
DEFAULT_MODEL="text-embedding-granite-embedding-278m-multilingual"

def create_embedding(text: str, model=DEFAULT_MODEL) -> np.ndarray:
    return client.embeddings.create(
        model=model,
        input=text).data[0].embedding

In [3]:
en_phrase = "What are you doing?"
mm_phrase_1 = "ဘာလုပ်နေတာလဲ"
mm_phrase_2 = "ဘယ်သွားမလို့လဲ"
embedding_eng = create_embedding(en_phrase)
embedding_mm_1 = create_embedding(mm_phrase_1)
embedding_mm_2 = create_embedding(mm_phrase_2)

In [4]:
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a,b)/(norm(a)*norm(b))

Semantically similar phrases will have higher cosine similrity compared to dissimilar pairs

In [5]:
print(f"Similarity between {en_phrase} and {mm_phrase_1}:: {cosine_similarity(embedding_eng, embedding_mm_1)}")
print(f"Similarity between {en_phrase} and {mm_phrase_2}:: {cosine_similarity(embedding_eng, embedding_mm_2)}")

Similarity between What are you doing? and ဘာလုပ်နေတာလဲ:: 0.9160455173193762
Similarity between What are you doing? and ဘယ်သွားမလို့လဲ:: 0.6699955543069079


# Create two simple lists of Myanmar and English questions

In [6]:
list_mm = [
    "နေကောင်းလား",
    "ဘာလုပ်နေသလဲ",
    "ဒီဟာကဘာလဲ",
    "ဘယ်သွားမလဲ",
    "ဈေးဘယ်မှာရှိလဲ",
    "ဘယ်ချိန်မှာစလဲ",
    "ဒီဟာဘယ်လောက်လဲ",
    "ဘယ်သူလဲ",
    "မင်းနာမည်ဘယ်လိုခေါ်လဲ",
    "အိမ်ဘယ်မှာရှိလဲ",
    "ဘယ်နေမှာနေသလဲ",
    "အလုပ်လုပ်တာပျော်သလား",
    "နောက်ဘယ်ချိန်တွေ့မလဲ",
    "မင်းငါ့ကိုသတိရတယ်လား",
    "ဒီနေ့ဘာနေ့လဲ",
    "နောက်မှတွေ့ကြမလား",
    "ဘယ်နှစ်ယောက်လာမလဲ",
    "ဒီအစားအစာကိုကြိုက်လား",
    "ဒီနေ့အပန်းဖြေခဲ့သလား",
    "ဘယ်လောက်ကြာမလဲ",
    "ဒါမင်းဟာလား",
    "သတင်းကဘာလဲ",
    "ဘယ်သူကိုဆက်သွယ်ရမလဲ",
    "မင်းအနီးအနားမှာနေတယ်လား",
    "ငါ့နဲ့ကူညီနိုင်မလား",
    "မင်းဘာကြောင့်ထွက်သွားတာလဲ",
    "ဒီနေ့အတွက်အချက်အလက်ရမလား",
    "စိတ်ညစ်နေတာလား",
    "ငါ့အကြံပေးနိုင်မလား",
    "ဒီကားကဘယ်မှတ်တိုင်ကိုသွားမလဲ",
    "ဘယ်တော့ပြန်မလဲ",
    "ဘာလုပ်ချင်လဲ",
    "ဒီစာအုပ်အကြောင်းပြောပြပါလား"
]

list_en = [
    "How are you?",
    "What are you doing?",
    "What is this?",
    "Where are you going?",
    "Where is the market?",
    "When does it start?",
    "How much is this?",
    "Who is it?",
    "What is your name?",
    "Where is your house?",
    "Where do you live?",
    "Do you enjoy your work?",
    "When will we meet next?",
    "Do you miss me?",
    "What day is it today?",
    "Shall we meet later?",
    "How many people will come?",
    "Do you like this food?",
    "Did you relax today?",
    "How long will it take?",
    "Is this yours?",
    "What is the news?",
    "Who should I contact?",
    "Where do you live nearby?",
    "Can you help me?",
    "Why did you leave?",
    "Can I get today’s details?",
    "Are you upset?",
    "Can you give me advice?",
    "Where does this bus go?",
    "When will you return?",
    "What do you want to do?",
    "Can you tell me about this book?"
]


In [7]:
mm_embeddings = [(create_embedding(original), original) for original in list_mm]

In [8]:
en_embeddings = [(create_embedding(original), original) for original in list_en]

In [9]:
acc = []
for m_idx, mm_embedding in enumerate(mm_embeddings):
    max_score = 0
    match = None
    match_idx = None
    for e_idx, en_embedding in enumerate(en_embeddings):
        score = cosine_similarity(mm_embedding[0], en_embedding[0])
        if score > max_score:
            match = en_embedding[1]
            max_score = score
            match_idx = e_idx
    correct = m_idx == match_idx
    acc.append(1 if correct else 0)
    print(f"For {mm_embedding[1]} ==> {match} {('[CORRECT]' if correct else '[WRONG]')}")

For နေကောင်းလား ==> Did you relax today? [WRONG]
For ဘာလုပ်နေသလဲ ==> What are you doing? [CORRECT]
For ဒီဟာကဘာလဲ ==> What is this? [CORRECT]
For ဘယ်သွားမလဲ ==> Where are you going? [CORRECT]
For ဈေးဘယ်မှာရှိလဲ ==> Where is the market? [CORRECT]
For ဘယ်ချိန်မှာစလဲ ==> When does it start? [CORRECT]
For ဒီဟာဘယ်လောက်လဲ ==> How much is this? [CORRECT]
For ဘယ်သူလဲ ==> Who is it? [CORRECT]
For မင်းနာမည်ဘယ်လိုခေါ်လဲ ==> What is your name? [CORRECT]
For အိမ်ဘယ်မှာရှိလဲ ==> Where is your house? [CORRECT]
For ဘယ်နေမှာနေသလဲ ==> Where are you going? [WRONG]
For အလုပ်လုပ်တာပျော်သလား ==> Do you enjoy your work? [CORRECT]
For နောက်ဘယ်ချိန်တွေ့မလဲ ==> When will we meet next? [CORRECT]
For မင်းငါ့ကိုသတိရတယ်လား ==> Do you miss me? [CORRECT]
For ဒီနေ့ဘာနေ့လဲ ==> What day is it today? [CORRECT]
For နောက်မှတွေ့ကြမလား ==> Shall we meet later? [CORRECT]
For ဘယ်နှစ်ယောက်လာမလဲ ==> How many people will come? [CORRECT]
For ဒီအစားအစာကိုကြိုက်လား ==> Do you like this food? [CORRECT]
For ဒီနေ့အပန်းဖြေခဲ့သလား ==> Did

In [10]:
print(f"Pre-cent correct accuracy:: {np.mean(acc) * 100}%")

Pre-cent correct accuracy:: 93.93939393939394%


# Visualizing embeddings by reducing their dimension to 2D using t-SNE

In [11]:
import plotly.graph_objs as go
from sklearn.manifold import TSNE
from scipy.spatial.distance import cdist

In [12]:
all_embeddings = np.array([emb for emb,_ in mm_embeddings] + [emb for emb,_ in en_embeddings])
labels = ['MM: ' + text for text in list_mm] + ['EN: ' + text for text in list_en]
colors = ['red'] * len(list_mm) + ['blue'] * len(list_en)

In [13]:
tsne = TSNE(n_components=2, random_state=42)
embeddings_2d = tsne.fit_transform(all_embeddings)

In [19]:
fig = go.Figure(data=go.Scatter(
    x=embeddings_2d[:, 0],
    y=embeddings_2d[:, 1],
    mode='markers+text',
    marker=dict(
        color=colors,
        size=10,
        opacity=0.7
    ),
    text=labels,
    textposition='top center',
    textfont=dict(
        size=14
    )
))

fig.update_layout(
    title='Myanmar-English Phrase Embeddings Visualization. Embeddings for both English and Myanmar are generated by IBM\'s granite-embedding-278m-multilingual embedding model',
    title_font_size=16,
    xaxis_title='t-SNE Dimension 1',
    yaxis_title='t-SNE Dimension 2',
    hovermode='closest',
    width=1800,  # Increased plot width
    height=1200  # Increased plot height
)

fig.write_html("embedding_visualization.html")

Visualization saved as embedding_visualization.html
