In [None]:
!pip install openai -q

All words used in this project are sourced directly from [Mi'gmaq Mi'kmaq Micmac Online](https://www.mikmaqonline.org/). 

The talking dictionary project is developing an Internet resource for the Mi'gmaq/Mi’kmaq language. Each headword is recorded by a minimum of three speakers. Multiple speakers allow one to hear differences and variations in how a word is pronounced. Each recorded word is used in an accompanying phrase. This permits learners the opportunity to develop the difficult skill of distinguishing individual words when they are spoken in a phrase.
Thus far we have posted 6500 headwords, a majority of these entries include two to three additional forms.

The project was initiated in Listuguj, therefore all entries have Listuguj speakers and Listuguj spellings. In collaboration with Unama'ki, the site now includes a number of recordings from Unama'ki speakers. More will be added as they become available.

The orthographies (spelling system):
Each word is presented using the Listuguj orthography. The Smith-Francis orthography will be included in the future. Some spellings are speculative.
Listuguj is in the Gespe'g territory of the Mi'gmaw; located on the southwest shore of the Gaspè peninsula.

Unama'ki is a Mi’gmaw territory; in English it is known as Cape Breton. 




In [None]:
import openai
import pandas as pd
import numpy as np
from getpass import getpass

openai.api_key = getpass()

··········


The Mi'kmaq language has twenty-one words for the word '**play**'. I define the words in a CSV file called mikmaq-langugage-play.csv.

In [None]:
df = pd.read_csv('mikmaq-language-play.csv')
print(df)

            text
0   awanmila'sit
1   elaqalsewatl
2       elutuatl
3    ge'gutesing
4     getmete'gl
5         gise'g
6       giso'qon
7    maligeiwatl
8      maligo'tg
9    mattaqta'tl
10      mila'sit
11  mila'sualatl
12   mila'suaqan
13    mila'suatg
14        nuja'q
15         papit
16     papitaqan
17      papuaqan
18         tu'at
19        wali'j
20   wigji'jaqan


I take the phrase written in Mi'kmaq. **Newtigisg'g gisitu'a'ti'tis, newtigisg'g tu'a'ti'tis** and convert it into numeric form and embed it within a vector space.

In English it is translated as: **If they could play ball all day, they would play ball all day**

Periods are omitted from this search.

In [None]:
from openai.embeddings_utils import get_embedding

df['embedding'] = df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
df.to_csv('word_embeddings.csv')

In [None]:
get_embedding("Newtigisg'g gisitu'a'ti'tis, newtigisg'g tu'a'ti'tis", engine='text-embedding-ada-002')

[-0.009200391359627247,
 0.0004904866800643504,
 0.007821002043783665,
 -0.008912460878491402,
 0.0009483300382271409,
 0.024869181215763092,
 -0.017999019473791122,
 -0.027346724644303322,
 -0.010164625011384487,
 -0.028578801080584526,
 0.04296194389462471,
 0.0056146495044231415,
 0.014235831797122955,
 -0.012421198189258575,
 0.013050627894699574,
 -0.005058876238763332,
 0.040738850831985474,
 0.008289727382361889,
 0.0009064796613529325,
 -0.016070552170276642,
 0.006491833832114935,
 0.034873101860284805,
 0.018789155408740044,
 -0.003053405089303851,
 -0.007365670055150986,
 -0.0007014126749709249,
 -0.0009734402992762625,
 -0.024949533864855766,
 0.01194577757269144,
 -0.007432630751281977,
 -0.013037236407399178,
 -0.016566062346100807,
 -0.02465490624308586,
 -0.019954269751906395,
 -0.02924840711057186,
 -0.01568218134343624,
 0.012240404263138771,
 0.007097827736288309,
 0.006080025807023048,
 0.00029337129672057927,
 0.017624039202928543,
 -0.0006800689734518528,
 0.00760

The above words are converted to numerical values and defined within a vector.
A new CSV file is generated that has updated vector information and saved as word_embeddings.csv.

The words in Mi'kmaq are assigned numeric vector information. Below, the Mi'kmaq words and their assigned vector values are displayed.

In [None]:
df = pd.read_csv('word_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df

Unnamed: 0.1,Unnamed: 0,text,embedding
0,0,awanmila'sit,"[-0.022933831438422203, -0.012924745678901672,..."
1,1,elaqalsewatl,"[-0.011054829694330692, 0.005368522834032774, ..."
2,2,elutuatl,"[-0.03373202681541443, -0.005934728309512138, ..."
3,3,ge'gutesing,"[-0.01024299394339323, 0.002512947889044881, 0..."
4,4,getmete'gl,"[-0.024321187287569046, -0.016803987324237823,..."
5,5,gise'g,"[-0.02557378076016903, 0.003713813377544284, 0..."
6,6,giso'qon,"[-0.010195519775152206, 0.000577291939407587, ..."
7,7,maligeiwatl,"[-0.03518354520201683, 0.014880157075822353, -..."
8,8,maligo'tg,"[-0.021383129060268402, -0.008770162239670753,..."
9,9,mattaqta'tl,"[-0.022418109700083733, 0.01368875615298748, 0..."


We can enter a word or term to find similarity. This is tested using the english word '**play**'. The program will semantically search against the previous embedded Mi'kmaq words using OpenAI Word Embedding.

In [None]:
search_term = input('Enter a search term: ')


Enter a search term: I want to play ball


The words entered above are converted into a vector. Below are the values for the searched term.

In [None]:
# semantic search
search_term_vector = get_embedding(search_term, engine="text-embedding-ada-002")
search_term_vector

[-0.028811048716306686,
 -0.012532934546470642,
 0.01519802026450634,
 0.0069854650646448135,
 -0.04110751301050186,
 0.020004121586680412,
 -0.011529532261192799,
 -0.004294814541935921,
 -0.0022145139519125223,
 -0.020975569263100624,
 0.010289660654962063,
 0.0008420265512540936,
 0.034665290266275406,
 -0.0034192348830401897,
 0.00710689602419734,
 -0.038193173706531525,
 0.03760519251227379,
 -0.02195979654788971,
 0.017153695225715637,
 -0.013140087947249413,
 0.010967115871608257,
 -0.012500979006290436,
 0.017907842993736267,
 -0.014482217840850353,
 -0.027251621708273888,
 -0.013549118302762508,
 0.007841871120035648,
 -0.018265744671225548,
 -0.0007281851721927524,
 0.007867435924708843,
 0.02998700924217701,
 0.02156354859471321,
 -0.008193382062017918,
 -0.00850654486566782,
 -0.014264920726418495,
 -0.01792062632739544,
 -0.009414080530405045,
 0.007349757477641106,
 0.0014835326001048088,
 -0.011900215409696102,
 0.007988866418600082,
 0.01675744727253914,
 -0.01712813042

# Mi'kmaq words are converted to a numerical form and embedded into a vector space.

The word or term is then compared to the Mi'kmaq word for play in a vector space using Cosine Similarity.

In [None]:
from openai.embeddings_utils import cosine_similarity

df["similarities"] = df['embedding'].apply(lambda x: cosine_similarity(x, search_term_vector))

df

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
0,0,awanmila'sit,"[-0.022933831438422203, -0.012924745678901672,...",0.758619
1,1,elaqalsewatl,"[-0.011054829694330692, 0.005368522834032774, ...",0.750233
2,2,elutuatl,"[-0.03373202681541443, -0.005934728309512138, ...",0.755367
3,3,ge'gutesing,"[-0.01024299394339323, 0.002512947889044881, 0...",0.747175
4,4,getmete'gl,"[-0.024321187287569046, -0.016803987324237823,...",0.776565
5,5,gise'g,"[-0.02557378076016903, 0.003713813377544284, 0...",0.751982
6,6,giso'qon,"[-0.010195519775152206, 0.000577291939407587, ...",0.761485
7,7,maligeiwatl,"[-0.03518354520201683, 0.014880157075822353, -...",0.753384
8,8,maligo'tg,"[-0.021383129060268402, -0.008770162239670753,...",0.758941
9,9,mattaqta'tl,"[-0.022418109700083733, 0.01368875615298748, 0...",0.754777


# The top 5 Mi'kmaq words that are similar to the term: '*I want to play ball'* in vector space are the following words:

getmete'gl (win all/break all/destroy all)

mila'sualatl (plays with/toy with)

papit (amuse self)

mila'suaqan (toy)

mila'suatg (plays with/toy with)

# **All words for Play defined**

 awanmila'sit (plays poorly)

elaqalsewatl (play a gambling game for/play for (in cards and board games)

elutuatl (impersonate/imitate)

ge'gutesing (land on top/jump on top)

getmete'gl (win all/break all/destroy all)

gise'g (have good time/have fun/fun to be with/enjoyable)

giso'qon (fun time/lots of fun! (interjection))

maligeiwatl (play with/amuse)

maligo'tg (play with/amuse)

mattaqta'tl (pluck (as string)/tug (a string)/pull (a string)/jerk (a string))

mila'sit (play)

mila'sualatl (plays with/toy with)

mila'suaqan (toy)

mila'suatg (plays with/toy with)

nuja'q (swimmer)

papit (amuse self)

papitaqan (toy/item made for enjoyment)

papuaqan (amusement/fun time/celebration)

tu'at (play baseball/play ball)

wali'j (snowball)

wigji'jaqan (toy)




In [None]:
df.sort_values("similarities", ascending=False).head(20)

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
4,4,getmete'gl,"[-0.024321187287569046, -0.016803987324237823,...",0.776565
11,11,mila'sualatl,"[-0.02142377384006977, -0.0002219943853560835,...",0.769436
15,15,papit,"[-0.01954457350075245, 0.0004031408461742103, ...",0.768016
12,12,mila'suaqan,"[-0.012376071885228157, 0.0039017482195049524,...",0.765671
13,13,mila'suatg,"[-0.025191467255353928, -0.0010983244283124804...",0.765148
10,10,mila'sit,"[-0.026904482394456863, -0.020474812015891075,...",0.762923
6,6,giso'qon,"[-0.010195519775152206, 0.000577291939407587, ...",0.761485
20,20,wigji'jaqan,"[-0.019499141722917557, -0.0001804624189389869...",0.760341
8,8,maligo'tg,"[-0.021383129060268402, -0.008770162239670753,...",0.758941
0,0,awanmila'sit,"[-0.022933831438422203, -0.012924745678901672,...",0.758619


Using the vector ID for awanmila'sit (plays poorly) and tu'at (play baseball/play ball), I then use the two words together to establish an updated vector association when adding new context to the search.

An action_vector and object_vector are defined for maths.


In [None]:
play_df = df.copy()

action_vector = play_df['embedding'][0]
object_vector = play_df['embedding'][18]

action_object_vector = action_vector + object_vector
action_object_vector

array([-0.04771161, -0.01259031,  0.02782422, ...,  0.0045356 ,
        0.01290679, -0.02435643])

# Words used: awanmila'sit + tu'at.
---

The words most associate via vector space are:

mila'suaqan (toy)

mila'suatg (plays with/toy with)

mila'sit (play)

mila'sualatl (plays with/toy with)

elutuatl (impersonate/imitate)

In [None]:
play_df["similarities"] = play_df['embedding'].apply(lambda x: cosine_similarity(x, action_object_vector))
play_df.sort_values("similarities", ascending=False)

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
0,0,awanmila'sit,"[-0.022933831438422203, -0.012924745678901672,...",0.955301
18,18,tu'at,"[-0.02477777935564518, 0.00033443779102526605,...",0.955301
12,12,mila'suaqan,"[-0.012376071885228157, 0.0039017482195049524,...",0.898795
13,13,mila'suatg,"[-0.025191467255353928, -0.0010983244283124804...",0.898511
10,10,mila'sit,"[-0.026904482394456863, -0.020474812015891075,...",0.896811
11,11,mila'sualatl,"[-0.02142377384006977, -0.0002219943853560835,...",0.890568
2,2,elutuatl,"[-0.03373202681541443, -0.005934728309512138, ...",0.877914
9,9,mattaqta'tl,"[-0.022418109700083733, 0.01368875615298748, 0...",0.875413
16,16,papitaqan,"[-0.023620164021849632, 0.0008591554360464215,...",0.874414
17,17,papuaqan,"[-0.013827907852828503, 0.011130276136100292, ...",0.861266


The english equivilent words in Mi'kmaq are introduced 

In [None]:
action_object_df = pd.read_csv('english-language-play.csv')
action_object_df

Unnamed: 0,text
0,plays poorly
1,play a gambling game for play for (in cards an...
2,impersonate imitate
3,land on top jump on top
4,win all break all destroy all
5,have good time have fun fun to be with enjoyable
6,fun time lots of fun! (interjection)
7,play with amuse
8,play with amuse
9,pluck (as string) tug (a string) pull (a strin...


In [None]:
action_object_df['embedding'] = action_object_df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
action_object_df.to_csv('Sentence-play-embeddings.csv')

The phrase used above is re-entered to update the vector association and how the phrase associates with the word 'Play' written in Mi'kmaq and find an association when comparing the vector space for the Mi'kmaq and English equivalent.

In [None]:
action_object_search = input("Enter word or phrase:")

Enter word or phrase:I want to play ball


An update vector space is generated with knowledge of Mi'kmaq and English equivalent words and meanings.

In [None]:

action_object_search_vector = get_embedding(action_object_search, engine="text-embedding-ada-002")
action_object_search_vector

[-0.028721993789076805,
 -0.012565872631967068,
 0.015255365520715714,
 0.007020789664238691,
 -0.04111538082361221,
 0.02011050656437874,
 -0.01150540728121996,
 -0.004328102804720402,
 -0.0022103670053184032,
 -0.0209665447473526,
 0.01029800996184349,
 0.0008336788741871715,
 0.03467592969536781,
 -0.0033187447115778923,
 0.007033566478639841,
 -0.03820229694247246,
 0.037691228091716766,
 -0.022039785981178284,
 0.01710798405110836,
 -0.013057774864137173,
 0.010987951420247555,
 -0.012457270175218582,
 0.017810702323913574,
 -0.014399327337741852,
 -0.02721434459090233,
 -0.013581618666648865,
 0.00785127468407154,
 -0.018219556659460068,
 -0.0007538245990872383,
 0.007857662625610828,
 0.029923003166913986,
 0.02168203890323639,
 -0.008189857006072998,
 -0.008451778441667557,
 -0.01423322968184948,
 -0.017951246351003647,
 -0.00946113746613264,
 0.007314653601497412,
 0.0014493555063381791,
 -0.011901484802365303,
 0.007947099395096302,
 0.016699131578207016,
 -0.0171463154256343

The English words are then converted into a numerical form and embedded into a vector space.

In [None]:
action_object_df["similarities"] = action_object_df['embedding'].apply(lambda x: cosine_similarity(x, action_object_search_vector))

action_object_df


Unnamed: 0,text,embedding,similarities
0,plays poorly,"[-0.037109412252902985, -0.00602723890915513, ...",0.794069
1,play a gambling game for play for (in cards an...,"[-0.019025489687919617, -0.0062982551753520966...",0.771193
2,impersonate imitate,"[-0.03657917678356171, -0.00834670290350914, 0...",0.773046
3,land on top jump on top,"[0.002501098206266761, -0.015347053296864033, ...",0.778453
4,win all break all destroy all,"[-0.026535844430327415, -0.01716775819659233, ...",0.767423
5,have good time have fun fun to be with enjoyable,"[-0.0036454314831644297, 0.005597290582954884,...",0.788258
6,fun time lots of fun! (interjection),"[-0.024563871324062347, -0.012834819965064526,...",0.779292
7,play with amuse,"[-0.030671261250972748, -0.016328096389770508,...",0.811058
8,play with amuse,"[-0.030660254880785942, -0.01628146506845951, ...",0.81109
9,pluck (as string) tug (a string) pull (a strin...,"[-0.03145026043057442, 0.002522394061088562, 0...",0.739929


# **All Mi'kmaq words for Play**
awanmila'sit (plays poorly)

elaqalsewatl (play a gambling game for/play for (in cards and board games)

elutuatl (impersonate/imitate)

ge'gutesing (land on top/jump on top)

getmete'gl (win all/break all/destroy all)

gise'g (have good time/have fun/fun to be with/enjoyable)

giso'qon (fun time/lots of fun! (interjection))

maligeiwatl (play with/amuse)

maligo'tg (play with/amuse)

mattaqta'tl (pluck (as string)/tug (a string)/pull (a string)/jerk (a string))

mila'sit (play)

mila'sualatl (plays with/toy with)

mila'suaqan (toy)

mila'suatg (plays with/toy with)

nuja'q (swimmer)

papit (amuse self)

papitaqan (toy/item made for enjoyment)

papuaqan (amusement/fun time/celebration)

tu'at (play baseball/play ball)

wali'j (snowball)

wigji'jaqan (toy)

The English words in the order that is most similar to with knowledge of English and Mi'kmaq equavilent.

In [None]:
action_object_df.sort_values("similarities", ascending=False)

Unnamed: 0,text,embedding,similarities
18,play baseball play ball,"[-0.013598288409411907, -0.012913853861391544,...",0.913124
10,play,"[-0.013542314060032368, -0.002542998408898711,...",0.826537
11,plays with toy with,"[-0.031726814806461334, 0.00011007127613993362...",0.822912
13,plays with toy with,"[-0.031711265444755554, 0.00010184014536207542...",0.822882
8,play with amuse,"[-0.030660254880785942, -0.01628146506845951, ...",0.81109
7,play with amuse,"[-0.030671261250972748, -0.016328096389770508,...",0.811058
12,toy,"[-0.004021339118480682, -0.02382609061896801, ...",0.802493
20,toy,"[-0.004064851440489292, -0.023867269977927208,...",0.802441
0,plays poorly,"[-0.037109412252902985, -0.00602723890915513, ...",0.794069
19,snowball,"[-0.022801609709858894, -0.020457889884710312,...",0.789251


In [None]:
v1 = np.array([1,2,3])
v2 = np.array([4,5,6])

# (1 * 4) + (2 * 5) + (3 * 6)
dot_product = np.dot(v1, v2)
dot_product

32

In [None]:
# square root of (1^2 + 2^2 + 3^2) = square root of (1+4+9) = square root of 14
np.linalg.norm(v1)

3.7416573867739413

In [None]:
# square root of (4^2 + 5^2 + 6^2) = square root of (16+25+36) = square root of 14
np.linalg.norm(v2)

8.774964387392123

In [None]:
magnitude = np.linalg.norm(v1) * np.linalg.norm(v2)
magnitude

32.83291031876401

In [None]:
dot_product / magnitude

0.9746318461970762

In [None]:
from scipy import spatial

result = 1 - spatial.distance.cosine(v1, v2)

result

0.9746318461970761