## Connect to your SAP HANA database... 

...using the user key `myDevChallenger`

In [1]:
import os 

# https://help.sap.com/docs/SAP_HANA_CLIENT/f1b440ded6144a54ada97ff95dac7adf/2dbfa39ecc364a65a6ab0fea9c8c8bd9.html?#secure-user-store-(hdbuserstore)-environment-variables

os.environ["HDB_USE_IDENT"]=os.getenv("WORKSPACE_ID")
print(os.getenv("HDB_USE_IDENT"))

workspaces-ws-cwf68


In [2]:
%load_ext sql

In [3]:
%sql hana://userkey=myDevChallenger

## Closely related words

Find 5 closest related words for **cat** and **CAT** using the [`COSINE_SIMILARITY()` SQL Function](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-vector-engine-guide/cosine-similarity-063e1366a7d54735b98b2513ea4a88c9), but displaying the [`L2DISTANCE()`](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-vector-engine-guide/l2distance) as well.

In [4]:
%%sql 

SELECT
  RN,
  WORD,
  RELATED_WORD,
  SIMILARITY_SCORE,
  L2DISTANCE
FROM (
  SELECT
    B.WORD AS WORD,
    A.WORD AS RELATED_WORD,
    COSINE_SIMILARITY(A.WV, B.WV) AS SIMILARITY_SCORE,
    L2DISTANCE(A.WV, B.WV) AS L2DISTANCE,
    ROW_NUMBER() OVER (PARTITION BY B.WORD ORDER BY COSINE_SIMILARITY(A.WV, B.WV) DESC) AS RN
  FROM
    "DEVCHALLENGER"."GOOGLE_NEWS" A,
    (SELECT WV, WORD FROM "DEVCHALLENGER"."GOOGLE_NEWS" WHERE WORD IN ('cat', 'CAT')) B
  WHERE
    A.WORD <> B.WORD
) AS RankedWords
WHERE
  RN <= 5
ORDER BY
  WORD DESC,
  SIMILARITY_SCORE DESC;

 * hana://userkey=myDevChallenger
Done.


rn,word,related_word,similarity_score,l2distance
1,cat,cats,0.8099379190125675,1.8781065954236476
2,cat,dog,0.7609457161879546,2.081533633772025
3,cat,kitten,0.7464984917354589,2.303452189792881
4,cat,feline,0.7326233654110104,2.25089669903064
5,cat,beagle,0.7150583717655224,2.656861986457292
1,CAT,CATS,0.4546213947726845,3.134979335832406
2,CAT,IIMs,0.4327996254374734,4.325961126356552
3,CAT,IIT,0.4209513456200335,3.619956091935916
4,CAT,MAT,0.4181640773353859,3.4067967527253344
5,CAT,STAR,0.4108674233085596,3.862325675511875


You should see that **cat** is related to pets, and **CAT** is distantly related to some other acronyms. 

## Analogy queries

Having words represented at vectors you can run some vector computations on them, like the analogy queries. 

The famous example, described as well at Wikipedia, is ["What is the word related to **queen**, if the word related to **king** is the **man**"](https://en.wikipedia.org/wiki/Word2vec#Preservation_of_semantic_and_syntactic_relationships).

In [16]:
word1='Bengaluru'
related_word1='India'
word2='Melbourne'

Notice the mix of Python variables `word1`, `related_word1`, `word2` in the SQL statement below.

The calculation of **3CosMul** presented in the [Linguistic Regularities in Sparse and Explicit Word Representations](https://aclanthology.org/W14-1618/) by Omer Levy, Yoav Goldberg is used in this SQL query.

In [17]:
%%sql

SELECT "V3"."WORD" AS "lookup_word",
       "V4"."WORD" AS "related_word",
       ((1+ COSINE_SIMILARITY("V4"."WV", "V3"."WV"))/2 * (1+COSINE_SIMILARITY("V4"."WV", "V2"."WV"))/2) / ((1+COSINE_SIMILARITY("V4"."WV", "V1"."WV"))/2 + 0.000001) AS "3COSMUL_SCORE"
FROM "DEVCHALLENGER"."GOOGLE_NEWS" AS "V4"
INNER JOIN "DEVCHALLENGER"."GOOGLE_NEWS" AS "V1" ON "V1"."WORD"=:word1
INNER JOIN "DEVCHALLENGER"."GOOGLE_NEWS" AS "V2" ON "V2"."WORD"=:related_word1
INNER JOIN "DEVCHALLENGER"."GOOGLE_NEWS" AS "V3" ON "V3"."WORD"=:word2
WHERE "V4"."WORD"<>"V2"."WORD"
  AND "V4"."WORD"<>"V1"."WORD"
  AND "V4"."WORD"<>"V3"."WORD"
ORDER BY 3 DESC
LIMIT 1

 * hana://userkey=myDevChallenger
Done.


lookup_word,related_word,3COSMUL_SCORE
Melbourne,Australia,1.048209669332026


Experiement with values of variables `word1`, `related_word1`, `word2` above to come up with your own example of the analogy query results and paste the screenshot in the submission thread: https://community.sap.com/t5/application-development-discussions/submissions-for-quot-sap-hana-cloud-multi-model-quot-developer-challenge/m-p/13728400#M2028459

To give you an initial idea: try `Monday`, `one`, `Tuesday` 🤓

Remember, that:
1. although it was tained on big amount of text, not all relationships, especally for topics having less coverage on Google News, might be capctured,
2. you loaded only 100000 tokens (words and phrases) out of 3000000 from the original model, so although it should cover most common words, it might not include all words.