## Connect to your SAP HANA database

In [None]:
%run "../01-check_setup.ipynb"

Load SQLAlchemy magic extension: https://pypi.org/project/ipython-sql/#description

In [2]:
%load_ext sql

# Use `%config SqlMagic` for configuration changes, if needed

Connect using SQLAlchemy Dialect for SAP HANA: https://pypi.org/project/sqlalchemy-hana/#description

In [3]:
%sql hana://{os.environ["HANADB_USR"]}:{os.environ["HANADB_PWD"]}@{os.environ["HANADB_URL"]}:{os.environ["HANADB_PRT"]}

In [None]:
%%sql 

SELECT * FROM "VECTORS"."GOOGLE_NEWS" WHERE WORD IN ('cat', 'CAT')

## Closely related words

Find 5 closest related words for **cat** and **CAT** using the [`COSINE_SIMILARITY()` SQL Function](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-vector-engine-guide/cosine-similarity-063e1366a7d54735b98b2513ea4a88c9), but displaying the [`L2DISTANCE()`](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-vector-engine-guide/l2distance) as well.

Compare running the query with and without vector index: https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-vector-engine-guide/examples-for-select-statements-using-vector-indexes

In [None]:
%%sql

SELECT
  RN,
  WORD,
  RELATED_WORD,
  SIMILARITY_SCORE,
  L2DISTANCE
FROM (
  SELECT
    B.WORD AS WORD,
    A.WORD AS RELATED_WORD,
    COSINE_SIMILARITY(A.WV, B.WV) AS SIMILARITY_SCORE,
    L2DISTANCE(A.WV, B.WV) AS L2DISTANCE,
    ROW_NUMBER() OVER (PARTITION BY B.WORD ORDER BY COSINE_SIMILARITY(A.WV, B.WV) DESC) AS RN
  FROM
    "VECTORS"."GOOGLE_NEWS" A,
    (SELECT WV, WORD FROM "VECTORS"."GOOGLE_NEWS" WHERE WORD IN ('cat', 'CAT')) B
  WHERE
    A.WORD <> B.WORD
) AS RankedWords
WHERE
  RN <= 5
ORDER BY
  WORD DESC,
  SIMILARITY_SCORE DESC
WITH HINT (NO_VECTOR_INDEX);

In [None]:
%%sql

SELECT
  RN,
  WORD,
  RELATED_WORD,
  SIMILARITY_SCORE,
  L2DISTANCE
FROM (
  SELECT
    B.WORD AS WORD,
    A.WORD AS RELATED_WORD,
    COSINE_SIMILARITY(A.WV, B.WV) AS SIMILARITY_SCORE,
    L2DISTANCE(A.WV, B.WV) AS L2DISTANCE,
    ROW_NUMBER() OVER (PARTITION BY B.WORD ORDER BY COSINE_SIMILARITY(A.WV, B.WV) DESC) AS RN
  FROM
    "VECTORS"."GOOGLE_NEWS" A,
    (SELECT WV, WORD FROM "VECTORS"."GOOGLE_NEWS" WHERE WORD IN ('cat', 'CAT')) B
  WHERE
    A.WORD <> B.WORD
) AS RankedWords
WHERE
  RN <= 5
ORDER BY
  WORD DESC,
  SIMILARITY_SCORE DESC
WITH HINT (VECTOR_INDEX);

You should see that **cat** is related to pets, and **CAT** is distantly related to some other acronyms. 

## Analogy queries

Having words represented at vectors you can run some vector computations on them, like the analogy queries. 

The famous example, described as well at Wikipedia, is ["What is the word related to **queen**, if the word related to **king** is the **man**"](https://en.wikipedia.org/wiki/Word2vec#Preservation_of_semantic_and_syntactic_relationships).

In [6]:
word1='king'
related_word1='man'
word2='queen'

Notice the mix of Python variables `word1`, `related_word1`, `word2` in the SQL statement below.

The calculation of **3CosMul** presented in the [Linguistic Regularities in Sparse and Explicit Word Representations](https://aclanthology.org/W14-1618/) by Omer Levy, Yoav Goldberg is used in this SQL query.

In [None]:
%%sql

SELECT "V3"."WORD" AS "lookup_word",
       "V4"."WORD" AS "related_word",
       ((1+ COSINE_SIMILARITY("V4"."WV", "V3"."WV"))/2 * (1+COSINE_SIMILARITY("V4"."WV", "V2"."WV"))/2) / ((1+COSINE_SIMILARITY("V4"."WV", "V1"."WV"))/2 + 0.000001) AS "3COSMUL_SCORE"
FROM "VECTORS"."GOOGLE_NEWS" AS "V4"
INNER JOIN "VECTORS"."GOOGLE_NEWS" AS "V1" ON "V1"."WORD"=:word1
INNER JOIN "VECTORS"."GOOGLE_NEWS" AS "V2" ON "V2"."WORD"=:related_word1
INNER JOIN "VECTORS"."GOOGLE_NEWS" AS "V3" ON "V3"."WORD"=:word2
WHERE "V4"."WORD"<>"V2"."WORD"
  AND "V4"."WORD"<>"V1"."WORD"
  AND "V4"."WORD"<>"V3"."WORD"
ORDER BY 3 DESC
LIMIT 1
-- WITH HINT (NO_VECTOR_INDEX)

Experiement with values of variables `word1`, `related_word1`, `word2` above to come up with your own example of the analogy query results!

To give you an initial idea: try `Monday`, `one`, `Tuesday` 🤓

Remember, that:
1. although it was tained on big amount of text, not all relationships, especally for topics having less coverage on Google News, might be capctured,
2. if you loaded only 100000 tokens (words and phrases) out of 3000000 from the original model, it should cover most common words, but will not include all words.