Check [Week 2 challenge description](../../challenges/week2.md) if you missed the required setup steps for this week.

## Import Vectors trained on the Google News

In [None]:
from gensim import downloader, models

Check the details of the word vevtors model 'word2vec-google-news-300' available in GenSim. 

It was trained on Google News using about 100 billion words. You can see it stores vectors for 3 million different tokens (words, phrases, parts of words), and it's raw size is quite big: 1.7GB compressed with gzip.

In [None]:
downloader.info('word2vec-google-news-300')

Depending on the bandwith of your network it should take about 1-4 minutes to download 1.7GB files with the model below.

In [None]:
%%time
mymodel_path = downloader.load('word2vec-google-news-300', return_path=True)

In [None]:
print(mymodel_path)

For this exercise you are not going to load all 3 million records, as it takes too long and might stretch the capacity of your trial account.

Therefore you set `mylimit_size` to 100000 to practice. This is sufficient for now. Loading all 3000000 would take about 6 minutes.

In [None]:
mylimit_size=100000
mymodel = models.KeyedVectors.load_word2vec_format(mymodel_path, binary=True, limit=mylimit_size)

## Convert the model to be loaded into SAP HANA db

It should take about 20 seconds to convert the data from the model to the Python list that you can load into the SAP HANA db instance.

In [None]:
%%time
myrecords=list()

for index, word in enumerate(mymodel.index_to_key):
    myrecord=(index, word, str(mymodel[word].tolist()))
    myrecords.append(myrecord)

## Load the model into SAP HANA's Vector Engine

In [None]:
import os

# https://help.sap.com/docs/SAP_HANA_CLIENT/f1b440ded6144a54ada97ff95dac7adf/2dbfa39ecc364a65a6ab0fea9c8c8bd9.html?#secure-user-store-(hdbuserstore)-environment-variables

os.environ["HDB_USE_IDENT"]=os.getenv("WORKSPACE_ID")
print(os.getenv("HDB_USE_IDENT"))

In [None]:
from hana_ml import dataframe as hdf

In [None]:
myconn=hdf.ConnectionContext(userkey='myDevChallenger')
print("SAP HANA DB version: ", myconn.hana_version())

The statement below will drop the database table `"GOOGLE_NEWS"`, if it exists already! 

If this table does not exist, then it will return just an error message, like `An error occurred: 'invalid table name: GOOGLE_NEWS ...'`

In [None]:
myconn.connection.setautocommit(True)
mycursor = myconn.connection.cursor()

try:
    mycursor.execute('DROP TABLE "GOOGLE_NEWS"')
    myconn.connection.commit()

except Exception as e:
    # Handle any exceptions and possibly rollback the transaction
    myconn.connection.rollback()
    print("An error occurred:", e)

Use hana-ml package's method `create_table()` to create a physical table in your SAP HANA db instance. Please note the use of the data type `REAL_VECTOR(300)` available in SAP HANA database in SAP HANA Cloud starting with the 2024/Q1 release.

`300` is the dimnsionality of the vectors to be stored in this column.

In [None]:
myconn.create_table(
    "GOOGLE_NEWS", 
    table_structure={
        "ID":"INT", 
        "WORD":"NVARCHAR(5000)", 
        "WV": "REAL_VECTOR(300)"
        }
    )

You should see `GOOGLE_NEWS` table name returned below.

In [None]:
myconn.get_tables()

Use the [`executemany` method](https://help.sap.com/docs/SAP_HANA_CLIENT/f1b440ded6144a54ada97ff95dac7adf/15e46b843c8045ec854d6375790cd504.html) from the SAP HANA Client Interface to insert records from the Python list onject into SAP HANA database table.

It might take up to 20 minutes for all 3000000 records to be inserted, but only about 10 seconds for 100000 records.

In [None]:
%%time
myconn.connection.setautocommit(False)
mycursor = myconn.connection.cursor()

try:
    mycursor.execute('TRUNCATE TABLE "GOOGLE_NEWS"')
    # Use the executemany method to insert the data
    mycursor.executemany(
        operation = '''INSERT INTO "GOOGLE_NEWS"("ID", "WORD", "WV") VALUES (?, ?, TO_REAL_VECTOR(?))''', 
        list_of_parameters = myrecords
    )

except Exception as e:
    # Handle any exceptions and possibly rollback the transaction
    myconn.connection.rollback()
    print("An error occurred:", e)

Note, that the statement above is not commiting the transaction and records are not visible for other processes in the database table unless the below connection commit is executed.

In [None]:
%%time
try:
    # Commit the transaction to save the changes
    myconn.connection.commit()

finally:
    # Close the cursor and the connection when done
    mycursor.close()

## Check tha data in the database table

In [None]:
myconn.table("GOOGLE_NEWS").count()

The statement below will return a preview of a few records with the preview of their vector value.

In [None]:
(
    myconn
    .table("GOOGLE_NEWS")
    .filter("UPPER(WORD) LIKE 'DOG'")
    .select('ID', 'WORD', ('TO_NVARCHAR(WV)',"WORD_VECTOR"))
    .head(3)
    .collect()
)

Look at the vector representation of the word **dog**.

In [None]:
import json

json.loads(
    myconn
    .table("GOOGLE_NEWS")
    .filter("WORD = 'dog'")
    .select(('TO_NVARCHAR(WV)',"WORD_VECTOR"))
    .head(1)
    .collect()
    .WORD_VECTOR[0]
)