This notebook requires [`conda` environment](https://docs.anaconda.com/miniconda/install/#quickstart-install-instructions), because `pip install` of `gensim` is failing in build phase on BAS:
```shell
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
source ~/miniconda3/bin/activate
conda create -n gensim  -c conda-forge python=3.11 
conda activate gensim
conda install -c conda-forge tensorflow-cpu ipykernel pillow pandas

```

## Import Vectors trained on the Google News

In [None]:
from gensim import downloader, models

Check the details of the word vevtors model 'word2vec-google-news-300' available in [GenSim](https://radimrehurek.com/gensim/intro.html#what-is-gensim).

It was trained on Google News using about 100 billion words. You can see it stores vectors for 3 million different tokens (words, phrases, parts of words), and it's raw size is quite big: 1.7GB compressed with gzip.

In [None]:
downloader.info('word2vec-google-news-300')

Depending on the bandwith of your network it should take about 1-4 minutes to download 1.7GB files with the model below.

In [None]:
%%time
mymodel_path = downloader.load('word2vec-google-news-300', return_path=True)

In [None]:
print(mymodel_path)

For this exercise you are not going to load all 3 million records, as it takes too long and might stretch the capacity of your trial account.

Therefore you set `mylimit_size` to 100000 to practice. This is sufficient for now. Loading all 3000000 would take about 6 minutes.

In [None]:
mylimit_size=100000
mymodel_path='/home/user/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz'
mymodel = models.KeyedVectors.load_word2vec_format(mymodel_path, binary=True, limit=mylimit_size)

## Convert the model to be loaded into SAP HANA db

It should take about 20 seconds to convert the data from the model to the Python list that you can load into the SAP HANA db instance.

In [None]:
%%time
myrecords=list()

for index, word in enumerate(mymodel.index_to_key):
    myrecord=(index, word, str(mymodel[word].tolist()))
    myrecords.append(myrecord)

print(len(myrecords))

In [None]:
import pickle

# # Open a file in binary write mode
with open('myrecords.pkl', 'wb') as file:
    # Serialize the list and write it to the file
    pickle.dump(myrecords, file)

> Switch to virtual env now to load to HANA db.

In [None]:
import pickle

# Open the file in binary read mode
with open('myrecords.pkl', 'rb') as file:
    # Deserialize the list from the file
    myrecords = pickle.load(file)

print(len(myrecords))

## Load the model into SAP HANA's Vector Engine

In [None]:
%run "../01-check_setup.ipynb"

The statement below will drop the database table `"GOOGLE_NEWS"`, if it exists already! 

If this table does not exist, then it will return just an error message, like `An error occurred: 'invalid table name: GOOGLE_NEWS ...'`

In [None]:
myconn.connection.setautocommit(True)
mycursor = myconn.connection.cursor()

try:
    mycursor.execute('DROP TABLE "VECTORS"."GOOGLE_NEWS"')
    myconn.connection.commit()

except Exception as e:
    # Handle any exceptions and possibly rollback the transaction
    myconn.connection.rollback()
    print("An error occurred:", e)

Use hana-ml package's method `create_table()` to create a physical table in your SAP HANA db instance. Please note the use of the data type [`REAL_VECTOR(300)`](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-vector-engine-guide/real-vector-data-type) available in SAP HANA database in SAP HANA Cloud starting with the 2024/Q1 release.

`300` is the dimnsionality of the vectors to be stored in this column.

In [None]:
myconn.create_table(
    "GOOGLE_NEWS", 
    schema="VECTORS",
    table_structure={
        "ID":"INT", 
        "WORD":"NVARCHAR(5000)", 
        "WV": "REAL_VECTOR(300)"
        }
    )

You should see `GOOGLE_NEWS` table name returned below.

In [None]:
myconn.get_tables(schema="VECTORS")

Use the [`executemany` method](https://help.sap.com/docs/SAP_HANA_CLIENT/f1b440ded6144a54ada97ff95dac7adf/15e46b843c8045ec854d6375790cd504.html) from the SAP HANA Client Interface to insert records from the Python list onject into SAP HANA database table.

It might take up to 20 minutes for all 3000000 records to be inserted, but only about 10 seconds for 100000 records.

In [None]:
%%time
myconn.connection.setautocommit(False)
mycursor = myconn.connection.cursor()

try:
    mycursor.execute('TRUNCATE TABLE "VECTORS"."GOOGLE_NEWS"')
    # Use the executemany method to insert the data
    mycursor.executemany(
        operation = '''INSERT INTO "VECTORS"."GOOGLE_NEWS"("ID", "WORD", "WV") VALUES (?, ?, TO_REAL_VECTOR(?))''', 
        list_of_parameters = myrecords
    )

except Exception as e:
    # Handle any exceptions and possibly rollback the transaction
    myconn.connection.rollback()
    print("An error occurred:", e)

Note, that the statement above is not commiting the transaction and records are not visible for other processes in the database table unless the below connection commit is executed.

In [None]:
%%time
try:
    # Commit the transaction to save the changes
    myconn.connection.commit()

finally:
    # Close the cursor and the connection when done
    mycursor.close()

## Check tha data in the database table

In [None]:
myconn.table("GOOGLE_NEWS", schema="VECTORS").count()

The statement below will return a preview of a few records with the preview of their vector value.

In [None]:
(
    myconn
    .table("GOOGLE_NEWS", schema="VECTORS")
    .filter("UPPER(WORD) LIKE 'DOG'")
    .select('ID', 'WORD', ('TO_NVARCHAR(WV)',"WORD_VECTOR"))
    .head(3)
    .collect()
)

Look at the vector representation of the word **dog**.

Note the use of the [`TO_NVARCHAR()` SQL function](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-vector-engine-guide/to-nvarchar-function-data-type-conversion) to display the numerical (and not binary) values of the vector.

In [None]:
import json

json.loads(
    myconn
    .table("GOOGLE_NEWS", schema="VECTORS")
    .filter("WORD = 'dog'")
    .select(('TO_NVARCHAR(WV)',"WORD_VECTOR"))
    .head(1)
    .collect()
    .WORD_VECTOR[0]
)