This notebook requires [`conda` environment](https://docs.anaconda.com/miniconda/install/#quickstart-install-instructions), because `pip install` of `gensim` is failing in build phase on BAS:
```shell
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
source ~/miniconda3/bin/activate
conda create -n gensim  -c conda-forge python=3.11 
conda activate gensim
conda install -c conda-forge tensorflow-cpu ipykernel pillow pandas gensim

```

## Import Vectors trained on the Google News

In [1]:
from gensim import downloader, models

Check the details of the word vevtors model 'word2vec-google-news-300' available in [GenSim](https://radimrehurek.com/gensim/intro.html#what-is-gensim).

It was trained on Google News using about 100 billion words. You can see it stores vectors for 3 million different tokens (words, phrases, parts of words), and it's raw size is quite big: 1.7GB compressed with gzip.

In [3]:
downloader.info('word2vec-google-news-300')

{'num_records': 3000000,
 'file_size': 1743563840,
 'base_dataset': 'Google News (about 100 billion words)',
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/word2vec-google-news-300/__init__.py',
 'license': 'not found',
 'parameters': {'dimension': 300},
 'description': "Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality' (https://code.google.com/archive/p/word2vec/).",
 'read_more': ['https://code.google.com/archive/p/word2vec/',
  'https://arxiv.org/abs/1301.3781',
  'https://arxiv.org/abs/1310.4546',
  'https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvec

Depending on the bandwith of your network it should take about 1-4 minutes to download 1.7GB files with the model below.

In [4]:
%%time
mymodel_path = downloader.load('word2vec-google-news-300', return_path=True)

CPU times: user 44.6 s, sys: 24.8 s, total: 1min 9s
Wall time: 58 s


In [5]:
print(mymodel_path)

/home/user/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz


For this exercise you are not going to load all 3 million records, as it takes too long and might stretch the capacity of your trial account.

Therefore you can set `mylimit_size` to 100000 to practice; this is sufficient. Loading all 3000000 would take about 6 minutes.

In [2]:
mylimit_size=3000000
mymodel_path='/home/user/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz'
mymodel = models.KeyedVectors.load_word2vec_format(mymodel_path, binary=True, limit=mylimit_size)

## Convert the model to be loaded into SAP HANA db

It should take about 20 seconds to convert the 100K records of data from the model to the Python list that you can load into the SAP HANA db instance. It should take about 5 minutes for the complete dataset.

In [3]:
%%time
myrecords=list()

for index, word in enumerate(mymodel.index_to_key):
    myrecord=(index, word, str(mymodel[word].tolist()))
    myrecords.append(myrecord)

print(len(myrecords))

3000000
CPU times: user 3min 54s, sys: 44 s, total: 4min 38s
Wall time: 4min 39s


In [4]:
import pickle

# # Open a file in binary write mode
with open('/tmp/myrecords.pkl', 'wb') as file:
    # Serialize the list and write it to the file
    pickle.dump(myrecords, file)

> Switch to virtual env now to load to HANA db.

In [2]:
import pickle

# Open the file in binary read mode
with open('/tmp/myrecords.pkl', 'rb') as file:
    # Deserialize the list from the file
    myrecords = pickle.load(file)

print(len(myrecords))

3000000


## Load the model into SAP HANA's Vector Engine

In [1]:
%run "../01-check_setup.ipynb"

SAP HANA Client for Python: 2.23.25021400
Connected to SAP HANA db version 4.00.000.00.1758012768 (fa/CE2025.28) 
at c5889dd5-e0f6-4930-8408-94d53ca61dbf.hna0.prod-us10.hanacloud.ondemand.com:443 as CODEJAMHANAAI00
Current time on the SAP HANA server: 2025-09-25 11:57:22.396000


The statement below will drop the database table `"GOOGLE_NEWS"`, if it exists already! 

If this table does not exist, then it will return just an error message, like `An error occurred: 'invalid table name: GOOGLE_NEWS ...'`

In [4]:
myconn.connection.setautocommit(True)
mycursor = myconn.connection.cursor()

try:
    mycursor.execute('DROP TABLE "VECTORS"."GOOGLE_NEWS"')
    myconn.connection.commit()

except Exception as e:
    # Handle any exceptions and possibly rollback the transaction
    myconn.connection.rollback()
    print("An error occurred:", e)

Use hana-ml package's method `create_table()` to create a physical table in your SAP HANA db instance. Please note the use of the data type [`REAL_VECTOR(300)`](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-vector-engine-guide/real-vector-data-type) available in SAP HANA database in SAP HANA Cloud starting with the 2024/Q1 release.

`300` is the dimnsionality of the vectors to be stored in this column.

In [5]:
myconn.create_table(
    "GOOGLE_NEWS", 
    schema="VECTORS",
    table_structure={
        "ID":"INT", 
        "WORD":"NVARCHAR(5000)", 
        "WV": "REAL_VECTOR(300)"
        }
    )

You should see `GOOGLE_NEWS` table name returned below.

In [6]:
myconn.get_tables(schema="VECTORS")

Unnamed: 0,TABLE_NAME
0,GOOGLE_NEWS


Use the [`executemany` method](https://help.sap.com/docs/SAP_HANA_CLIENT/f1b440ded6144a54ada97ff95dac7adf/15e46b843c8045ec854d6375790cd504.html) from the SAP HANA Client Interface to insert records from the Python list onject into SAP HANA database table.

It might take up to 20 minutes for all 3000000 records to be inserted, but only about 10 seconds for 100000 records.

In [None]:
import math
import time

# --- Config ---
num_parts = 30                    # how many chunks to split into
parts_to_load = range(1, 31)      # which 1-based parts to load (e.g., range(5, 11) for parts 5..10)

# --- Setup ---
myconn.connection.setautocommit(False)
mycursor = myconn.connection.cursor()

# Truncate only if part 1 is being loaded
if 1 in parts_to_load:
    try:
        mycursor.execute('TRUNCATE TABLE "VECTORS"."GOOGLE_NEWS"')
        myconn.connection.commit()
    except Exception as e:
        myconn.connection.rollback()
        raise RuntimeError(f"Failed to truncate table: {e}") from e

# --- Insert in chunks ---
total = len(myrecords)
chunk_size = math.ceil(total / num_parts) if num_parts > 0 else total

for part_idx_1based in range(1, num_parts + 1):
    if part_idx_1based not in parts_to_load:
        continue

    start = (part_idx_1based - 1) * chunk_size
    end = min(start + chunk_size, total)
    chunk = myrecords[start:end]

    if not chunk:
        continue  # nothing in this part (can happen if num_parts > needed)

    try:
        print(f"Chunk {part_idx_1based} ({len(chunk)} rows) ...", end="", flush=True)
        t0 = time.time()

        mycursor.executemany(
            operation='''INSERT INTO "VECTORS"."GOOGLE_NEWS"("ID", "WORD", "WV") 
                         VALUES (?, ?, TO_REAL_VECTOR(?))''',
            list_of_parameters=chunk
        )
        myconn.connection.commit()

        elapsed = time.time() - t0
        print(f" in {elapsed:.2f} seconds")
    except Exception as e:
        myconn.connection.rollback()
        raise RuntimeError(
            f"Error in part {part_idx_1based} (rows {start}:{end}): {e}"
        ) from e

Chunk 1 (100000 rows) ...

 in 9.81 seconds
Chunk 2 (100000 rows) ... in 7.20 seconds
Chunk 3 (100000 rows) ... in 7.58 seconds
Chunk 4 (100000 rows) ... in 8.28 seconds
Chunk 5 (100000 rows) ... in 8.25 seconds
Chunk 6 (100000 rows) ... in 8.10 seconds
Chunk 7 (100000 rows) ... in 8.55 seconds
Chunk 8 (100000 rows) ... in 8.80 seconds
Chunk 9 (100000 rows) ... in 8.61 seconds
Chunk 10 (100000 rows) ... in 8.32 seconds
Chunk 11 (100000 rows) ... in 8.59 seconds
Chunk 12 (100000 rows) ... in 9.21 seconds
Chunk 13 (100000 rows) ... in 10.12 seconds
Chunk 14 (100000 rows) ... in 10.89 seconds
Chunk 15 (100000 rows) ... in 10.24 seconds
Chunk 16 (100000 rows) ... in 10.90 seconds
Chunk 17 (100000 rows) ... in 9.65 seconds
Chunk 18 (100000 rows) ... in 9.89 seconds
Chunk 19 (100000 rows) ... in 9.93 seconds
Chunk 20 (100000 rows) ... in 9.63 seconds
Chunk 21 (100000 rows) ... in 9.93 seconds
Chunk 22 (100000 rows) ... in 10.37 seconds
Chunk 23 (100000 rows) ...

Note, that the statement above is not commiting the transaction and records are not visible for other processes in the database table unless the below connection commit is executed.

In [None]:
%%time
try:
    # Commit the transaction to save the changes
    myconn.connection.commit()

finally:
    # Close the cursor and the connection when done
    mycursor.close()

## Build the index

In [3]:
%%time
mycursor = myconn.connection.cursor()

try:
    mycursor.execute('CREATE HNSW VECTOR INDEX CSIDX ON "VECTORS"."GOOGLE_NEWS" ("WV") SIMILARITY FUNCTION COSINE_SIMILARITY ')

except Exception as e:
    # Handle any exceptions and possibly rollback the transaction
    myconn.connection.rollback()
    print("An error occurred:", e)

CPU times: user 8.18 ms, sys: 1.29 ms, total: 9.47 ms
Wall time: 8min 28s


In [4]:
myconn.table("VECTOR_INDEXES").collect()

Unnamed: 0,SCHEMA_NAME,TABLE_NAME,COLUMN_NAME,INDEX_TYPE,INDEX_NAME,SIMILARITY_FUNCTION,BUILD_CONFIGURATION,SEARCH_CONFIGURATION,CREATE_TIME
0,VECTORS,GOOGLE_NEWS,WV,HNSW VECTOR,CSIDX,COSINE_SIMILARITY,"{""M"":64,""efConstruction"":128}","{""efSearch"":256}",2025-09-25 11:58:41.942


## Check tha data in the database table

In [5]:
myconn.table("GOOGLE_NEWS", schema="VECTORS").count()

3000000

The statement below will return a preview of a few records with the preview of their vector value.

In [6]:
(
    myconn
    .table("GOOGLE_NEWS", schema="VECTORS")
    .filter("UPPER(WORD) LIKE 'DOG'")
    .select('ID', 'WORD', ('TO_NVARCHAR(WV)',"WORD_VECTOR"))
    .head(3)
    .collect()
)

Unnamed: 0,ID,WORD,WORD_VECTOR
0,2043,dog,"[0.05126953,-0.022338867,-0.17285156,0.1611328..."
1,9760,Dog,"[-0.24609375,0.0000123381615,-0.17285156,0.240..."
2,93909,DOG,"[-0.114746094,-0.24023438,0.083496094,0.237304..."


Look at the vector representation of the word **dog**.

Note the use of the [`TO_NVARCHAR()` SQL function](https://help.sap.com/docs/hana-cloud-database/sap-hana-cloud-sap-hana-database-vector-engine-guide/to-nvarchar-function-data-type-conversion) to display the numerical (and not binary) values of the vector.

In [7]:
import json

json.loads(
    myconn
    .table("GOOGLE_NEWS", schema="VECTORS")
    .filter("WORD = 'dog'")
    .select(('TO_NVARCHAR(WV)',"WORD_VECTOR"))
    .head(1)
    .collect()
    .WORD_VECTOR[0]
)

[0.05126953,
 -0.022338867,
 -0.17285156,
 0.16113281,
 -0.084472656,
 0.057373047,
 0.05859375,
 -0.08251953,
 -0.015380859,
 -0.06347656,
 0.1796875,
 -0.42382812,
 -0.022583008,
 -0.16601562,
 -0.025146484,
 0.107421875,
 -0.19921875,
 0.15917969,
 -0.1875,
 -0.12011719,
 0.15527344,
 -0.099121094,
 0.14257812,
 -0.1640625,
 -0.08935547,
 0.20019531,
 -0.14941406,
 0.3203125,
 0.328125,
 0.024414062,
 -0.09716797,
 -0.08203125,
 -0.036376953,
 -0.0859375,
 -0.09863281,
 0.0077819824,
 -0.013427734,
 0.052734375,
 0.1484375,
 0.33398438,
 0.016601562,
 -0.21289062,
 -0.015075684,
 0.052490234,
 -0.107421875,
 -0.08886719,
 0.24902344,
 -0.0703125,
 -0.015991211,
 0.075683594,
 -0.0703125,
 0.119140625,
 0.22949219,
 0.014160156,
 0.115234375,
 0.007507324,
 0.27539062,
 -0.24414062,
 0.296875,
 0.03491211,
 0.2421875,
 0.13574219,
 0.14257812,
 0.017578125,
 0.029296875,
 -0.12158203,
 0.022827148,
 -0.047607422,
 -0.15527344,
 0.0031433105,
 0.34570312,
 0.122558594,
 -0.1953125,
 0