##### NOTE: This notebook is executed in Google Colab by connecting to the Tesla 4 GPU due to resource constraint in local environment.
Date: 12/07/2025 <br>
Author: Wan Xuen <br>
Notebook02: Text Mining for Mental Health Chatbot <br>
Aim: To embed the lemmatized text and save them into the chromaDB

Check GPU availability

In [1]:
import torch
torch.cuda.is_available()  # Should return True
torch.cuda.get_device_name(0)


'Tesla T4'

Download required libraries

In [2]:
!pip install fastembed
!pip install pyarrow
!pip install sentence-transformers
!pip install langchain
!pip install chromadb
!pip install transformers
!pip install -U langchain-community

Collecting fastembed
  Downloading fastembed-0.7.1-py3-none-any.whl.metadata (10 kB)
Collecting loguru<0.8.0,>=0.7.2 (from fastembed)
  Downloading loguru-0.7.3-py3-none-any.whl.metadata (22 kB)
Collecting mmh3<6.0.0,>=4.1.0 (from fastembed)
  Downloading mmh3-5.1.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting onnxruntime!=1.20.0,>=1.17.0 (from fastembed)
  Downloading onnxruntime-1.22.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.6 kB)
Collecting py-rust-stemmers<0.2.0,>=0.1.0 (from fastembed)
  Downloading py_rust_stemmers-0.1.5-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting coloredlogs (from onnxruntime!=1.20.0,>=1.17.0->fastembed)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime!=1.20.0,>=1.17.0->fastembed)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2

Embedding using sentence transformer

In [None]:
import os
# Suppress TensorFlow warnings if not using TF models
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "1"

import dask.dataframe as dd
import numpy as np
import time
from sentence_transformers import SentenceTransformer
from langchain.vectorstores import Chroma
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.schema import Document

In [None]:
file_path = '/content/drive/MyDrive/preprocessing-filterv1.parquet'
ddf = dd.read_parquet(file_path)
df = ddf[['lemmatized_text']].compute()

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 329482 entries, 0 to 389386
Data columns (total 1 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   lemmatized_text  329482 non-null  string
dtypes: string(1)
memory usage: 133.8 MB


In [8]:
df.head()

Unnamed: 0,lemmatized_text
0,subreddit suicidewatch title hey want know peo...
1,subreddit suicidewatch title long survive drin...
2,subreddit suicidewatch title struggle awful th...
4,subreddit suicidewatch title seek compassionat...
8,subreddit suicidewatch title want die body vir...


In [None]:
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
corpus = df['lemmatized_text'].tolist()  

# Configuration
batch_size = 128           
save_every = 10            
pause_seconds = 3          # Optional pause to reduce VRAM spikes
output_dir = "embedding_chunks"
os.makedirs(output_dir, exist_ok=True)

embeddings = []
start = time.time()

# Batch processing
for i in range(0, len(corpus), batch_size):
    batch = corpus[i:i + batch_size]

    # Encode with GPU
    batch_embeddings = model.encode(
        batch,
        batch_size=batch_size,
        convert_to_numpy=True,
        show_progress_bar=False
    )
    embeddings.append(batch_embeddings)

    # Save every N batches 
    if (i // batch_size) % save_every == 0:
        filename = os.path.join(output_dir, f'embeddings_part_{i//batch_size}.npy')
        np.save(filename, np.vstack(embeddings))
        embeddings = []  # clear memory

        # ETA tracker
        elapsed = time.time() - start
        processed = i + batch_size
        per_sample = elapsed / processed
        remaining = len(corpus) - processed
        eta = remaining * per_sample
        print(f"[Progress] {processed}/{len(corpus)} | ETA: {eta/60:.2f} mins")

        time.sleep(pause_seconds)

if embeddings:
    filename = os.path.join(output_dir, 'embeddings_part_final.npy')
    np.save(filename, np.vstack(embeddings))
    print("[Final Save] Remaining embeddings saved.")

print("✅ Embedding complete.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[Progress] 128/329482 | ETA: 83.36 mins
[Progress] 1408/329482 | ETA: 52.19 mins
[Progress] 2688/329482 | ETA: 51.45 mins
[Progress] 3968/329482 | ETA: 51.43 mins
[Progress] 5248/329482 | ETA: 51.13 mins
[Progress] 6528/329482 | ETA: 51.27 mins
[Progress] 7808/329482 | ETA: 51.45 mins
[Progress] 9088/329482 | ETA: 51.50 mins
[Progress] 10368/329482 | ETA: 51.03 mins
[Progress] 11648/329482 | ETA: 51.09 mins
[Progress] 12928/329482 | ETA: 50.96 mins
[Progress] 14208/329482 | ETA: 51.00 mins
[Progress] 15488/329482 | ETA: 51.08 mins
[Progress] 16768/329482 | ETA: 51.80 mins
[Progress] 18048/329482 | ETA: 52.22 mins
[Progress] 19328/329482 | ETA: 52.10 mins
[Progress] 20608/329482 | ETA: 51.71 mins
[Progress] 21888/329482 | ETA: 51.36 mins
[Progress] 23168/329482 | ETA: 50.89 mins
[Progress] 24448/329482 | ETA: 50.75 mins
[Progress] 25728/329482 | ETA: 50.72 mins
[Progress] 27008/329482 | ETA: 50.51 mins
[Progress] 28288/329482 | ETA: 50.37 mins
[Progress] 29568/329482 | ETA: 50.27 mins
[

Save to ChromaDB
- Efficient semantic search-ready vector store

In [None]:
texts = df['lemmatized_text'].tolist()

embedding_files = sorted([f for f in os.listdir("embedding_chunks") if f.endswith(".npy")])
embeddings = [np.load(os.path.join("embedding_chunks", f)) for f in embedding_files]
embeddings = np.vstack(embeddings)
docs = [Document(page_content=texts[i]) for i in range(len(texts))]

embedding_func = SentenceTransformerEmbeddings(model_name="BAAI/bge-small-en-v1.5")

# Create unique IDs for each document (same order as embeddings and texts)
ids = [f"doc_{i}" for i in range(len(texts))]

# Save to Chroma DB
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embedding_func
)

from tqdm import tqdm

batch_size = 5000
for i in tqdm(range(0, len(texts), batch_size)):
    batch_texts = texts[i:i + batch_size]
    batch_embeddings = embeddings[i:i + batch_size]
    batch_ids = [f"doc_{j}" for j in range(i, min(i + batch_size, len(texts)))]

    vectorstore._collection.add(
        documents=batch_texts,
        embeddings=batch_embeddings,
        ids=batch_ids
    )

vectorstore.persist()
print("All batches added to Chroma DB.")

100%|██████████| 66/66 [17:01<00:00, 15.48s/it]

All batches added to Chroma DB.



  vectorstore.persist()
