# Telecom Multilingual Vector Store Creation

## Overview
This notebook creates a multilingual vector database for telecom customer service conversations using ChromaDB and BGE-M3 embeddings. The vector store enables semantic search across conversations in German, French, Italian, and English.

## Key Components
- **Data Source**: Processed telecom conversations with topic labels and language metadata
- **Embedding Model**: BAAI/bge-m3 (1024-dimensional multilingual embeddings)
- **Vector Database**: ChromaDB with persistent storage
- **Languages**: German (deu), French (fra), Italian (ita), English (eng)
- **Dataset**: ~40,000 training documents, 400 test documents

## Workflow
1. **Data Preparation**: Load processed conversations with topic labels
2. **Train/Test Split**: Create balanced datasets across languages (10K train, 100 test per language)
3. **Embedding Generation**: Generate BGE-M3 embeddings with GPU acceleration
4. **Vector Store Creation**: Store embeddings in ChromaDB with metadata
5. **Evaluation**: Test multilingual retrieval capabilities
6. **Export**: Save retrieval evaluation results

## Technical Specifications
- **Batch Processing**: 64 documents per batch for memory efficiency
- **GPU Optimization**: CUDA acceleration with memory management
- **Storage**: Persistent ChromaDB database
- **Retrieval**: Semantic similarity search with k=5 results

In [None]:
# Basic Python environment setup with essential data science libraries
# These are pre-installed in the Kaggle environment
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Check available input data files in the Kaggle environment
# This helps verify that our datasets are properly loaded
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Note: Output files are saved to /kaggle/working/ which gets preserved
# Temporary files can be written to /kaggle/temp/ but won't be saved

In [None]:
# Install required packages for multilingual embeddings and vector database

!pip install -U FlagEmbedding
!pip install langchain-chroma
!pip install langchain-community
!pip install langchain-huggingface
!pip install langchain-openai

In [None]:
# Check installed versions of FlagEmbedding and LangChain components
!pip list | grep -E 'FlagEmbedding|langchain'

In [None]:
# Import all necessary libraries for vector store creation
from turtle import pd  # Note: This seems to be an error, should be corrected
from FlagEmbedding import BGEM3FlagModel  # BGE-M3 multilingual embedding model
from langchain_chroma import Chroma  # ChromaDB vector store integration
from langchain_huggingface.embeddings import HuggingFaceEmbeddings  # Hugging Face embeddings wrapper
from langchain_community.document_loaders import TextLoader, CSVLoader  # Document loaders
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Text chunking utility
from langchain_openai.chat_models import ChatOpenAI  # OpenAI integration

import pandas as pd 

In [None]:
# Load the processed telecom conversation data with NER cleaning applied
df = pd.read_csv("/kaggle/input/telecom-after-ner/after_ner.csv", encoding='UTF-8')

# Extract only the cleaned text and topic label columns for vector store creation
# This reduces file size and focuses on the essential data
df[['cleaned_text','topic_label']].to_csv(r"/kaggle/working/cleaned_text.csv")

# Reload the cleaned dataset for further processing
df_cleaned_text_metadata = pd.read_csv(r"/kaggle/working/cleaned_text.csv", encoding='UTF-8')
df_cleaned_text_metadata.head()

# Remove the automatically generated index column
df_cleaned_text_metadata = df_cleaned_text_metadata.drop(columns=['Unnamed: 0'])

In [None]:
# Create balanced train/test datasets across multiple languages
# Each language gets 10,000 training records and 100 test records

# Training data: Select 10,000 records for each language
train_deu = df_cleaned_text_metadata.iloc[0:10000].copy()  # German training data
train_deu['language'] = 'deu'
train_fra = df_cleaned_text_metadata.iloc[15000:25000]  # French training data
train_fra['language'] = 'fra'
train_ita = df_cleaned_text_metadata.iloc[30000:40000]  # Italian training data
train_ita['language'] = 'ita'
train_eng = df_cleaned_text_metadata.iloc[45000:55000]  # English training data
train_eng['language'] = 'eng'

# Combine all training data into a single dataframe
train_df = pd.concat([train_deu, train_fra, train_ita, train_eng], ignore_index=True)

# Testing data: Select 100 records for each language (non-overlapping with training)
test_deu = df_cleaned_text_metadata.iloc[10000:10100]  # German test data
test_deu['language'] = 'deu'
test_fra = df_cleaned_text_metadata.iloc[25000:25100]  # French test data
test_fra['language'] = 'fra'
test_ita = df_cleaned_text_metadata.iloc[40000:40100]  # Italian test data
test_ita['language'] = 'ita'
test_eng = df_cleaned_text_metadata.iloc[55000:55100]  # English test data
test_eng['language'] = 'eng'

# Combine all test data into a single dataframe
test_df = pd.concat([test_deu, test_fra, test_ita, test_eng], ignore_index=True)

In [None]:
# Preview the first 10 rows of training data
train_df.head(10)

In [None]:
# Analyze topic label diversity across training and testing datasets
# This helps understand the variety of conversation topics available
print("Total unique topic labels in training data:", train_df['topic_label'].nunique())
print("Total unique topic labels in testing data:", test_df['topic_label'].nunique())

In [None]:
# Visualize topic distribution across languages for top 15 topics
# This helps identify which topics are most common and how they're distributed across languages
import matplotlib.pyplot as plt 
import seaborn as sns

# Get top 15 topic labels in train and test datasets
top_train_labels = train_df['topic_label'].value_counts().nlargest(15).index
top_test_labels = test_df['topic_label'].value_counts().nlargest(15).index

# Filter dataframes to include only top labels for clearer visualization
train_top = train_df[train_df['topic_label'].isin(top_train_labels)]
test_top = test_df[test_df['topic_label'].isin(top_test_labels)]

# Create visualization for training data topic distribution
plt.figure(figsize=(12, 6))
sns.countplot(data=train_top, x='topic_label', hue='language', order=top_train_labels)
plt.title('Top 15 Topic Labels in Training Data')
plt.xlabel('Topic Label')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.tight_layout()

# Create visualization for testing data topic distribution
plt.figure(figsize=(12, 6))
sns.countplot(data=test_top, x='topic_label', hue='language', order=top_test_labels)
plt.title('Top 15 Topic Labels in Testing Data')
plt.xlabel('Topic Label')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.tight_layout()

In [None]:
# Analyze common topic labels between training and testing datasets
# This ensures both datasets cover similar conversation topics for proper evaluation
common_labels = set(train_df['topic_label']).intersection(set(test_df['topic_label']))

# Count occurrences of each common label in both datasets
common_labels_count = {
    label: {
        'train': (train_df['topic_label'] == label).sum(),
        'test': (test_df['topic_label'] == label).sum()
    }
    for label in common_labels
}

# Convert to DataFrame for easier analysis
common_labels_df = pd.DataFrame.from_dict(common_labels_count, orient='index').reset_index()
common_labels_df.columns = ['topic_label', 'train_count', 'test_count']
common_labels = set(train_df['topic_label']).intersection(set(test_df['topic_label']))

common_labels_df.head()

In [None]:
# Save processed train and test datasets for future use
# These will be used for vector store creation and evaluation
print("Saving train and test dataframes to csv files...")
train_df.to_csv(r"/kaggle/working/train_df.csv", index=False)
test_df.to_csv(r"/kaggle/working/test_df.csv", index=False)

In [None]:
# Initialize CSV loader for LangChain document processing
# The loader will treat 'cleaned_text' column as the source content for embeddings
loader = CSVLoader(file_path='/kaggle/working/train_df.csv',
                   encoding = 'UTF-8',
                   source_column= 'cleaned_text')

# Load documents from CSV into LangChain document format
docs = loader.load()

In [None]:
# Preview the structure of loaded documents
# This shows how LangChain formats the document with content and metadata
print(docs[1])

In [None]:
# Initialize BGE-M3 multilingual embedding model
# BGE-M3 is optimized for multilingual semantic search with 1024-dimensional vectors
# Using CUDA for GPU acceleration and normalizing embeddings for better similarity calculations
embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-m3",
                                        model_kwargs={'device': 'cuda'},
                                        encode_kwargs={"normalize_embeddings": True})

In [None]:
# Configure CUDA memory allocation for better GPU memory management
# This helps prevent out-of-memory errors when processing large batches
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

In [None]:
# Individual record processing approach (slower but memory-efficient)
# Initialize ChromaDB persistent client for data storage
from chromadb import Client
from chromadb.config import Settings
from chromadb import PersistentClient

# Create persistent ChromaDB client with local storage
client = PersistentClient(path="/kaggle/working/chromadb")
collection = client.get_or_create_collection("telecom_vector_store")

from tqdm import tqdm  # Progress bar for long-running operations

# Process each record individually with embedding generation
# This approach is slower but uses less memory
for idx, row in tqdm(train_df.iterrows(), total=len(train_df)):
    chat_history = row['cleaned_text']  # Extract conversation text
    metadata = {
        "language": row['language'],
        "topic_label": row['topic_label']
    }
    # Generate embedding for single document
    embedding = embedding_model.embed_documents([chat_history])[0]
    # Store in ChromaDB with metadata
    collection.add(
        embeddings=[embedding],
        documents=[chat_history],
        metadatas=[metadata],
        ids=[str(idx)]
    )

print(f"Total records processed: {collection.count()}")

In [None]:
# Batch processing approach (faster and more efficient)
# Process multiple documents simultaneously for better performance
from chromadb import PersistentClient
from tqdm import tqdm
import math

# Initialize ChromaDB client
client = PersistentClient(path="/kaggle/working/chromadb")
collection = client.get_or_create_collection("telecom_vector_store")

# Configure batch processing parameters
batch_size = 64  # Process 64 documents at once for optimal GPU utilization
num_batches = math.ceil(len(train_df) / batch_size)

# Process data in batches for efficiency
for batch_num in tqdm(range(num_batches), desc="Processing Batches"):
    start_idx = batch_num * batch_size
    end_idx = min((batch_num + 1) * batch_size, len(train_df))
    batch_df = train_df.iloc[start_idx:end_idx]

    # Prepare batch data for embedding generation
    texts = batch_df['cleaned_text'].tolist()
    embeddings = embedding_model.embed_documents(texts)  # Generate embeddings for entire batch
    metadatas = [
        {
            "language": row['language'],
            "topic_label": row['topic_label']
        }
        for _, row in batch_df.iterrows()
    ]
    ids = [str(idx) for idx in batch_df.index]

    # Insert entire batch into ChromaDB
    collection.add(
        embeddings=embeddings,
        documents=texts,
        metadatas=metadatas,
        ids=ids
    )

print(f"Total records processed: {collection.count()}")

In [None]:
# Create persistent ChromaDB client for long-term storage
# This ensures data persists between notebook sessions
from chromadb import PersistentClient
persistent_client = PersistentClient(path="/kaggle/working/chromadb")
persistent_collection = persistent_client.get_or_create_collection("telecom_vector_store")

In [None]:
# Migrate data from temporary to persistent collection in batches
# This ensures data is properly stored for future use
record_count = collection.count()
batch_size = 1000  # Batch size for migration operations

from tqdm import tqdm

# Transfer data in batches to avoid memory issues
for i in tqdm(range(0, record_count, batch_size)):
    batch = collection.get(
        include=["embeddings", "documents", "metadatas"],
        limit=batch_size,
        offset=i
    )
    # Add batch data to persistent collection
    persistent_collection.add(
        ids=batch['ids'],
        embeddings=batch['embeddings'],
        documents=batch['documents'],
        metadatas=batch['metadatas']
    )

In [None]:
# Verify that data migration was successful
# Check the total count in the persistent collection
print("Persistent count:", persistent_collection.count())

In [None]:
# Check the size of the ChromaDB database file
# This helps monitor storage usage and performance
ls -lh /kaggle/working/chromadb/chroma.sqlite3

In [None]:
# Preview sample documents from the collection
# This helps verify data structure and content quality
sample = collection.peek(1)
print(sample)

In [None]:
# Create LangChain-compatible vector store interface
# This enables integration with LangChain retrieval chains
vector_store = Chroma(
    collection_name="telecom_vector_store",
    embedding_function=embedding_model,
    persist_directory="/kaggle/working/chromadb"
)

In [None]:
# Create retriever interface with k=5 for retrieving top 5 similar documents
# This will be used for semantic search and RAG applications
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

In [None]:
# Test retrieval functionality with a sample query
# This verifies that semantic search is working correctly
retriever.invoke("Problem with data connectivity")

In [None]:
# Copy pre-existing database from input directory to working directory
# This is useful when using a previously created vector store
import shutil

# Source: read-only input directory with existing ChromaDB
source_dir = '/kaggle/input/telecom-vector-store-new/chromadb'

# Destination: writable working directory
destination_dir = '/kaggle/working/chromadb'

# Copy the entire database directory if it doesn't already exist
if not os.path.exists(destination_dir):
    shutil.copytree(source_dir, destination_dir)

In [None]:
# Reinitialize embedding model and vector store after copying database
# This ensures proper connection to the copied ChromaDB
embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-m3",
                                        model_kwargs={'device': 'cuda'},
                                        encode_kwargs={"normalize_embeddings": True})

# Connect to the copied vector store
vector_store = Chroma(
    collection_name="telecom_vector_store",
    embedding_function=embedding_model,
    persist_directory="/kaggle/working/chromadb"
)

# Create retriever with k=5 for semantic search
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

In [None]:
# Define multilingual test queries for retrieval evaluation
# These test queries cover common telecom issues in different languages
language_texts = {
    "German": "Ich habe ein Problem mit meinem internationalen Sprach- und Datenroaming. Ich bin derzeit in LOC und die Anrufqualität ist sehr schlecht, mit viel statischem Rauschen. Außerdem kann ich mich nicht mit dem Internet verbinden. Ich habe bereits versucht, mein iPhone NUM neu zu starten, aber das hat das Problem nicht gelöst. Es ist frustrierend, dass mein Mobilfunkanbieter das Problem nicht direkt beh.",
    "French": "J'ai un problème avec la configuration de mon nouvel appareil Samsung Galaxy SNUM Ultra. Je reçois constamment un message d'erreur indiquant que ma carte SIM n'est pas compatible, bien que j'aie vérifié en ligne et que cela devrait fonctionner. J'ai essayé de redémarrer mon téléphone plusieurs fois, mais cela n'a pas résolu le problème. Je suis frustré car j'essaie de le faire fonctionner depuis",
    "Italian": "Ho un problema con il controllo dell'utilizzo dei dati del mio piano hotspot mobile. Sto cercando di verificare quanti dati ho consumato finora questo mese rispetto alla mia allocazione mensile. Ho bisogno di assistenza per accedere a queste informazioni sul mio account.",
    "English": "I am facing a problem with adding an international roaming plan to my account. I've tried accessing my account online and through the mobile app, but I can't get it to work. I've been on hold for a long time, which is frustrating. When I tried to add the \"Global Traveler\" plan, the system wouldn't allow it due to an alleged issue with my billing address. However, I'm certain that my billing address is correct as I've been using the same one for years."
}

In [None]:
# Test retrieval with German query about roaming issues
# This evaluates cross-language semantic search capabilities
retriever_output = retriever.invoke(language_texts['German'])

In [None]:
# Comprehensive multilingual retrieval evaluation
# Test all language queries and analyze retrieved document metadata
from typing import List, Dict, Any

class Document: # Document structure for compatibility
    def __init__(self, id: str, metadata: Dict[str, Any], page_content: str):
        self.id = id
        self.metadata = metadata
        self.page_content = page_content

# Test queries for evaluation across all supported languages
language_texts = {
    "German": "Ich habe ein Problem mit meinem internationalen Sprach- und Datenroaming. Ich bin derzeit in LOC und die Anrufqualität ist sehr schlecht, mit viel statischem Rauschen. Außerdem kann ich mich nicht mit dem Internet verbinden. Ich habe bereits versucht, mein iPhone NUM neu zu starten, aber das hat das Problem nicht gelöst. Es ist frustrierend, dass mein Mobilfunkanbieter das Problem nicht direkt beh.",
    "French": "J'ai un problème avec la configuration de mon nouvel appareil Samsung Galaxy SNUM Ultra. Je reçois constamment un message d'erreur indiquant que ma carte SIM n'est pas compatible, bien que j'aie vérifié en ligne et que cela devrait fonctionner. J'ai essayé de redémarrer mon téléphone plusieurs fois, mais cela n'a pas résolu le problème. Je suis frustré car j'essaie de le faire fonctionner depuis",
    "Italian": "Ho un problema con il controllo dell'utilizzo dei dati del mio piano hotspot mobile. Sto cercando di verificare quanti dati ho consumato finora questo mese rispetto alla mia allocazione mensile. Ho bisogno di assistenza per accedere a queste informazioni sul mio account.",
    "English": "I am facing a problem with adding an international roaming plan to my account. I've tried accessing my account online and through the mobile app, but I can't get it to work. I've been on hold for a long time, which is frustrating. When I tried to add the \"Global Traveler\" plan, the system wouldn't allow it due to an alleged issue with my billing address. However, I'm certain that my billing address is correct as I've been using the same one for years."
}

data = []

# Evaluate retrieval for each language query
for lang_name, text_content in language_texts.items():
    try:
        response_documents: List[Document] = retriever.invoke(text_content)
    except Exception as e:
        print(f"Error calling retriever for {lang_name}: {e}. Skipping this input.")
        continue 

    # Analyze each retrieved document's metadata
    for i, doc in enumerate(response_documents):
        if hasattr(doc, 'metadata') and isinstance(doc.metadata, dict):
            topic_label = doc.metadata.get('topic_label')
            language = doc.metadata.get('language')
            # Store evaluation data for analysis
            data.append({
                'original_language_text': lang_name,
                'document_rank': i + 1, # Rank of retrieved document
                'topic_label': topic_label,
                'language': language
            })

# Create evaluation dataframe
df_retreiver_evaluation = pd.DataFrame(data)

# Format display to show original language only once per group
df_retreiver_evaluation['original_language_text'] = df_retreiver_evaluation['original_language_text'].mask(df_retreiver_evaluation['original_language_text'].duplicated(), '')

df_retreiver_evaluation

In [None]:
# Export retrieval evaluation results for further analysis
# This saves the multilingual retrieval performance data
df_retreiver_evaluation.to_csv(r"/kaggle/working/df_retreiver_evaluation.csv", index=False)