<a href="https://colab.research.google.com/github/AlbertoB12/KultuRAG/blob/main/indexer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Store Document Upload

This notebook processes text documents in various formats (.txt, .json, .jsonl) and uploads them to a Qdrant vector database for semantic search and retrieval applications.

**Pipeline Overview:**
1. Initialize embedding model with automatic device detection
2. Load and parse input documents
3. Convert documents to vector embeddings
4. Upload to Qdrant

## 1. Install and import packages

In [None]:
# Install packages
!pip install langchain-qdrant langchain-huggingface sentence-transformers torch

Collecting langchain-qdrant
  Downloading langchain_qdrant-0.2.0-py3-none-any.whl.metadata (1.8 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.3.1-py3-none-any.whl.metadata (996 bytes)
Collecting qdrant-client<2.0.0,>=1.10.1 (from langchain-qdrant)
  Downloading qdrant_client-1.15.1-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 

In [None]:
# Imports
import json, torch
from langchain_qdrant import QdrantVectorStore
from langchain_huggingface import HuggingFaceEmbeddings
from google.colab import userdata

## 2. Embedding Model Configuration

In [None]:
# Embedding model configuration
# Sentence transformer model for embeddings
model_name = 'sentence-transformers/all-MiniLM-L6-v2'  # 22.7M parameters, 80MB, open-source

# Model configuration parameters
# Automatic device detection
if torch.cuda.is_available():  # Use GPU if available
    device = 'cuda'
    gpu_name = torch.cuda.get_device_name(0)
    print(f"GPU detected: {gpu_name}.")
else:  # If GPU is not available, use CPU
    device = 'cpu'
    print("Using CPU.")

# Model configuration
model_kwargs = {
    'device': device  # Automatically selected device, GPU or CPU
}

# Encoding configuration
encode_kwargs = {
    'normalize_embeddings': False  # Keep raw embeddings without L2 normalization
}

# Initialize the HuggingFace embeddings
embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

print("Embedding model initialized successfully")

GPU detected: NVIDIA A100-SXM4-40GB.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding model initialized successfully!


## 3. File Configuration

The script supports:
- **.txt files**: Documents separated by double newlines
- **.json files**: Array of document objects with 'text' field
- **.jsonl files**: One JSON object per line with 'text' field

In [None]:
# Document loading and preprocessing
# Input file path
input_file = r"/content/Deutsche Saetze.txt"

if not input_file:
    print("Specify the input file path")
else:
    print(f"Input file configured: {input_file}")

Input file configured: /content/Deutsche Saetze.txt


## 4. Document Loading and Processing

In [None]:
# Initialize empty list for text strings
texts = []

# Handle different file formats
# .txt
if input_file.endswith(".txt"):
    """
    Process .txt files by splitting on double newlines

    Assumes documents are separated by blank lines
    """
    with open(input_file, 'r', encoding='UTF-8') as f:
        content = f.read()

    # Split content into paragraphs and extract just the text
    texts = [
        paragraph.strip()
        for paragraph in content.split("\n\n")  # If content is separated by other units, change it here
        if paragraph.strip()  # Filter out empty paragraphs
    ]

# .json
elif input_file.endswith(".json"):
    """
    Process JSON files containing an array of document objects

    Extracts text content from each document object
    """
    with open(input_file, 'r', encoding='UTF-8') as f:
        raw_entries = json.load(f)

    # Extract text from each entry (assumes 'text' field exists)
    texts = [
        entry.get('text', str(entry)).strip()
        for entry in raw_entries
        if isinstance(entry, dict) and entry.get('text')
    ]

# .jsonl
elif input_file.endswith(".jsonl"):
    """
    Process JSONL files where each line is a separate JSON object

    Extracts text content from each JSON object
    """
    texts = []
    with open(input_file, 'r', encoding='UTF-8') as f:
        for line_number, line in enumerate(f, 1):
            line = line.strip()
            if line:  # Skip empty lines, if they exist
                try:
                    entry = json.loads(line)
                    if isinstance(entry, dict) and entry.get('text'):
                        texts.append(entry['text'].strip())
                except json.JSONDecodeError as e:
                    print(f"Skipping invalid JSON at line {line_number}: {line}")
                    print(f"Error details: {e}")

else:
    raise ValueError(f"Unsupported file format. File must be .txt, .json, or .jsonl")

# Processing statistics
print(f"Processed {len(texts)} text documents from {input_file}")

Processed 555616 text documents from /content/Deutsche Saetze.txt


## 5. Data Preview

In [None]:
# Preview the first three documents
if texts:  # If documents exist
    print("Sample documents:\n")
    for i, text in enumerate(texts[:3]):  # Show first 3 documents
        print(f"Document {i+1}: {text[:100]}{'...' if len(text) > 100 else ''}")
else:  # If no documents exist
    print("No documents loaded")

Sample documents:

Document 1: Lass uns etwas versuchen!
Document 2: Was ist das?
Document 3: Was ist das?


In [None]:
# Remove duplicate sentences
if texts:
    # Count and display how many texts there are
    original_count = len(texts)
    print(f"Original number of documents: {original_count}")

    # Remove duplicates while preserving order
    unique_texts = []  # Empty list to save not repeated texts
    seen = set()  # Empty set unordered collection to save already saved texts in unique_texts to avoid saving already saved texts

    # Remove duplicates
    for text in texts:
        if text not in seen:  # If text in not duplicated
            unique_texts.append(text)
            seen.add(text)

    # Update the texts list
    texts = unique_texts

    # Show statistics
    duplicates_removed = original_count - len(texts)
    print(f"After removing duplicates: {len(texts)} documents")
    print(f"Duplicates removed: {duplicates_removed}")
    print(f"Deduplication rate: {(duplicates_removed/original_count)*100:.1f}%")

    # Preview again after deduplication
    if texts:
        print("\nSample documents after deduplication:")
        for i, text in enumerate(texts[:3]):  # Show first 3 documents
            print(f"Document {i+1}: {text[:100]}{'...' if len(text) > 100 else ''}")
else:  # If no duplicated texts exist
    print("No texts to deduplicate")

Original number of documents: 555616
After removing duplicates: 471633 documents
Duplicates removed: 83983
Deduplication rate: 15.1%

Sample documents after deduplication:
Document 1: Lass uns etwas versuchen!
Document 2: Was ist das?
Document 3: Heute ist der 18. Juni und das ist der Geburtstag von Muiriel!


## 6. Qdrant Configuration

Configure Qdrant vector database connection parameters

In [None]:
# Qdrant configuration parameters
qdrant_URL = userdata.get('QDRANT_URL')
qdrant_API_key = userdata.get('QDRANT_API_KEY')
collection_name = "KultuRAG"

if not qdrant_url or not qdrant_API_key:  # If any parameter is missing, show a warning
    print("Configure Qdrant URL and API key above")
else:  # If no parameter is missing and the configuration was successfully set, show info
    print("Qdrant configuration set successfully!")
    print(f"Collection: {collection_name}")

Qdrant configuration set successfully!
Collection: KultuRAG


## 7. Vector Store Upload

In [None]:
# Initialize Qdrant vector store and upload text documents
if texts and qdrant_URL and qdrant_API_key:  # If previous steps were successful
    print("Starting vector store upload.")

    # Handle embedding generation and vector storage
    doc_store = QdrantVectorStore.from_texts(
        texts=texts,  # Text strings to embed
        embedding=embeddings,  # Embedding model instance
        url=qdrant_URL,  # Qdrant server URL
        api_key=qdrant_API_key,  # Qdrant API key
        collection_name=collection_name,  # Collection name
    )

    print(f"Successfully uploaded {len(texts)} text documents to Qdrant collection")
    print("Document upload completed successfully!")

else:  # If previous steps were not successful
    print("Upload skipped. Ensure:")
    print("1. Documents are loaded and texts list is not empty")
    print("2. Qdrant URL is configured")
    print("3. Qdrant API key is configured")

Starting vector store upload.
Successfully uploaded 471633 text documents to Qdrant collection
Document upload completed successfully!


## 8. Upload Summary

In [None]:
# Final summary
print("UPLOAD SUMMARY")
print(f"Input file: {input_file if input_file else 'Not configured'}")
print(f"Documents processed: {len(texts) if 'texts' in locals() else 0}")
print(f"Embedding model: {model_name}")
print(f"Device used: {device if 'device' in locals() else 'Not configured'}")
print(f"Qdrant collection: {COLLECTION_NAME}")
print(f"Upload status: {'Complete' if 'doc_store' in locals() else 'Pending configuration'}")

UPLOAD SUMMARY
Input file: /content/Deutsche Saetze.txt
Documents processed: 471633
Embedding model: sentence-transformers/all-MiniLM-L6-v2
Device used: cuda
Qdrant collection: KultuRAG
Upload status: Complete
