# Optimized ANTIQUE Dataset Processing and Embedding Generation

This notebook implements optimized processing for higher MAP scores:
1. **Better Model Selection**: Uses retrieval-optimized models
2. **Improved Text Processing**: Preserves semantic information
3. **Enhanced Embedding Strategy**: Query-document optimization
4. **Memory & Speed Optimization**: Efficient batch processing

## Step 1: Install Optimized Packages

In [None]:
# Install compatible packages for Colab
!pip install --upgrade pip
!pip install sentence-transformers>=2.2.2
!pip install transformers>=4.21.0
!pip install torch>=1.13.0
!pip install pandas numpy scikit-learn joblib nltk tqdm faiss-cpu beir datasets ir_datasets
!pip install huggingface_hub>=0.10.0

# Restart runtime after package installation
print("[INFO] Packages installed! Please restart runtime and run the next cell.")

Collecting pip
  Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.1.1
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting beir
  Downloading beir-2.2.0-py3-none-any.whl.metadata (28 kB)
Collecting ir_datasets
  Downloading ir_datasets-0.5.11-py3-none-any.whl.metadata (12 kB)
Collecting pytrec-eval-terrier (from beir)
  Downloading pytrec_eval_terrier-0.5.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (984 bytes)
Collecting inscriptis>=2.2.0 (from ir_datasets)
  Downloading inscriptis-2.6.0-py3-none-any.whl.metadata 

## Step 1.5: Import Packages (Run After Restart)

In [None]:
import pandas as pd
import numpy as np
import torch
from sentence_transformers import SentenceTransformer
import ir_datasets
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import os
from tqdm import tqdm
from collections import defaultdict
import joblib
import faiss
from sklearn.metrics.pairwise import cosine_similarity
import zipfile
import tarfile
import warnings
warnings.filterwarnings('ignore')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

Using device: cuda


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## Step 2: Download and Extract ANTIQUE Dataset

In [None]:
print("Downloading ANTIQUE dataset directly...")

# Download the ANTIQUE dataset
dataset = ir_datasets.load('antique/train')

# Create directory
os.makedirs('antique_dataset', exist_ok=True)

# Save documents
print("Saving documents...")
docs_data = [{'doc_id': doc.doc_id, 'text': getattr(doc, 'text', '')} for doc in tqdm(dataset.docs_iter(), desc="Loading documents")]
docs_df = pd.DataFrame(docs_data)
docs_df.to_csv('antique_dataset/documents.tsv', sep='\t', index=False)

# Save queries
print("Saving queries...")
queries_data = [{'query_id': query.query_id, 'text': query.text} for query in tqdm(dataset.queries_iter(), desc="Loading queries")]
queries_df = pd.DataFrame(queries_data)
queries_df.to_csv('antique_dataset/queries.tsv', sep='\t', index=False)

# Save qrels
print("Saving relevance judgments...")
qrels_data = [{'query_id': qrel.query_id, 'doc_id': qrel.doc_id, 'relevance': qrel.relevance} for qrel in tqdm(dataset.qrels_iter(), desc="Loading qrels")]
qrels_df = pd.DataFrame(qrels_data)
qrels_df.to_csv('antique_dataset/qrels.tsv', sep='\t', index=False)

print("✅ Downloaded ANTIQUE dataset")

Downloading ANTIQUE dataset directly...
Saving documents...


[INFO] Please confirm you agree to the authors' data usage agreement found at <https://ciir.cs.umass.edu/downloads/Antique/readme.txt>
[INFO] If you have a local copy of https://ciir.cs.umass.edu/downloads/Antique/antique-collection.txt, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/684f7015aff377062a758e478476aac8
[INFO] [starting] https://ciir.cs.umass.edu/downloads/Antique/antique-collection.txt
Loading documents: 0it [00:00, ?it/s]
https://ciir.cs.umass.edu/downloads/Antique/antique-collection.txt: 0.0%| 0.00/93.6M [00:00<?, ?B/s][A
https://ciir.cs.umass.edu/downloads/Antique/antique-collection.txt: 0.0%| 32.8k/93.6M [00:00<06:31, 239kB/s][A
https://ciir.cs.umass.edu/downloads/Antique/antique-collection.txt: 0.2%| 147k/93.6M [00:00<02:57, 527kB/s] [A
https://ciir.cs.umass.edu/downloads/Antique/antique-collection.txt: 0.7%| 664k/93.6M [00:00<00:59, 1.57MB/s][A
https://ciir.cs.umass.edu/downloads/Antique/antique-collection.txt: 2.9%| 2.69M/93

Saving queries...


[INFO] [starting] https://ciir.cs.umass.edu/downloads/Antique/antique-train-queries.txt
Loading queries: 0it [00:00, ?it/s]
https://ciir.cs.umass.edu/downloads/Antique/antique-train-queries.txt: 0.0%| 0.00/137k [00:00<?, ?B/s][A

[A[INFO] [finished] https://ciir.cs.umass.edu/downloads/Antique/antique-train-queries.txt: [00:00] [137kB] [655kB/s]
Loading queries: 0it [00:00, ?it/s]
https://ciir.cs.umass.edu/downloads/Antique/antique-train-queries.txt: [00:00] [137kB] [640kB/s][A
Loading queries: 2426it [00:00, 5162.11it/s]


Saving relevance judgments...


[INFO] [starting] https://ciir.cs.umass.edu/downloads/Antique/antique-train.qrel
Loading qrels: 0it [00:00, ?it/s]
https://ciir.cs.umass.edu/downloads/Antique/antique-train.qrel: 0.0%| 0.00/626k [00:00<?, ?B/s][A
https://ciir.cs.umass.edu/downloads/Antique/antique-train.qrel: 5.2%| 32.8k/626k [00:00<00:02, 242kB/s][A
https://ciir.cs.umass.edu/downloads/Antique/antique-train.qrel: 23.6%| 147k/626k [00:00<00:00, 533kB/s][A
[INFO] [finished] https://ciir.cs.umass.edu/downloads/Antique/antique-train.qrel: [00:00] [626kB] [1.79MB/s]

Loading qrels: 0it [00:00, ?it/s]
Loading qrels: 27422it [00:00, 38241.23it/s]


✅ Downloaded ANTIQUE dataset


## Step 3: Smart Text Preprocessing (Preserves Semantics)

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import pandas as pd # Import pandas for isna()

stop_words = set(stopwords.words('english'))
stop_words = stop_words - {'not', 'no', 'nor', 'against', 'up', 'down', 'over', 'under', 'more', 'most', 'very'}
lemmatizer = WordNetLemmatizer()

# Removed AutoTokenizer import as it's no longer needed in this function

def smart_clean_text(text):
    if pd.isna(text) or not isinstance(text, str):
        return ""
    text = text.lower()
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' url ', text)
    text = re.sub(r'<.*?>', ' ', text)
    text = re.sub(r'\b\d{4}\b', ' YEAR ', text)
    text = re.sub(r'\b\d+\.\d+\b', ' DECIMAL ', text)
    text = re.sub(r'\b\d+\b', ' NUMBER ', text)
    text = re.sub(r'[!]{2,}', ' EMPHASIS ', text)
    text = re.sub(r'[?]{2,}', ' QUESTION ', text)
    # Keep characters that are part of words, including some symbols if they are part of technical terms, but remove isolated special characters
    text = re.sub(r'[^a-zA-Z0-9\s\.\,\;\'\"\-\!\?]', ' ', text) # Relaxing this regex slightly
    text = re.sub(r'\s+', ' ', text).strip()

    # Removing word tokenization and lemmatization from here
    # The SentenceTransformer model's tokenizer will handle this internally

    return text # Return the cleaned string directly

## Step 4: Embedding Generation

In [None]:
from sentence_transformers import SentenceTransformer
# Removed AutoTokenizer import as it's no longer explicitly used here

print(f"Loading model: sentence-transformers/all-MiniLM-L6-v2")
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=device)
MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
print(f"Model loaded successfully on {device}")
model = SentenceTransformer(MODEL_NAME, device=device)

# Prepare texts for embedding
print("\nPreparing texts for embedding...")
# Apply the simplified cleaning function
doc_texts = docs_df['text'].apply(smart_clean_text).tolist()
doc_ids = docs_df['doc_id'].tolist()
query_texts = queries_df['text'].apply(smart_clean_text).tolist()
query_ids = queries_df['query_id'].tolist()

def generate_embeddings_optimized(texts, batch_size=64):
    # The SentenceTransformer model's encode method handles tokenization and truncation
    embeddings = model.encode(texts, batch_size=batch_size, show_progress_bar=True, convert_to_numpy=True, normalize_embeddings=True)
    return embeddings

doc_embeddings = generate_embeddings_optimized(doc_texts)
query_embeddings = generate_embeddings_optimized(query_texts)

print(f"\nEmbedding generation completed!")
print(f"Document embeddings shape: {doc_embeddings.shape}")
print(f"Query embeddings shape: {query_embeddings.shape}")

Loading model: sentence-transformers/all-MiniLM-L6-v2


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Model loaded successfully on cuda

Preparing texts for embedding...


Batches:   0%|          | 0/6308 [00:00<?, ?it/s]

Batches:   0%|          | 0/38 [00:00<?, ?it/s]


Embedding generation completed!
Document embeddings shape: (403666, 384)
Query embeddings shape: (2426, 384)


## Step 5: Retrieval Evaluation & MAP Calculation

In [None]:
index = faiss.IndexFlatIP(doc_embeddings.shape[1])
index.add(doc_embeddings.astype(np.float32))

qrels_dict = defaultdict(dict)
for _, row in qrels_df.iterrows():
    qid = str(row['query_id'])
    did = str(row['doc_id'])
    rel = int(row['relevance'])
    qrels_dict[qid][did] = rel

average_precisions = []
for i, query_emb in enumerate(query_embeddings):
    query_id = str(query_ids[i])
    scores, indices = index.search(query_emb.reshape(1, -1).astype(np.float32), 100)
    relevant_found = 0
    precision_sum = 0
    for rank, doc_idx in enumerate(indices[0]):
        doc_id = str(doc_ids[doc_idx])
        is_relevant = qrels_dict[query_id].get(doc_id, 0) > 0
        if is_relevant:
            relevant_found += 1
            precision_sum += relevant_found / (rank + 1)
    avg_precision = precision_sum / relevant_found if relevant_found > 0 else 0.0
    average_precisions.append(avg_precision)
map_score = np.mean(average_precisions)
print(f"MAP Score: {map_score:.4f}")

MAP Score: 0.3999


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/gdrive')

# Define your save directory in Google Drive
save_dir = '/content/gdrive/MyDrive/Antiqua_Embeddings'  # Change this to your preferred path

# Create directory if it doesn't exist
if not os.path.exists(save_dir):
    os.makedirs(save_dir)
    print(f"Created directory: {save_dir}")
else:
    print(f"Directory already exists: {save_dir}")

print("\nSaving embeddings and metadata to Google Drive...")

# Save embeddings using joblib
joblib.dump(doc_embeddings, f'{save_dir}/doc_embeddings.joblib')
joblib.dump(query_embeddings, f'{save_dir}/query_embeddings.joblib')
MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'

# Save metadata
metadata = {
    'model_name': MODEL_NAME,
    'embedding_dim': doc_embeddings.shape[1],
    'num_docs': len(doc_embeddings),
    'num_queries': len(query_embeddings),
    'doc_ids': doc_ids,
    'query_ids': query_ids,
    'normalized': True
}
joblib.dump(metadata, f'{save_dir}/embedding_metadata.joblib')

# Save cleaned texts with IDs using joblib
doc_data = {
    'doc_ids': doc_ids,
    'texts': doc_texts
}
joblib.dump(doc_data, f'{save_dir}/documents_final.joblib')

query_data = {
    'query_ids': query_ids,
    'texts': query_texts
}
joblib.dump(query_data, f'{save_dir}/queries_final.joblib')

# Create summary
summary = f"""
=== PROCESSING COMPLETE ===

Model: {MODEL_NAME}
Documents: {len(doc_embeddings):,}
Queries: {len(query_embeddings):,}
Embedding Dimension: {doc_embeddings.shape[1]}

Files Generated (all in joblib format):
- doc_embeddings.joblib: Document embeddings
- query_embeddings.joblib: Query embeddings
- embedding_metadata.joblib: Metadata
- documents_final.joblib: Cleaned documents with IDs
- queries_final.joblib: Cleaned queries with IDs

Saved to Google Drive at: {save_dir}

✅ All files saved successfully!
"""

print(summary)

# Save summary as text file
with open(f'{save_dir}/processing_summary.txt', 'w') as f:
    f.write(summary)

# Create zip file for easy download
print("\nCreating zip file in Google Drive...")
with zipfile.ZipFile(f'{save_dir}/antique_Embeddings_embeddings_joblib.zip', 'w') as zipf:
    zipf.write(f'{save_dir}/doc_embeddings.joblib', 'doc_embeddings.joblib')
    zipf.write(f'{save_dir}/query_embeddings.joblib', 'query_embeddings.joblib')
    zipf.write(f'{save_dir}/embedding_metadata.joblib', 'embedding_metadata.joblib')
    zipf.write(f'{save_dir}/documents_final.joblib', 'documents_final.joblib')
    zipf.write(f'{save_dir}/queries_final.joblib', 'queries_final.joblib')
    zipf.write(f'{save_dir}/processing_summary.txt', 'processing_summary.txt')

print(f"✅ Zip file created: {save_dir}/antique_embeddings_joblib.zip")
print("\n🎉 Processing complete! Files saved to your Google Drive.")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
Directory already exists: /content/gdrive/MyDrive/Antiqua_Embeddings

Saving embeddings and metadata to Google Drive...

=== PROCESSING COMPLETE ===

Model: sentence-transformers/all-MiniLM-L6-v2
Documents: 403,666
Queries: 2,426
Embedding Dimension: 384

Files Generated (all in joblib format):
- doc_embeddings.joblib: Document embeddings
- query_embeddings.joblib: Query embeddings
- embedding_metadata.joblib: Metadata
- documents_final.joblib: Cleaned documents with IDs
- queries_final.joblib: Cleaned queries with IDs

Saved to Google Drive at: /content/gdrive/MyDrive/Antiqua_Embeddings

✅ All files saved successfully!


Creating zip file in Google Drive...
✅ Zip file created: /content/gdrive/MyDrive/Antiqua_Embeddings/antique_embeddings_joblib.zip

🎉 Processing complete! Files saved to your Google Drive.


In [None]:
from google.colab import drive
from sentence_transformers import SentenceTransformer
import joblib
import os

# Mount Google Drive
drive.mount('/content/gdrive')

# Define your save directory in Google Drive
save_dir = '/content/gdrive/MyDrive/Antique_Embeddings'  # Change this to your preferred path

# Create directory if it doesn't exist
if not os.path.exists(save_dir):
    os.makedirs(save_dir)
    print(f"Created directory: {save_dir}")
else:
    print(f"Directory already exists: {save_dir}")

# 1. Save the model itself
print("\nSaving the Sentence Transformer model...")
model_save_path = f"{save_dir}/{MODEL_NAME.replace('/', '_')}"
model.save(model_save_path)
print(f"✅ Model saved to: {model_save_path}")

# 2. Save embeddings using joblib
print("\nSaving embeddings...")
joblib.dump(doc_embeddings, f'{save_dir}/doc_embeddings.joblib')
joblib.dump(query_embeddings, f'{save_dir}/query_embeddings.joblib')

# 3. Save metadata
metadata = {
    'model_name': MODEL_NAME,
    'model_path': model_save_path,
    'embedding_dim': doc_embeddings.shape[1],
    'num_docs': len(doc_embeddings),
    'num_queries': len(query_embeddings),
    'doc_ids': doc_ids,
    'query_ids': query_ids,
    'normalized': True
}
joblib.dump(metadata, f'{save_dir}/embedding_metadata.joblib')

# 4. Save cleaned texts
doc_data = {
    'doc_ids': doc_ids,
    'texts': doc_texts
}
joblib.dump(doc_data, f'{save_dir}/documents_final.joblib')

query_data = {
    'query_ids': query_ids,
    'texts': query_texts
}
joblib.dump(query_data, f'{save_dir}/queries_final.joblib')

# Create summary
summary = f"""
=== PROCESSING COMPLETE ===

Model: {MODEL_NAME}
Model saved to: {model_save_path}
Documents: {len(doc_embeddings):,}
Queries: {len(query_embeddings):,}
Embedding Dimension: {doc_embeddings.shape[1]}

Files Generated:
- Model directory: {MODEL_NAME.replace('/', '_')}/
- doc_embeddings.joblib: Document embeddings
- query_embeddings.joblib: Query embeddings
- embedding_metadata.joblib: Metadata
- documents_final.joblib: Cleaned documents
- queries_final.joblib: Cleaned queries

Saved to Google Drive at: {save_dir}

✅ All files saved successfully!
"""

print(summary)

# Save summary
with open(f'{save_dir}/processing_summary.txt', 'w') as f:
    f.write(summary)

print("\n🎉 Processing complete! Model and embeddings saved to your Google Drive.")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
Created directory: /content/gdrive/MyDrive/Antique_Embeddings

Saving the Sentence Transformer model...
✅ Model saved to: /content/gdrive/MyDrive/Antique_Embeddings/sentence-transformers_all-MiniLM-L6-v2

Saving embeddings...

=== PROCESSING COMPLETE ===

Model: sentence-transformers/all-MiniLM-L6-v2
Model saved to: /content/gdrive/MyDrive/Antique_Embeddings/sentence-transformers_all-MiniLM-L6-v2
Documents: 403,666
Queries: 2,426
Embedding Dimension: 384

Files Generated:
- Model directory: sentence-transformers_all-MiniLM-L6-v2/
- doc_embeddings.joblib: Document embeddings
- query_embeddings.joblib: Query embeddings
- embedding_metadata.joblib: Metadata
- documents_final.joblib: Cleaned documents
- queries_final.joblib: Cleaned queries

Saved to Google Drive at: /content/gdrive/MyDrive/Antique_Embeddings

✅ All files saved successfully!


🎉 Processing comp