# ITRLM+RAG Exploration Notebook

This notebook provides exploratory analysis and testing of the ITRLM+RAG implementation.

## Overview
- **ITRLM**: Improved Translation-based Language Model
- **RAG**: Retrieval-Augmented Generation
- **Hybrid Ranking**: Multi-stage ranking pipeline


In [29]:
# Import necessary libraries
import sys
sys.path.append('..')

from data.semeval_loader import SemEvalLoader
from data.yahoo_loader import YahooLoader
from hmr.text_processing import TextProcessor
from hmr.pmi_dictionary import PMIDictionary

from hmr.category_predictor import CategoryPredictor
from hmr.rag_generator import RAGAnswerGen

In [26]:
# ------------------------------
# 1. Load and prepare data
# ------------------------------
semeval_loader = SemEvalLoader('../data')
yahoo_loader = YahooLoader('../data')

test_data = semeval_loader.load_data('test')
train_data = yahoo_loader.load_data('train', max_samples=100)

print(f"Loaded {len(test_data)} test examples")
print(f"Loaded {len(train_data)} training examples")

qa_pairs = [(ex['question'], ex['answer']) for ex in test_data]
processor = TextProcessor()
qa_pairs_clean = [(processor.clean(q), processor.clean(a)) for q, a in qa_pairs]


Test file not found: ../data/semeval_2017/test_data.txt
Train file not found: ../data/yahoo_answers/yahoo_answers_train.json


Loaded 3 test examples
Loaded 5 training examples


## 1. Data Loading and Exploration


In [27]:
# ------------------------------
# 2. Build PMI dictionary
# ------------------------------
pmi_builder = PMIDictionary({"alpha": 0.3, "top_n_translations": 50})
pmi_builder.build(qa_pairs_clean)

[PMI] Building general dictionary from 3 Q-A pairs...


Counting co-occurrences: 100%|██████████| 3/3 [00:00<00:00, 19599.55it/s]
PMI calculation: 100%|██████████| 46/46 [00:00<00:00, 74064.49it/s]

✅ PMI dictionary saved to outputs/pmi_dicts/pmi_general.json





In [None]:
# ------------------------------
# 3. Category prediction
# ------------------------------
yahoo_pred = CategoryPredictor()
yahoo_pred.load_or_train()
cat, conf = yahoo_pred.predict("Where can I buy cheap airline tickets?")
print(f"Predicted Category: {cat} (Confidence: {conf:.2f})")

In [None]:
# ------------------------------
# 4. RAG setup
# ------------------------------
cfg_rag = {
    "embed_model": "sentence-transformers/all-mpnet-base-v2",
    "gen_model": "mistralai/Mistral-7B-Instruct-v0.2",
    "top_k_ctx": 10,
    "max_answer_tokens": 100
}

rag_gen = RAGAnswerGen(cfg_rag)


In [None]:
# ------------------------------
# 5. Build or attach corpus safely
# ------------------------------
if hasattr(rag_gen, "build_or_load_index"):
    rag_gen.build_or_load_index(qa_pairs_clean)
    print("Index built using build_or_load_index()")
elif hasattr(rag_gen, "fit"):
    rag_gen.fit(qa_pairs_clean)
    print("Index built using fit()")
elif hasattr(rag_gen, "add_corpus"):
    rag_gen.add_corpus(qa_pairs_clean)
    print("Corpus added using add_corpus()")
else:
    print("⚠️ No index build method found — make sure retrieve_context() uses default data")

⚠️ No index build method found — make sure retrieve_context() uses default data


In [None]:
# ------------------------------
# 6. Single Query Demo
# ------------------------------
query = "Where can I buy cheap airline tickets?"
answer = rag_gen.generate(query)

print("\nSingle Query Test:")
print(f"Q: {query}")
print(f"A: {answer}")

✅ Loaded FAISS index from outputs/faiss_index.index
✅ Loaded 2 context texts from outputs/faiss_index_texts.json

Single Query Test:
Q: Where can I buy cheap airline tickets?
A: You can buy tickets on airline websites.


In [None]:
# ------------------------------
# 7. Batch Generation & Evaluation
# ------------------------------
test_questions = [
    "Where can I buy cheap airline tickets?",
    "What is the best way to book a flight online?",
    "Which website offers discounts on airfare?",
    "How do I find cheap tickets?",
    "Where to purchase airline tickets?"
]

# Generate answers
predicted_answers = [rag_gen.generate(q) for q in test_questions]

# Mock reference answers
reference_answers = [
    "You can buy tickets on airline websites.",
    "Use travel aggregators like Skyscanner.",
    "You can buy tickets on airline websites.",
    "Use travel aggregators like Skyscanner.",
    "You can buy tickets on airline websites.",
]

# Simple evaluation
correct = sum([pred.strip().lower() == ref.strip().lower()
               for pred, ref in zip(predicted_answers, reference_answers)])
accuracy = correct / len(test_questions)

print("\nBatch Evaluation:")
for q, pred, ref in zip(test_questions, predicted_answers, reference_answers):
    print(f"Q: {q}\n  Predicted: {pred}\n  Reference: {ref}\n")

print(f"Accuracy: {accuracy:.2f}")
print(f"Last generated answer: {answer}")


Batch Evaluation:
Q: Where can I buy cheap airline tickets?
  Predicted: You can buy tickets on airline websites.
  Reference: You can buy tickets on airline websites.

Q: What is the best way to book a flight online?
  Predicted: You can buy tickets on airline websites. Use travel aggregators like Skyscanner
  Reference: Use travel aggregators like Skyscanner.

Q: Which website offers discounts on airfare?
  Predicted: Skyscanner
  Reference: You can buy tickets on airline websites.

Q: How do I find cheap tickets?
  Predicted: You can buy tickets on airline websites. Use travel aggregators like Skyscanner. Use travel aggregators like Skyscanner. Use travel aggregators like Skyscanner.
  Reference: Use travel aggregators like Skyscanner.

Q: Where to purchase airline tickets?
  Predicted: airline websites
  Reference: You can buy tickets on airline websites.

Accuracy: 0.20
Last generated answer: You can buy tickets on airline websites.


In [None]:
from hmr.lang_pipeline import LanguagePipeline

lang_pipe = LanguagePipeline()

query = "తెలుగు టెస్ట్ అంటే మీ అవసరాన్ని బట్టి వివిధ రకాల పరీక్షలను సూచించవచ్చు. ఇది తెలుగు భాషా నైపుణ్య పరీక్ష కావచ్చు ఉదాహరణకు" # Spanish input


lang_code = lang_pipe.detect_language(query)
query_en = lang_pipe.translate_to_english(query, lang_code)

print(f"Detected Language: {lang_code}")
print(f"Translated Query: {query_en}")

Detected Language: te
Translated Query: Telugu Test means different types of tests as per your requirement. It could be a Telugu language proficiency test for example


In [None]:
answer_en = rag_gen.generate(query_en)
final_answer = lang_pipe.translate_from_english(answer_en, lang_code)

print("\nFinal Answer:")
print(final_answer)


Final Answer:
తెలుగు టెస్ట్ అంటే మీ అవసరానికి అనుగుణంగా వివిధ రకాల పరీక్షలు. ఉదాహరణకు ఇది తెలుగు భాషా ప్రావీణ్యత పరీక్ష కావచ్చు.


In [30]:
# %%
# ------------------------------
# 6. Multilingual Query Demo (Auto-Detect & Translate)
# ------------------------------
from langdetect import detect
from deep_translator import GoogleTranslator

def detect_and_translate_to_english(text):
    """Detect the language and translate to English if needed."""
    lang_code = detect(text)
    if lang_code != "en":
        translator = GoogleTranslator(source=lang_code, target='en')
        translated = translator.translate(text)
    else:
        translated = text
    return lang_code, translated

def translate_back_to_original(text, target_lang):
    """Translate from English back to the user's original language."""
    if target_lang == "en":
        return text
    translator = GoogleTranslator(source='en', target=target_lang)
    return translator.translate(text)

# Example multilingual query
query_original = "¿Dónde puedo comprar billetes de avión baratos?"  # Spanish
telugu_text = "భారతదేశ రాజధాని నగరం ఏది?" # telugu

# Step 1 & 2: detect + translate to English
lang_code, query_en = detect_and_translate_to_english(telugu_text)
print(f"Detected Language: {lang_code}")
print(f"Translated Query: {query_en}")

# Step 3: run through existing RAG pipeline
answer_en = rag_gen.generate(query_en)

# Step 4: translate back to original language
final_answer = translate_back_to_original(answer_en, lang_code)

print("\n--- Final Output ---")
print(f"Original Question ({lang_code}): {telugu_text}")
print(f"Answer in English: {answer_en}")
print(f"Answer Translated Back: {final_answer}")


Detected Language: te
Translated Query: Which is the capital city of India?

--- Final Output ---
Original Question (te): భారతదేశ రాజధాని నగరం ఏది?
Answer in English: India
Answer Translated Back: భారతదేశం


In [35]:
li = [
    "Where does the sun rise?",
    "What is the capital of India?",
    "Who is the President of the United States?",
    "How many continents are there in the world?",
    "What is the largest ocean on Earth?",
    "Who invented the telephone?",
    "What is the national animal of India?",
    "भारत की राजधानी क्या है?",
    "संयुक्त राज्य अमेरिका के राष्ट्रपति कौन हैं?",
    "दुनिया में कुल कितने महाद्वीप हैं?",
    "पृथ्वी का सबसे बड़ा महासागर कौन सा है?",
    "टेलीफोन का आविष्कार किसने किया?",
    "भारत का राष्ट्रीय पशु कौन सा है?",
    "किस ग्रह को लाल ग्रह कहा जाता है?",
    "मनुष्य कौन सी गैस बाहर निकालता है?",
    "भारत का राष्ट्रगान किसने लिखा?",
    "సూర్యుడు ఎక్కడ ఉదయిస్తాడు?",
    "భారతదేశ రాజధాని ఏది?",
    "అమెరికా అధ్యక్షుడు ఎవరు?",
    "ప్రపంచంలో మొత్తం ఎన్ని ఖండాలు ఉన్నాయి?",
    "భూమిపై అతిపెద్ద మహాసముద్రం ఏది?",
    "టెలిఫోన్ ఎవరు కనుగొన్నారు?",
    "భారతదేశ జాతీయ జంతువు ఏది?",
    "ఎర్ర గ్రహం అని పిలవబడే గ్రహం ఏది?",
    "మనుషులు ఏ వాయువును ఉద్గారిస్తారు?",
    "భారత జాతీయ గీతాన్ని ఎవరు రచించారు?",
    "¿Dónde sale el sol?",
    "¿Cuál es la capital de la India?",
    "¿Quién es el presidente de los Estados Unidos?",
    "¿Cuántos continentes hay en el mundo?",
    "¿Cuál es el océano más grande de la Tierra?",
    "¿Quién inventó el teléfono?",
    "¿Cuál es el animal nacional de la India?",
    "¿Qué planeta se conoce como el planeta rojo?",
    "¿Qué gas exhalan los humanos?",
    "¿Quién escribió el himno nacional de la India?",
]

In [37]:
for q in li:
    print("--------------------------------")
    lang_code, query_en = detect_and_translate_to_english(q)
    print(f"Detected Language: {lang_code}")
    print(f"Translated Query: {query_en}")

    # Step 3: run through existing RAG pipeline
    answer_en = rag_gen.generate(query_en)

    # Step 4: translate back to original language
    final_answer = translate_back_to_original(answer_en, lang_code)

    print("\n--- Final Output ---")
    print(f"Original Question ({lang_code}): {q}")
    print(f"Answer in English: {answer_en}")
    print(f"Answer Translated Back: {final_answer}")


--------------------------------
Detected Language: en
Translated Query: Where does the sun rise?

--- Final Output ---
Original Question (en): Where does the sun rise?
Answer in English: The sun rises in the northern hemisphere.
Answer Translated Back: The sun rises in the northern hemisphere.
--------------------------------
Detected Language: en
Translated Query: What is the capital of India?

--- Final Output ---
Original Question (en): What is the capital of India?
Answer in English: Delhi
Answer Translated Back: Delhi
--------------------------------
Detected Language: en
Translated Query: Who is the President of the United States?

--- Final Output ---
Original Question (en): Who is the President of the United States?
Answer in English: James Madison
Answer Translated Back: James Madison
--------------------------------
Detected Language: en
Translated Query: How many continents are there in the world?

--- Final Output ---
Original Question (en): How many continents are there i