# 🌐 Cross-Lingual Information Retrieval using Transformers

Welcome to the project!

In this notebook, we’ll build a real-world **Cross-Lingual Information Retrieval** system using **Hugging Face Transformers** and **Sentence Transformers**.

### 🔍 Goal:
Given a **query in one language** (e.g., English), retrieve the most **relevant document in another language** (e.g., French, Spanish, Hindi).

### ✅ What You'll Learn:
- How to use multilingual transformer models like **MiniLM** from Hugging Face to generate sentence embeddings for cross-lingual search
- Generate **semantic embeddings** from text
- Use **cosine similarity** for semantic search
- Build a multilingual search system (end-to-end)

Let’s get started!


## 🔧 Step 1: Project Setup

In this step, we'll:
- Install all required libraries
- Import key modules

We'll use:

- `sentence-transformers`: for generating sentence embeddings
- `transformers`: (under the hood) for multilingual models
- **Cosine Similarity** (from `sentence-transformers.util`) for comparing embeddings


In [17]:
# Install the necessary libraries
!pip install -q sentence-transformers transformers datasets tf-keras

In [21]:
# Basic Python libraries
import numpy as np
import pandas as pd
import torch
import warnings
warnings.filterwarnings('ignore')

# Sentence Transformers (built on Hugging Face)
from sentence_transformers import SentenceTransformer, util

## 📂 Step 2: Load Tatoeba Dataset via Hugging Face

We'll utilize the Tatoeba dataset from Hugging Face, which contains multilingual sentence pairs. This dataset is ideal for building a cross-lingual information retrieval system.

We'll focus on English-French sentence pairs for this project.


In [78]:
from datasets import load_dataset

# Load the Tatoeba dataset with trusted custom code
dataset = load_dataset("tatoeba", lang1="en", lang2="fr", split="train", trust_remote_code=True)

# Display first 5 sentence pairs
for i in range(5):
    print(f"English: {dataset[i]['translation']['en']}")
    print(f"French:  {dataset[i]['translation']['fr']}\n")

English: When he asked who had broken the window, all the boys put on an air of innocence.
French:  Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent.

English: Then, when he asked who had broken the window, all the boys acted innocent.
French:  Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent.

English: Let's try something.
French:  Essayons quelque chose !

English: Let's try something.
French:  Tentons quelque chose !

English: Let's try something.
French:  Essayons quelque chose.



## 🔢 Step 3: Generate Embeddings with a Multilingual Transformer

We'll now convert both English and French sentences into dense vector embeddings using the model:

📌 `paraphrase-multilingual-MiniLM-L12-v2`  
This model supports over 50 languages and maps similar sentences (even across languages!) to similar embeddings.

We'll use `SentenceTransformer.encode()` to:
- Encode all English queries
- Encode all French documents
- Store them in a format ready for similarity comparison


In [57]:
from sentence_transformers import SentenceTransformer

# Load multilingual model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Confirm it's loaded
print("Model loaded successfully!")

Model loaded successfully!


In [70]:
# Convert sliced dataset to a list of dictionaries
samples = dataset.select(range(200))

# Extract English and French sentences
english_sentences = [sample['translation']['en'] for sample in samples]
french_sentences = [sample['translation']['fr'] for sample in samples]

# Preview
for en, fr in zip(english_sentences, french_sentences):
    print(f"EN: {en}")
    print(f"FR: {fr}\n")

EN: When he asked who had broken the window, all the boys put on an air of innocence.
FR: Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent.

EN: Then, when he asked who had broken the window, all the boys acted innocent.
FR: Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent.

EN: Let's try something.
FR: Essayons quelque chose !

EN: Let's try something.
FR: Tentons quelque chose !

EN: Let's try something.
FR: Essayons quelque chose.

EN: I have to go to sleep.
FR: Je dois aller dormir.

EN: I have to go to sleep.
FR: Il faut que j'aille dormir.

EN: I can't abide that fellow.
FR: Je ne supporte pas ce type.

EN: I can't bear that fellow.
FR: Je ne supporte pas ce type.

EN: I can't stand that guy.
FR: Je ne supporte pas ce type.

EN: Today is June 18th and it is Muiriel's birthday!
FR: Aujourd'hui nous sommes le 18 juin et c'est l'anniversaire de Muiriel !

EN: Today is June 18th and it is Muiriel's birt

In [72]:
# Encode English queries
english_embeddings = model.encode(english_sentences, convert_to_tensor=True)

# Encode French documents
french_embeddings = model.encode(french_sentences, convert_to_tensor=True)

## 🔍 Step 4: Semantic Matching with Cosine Similarity

In this step, we’ll:
- Use cosine similarity to compare English sentence embeddings with all French sentence embeddings
- Retrieve the most similar French sentence for each English query
- This mimics how a cross-lingual search engine would work

Cosine similarity gives a score between -1 and 1:
- 1 = perfect match
- 0 = no similarity
- -1 = completely opposite


In [80]:
from sentence_transformers.util import cos_sim
import torch

top_matches = []
correct_predictions = 0
total_queries = len(english_embeddings)

# Loop over each English query
for idx, query_embedding in enumerate(english_embeddings):
    # Compute similarity with all French documents
    cosine_scores = cos_sim(query_embedding, french_embeddings)[0]
    
    # Find the top match
    top_match_idx = torch.argmax(cosine_scores).item()
    
    # Check if match is correct (same index)
    is_correct = (top_match_idx == idx)
    if is_correct:
        correct_predictions += 1

    # Store the result
    top_matches.append({
        "english_query": english_sentences[idx],
        "matched_french_doc": french_sentences[top_match_idx],
        "similarity_score": float(cosine_scores[top_match_idx]),
        "correct": is_correct
    })

# Accuracy summary
accuracy = correct_predictions / total_queries * 100
print(f"\n✅ Top-1 Accuracy: {accuracy:.2f}% ({correct_predictions}/{total_queries})\n")

# Show sample matches
for match in top_matches[:5]:  # show only first 5 for readability
    print(f"English Query: {match['english_query']}")
    print(f"Matched French Doc: {match['matched_french_doc']}")
    print(f"Similarity Score: {match['similarity_score']:.4f}")
    print(f"Correct Match: {'Yes' if match['correct'] else 'No'}\n")


✅ Top-1 Accuracy: 56.50% (113/200)

English Query: When he asked who had broken the window, all the boys put on an air of innocence.
Matched French Doc: Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent.
Similarity Score: 0.9428
Correct Match: Yes

English Query: Then, when he asked who had broken the window, all the boys acted innocent.
Matched French Doc: Lorsqu'il a demandé qui avait cassé la fenêtre, tous les garçons ont pris un air innocent.
Similarity Score: 0.9525
Correct Match: No

English Query: Let's try something.
Matched French Doc: Essayons quelque chose.
Similarity Score: 0.9937
Correct Match: No

English Query: Let's try something.
Matched French Doc: Essayons quelque chose.
Similarity Score: 0.9937
Correct Match: No

English Query: Let's try something.
Matched French Doc: Essayons quelque chose.
Similarity Score: 0.9937
Correct Match: Yes



## 📁 Step 5: Save Results to CSV

We'll now save all results to a `.csv` file, including:
- English queries
- Matched French documents
- Cosine similarity scores
- Whether the top match was correct

This makes it easy to:
- Analyze results in Excel or Python
- Share your model output on GitHub or with recruiters


In [87]:
import pandas as pd

# Convert list of match results into a DataFrame
results_df = pd.DataFrame(top_matches)

# Save to CSV
results_df.to_csv("cross_lingual_results.csv", index=False)

print("Results saved to cross_lingual_results.csv")
results_df.head()

Results saved to cross_lingual_results.csv


Unnamed: 0,english_query,matched_french_doc,similarity_score,correct
0,"When he asked who had broken the window, all t...",Lorsqu'il a demandé qui avait cassé la fenêtre...,0.942797,True
1,"Then, when he asked who had broken the window,...",Lorsqu'il a demandé qui avait cassé la fenêtre...,0.952528,False
2,Let's try something.,Essayons quelque chose.,0.993729,False
3,Let's try something.,Essayons quelque chose.,0.993729,False
4,Let's try something.,Essayons quelque chose.,0.993729,True


## 💬 Step 6: Interactive User Input for Query Matching

We'll now build a simple interactive demo in Jupyter where you can:

- Enter a custom **English query**
- See the **top 3 most similar French documents**
- View their **similarity scores**

This simulates how a real-world cross-lingual search system could work.


In [106]:
def search_french_docs(user_query, top_k=3):
    # Encode the user query
    query_embedding = model.encode(user_query, convert_to_tensor=True)
    
    # Compute similarity with all French embeddings
    cosine_scores = cos_sim(query_embedding, french_embeddings)[0]
    
    # Get top k match indices
    top_results = torch.topk(cosine_scores, k=top_k)

    print(f"\nYour Query: {user_query}\n")
    print("Top Matches:\n")

    for score, idx in zip(top_results[0], top_results[1]):
        print(f"French Doc: {french_sentences[idx]}")
        print(f"Similarity Score: {score.item():.4f}\n")

# 🔁 Run this as many times as you want
user_input = input("Enter your English query: ")
search_french_docs(user_input)

Enter your English query:  America is a lovely place to be, if you are here to earn money.



Your Query: America is a lovely place to be, if you are here to earn money.

Top Matches:

French Doc: L'Amérique est un endroit charmant pour vivre, si c'est pour gagner de l'argent.
Similarity Score: 0.9424

French Doc: L'Amérique est un endroit merveilleux pour y vivre, si vous êtes là-bas pour gagner de l'argent.
Similarity Score: 0.9423

French Doc: C'est bon, tu peux le faire.
Similarity Score: 0.3179



## 💾 Step 7: Save Embeddings and Sentences to File for Flask deployment

To make our Flask app fast and efficient, we’ll pre-compute the following and save them in a `.pkl` (pickle) file:

- English sentences (as queries)
- French sentences (as retrievable documents)
- French sentence embeddings (vector form for similarity search)

This avoids re-running the model every time a user enters a query and makes deployment smoother.

We’ll later load this file inside `utils.py` in our Flask app.


In [111]:
# ✅ Step 1: Save model embeddings and sentences to a .pkl file

import pickle

# Save English and French sentence lists and French embeddings
with open("AI-CrossLingual-Flask-App/embeddings.pkl", "wb") as f:
    pickle.dump({
        "english_sentences": english_sentences,
        "french_sentences": french_sentences,
        "french_embeddings": french_embeddings
    }, f)

print("Embeddings and sentences saved to embeddings.pkl!")

Embeddings and sentences saved to embeddings.pkl!


## 🛠️ Step 8: Create `utils.py` inside AI-CrossLingual-Flask-App/utils.py

This file will:
- Load the multilingual model
- Load the saved embeddings (English, French, vectorized)
- Define a function to perform cosine similarity search


## 🌐 Step 9: Create `app.py`

This is the main Flask backend file. It will:
- Serve the HTML UI
- Accept English query from user
- Return top-matching French documents using `utils.py`


## 🖼️ Step 10: Create `templates/index.html`

This is the frontend UI of our Flask app. It lets the user:
- Enter an English query
- View top-matching French sentences + similarity scores
- commands to run this - cd AI-CrossLingual-Flask-App, flask run
- Then open: http://127.0.0.1:5000 in your browser 