
---

## üß© **Problem Statement: Translate a PDF Story into Hindi Using LangChain**

---

### üìå Task:

You are given a PDF file containing an English story. Your goal is to:

1. **Upload and read the PDF file** using `PyMuPDFLoader` from LangChain.
- PDF File link - [Click here](https://drive.google.com/file/d/1hNkcxV4-T5I-wdejnnmKKjU1oinLdGnw/view?usp=sharing)
Download PDF and store in local.
2. **Extract text** from all pages of the PDF.
3. Use **RecursiveCharacterTextSplitter** to divide the text into **chunks of 30 characters** (no overlap).
4. For each chunk, use an **LLM** (like OpenAI‚Äôs ChatGPT via `ChatOpenAI`) to **translate the chunk into Hindi**.
5. **Display the original English chunk and its Hindi translation side by side.**

---

### üß™ Input:

* A PDF file (e.g., `sample_story.pdf`) containing a short English story.

---

### ‚úÖ Output:

* Print a list of **translated Hindi chunks**, one per line.
* Each line should include the **original English chunk** and the **translated Hindi output**.

Example:

```
Chunk 1 (EN): Once upon a time in a small vi  
Translation (HI): ‡§è‡§ï ‡§∏‡§Æ‡§Ø ‡§ï‡•Ä ‡§¨‡§æ‡§§ ‡§π‡•à, ‡§è‡§ï ‡§õ‡•ã‡§ü‡•á ‡§∏‡•á ‡§ó‡§æ

Chunk 2 (EN): llage nestled between the hill  
Translation (HI): ‡§ó‡§æ‡§Ç‡§µ ‡§ú‡•ã ‡§™‡§π‡§æ‡§°‡§º‡§ø‡§Ø‡•ã‡§Ç ‡§ï‡•á ‡§¨‡•Ä‡§ö ‡§¨‡§∏‡§æ
```

---

### üîß Tools You Must Use:

* `langchain.document_loaders.PyMuPDFLoader` for reading the PDF
* `langchain.text_splitter.RecursiveCharacterTextSplitter`
* `langchain.chat_models.ChatOpenAI` (or any supported LLM)

---

### üìé Sample File:

You can use this file for testing:
üìÑ [Download `sample_story.pdf`](sandbox:/mnt/data/sample_story.pdf)

---



In [None]:
!pip install -qU langchain langchain-google-genai langchain-community openai pymupdf tqdm

In [None]:
import os
from getpass import getpass


if not os.environ.get("GOOGLE_API_KEY"):
    os.environ["GOOGLE_API_KEY"] = getpass("Enter your Google Gemini API key: ")

In [None]:
from google.colab import files

uploaded = files.upload()  # pick your file (e.g., sample_story.pdf)
pdf_path = list(uploaded.keys())[0]
print("Using:", pdf_path)


Saving Sep2025Resume.pdf to Sep2025Resume.pdf
Using: Sep2025Resume.pdf


In [None]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter


# 1) Load the PDF
loader = PyMuPDFLoader(pdf_path)
docs = loader.load()  # list[Document], one per page by default

# 2) Extract full text and split into 30-char chunks (no overlap)
full_text = "\n".join(d.page_content for d in docs)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,
    chunk_overlap=0,      # <- no overlap
    separators=["\n\n", "\n", " "]  # natural breakpoints first, then fallback
)
chunks = [c for c in splitter.split_text(full_text) if c.strip()]

print(f"Total chunks: {len(chunks)}")

# import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

# const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 100, chunkOverlap: 0 })
# const texts = splitter.splitText(document)

Total chunks: 173


In [None]:
# 3) Init LLM (assignment asks ChatOpenAI). Temperature=0 for deterministic-ish output
from langchain_google_genai import ChatGoogleGenerativeAI
from tqdm import tqdm

llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0)  # Use any ChatOpenAI-supported model
# If you strictly need "ChatGPT via ChatOpenAI" and your account supports it,
# you can use model="gpt-4o" / "gpt-4o-mini" / "gpt-4.1-mini" etc.

# 4) Translate chunk-by-chunk
translated = []
system_prompt = (
    "You are a precise translator. Translate the given English text to Hindi. "
    "Preserve meaning and tone. Do not add or omit content. Return only the translation."
)

for chunk in tqdm(chunks, desc="Translating"):
    user_prompt = f"Translate to Hindi:\n\n{chunk}"
    resp = llm.invoke([{"role": "system", "content": system_prompt},
                       {"role": "user", "content": user_prompt}])
    hi = resp.content.strip()
    translated.append((chunk, hi))

# 5) Print: English + Hindi side by side (as requested)
print("\n===== English ‚Üí Hindi (one per line) =====\n")
for i, (en, hi) in enumerate(translated, start=1):
    print(f"Chunk {i} (EN): {en}")
    print(f"Translation (HI): {hi}\n")

* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 10
Please retry in 31.808346166s. [violations {
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 10
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds: 31
}
].
* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 10
Please retry in 29.698613461s. [violations {
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flas


===== English ‚Üí Hindi (one per line) =====

Chunk 1 (EN): Once upon a time in a small village nestled
Translation (HI): ‡§è‡§ï ‡§∏‡§Æ‡§Ø ‡§ï‡•Ä ‡§¨‡§æ‡§§ ‡§π‡•à, ‡§è‡§ï ‡§õ‡•ã‡§ü‡•á ‡§∏‡•á ‡§ó‡§æ‡§Å‡§µ ‡§Æ‡•á‡§Ç ‡§ú‡•ã ‡§¨‡§∏‡§æ ‡§π‡•Å‡§Ü ‡§•‡§æ

Chunk 2 (EN): between the hills, there lived a curious little
Translation (HI): ‡§™‡§π‡§æ‡§°‡§º‡§ø‡§Ø‡•ã‡§Ç ‡§ï‡•á ‡§¨‡•Ä‡§ö, ‡§è‡§ï ‡§ú‡§ø‡§ú‡•ç‡§û‡§æ‡§∏‡•Å ‡§õ‡•ã‡§ü‡§æ ‡§∞‡§π‡§§‡§æ ‡§•‡§æ

Chunk 3 (EN): girl named
Translation (HI): ‡§®‡§æ‡§Æ ‡§ï‡•Ä ‡§≤‡§°‡§º‡§ï‡•Ä

Chunk 4 (EN): Meera. She loved asking questions and exploring
Translation (HI): ‡§Æ‡•Ä‡§∞‡§æ‡•§ ‡§â‡§∏‡•á ‡§∏‡§µ‡§æ‡§≤ ‡§™‡•Ç‡§õ‡§®‡§æ ‡§î‡§∞ ‡§ñ‡•ã‡§ú‡§¨‡•Ä‡§® ‡§ï‡§∞‡§®‡§æ ‡§¨‡§π‡•Å‡§§ ‡§™‡§∏‡§Ç‡§¶ ‡§•‡§æ‡•§

Chunk 5 (EN): the woods near her home.
Translation (HI): ‡§â‡§∏‡§ï‡•á ‡§ò‡§∞ ‡§ï‡•á ‡§™‡§æ‡§∏ ‡§ï‡§æ ‡§ú‡§Ç‡§ó‡§≤‡•§

Chunk 6 (EN): Every morning, Meera would set out on a new
Translation (HI): ‡§π‡§∞ ‡§∏‡•Å‡§¨‡§π, ‡§Æ‡•Ä‡§∞‡§æ ‡§è‡§ï ‡§®‡§à ‡§™‡§∞ ‡§®‡§ø‡§ï‡§≤ ‡§™‡§°‡§


