# Translation

- Author: [Wonyoung Lee](https://github.com/BaBetterB)
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BaBetterB/LangChain-OpenTutorial/blob/main/15-Agent/05-Iteration-HumanInTheLoop.ipynb)
[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/04-SemanticChunker.ipynb)


## Overview

This tutorial compares two approaches to translating Chinese text into English using LangChain.

The first approach utilizes a single LLM (e.g. GPT-4) to generate a straightforward translation. The second approach employs Retrieval-Augmented Generation (RAG), which enhances translation accuracy by retrieving relevant documents.

The tutorial evaluates the translation accuracy and performance of each method, helping users choose the most suitable approach for their needs.


### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [Translation using LLM](#translation-using-llm)
- [Translation using RAG](#translation-using-rag)
- [Evaluation of translation results](#evaluation-of-translation-resultsr)


### References



----

 


## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [ `langchain-opentutorial` ](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

Load sample text and output the content.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial


[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Install required packages
from langchain_opentutorial import package


package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_community",
        "load_dotenv",
        "langchain_openai",
        "transformers",
        "faiss-cpu",
        "sentence_transformers",
        "sacrebleu",
        "unbabel-comet",
        "load_from_checkpoint",
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Translation",  # title
    }
)

Environment variables have been set successfully.


You can alternatively set `OPENAI_API_KEY` in `.env` file and load it.

[Note] This is not necessary if you've already set `OPENAI_API_KEY` in previous steps.

In [4]:
# Configuration File for Managing API Keys as Environment Variables
from dotenv import load_dotenv

# Load API Key Information
load_dotenv(override=True)

True

## Translation using LLM

Translation using LLM refers to using a large language model (LLM), such as GPT-4, to translate text from one language to another. 
The model processes the input text and generates a direct translation based on its pre-trained knowledge. This approach is simple, fast, and effective.



In [7]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableSequence

# Create LLM
llm = ChatOpenAI(model="gpt-4o-mini")

# Create PromptTemplate
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a professional translator.",
        ),
        (
            "human",
            "Please translate the following Chinese document into natural and accurate English."
            "Consider the context and vocabulary to ensure smooth and fluent sentences.:.\n\n"
            "**Chinese Original Text:** {chinese_text}\n\n**English Translation:**",
        ),
    ]
)

translation_chain = RunnableSequence(prompt, llm)

chinese_text = "人工智能正在改变世界，各国都在加紧研究如何利用这一技术提高生产力。"

response = translation_chain.invoke({"chinese_text": chinese_text})

print("Chinese_text:", chinese_text)
print("Translation:", response.content)

Chinese_text: 人工智能正在改变世界，各国都在加紧研究如何利用这一技术提高生产力。
Translation: Artificial intelligence is transforming the world, and countries are intensifying their research on how to leverage this technology to enhance productivity.


## Translation using RAG 

Translation using RAG (Retrieval-Augmented Generation) enhances translation accuracy by combining a pre-trained LLM with a retrieval mechanism. It first retrieves relevant documents or data related to the input text, then uses this additional context to generate a more precise and contextually accurate translation. This approach is particularly useful for technical terms, specialized content, or context-sensitive translations


### Simple Search Implementation Using FAISS

FAISS (Facebook AI Similarity Search) is a library developed by Facebook AI for efficient similarity search and clustering of dense vectors. It is widely used for approximate nearest neighbor (ANN) search in large-scale datasets.

In [10]:
import os
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import TextLoader
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings


embeddings = OpenAIEmbeddings()

llm = ChatOpenAI(model="gpt-4o-mini")

file_path = "data/news_cn.txt"
if not os.path.exists(file_path):
    raise FileNotFoundError(f"file not found!!: {file_path}")

loader = TextLoader(file_path, encoding="utf-8")
docs = loader.load()


# Vectorizing Sentences Individually
sentences = []
for doc in docs:
    text = doc.page_content
    sentence_list = text.split("。")  # Splitting Chinese sentences based on '.'
    sentences.extend(
        [sentence.strip() for sentence in sentence_list if sentence.strip()]
    )


# Store sentences in the FAISS vector database
vector_store = FAISS.from_texts(sentences, embedding=embeddings)

# Search vectors using keywords "人工智能"
search_results = vector_store.similarity_search("人工智能", k=3)

# check result
print("Search result")
for idx, result in enumerate(search_results, start=1):
    print(f"{idx}. {result.page_content}")

Search result
1. 当地球员并非专业人士，而是农民、建筑工人、教师和学生，对足球的热爱将他们凝聚在一起
2. ”卡卡说道
3. “足球让我们结识新朋友，连接更广阔的世界


### Let's compare translation using LLM and translation using RAG.

First, write the necessary functions.

In [13]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser


llm = ChatOpenAI(model="gpt-4o-mini")


# Document Search Function (Used in RAG)
def retrieve_relevant_docs(query, vector_store, k=3):
    # Perform search and return relevant documents
    search_results = vector_store.similarity_search(query, k=k)
    return [doc.page_content for doc in search_results]


# Translation using only LLM
def translate_with_llm(chinese_text):
    prompt_template_llm = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a translation expert. Translate the following Chinese sentence into English:",
            ),
            ("user", f'Chinese sentence: "{chinese_text}"'),
            ("user", "Please provide an accurate translation."),
        ]
    )
    # translation_chain_llm = LLMChain(prompt=prompt_template_llm, llm=llm)
    translation_chain_llm = RunnableSequence(prompt_template_llm, llm)
    return translation_chain_llm.invoke({"chinese_text": chinese_text})


# RAG-based Translation
def translate_with_rag(chinese_text, vector_store):
    retrieved_docs = retrieve_relevant_docs(chinese_text, vector_store)

    # Add retrieved documents as context
    context = "\n".join(retrieved_docs)

    # Construct prompt template (Using RAG)
    prompt_template_rag = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a translation expert. Below is the Chinese text that needs to be translated into English. Additionally, the following context has been provided from relevant documents that might help you in producing a more accurate and context-aware translation.",
            ),
            ("system", f"Context (Relevant Documents):\n{context}"),
            ("user", f'Chinese sentence: "{chinese_text}"'),
            (
                "user",
                "Please provide a translation that is both accurate and reflects the context from the documents provided.",
            ),
        ]
    )

    translation_chain_rag = RunnableSequence(prompt_template_rag, llm)

    # Request translation using RAG
    return translation_chain_rag.invoke({"chinese_text": chinese_text})


# Function to store document text as a list
def chinese_text_from_file_loader(path):
    # Load data
    loader = TextLoader(path, encoding="utf-8")
    docs = loader.load()

    # Retrieve the page_content of the first document from docs and embed it
    text = docs[0].page_content

    # Vectorize sentences individually
    sentences = []
    for doc in docs:

        text = docs[0].page_content
        sentence_list = text.split(
            "。"
        )  # In Chinese, sentences are usually separated by '。'
        sentences.extend(
            [sentence.strip() for sentence in sentence_list if sentence.strip()]
        )

    print(len(sentences))
    return sentences

**Use the written functions to perform the comparison.**

In [14]:
sentences = chinese_text_from_file_loader("data/comparison_cn.txt")
chinese_text = ""

for sentence in sentences:
    chinese_text += sentence

# LLM
llm_translation = translate_with_llm(chinese_text)

# RAG
rag_translation = translate_with_rag(chinese_text, vector_store)


print(chinese_text)

print("\nLLM")

print(llm_translation.content)

print("\nRAG")
print(rag_translation.content)

11
当前，我国中医药领域高水平科技创新平台加速集聚2025年1月9日在北京举行的全国中医药科技工作会议上的数据显示，我国不断深化中医药科技创新体系建设，中医药科技创新成果接连涌现目前，已基本构建起覆盖“国家—行业—地方”三级中医药科技创新平台体系，各省级平台建设数量超过1200个中医药领域已有7个全国重点实验室、5个国家工程研究中心、4个国家医学攻关产教融合创新平台获批建设，46个国家中医药传承创新中心建设正在推进地方研究平台、科研院所建设力度加大，中医药广东省实验室、湖北时珍实验室、河南省中医药科学院等一批省级新型科研平台相继组建据介绍，我国对中医药原创理论的科学阐释与认识进一步深化，在研究上取得一批重要成果在心脑血管、代谢、消化等多个疾病领域，中医临床评价研究取得重要进展2023年以来，基于循证证据和大量临床实践，遴选发布了50个中医治疗优势病种、52个中西医结合诊疗方案、100项适宜技术、100个疗效独特的中药品种中药资源保护和创新研发也有不少新进展据介绍，我国建立了28个中药材种子种苗繁育基地，120多种大宗或道地药材实现规范化种植，100余种中药材开展生态种植2021年以来，43个中药新药获批上市，包括19个古代经典名方中药复方制剂，中药新药研发进程明显加快

LLM
Currently, high-level technological innovation platforms in the field of traditional Chinese medicine (TCM) in our country are accelerating their gathering. Data from the National TCM Science and Technology Work Conference held on January 9, 2025, in Beijing shows that our country is continuously deepening the construction of the TCM technological innovation system, with TCM technological innovation achievements emerging in succession. At present,

## Evaluation of translation results

Evaluation of translation results using BLEU and TER scores.
Considering the addition of COMET and GPT for further assessment.
Aiming to improve accuracy and quality in translation evaluation.

In [15]:
import pandas as pd
import json
import nltk
import sacrebleu
from tabulate import tabulate


nltk.download("punkt")


#  BLEU
def calculate_bleu(reference, candidate):
    return round(sacrebleu.sentence_bleu(candidate, [reference]).score, 3)


# TER
def calculate_ter(reference, candidate):
    ter_metric = sacrebleu.metrics.TER()
    return round(ter_metric.corpus_score([candidate], [[reference]]).score, 3)


json_file_path = "data/translations_comparison.json"


def load_json_data(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)
    return data


translations_data = load_json_data(json_file_path)


df = pd.DataFrame(translations_data)

results = []

for _, row in df.iterrows():
    source_text = row["source_text"]
    translation_1 = row["translation_1"]
    translation_2 = row["translation_2"]

    # translation_1 evaluation
    bleu_1 = calculate_bleu(source_text, translation_1)
    ter_1 = calculate_ter(source_text, translation_1)

    # translation_2 evaluation
    bleu_2 = calculate_bleu(source_text, translation_2)
    ter_2 = calculate_ter(source_text, translation_2)

    results.append(
        {
            "Category": "Source Text",
            "Text": source_text,
            "BLEU": "-",
            "TER": "-",
        }
    )
    results.append(
        {
            "Category": "Translation 1",
            "Text": translation_1,
            "BLEU": bleu_1,
            "TER": ter_1,
        }
    )
    results.append(
        {
            "Category": "Translation 2",
            "Text": translation_2,
            "BLEU": bleu_2,
            "TER": ter_2,
        }
    )

results_df = pd.DataFrame(results)


def display_results(dataframe):
    print("\n📌 **Translation Quality Evaluation (BLEU & TER Scores)**\n")
    print(tabulate(dataframe, headers="keys", tablefmt="fancy_grid"))


display_results(results_df)


📌 **Translation Quality Evaluation (BLEU & TER Scores)**

╒════╤═══════════════╤════════════════════════════════════════════════╤════════╤═══════╕
│    │ Category      │ Text                                           │ BLEU   │ TER   │
╞════╪═══════════════╪════════════════════════════════════════════════╪════════╪═══════╡
│  0 │ Source Text   │ 这个产品在市场上很受欢迎。                     │ -      │ -     │
├────┼───────────────┼────────────────────────────────────────────────┼────────┼───────┤
│  1 │ Translation 1 │ This product is very popular in the market.    │ 0.0    │ 800.0 │
├────┼───────────────┼────────────────────────────────────────────────┼────────┼───────┤
│  2 │ Translation 2 │ This product is well received in the market.   │ 0.0    │ 800.0 │
├────┼───────────────┼────────────────────────────────────────────────┼────────┼───────┤
│  3 │ Source Text   │ 人工智能正在改变世界。                         │ -      │ -     │
├────┼───────────────┼────────────────────────────────────────────────┼────

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\herme\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
