# Translation

- Author: [Wonyoung Lee](https://github.com/BaBetterB)
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BaBetterB/LangChain-OpenTutorial/blob/main/12-RAG/06-Translation.ipynb)
[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/04-SemanticChunker.ipynb)


## Overview

This tutorial compares two approaches to translating Chinese text into English using LangChain.

The first approach utilizes a single LLM (e.g. GPT-4) to generate a straightforward translation. The second approach employs Retrieval-Augmented Generation (RAG), which enhances translation accuracy by retrieving relevant documents.

The tutorial evaluates the translation accuracy and performance of each method, helping users choose the most suitable approach for their needs.


### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [Translation using LLM](#translation-using-llm)
- [Translation using RAG](#translation-using-rag)
- [Evaluation of translation results](#evaluation-of-translation-results)


### References

- [LangChain OpenAIEmbeddings API](https://python.langchain.com/api_reference/openai/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html)
- [NLTK](https://www.nltk.org/)
- [TER](https://machinetranslate.org/ter)
- [BERTScore](https://arxiv.org/abs/1904.09675)
- [FAISS](https://github.com/facebookresearch/faiss)
- [Chinese Source](https://cn.chinadaily.com.cn/)



----

 


## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [ `langchain-opentutorial` ](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

Load sample text and output the content.

In [None]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [3]:
# Install required packages
from langchain_opentutorial import package


package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_community",
        "load_dotenv",
        "langchain_openai",
        "faiss-cpu",
        "sacrebleu",
        "bert_score",
    ],
    verbose=False,
    upgrade=False,
)

In [None]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Translation",  # title
    }
)

You can alternatively set `OPENAI_API_KEY` in `.env` file and load it.

[Note] This is not necessary if you've already set `OPENAI_API_KEY` in previous steps.

In [None]:
# Configuration File for Managing API Keys as Environment Variables
from dotenv import load_dotenv

# Load API Key Information
load_dotenv(override=True)

## Translation using LLM

Translation using LLM refers to using a large language model (LLM), such as GPT-4, to translate text from one language to another. 
The model processes the input text and generates a direct translation based on its pre-trained knowledge. This approach is simple, fast, and effective.



In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableSequence

# Create LLM
llm = ChatOpenAI(model="gpt-4o-mini")

# Create PromptTemplate
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a professional translator.",
        ),
        (
            "human",
            "Please translate the following Chinese document into natural and accurate English."
            "Consider the context and vocabulary to ensure smooth and fluent sentences.:.\n\n"
            "**Chinese Original Text:** {chinese_text}\n\n**English Translation:**",
        ),
    ]
)

translation_chain = RunnableSequence(prompt, llm)

chinese_text = "人工智能正在改变世界，各国都在加紧研究如何利用这一技术提高生产力。"

response = translation_chain.invoke({"chinese_text": chinese_text})

print("Chinese_text:", chinese_text)
print("Translation:", response.content)

## Translation using RAG 

Translation using RAG (Retrieval-Augmented Generation) enhances translation accuracy by combining a pre-trained LLM with a retrieval mechanism. This approach first retrieves relevant documents or data related to the input text and then utilizes this additional context to generate a more precise and contextually accurate translation.


### Simple Search Implementation Using FAISS

In this implementation, we use a vector database to store and retrieve embedded representations of entire sentences. Instead of relying solely on predefined knowledge in the LLM, our approach allows the model to retrieve semantically relevant sentences from the vector database, improving the translation's accuracy and fluency.

**FAISS (Facebook AI Similarity Search)**

FAISS is a library developed by Facebook AI for efficient similarity search and clustering of dense vectors. It is widely used for approximate nearest neighbor (ANN) search in large-scale datasets.

In [None]:
import os
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import TextLoader
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings


embeddings = OpenAIEmbeddings()

llm = ChatOpenAI(model="gpt-4o-mini")

file_path = "data/news_cn.txt"
if not os.path.exists(file_path):
    raise FileNotFoundError(f"file not found!!: {file_path}")

loader = TextLoader(file_path, encoding="utf-8")
docs = loader.load()


# Vectorizing Sentences Individually
sentences = []
for doc in docs:
    text = doc.page_content
    sentence_list = text.split("。")  # Splitting Chinese sentences based on '。'
    sentences.extend(
        [sentence.strip() for sentence in sentence_list if sentence.strip()]
    )


# Store sentences in the FAISS vector database
vector_store = FAISS.from_texts(sentences, embedding=embeddings)

# Search vectors using keywords "人工智能"
search_results = vector_store.similarity_search("人工智能", k=3)

# check result
print("Search result")
for idx, result in enumerate(search_results, start=1):
    print(f"{idx}. {result.page_content}")

### Let's compare translation using LLM and translation using RAG.

First, write the necessary functions.

In [None]:
import json
import re
import nltk
from nltk.tokenize import sent_tokenize
from langchain.document_loaders import TextLoader
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableSequence


llm = ChatOpenAI(model="gpt-4o-mini")


# Download the necessary data for sentence tokenization in NLTK (requires initial setup)
nltk.download("punkt")


# Document Search Function (Used in RAG)
def retrieve_relevant_docs(query, vector_store, k=3):
    """
    Searches for relevant documents using vector similarity search.

    Parameters:
        query (str): The keyword or sentence to search for.
        vector_store (FAISS): The vector database.
        k (int): The number of top matching documents to retrieve (default: 3).

    Returns:
        list: A list of retrieved document texts.
    """
    search_results = vector_store.similarity_search(query, k=k)
    return [doc.page_content for doc in search_results]


# Translation using only LLM
def translate_with_llm(chinese_text):
    """
    Translates Chinese text into English using GPT-4o-mini.

    Parameters:
        chinese_text (str): The input Chinese sentence to be translated.

    Returns:
        str: The translated English sentence.
    """
    prompt_template_llm = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a translation expert. Translate the following Chinese sentence into English:",
            ),
            ("user", f'Chinese sentence: "{chinese_text}"'),
            ("user", "Please provide an accurate translation."),
        ]
    )

    translation_chain_llm = RunnableSequence(prompt_template_llm, llm)

    return translation_chain_llm.invoke({"chinese_text": chinese_text})


# RAG-based Translation
def translate_with_rag(chinese_text, vector_store):
    """
    Translates Chinese text into English using the RAG approach.
    It first retrieves relevant documents and then uses them for context-aware translation.

    Parameters:
        chinese_text (str): The input Chinese sentence to be translated.
        vector_store (FAISS): The vector database for document retrieval.

    Returns:
        str: The translated English sentence with contextual improvements.
    """
    retrieved_docs = retrieve_relevant_docs(chinese_text, vector_store)

    # Add retrieved documents as context

    context = "\n".join(retrieved_docs)

    # Construct prompt template (Using RAG)

    prompt_template_rag = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a translation expert. Below is the Chinese text that needs to be translated into English. Additionally, the following context has been provided from relevant documents that might help you in producing a more accurate and context-aware translation.",
            ),
            ("system", f"Context (Relevant Documents):\n{context}"),
            ("user", f'Chinese sentence: "{chinese_text}"'),
            (
                "user",
                "Please provide a translation that is both accurate and reflects the context from the documents provided.",
            ),
        ]
    )

    translation_chain_rag = RunnableSequence(prompt_template_rag, llm)

    # Request translation using RAG

    return translation_chain_rag.invoke({"chinese_text": chinese_text})


# Load Chinese text from a file and split it into sentences, returning them as a list.
def chinese_text_from_file_loader(path):
    """
    Loads Chinese text from a file and splits it into individual sentences.

    Parameters:
        path (str): File path.

    Returns:
        list: List of Chinese sentences.
    """
    # Load data
    loader = TextLoader(path, encoding="utf-8")
    docs = loader.load()

    return split_chinese_sentences_from_docs(docs)


# Split sentences from a list of documents and return them as a list
def split_chinese_sentences_from_docs(docs):
    """
    Extracts sentences from a list of documents.

    Parameters:
        docs (list): List of document objects.

    Returns:
        list: List of extracted sentences.
    """
    sentences = []

    for doc in docs:
        text = doc.page_content
        sentences.extend(split_chinese_sentences(text))

    return sentences


# Use regular expressions to split sentences and punctuation together.
# Then, combine the sentences and punctuation back and return them
def split_chinese_sentences(text):
    """
    Splits Chinese text into sentences based on punctuation marks (。！？).

    Parameters:
        text (str): The input Chinese text.

    Returns:
        list: List of separated sentences.
    """
    # Separate sentences and punctuation,
    sentence_list = re.split(r"([。！？])", text)

    # Combine the sentences and punctuation back to restore them.
    merged_sentences = [
        "".join(x) for x in zip(sentence_list[0::2], sentence_list[1::2])
    ]

    # Remove empty sentences and return the result.
    return [sentence.strip() for sentence in merged_sentences if sentence.strip()]


def count_chinese_sentences(docs):
    """
    Counts the number of sentences in a given Chinese text.

    Parameters:
        docs (str or list): Input text data.

    Returns:
        list: List of split sentences.
    """
    if isinstance(docs, str):
        sentences = split_chinese_sentences(docs)

    print(f"Total number of sentences: {len(sentences)}")
    return sentences


def split_english_sentences_from_docs(docs):
    """
    Splits English text into sentences using NLTK's sentence tokenizer.

    Parameters:
        text (str): The input English text.

    Returns:
        list: List of separated sentences.
    """
    sentences = []

    for doc in docs:
        text = doc.page_content
        sentences.extend(split_english_sentences(text))
    return sentences


# Use NLTK's sent_tokenize() to split sentences accurately.
# By default, it recognizes periods (.), question marks (?), and exclamation marks (!) to separate sentences.
def split_english_sentences(text):
    """
    Splits English text into sentences using NLTK's sentence tokenizer.

    Parameters:
        text (str): The input English text.

    Returns:
        list: List of separated sentences.
    """
    return sent_tokenize(text)


def count_paragraphs_and_sentences(docs):
    """
    Counts the number of paragraphs and sentences in a given text.

    Parameters:
        docs (str): Input text data.

    Returns:
        int: Total number of sentences.
    """
    if isinstance(docs, str):

        paragraphs = paragraphs = re.split(r"\n\s*\n", docs.strip())
        paragraphs = [para.strip() for para in paragraphs if para.strip()]
        sentences = [sent for para in paragraphs for sent in sent_tokenize(para)]

        print(f"Total number of paragraphs : {len(paragraphs)}")
        print(f"Total number of sentences  : {len(sentences)}")
    return len(sentences)

**Use the written functions to perform the comparison.**

In [None]:
sentences = chinese_text_from_file_loader("data/comparison_cn.txt")

chinese_text = ""

for sentence in sentences:
    chinese_text += sentence

# LLM
llm_translation = translate_with_llm(chinese_text)


# RAG
rag_translation = translate_with_rag(chinese_text, vector_store)


print("\ninput chinese text")
count_chinese_sentences(chinese_text)
print(chinese_text)


print("\nTranslation using LLM")
count_paragraphs_and_sentences(llm_translation.content)
print(llm_translation.content)


print("\nTranslation using RAG")
count_paragraphs_and_sentences(rag_translation.content)
print(rag_translation.content)

## Evaluation of translation results

Evaluating machine translation quality is essential to ensure the accuracy and fluency of translated text. In this tutorial, we use two key metrics, TER and BERTScore, to assess the quality of translations produced by both a general LLM-based translation system and a RAG-based translation system.

By combining TER and BERTScore, we achieve a comprehensive evaluation of translation quality.
TER measures the structural differences and required edits between translations and reference texts.
BERTScore captures the semantic similarity between translations and references.
This dual evaluation approach allows us to effectively compare LLM and RAG translations, helping determine which method provides more accurate, fluent, and natural translations.


**TER (Translation Edit Rate)**

TER quantifies how much editing is required to transform a system-generated translation into the reference translation. It accounts for insertions, deletions, substitutions, and Shifts (word reordering).

Interpretation:
Lower TER indicates a better translation (fewer modifications needed).
Higher TER suggests that the translation deviates significantly from the reference

**BERTScore - Contextual Semantic Evaluation**

BERTScore evaluates translation quality by computing semantic similarity scores between reference and candidate translations. It utilizes contextual embeddings from a pre-trained BERT model, unlike traditional n-gram-based methods that focus solely on word overlap.

Interpretation:
Higher BERTScore (closer to 1.0) indicates better semantic similarity between the candidate and reference translations.
Lower scores indicate less semantic alignment with the reference translation.

By leveraging both TER and BERTScore, we can effectively analyze the strengths and weaknesses of LLM-based and RAG-based translation methods.

In [None]:
import pandas as pd
import json
import nltk
import sacrebleu
import bert_score
from tabulate import tabulate

# Download required NLTK resources
nltk.download("punkt")


# TER Score Calculation
def calculate_ter(reference, candidate):
    ter_metric = sacrebleu.metrics.TER()
    return round(ter_metric.corpus_score([candidate], [[reference]]).score, 3)


# BERTScore Calculation
def calculate_bert_score(reference, candidate):
    try:
        P, R, F1 = bert_score.score([candidate], [reference], lang="en")
        return round(F1.mean().item(), 3)
    except Exception as e:
        print(f"Error calculating BERTScore: {e}")
        return None


sentences = chinese_text_from_file_loader("data/comparison_cn.txt")

# Store sentences in the FAISS vector database
vector_store = FAISS.from_texts(sentences, embedding=embeddings)

# Randomly select 5 sentences
if sentences.len() > 5:
    selected_sentences = random.sample(sentences, 5)


# Execute translation
translated_results = []
for idx, sentence in enumerate(selected_sentences, start=1):
    llm_translation = translate_with_llm(sentence)
    translate_with_rag(sentence, vector_store)

    # Evaluate translation quality (LLM)
    ter_llm = calculate_ter(sentence, llm_translation)
    bert_llm = calculate_bert_score(sentence, llm_translation)

    # Evaluate translation quality (RAG)
    ter_rag = calculate_ter(sentence, rag_translation)
    bert_rag = calculate_bert_score(sentence, rag_translation)

    translated_results.append(
        {
            "source_text": sentence,
            "llm_translation": llm_translation,
            "rag_translation": rag_translation,
            "TER LLM": ter_llm,
            "BERTScore LLM": bert_llm,
            "TER RAG": ter_rag,
            "BERTScore RAG": bert_rag,
        }
    )


# Display results in a transposed format
for idx, result in enumerate(translated_results, start=1):
    print(f"**Sentence {idx}**")
    print("-" * 60)
    print(f"Source Text       | {result['source_text']}")
    print(f"LLM Translation   | {result['llm_translation']}")
    print(f"RAG Translation   | {result['rag_translation']}")
    print(f"TER Score (LLM)   | {result['TER LLM']}")
    print(f"BERTScore (LLM)   | {result['BERTScore LLM']}")
    print(f"TER Score (RAG)   | {result['TER RAG']}")
    print(f"BERTScore (RAG)   | {result['BERTScore RAG']}")
    print("-" * 60, "\n")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\herme\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for

🔹 **Sentence 1**
------------------------------------------------------------
Source Text       | 数据流通方面，标准包括数据产品、数据确权、数据资源定价、数据流通交易。
LLM Translation   | In terms of data circulation, the standards include data products, data ownership confirmation, data resource pricing, and data circulation transactions.
RAG Translation   | In terms of data circulation, the standards include data products, data ownership verification, data resource pricing, and data circulation transactions.
TER Score (LLM)   | 2000.0
BERTScore (LLM)   | 0.763
TER Score (RAG)   | 2000.0
BERTScore (RAG)   | 0.762
------------------------------------------------------------ 

🔹 **Sentence 2**
------------------------------------------------------------
Source Text       | 安全保障方面，标准涉及数据基础设施安全，数据要素市场安全，数据流通安全。
LLM Translation   | "In terms of security guarantees, the standards cover the safety of data infrastructure, the security of data element markets, and the safety of data circulation."
RAG Translation   | In terms o