# Translation

- Author: [Wonyoung Lee](https://github.com/BaBetterB)
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BaBetterB/LangChain-OpenTutorial/blob/main/15-Agent/05-Iteration-HumanInTheLoop.ipynb)
[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/04-SemanticChunker.ipynb)


## Overview

This tutorial compares two approaches to translating Chinese text into English using LangChain.

The first approach utilizes a single LLM (e.g. GPT-4) to generate a straightforward translation. The second approach employs Retrieval-Augmented Generation (RAG), which enhances translation accuracy by retrieving relevant documents.

The tutorial evaluates the translation accuracy and performance of each method, helping users choose the most suitable approach for their needs.


### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [Translation using LLM](#translation-using-llm)
- [Translation using RAG](#translation-using-rag)
- [Evaluation of translation results](#evaluation-of-translation-resultsr)


### References



----

 


## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [ `langchain-opentutorial` ](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

Load sample text and output the content.

In [2]:
%%capture --no-stderr
%pip install langchain-opentutorial


[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
# Install required packages
from langchain_opentutorial import package


package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_community",
        "load_dotenv",
        "langchain_openai",
        "transformers",
        "faiss-cpu",
        "sentence_transformers",
        "sacrebleu",
        "unbabel-comet",
        "load_from_checkpoint",
    ],
    verbose=False,
    upgrade=False,
)

In [4]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Translation",  # title
    }
)

Environment variables have been set successfully.


You can alternatively set `OPENAI_API_KEY` in `.env` file and load it.

[Note] This is not necessary if you've already set `OPENAI_API_KEY` in previous steps.

In [5]:
# Configuration File for Managing API Keys as Environment Variables
from dotenv import load_dotenv

# Load API Key Information
load_dotenv(override=True)

True

## Translation using LLM

Translation using LLM refers to using a large language model (LLM), such as GPT-4, to translate text from one language to another. 
The model processes the input text and generates a direct translation based on its pre-trained knowledge. This approach is simple, fast, and effective.



In [6]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableSequence

# Create LLM
llm = ChatOpenAI(model="gpt-4o-mini")

# Create PromptTemplate
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a professional translator.",
        ),
        (
            "human",
            "Please translate the following Chinese document into natural and accurate Korean."
            "Consider the context and vocabulary to ensure smooth and fluent sentences.:.\n\n"
            "**Chinese Original Text:** {chinese_text}\n\n**English Translation:**",
        ),
    ]
)

translation_chain = RunnableSequence(prompt, llm)

chinese_text = "人工智能正在改变世界，各国都在加紧研究如何利用这一技术提高生产力。"

response = translation_chain.invoke({"chinese_text": chinese_text})

print("Chinese_text:", chinese_text)
print("Translation:", response.content)

Failed to multipart ingest runs: langsmith.utils.LangSmithAuthError: Authentication failed for https://api.smith.langchain.com/runs/multipart. HTTPError('401 Client Error: Unauthorized for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Invalid token"}')trace=7fa642e2-6d50-42dd-ab36-89c1a4ed6a43,id=7fa642e2-6d50-42dd-ab36-89c1a4ed6a43; trace=7fa642e2-6d50-42dd-ab36-89c1a4ed6a43,id=6a4fb624-6770-4ba0-b055-2bee90a8fcfc; trace=7fa642e2-6d50-42dd-ab36-89c1a4ed6a43,id=c26c4172-c7cc-40ab-9a24-c374bbb77ddc


Chinese_text: 人工智能正在改变世界，各国都在加紧研究如何利用这一技术提高生产力。
Translation: 인공지능이 세계를 변화시키고 있으며, 각국은 이 기술을 활용하여 생산성을 높이는 방법에 대한 연구를 가속화하고 있습니다.


Failed to multipart ingest runs: langsmith.utils.LangSmithAuthError: Authentication failed for https://api.smith.langchain.com/runs/multipart. HTTPError('401 Client Error: Unauthorized for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Invalid token"}')trace=7fa642e2-6d50-42dd-ab36-89c1a4ed6a43,id=7fa642e2-6d50-42dd-ab36-89c1a4ed6a43; trace=7fa642e2-6d50-42dd-ab36-89c1a4ed6a43,id=c26c4172-c7cc-40ab-9a24-c374bbb77ddc
Failed to multipart ingest runs: langsmith.utils.LangSmithAuthError: Authentication failed for https://api.smith.langchain.com/runs/multipart. HTTPError('401 Client Error: Unauthorized for url: https://api.smith.langchain.com/runs/multipart', '{"detail":"Invalid token"}')trace=f209f675-10f6-4c8d-aa5c-9114535ae27d,id=f209f675-10f6-4c8d-aa5c-9114535ae27d; trace=f209f675-10f6-4c8d-aa5c-9114535ae27d,id=ea4aac3a-bbf1-40a4-94bb-238a77e5d351; trace=f209f675-10f6-4c8d-aa5c-9114535ae27d,id=856eca3e-b98e-4394-9528-6c418a9451ff
Failed to multipart ingest runs: langs

## Translation using RAG 

Translation using RAG (Retrieval-Augmented Generation) enhances translation accuracy by combining a pre-trained LLM with a retrieval mechanism. It first retrieves relevant documents or data related to the input text, then uses this additional context to generate a more precise and contextually accurate translation. This approach is particularly useful for technical terms, specialized content, or context-sensitive translations


### Simple Search Implementation Using FAISS

FAISS (Facebook AI Similarity Search) is a library developed by Facebook AI for efficient similarity search and clustering of dense vectors. It is widely used for approximate nearest neighbor (ANN) search in large-scale datasets.

In [7]:
import os
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import TextLoader
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings


embeddings = OpenAIEmbeddings()

llm = ChatOpenAI(model="gpt-4o-mini")

file_path = "data/news_cn.txt"
if not os.path.exists(file_path):
    raise FileNotFoundError(f"file not found!!: {file_path}")

loader = TextLoader(file_path, encoding="utf-8")
docs = loader.load()


# Vectorizing Sentences Individually
sentences = []
for doc in docs:
    text = doc.page_content
    sentence_list = text.split("。")  # Splitting Chinese sentences based on '。'
    sentences.extend(
        [sentence.strip() for sentence in sentence_list if sentence.strip()]
    )


# Store sentences in the FAISS vector database
vector_store = FAISS.from_texts(sentences, embedding=embeddings)

# Search vectors using keywords "人工智能"
search_results = vector_store.similarity_search("人工智能", k=3)

# check result
print("Search result")
for idx, result in enumerate(search_results, start=1):
    print(f"{idx}. {result.page_content}")

Search result
1. 当地球员并非专业人士，而是农民、建筑工人、教师和学生，对足球的热爱将他们凝聚在一起
2. ”卡卡说道
3. “足球让我们结识新朋友，连接更广阔的世界


### Let's compare translation using LLM and translation using RAG.

First, write the necessary functions.

In [8]:
import json
import re
import nltk
from nltk.tokenize import sent_tokenize
from langchain.document_loaders import TextLoader
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableSequence


llm = ChatOpenAI(model="gpt-4o-mini")


# NLTK의 문장 토큰화를 위한 데이터 다운로드 (최초 1회 실행 필요)
nltk.download("punkt")


# Document Search Function (Used in RAG)
def retrieve_relevant_docs(query, vector_store, k=3):

    # Perform search and return relevant documents
    search_results = vector_store.similarity_search(query, k=k)
    return [doc.page_content for doc in search_results]


# Translation using only LLM
def translate_with_llm(chinese_text):

    prompt_template_llm = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a translation expert. Translate the following Chinese sentence into Korean:",
            ),
            ("user", f'Chinese sentence: "{chinese_text}"'),
            ("user", "Please provide an accurate translation."),
        ]
    )

    translation_chain_llm = RunnableSequence(prompt_template_llm, llm)

    return translation_chain_llm.invoke({"chinese_text": chinese_text})


# RAG-based Translation
def translate_with_rag(chinese_text, vector_store):

    retrieved_docs = retrieve_relevant_docs(chinese_text, vector_store)

    # Add retrieved documents as context

    context = "\n".join(retrieved_docs)

    # Construct prompt template (Using RAG)

    prompt_template_rag = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a translation expert. Below is the Chinese text that needs to be translated into Korean. Additionally, the following context has been provided from relevant documents that might help you in producing a more accurate and context-aware translation.",
            ),
            ("system", f"Context (Relevant Documents):\n{context}"),
            ("user", f'Chinese sentence: "{chinese_text}"'),
            (
                "user",
                "Please provide a translation that is both accurate and reflects the context from the documents provided.",
            ),
        ]
    )

    translation_chain_rag = RunnableSequence(prompt_template_rag, llm)

    # Request translation using RAG

    return translation_chain_rag.invoke({"chinese_text": chinese_text})


# Function to store document text as a list
def chinese_text_from_file_loader(path):
    """
    파일에서 중국어 텍스트를 로드하고 문장 단위로 분리하여 리스트로 반환
    """
    # Load data
    loader = TextLoader(path, encoding="utf-8")
    docs = loader.load()

    return split_chinese_sentences_from_docs(docs)


def split_chinese_sentences_from_docs(docs):
    """
    문서 리스트에서 문장을 분리하여 리스트로 반환
    """
    sentences = []

    for doc in docs:
        text = doc.page_content  # 문서 객체에서 텍스트 추출
        sentences.extend(split_chinese_sentences(text))  # 문장 단위로 분리하여 추가

    return sentences


def split_chinese_sentences(text):
    """
    - 정규 표현식을 사용하여 문장과 문장부호를 함께 분리.
    - 문장과 문장부호를 다시 결합하여 반환.
    """
    # 문장과 문장부호 분리
    sentence_list = re.split(r"([。！？])", text)

    # 문장과 문장부호를 결합하여 복원
    merged_sentences = [
        "".join(x) for x in zip(sentence_list[0::2], sentence_list[1::2])
    ]

    # 빈 문장 제거 후 반환
    return [sentence.strip() for sentence in merged_sentences if sentence.strip()]


def count_chinese_sentences(docs):
    if isinstance(docs, str):
        # `input_data`가 단순 문자열인 경우 바로 처리
        sentences = split_chinese_sentences(docs)

    print(f"전체 문장 개수: {len(sentences)}")
    return sentences


def split_english_sentences_from_docs(docs):
    """
    문서 리스트에서 문장을 분리하여 리스트로 반환
    """
    sentences = []

    for doc in docs:
        text = doc.page_content  # 문서 객체에서 텍스트 추출
        sentences.extend(split_english_sentences(text))  # 문장 단위로 분리하여 추가

    return sentences


def split_english_sentences(text):
    """
    - NLTK의 `sent_tokenize()`를 사용하여 문장을 정확하게 분리.
    - 기본적으로 마침표(`.`), 물음표(`?`), 느낌표(`!`)를 인식하여 문장을 구분.
    """
    return sent_tokenize(text)


def count_paragraphs_and_sentences(docs):
    """
    주어진 파일에서 영어 문장의 개수를 세는 함수
    """
    if isinstance(docs, str):
        # `input_data`가 단순 문자열인 경우 바로 처리
        paragraphs = paragraphs = re.split(r"\n\s*\n", docs.strip())
        paragraphs = [
            para.strip() for para in paragraphs if para.strip()
        ]  # 빈 문단 제거
        sentences = [sent for para in paragraphs for sent in sent_tokenize(para)]

        print(f"📌 전체 문단 개수: {len(paragraphs)}")
        print(f"📌 전체 문장 개수: {len(sentences)}")
    return len(sentences)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\herme\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Use the written functions to perform the comparison.**

In [9]:
# sentences = chinese_text_from_file_loader("data/comparison_cn.txt")
sentences = chinese_text_from_file_loader("data/comparison_cn copy.txt")

chinese_text = ""


for sentence in sentences:
    chinese_text += sentence


# LLM
llm_translation = translate_with_llm(chinese_text)


# RAG
rag_translation = translate_with_rag(chinese_text, vector_store)


print("\ninput chinese text")
count_chinese_sentences(chinese_text)
print(chinese_text)


print("\nTranslation using LLM")
count_paragraphs_and_sentences(llm_translation.content)
print(llm_translation.content)


print("\nTranslation using RAG")
count_paragraphs_and_sentences(rag_translation.content)
print(rag_translation.content)


input chinese text
전체 문장 개수: 15
数据领域迎来国家标准。10月8日，国家发改委等部门发布关于印发《国家数据标准体系建设指南》(以下简称《指南》)的通知。为“充分发挥标准在激活数据要素潜能、做强做优做大数字经济等方面的规范和引领作用”，国家发展改革委、国家数据局、中央网信办、工业和信息化部、财政部、国家标准委组织编制了《国家数据标准体系建设指南》。《指南》提出，到2026年底，基本建成国家数据标准体系，围绕数据流通利用基础设施、数据管理、数据服务、训练数据集、公共数据授权运营、数据确权、数据资源定价、企业数据范式交易等方面制修订30项以上数据领域基础通用国家标准，形成一批标准应用示范案例，建成标准验证和应用服务平台，培育一批具备数据管理能力评估、数据评价、数据服务能力评估、公共数据授权运营绩效评估等能力的第三方标准化服务机构。《指南》明确，数据标准体系框架包含基础通用、数据基础设施、数据资源、数据技术、数据流通、融合应用、安全保障等7个部分。数据基础设施方面，标准涉及存算设施中的数据算力设施、数据存储设施，网络设施中的5G网络数据传输、光纤数据传输、卫星互联网数据传输，此外还有流通利用设施。数据流通方面，标准包括数据产品、数据确权、数据资源定价、数据流通交易。融合应用方面，标准涉及工业制造、农业农村、商贸流通、交通运输、金融服务、科技创新、文化旅游(文物)、卫生健康、应急管理、气象服务、城市治理、绿色低碳。安全保障方面，标准涉及数据基础设施安全，数据要素市场安全，数据流通安全。数据资源中的数据治理标准包括数据业务规划、数据质量管理、数据调查盘点、数据资源登记；训练数据集方面的标准包括训练数据集采集处理、训练数据集标注、训练数据集合成。在组织保障方面，将指导建立全国数据标准化技术组织，加快推进急用、急需数据标准制修订工作，强化与有关标准化技术组织、行业、地方及相关社团组织之间的沟通协作、协调联动，以标准化促进数据产业生态建设。同时还将完善标准试点政策配套，搭建数据标准化公共服务平台，开展标准宣贯，选择重点地方、行业先行先试，打造典型示范。探索推动数据产品第三方检验检测，深化数据标准实施评价管理。在人才培养方面，将打造标准配套的数据人才培训课程，形成一批数据标准化专业人才。优化数据国际标准化专家队伍，支持参与国际标准化活动，强化国际交流。


테스트 데이터 셋 생성 문장의 수가 일치해야함함

In [None]:
# Function to make comparison_json
def make_comparison_json(chinese_text, translation_1, translation_2):
    # 문장별로 나누기 (원문과 번역을 각 문장별로 나누기)
    # chinese_sentences = chinese_text.split("。")
    chinese_sentences = split_chinese_sentences(chinese_text)
    translation_1_sentences = translation_1.split(".")
    translation_2_sentences = translation_2.split(".")

    print("\nchinese_sentences:", len(chinese_sentences))
    print("\ntranslation_1_sentences:", len(translation_1_sentences))
    print("\ntranslation_2_sentences:", len(translation_2_sentences))

    # 각 문장의 원문과 번역을 매핑하여 저장
    data = []
    for i in range(len(chinese_sentences)):
        # 문장별로 원문과 번역을 매핑
        sentence_data = {
            "chinese_text": chinese_sentences[i].strip(),
            "translation_1": (
                translation_1_sentences[i].strip()
                if i < len(translation_1_sentences)
                else ""
            ),
            "translation_2": (
                translation_2_sentences[i].strip()
                if i < len(translation_2_sentences)
                else ""
            ),
        }
        data.append(sentence_data)

    # JSON 파일로 저장
    output_file = "data/translation_comparison2.json"
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

    print(f"Translation comparison data saved to {output_file}")

    make_comparison_json(chinese_text, llm_translation.content, rag_translation.content)

## Evaluation of translation results

Evaluation of translation results using BLEU and TER scores.
Considering the addition of COMET and GPT for further assessment.
Aiming to improve accuracy and quality in translation evaluation.

In [None]:
# 문장별로 나누기 (원문과 번역을 각 문장별로 나누기)
chinese_sentences = chinese_text.split("。")
translation_1_sentences = translatllm_translation.content.split(".")
translation_2_sentences = translation_2.split(".")

# 각 문장의 원문과 번역을 매핑하여 저장
data = []
for i in range(len(chinese_sentences)):
    # 문장별로 원문과 번역을 매핑
    sentence_data = {
        "chinese_text": chinese_sentences[i].strip(),
        "translation_1": (
            translation_1_sentences[i].strip()
            if i < len(translation_1_sentences)
            else ""
        ),
        "translation_2": (
            translation_2_sentences[i].strip()
            if i < len(translation_2_sentences)
            else ""
        ),
    }
    data.append(sentence_data)

# JSON 파일로 저장
output_file = "translation_comparison.json"
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

print(f"Translation comparison data saved to {output_file}")

In [15]:
import pandas as pd
import json
import nltk
import sacrebleu
from tabulate import tabulate


nltk.download("punkt")


#  BLEU
def calculate_bleu(reference, candidate):
    return round(sacrebleu.sentence_bleu(candidate, [reference]).score, 3)


# TER
def calculate_ter(reference, candidate):
    ter_metric = sacrebleu.metrics.TER()
    return round(ter_metric.corpus_score([candidate], [[reference]]).score, 3)


json_file_path = "data/translations_comparison.json"


def load_json_data(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)
    return data


translations_data = load_json_data(json_file_path)


df = pd.DataFrame(translations_data)

results = []

for _, row in df.iterrows():
    source_text = row["source_text"]
    translation_1 = row["translation_1"]
    translation_2 = row["translation_2"]

    # translation_1 evaluation
    bleu_1 = calculate_bleu(source_text, translation_1)
    ter_1 = calculate_ter(source_text, translation_1)

    # translation_2 evaluation
    bleu_2 = calculate_bleu(source_text, translation_2)
    ter_2 = calculate_ter(source_text, translation_2)

    results.append(
        {
            "Category": "Source Text",
            "Text": source_text,
            "BLEU": "-",
            "TER": "-",
        }
    )
    results.append(
        {
            "Category": "Translation 1",
            "Text": translation_1,
            "BLEU": bleu_1,
            "TER": ter_1,
        }
    )
    results.append(
        {
            "Category": "Translation 2",
            "Text": translation_2,
            "BLEU": bleu_2,
            "TER": ter_2,
        }
    )

results_df = pd.DataFrame(results)


def display_results(dataframe):
    print("\n📌 **Translation Quality Evaluation (BLEU & TER Scores)**\n")
    print(tabulate(dataframe, headers="keys", tablefmt="fancy_grid"))


display_results(results_df)


📌 **Translation Quality Evaluation (BLEU & TER Scores)**

╒════╤═══════════════╤════════════════════════════════════════════════╤════════╤═══════╕
│    │ Category      │ Text                                           │ BLEU   │ TER   │
╞════╪═══════════════╪════════════════════════════════════════════════╪════════╪═══════╡
│  0 │ Source Text   │ 这个产品在市场上很受欢迎。                     │ -      │ -     │
├────┼───────────────┼────────────────────────────────────────────────┼────────┼───────┤
│  1 │ Translation 1 │ This product is very popular in the market.    │ 0.0    │ 800.0 │
├────┼───────────────┼────────────────────────────────────────────────┼────────┼───────┤
│  2 │ Translation 2 │ This product is well received in the market.   │ 0.0    │ 800.0 │
├────┼───────────────┼────────────────────────────────────────────────┼────────┼───────┤
│  3 │ Source Text   │ 人工智能正在改变世界。                         │ -      │ -     │
├────┼───────────────┼────────────────────────────────────────────────┼────

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\herme\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
