# Document Translation and Localization with Streamlit

This Python script uses the Streamlit framework to create a simple web application for translating and localizing text in DOCX documents. The translation is performed using the Hugging Face Transformers library, and names are localized using spaCy.

## Table of Contents
1. [Requirements](#requirements)
2. [Installation](#installation)
3. [Usage](#usage)
4. [Code Explanation](#code-explanation)
5. [License](#license)

## Requirements <a name="requirements"></a>

- Python 3.x
- Streamlit
- transformers (from Hugging Face)
- docx2txt
- spaCy
- docx (python-docx)

## Installation <a name="installation"></a>

To install the required packages, you can use `pip`:

```bash
pip install streamlit transformers docx2txt spacy python-docx
```

Additionally, make sure you have a spaCy language model installed for the target country language. For example, if you're targeting French, you can install the French model like this:

```bash
python -m spacy download fr_core_news_sm
```

## Usage <a name="usage"></a>

1. Run the script using the following command:

   ```bash
   streamlit run your_script_name.py
   ```

2. Access the Streamlit app in your web browser.

3. Upload a DOCX file that you want to translate and localize.

4. Enter the target language (e.g., "fr" for French) and the target country code (e.g., "fr" for France).

5. Click the "Translate and Localize" button.

6. The app will process the uploaded document, perform translation, localize names, and generate a localized document.

7. Download the localized document using the "Download Localized Document" button.

## Code Explanation <a name="code-explanation"></a>

- The script begins by importing the necessary libraries and packages, including Streamlit, transformers (for translation), docx2txt (for extracting text from DOCX files), spaCy (for name localization), and python-docx (for working with DOCX documents).

- Several functions are defined:
  - `clean_text`: Cleans and tokenizes text by removing non-alphanumeric characters and converting it to lowercase.
  - `translate_text`: Splits and translates text segments using a specified translation model from Hugging Face.
  - `localize_names`: Localizes names in the translated document using spaCy and a predefined name localization database.

- The Streamlit app is created, including a title and a file uploader for uploading DOCX files.

- When a file is uploaded and the "Translate and Localize" button is clicked, the following steps are performed:
  - The uploaded DOCX file is processed to extract its text content.
  - User input for the target language and country code is obtained.
  - The `clean_text` function is used to clean and tokenize the extracted text.
  - The `translate_text` function translates the cleaned text to the target language.
  - The `localize_names` function localizes names in the translated document.
  - The styles, headers, and footers from the uploaded document are copied to a new localized document.
  - The localized document is saved, and a download button is provided for the user to download it as a DOCX file.



In [None]:
import streamlit as st
import re
from transformers import AutoTokenizer, MarianTokenizer, MarianMTModel
import docx2txt
import spacy
from docx import Document
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT

# Function to clean and tokenize text
def clean_text(text):
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    text = text.lower()
    return text

# Function to split and translate text segments
def translate_text(input_text, target_language):
    model_name = f"Helsinki-NLP/opus-mt-en-{target_language}"
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    max_segment_length = 1152
    segments = [input_text[i:i + max_segment_length] for i in range(0, len(input_text), max_segment_length)]

    translated_segments = []

    for segment in segments:
        input_ids = tokenizer.encode(segment, return_tensors="pt", max_length=512, truncation=True)
        translated_ids = model.generate(input_ids, max_length=100, num_beams=4, early_stopping=True)
        translated_text = tokenizer.decode(translated_ids[0], skip_special_tokens=True)
        translated_segments.append(translated_text)

    return "\n".join(translated_segments)

# Function to localize names in the translated document
def localize_names(translated_document, target_country_code):
    nlp_target_lang = spacy.load(f"{target_country_code}_core_news_sm")

    name_localization_db = {
        "fr": {
            "John": "Jean",
            "Marie": "Marie",
            # Add more name translations for France
        },
        "de": {
            "John": "Johann",
            "Marie": "Maria",
            # Add more name translations for Germany
        },
        # Add more countries and their name translations as needed
    }

    target_language = name_localization_db.get(target_country_code, "en")

    if target_language == "fr":
        doc = nlp_target_lang(translated_document)
        localized_names = name_localization_db["fr"]
    else:
        doc = nlp_target_lang(translated_document)
        localized_names = {}

    localized_text = []
    for token in doc:
        if token.ent_type_ == "PERSON" and token.text in localized_names:
            localized_text.append(localized_names[token.text])
        else:
            localized_text.append(token.text)

    return " ".join(localized_text)

# Streamlit app
st.title("Document Translation and Localization")

uploaded_file = st.file_uploader("Upload a DOCX file", type=["docx"])

if uploaded_file is not None:
    doc_text = docx2txt.process(uploaded_file)
    localized_doc = Document()

    target_language = st.text_input("Enter the target language (e.g., fr for French):")
    target_country_code = st.text_input("Enter the target country code (e.g., fr for France):")

    if st.button("Translate and Localize"):
        cleaned_text = clean_text(doc_text)
        translated_text = translate_text(cleaned_text, target_language)
        localized_document = localize_names(translated_text, target_country_code)

        # Copy styles, headers, and footers from the uploaded document
        doc = Document(uploaded_file)
        for element in doc.element.body:
            if element.tag.endswith('sectPr'):
                continue
            localized_doc.element.body.append(element)

        # Add the localized document to the Document layout
        localized_doc.add_paragraph(localized_document, style='BodyText')
        st.download_button(
            "Download Localized Document",
            lambda: localized_doc.save("localized_document.docx"),
            key="download-localized-docx",
            help="Click here to download the localized document as a DOCX file.",
        )


2023-08-19 11:37:40.128 
  command:

    streamlit run /usr/local/lib/python3.10/dist-packages/ipykernel_launcher.py [ARGUMENTS]


In [None]:
!streamlit run /usr/local/lib/python3.10/dist-packages/ipykernel_launcher.py


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to False.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.125.199.96:8501[0m
[0m
[34m  Stopping...[0m
[34m  Stopping...[0m
Traceback (most recent call last):
  File "/usr/local/bin/streamlit", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-pa

In [None]:
pip install streamlit

Collecting streamlit
  Downloading streamlit-1.25.0-py2.py3-none-any.whl (8.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.1/8.1 MB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
Collecting pympler<2,>=0.9 (from streamlit)
  Downloading Pympler-1.0.1-py3-none-any.whl (164 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m164.8/164.8 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
Collecting tzlocal<5,>=1.1 (from streamlit)
  Downloading tzlocal-4.3.1-py3-none-any.whl (20 kB)
Collecting validators<1,>=0.2 (from streamlit)
  Downloading validators-0.21.2-py3-none-any.whl (25 kB)
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit)
  Downloading GitPython-3.1.32-py3-none-any.whl (188 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.5/188.5 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydeck<1,>=0.8 (from streamlit)
  Downloading pydeck-0.8.0-py2.py3-none-any.whl (4.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
pip install transformers

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m97.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m80.4 MB/s[0m eta [36m0:00:0

In [None]:
pip install docx2txt

Collecting docx2txt
  Downloading docx2txt-0.8.tar.gz (2.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: docx2txt
  Building wheel for docx2txt (setup.py) ... [?25l[?25hdone
  Created wheel for docx2txt: filename=docx2txt-0.8-py3-none-any.whl size=3959 sha256=80fe3c78f9adac3f0387d266db7817e981e34f410b2d467726a89746669cfd72
  Stored in directory: /root/.cache/pip/wheels/22/58/cf/093d0a6c3ecfdfc5f6ddd5524043b88e59a9a199cb02352966
Successfully built docx2txt
Installing collected packages: docx2txt
Successfully installed docx2txt-0.8


In [None]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [None]:
pip install docx

Collecting docx
  Downloading docx-0.2.4.tar.gz (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.9/54.9 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: docx
  Building wheel for docx (setup.py) ... [?25l[?25hdone
  Created wheel for docx: filename=docx-0.2.4-py3-none-any.whl size=53894 sha256=0f626728650f4a10a327e9cb055b36c9a887dc97aa54c803bb12a6110194c668
  Stored in directory: /root/.cache/pip/wheels/81/f5/1d/e09ba2c1907a43a4146d1189ae4733ca1a3bfe27ee39507767
Successfully built docx
Installing collected packages: docx
Successfully installed docx-0.2.4


In [None]:
pip install python-docx==0.8.11


Collecting python-docx==0.8.11
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: python-docx
  Building wheel for python-docx (setup.py) ... [?25l[?25hdone
  Created wheel for python-docx: filename=python_docx-0.8.11-py3-none-any.whl size=184487 sha256=027e585fe0dd460183833484267b27da579a0fda13d8578a4dd711084e6e5c30
  Stored in directory: /root/.cache/pip/wheels/80/27/06/837436d4c3bd989b957a91679966f207bfd71d358d63a8194d
Successfully built python-docx
Installing collected packages: python-docx
Successfully installed python-docx-0.8.11
