# Automatic translation of Resume with Azure, GCP, docx, and mod_docx!

Have you ever wanted to automatically change the language on your CV?  There are built-in document changing services on Google Sheets, but they often require you to pay money or create an account. 

In this post, I create a quick application using Azure Cognitive Services and GCP, and review some XML basics to resave the docx file.

![alt text](splash_mod_docx.png)


Blog article: https://medium.com/@j622amilah/publishing-code-on-pypi-mod-docx-xml-document-manipulation-f5e89efaf1bc


In [1]:
# Using PyPI repository 
import os
import sys

# sudo pip install mod_docx --upgrade
# https://pypi.org/project/mod-docx/
sys.path.insert(1, '/usr/lib/python3.11/site-packages')
import mod_docx

# https://pypi.org/project/docx/
sys.path.insert(1, '/usr/lib/python3.10/site-packages')
import docx 

# [Step 0] Open the docx file as an XML object

In [2]:
# Obtenir le fichier docx originale
fpath = "../git2/mod_docx/tests"
fichier = "test_document.docx"
fichier = "test_complicated_document.docx"
docx_filename = os.path.join(fpath, fichier)
        
# Create a class object for the name_of_file.class_name. Use the class object, to call the functions in the class.
md = mod_docx.mod_docx(docx_filename)

# Convert the orignal docx file to XML
document_org = md.opendocx(docx_filename)


# [Step 1] Modify the document

In [3]:
# Extract text from docx file

# Way 0: Form recognizer
# However this way the text will not be aligned by line or position within the document.
# This could be a problem when trying to automatically replace pieces of text that are equivalent to 
# other pieces of text (ie: automatic text translation).

# Way 1: docx text extraction from the html tags <body> to <\body>
text_org = docx.getdocumenttext(document_org)
print('text_org: ', text_org)

text_org:  ['Jamilah FOUCHER', 'DATA SCIENTIST', 'Adresse', 'Téléphone', 'Email', 'Linked-In : Nom de profile (4 badges de compétences)', 'Data Science Blog: medium.com/@j622amilah', 'GitHub: github.com/j622amilah', 'ResearchGate: Nom de profile', 'Publications: Nom de profile', "Je suis intéressé à être un Data Scientist; Je développe actuellement des compétences de spécialiste des données et d'architecte de données pour travailler avec des modèles sur le cloud.", 'SKILLS', 'Processus de la science des données', 'Poser des questions, Hypothèse, Expérimenter, Analyser (Sélection et réduction des marqueurs, sélection du modèle (Python SDK: mobilenet, OpenAI, Hugging Face), modèle entraînement, flux de travail), Interpréter les résultats, Communiquer et livrer les résultats', 'Apprentissage automatique et profond ', "méthodes ML en utilisant les algorithmes d'optimisation et l'arbre de décision, méthodes DL en utilisant les algorithmes d'optimisation, RNN, CNN, GANS, Transformer", 'Types

 ## Translate the extracted text using Azure and/or GCP

## Azure

In [None]:
# Create resources on Azure : see https://github.com/DevopsPractice7/mod_docx under tests
os.system("./create_Azure_resources.sh")

In [None]:
def traduction_Xlang_a_Ylang(de_lang, a_lang, key, endpoint, location, text):
    import requests, uuid, json
    
    constructed_url = endpoint + '/translate'

    params = {'api-version': '3.0', 'from': de_lang, 'to': a_lang}

    headers = {
        'Ocp-Apim-Subscription-Key': key,
        'Ocp-Apim-Subscription-Region': location,
        'Content-type': 'application/json',
        'X-ClientTraceId': str(uuid.uuid4())
    }
    
    body = [{'text': f'{text}'}]

    request = requests.post(constructed_url, params=params, headers=headers, json=body)
    response = request.json()
        
    return response

In [None]:
de_lang = 'fr'
a_lang = ['en']
key =  ''
endpoint = "https://api.cognitive.microsofttranslator.com"
location = "francecentral"

# Keep the line order/position with respect to the docx file, by translating each line 
text_mod = []
for ind, pline in enumerate(text_org):
    response = traduction_Xlang_a_Ylang(de_lang, a_lang, key, endpoint, location, pline)
    transtext = response[0]['translations'][0]['text'].split("\\n")
    text_mod.append(transtext[0])

## Google Cloud Platform

In [4]:
from os import environ

PROJECT_ID = environ.get("PROJECT_ID", "traductionapi-0")
global PROJECT_ID
print('PROJECT_ID : ', PROJECT_ID)


LOCATION = "europe-west9"

# Credentials
import os

# https://cloud.google.com/docs/authentication/provide-credentials-adc#local-dev
# In the command line : gcloud auth application-default login
# It creates the application_default_credentials.json file
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/home/oem2/.config/gcloud/application_default_credentials.json"

PROJECT_ID :  traductionapi-0


In [5]:
def translate_text(target, text):
    """Translates text into the target language.

    Target must be an ISO 639-1 language code.
    See https://g.co/cloud/translate/v2/translate-reference#supported_languages
    """
    import six
    
    # https://cloud.google.com/translate/docs/reference/libraries/v2/python
    # Be sure to install: sudo pip install google-cloud-translate==2.0.1
    # 
    from google.cloud import translate_v2 as translate

    translate_client = translate.Client()

    #if isinstance(text, six.binary_type):
    #    text = text.decode("utf-8")
        
    result = translate_client.translate(text, target_language=target)
    
    # print(u"Text: {}".format(result["input"]))
    # print(u"Detected source language: {}".format(result["detectedSourceLanguage"]))
    
    return result["translatedText"] #u"Translation: {}".format(result["translatedText"])


In [7]:
target = 'en'

# Keep the line order/position with respect to the docx file, by translating each line 
text_mod = []
for ind, pline in enumerate(text_org):
    textout = translate_text(target, pline)
    
    # Clean text 
    replace_le = ['&#39;', '\xa0']
    replace_avec = ["'", ' ']
    for ind, i in enumerate(replace_le):
        textout = textout.replace(i, replace_avec[ind])

    # Save cleaned text
    text_mod.append(textout)
print('text_mod: ', text_mod)    

text_mod:  ['Jamilah footer', 'DATA SCIENTIST', 'Address', 'Phone', 'Email', 'Linked-In: Profile name (4 skill badges)', 'Data Science Blog: medium.com/@j622amilah', 'GitHub: github.com/j622amilah', 'ResearchGate: Nom de profile', 'Posts: Profile name', 'I am interested in being a Data Scientist; I am currently developing data scientist and data architect skills to work with models on the cloud.', 'SKILLS', 'Data science process', 'Ask questions, Hypothesize, Experiment, Analyze (Marker selection and reduction, model selection (Python SDK: mobilenet, OpenAI, Hugging Face), model training, workflow), Interpret results, Communicate and deliver results', 'Automatic and deep learning', 'ML methods using optimization algorithms and decision tree, DL methods using optimization algorithms, RNN, CNN, GANS, Transformer', 'Types of data processed', 'Time series, text, image, sound, static and continuous data', 'Control theory, system identification, statistics', 'n4sid, ARX, created a predictive

## Replace the text in the open XML object doc_finale, using the docx replace function

In [None]:
for ind, pline in enumerate(text_org):
    
    # Le mot qu'on veut chercher
    search = pline
    print('search:', search)
    
    # Le mot qu'on veut replacer
    replace = text_mod[ind]
    print('replace:', replace)
    
    try:
        doc_finale = docx.replace(document_org, search, replace)
    except Exception as err:
        print('err:', err)   


# [Step 2] Save the changed XML document as a docx file 

In [None]:
root_dir = f'{fpath}/'
desired_output_docx_name = 'test_document_out.docx'
md.savedocx_ver_fichier(docx_filename, root_dir, desired_output_docx_name, doc_finale)
