# Mount on Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Install Libraries

In [2]:
!pip install transformers
!pip install sentencepiece


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
!pip install sacremoses

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Import Libraries

In [4]:
from transformers import MarianMTModel, MarianTokenizer

#Define Translation Function
Create a function to translate text from English to a target language and back using a pre-trained model. The following parameters will be selected to compare the results of the translation



*   num_beam
*   do_sample
*   top_k
*   top_p




Please click the link on each parameter to understand its influence on the translation






## Using [num_beam](https://huggingface.co/blog/how-to-generate#beam-search) parameter

In [5]:
input_text = "Hugging Face is an American company that develops tools for building applications using machine learning. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets."

In [6]:
input_language = "en" 
target_language = 'es'

def translate(text, target_language):
    model_name = f'Helsinki-NLP/opus-mt-{input_language}-{target_language}'
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    inputs = tokenizer.encode(text, return_tensors="pt")
    outputs = model.generate(inputs, 
                             num_beams=4, 
                             max_length=200, 
                            #  no_repeat_ngram_size=2, ## This avoid bigrams appearing twice
                            #  num_return_sequences=3, ## Must be less or equal to num_beams. Select highest scoring beamns   
                             early_stopping=True)
    
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return translated_text

In [7]:
translated_text_beam = translate(input_text, target_language)

translated_text_beam

'Hugging Face es una empresa estadounidense que desarrolla herramientas para la construcción de aplicaciones utilizando el aprendizaje automático. Es más notable por su biblioteca de transformadores construida para aplicaciones de procesamiento de lenguaje natural y su plataforma que permite a los usuarios compartir modelos de aprendizaje automático y conjuntos de datos.'

In [8]:
# Translate translated text back to english
input_language = "es" 
target_language = 'en'

def translate(text, target_language):
    model_name = f'Helsinki-NLP/opus-mt-{input_language}-{target_language}'
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    inputs = tokenizer.encode(text, return_tensors="pt")
    outputs = model.generate(inputs, 
                             num_beams=4, 
                             max_length=200, 
                            #  no_repeat_ngram_size=2, ## This avoid bigrams appearing twice
                            #  num_return_sequences=3, ## Must be less or equal to num_beams. Select highest scoring beamns   
                             early_stopping=True)
    
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return translated_text


In [9]:
translated_text_beam

original_text = translate(translated_text_beam, target_language)

original_text

'Hugging Face is an American company that develops application building tools using machine learning. It is most notable for its library of transformers built for natural language processing applications and its platform that allows users to share machine learning models and data sets.'

## Using [do_sample](https://huggingface.co/blog/how-to-generate#sampling) parameter

In [10]:
input_text = "Hugging Face is an American company that develops tools for building applications using machine learning. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets."

In [11]:
input_language = "en" 
target_language = 'es'

def translate(text, target_language):
    model_name = f'Helsinki-NLP/opus-mt-{input_language}-{target_language}'
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    inputs = tokenizer.encode(text, return_tensors="pt")
    outputs = model.generate(inputs, 
                             do_sample=True, 
                             max_length=200,                            
                             top_k=0, # See k_top chapter in this notebook to see an improvement using do_sample
                             early_stopping=True)
    
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return translated_text

In [12]:
translated_text_do = translate(input_text, target_language)

translated_text_do

'Hugging Face es una empresa estadounidense que desarrolla herramientas para la construcción de aplicaciones utilizando el aprendizaje automático. Es más notable por su biblioteca de transformadores construida para aplicaciones de procesamiento de lenguaje natural y su plataforma que permite a los usuarios compartir modelos de aprendizaje automático y conjuntos de datos.'

In [13]:
# Translate translated text back to english
input_language = "es" 
target_language = 'en'

def translate(text, target_language):
    model_name = f'Helsinki-NLP/opus-mt-{input_language}-{target_language}'
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    inputs = tokenizer.encode(text, return_tensors="pt")
    outputs = model.generate(inputs, 
                             do_sample=True, 
                             max_length=200,                            
                             top_k=0, # See k_top chapter in this notebook to see an improvement using do_sample
                             early_stopping=True)
    
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return translated_text


In [14]:
translated_text_do

original_text = translate(translated_text_do, target_language)

original_text

'Hugging Face is an American company that develops application building tools using machine learning. It is most notable for its transformer library built for natural language processing applications and its platform that allows users to share machine learning models and data sets.'

## Using [top_k](https://huggingface.co/blog/how-to-generate#top-k-sampling) parameter

In [15]:
input_text = "Hugging Face is an American company that develops tools for building applications using machine learning. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets."

In [16]:
input_language = "en" 
target_language = 'es'

def translate(text, target_language):
    model_name = f'Helsinki-NLP/opus-mt-{input_language}-{target_language}'
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    inputs = tokenizer.encode(text, return_tensors="pt")
    outputs = model.generate(inputs, 
                             do_sample=True, 
                             max_length=200,                                                       
                             top_k=50, # The K most likely next words are filtered and the probability mass is redistributed among only those K next words
                             early_stopping=True)
    
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return translated_text

In [17]:
translated_text_k = translate(input_text, target_language)

translated_text_k


'Hugging Face es una empresa estadounidense que desarrolla herramientas para la construcción de aplicaciones utilizando el aprendizaje automático. Es más notable por su biblioteca de transformadores construida para aplicaciones de procesamiento de lenguaje natural y su plataforma que permite a los usuarios compartir modelos de aprendizaje automático y conjuntos de datos.'

In [18]:
# Translate translated text back to english
input_language = "es" 
target_language = 'en'

def translate(text, target_language):
    model_name = f'Helsinki-NLP/opus-mt-{input_language}-{target_language}'
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    inputs = tokenizer.encode(text, return_tensors="pt")
    outputs = model.generate(inputs, 
                             do_sample=True, 
                             max_length=200,                       
                             top_k=50, # The K most likely next words are filtered and the probability mass is redistributed among only those K next words
                             early_stopping=True)
    
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return translated_text


In [19]:
translated_text_k

original_text = translate(translated_text_k, target_language)

original_text

'Hugging Face is an American company that develops application building tools using machine learning. It is most notable for its library of transformers built for natural language processing applications and its platform that allows users to share machine learning models and data sets.'

## Using [top_p](https://huggingface.co/blog/how-to-generate#top-p-nucleus-sampling) parameter

In [20]:
input_text = "Hugging Face is an American company that develops tools for building applications using machine learning. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets."

In [21]:
input_language = "en" 
target_language = 'es'

def translate(text, target_language):
    model_name = f'Helsinki-NLP/opus-mt-{input_language}-{target_language}'
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    inputs = tokenizer.encode(text, return_tensors="pt")
    outputs = model.generate(inputs, 
                             do_sample=True, 
                             max_length=200,                            
                             top_k=100, # The K most likely next words are filtered and the probability mass is redistributed among only those K next words
                             top_p=0.95, 
                             early_stopping=True)
    
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return translated_text

In [22]:
translated_text_p = translate(input_text, target_language)

translated_text_p

'Hugging Face es una empresa estadounidense que desarrolla herramientas para la construcción de aplicaciones utilizando el aprendizaje automático. Es más notable por su biblioteca de transformadores construida para aplicaciones de procesamiento de lenguaje natural y su plataforma que permite a los usuarios compartir modelos de aprendizaje automático y conjuntos de datos.'

In [23]:
# Translate translated text back to english
input_language = "es" 
target_language = 'en'

def translate(text, target_language):
    model_name = f'Helsinki-NLP/opus-mt-{input_language}-{target_language}'
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)

    inputs = tokenizer.encode(text, return_tensors="pt")
    outputs = model.generate(inputs, 
                             do_sample=True, 
                             max_length=200,                            
                             top_k=100, # The K most likely next words are filtered and the probability mass is redistributed among only those K next words
                             top_p=0.95, 
                             early_stopping=True)
    
    translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return translated_text


In [24]:
translated_text_p

original_text = translate(translated_text_p, target_language)

original_text

'Hugging Face is an American company that develops application building tools using machine learning. It is most notable for its library of transformers built for natural language processing applications and its platform that allows users to share machine learning models and data sets.'

# Translations Comparison

In [25]:
print(f"Translation using beam: {translated_text_beam}")
print("----------------------------------------------")
print(f"Translation using do_sample: {translated_text_do}")
print("----------------------------------------------")
print(f"Translation using top_k: {translated_text_k}")
print("----------------------------------------------")
print(f"Translation using top_p: {translated_text_p}")

Translation using beam: Hugging Face es una empresa estadounidense que desarrolla herramientas para la construcción de aplicaciones utilizando el aprendizaje automático. Es más notable por su biblioteca de transformadores construida para aplicaciones de procesamiento de lenguaje natural y su plataforma que permite a los usuarios compartir modelos de aprendizaje automático y conjuntos de datos.
----------------------------------------------
Translation using do_sample: Hugging Face es una empresa estadounidense que desarrolla herramientas para la construcción de aplicaciones utilizando el aprendizaje automático. Es más notable por su biblioteca de transformadores construida para aplicaciones de procesamiento de lenguaje natural y su plataforma que permite a los usuarios compartir modelos de aprendizaje automático y conjuntos de datos.
----------------------------------------------
Translation using top_k: Hugging Face es una empresa estadounidense que desarrolla herramientas para la con