<a href="https://colab.research.google.com/github/JuanJoseMV/neuraltextgen/blob/main/italian_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Intialization

In [1]:
%%capture
!git clone --recursive https://github.com/JuanJoseMV/neuraltextgen.git
!pip install -r /content/neuraltextgen/texygen/requirements.txt
!pip install simpletransformers

In [2]:
import sys
import os


os.chdir("/content/neuraltextgen/")
from NeuralTextGenerator import BertTextGenerator

APEX_AVAILABLE = False

## Evaluation - Texygen



In [None]:
import nltk
nltk.download('punkt')

os.chdir("/content/neuraltextgen/texygen")
from utils.metrics.Bleu import Bleu
from utils.metrics.SelfBleu import SelfBleu
os.chdir("/content/neuraltextgen")

wiki103_file = 'data/wiki103.5k.txt'
tbc_file = 'data/tbc.5k.txt'

## nvidia apex

Nvidia-Apex is an extension of Pytorch for automatic Mixed-precision. 

Most deep learning frameworks, including PyTorch, train using 32-bit floating point

(FP32) arithmetic by default. However, using FP32 for all operations is not essential to achieve full accuracy for many state-of-the-art deep neural networks (DNNs). In 2017, NVIDIA researchers developed a methodology for mixed-precision training in which a few operations are executed in FP32 while the majority of the network is executed using 16-bit floating point (FP16) arithmetic. FP16 arithmetic offers the following additional performance benefits on Volta GPUs:

- FP16 reduces memory bandwidth and storage requirements by 2x. Bandwidth-bound operations can realize up to 2x speedup immediately.
- FP16 arithmetic enables Tensor Cores, which in Volta GPUs offer 125 TFlops of computational throughput on generalized matrix-matrix multiplications (GEMMs) and convolutions, an 8X increase over FP32.
 

With mixed precision training, networks receive almost all the memory savings and improved throughput of pure FP16 training while matching the accuracy of FP32 training.

### Using Apex

The code below is used to initialize apex. The installation requires around 5-10 minutes.

In [None]:
%%writefile setup.sh
export CUDA_HOME=/usr/local/cuda-10.1
git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

Writing setup.sh


In [None]:
%%capture
!sh setup.sh

APEX_AVAILABLE = True

#Italian Text Generation

Classical transformers methods are trained on english text corpus...

how to deal with other languages

1. Machine translation
2. Specific models
3. Cross Language models



## Italian text with Machine Translation

The first attempt at italian text generation can be done relying on machine translation. 

### English text generation

In [None]:
en_bert_model = BertTextGenerator("bert-base-uncased", use_apex = APEX_AVAILABLE)

parameters = {'n_samples': 11,  # 1000
              'batch_size': 11,  # 50
              'max_len': 40,
              'top_k': 100,
              'temperature': 1,
              'burnin': 250,
              'sample': True,
              'max_iter': 500,
              'seed_text': "",
              'init_method': 'masked'
              }

parameters_str = "_".join([f"{k}={v}" for k, v in parameters.items()])
file_path = "english-text_" + parameters_str +".txt"

en_bert_sents = en_bert_model.generate(save_to_path=file_path, **parameters)

In [None]:
for sent in en_bert_sents[:10]:
    print(f"\t{sent}")

	 this is very simple. for instance, in wikipedia, f'( y ) states that f ( c ) x completely separates " ( f ) c " from the " x " it is describing. 
	 late 1990s and 2000s also based in peropeus " little ramrod " series voice actor voice actor. of course most of that stuff that you want have a bit of fun. looney tunes. 
	 performance - michael lee. special release " friends " : standard edition & deluxe edition. special advert for coates'version of " ( you ) saved my life ". performance - michael lee. 
	 i hear the rise of sky, the mist rising and falling, and then to the sea, and then back to the sign, all the way to the bottom. the water is still terribly cold. 
	 a month after unifying german prime minister martin svends trotter visited the various alliance parties for the conservative christian social party, but they " refused to visit any'a right'group ". 
	 (... and )... and in the simple friendship her friends have with her, there are men whom she has met before and whom she is p

### English-Italian machine translation

In [None]:
%%capture
!pip install google_trans_new

In [None]:
from google_trans_new import google_translator  

translator = google_translator()  

it_translated_bert_sentences = []
it_file_path = "italian-translated_" + parameters_str + ".txt"

with open(it_file_path, "w") as f:
  for sent in en_bert_sents:
    translation = translator.translate(sent, lang_tgt='it')
    it_translated_bert_sentences.append(translation)
    f.write(translation+'\n')

In [None]:
for sent, trans in zip(en_bert_sents[:5], it_translated_bert_sentences[:5]):
  print(f"ORIGINAL: {sent}")
  print(f"TRANSLATION: {trans}")
  print('\n')


ORIGINAL:  this is very simple. for instance, in wikipedia, f'( y ) states that f ( c ) x completely separates " ( f ) c " from the " x " it is describing. 
TRANSLATION: Questo è molto semplice. Ad esempio, in wikipedia, f '(y) afferma che f (c) x separa completamente "(f) c" dallo "x" si sta descrivendo. 


ORIGINAL:  late 1990s and 2000s also based in peropeus " little ramrod " series voice actor voice actor. of course most of that stuff that you want have a bit of fun. looney tunes. 
TRANSLATION: La fine degli anni '90 e 2000 è basandosi anche in serie "Little Ramrod" Attore Voice Actor Actor Attore. Ovviamente la maggior parte di quella roba che vuoi avere un po 'divertente. Looney Tunes. 


ORIGINAL:  performance - michael lee. special release " friends " : standard edition & deluxe edition. special advert for coates'version of " ( you ) saved my life ". performance - michael lee. 
TRANSLATION: Performance - Michael Lee. Rilevazione speciale "Amici": Edizione standard e Deluxe. An

### Evaluation

To evaluate we will use ...

In [None]:
trans_bleu_score_tbc = Bleu(file_path, tbc_file)
trans_bleu_score_wiki = Bleu(file_path, wiki103_file)

print("(Texygen) BERT-TBC BLEU: %.2f" % (100 * trans_bleu_score_tbc.get_bleu()))
print("(Texygen) BERT-Wiki103 BLEU: %.2f" % (100 * trans_bleu_score_wiki.get_bleu()))

## Italian text generation via Italian model

In [None]:
#dbmdz/bert-base-italian-cased
it_bert_model = BertTextGenerator("dbmdz/bert-base-italian-xxl-uncased", use_apex = APEX_AVAILABLE)


In [None]:

for burnin in [0, 250, 500]:
  for init_method in ['masked', 'random']:
    parameters = {'n_samples': 11,  # 1000
                  'batch_size': 11,  # 50
                  'max_len': 40,
                  'top_k': 100,
                  'temperature': 1,
                  'burnin': burnin,
                  'sample': True,
                  'max_iter': 500,
                  'seed_text': "",
                  'init_method': init_method
                  }

    # "key1=val1_key2=val2_...txt"
    file_path = "_".join([f"{k}={v}" for k, v in parameters.items()])+".txt"
    bert_sents = bert_model.generate(save_to_path=file_path, **parameters)

    bleu_score_tbc = Bleu(file_path, tbc_file)
    bleu_score_wiki = Bleu(file_path, wiki103_file)

    
    print(f"\n(init_method = {init_method} - burnin = {burnin}) Text generated:  (BLEU-tbc={(100 * bleu_score_tbc.get_bleu()):.2f} BLEU-wiki={100 * bleu_score_wiki.get_bleu():.2f}")
    for sent in bert_sents[:10]:
        print(f"\t{sent}")


    

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Finished batch 1 in 20.106s

(init_method = masked - burnin = 0) Text generated:  (BLEU-tbc=24.64 BLEU-wiki=20.84
	 1977 ma lin, writings the pursuit of evidence. foreword by ming fu mau. 1977 ma lin, essays ma lin, correspondence and other chinese historical documents. numerous essays by ming fu mau. 
	 the mahan school ( dartmouth ) from 1976 - 1990 was named among the hundredest " outstanding military schools possible " by google ( samuel l. johnson ) and wake forest university ( andrew jackson ). 
	 by now, nearly all the others - elena, meredith, especially bonnie - were staring as if they were deep in thought. " victoria would have looked like the most beautiful bride, " meredith said. 
	 on singles " promise me " and " our love " ( from their first two eps ) ; for i'm back in your arms " ( and " good times tonight " ) in the uk ; 
	 concert for the festival ensemble. cd ( 2 discs ). here and there : choral works ( choir ) ( mary and frank, the two choirs mary and frank ). boston

#Evaluation - Original

In [None]:
from nltk.translate import bleu_score as bleu

def prepare_data(data_file, replacements={}, uncased=True):
    data = [d.strip().split() for d in open(data_file, 'r').readlines()]
    if uncased:
        data = [[t.lower() for t in sent] for sent in data]
        
    for k, v in replacements.items():
        data = [[t if t != k else v for t in sent] for sent in data]
 
    return data

def prepare_wiki(data_file, uncased=True):
    replacements = {"@@unknown@@": "[UNK]"}
    return prepare_data(data_file, replacements=replacements, uncased=uncased)

def prepare_tbc(data_file):        
    replacements = {"``": "\"", "\'\'": "\""}
    return prepare_data(data_file, replacements=replacements)

def corpus_bleu(generated, references):
    """ Compute similarity between two corpora as measured by
    comparing each sentence of `generated` against all sentences in `references` 
    
    args:
        - generated (List[List[str]]): list of sentences (split into tokens)
        - references (List[List[str]]): list of sentences (split into tokens)
        
    returns:
        - bleu (float)
    """    
    return bleu.corpus_bleu([references for _ in range(len(generated))], generated)

In [None]:
!git clone https://github.com/nyu-dl/bert-gen
wiki103_file = 'bert-gen/data/wiki103.5k.txt'
tbc_file = 'bert-gen/data/tbc.5k.txt'

wiki_data = prepare_wiki(wiki103_file)
tbc_data = prepare_tbc(tbc_file)

fatal: destination path 'bert-gen' already exists and is not an empty directory.


Try to evaluate using original functions and no cleaning of wiki-data

In [None]:
print("BERT-TBC BLEU: %.2f" % (100 * corpus_bleu(bert_sents, tbc_data)))
print("BERT-Wiki103 BLEU: %.2f" % (100 * corpus_bleu(bert_sents, wiki_data)))
print("BERT-{TBC + Wiki103} BLEU: %.2f" % (100 * corpus_bleu(bert_sents, tbc_data[:2500] + wiki_data[:2500])))

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


BERT-TBC BLEU: 17.31
BERT-Wiki103 BLEU: 23.39
BERT-{TBC + Wiki103} BLEU: 22.46


Try to evaluate after cleaning

In [None]:
def cleaner(data):
  len_mask = []
  for i in range(len(data)):
    if len(data[i]) <4:
      len_mask.append(False)
    else:
      len_mask.append(True)

  data = [b for a, b in zip(len_mask, data) if a]
  return data

wiki_data = cleaner(wiki_data)
tbc_data = cleaner(tbc_data)

print("BERT-TBC BLEU: %.2f" % (100 * corpus_bleu(bert_sents, tbc_data)))
print("BERT-Wiki103 BLEU: %.2f" % (100 * corpus_bleu(bert_sents, wiki_data)))
print("BERT-{TBC + Wiki103} BLEU: %.2f" % (100 * corpus_bleu(bert_sents, tbc_data[:2500] + wiki_data[:2500])))

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


BERT-TBC BLEU: 17.31
BERT-Wiki103 BLEU: 23.39
BERT-{TBC + Wiki103} BLEU: 22.46


## Evaluation - Texygen

In [None]:
bleu_score_tbc = Bleu(file_path, tbc_file)
bleu_score_wiki = Bleu(file_path, wiki103_file)

print("(Texygen) BERT-TBC BLEU: %.2f" % (100 * bleu_score_tbc.get_bleu()))
print("(Texygen) BERT-Wiki103 BLEU: %.2f" % (100 * bleu_score_wiki.get_bleu()))



(Texygen) BERT-TBC BLEU: 17.43
(Texygen) BERT-Wiki103 BLEU: 18.25
