<a href="https://colab.research.google.com/github/Qk527/DataAugmentationTransliteration/blob/main/DataAugmentationSpanishTransliterateWithTransformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PoC Spanish Data Augmentation with Transformers
Approach to data augmentation techniques in NLP.

# Dependencies

In [1]:
!pip install transformers



In [2]:
!pip install transformers[sentencepiece]



# Libraries

In [3]:
#General purpose
import pandas as pd

#Transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline

# Configuration

## From *Es* to *En*

In [4]:
tokenizer_es_to_en = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")

model_es_to_en = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-es-en")

## From *En* to *Es*

In [5]:
tokenizer_en_to_es = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-es")

model_en_to_es = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-es")

## Use

In [6]:
#Generic text from Borges to try tranlate fuction.
text = "En el nombre de la rosa está la rosa."

In [7]:
#Translate to english
inputs = tokenizer_es_to_en(text, return_tensors="pt")
outputs = model_es_to_en(**inputs, labels=inputs["input_ids"])

In [8]:
translator_to_en = pipeline('translation', model='Helsinki-NLP/opus-mt-es-en')

In [9]:
result = translator_to_en(text, max_length=200, do_sample=True, temperature=0.9)

In [10]:
#Look the result
print(result[0]['translation_text'])

In the name of the rose is the rose.


In [11]:
#Translate to spanish
inputs = tokenizer_en_to_es(text, return_tensors="pt")
outputs = model_en_to_es(**inputs, labels=inputs["input_ids"])

In [12]:
translator_to_es = pipeline('translation', model='Helsinki-NLP/opus-mt-en-es')

In [13]:
result = translator_to_es(result[0]['translation_text'], max_length=200, do_sample=True, temperature=0.9)

In [14]:
print(result[0]['translation_text'])

En el nombre de la rosa está la rosa.


# Test

In [15]:
df = pd.read_csv('dataset.csv')
df.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


In [16]:
df = df[df['language'] == 'Spanish'].reset_index()

## Len study

In [17]:
#One parameter is the max_len in text
df['textLen'] = df['Text'].apply(lambda x: len(str(x)))
df['textLen'].describe()

count    1000.000000
mean      383.995000
std       235.792899
min       124.000000
25%       211.500000
50%       314.000000
75%       485.000000
max      1310.000000
Name: textLen, dtype: float64

In [18]:
# I decided mantaing 50% of data to this test
df = df[df['textLen'] < 315]

In [19]:
df.shape

(502, 4)

## Back Translation

In [20]:
def translate(text):
  en_result = translator_to_en(text, max_length=350, do_sample=True, temperature=0.9)
  es_result = translator_to_es(en_result[0]['translation_text'], max_length=350, do_sample=True, temperature=0.9)
  return es_result[0]['translation_text']

In [21]:
df['generation'] = df['Text'].apply(lambda x: translate(str(x)))

In [22]:
merged_df = pd.DataFrame(df['Text'].append(df['generation']))

In [23]:
#Sometimes translate online change the text case to resolve this:
def text_to_minus(text):
  return text.lower()

In [24]:
df['Text'] = df['Text'].apply(lambda x: text_to_minus(x))
df['generation'] = df['generation'].apply(lambda x: text_to_minus(x))

In [25]:
merged_df = pd.DataFrame(df['Text'].append(df['generation']))
merged_df.drop_duplicates(inplace=True)
merged_df.shape

(999, 1)

The data was almost doubled.