# Translation model Inga - Spanish

In this notebook we are going to create a transformer that uses an encoder-decoder to translate from Inga to Spanish. We are following a hugging face tutorial on translation.

## Preparing and understanding the data

### Creating the dataset

First of all we need to import the necessary libraries to manipulate the data and load the translated sentences.

Read the .csv file with the translated sentences and split into train and test

In [None]:
import pandas as pd

df = pd.read_csv('data\BIBLIA_DATASET_INGA.csv', sep= ',',header=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8201 entries, 0 to 8200
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   inga     8201 non-null   object
 1   espanol  8201 non-null   object
dtypes: object(2)
memory usage: 128.3+ KB


In [None]:
df.columns

Index(['inga', 'espanol'], dtype='object')

In [None]:
df.sample(5)

Unnamed: 0,inga,espanol
1793,IAIA JESUSMANDA ALLI WILLAITA SAN LUCASMI WI...,LUCAS
5705,7 Chiuramandaka Santiagotasi kawarirka. Nispa...,7 Luego apareció a Jacobo y después a todos lo...
4045,5 Chasa uiaspaka tukuikuna suglla iuiaimi tuku...,5 Esta propuesta agradó a toda la multitud; y ...
8006,8 Maikanpas kai alpapi kaugsanakug kikinpa su...,8 Y le adorarán todos los habitantes sobre la ...
998,62 Chasa uiaspaka iaia sasirduti saia rispa Je...,62 Se levantó el sumo sacerdote y le dijo: — ¿...


### Tokenization

The tokenization function we will implement will only divide by words, remove special characters and lowercase.

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import text_to_word_sequence

def tokenization(text):
  # Leave punctuation, INCLUDES ; . * ? ¿
  list_words = text_to_word_sequence(text, filters='!"#$%&()+,-./:;<=>?@[\]^_`{|}~', lower=True)
  return list_words

In [None]:
tokenized_inb = []
tokenized_es = []


for i in range(0, len(df['inga'].values)):
    # Limpieza de caracteres raros
    if "," in df.iloc[i]['inga']:
        print(i, df.iloc[i]['inga'])
        break
    procesado = tokenization(df.iloc[i]['inga'])
    tokenized_inb.append(' '.join(procesado))

for i in range(0, len(df['espanol'].values)):
    # Limpieza de caracteres raros
    if "," in df.iloc[i]['espanol']:
        print(i, df.iloc[i]['espanol'])
        break
    procesado = tokenization(df.iloc[i]['espanol'])
    tokenized_es.append('<bos> '+' '.join(procesado) + ' <eos>')

tokenized_df = pd.DataFrame(list(zip(tokenized_inb, tokenized_es)),
               columns =['source', 'target'])

tokenized_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8201 entries, 0 to 8200
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   source  8201 non-null   object
 1   target  8201 non-null   object
dtypes: object(2)
memory usage: 128.3+ KB


Al parecer el nuevo testamento no tiene comas

In [None]:
print(df['inga'][131])
print(df['espanol'][131])
print(tokenized_df['source'][131])
print(tokenized_df['target'][131])

36 Kamkunapa umatapas *mana kawag churangichi; kamkuna manima pudingapa kangichichu ñi sug agcha iuraiachingapa u ianaiachingapa.
36 No jurarás ni por tu cabeza porque no puedes hacer que un cabello sea ni blanco ni negro.
36 kamkunapa umatapas *mana kawag churangichi kamkuna manima pudingapa kangichichu ñi sug agcha iuraiachingapa u ianaiachingapa
<bos> 36 no jurarás ni por tu cabeza porque no puedes hacer que un cabello sea ni blanco ni negro <eos>


When calculating the sizes of the vocalubary for the input (inga) and output (spanish) language we can see that both have similar sized, but inga has a few more (exactly 274 more words)

In [None]:
# Tokenizer for input language (inga)
input_tokenizer = Tokenizer(oov_token="<unk>", filters='"#$%&()+-/=@[\]^_`{|}~')
input_tokenizer.fit_on_texts(tokenized_inb)
input_tokenizer.word_index["<pad>"] = 0
input_tokenizer.index_word[0] = "<pad>"



# Tokenizer for output language (spanish)
output_tokenizer = Tokenizer(oov_token="<unk>", filters='"#$%&()*+-/=@[\]^_`{|}~')
output_tokenizer.fit_on_texts(tokenized_es)
output_tokenizer.word_index["<pad>"] = 0
output_tokenizer.index_word[0] = "<pad>"

# Get vocabulary sizes
input_vocab_size = len(input_tokenizer.word_index)  # Add 1 for the padding token
output_vocab_size = len(output_tokenizer.word_index)  # Add 1 for the padding token

print("Input Vocabulary Size:", input_vocab_size)
print("Output Vocabulary Size:", output_vocab_size)

Input Vocabulary Size: 18659
Output Vocabulary Size: 11816


In [None]:
output_tokenizer.word_index["<pad>"], output_tokenizer.word_index["<unk>"], output_tokenizer.word_index["<bos>"], output_tokenizer.word_index["<eos>"]

(0, 1, 2, 3)

## Keras translation model

Voy a intentar seguir un paso a paso de Keras para hacer un modelo seq2seq para traduccion

https://keras.io/examples/nlp/neural_machine_translation_with_transformer/

In [None]:
import pathlib
import random
import string
import re
import numpy as np
import tensorflow as tf
from tensorflow import keras

### Parsing

In this section we are adding SOS and EOS embedings

In [None]:
tokenized_df.sample(5)

Unnamed: 0,source,target
7964,4 —chi iskai nuka kachaskakunaka kai alpata ma...,<bos> 4 ellos son los dos olivos y los dos can...
2491,32 mana pudingasina kawaspaka chi sug mandag *...,<bos> 32 de otra manera cuando el otro rey est...
6704,3 iaia jesús ima rurangapa niska chasami kanga...,<bos> 3 pero fiel es el señor que os establece...
1552,18 jesús chasa rimaskataka iaia sasirdutikuna ...,<bos> 18 lo oyeron los principales sacerdotes ...
4429,3 pabloka munakurkami timoteoka nukawa purigri...,<bos> 3 pablo quiso que éste fuera con él y to...


### Spliting trainning and testing

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training and validation sets
train, test = train_test_split(tokenized_df, test_size = 0.10)

### Vectorizing

In [None]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("<", "")
strip_chars = strip_chars.replace(">", "")

sequence_length = 20
batch_size = 128

def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(lowercase, "[%s]" % re.escape(strip_chars), "")

inb_vectorization = TextVectorization(
    max_tokens=input_vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)

es_vectorization = TextVectorization(
    max_tokens=output_vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
    standardize=custom_standardization,
)

In [None]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [None]:
inb_vectorization.adapt(tokenized_df['source'].values)
es_vectorization.adapt(tokenized_df['target'].values)

In [None]:
es_vectorization.get_vocabulary(include_special_tokens=True)[:20]

['',
 '[UNK]',
 '<eos>',
 '<bos>',
 'de',
 'y',
 'que',
 'a',
 'la',
 'el',
 'en',
 'los',
 'no',
 'por',
 'se',
 'para',
 'le',
 '—',
 'con',
 'dios']

Next we are going to format the data set so that we can pass it correctry to the encoder and decoder transformers


In [None]:
def format_dataset(ing, spa):

    inb = inb_vectorization(ing)
    es = es_vectorization(spa)

    return (
        {
            "encoder_inputs": inb,
            "decoder_inputs": es[:, :-1],
        },
        es[:, 1:],
    )


def make_dataset(pairs):
    ing_texts = pairs['source'].values
    spa_texts = pairs['target'].values
    dataset = tf.data.Dataset.from_tensor_slices((ing_texts, spa_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.cache().shuffle(2048).prefetch(16)


train_ds = make_dataset(train)
test_ds = make_dataset(test)

In [None]:
for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")

inputs["encoder_inputs"].shape: (128, 20)
inputs["decoder_inputs"].shape: (128, 20)
targets.shape: (128, 20)


## Building the model

In [None]:
from tensorflow.keras import layers

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(dense_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, mask=None):
        attention_output = self.attention(query=inputs, value=inputs, key=inputs)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "dense_dim": self.dense_dim,
                "num_heads": self.num_heads,
            }
        )
        return config

In [None]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=embed_dim
        )
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "sequence_length": self.sequence_length,
                "vocab_size": self.vocab_size,
                "embed_dim": self.embed_dim,
            }
        )
        return config

In [None]:
class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, latent_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.latent_dim = latent_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.attention_3 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(latent_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.layernorm_4 = layers.LayerNormalization()
        self.add = layers.Add()  # instead of `+` to preserve mask
        self.supports_masking = True

    def call(self, inputs, encoder_outputs, mask=None):
        attention_output_1 = self.attention_1(
            query=inputs, value=inputs, key=inputs, use_causal_mask=True
        )
        out_1 = self.layernorm_1(self.add([inputs, attention_output_1]))

        attention_output_2 = self.attention_2(
            query=out_1, value=out_1, key=out_1, use_causal_mask=True
        )
        out_2 = self.layernorm_2(self.add([out_1, attention_output_2]))

        attention_output_3 = self.attention_3(
            query=out_2, value=encoder_outputs, key=encoder_outputs
        )
        out_3 = self.layernorm_3(self.add([out_2, attention_output_3]))

        proj_output = self.dense_proj(out_3)
        return self.layernorm_4(self.add([out_3, proj_output]))

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "latent_dim": self.latent_dim,
                "num_heads": self.num_heads,
            }
        )
        return config


Next, we assemble the end-to-end model

In [None]:
embed_dim = 512
latent_dim = 512
num_heads = 8

# Encoder
encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = PositionalEmbedding(sequence_length, input_vocab_size, embed_dim)(encoder_inputs)
for _ in range(6):  # Stack the TransformerEncoder layer 6 times
    x = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)
encoder_outputs = x
encoder = keras.Model(encoder_inputs, encoder_outputs)

# Decoder
decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
x = PositionalEmbedding(sequence_length, output_vocab_size, embed_dim)(decoder_inputs)
for _ in range(6):  # Stack the TransformerDecoder layer 6 times
    x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, encoded_seq_inputs)
x = layers.Dropout(0.2)(x)
decoder_outputs = layers.Dense(output_vocab_size, activation="softmax")(x)
decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

# Transformer
decoder_outputs = decoder([decoder_inputs, encoder_outputs])
transformer = keras.Model(
    [encoder_inputs, decoder_inputs], decoder_outputs, name="transformer"
)

## Training the model

In [None]:
epochs = 15  # This should be at least 30 for convergence

transformer.summary()
transformer.compile(
    "rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
transformer.fit(train_ds, epochs=epochs, validation_data=test_ds)

Model: "transformer"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 encoder_inputs (InputLayer  [(None, None)]               0         []                            
 )                                                                                                
                                                                                                  
 positional_embedding_4 (Po  (None, None, 512)            9563648   ['encoder_inputs[0][0]']      
 sitionalEmbedding)                                                                               
                                                                                                  
 decoder_inputs (InputLayer  [(None, None)]               0         []                            
 )                                                                                      

<keras.src.callbacks.History at 0x79bb10a22cb0>

## Decoding test sentences

In [None]:
import sys

spa_vocab = es_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20


def decode_sequence(input_sentence):
    tokenized_input_sentence = inb_vectorization([input_sentence])
    decoded_sentence = "<bos>"
    for i in range(max_decoded_sentence_length-1):
        tokenized_target_sentence = es_vectorization([decoded_sentence])[:, :-1]
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "<eos>":
            break
    return decoded_sentence

In [None]:
test_ing_texts = tokenized_df['source'].values
for _ in range(10):
    input_sentence = random.choice(test_ing_texts)
    print(input_sentence)
    translated = decode_sequence(input_sentence)
    print(translated)
    print("-------------------------")

21 chi jiru ruraskamanda ¿imatak chaskirkangichi chasa ruraskakunamandaka kunaura kamkunata iapa pingaipami iuiachiku chasa ruragkunaka wañuillami chaskinkuna
<bos> 21 qué pues qué es decir que es la carne os parece que os haga dignos de la carne
-------------------------
16 *chiwanka chi iskai chunga chusku iacha taita kuna taita diuspa ñawi ladu mandadirupi tianakugka suma *kumurispa paita kungurirkakuna
<bos> 16 y se levantaron los veinticuatro ancianos y los ancianos se pusieron de los que estaban de los que
-------------------------
26 ikuti suma luarpi tiaska jerusalén puiblumanda kagkunaka niraianmi ñi pita mana randiskakuna kagta chi puiblu nukanchipa mamasinami niraiá
<bos> 26 pero la creación que es la carne es la carne ni la sangre de los hombres <eos>
-------------------------
13 kasapasmi tukugsamunkuna killa wangu tukuspa wasi wasillami puringapa munankuna mana killa wangulla tukunkunachu chankualkuna tukuspa tukuipi rimarispami purinkuna imasa mana chaiaska parlukuna was

## Calculating BLEU

In [None]:
data = []
for i in range(0,len(test['target'].values)):
  data.append((test.iloc[i]['source'], test.iloc[i]['target']))


In [None]:
from nltk.translate.bleu_score import corpus_bleu

def calculate_bleu_score(data):
    predicted_sentences = []
    actual_sentences = []

    for source_sentence, target_sentence in data:
        # Generate prediction
        predicted_seq = decode_sequence(source_sentence)

        # Store predictions and actual sentences
        predicted_sentences.append(predicted_seq.split())
        actual_sentences.append([target_sentence.split()])

    # Calculate BLEU score
    bleu_score = corpus_bleu(actual_sentences, predicted_sentences)
    return bleu_score


In [None]:
bleu_score = calculate_bleu_score(data)
print("BLEU Score:", bleu_score)