<a href="https://colab.research.google.com/github/Bhuvanesh-Singla/WEC_Recs23_Task2/blob/main/Polyphasic/2)Translation/Baseline_Translation_Intel_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Translation Models


Machine translation is a pivotal field within natural language processing (NLP) that focuses on automating the conversion of text or speech from one language to another. It relies on sophisticated models and techniques to accomplish this challenging task effectively. One of the cornerstone methods in machine translation is the sequence-to-sequence (seq2seq) model, which employs deep neural networks to encode input text and then decode it into the target language. This technique has revolutionized translation tasks by learning to capture complex linguistic nuances and contextual information. Additionally, other models like Transformer-based models, including the famous BERT and GPT-3, have also made significant strides in translation, leveraging attention mechanisms to excel in various language pairs and domains. The choice of model depends on specific translation requirements, language pairs, and the quality of available training data. In this Colab file, we havee given a basic demo on how tto use the dataset and work on a simple seq2seq moel usig RNN.Your task will be to improve the model to the maximum you can ,make prediction on the test dataset given and write a code to generate the BLEU score of you prediction compared to original.






In [1]:
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional,LSTM, Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import sparse_categorical_crossentropy
from tensorflow.keras.callbacks import ModelCheckpoint

In [14]:
import nltk
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
##Loading and processing data
eng_fr = pd.read_csv("nlp_intel_train.csv")
eng_fr_test = pd.read_csv("nlp_intel_test.csv")

In [3]:
eng_fr = eng_fr.dropna(axis=0, how="any", subset=None, inplace=False)
eng_fr_test = eng_fr_test.dropna(axis=0, how="any", subset=None, inplace=False)

In [4]:
##Tokenizer and padding

def tokenize(data):
  t = Tokenizer()
  t.fit_on_texts(data)
  return t
def training_sequences(tokenizer, m_length, data):
    seq = tokenizer.texts_to_sequences(data)
    seq = pad_sequences(seq, maxlen = m_length, padding='post')
    return seq


In [5]:
#Preprocessing by tokenization and padding
#return processed data and tokenizer
def preprocess(x, y):

    x_tk = tokenize(x)
    y_tk = tokenize(y)

    preprocess_x = training_sequences(x_tk,55,x)
    preprocess_y = training_sequences(y_tk,55,y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

In [7]:
preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer = preprocess(eng_fr["en"].tolist(), eng_fr["fr"].tolist())

In [6]:
preproc_english_sentences_test, preproc_french_sentences_test, english_tokenizer_test, french_tokenizer_test = preprocess(eng_fr_test["en"].tolist(), eng_fr_test["fr"].tolist())

In [8]:
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)

print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

Max English sentence length: 55
Max French sentence length: 55
English vocabulary size: 21789
French vocabulary size: 27712


In [9]:
#Final output funtion
def logits_to_text(logits, tokenizer):

    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = ' '

    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

In [10]:
def bd_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):

    learning_rate = 0.001

    # Build the layers
    model = Sequential()
    model.add(Embedding(french_vocab_size, 256, input_length=input_shape[1], input_shape=input_shape[1:]))
    model.add(GRU(256, return_sequences=True))
    model.add(Dense(1024, activation='relu'))
    model.add(Dense(english_vocab_size, activation='softmax'))

    # Compile model
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    return model

In [11]:
preproc_french_sentences.shape[1]

55

In [12]:
tmp_x =pad_sequences(preproc_french_sentences, maxlen = 55, padding = 'post')
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2]))

# Train
model = bd_model(
    tmp_x.shape,
    preproc_english_sentences.shape[1],
    len(english_tokenizer.word_index)+1,
    len(french_tokenizer.word_index)+1)

model.summary()

model.fit(tmp_x, preproc_english_sentences, batch_size=64, epochs=1, validation_split=0.2)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 55, 256)           7094528   
                                                                 
 gru (GRU)                   (None, 55, 256)           394752    
                                                                 
 dense (Dense)               (None, 55, 1024)          263168    
                                                                 
 dense_1 (Dense)             (None, 55, 21790)         22334750  
                                                                 
Total params: 30087198 (114.77 MB)
Trainable params: 30087198 (114.77 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


<keras.src.callbacks.History at 0x7ae8894cffa0>

In [None]:
i= 1


print("Prediction:")
print(logits_to_text(model.predict(tmp_x[[i]])[0], english_tokenizer))
print("\nCorrect Translation:")
print(eng_fr["en"].tolist()[i])
print("\nOriginal text:")
print(eng_fr["fr"].tolist()[i])

Prediction:
the the the the the the the the the the                                                                                          

Correct Translation:
The club was very active and they twice organized the annual conference of the Amateur Astronomy Federation of Quebec in 1990 and 1997.

Original text:
Le club est très actif et organise à deux occasions (en 1990 et 1997) le congrès annuel de la Fédération des Astronomes Amateurs du Québec.


In [13]:
tmp_x_test =pad_sequences(preproc_french_sentences_test, maxlen = 55, padding = 'post')
tmp_x_test = tmp_x_test.reshape((-1, preproc_french_sentences_test.shape[-2]))

In [17]:
model.predict(tmp_x[[1]])



array([[[2.4414549e-02, 5.0622914e-02, 1.9811360e-02, ...,
         1.3366742e-06, 1.4642909e-06, 1.4177541e-06],
        [1.6644148e-02, 4.0955663e-02, 1.9040609e-02, ...,
         2.2513275e-06, 2.4232031e-06, 2.3768237e-06],
        [3.9412849e-02, 6.6570200e-02, 2.9955361e-02, ...,
         9.0163087e-07, 1.0106593e-06, 9.6586894e-07],
        ...,
        [9.9849701e-01, 2.3488019e-04, 1.6905511e-04, ...,
         3.5275236e-15, 8.3311188e-15, 4.7333671e-15],
        [9.9855274e-01, 2.2800274e-04, 1.6422423e-04, ...,
         3.1053863e-15, 7.3558518e-15, 4.1724136e-15],
        [9.9860328e-01, 2.2169985e-04, 1.5979225e-04, ...,
         2.7517367e-15, 6.5364741e-15, 3.7019482e-15]]], dtype=float32)

In [19]:
from nltk.translate.bleu_score import sentence_bleu

# Initialize lists to store best translations and their corresponding BLEU scores
best_translations = []
best_bleu_scores = []

# Loop through each input sequence in tmp_x_test
for i in range(len(tmp_x_test)):
    input_sequence = tmp_x_test[[i]]

    # Generate multiple translations for the input sequence
    translations = model.predict(input_sequence)

    # Initialize variables to track the best translation and its BLEU score
    best_translation = None
    best_bleu_score = 0

    # Loop through each translation and calculate its BLEU score
    for translation in translations:
        predicted_translation = logits_to_text(translation, english_tokenizer)
        bleu_score = sentence_bleu([eng_fr["en"].tolist()[i].split()], predicted_translation.split())

        # Check if this translation has a higher BLEU score
        if bleu_score > best_bleu_score:
            best_translation = predicted_translation
            best_bleu_score = bleu_score

    # Append the best translation and its BLEU score to the lists
    best_translations.append(best_translation)
    best_bleu_scores.append(best_bleu_score)

# Calculate the average BLEU score for the entire test set
average_bleu_score = sum(best_bleu_scores) / len(best_bleu_scores)

# Print the best translations and their BLEU scores
for i in range(len(tmp_x_test)):
    print("Best Translation for Input", i + 1, ":", best_translations[i])
    print("BLEU Score for Best Translation:", best_bleu_scores[i])

# Print the average BLEU score for the entire test set
print("Average BLEU Score for the Test Set:", average_bleu_score)




The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Best Translation for Input 1 : the the the the the                                                                                                    
BLEU Score for Best Translation: 3.6695469123219324e-232
Best Translation for Input 2 : the the                                                                                                          
BLEU Score for Best Translation: 5.016678430111076e-236
Best Translation for Input 3 : the the the the                                                                                                      
BLEU Score for Best Translation: 3.516963099536717e-234
Best Translation for Input 4 : None
BLEU Score for Best Translation: 0
Best Translation for Input 5 : the the the the                                                                                                      
BLEU Score for Best Translation: 7.6272394159932085e-233
Best Translation for Input 6 : the the the t