# TRANSFORMERS-Multilingual Translation: Custom Encoders and Decoders and PreTrained T5ConditionalGenerator

## Project Overview:

**In our "TRANSFORMERS-Multilingual Translation" project, we embark on a journey to revolutionize language translation by leveraging the power of state-of-the-art Transformer models. Specifically, we focus on translating text from English to French using a custom encoder-decoder architecture and harnessing the capabilities of a pretrained T5ConditionalGenerator.
This project aims to push the boundaries of multilingual translation, making it more accurate, versatile, and efficient than ever before.**

**Custom Encoder-Decoder Development:** We aim to design and implement a custom encoder-decoder architecture tailored for the English-to-French translation task. This architecture will enhance the model's ability to capture intricate linguistic nuances.

**T5ConditionalGenerator Integration:** We will seamlessly integrate a pretrained T5ConditionalGenerator The T5 model is renowned for its language generation capabilities, and we will adapt it for conditional language translation.

# Exploratory Data Analysis

In [1]:
import pandas as pd
import numpy as np

In [2]:
data=pd.read_csv('eng_-french.csv') #Data taken from Kaggle

In [3]:
data.head()

Unnamed: 0,English words/sentences,French words/sentences
0,Hi.,Salut!
1,Run!,Cours !
2,Run!,Courez !
3,Who?,Qui ?
4,Wow!,Ça alors !


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175621 entries, 0 to 175620
Data columns (total 2 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   English words/sentences  175621 non-null  object
 1   French words/sentences   175621 non-null  object
dtypes: object(2)
memory usage: 2.7+ MB


**Totally this dataset contains 1,75,621 records with english to french translated words**

**Let's Check for any Nan values**

In [5]:
data.isnull().sum()

English words/sentences    0
French words/sentences     0
dtype: int64

## This Dataset contains words in column wise So i'll convert it into list for Padding,tokenizing and finally pass to encoders and decoders

In [6]:
english_sentences=data['English words/sentences'].tolist()

In [7]:
print(english_sentences[:100])

['Hi.', 'Run!', 'Run!', 'Who?', 'Wow!', 'Fire!', 'Help!', 'Jump.', 'Stop!', 'Stop!', 'Stop!', 'Wait!', 'Wait!', 'Go on.', 'Go on.', 'Go on.', 'Hello!', 'Hello!', 'I see.', 'I try.', 'I won!', 'I won!', 'I won.', 'Oh no!', 'Attack!', 'Attack!', 'Cheers!', 'Cheers!', 'Cheers!', 'Cheers!', 'Get up.', 'Go now.', 'Go now.', 'Go now.', 'Got it!', 'Got it!', 'Got it?', 'Got it?', 'Got it?', 'Hop in.', 'Hop in.', 'Hug me.', 'Hug me.', 'I fell.', 'I fell.', 'I know.', 'I left.', 'I left.', 'I lied.', 'I lost.', 'I paid.', "I'm 19.", "I'm OK.", "I'm OK.", 'Listen.', 'No way!', 'No way!', 'No way!', 'No way!', 'No way!', 'No way!', 'No way!', 'No way!', 'No way!', 'Really?', 'Really?', 'Really?', 'Thanks.', 'We try.', 'We won.', 'We won.', 'We won.', 'We won.', 'Ask Tom.', 'Awesome!', 'Be calm.', 'Be calm.', 'Be calm.', 'Be cool.', 'Be fair.', 'Be fair.', 'Be fair.', 'Be fair.', 'Be fair.', 'Be fair.', 'Be kind.', 'Be nice.', 'Be nice.', 'Be nice.', 'Be nice.', 'Be nice.', 'Be nice.', 'Beat it.',

In [8]:
french_sentences=data['French words/sentences'].tolist()

In [9]:
print(french_sentences[:100])

['Salut!', 'Cours\u202f!', 'Courez\u202f!', 'Qui ?', 'Ça alors\u202f!', 'Au feu !', "À l'aide\u202f!", 'Saute.', 'Ça suffit\u202f!', 'Stop\u202f!', 'Arrête-toi !', 'Attends !', 'Attendez !', 'Poursuis.', 'Continuez.', 'Poursuivez.', 'Bonjour !', 'Salut !', 'Je comprends.', "J'essaye.", "J'ai gagné !", "Je l'ai emporté !", 'J’ai gagné.', 'Oh non !', 'Attaque !', 'Attaquez !', 'Santé !', 'À votre santé !', 'Merci !', 'Tchin-tchin !', 'Lève-toi.', 'Va, maintenant.', 'Allez-y maintenant.', 'Vas-y maintenant.', "J'ai pigé !", 'Compris !', 'Pigé\u202f?', 'Compris\u202f?', "T'as capté\u202f?", 'Monte.', 'Montez.', 'Serre-moi dans tes bras !', 'Serrez-moi dans vos bras !', 'Je suis tombée.', 'Je suis tombé.', 'Je sais.', 'Je suis parti.', 'Je suis partie.', "J'ai menti.", "J'ai perdu.", 'J’ai payé.', "J'ai 19 ans.", 'Je vais bien.', 'Ça va.', 'Écoutez !', "C'est pas possible\u202f!", 'Impossible\u202f!', 'En aucun cas.', 'Sans façons\u202f!', "C'est hors de question !", "Il n'en est pas questi

In [10]:
print(len(french_sentences))

175621


# Text Preprocessing

In [11]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Create a tokenizer and fit it on the english data
eng_tok=Tokenizer()
eng_tok.fit_on_texts(english_sentences)
eng_prepad=eng_tok.texts_to_sequences(english_sentences)
print(eng_prepad[:10])

#create tokenizer for french and fit it on french data
frh_tok=Tokenizer()
frh_tok.fit_on_texts(french_sentences)
frh_prepad=frh_tok.texts_to_sequences(french_sentences)
print(frh_prepad[:10])


[[2818], [429], [429], [79], [3576], [456], [81], [1409], [190], [190]]
[[4241], [6947], [18607], [32], [28, 11391], [41, 467], [6, 11392], [4666], [28, 13843], [18608]]


In [12]:
len(eng_tok.word_index)

14531

**There are totally 14531 unique words in english language**

In [13]:
len(frh_tok.word_index)

30660

**With respect to French language there are nearly 30k unique words which is double the english language**

In [37]:
frh_tok.word_index

{'je': 1,
 'de': 2,
 'pas': 3,
 'vous': 4,
 'que': 5,
 'à': 6,
 'ne': 7,
 'le': 8,
 'la': 9,
 'tu': 10,
 'il': 11,
 'ce': 12,
 'est': 13,
 'tom': 14,
 'un': 15,
 'a': 16,
 'nous': 17,
 'en': 18,
 'une': 19,
 'les': 20,
 "j'ai": 21,
 'suis': 22,
 'me': 23,
 'pour': 24,
 'faire': 25,
 'elle': 26,
 "c'est": 27,
 'ça': 28,
 'dans': 29,
 'plus': 30,
 'des': 31,
 'qui': 32,
 'tout': 33,
 'moi': 34,
 'veux': 35,
 'te': 36,
 'fait': 37,
 'avec': 38,
 'mon': 39,
 'du': 40,
 'au': 41,
 'se': 42,
 'si': 43,
 'et': 44,
 'êtes': 45,
 "n'est": 46,
 'sont': 47,
 'être': 48,
 "qu'il": 49,
 'y': 50,
 'cette': 51,
 'son': 52,
 'très': 53,
 'peux': 54,
 'as': 55,
 'votre': 56,
 'temps': 57,
 'pourquoi': 58,
 'sur': 59,
 'ils': 60,
 'dit': 61,
 'cela': 62,
 'lui': 63,
 'ma': 64,
 'pense': 65,
 'était': 66,
 'sais': 67,
 'été': 68,
 'avez': 69,
 'es': 70,
 'chose': 71,
 "n'ai": 72,
 'jamais': 73,
 'toi': 74,
 'ici': 75,
 'comment': 76,
 'où': 77,
 'vraiment': 78,
 'bien': 79,
 'ton': 80,
 'quelque': 81,
 '

In [38]:
frh_tok.index_word

{1: 'je',
 2: 'de',
 3: 'pas',
 4: 'vous',
 5: 'que',
 6: 'à',
 7: 'ne',
 8: 'le',
 9: 'la',
 10: 'tu',
 11: 'il',
 12: 'ce',
 13: 'est',
 14: 'tom',
 15: 'un',
 16: 'a',
 17: 'nous',
 18: 'en',
 19: 'une',
 20: 'les',
 21: "j'ai",
 22: 'suis',
 23: 'me',
 24: 'pour',
 25: 'faire',
 26: 'elle',
 27: "c'est",
 28: 'ça',
 29: 'dans',
 30: 'plus',
 31: 'des',
 32: 'qui',
 33: 'tout',
 34: 'moi',
 35: 'veux',
 36: 'te',
 37: 'fait',
 38: 'avec',
 39: 'mon',
 40: 'du',
 41: 'au',
 42: 'se',
 43: 'si',
 44: 'et',
 45: 'êtes',
 46: "n'est",
 47: 'sont',
 48: 'être',
 49: "qu'il",
 50: 'y',
 51: 'cette',
 52: 'son',
 53: 'très',
 54: 'peux',
 55: 'as',
 56: 'votre',
 57: 'temps',
 58: 'pourquoi',
 59: 'sur',
 60: 'ils',
 61: 'dit',
 62: 'cela',
 63: 'lui',
 64: 'ma',
 65: 'pense',
 66: 'était',
 67: 'sais',
 68: 'été',
 69: 'avez',
 70: 'es',
 71: 'chose',
 72: "n'ai",
 73: 'jamais',
 74: 'toi',
 75: 'ici',
 76: 'comment',
 77: 'où',
 78: 'vraiment',
 79: 'bien',
 80: 'ton',
 81: 'quelque',
 8

In [39]:
len(frh_tok.index_word)

30660

In [14]:
type(eng_tok.word_index)

dict

## Using the number of unique words in the tokenizer to define the embedding

In [15]:
voc_size_eng=len(eng_tok.word_index)+1 #to account for index 0 as well
voc_size_frh=len(frh_tok.word_index)+1

In [16]:
print(voc_size_eng,voc_size_frh)

14532 30661


## To pass into embedded layer we need to give maxlen paramater so i ll find the maximum length of sentence in both the languages

In [17]:
max_length=max(len(seq) for seq in eng_prepad+frh_prepad)
max_length

55

## PADDING

In [18]:
eng_padding=pad_sequences(sequences=eng_prepad,maxlen=max_length,padding='pre')#passing the tokenized words and converted sequences into pad_sequences
frh_padding=pad_sequences(sequences=frh_prepad,maxlen=max_length,padding='pre')

In [19]:
eng_padding[:5]

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0, 2818],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,  429],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0

 ## Transformers uses 512 dimensions for each and every word and it contains 6 Encoders and Decoders .In Encoder it contains self attention and in decoder it contains Encoder-Decoder Attention

### Due to Less computational resources i have mimic the transformer architecture but not the exact Transformer Architecture

# Defining Encoders and Decoders

In [2]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input,Embedding,LSTM,Dense


### Encoder

In [29]:
embedding_dim=256
units=512

#encoder
encoder_input= Input(shape=(max_length,))
encoder_embedding=Embedding(input_dim=voc_size_eng,output_dim=embedding_dim)(encoder_input)
encoder_LSTM=LSTM(units=units,return_state=True)#return state is set to true to return hidden state and cell state
enoder_outputs,state_h,state_c=encoder_LSTM(encoder_embedding)#encoder output contains information about the source sequence
#and is used for attention mechanisms,state_c:it carries information about the sequence's long-term dependencies
#state_h variable contains the hidden state of the LSTM layer after processing the entire input sequence
encoder_states=[state_h,state_c]#creating a list,it will be used to initialize the decoder LSTM during the decoding phase.

### Decoder

In [30]:
#Decoder
decoder_input=Input(shape=(max_length,))
dec_emb_layer=Embedding(input_dim=voc_size_frh,output_dim=embedding_dim)
decoder_embedding=dec_emb_layer(decoder_input)
decoder_LSTM=LSTM(units=units,return_sequences=True,return_state=True)#return_sequences=True indicates that the LSTM layer 
#should return sequences at each time step rather than just the final output,
#This is necessary because the decoder needs to generate a sequence of words
decoder_outputs,_,_=decoder_LSTM(decoder_embedding,initial_state=encoder_states)#decoder_outputs variable holds
#the output sequences from the decoder LSTM for each time step,two _ specifies hidden state and cell state of decoder LSTM
#we dont use so we use _
decoder_dense=Dense(voc_size_frh,activation='softmax')
output=decoder_dense(decoder_outputs)

In [31]:
model=Model([encoder_input,decoder_input],output)

In [32]:
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_3 (InputLayer)        [(None, 55)]                 0         []                            
                                                                                                  
 input_4 (InputLayer)        [(None, 55)]                 0         []                            
                                                                                                  
 embedding_2 (Embedding)     (None, 55, 256)              3720192   ['input_3[0][0]']             
                                                                                                  
 embedding_3 (Embedding)     (None, 55, 256)              7849216   ['input_4[0][0]']             
                                                                                            

### Train test Split

In [33]:
from sklearn.model_selection import train_test_split

In [34]:
X_train,x_test,y_train,y_test=train_test_split(eng_padding,frh_padding,test_size=0.20,random_state=42)

In [36]:
model.fit([X_train,X_train],y_train,validation_data=([x_test,x_test],y_test),epochs=1,batch_size=64)#compuation time is very hige so i gave only 1 epoch



<keras.src.callbacks.History at 0x200a63dc950>

### Defining custom Function for language translation from English to French

In [62]:
def translate_to_french(sentences):
    seq=eng_tok.texts_to_sequences([sentences])
    padding=pad_sequences(sequences=seq,maxlen=max_length,padding='pre')
    translated=np.argmax(model.predict([padding,padding]),axis=-1)# axis=-1 is that you want to find the index of the word with the highest predicted probability
    
    translated_sentence=[]
    for i in translated[0]:# By iterating over translated[0] im accessing the predictions for the first sentence in the batch. If you have multiple sentences to translate, you can extend your code to iterate over all sentences in the batch.
        if i in frh_tok.index_word:
            translated_sentence.append(frh_tok.index_word[i])
        else:
            translated_sentence.append('')
    
    return ' '.join(translated_sentence)
    
    

In [67]:
a=str(input("enter your words: "))
print("Input:",a)
b=translate_to_french(a)
print("French:",b)

enter your words: This is my project which translates to french
Input: This is my project which translates to french
French: c'est mon de est de français


**The output i get is 75% accurate but we can improve the translation by training more epoch with better computation resource**

# Pretrained Transformers

### Now we are going to use Pretrained Transformers-T5ConditionalGeneration for Language Translation

In [7]:
import transformers
import torch

In [17]:
print(transformers.__version__)
print(torch.__version__)

4.33.0
2.0.1+cpu


### Importing the T5ForConditionalGeneration and T5Tokenizer classes from the Transformers library. These classes are associated with T5 (Text-to-Text Transfer Transformer), which is a pre-trained model for various natural language processing tasks

In [3]:
from transformers import T5ForConditionalGeneration,T5Tokenizer

### t5-base: The base version of the T5 model, a versatile model for various text generation tasks.There are other versions like t5-small,t5-large,t5-3b,t5-11b.

**But here we are using t5-base model suitable for text generation tasks and requires less computation reqruirement**

In [8]:
model_name='t5-base'
model=T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer=T5Tokenizer.from_pretrained(model_name)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [13]:
def Eng_French_translation(english_sentences):
    input_text='English to French'+english_sentences#The string 'English to French' have added as a prefix to the input text is a hint or prompt that we provide to the model.It tells the model the desired translation direction (from English to French)
    input_id=tokenizer.encode(input_text,return_tensors='pt') #'pt' specifies that the output should be in PyTorch tensors
    with torch.no_grad():#disabling gradient calculation since we're not training the model
        outputs=model.generate(inputs=input_id)
        
    french_translation=tokenizer.decode(outputs[0],skip_special_tokens=True) #outputs[0] selecting the first sequence from 
    #the list of generated sequences,which is often the most likely translation according to the model's internal ranking
    #skip_special_tokens=True special tokens are not part of the human-readable text and can interfere with the readability
    return french_translation

        

In [15]:
a='What is the role of Data Scientist'
b=Eng_French_translation(a)
result=print("French:",b)



French: Quel est le rôle du chercheur en données?


### Conclusion:

**TRANSFORMERS-Multilingual Translation from English to French" represents a significant advancement in the field of language translation. By combining the strengths of custom encoder-decoder architecture and pretrained T5ConditionalGenerator, this project has the potential to break down language barriers, enabling effective communication and information exchange across linguistic boundaries. It offers valuable applications in industries such as e-commerce, content localization, and global communication.**