# Data wrangling<a id='2_Data_wrangling'></a>

## Introduction<a id='2.2_Introduction'></a>

In today's interconnected world, effective language translation plays a crucial role in breaking down communication barriers and enabling cross-cultural understanding. An automated translation system can facilitate seamless interactions, enhance information dissemination, and support users in understanding content in their preferred language.

For this project, I have chosen to fine-tune a pretrained model found on https://huggingface.co/tasks/translation that enables accurate and contextually relevant translation from English to Russian. The goal is to enhance the accessibility of information across language barriers and improve communication between users who speak different languages.

Utilizing a pretrained model offers substantial advantages. It diminishes computational expenses, lessens your environmental impact, and grants you access to cutting-edge models without the need to initiate training from the ground up. Transformers offer an array of thousands of pretrained models catering to various tasks. Upon employing a pretrained model, you fine-tune it using a dataset tailored to your specific task, a technique recognized as fine-tuning, which wields remarkable training prowess.

The model has been trained on a dataset sourced from Kaggle, encompassing pairs of concise English and Russian sentences.

## 1. Data Collection and Overview<a id='2.3_Imports'></a>

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
from string import punctuation
import random
import string
import re

In [2]:
data=pd.read_csv('C:/Users/bayar/Downloads/Capstone3/rus.txt',delimiter='\t',header=None)

In [3]:
data.head()

Unnamed: 0,0,1,2
0,Go.,Марш!,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
1,Go.,Иди.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
2,Go.,Идите.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
3,Hi.,Здравствуйте.,CC-BY 2.0 (France) Attribution: tatoeba.org #5...
4,Hi.,Привет!,CC-BY 2.0 (France) Attribution: tatoeba.org #5...


## 2. Data Definition and Cleaning<a id='2.6_Explore_The_Data'></a>

In [4]:
# Keep first two columns as the last column is not informative
data=data.iloc[:,:2]
data.head()

Unnamed: 0,0,1
0,Go.,Марш!
1,Go.,Иди.
2,Go.,Идите.
3,Hi.,Здравствуйте.
4,Hi.,Привет!


In [5]:
data.shape

(363386, 2)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 363386 entries, 0 to 363385
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   0       363386 non-null  object
 1   1       363386 non-null  object
dtypes: object(2)
memory usage: 5.5+ MB


In [7]:
# converting every letter to lower case
data[0] = data[0].apply(lambda x: str(x).lower())
data[1] = data[1].apply(lambda x: str(x).lower())

In [8]:
# removing apostrophe from the sentences
data[0] = data[0].apply(lambda x: re.sub("'","",x))
data[1] = data[1].apply(lambda x: re.sub("'","",x))

In [9]:
exclude = set(string.punctuation)
# removing all the punctuations
data[0] = data[0].apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
data[1] = data[1].apply(lambda x: ''.join(ch for ch in x if ch not in exclude))

In [10]:
data.head()

Unnamed: 0,0,1
0,go,марш
1,go,иди
2,go,идите
3,hi,здравствуйте
4,hi,привет


## 3. Tokenization<a id='2.6_Explore_The_Data'></a>

Within the realm of natural language processing (NLP), tokenization stands as the pivotal procedure for disintegrating text into discrete units referred to as tokens. While these tokens commonly constitute words, they have the flexibility to encompass phrases, subwords, or even characters, contingent on the specific application. Tokenization holds a foundational role across numerous NLP undertakings, encompassing language modeling, machine translation, and text classification. Subsequent to the tokenization process, the text can undergo conversion into a numerical format, which then serves as input for machine learning models.

In [11]:
from tensorflow.python.ops.numpy_ops import np_config
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [14]:
# Initialize Tokenizers

english_tokenizer = data[0]
russian_tokenizer = data[1]

english_tokenizer = Tokenizer()
russian_tokenizer = Tokenizer()

# Fit tokenizers on text data
english_tokenizer.fit_on_texts(data[0])
russian_tokenizer.fit_on_texts(data[1])

In [20]:
word_index = english_tokenizer.word_index
print(f"The number of words in the English vocabulary: {len(word_index)}")

word_index_ru = russian_tokenizer.word_index
print(f"The number of words in the Russian vocabulary: {len(word_index_ru)}")

The number of words in the English vocabulary: 16697
The number of words in the Russian vocabulary: 54025


In [15]:
# Convert text sequences to integer sequences
english_sequences = english_tokenizer.texts_to_sequences(data[0])
russian_sequences = russian_tokenizer.texts_to_sequences(data[1])

## 4. Padding<a id='2.6_Explore_The_Data'></a>

A consistent length for input sequences is frequently necessary. When input sequences vary in length, it becomes essential to pad them using a designated placeholder value (typically 0) to standardize their lengths. This padding procedure guarantees uniform input sizes for the model, a prerequisite for effective training efficiency.

In [18]:
# Pad sequences to a fixed length
max_sequence_length = 10  # Example length
padded_english_sequences = pad_sequences(english_sequences, maxlen=max_sequence_length, padding='post', truncating='post')
padded_russian_sequences = pad_sequences(russian_sequences, maxlen=max_sequence_length, padding='post', truncating='post')

In [19]:
print("Summary of Sequence Lengths:")
for i in range(10):
    print(f"Row {i + 1}: Sequence Lengths - Column 0: {len(padded_english_sequences[i])}, Column 1: {len(padded_russian_sequences[i])}")

Summary of Sequence Lengths:
Row 1: Sequence Lengths - Column 0: 10, Column 1: 10
Row 2: Sequence Lengths - Column 0: 10, Column 1: 10
Row 3: Sequence Lengths - Column 0: 10, Column 1: 10
Row 4: Sequence Lengths - Column 0: 10, Column 1: 10
Row 5: Sequence Lengths - Column 0: 10, Column 1: 10
Row 6: Sequence Lengths - Column 0: 10, Column 1: 10
Row 7: Sequence Lengths - Column 0: 10, Column 1: 10
Row 8: Sequence Lengths - Column 0: 10, Column 1: 10
Row 9: Sequence Lengths - Column 0: 10, Column 1: 10
Row 10: Sequence Lengths - Column 0: 10, Column 1: 10


## Summary<a id='2.6_Explore_The_Data'></a>

Our initial steps involved transforming all characters to lowercase and eliminating any punctuation. Subsequently, we carried out text preprocessing using a tokenizer. This tool segments text into tokens based on specified rules. These tokens are then translated into numerical values and ultimately into tensors, which serve as inputs for the model. If the model demands supplementary inputs, the tokenizer integrates them.

The concluding phase encompassed the transformation of sentences into a consistent format. To achieve this, we employed a Padding strategy, which ensures that tensors are uniform in shape by introducing a distinctive padding token into shorter sentences.

With these preparations complete, the dataset stands poised for utilization with a pretrained model.