# Project Overview: Data Preprocessing for Sequence-to-Sequence Language Translation Model

In this project, I aim to prepare a dataset for training a **Sequence-to-Sequence (Seq2Seq) neural network** for language translation tasks. Before training the model, I need to preprocess the text data to create structured inputs and outputs suitable for a machine learning model. This preprocessing step includes tokenising the text, building vocabulary sets, and encoding sentences into numerical formats that can be fed into the neural network.

The dataset I’m using contains pairs of sentences in two languages, separated by tabs, with each pair representing a translation from one language to another. My task is to process this data and convert it into a format that the Seq2Seq model can work with efficiently.

## Project Objectives
The objective of this project is to:
- Read and split the text dataset into input and target sentences.
- Build vocabulary sets for both the input and target sentences.
- Tokenise the sentences and add special tokens such as `<START>` and `<END>` to the target sentences.
- Convert each token into a unique numerical representation.
- Structure the processed data into three main components: encoder inputs, decoder inputs, and decoder targets, to be used in training the Seq2Seq model.

## Code Breakdown and Explanation

### 1. Importing Libraries and Reading the Dataset
First, I import the necessary libraries and set the path to the dataset file. I read the file, split it by lines, and prepare to process each line separately.

In [73]:
import numpy as np
import re

# Importing our translations
# for example: "spa.txt" or "spa-eng/spa.txt"
data_path = "spa.txt"

# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8') as f:
  lines = f.read().split('\n')

## 2. Initialising Variables and Preparing Lists
I create empty lists to store the input and target sentences. Additionally, I initialise two sets to hold the unique tokens from both the input and target sentences.

In [76]:
input_docs = []
target_docs = []
input_tokens = set()
target_tokens = set()

## 3. Processing Each Line from the Dataset
Here, I process each line up to a defined limit (6000 lines in this example) to avoid long preprocessing times. I split each line into input and target sentences using a tab separator.

I add a <START> token at the beginning and an <END> token at the end of each target sentence. I then append these sentences to the corresponding lists.

In [79]:
for line in lines[:6001]:
  input_doc, target_doc = line.split('\t')[:2]
  input_docs.append(input_doc)

  target_doc = " ".join(re.findall(r"[\w']+|[^\s\w]", target_doc))
  target_doc = '<START> ' + target_doc + ' <END>'
  target_docs.append(target_doc)

## 4. Tokenising Sentences and Building Vocabulary Sets
For each sentence, I split it into words or tokens. I add each unique token to the corresponding vocabulary set.

In [82]:
  for token in re.findall(r"[\w']+|[^\s\w]", input_doc):
    if token not in input_tokens:
      input_tokens.add(token)
  for token in target_doc.split():
    if token not in target_tokens:
      target_tokens.add(token)

## 5. Creating Sorted Lists of Tokens and Defining Features
I sort the tokens and count the number of unique tokens in both input and target vocabularies. I then define dictionaries that map each token to a unique index and create reverse mappings for decoding purposes.

In [85]:
input_tokens = sorted(list(input_tokens))
target_tokens = sorted(list(target_tokens))

num_encoder_tokens = len(input_tokens)
num_decoder_tokens = len(target_tokens)

max_encoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", input_doc)) for input_doc in input_docs])
max_decoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", target_doc)) for target_doc in target_docs])

input_features_dict = dict(
    [(token, i) for i, token in enumerate(input_tokens)])
target_features_dict = dict(
    [(token, i) for i, token in enumerate(target_tokens)])

reverse_input_features_dict = dict(
    (i, token) for token, i in input_features_dict.items())
reverse_target_features_dict = dict(
    (i, token) for token, i in target_features_dict.items())

## 6. Preparing Data for the Neural Network
I initialise three-dimensional arrays to store the encoder and decoder inputs, as well as the decoder targets. These arrays are sized according to the number of sentences and their maximum lengths.

In [88]:
encoder_input_data = np.zeros(
    (len(input_docs), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_docs), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_docs), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

## 7. Encoding Sentences into Numerical Data
For each line in the dataset, I convert each token in the sentences into its corresponding index based on the previously defined dictionaries. This process is done separately for the encoder inputs and the decoder inputs and targets.

In [91]:
for line, (input_doc, target_doc) in enumerate(zip(input_docs, target_docs)):

  for timestep, token in enumerate(re.findall(r"[\w']+|[^\s\w]", input_doc)):
    encoder_input_data[line, timestep, input_features_dict[token]] = 1.

  for timestep, token in enumerate(target_doc.split()):
    decoder_input_data[line, timestep, target_features_dict[token]] = 1.
    if timestep > 0:
      decoder_target_data[line, timestep - 1, target_features_dict[token]] = 1.

KeyError: 'Go'

## 8. Displaying Token Mappings
Finally, I print a subset of the input tokens and one of the reverse target tokens as a check to ensure that the mappings are accurate.

In [41]:
print(list(input_features_dict.keys())[:50], reverse_target_features_dict[50])
print(len(input_tokens))

KeyError: 50

## Summary
In this code, I’ve performed the crucial step of data preprocessing for a Seq2Seq model. By reading, cleaning, and tokenising text data, I’ve created the necessary input and output structures that the model will use during training. The data has been transformed into numerical arrays that can be efficiently processed by a neural network. This preprocessed data forms the foundation for building a translation model in subsequent steps.