# Introduction Text Translator

The 
<a 
href="https://github.com/ajinkyakulkarni14/TED-Multilingual-Parallel-Corpus/tree/master"> 
    original 
</a> 
dataset that was given did not contain the english version. I only am fluent in English and Dutch so i searched and found the
<a 
href="https://opus.nlpl.eu/TED2020/en&nl/v1/TED2020"> 
    Ted2020
</a> 
dataset. So this is the dataset i am going to use.


## 1) Loading the Data

The dataset contains 3 important files:
| File        | Value |
|-----------------|--------|
| english.txt | contains english sentences. |
| dutch.txt | contains dutch translations. |
| alignments.xml | contains which english sentence belongs to which dutch sentence |


In [1]:
data_folder = './data'

### Loading in txt files

In [2]:
def load_text(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        return [line.strip() for line in f if line.strip()]
    
english_sentences = load_text(f"{data_folder}/english.txt")
dutch_sentences = load_text(f"{data_folder}/dutch.txt")

### Load in XML file

In [3]:
import xml.etree.ElementTree as ET

alignment_root = ET.parse(f"{data_folder}/alignments.xml").getroot()

In [4]:
import re

english_aligned = []
dutch_aligned = []

for link in alignment_root.findall(".//link"):
    en_idx, nl_idx = link.attrib["xtargets"].split(";")

    try:
        # Handle English index
        en_index = int(en_idx.strip()) - 1
        
        # Handle Dutch index - take first number before any decimal point
        nl_match = re.match(r'^\s*(\d+)', nl_idx.strip())
        nl_index = int(nl_match.group(1)) - 1 if nl_match else None
        
        # Add to aligned lists if valid
        if (0 <= en_index < len(english_sentences) and 
            0 <= nl_index < len(dutch_sentences)):
            english_aligned.append(english_sentences[en_index])
            dutch_aligned.append(dutch_sentences[nl_index])
    except (ValueError, AttributeError):
        # The dataset contains some anomalies so im going to skip these.
        continue


In [5]:
import pandas as pd

dataset = pd.DataFrame({
    'English': english_aligned,
    'Dutch': dutch_aligned
})

print(f"The dataset contains: {len(dataset)} sentences.")
dataset.head()

The dataset contains: 298775 sentences.


Unnamed: 0,English,Dutch
0,"Thank you so much, Chris.","Hartelijk bedankt, Chris, het is werkelijk een..."
1,And it's truly a great honor to have the oppor...,de gelegenheid te hebben twee keer op dit podi...
2,"I have been blown away by this conference, and...",Ik ben zeer onder de indruk van deze conferent...
3,"And I say that sincerely, partly because (Mock...","Ik meen dit, ook omdat – (Nepzucht) – ik het n..."
4,(Laughter) Put yourselves in my position.,(Gelach) Bekijk het eens vanuit mijn perspectief!


The dataset still may contain certain characters or strings which can cause errors in translations so we first need to remove those.

In [6]:
def preprocess_sentence(sentence):
    # Lowercase
    sentence = sentence.lower()
    # Remove special characters and numbers
    sentence = re.sub(r'[^\w\s]', '', sentence)
    # Remove extra spaces
    sentence = re.sub(r'\s+', ' ', sentence).strip()
    return sentence

for index, row in dataset.iterrows():
    row['English'] = preprocess_sentence(row['English'])
    row['Dutch'] = preprocess_sentence(row['Dutch'])

dataset.head(5)

Unnamed: 0,English,Dutch
0,thank you so much chris,hartelijk bedankt chris het is werkelijk een eer
1,and its truly a great honor to have the opport...,de gelegenheid te hebben twee keer op dit podi...
2,i have been blown away by this conference and ...,ik ben zeer onder de indruk van deze conferent...
3,and i say that sincerely partly because mock s...,ik meen dit ook omdat nepzucht ik het nodig heb
4,laughter put yourselves in my position,gelach bekijk het eens vanuit mijn perspectief


In [7]:
dataset.to_excel('./data/data.xlsx', index=False, engine='openpyxl')