<a href="https://colab.research.google.com/github/Dagobert42/MyRNNsearch/blob/main/experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This project concerns itself with a Natural Language Processing ("NLP") system which is able to translate a given piece of text from one language into another. It is part of the examination in the lecture on "Deep Learning for Natural Language Processing" of the M. Sc. Cognitive Systems at the University of Potsdam.

It is based on the paper:


> Bahdanau, Cho & Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015.


The task is to implement the RNNsearch-50 system, i.e., the encoder-decoder with attention system for any language pair different from English-German, German-English, English-French, and French-English.

The experiments were conducted in the form of a Jupyter Notebook to provide commentary on the underlying thought process and in order to have GPU access through Google Colaboratory. However, this notebook is non-exhaustive in its documentation. An in-depth walk-through of the project is given in the accompanying report which can be found here: **add link**

# Setup

If you choose to access the project through the related Jupyter Notebook it will initially clone the python files from the research repository and install any dependency packages. These dependencies include the TensorFlow dataset and SpaCy pipelines.

In [None]:
!git clone -l -s https://github.com/Dagobert42/MyRNNsearch.git temp
%cd temp
!ls

Cloning into 'temp'...
remote: Enumerating objects: 8, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 8 (delta 0), reused 8 (delta 0), pack-reused 0[K
Unpacking objects: 100% (8/8), done.
/content/temp
config.yml  model.py  README.md  run.py  tests.py  utils.py


In [None]:
from IPython.display import clear_output
import time

In [None]:
!pip install --upgrade tensorflow-datasets
time.sleep(5)
clear_output()

# Data

The chosen language pair English-Danish. A dataset of matched sentence pairs for these languages can be obtained at runtime via a TensorFlow ParaCrawl configuration [cite]. We propose this dataset because it is large enough (2.414.895 examples, 362.46 MiB) to allow the model to learn in a meaningful way, while being small enough to be loaded into memory. Furthermore, pretrained SpaCy pipelines exist for both of these languages, providing us with tools for tokenization as well as sets of 30.000 ready-to-use word vectors for each language. We disable any other parts of the SpaCy pipelines during loading as they are not relevant for the project. Pre-existing TensorFlow ParaCrawl configurations also include various other target languages for translation from English (insert link). Hypothetically, it should be possible to run the experiments with any one of them (although some of the datasets are exceedingly large and it might not be possible to load them at runtime). Also please keep in mind that there may not be a SpaCy pipeline for the desired target language.

In [None]:
!pip install --upgrade spacy
!python -m spacy download en_core_web_md
!python -m spacy download da_core_news_md
time.sleep(5)
clear_output()

In [None]:
import data
# get builder for English-Danish
builder = get_dataset_builder(target_language='da')
builder.download_and_prepare()
train, test, val = get_data_splits(builder)
data[:10]

We load the SpaCy pipelines. SpaCy provides us with an alternative way to obtain tokens for each language. Taking our handmade vocabulary as a baseline we can observe differences in performance that SpaCy might provide.

In [None]:
import spacy
from spacy.lang.en.examples import sentences as e
from spacy.lang.da.examples import sentences as d

english_nlp = spacy.load("en_core_news_md", exclude=["tagger", "parser", "senter", "attribute_ruler", "lemmatizer", "ner"])
danish_nlp = spacy.load("da_core_news_md", exclude=["morphologizer", "parser", "senter", "attribute_ruler", "lemmatizer", "ner"])

# Model

# Training

# Evaluation