# Curate Data

### Raw data

At this point, you have already run `make pull-data`. This should have downloaded and preprocessed the data into individual files. Your file structure should look like this:

```
data/
├── processed
│   ├── countries.txt
│   ├── english.txt
│   ├── french.txt
│   └── names.txt
└── raw
    ├── data.zip
    ├── eng-fra
    │   └── eng-fra.txt
    └── names
        ├── Arabic.txt
        ├── Chinese.txt
        ├── Czech.txt
        ├── Dutch.txt
        ├── English.txt
        ├── French.txt
        ├── German.txt
        ├── Greek.txt
        ├── Irish.txt
        ├── Italian.txt
        ├── Japanese.txt
        ├── Korean.txt
        ├── Polish.txt
        ├── Portuguese.txt
        ├── Russian.txt
        ├── Scottish.txt
        ├── Spanish.txt
        └── Vietnamese.txt
```

The 'names' and 'countries' data correspond to this pytorch tutorial: [https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html](https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html)


The 'english' and 'french' data correspond to this one: [https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)


In this notebook, we will use our Language utility class to scan these corpuses (corpa?) and encode the words/sentences into indices of tokens in a vocabulary. 


### Processing Strategy

1. Take a text file where each line is an observation of a sentence in a given language.

2. Tokenize this sentence into tokens. Could be a basic tokenization such as splitting a sentence on spaces into component words; or more sophisticated tokenization, like one of the spaCy models.

3. Scan over all the tokens in all the data in the file, collect a vocabulary. 

4. If necessary, add special tokens to the vocabulary for padding, unknown token, start/end of sentence.

5. Write a new file where each line gets tokenized and converted into the indices of those tokens in the vocabulary.

6. Save the model so we can load it and access its vocab/tokenizer later.


For example, we have an input file like this:

```
names.txt

Adam
Ahearn
Aodh
Aodha
```

and we create a language model that tokenizes it by splitting the name into component letters:

```
names_indices.txt

2 23 18 4 19 3
2 23 10 7 4 9 8 3
2 23 5 18 10 3
2 23 5 18 10 4 3
```

Here special tokens have been added, so that the name always begins with token 2 (<SOS>) and ends with token 3 (<EOS>).

## Libraries

In [1]:
import os
import yaml

from common.language import Language

from common.utils import get_logger

logger = get_logger("curation")

## Parameters

In [2]:
# Set the cwd to the root of the project.
# Only let this execute once
if os.getcwd().endswith("src"):
    os.chdir("..")

logger.info(f"Current working directory: {os.getcwd()}")

2024-10-15 14:48:46 - curation - INFO: Current working directory: /home/rob/encoder-decoder


In [3]:
# Load config.yaml. This contains all of our paths and constants.
with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

## Split names into component letters

In the RNN tutorial, the input sequence is each name letter by letter, and the target output is the country that the name originates from. 

Split the names up by letter, encode them. Treate the country name as a whole word, encode it.

In our second exercise, we'll train an encoder-decoder to spell out the country name, so we want a version of this that has the letters encoded individually.

In [4]:
# We have two types of tokenizer here:
#   When given a string, split into individual characters (convert to list)
#   When given a string, split into individual words (wrap in list)
names = Language(
    name="names",
    tokenizer_name="split_into_chars", 
    detokenizer_name="join_no_space",
    )

# Since this is a classification task, we don't need to add <SOS> or <EOS>
# tokens. Keep whole words.
countries_word = Language(
    name="countries_word",
    tokenizer_name="keep_whole_sentence",
    detokenizer_name="join_no_space",
    add_special_tokens=False,
)

# Split into individual characters
countries_letters = Language(
    name="countries_letters",
    tokenizer_name="split_into_chars",
    detokenizer_name="join_no_space",
)

In [5]:
names.scan_corpus(config["NAMES_INPUT_PATH"])
countries_word.scan_corpus(config["COUNTRIES_INPUT_PATH"])
countries_letters.scan_corpus(config["COUNTRIES_INPUT_PATH"])

In [6]:
# Vocab lengths
print("Length of names vocab: ", len(names.vocabulary))
print("Length of countries vocab: ", len(countries_word.vocabulary))
print(
    "Length of countries as letters vocab: ", len(countries_letters.vocabulary)
)

print("\nExample elements: \n")
print(list(names.vocabulary)[0:10])
print(list(countries_word.vocabulary)[0:10])
print(list(countries_letters.vocabulary)[0:10])

Length of names vocab:  91
Length of countries vocab:  19
Length of countries as letters vocab:  35

Example elements: 

['<PAD>', '<UNK>', '<SOS>', '<EOS>', 'a', 'o', 'i', 'e', 'n', 'r']
['Russian', 'English', 'Arabic', 'Japanese', 'German', 'Italian', 'Czech', 'Spanish', 'Dutch', 'French']
['<PAD>', '<UNK>', '<SOS>', '<EOS>', 'i', 'n', 's', 'a', 'u', 'R']


Note that our tokenizers have not converted to lowercase first. We're leaving capital letters in.

In [7]:
# Convert each line to a list of indices and save to a new file
names.convert_corpus_to_indices(
    config["NAMES_INPUT_PATH"],
    config["NAMES_OUTPUT_PATH"],
)

countries_word.convert_corpus_to_indices(
    config["COUNTRIES_INPUT_PATH"],
    config["COUNTRIES_WORD_OUTPUT_PATH"],
)

countries_letters.convert_corpus_to_indices(
    config["COUNTRIES_INPUT_PATH"],
    config["COUNTRIES_LETTER_OUTPUT_PATH"],
)

In [8]:
# Save the language models
names.save(config["NAMES_LANGUAGE_MODEL_PATH"])

countries_word.save(config["COUNTRIES_WORD_LANGUAGE_MODEL_PATH"])

countries_letters.save(config["COUNTRIES_LETTER_LANGUAGE_MODEL_PATH"])

## English and French

In [9]:
# If name is english or french, will use a default tokenizer
english = Language(
    name="english",
    tokenizer_name="spacy_english",
    detokenizer_name="join_with_space",
)

french = Language(
    name="french",
    tokenizer_name="spacy_french",
    detokenizer_name="join_with_space",
)

In [10]:
# Scan to learn vocabulary. We're going to limit the vocab size to 10000 tokens.
english.scan_corpus(
    config["ENGLISH_INPUT_PATH"],
    max_vocab_size=7500,
)

french.scan_corpus(
    config["FRENCH_INPUT_PATH"],
    max_vocab_size=7500,
)

# Save models
english.save(config["ENGLISH_LANGUAGE_MODEL_PATH"])
french.save(config["FRENCH_LANGUAGE_MODEL_PATH"])

In [11]:
# Top vocab
print(list(english.vocabulary)[0:10])
print(list(french.vocabulary)[0:10])

['<PAD>', '<UNK>', '<SOS>', '<EOS>', '.', 'I', 'you', 'to', '?', 'the']
['<PAD>', '<UNK>', '<SOS>', '<EOS>', '.', 'Je', 'de', '?', 'pas', 'est']


In [12]:
# Give example encodings

tokens = english.tokenizer("Hello there, skibidi yeet")
# Add our special tokens
tokens = ["<SOS>"] + tokens + ["<EOS>"]
print(f"Tokens: {tokens}")

indices = english.token_to_index(tokens)
print(f"Indices: {indices}")

# Inverse operation
tokens = english.index_to_token(indices)
print(f"Tokens: {tokens}")

Tokens: ['<SOS>', 'Hello', 'there', ',', 'skibidi', 'yeet', '<EOS>']
Indices: [2, 3861, 85, 24, 1, 1, 3]
Tokens: ['<SOS>', 'Hello', 'there', ',', '<UNK>', '<UNK>', '<EOS>']


This will take a hot minute (about 5).

In [13]:
# Note that in conversion, the special tokens are added to front and back
english.convert_corpus_to_indices(
    config["ENGLISH_INPUT_PATH"],
    config["ENGLISH_OUTPUT_PATH"],
)
logger.info("Converted English corpus to indices")

french.convert_corpus_to_indices(
    config["FRENCH_INPUT_PATH"],
    config["FRENCH_OUTPUT_PATH"],
)
logger.info("Converted French corpus to indices")

2024-10-15 14:50:08 - curation - INFO: Converted English corpus to indices
2024-10-15 14:51:34 - curation - INFO: Converted French corpus to indices
