This notebooks gathers, splits and preprocess the data we are using to train our poems generator.

At the end, it generates two text files in the current directory:
* A concatenation of all the poems of the training set, with name `all_poems.train.[lang].txt
* A concatenation of all the poems of the validation set, with name `all_poems.valid.[lang].txt

If you are trying to run the notebook outside of the project it belongs to (*https://github.com/Poems-AI/AI.git*), you need to set `run_as_standalone_nb = True`

In [None]:
run_as_standalone_nb = False

In [None]:
if run_as_standalone_nb:
    root_lib_path = Path('ai').resolve()
    if not root_lib_path.exists():
        !git clone https://github.com/Poems-AI/AI.git
    if str(root_lib_path) not in sys.path:
        sys.path.insert(0, str(root_lib_path))
else:
    import local_lib_import

In [None]:
from poemsai.config import set_config_value
from poemsai.data import (ComposedPoemsReader, DataSource, get_data_sources, Lang, lang_to_str, 
                          merge_poems, PoemsWriter, ReaderFactory, SplitterFactory)
from pathlib import Path
import sys

Edit the cell below to choose the language you want to generate the .txt for:

In [None]:
lang = Lang.English

# Data collection

If outside of Kaggle, you should point the `'KAGGLE_DS_ROOT'` config key to the root folder that contains the Kaggle datasets you are using.

In [None]:
set_config_value('KAGGLE_DS_ROOT', '/kaggle/input')

We are currently using:
* https://github.com/Poems-AI/dataset/tree/main/marcos_de_la_fuente.txt/en.txt: english poems by our poet Marcos de la Fuente
* https://github.com/Poems-AI/dataset/tree/main/marcos_de_la_fuente.txt/es.txt: spanish poems by our poet Marcos de la Fuente
* https://www.kaggle.com/michaelarman/poemsdataset) as an external english poetry dataset
* https://www.kaggle.com/andreamorgar/spanish-poetry-dataset) as an external spanish poetry dataset

In [None]:
!git clone https://github.com/Poems-AI/dataset.git

In [None]:
data_sources = [get_data_sources(lang, ds_type) for ds_type in DataSource]
[(type(ds), len(ds)) for ds in data_sources]

# Split into training and validation set

Set the percentage of data to be used as validation set, given as a fraction of unity:

In [None]:
valid_pct = 0.2

In [None]:
train_data, valid_data = [], []
splitter_factory = SplitterFactory()

for data_source in data_sources:
    splitter = splitter_factory.get_splitter_for(data_source)
    train_data_source, valid_data_source = splitter.split(data_source, valid_pct)
    train_data.append(train_data_source)
    valid_data.append(valid_data_source)
    
sum(len(ds) for ds in train_data), sum(len(ds) for ds in valid_data)

# Read and merge poems by split

In [None]:
reader_factory = ReaderFactory()
train_data_readers = [reader_factory.get_reader_for(data) for data in train_data]
valid_data_readers = [reader_factory.get_reader_for(data) for data in valid_data]

train_txt_path = Path(f'./all_poems.train.{lang_to_str(lang)}.txt')
valid_txt_path = Path(f'./all_poems.valid.{lang_to_str(lang)}.txt')
with open(train_txt_path, "w", encoding="utf-8") as train_txt_f:
    merge_poems(ComposedPoemsReader(train_data_readers), PoemsWriter(train_txt_f))
with open(valid_txt_path, "w", encoding="utf-8") as valid_txt_f:
    merge_poems(ComposedPoemsReader(valid_data_readers), PoemsWriter(valid_txt_f))

Show the number of lines by file:

In [None]:
!wc -l $train_txt_path
!wc -l $valid_txt_path