The code in this repo is just an old exercise I did a while ago which was in itself a Pytorch adaptation of a post by my good friend [Nadbor](http://nadbordrozd.github.io/blog/2017/06/03/python-or-scala/). I decided to bring to this repo since is NLP related and this is NLP stuff 🙂. 

### Get Data

First you would need to get the data, for that simply run 

```bash
bash get_data.sh
```

Then one needs to prepare the input files

### Prepare Input files

To run this in the terminal you would do

```bash
python prepare_input_files.py data/austen 'austen.txt' data/austen_clean
python prepare_input_files.py data/shakespeare/ 'shakespeare.txt' data/shakespeare_clean
python prepare_input_files.py data/scikit-learn '*.py' data/sklearn_clean
python prepare_input_files.py data/scalaz/ '*.scala' data/scalaz_clean
```

`prepare_input_files.py` will call `text_utils.py`. Here in this notebook I include all the code for clarity.

In [1]:
import numpy as np
import fnmatch
import os
import argparse

from unidecode import unidecode

chars = '\n !"#$%&\'()*+,-./0123456789:;<=>?@[\\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~'
charset = set(chars)
n_chars = len(charset)
char2ind = dict((c, i) for i, c in enumerate(chars))
ind2char = dict((i, c) for i, c in enumerate(chars))

char2vec = {}
for c in charset:
    vec = np.zeros(n_chars)
    vec[char2ind[c]] = 1
    char2vec[c] = vec


def sanitize_text(text):
    return ''.join(c for c in unidecode(text.decode('utf-8', 'ignore')).replace('\t', '    ') if c in charset)


# '../' because we are in the 'notebooks' dir
input_dirs = ['../data/scikit-learn', '../data/scalaz', '../data/austen', '../data/shakespeare']
output_dirs = ['../data/sklearn_clean', '../data/scalaz_clean', '../data/austen_clean', '../data/shakespeare_clean']
file_patterns = ['*.py','*.scala','austen.txt','shakespeare.txt']
for input_dir, output_dir, file_pattern in zip(input_dirs, output_dirs, file_patterns):
    try:
        os.makedirs(output_dir)
    except os.error as e:
        # errno 17 means 'file exists error' which we can ignore
        if e.errno != 17:
            raise

    for root, dirnames, filenames in os.walk(input_dir):
        for filename in fnmatch.filter(filenames, file_pattern):
            src_path = os.path.join(root, filename)
            dst_path = os.path.join(output_dir, filename)
            # read in bytes (rb), write in text ('w')
            with open(src_path, 'rb') as in_f, open(dst_path, 'w') as out_f:
                out_f.write(sanitize_text(in_f.read()))

For most of this excercise we will be using the python and scala datasets.

Just in case you want to use Austen's and Shakespeare's books, here the books are splitted so that the partitions "make sense", meaning have enough text and correspond to episodes or chapters.

If you wanted to run it with the split_ebooks.py script:

```bash
python split_ebooks.py
```

In [2]:
import re
import os

authors = ['austen', 'shakespeare']

ebook_d = {}
ebook_d['austen'] = {}
ebook_d['shakespeare'] = {}

ebook_d['austen']['dir'] = '../data/austen_clean'
ebook_d['austen']['fname'] = 'austen.txt'
ebook_d['austen']['regex'] = 'Chapter\s+.*|CHAPTER\s+.*' # regular expression to split based on
ebook_d['austen']['startidx'] = 1 # starting index for the resulting partitions
ebook_d['austen']['endex'] = 'THE END' # expression to denote the end of the document 

ebook_d['shakespeare']['dir'] = '../data/shakespeare_clean'
ebook_d['shakespeare']['fname'] = 'shakespeare.txt'
ebook_d['shakespeare']['regex'] = '\s+\d+\s+|ACT\s+.*\.|SCENE\s+.*\.'
ebook_d['shakespeare']['startidx'] = 3
ebook_d['shakespeare']['endex'] = 'FINIS'

for author in authors:
    filepath = os.path.join(ebook_d[author]['dir'],ebook_d[author]['fname'])
    with open(filepath, 'r') as f:
        ebook = f.read()
    f.close()

    endex = ebook_d[author]['endex']
    startidx = ebook_d[author]['startidx']
    the_end = [m.start() for m in re.finditer(endex, ebook)][-1]
    ebook = ebook[:the_end]
    parts = re.split(ebook_d[author]['regex'], ebook)[startidx:]

    for i,p in enumerate(parts):
        fname = 'part' + str(i).zfill(4) + '.txt'
        fpath = os.path.join(ebook_d[author]['dir'],fname)
        with open(fpath, 'w') as f:
            f.write(p)
        f.close()
    os.remove(filepath)

### Train/Test split

3-Train/Test split

Not much secret here...

If you wanted to run it with the .py script:

```bash
python train_test_split.py data/austen_clean/ 0.25
python train_test_split.py data/shakespeare_clean/ 0.25
python train_test_split.py data/sklearn_clean/ 0.25
python train_test_split.py data/scalaz_clean/ 0.25
```

In [None]:
data_dirs = ['../data/sklearn_clean/', '../data/scalaz_clean/', '../data/austen_clean/', '../data/shakespeare_clean/']
test_fraction = 0.25

for data_dir in data_dirs:
    files = os.listdir(data_dir)
    train_dir = os.path.join(data_dir, 'train')
    test_dir = os.path.join(data_dir, 'test')

    # randomly shuffle the files
    files = list(np.array(files)[np.random.permutation(len(files))])
    os.makedirs((train_dir))
    os.makedirs(test_dir)

    train_fraction = 1 - test_fraction
    for i, f in enumerate(files):
        file_path = os.path.join(data_dir, f)
        if len(files) * train_fraction >= i:
            shutil.move(file_path, train_dir)
        else:
            shutil.move(file_path, test_dir)

And this is what you should have in your data dir

In [4]:
%%bash
find ../data/*clean -maxdepth 2  -type d

../data/austen_clean
../data/austen_clean/test
../data/austen_clean/train
../data/scalaz_clean
../data/scalaz_clean/test
../data/scalaz_clean/train
../data/shakespeare_clean
../data/shakespeare_clean/test
../data/shakespeare_clean/train
../data/sklearn_clean
../data/sklearn_clean/test
../data/sklearn_clean/train


that's it, we have a series of files with text and we are ready to train