## Input/Output Interface for the BlazingText Algorithm
For supervised training, the BlazingText training/validation files need to be in the RecordIO format. The files should contain a single record per line, starting with the label. Labels are words that are prefixed by the string __label__. Here is an example of a training/validation file:

```

__label__4  linux ready for prime time , intel says , despite all the linux hype , the open-source movement has yet to make a huge splash in the desktop market . that may be about to change , thanks to chipmaking giant intel corp .

__label__2  bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly as the indian skippers return to international cricket was short lived . 
```

In [1]:
import csv
import multiprocessing
from multiprocessing import Pool
import os

import boto3
import nltk
import re
from sklearn.model_selection import train_test_split

try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')


print("Extracting text from labeled documents.")
data = []
for file in os.listdir('labeled_fbo_docs'):
    if file.startswith('GREEN'):
        target = '__label__1'
    elif file.startswith('RED') or file.startswith('YELLOW'):
        target = '__label__0'
    else:
        raise Exception(f"A file isn't prepended with the target:  {file}")
    
    file_path = os.path.join(os.getcwd(), 'labeled_fbo_docs', file)
    with open(file_path, 'r', errors = 'ignore') as f:
        text = f.read().replace("\n", ' ').strip()
    data.append((target, text))
print("Done extracting text from labeled documents.")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/charlessmcallister/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
def transform_instance(row):
    cur_row = []
    #append the label
    cur_row.append(row[0])
    cur_row.extend(nltk.word_tokenize(row[1].lower()))
    
    return cur_row

In [3]:
def preprocess(data, output_file):
    pool = Pool(processes = multiprocessing.cpu_count())
    transformed_rows = pool.map(transform_instance, data)
    pool.close() 
    pool.join()
    
    with open(output_file, 'w') as csvoutfile:
        csv_writer = csv.writer(csvoutfile, delimiter = ' ', lineterminator = '\n')
        csv_writer.writerows(transformed_rows)

In [4]:
%%time
# Preparing the training dataset
train, test = train_test_split(data)

preprocess(train, 'srt.train')
        
# Preparing the validation dataset        
preprocess(test, 'srt.validation')

CPU times: user 7.72 s, sys: 2.06 s, total: 9.78 s
Wall time: 47.2 s


## Pushing to S3
You'll need to have installed the awscli prior to this step and have configured it to use the Key ID and Secret Access Key of your AWS account. You can do that with the `aws configure` command as documented [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html#cli-quick-configuration).

In [95]:
%%time
s3 = boto3.resource('s3')
region = boto3.Session().region_name
bucket = 'srt-sagemaker' 
prefix = 'training' # Used as part of the path in the bucket where we'll store train and test data
bucket_path = f'https://s3-{region}.amazonaws.com/{bucket}'

data_to_upload = ['srt.train', 'srt.validation']
for upload_file in data_to_upload:
    key = f'{prefix}/{upload_file}'
    s3.Bucket(bucket).Object(key).upload_file(upload_file)
    url = f's3n://{bucket}/{key}'
    print(f'Done writing to {url}')

Done writing to s3n://srt-sagemaker/training/srt.train
Done writing to s3n://srt-sagemaker/training/srt.validation
CPU times: user 4.42 s, sys: 4.23 s, total: 8.65 s
Wall time: 6min 52s
