# Read in the labeled documents
This notebook assumes:
 - You've got all of the labeled solicitaton documents within a directory named `labeled_fbo_docs`
 - You can use the awscli to push a csv up to an S3 bucket named (our is named `srt-sm`).

Below, we'll read in each document and extract the text along with the label (the label is in the file name). Although there are three lables (red, yellow and green), we're combining red and yellow as noncompliant ($0$) and treating green as compliant ($1$). This makes our binary classification challenge.

In [1]:
import os

data = []
for file in os.listdir('labeled_fbo_docs'):
    if file.startswith('GREEN'):
        target = 1
    elif file.startswith('RED') or file.startswith('YELLOW'):
        target = 0
    else:
        raise Exception(f"A file isn't prepended with the target:  {file}")
    
    file_path = os.path.join(os.getcwd(), 'labeled_fbo_docs', file)
    with open(file_path, 'r', errors = 'ignore') as f:
        #do some newline replacing
        text = f.read().replace("\n", ' ').strip()
    data.append([target, text])

# Write the training data to csv
Here we'll write this training data to a single csv, with the headers `target` and `text`.

In [2]:
import csv

with open('srt_train.csv', mode='w') as f:
    csvwriter = csv.writer(f, delimiter=',')
    csvwriter.writerows(data)

# Push training data to S3
You'll need to have installed the awscli prior to this step and have configured it to use the Key ID and Secret Access Key of your AWS account. You can do that with the `aws configure` as documented [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html#cli-quick-configuration).

In [1]:
import boto3

s3 = boto3.resource('s3')
region = boto3.Session().region_name
bucket = 'srt-sm' 
prefix = 'training' # Used as part of the path in the bucket where we'll store train and test data
bucket_path = f'https://s3-{region}.amazonaws.com/{bucket}'

upload_file = 'srt_train.csv'
key = f'{prefix}/{upload_file}'
s3.Bucket(bucket).Object(key).upload_file(upload_file)
url = f's3n://{bucket}/{key}'
print(f'Done writing to {url}')

Done writing to s3n://srt-sm/training/srt_train.csv
