# Upload Training Data to s3
This notebook assumes:
 - You've got all of the labeled solicitaton documents within a directory named `labeled_fbo_docs`.
 - You can use the `awscli` and have configured it. See [this](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html) for instructions. We recommend placing the credentials in a shared credential file (`~/.aws/credentials`)
 - You have already created an S3 bucket (ours is named `srt-sm`).

## Read in the training data

Below, we'll read in the labeled documents and extract the text along with the label (the label is in the file name). 

>Although there are three lables (red, yellow and green), we're combining red and yellow as noncompliant ($0$) and treating green as compliant ($1$). This makes a binary classification challenge.

In [2]:
import os
import random

data = []
for file in os.listdir('labeled_fbo_docs'):
    if file.startswith('GREEN'):
        target = 1
    elif file.startswith('RED') or file.startswith('YELLOW'):
        target = 0
    else:
        raise Exception(f"A file isn't prepended with the target:  {file}")

    file_path = os.path.join(os.getcwd(), 'labeled_fbo_docs', file)
    with open(file_path, 'r', errors = 'ignore') as f:
        #do some newline replacing
        text = f.read().replace("\n", ' ').strip()
    data.append([target, text])
    
print(f"Done reading in {len(data)} documents.")

Done reading in 993 documents.


# Split the samples into training and test datasets
Since our data is imbalanced, we'll use the `stratify` method to split the data in a balanced fashion, using the labels array.

In [3]:
from sklearn.model_selection import train_test_split

y = [i[0] for i in data]
x = [i[1] for i in data]

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=123, stratify=y)

n_test_pos_samples = 100 * sum(y_test) / len(y_test)
n_train_pos_samples = 100 * sum(y_train) / len(y_train)

print("{:.2f}% of the training data is a positive sample".format(n_train_pos_samples))

print("{:.2f}% of the testing data is a positive sample".format(n_test_pos_samples))

27.33% of the training data is a positive sample
27.14% of the testing data is a positive sample


# Write the training data to csv
Here we'll write the training and test data to two csvs, using pandas to keep it simple.

In [4]:
import pandas as pd

train_df = pd.DataFrame([y_train, X_train]).transpose()

test_df = pd.DataFrame([y_test, X_test]).transpose()


In [5]:
train_df.to_csv('srt_train.csv', index = False)

test_df.to_csv('srt_test.csv', index = False)

Here we'll read in the files we wrote just to make sure we do it correctly in our sagemaker notebook.

In [6]:
import numpy as np

test_df_check = pd.read_csv('srt_test.csv')
test_df_check.columns = ['target', 'text']
test_df_check = test_df_check.astype({'target': np.float64, 'text': str})

test_df_check.head()

Unnamed: 0,target,text
0,0.0,OBJECTIVE The RRB seeks electronic data sto...
1,0.0,Checklist and Certification for Minimum Level ...
2,1.0,"Date Issued: January 6, 2009 Date Due: Febru..."
3,0.0,Section SF 1449 - CONTINUATION SHEET |ITEM ...
4,1.0,Statement of Work ...


# Push training data to S3
Here we push the data to our s3 bucket, using a prefix that describes our project and the model we're going to use.

In [7]:
import boto3

s3 = boto3.resource('s3')
region = boto3.Session().region_name
bucket = 'srt-sm' 
prefix = 'Sklearn-RandomizedGridSearch'
bucket_path = f'https://s3-{region}.amazonaws.com/{bucket}'

for f in ['srt_train.csv', 'srt_test.csv']:
    key = f'{prefix}/{f}'
    s3.Bucket(bucket).Object(key).upload_file(f)
    url = f's3n://{bucket}/{key}'
    print(f'Done writing to {url}')

Done writing to s3n://srt-sm/Sklearn-RandomizedGridSearch/srt_train.csv
Done writing to s3n://srt-sm/Sklearn-RandomizedGridSearch/srt_test.csv
