# Read in the labeled documents
This notebook assumes:
 - You've got all of the labeled solicitaton documents within a directory named `labeled_fbo_docs`
 - You can use the awscli
 - You have already created an S3 bucket (our is named `srt-sm`).

Below, we'll read in each document and extract the text along with the label (the label is in the file name). Although there are three lables (red, yellow and green), we're combining red and yellow as noncompliant ($0$) and treating green as compliant ($1$). This makes a binary classification challenge.


>And since we're only piloting SageMaker, will reduce our total training dataset size down to just 50 documents.

In [10]:
import os
import random

data = []
for i, file in enumerate(os.listdir('labeled_fbo_docs')):
    if i < 50:
        if file.startswith('GREEN'):
            target = 1
        elif file.startswith('RED') or file.startswith('YELLOW'):
            target = 0
        else:
            raise Exception(f"A file isn't prepended with the target:  {file}")

        file_path = os.path.join(os.getcwd(), 'labeled_fbo_docs', file)
        with open(file_path, 'r', errors = 'ignore') as f:
            #do some newline replacing
            text = f.read().replace("\n", ' ').strip()
        data.append([target, text])
    else:
        break

# Split into Training and Testing data
Since our data is imbalanced, we'll use the `stratify` kwarg to split the data in a stratified fashion, using the labels array.

In [11]:
from sklearn.model_selection import train_test_split

y = [i[0] for i in data]
x = [i[1] for i in data]

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=123, stratify=y)

n_test_pos_samples = 100 * sum(y_test) / len(y_test)
n_train_pos_samples = 100 * sum(y_train) / len(y_train)

print("{:.2f}% of the training data is a positive sample".format(n_train_pos_samples))

print("{:.2f}% of the testing data is a positive sample".format(n_test_pos_samples))

25.00% of the training data is a positive sample
20.00% of the testing data is a positive sample


# Write the training data to csv
Here we'll write the training and test data to two csvs, using pandas.

In [12]:
import pandas as pd

train_df = pd.DataFrame([y_train, X_train]).transpose()

test_df = pd.DataFrame([y_test, X_test]).transpose()


In [13]:
# sagemaker doesn't like input data with a single feature apparently. So we'll add zeros
# see comment here: https://stackoverflow.com/questions/51635902/aws-sagemaker-unable-to-parse-csv
train_df[2] = [0 for i in range(len(train_df))]
test_df[2] = [0 for i in range(len(test_df))]

In [14]:
train_df.to_csv('srt_train.csv', index = False)

test_df.to_csv('srt_test.csv', index = False)

Here we'll read in the files we wrote just to make sure we do it correctly in our sagemaker script

In [15]:
import numpy as np

test_df_check = pd.read_csv('srt_test.csv')
test_df_check.columns = ['target', 'text', 'zero']
test_df_check = test_df_check.astype({'target': np.float64, 'text': str, 'zero': np.float64})

test_df_check.head()

Unnamed: 0,target,text,zero
0,0.0,This is a combined synopsis/solicitation for c...,0.0
1,0.0,General Information: Document Type: ...,0.0
2,1.0,Attachment 1 Glossary ...,0.0
3,0.0,STATEMENT OF WORK (SOW) FOR 317 RCS/...,0.0
4,0.0,"|SOLICITATION, OFFER AND AWARD ...",0.0


# Push training data to S3
You'll need to have installed the awscli prior to this step and have configured it to use the Key ID and Secret Access Key of your AWS account. You can do that with `aws configure` as documented [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html#cli-quick-configuration).

In [16]:
import boto3

s3 = boto3.resource('s3')
region = boto3.Session().region_name
bucket = 'srt-sm' 
prefix = 'Scikit-LinearLearner-pipeline-srt'
bucket_path = f'https://s3-{region}.amazonaws.com/{bucket}'

for f in ['srt_train.csv', 'srt_test.csv']:
    key = f'{prefix}/{f}'
    s3.Bucket(bucket).Object(key).upload_file(f)
    url = f's3n://{bucket}/{key}'
    print(f'Done writing to {url}')

Done writing to s3n://srt-sm/Scikit-LinearLearner-pipeline-srt/srt_train.csv
Done writing to s3n://srt-sm/Scikit-LinearLearner-pipeline-srt/srt_test.csv
