# Plagiarism Detection Model

* Upload data to S3
* Define skelarn estimator
* Training script
* Train estimator
* Deploy model
* Evaluate deployed classifier - check results match experiment

## Imports

In [1]:
# set sys path to access scripts
import sys
sys.path.append('../')

# general
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# model
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# aws
import boto3
import sagemaker

# custom scripts
import scripts.evaluator as evaluator
import scripts.config as config

## Prepare Data for S3

In [2]:
df_phrase = pd.read_csv(config.FINANCIAL_PHRASE_BANK)
df_domain_dict = pd.read_csv(config.DOMAIN_DICTIONARY)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(df_phrase['text'], 
                                                    df_phrase['sentiment'], 
                                                    test_size=config.TEST_SIZE, 
                                                    random_state=config.RANDOM_STATE)

df_train = pd.concat([y_train, X_train], axis=1)
df_test = pd.concat([y_test, X_test], axis=1)

df_train.to_csv(config.DP_DATASETS_LOC+'dp1_train.csv', index=False)
df_test.to_csv(config.DP_DATASETS_LOC+'dp1_test.csv', index=False)
df_domain_dict.to_csv(config.DP_DATASETS_LOC+'dp1_vocab.csv', index=False)

In [None]:
##### DELETE

In [10]:
# check upload
train_data = pd.read_csv(config.DP_DATASETS_LOC+'dp1_train.csv')
test_data =  pd.read_csv(config.DP_DATASETS_LOC+'dp1_test.csv')

In [7]:
vocab = list(df_domain_dict.word)

In [11]:
train_y = train_data.sentiment
train_x = train_data.text
test_x = test_data.text
test_y = test_data.sentiment

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.base import TransformerMixin

pipe = Pipeline([('tfidf', TfidfVectorizer(vocabulary=vocab, ngram_range=(1,2))),
             ('model', LinearSVC())])

pipe.fit(train_x, train_y)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [16]:
test_y_preds = pipe.predict(test_x);

In [17]:
from sklearn.metrics import accuracy_score

# Second: calculate the test accuracy
accuracy = accuracy_score(test_y, test_y_preds)

print(accuracy)

0.7761506276150628


## Load Data to S3


In [4]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

In [5]:
# should be the name of directory you created to save your features data
data_dir = config.DP_DATASETS_LOC

# set prefix, a descriptive name for a directory  
prefix = 'financial_sentiment'

# upload all data to S3
data = sagemaker_session.upload_data(path=data_dir,
                                     bucket=bucket,
                                     key_prefix=prefix+'/data')

# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

financial_sentiment/data/dp1_test.csv
financial_sentiment/data/dp1_train.csv
financial_sentiment/data/dp1_vocab.csv
financial_sentiment/data/vocab.csv
Test passed!


## Define a Scikit-learn estimator


In [6]:
source = '/home/ec2-user/SageMaker/financial_headline_sentiment/deployment/source_sklearn'

In [7]:
from sagemaker.sklearn.estimator import SKLearn
# your import and estimator code, here

output_path = 's3://{}/{}/{}'.format(bucket, prefix, 'model')

estimator = SKLearn(entry_point='train.py', 
                    framework_version='0.20.0', 
                    source_dir=source, 
                    role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c4.xlarge',
                    output_path=output_path,
                    sagemaker_session=sagemaker_session)



## Train the estimator

Train estimator on the training data stored in S3.

In [8]:
%%time

# Train your estimator on S3 training data
estimator.fit({'train':data})


2020-06-01 11:22:48 Starting - Starting the training job......
2020-06-01 11:23:19 Starting - Launching requested ML instances.........
2020-06-01 11:24:52 Starting - Preparing the instances for training...
2020-06-01 11:25:32 Downloading - Downloading input data...
2020-06-01 11:26:15 Training - Training image download completed. Training in progress.
2020-06-01 11:26:15 Uploading - Uploading generated training model[34m2020-06-01 11:26:10,065 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-06-01 11:26:10,067 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-06-01 11:26:10,077 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-06-01 11:26:10,330 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-06-01 11:26:10,331 sagemaker-containers INFO     Generating setup.cfg[0m
[34m2020-06-01 11:26

## Deploy trained model

After training, deploy your model to create a `predictor`. If you're using a PyTorch model, you'll need to create a trained `PyTorchModel` that accepts the trained `<model>.model_data` as an input parameter and points to the provided `source_pytorch/predict.py` file as an entry point. 

To deploy a trained model, you'll use `<model>.deploy`, which takes in two arguments:
* **initial_instance_count**: The number of deployed instances (1).
* **instance_type**: The type of SageMaker instance for deployment.

Note: If you run into an instance error, it may be because you chose the wrong training or deployment instance_type. It may help to refer to your previous exercise code to see which types of instances we used.

In [9]:
%%time

# deploy your model to create a predictor
predictor = estimator.deploy(initial_instance_count=1, 
                             instance_type='ml.t2.medium')


-------------!CPU times: user 227 ms, sys: 14.8 ms, total: 242 ms
Wall time: 6min 31s


## Endpoint Name For Lambda Function

In [10]:
predictor.endpoint

'sagemaker-scikit-learn-2020-06-01-11-22-48-208'

## Format

https://docs.python.org/3/library/json.html
https://medium.com/weareservian/machine-learning-on-aws-sagemaker-53e1a5e218d9
https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#deploy-a-scikit-learn-model

In [11]:
import os
import json

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "dp1_test.csv"))

# Labels are in the first column
test_y = test_data.sentiment
test_x = test_data.text

# format data
data = {'data': list(test_x)}
data_string = json.dumps(data)
data_encoded = data_string.encode('UTF-8')

In [15]:
import json
import boto3

# The SageMaker runtime is what allows us to invoke the endpoint that we've created.
runtime = boto3.Session().client('sagemaker-runtime')

#


# SageMaker runtime to invoke our endpoint, sending the headline we were given
response = runtime.invoke_endpoint(EndpointName = 'sagemaker-scikit-learn-2020-06-01-11-22-48-208', # endpoint of estimator
                                   ContentType = 'text/csv', # data format
                                   Body = data_encoded)  # the headline

# The response is an HTTP response whose body contains the result of our inference
result = response['Body'].read().decode('utf-8')

---
## Evaluate Model

Once your model is deployed, you can see how it performs when applied to our test data.

The provided cell below, reads in the test data, assuming it is stored locally in `data_dir` and named `test.csv`. The labels and features are extracted from the `.csv` file.

In [16]:
test_y_preds = json.loads(result)['predictions']

In [17]:
from sklearn.metrics import accuracy_score

# Second: calculate the test accuracy
accuracy = accuracy_score(test_y, test_y_preds)

print(accuracy)

0.7761506276150628


## Example for Lambda Test Function

In [18]:
# format data
data = {'data': list(test_x)[0:2]}
data

{'data': ['the sale , which will result in a gain of some eur 60 million in the second quarter of 2010 for oriola-kd , supports the finnish company s strategy to focus on pharmaceutical wholesale and retail operations',
  'scanfil expects net_sales in 2008 to remain at the 2007 level']}

----
## EXERCISE: Clean up Resources

After you're done evaluating your model, **delete your model endpoint**. You can do this with a call to `.delete_endpoint()`. You need to show, in this notebook, that the endpoint was deleted. Any other resources, you may delete from the AWS console, and you will find more instructions on cleaning up all your resources, below.

In [58]:
# uncomment and fill in the line below!
predictor.delete_endpoint()

### Deleting S3 bucket

When you are *completely* done with training and testing models, you can also delete your entire S3 bucket. If you do this before you are done training your model, you'll have to recreate your S3 bucket and upload your training data again.

In [59]:
# deleting bucket, uncomment lines below
bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': 'D9DFC25E7CA62C32',
   'HostId': 'EWcOd8DLhVxS4Na+Wux/lJnMMyDm/+ia/sMnwKRabQp8uPPyvbQ9jOWWxUz1uJ7lt51+zzjBHj0=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'EWcOd8DLhVxS4Na+Wux/lJnMMyDm/+ia/sMnwKRabQp8uPPyvbQ9jOWWxUz1uJ7lt51+zzjBHj0=',
    'x-amz-request-id': 'D9DFC25E7CA62C32',
    'date': 'Sun, 31 May 2020 18:48:48 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'financial_sentiment/data/dp1_train.csv'},
   {'Key': 'financial_sentiment/model/sagemaker-scikit-learn-2020-05-31-07-53-37-211/debug-output/training_job_end.ts'},
   {'Key': 'financial_sentiment/model/sagemaker-scikit-learn-2020-05-31-07-53-37-211/output/model.tar.gz'},
   {'Key': 'financial_sentiment/data/vocab.csv'},
   {'Key': 'financial_sentiment/data/dp1_test.csv'},
   {'Key': 'sagemaker-scikit-learn-2020-05-31-07-53-37-211/source/sourced