# Plagiarism Detection Model

## Outline:

* Upload the data to S3.
* Define a binary classification model and a training script.
* Train the model and deploy it.
* Evaluate the deployed classifier

## Load Data to S3

In [1]:
import pandas as pd
import boto3
import sagemaker

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

## Uploading the training data to S3

In [3]:
# name of directory created to save features data
data_dir = 'plagiarism_data'

# set prefix
prefix = 'plagiarism-detection'

# upload all data to S3
input_data = sagemaker_session.upload_data(path = data_dir, bucket = bucket, key_prefix = prefix)

print(input_data)

s3://sagemaker-us-west-1-035057502445/plagiarism-detection


### Test cell

In [4]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

plagiarism-detection/test.csv
plagiarism-detection/train.csv
sagemaker-pytorch-2020-04-19-22-26-49-362/source/sourcedir.tar.gz
sagemaker-pytorch-2020-04-20-19-03-06-148/source/sourcedir.tar.gz
sagemaker-pytorch-2020-04-20-19-10-00-845/debug-output/training_job_end.ts
sagemaker-pytorch-2020-04-20-19-10-00-845/output/model.tar.gz
sagemaker-pytorch-2020-04-20-19-10-00-845/source/sourcedir.tar.gz
sagemaker-pytorch-2020-04-20-21-32-04-073/sourcedir.tar.gz
sagemaker/sentiment_rnn/train.csv
sagemaker/sentiment_rnn/word_dict.pkl
Test passed!


---

# Modeling

I am going to use the sklearn packages LinearSVC model. This is a binary classification problem with few samples. A Linear Support Vector Classification model is suitable because it produces 1's and 0's instead of probabilities like Logistic Regression.

## Completing a training script 

The script is in the main directory of the project.

### Defining and training a model

In [6]:
!pygmentize source_sklearn/train.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m

[34mfrom[39;49;00m [04m[36msklearn.externals[39;49;00m [34mimport[39;49;00m joblib

[37m## TODO: Import any additional libraries you need to define a model[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn.svm[39;49;00m [34mimport[39;49;00m LinearSVC

[37m# Provided model load function[39;49;00m
[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""Load model from the model_dir. This is the same model that is saved[39;49;00m
[33m    in the main if statement.[39;49;00m
[33m    """[39;49;00m
    [34mprint[39;49;00m([33m"[39;49;00m[33mLoading model.[39;49;00m[33m"[39;49;00m)
    
    [37m# load using joblib[39;49;00m
    model = joblib.load(os.pa

---
# Creating an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function specified above. I am going to fill in these arguments when constructing the estimator.

* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `source_sklearn` OR `source_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances.
* **train_instance_type**: The type of SageMaker instance for training.
* **sagemaker_session**: The session used to train on Sagemaker.
* **hyperparameters**: A dictionary `{'name':value, ..}` passed to the train function as hyperparameters.


## Defining a Scikit-learn or PyTorch estimator

In [7]:
# import and estimator code
from sagemaker.sklearn.estimator import SKLearn

estimator = SKLearn(entry_point="train.py",
                    source_dir="source_sklearn",
                    role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c4.xlarge')

## Training the estimator

In [8]:
%%time

# Training the estimator on S3 training data

estimator.fit({'train': input_data})

2020-04-23 02:00:11 Starting - Starting the training job...
2020-04-23 02:00:13 Starting - Launching requested ML instances......
2020-04-23 02:01:15 Starting - Preparing the instances for training...
2020-04-23 02:02:00 Downloading - Downloading input data...
2020-04-23 02:02:23 Training - Downloading the training image..[34m2020-04-23 02:02:44,543 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-04-23 02:02:44,545 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-04-23 02:02:44,555 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-04-23 02:02:44,835 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-04-23 02:02:44,835 sagemaker-containers INFO     Generating setup.cfg[0m
[34m2020-04-23 02:02:44,835 sagemaker-containers INFO     Generating MANIFEST.in[0m
[34m2020-04-23 02:02:44,835 sag

## Deploying the trained model

In [9]:
%%time

# deploying the model to create a predictor
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

---------------!CPU times: user 264 ms, sys: 3.07 ms, total: 267 ms
Wall time: 7min 31s


---
# Evaluating The Model

In [10]:
import os

# read in test data
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

## Determining the accuracy of the model

In [11]:
# generate predicted, class labels
test_y_preds = predictor.predict(test_x)


# testing that the model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [12]:
# calculating the test accuracy
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(test_y, test_y_preds)

print(accuracy)

## print out the array of predicted and true labels
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

1.0

Predicted class labels: 
[1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0.
 0.]

True class labels: 
[1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0.
 0.]


The model achieved 100% accuracy. This may be due to the small sample size or a very clear margin of difference between the texts.

----
## Cleaning up Resources

In [13]:
predictor.delete_endpoint()

### Deleting S3 bucket

In [14]:
# deleting bucket

bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': '677CD6CFCAAA9114',
   'HostId': 'izKqB83KPVnMa49fy33x34q3hcgHDfRHqetsFnzXGBBUYxpLgg4YJbZBXofxohn+2UW01olMch4=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'izKqB83KPVnMa49fy33x34q3hcgHDfRHqetsFnzXGBBUYxpLgg4YJbZBXofxohn+2UW01olMch4=',
    'x-amz-request-id': '677CD6CFCAAA9114',
    'date': 'Thu, 23 Apr 2020 02:26:20 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker-scikit-learn-2020-04-23-02-00-10-659/source/sourcedir.tar.gz'},
   {'Key': 'plagiarism-detection/test.csv'},
   {'Key': 'sagemaker/sentiment_rnn/train.csv'},
   {'Key': 'sagemaker-pytorch-2020-04-20-19-10-00-845/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-pytorch-2020-04-19-22-26-49-362/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-scikit-learn-2020-04-23-02-00-10-659/output/model.tar.gz'},
   {'Key': 'sagemaker/sentiment_rnn