# Plagiarism Detection Model

Now that we've created training and test data, we are ready to define and train a model. Our goal is to train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features you provide the model.

This section will be broken down into a few discrete steps:

* Upload data to S3.
* Define a binary classification model and a training script.
* Train model and deploy it.
* Evaluate deployed classifier.


## Load Data to S3

In the 'Plagiarism Detection, Feature Engineering' section we have created two files: a `training.csv` and `test.csv` file with the features and class labels for the given corpus of plagiarized/non-plagiarized text data. We will load some AWS SageMaker libraries and create a default bucket. 

In [1]:
import pandas as pd
import boto3
import sagemaker

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# creating an S3 bucket
bucket = sagemaker_session.default_bucket()

## Upload your training data to S3

In [3]:
data_dir = 'plagiarism_data'

# set prefix 
prefix = 'plagiarism_detection_model'

# upload all data to S3
location_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)
print(location_data)

s3://sagemaker-us-east-1-194770695442/plagiarism_detection_model


### Test cell

Test to see data has been successfully uploaded. The below cell prints out the items in S3 bucket and will throw an error if it is empty.

In [4]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

plagiarism_detection_model/sagemaker-scikit-learn-2021-12-08-20-22-16-646/profiler-output/system/incremental/2021120820/1638995100.algo-1.json
plagiarism_detection_model/sagemaker-scikit-learn-2021-12-08-20-22-16-646/profiler-output/system/incremental/2021120820/1638995160.algo-1.json
plagiarism_detection_model/sagemaker-scikit-learn-2021-12-08-20-29-17-923/profiler-output/system/incremental/2021120820/1638995460.algo-1.json
plagiarism_detection_model/sagemaker-scikit-learn-2021-12-08-20-29-17-923/profiler-output/system/incremental/2021120820/1638995520.algo-1.json
plagiarism_detection_model/sagemaker-scikit-learn-2021-12-08-20-35-42-665/profiler-output/system/incremental/2021120820/1638995880.algo-1.json
plagiarism_detection_model/sagemaker-scikit-learn-2021-12-08-20-35-42-665/profiler-output/system/incremental/2021120820/1638995940.algo-1.json
plagiarism_detection_model/sagemaker-scikit-learn-2021-12-08-20-40-45-210/profiler-output/system/incremental/2021120820/1638996180.algo-1.json

---

# Modeling

After uploading trained data, it's time to define and train a model!


## Complete a training script 

A typical training script:
* Loads training data from a specified directory
* Parses any training & model hyperparameters (ex. nodes in a neural network, training epochs, etc.)
* Instantiates a model of your design, with any specified hyperparams
* Trains that model 
* Finally, saves the model so that it can be hosted/deployed, later

### Defining and training a model

In [3]:
!pygmentize source_sklearn/train.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m

[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mexternals[39;49;00m [34mimport[39;49;00m joblib


[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36msvm[39;49;00m [34mimport[39;49;00m LinearSVC


[37m# model load function[39;49;00m
[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""Load model from the model_dir. This is the same model that is saved[39;49;00m
[33m    in the main if statement.[39;49;00m
[33m    """[39;49;00m
    [36mprint[39;49;00m([33m"[39;49;00m[33mLoading model.[39;49;00m[33m"[39;49;00m)

    [37m# load using joblib[39;49;00m
    model = joblib.load(os.path.join(mo

---
# Create an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. To run a custom training script in SageMaker, construct an estimator and should include these constructor arguments:

* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances
* **train_instance_type**: The type of SageMaker instance for training. 
* **sagemaker_session**: The session used to train on Sagemaker.


## Define a Scikit-learn or PyTorch estimator


In [6]:
# estimator
from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(entry_point="train.py",
                    source_dir="source_sklearn",
                    role=role,
                    instance_count=1,
                    instance_type='ml.c4.xlarge',
                    framework_version = '0.20.0',
                    py_version = 'py3',
                    output_path = 's3://{}/{}'.format(bucket, prefix),
                    sagemaker_session = sagemaker_session)

## Train the estimator

Train the estimator on the training data stored in S3. This should create a training job that we can monitor in SageMaker console.

In [7]:
%%time

# Train estimator on S3 training data
sklearn_estimator.fit({'train': location_data})

2021-12-08 20:50:23 Starting - Starting the training job...
2021-12-08 20:50:44 Starting - Launching requested ML instancesProfilerReport-1638996623: InProgress
......
2021-12-08 20:51:47 Starting - Preparing the instances for training.........
2021-12-08 20:53:20 Downloading - Downloading input data...
2021-12-08 20:53:54 Training - Downloading the training image...
2021-12-08 20:54:22 Uploading - Uploading generated training model.[34m2021-12-08 20:54:18,579 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2021-12-08 20:54:18,581 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-12-08 20:54:18,591 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2021-12-08 20:54:19,059 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-12-08 20:54:19,072 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus i

## Deploy the trained model

After training, will deploy the model to create a `predictor`.


In [8]:
%%time
# deploy the model to create a predictor
predictor = sklearn_estimator.deploy(instance_type='ml.m4.xlarge', initial_instance_count=1)

--------!CPU times: user 175 ms, sys: 999 µs, total: 176 ms
Wall time: 4min 2s


---
# Evaluating The Model

Once the model is deployed, we can see how it performs when applied to test data.


In [9]:
import os

# reading in test data
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

##  Determine the accuracy of the model

Using deployed `predictor` to generate predicted, class labels for the test data. By comparing those to the *true* labels, `test_y`, we can calculate a value between 0 and 1.0 that indicates the fraction of test data that the model classified correctly. 

In [10]:
# Generate predicted, class labels
test_y_preds = predictor.predict(test_x)

# test the model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [11]:
# Calculate the test accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_y,test_y_preds)

print(accuracy)

# printing out the array of predicted and true labels
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

1.0

Predicted class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


---
## Further Directions

There are many ways to improve or add on to this project to expand learning  A few ideas are listed below:
* Train a classifier to predict the *category* (1-3) of plagiarism and not just plagiarized (1) or not (0).
* Utilize a different and larger dataset to see if this model can be extended to other types of plagiarism.
* Use language or character-level analysis to find different (and more) similarity features.
* Write a complete pipeline function that accepts a source text and submitted text file, and classifies the submitted text as plagiarized or not.
* Use API Gateway and a lambda function to deploy your model to a web application.

