# This Notebook Must Be Executed In A SageMaker Instance

# Plagiarism Detection Model

Now that the train and test sets have been processed and writen to `.csv` we can focus on modeling the data and deploying the model endpoint. 

We need to complete the following steps to finish the data science life-cycle for this project:

* Upload data to S3.
* Define a binary classification model and a training script.
* Train the model and deploy it.
* Evaluate deployed classifier.


## Load Data from S3

We already have `training.csv` and `test.csv` files with the features and class labels for the given corpus of plagiarized/non-plagiarized text data. 

In [1]:
import pandas as pd
import boto3
import sagemaker

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

## Upload your training data to S3


In [3]:
bucket

'sagemaker-us-east-2-875690977746'

In [4]:
data_dir = 'plagiarism_project'
prefix = 'data'

# upload all data to S3
sagemaker_session.upload_data(bucket=bucket, path='plagiarism_data/train.csv', key_prefix=data_dir+'/'+prefix)
sagemaker_session.upload_data(bucket=bucket, path='plagiarism_data/test.csv', key_prefix=data_dir+'/'+prefix)

's3://sagemaker-us-east-2-875690977746/plagiarism_project/data/test.csv'

### Check S3 Resources

Test that the data has been successfully uploaded. The below cell prints out the items in your S3 bucket and will throw an error if it is empty.

In [5]:
# Confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'

---

# Modeling

Let's train to define and train a model!

The data is loaded into S3 so that it is easy to reach. I will be using the Gradient Boosted classifier from `sklearn` that we tested previously in `3_Modeling_Trials.ipynb`

## Training script 

To implement a custom classifier, we need to complete a `train.py` script. 

A typical training script includes:

* Loads training data from a specified directory
* Parses any training & model hyperparameters (ex. nodes in a neural network, training epochs, etc.)
* Instantiates a model of your design, with any specified hyperparams
* Trains that model 
* Finally, saves the model so that it can be hosted/deployed, later

Since we have already tested a model and have one that we like, we will keep the framework in `train.py` but simply paste in the model parameters.

### Defining and training a model
We need a trainging file, `train.py` it has the following parts:

1. Import any libraries needed
2. Define additional model training hyperparameters using `parser.add_argument`
2. Define a model in the `if __name__ == '__main__':` section
3. Train the model in that same section

I am using the below training script that I adapated from the Udacity course. Much of this file is pre-filled and is needed for running a custom PyTorch deep learning model. The SageMaker SKLearn model wrapper has it's own training script, but since we are uploading a model from a local training job, I believe this custom script is needed. 

In [7]:
!pygmentize source_sklearn/train.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m

[34mfrom[39;49;00m [04m[36msklearn.externals[39;49;00m [34mimport[39;49;00m joblib

[34mfrom[39;49;00m [04m[36msklearn.ensemble[39;49;00m [34mimport[39;49;00m GradientBoostingClassifier

[37m# This is a general framework for testing models in SageMaker. [39;49;00m
[37m# I'm keeping the structure the same, but using some shortcuts to make the process smoother.[39;49;00m

[37m# Model load function[39;49;00m
[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""Load model from the model_dir. This is the same model that is saved[39;49;00m
[33m    in the main if statement.[39;49;00m
[33m    """[39;49;00m
    [34mprint[39;49;00m([33m"[39;49;00m[33mLoading model.[39;49;00m[33m"[39;49;00m)
    
    [37m# load using joblib[39;49;00m
    model = joblib.lo

---
# Create an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function you specified above. To run a custom training script in SageMaker, construct an estimator, and fill in the appropriate constructor arguments:

* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory.
* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory.
* **entry_point**: The path to the Python script SageMaker runs for training.
* **source_dir**: The path to the training script directory.
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances (should be left at 1).
* **train_instance_type**: The type of SageMaker instance for training. 
* **sagemaker_session**: The session used to train on Sagemaker.
* **hyperparameters** (optional): A dictionary `{'name':value, ..}` passed to the train function as hyperparameters.



In [8]:
s3_output_path = "s3://{}/{}/{}/output".format(bucket, data_dir, prefix)
s3_output_path

's3://sagemaker-us-east-2-875690977746/plagiarism_project/data/output'

In [9]:
from sagemaker.sklearn.estimator import SKLearn
# Define model role
gbm_estimator = SKLearn(role=role,
                        entry_point='train.py',
                        source_dir='source_sklearn',
                        train_instance_count=1,
                        train_instance_type='ml.c4.xlarge',
                        sagemaker_session=sagemaker_session,
                        output_path=s3_output_path
                        )

## Train the estimator

Create a training job that can be monitored in the SageMaker console.

In [10]:
%%time

test_path = f's3://{bucket}/{data_dir}/{prefix}/test.csv'
train_path = f's3://{bucket}/{data_dir}/{prefix}/train.csv'

data_channels = {
    "train": train_path,
    "test": test_path
}

# Train estimator on S3 training data
gbm_estimator.fit(inputs=data_channels)

2020-04-23 18:43:43 Starting - Starting the training job......
2020-04-23 18:44:13 Starting - Launching requested ML instances...
2020-04-23 18:45:11 Starting - Preparing the instances for training......
2020-04-23 18:45:55 Downloading - Downloading input data...
2020-04-23 18:46:42 Training - Training image download completed. Training in progress..[34m2020-04-23 18:46:42,901 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-04-23 18:46:42,903 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-04-23 18:46:42,913 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-04-23 18:46:43,201 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-04-23 18:46:43,201 sagemaker-containers INFO     Generating setup.cfg[0m
[34m2020-04-23 18:46:43,201 sagemaker-containers INFO     Generating MANIFEST.in[0m
[34

## Deploy the trained model

After training, we can deploy the model to create a `predictor`.

To deploy a trained model, we to use `<model>.deploy`, which takes in two arguments:
* **initial_instance_count**: The number of deployed instances (1).
* **instance_type**: The type of SageMaker instance for deployment.


In [11]:
%%time

# deploy the model to create a predictor
predictor = gbm_estimator.deploy(initial_instance_count=1,
                                instance_type='ml.t2.medium')


-----------------!CPU times: user 276 ms, sys: 18.6 ms, total: 295 ms
Wall time: 8min 32s


---
# Evaluating The Model

Once the model is deployed, we can see how it performs when applied to the test data. We have already trained this model with the training set only. This is the long awaited hold-out set. We could have passed the test data to the model already, but I wanted to pass it to the deployed model as proof that it is actually deployed and as a fun hold-out for the real goal of this project -- using SageMaker. 


In [12]:
print(data_dir)

plagiarism_project


In [13]:
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join('plagiarism_data', "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

## Accuracy of the model

We will use the deployed `predictor` to generate predicted, class labels for the test data. Comparing those to the *true* labels, `test_y`, and calculate the accuracy that the model classified correctly. 

In [14]:
# Predict
test_y_preds = predictor.predict(test_x)

# Test
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'

In [15]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [16]:
# Accuracy score
accuracy = accuracy_score(y_true=test_y, y_pred=test_y_preds)
cls_report = classification_report(test_y, test_y_preds)
confusion_mat = confusion_matrix(test_y, test_y_preds)
print('accuracy:', accuracy, '%')


# Print out the array of predicted and true labels
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

accuracy: 0.92 %

Predicted class labels: 
[0 0 0 0 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 0]

True class labels: 
[0 0 0 0 1 1 1 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0]


In [17]:
print(cls_report)

              precision    recall  f1-score   support

           0       0.83      1.00      0.91        10
           1       1.00      0.87      0.93        15

   micro avg       0.92      0.92      0.92        25
   macro avg       0.92      0.93      0.92        25
weighted avg       0.93      0.92      0.92        25



In [18]:
print(confusion_mat)

[[10  0]
 [ 2 13]]


Two wrongly identified cases on the test set, not bad. Both were cases of False Negatives, which seems like the right side to be on. I would rather a model miss a case than a model that flags authentic work as phony. As expected the test data proved to be a bit harder for the model to predict than the training set evaluation scored. This is generally true for train/test data. 

----
## Clean up Resources

These are for-pay services from AWS so always be sure to delete endpooints once finished so as not to incure additionally costs. 

In [19]:
predictor.delete_endpoint()