# Data Preparation and Training

In this notebook, we are going to run question answer training on Amazon Sagemaker.

---

# Data Preparation

## Load required libraries

In [1]:
import pandas as pd
import numpy as np
import os

import sagemaker
import boto3
import json

import sagemaker
from sagemaker import get_execution_role
from sagemaker.huggingface import HuggingFace

from sagemaker.s3 import S3Downloader
from sagemaker.s3 import S3Uploader

data_dir = "../inputs"

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = get_execution_role()
bucket = sagemaker_session.default_bucket()

---

# Train Bert model using Amazon Sagemaker

In [5]:
train_file_location = f"{data_dir}/train.csv"
bucket = sagemaker_session.default_bucket()
prefix = "sentiment_extraction/data"


inputs = S3Uploader.upload(train_file_location, "s3://{}/{}".format(bucket, prefix))

inputs =  "s3://{}/{}".format(bucket, prefix)
print(inputs)


s3://sagemaker-ap-south-1-296512243111/sentiment_extraction/data


In [13]:
hyperparameters={
                 
    "model_name": "roberta-base",
    "batch_size": 16,
    "epochs": 5 ,
    "lr" : 2e-5,
                 }


In [14]:
local_script_location = "../src"
hub = {
  'HF_TASK':'question-answering'     ## NLP task you want to use for predictions
}
huggingface_estimator = HuggingFace(
        entry_point='train.py',
        source_dir=local_script_location,
        env=hub, 
        instance_type='ml.g4dn.12xlarge',
#         instance_type='ml.p2.xlarge', 
        instance_count=1,
        role=role,
        transformers_version='4.6',
        pytorch_version='1.7',
        py_version='py36',
        hyperparameters = hyperparameters
)
huggingface_estimator.fit(inputs)



2023-01-09 11:45:05 Starting - Starting the training job...
2023-01-09 11:45:34 Starting - Preparing the instances for trainingProfilerReport-1673264705: InProgress
.........
2023-01-09 11:46:54 Downloading - Downloading input data
2023-01-09 11:46:54 Training - Downloading the training image.........
2023-01-09 11:48:34 Training - Training image download completed. Training in progress.....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-01-09 11:49:09,105 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-01-09 11:49:09,152 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-01-09 11:49:09,154 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-01-09 11:49:09,448 sagemaker-training-toolkit INFO     Invoking user script[0m
[34mTraining Env:[0m
[34m{


[34m  Attempting uninstall: transformers
    Found existing installation: transformers 4.6.1
    Uninstalling transformers-4.6.1:
      Successfully uninstalled transformers-4.6.1[0m
[34mSuccessfully installed huggingface-hub-0.4.0 tokenizers-0.12.1 transformers-4.18.0[0m
[34mModel will be saved in - /opt/ml/model[0m
[34mPath to data folder - /opt/ml/input/data/training[0m
[34mContents of folder /opt/ml/input/data/training - ['train.csv'][0m
[34m/opt/ml/input/data/training/train.csv[0m
[34mShape of training dataset - (27481, 5)[0m
[34mSome weights of the model checkpoint at roberta-base were not used when initializing RobertaForQuestionAnswering: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight'][0m
[34m- This IS expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a 

[34mThe following columns in the evaluation set  don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: answer_start, sentiment, textID, selected_text, text. If answer_start, sentiment, textID, selected_text, text are not expected by `RobertaForQuestionAnswering.forward`,  you can safely ignore this message.[0m
[34m***** Running Evaluation *****
  Num examples = 5497
  Batch size = 16[0m
[34m{'eval_loss': 1.0898241996765137, 'eval_runtime': 27.1365, 'eval_samples_per_second': 202.569, 'eval_steps_per_second': 3.169, 'epoch': 1.0}[0m
[34m{'loss': 1.4402, 'learning_rate': 1.4186046511627909e-05, 'epoch': 1.45}[0m
[34mThe following columns in the evaluation set  don't have a corresponding argument in `RobertaForQuestionAnswering.forward` and have been ignored: answer_start, sentiment, textID, selected_text, text. If answer_start, sentiment, textID, selected_text, text are not expected by `RobertaForQuestionAnswering.forward`,  you can 


2023-01-09 12:19:00 Completed - Training job completed
Training seconds: 1944
Billable seconds: 1944


In [16]:
predictor = huggingface_estimator.deploy(initial_instance_count=1, instance_type="ml.c5.xlarge")

----!

In [19]:
text = "Tommy is a ridiculous but good dog."

data = {
   "context": text,
    "question":"negative"
}
print(predictor.predict(data))

data = {
   "context": text,
    "question":"positive"
}
print(predictor.predict(data))
data = {
   "context": text,
    "question":"neutral"
}
print(predictor.predict(data))

{'score': 0.30175891518592834, 'start': 11, 'end': 21, 'answer': 'ridiculous'}
{'score': 0.25432121753692627, 'start': 26, 'end': 30, 'answer': 'good'}
{'score': 0.9555259943008423, 'start': 0, 'end': 35, 'answer': 'Tommy is a ridiculous but good dog.'}


In [20]:
text = "Tommy is a ridiculous but good dog."

data = {
   "context": text,
    "question":"negative"
}
print(predictor.predict(data))

data = {
   "context": text,
    "question":"positive"
}
print(predictor.predict(data))
data = {
   "context": text,
    "question":"neutral"
}
print(predictor.predict(data))

{'score': 0.30175891518592834, 'start': 11, 'end': 21, 'answer': 'ridiculous'}
{'score': 0.25432121753692627, 'start': 26, 'end': 30, 'answer': 'good'}
{'score': 0.9555259943008423, 'start': 0, 'end': 35, 'answer': 'Tommy is a ridiculous but good dog.'}


In [23]:
sentences = [
    "Tommy is ridiculous but a good dog.",
    "Day went poor. It was an awesome night.",
    "Pity what a slave this beautiful boy is.",
    "Chelsea lost inspite of their remarkable play.",
    
]
for sentence in sentences:
    for sentiment in ["positive" , "negative"]:
        data = {
           "context": sentence,
            "question":sentiment
        }
        print("Sentence ", sentence)
        print("Sentiment ", sentiment)
        print(predictor.predict(data))  

Sentence  Tommy is ridiculous but a good dog.
Sentiment  positive
{'score': 0.38624709844589233, 'start': 26, 'end': 30, 'answer': 'good'}
Sentence  Tommy is ridiculous but a good dog.
Sentiment  negative
{'score': 0.3069661855697632, 'start': 9, 'end': 19, 'answer': 'ridiculous'}
Sentence  Day went poor. Night was fantastic.
Sentiment  positive
{'score': 0.6471546292304993, 'start': 25, 'end': 35, 'answer': 'fantastic.'}
Sentence  Day went poor. Night was fantastic.
Sentiment  negative
{'score': 0.3586087226867676, 'start': 0, 'end': 14, 'answer': 'Day went poor.'}
Sentence  Pity what a slave this beautiful boy is.
Sentiment  positive
{'score': 0.45510128140449524, 'start': 0, 'end': 40, 'answer': 'Pity what a slave this beautiful boy is.'}
Sentence  Pity what a slave this beautiful boy is.
Sentiment  negative
{'score': 0.23526301980018616, 'start': 0, 'end': 17, 'answer': 'Pity what a slave'}
Sentence  Chelsea lost inspite of their remarkable play.
Sentiment  positive
{'score': 0.371