# An Endpoint using BERT large model (uncased) with the MS_MARCO dataset

---

## Background
Bidirectional Encoder Representations from Transformers (BERT) is a pre-trained language model that 
has achieved state-of-the-art results on numerous natural language processing (NLP) tasks. BERT is a 
transformer-based model, which was fine-tuned for various NLP tasks, including question answering. 
The Stanford Question Answering Dataset (SQuAD) is one such task where BERT has achieved 
significant accuracy compared with other models. However, current research also indicates that 
tuning BERT with new datasets can lead to further improvements in its performances for specific 
tasks. In this proposal, we suggest fine-tuning a pre-trained BERT large model (uncased) with the 
ms_marco dataset using whole word masking and incorporating a fine-tuning approach to improve its 
performances. 

The BERT model is based on the concept of pre-training, as well as fine-tuning, a NLP task. The pretraining process is based on unsupervised objectives, such as the masked language model and the 
next sentence prediction task, to learn contextual representations of words. The fine-tuning process, 
on the other hand, uses supervised training, where the learned pre-trained knowledge is combined 
with task-specific datasets to achieve beƩer performances.

MS_MARCO is a dataset consisting of a large collection of real-world queries and corresponding 
passages that aims to facilitate research in question-answer matching and ranking. It contains more 
than 1 million queries and over 8 million passages, making it one of the largest publicly available 
datasets of its kind. Each query in the dataset was generated by real users of the Bing search engine 
and is paired with relevant passages retrieved from web pages. The dataset is designed to enable 
researchers to develop and evaluate machine learning models for natural language processing tasks 
such as question answering, information retrieval, and passage ranking. The dataset also includes 
information about the relevance of passages to queries, allowing for the evaluation of ranking 
algorithms. Overall, the MS MARCO v1.2 dataset is a valuable resource for anyone interested in 
developing and evaluating algorithms for natural language understanding tasks.


---

## Preparation

Our proposal is to use the BERT large model (uncased) with the MS_MARCO dataset using whole 
word masking as an endpoint to answer question. For this we will go through the following steps: 
* Preparing the data: We will use the MS_MARCO dataset to fine-tune the BERT large model. We will construct a train and test set from the available dataset for fine-tuning. 
* Fine-tuning the model: We will fine-tune the pre-trained BERT model with the MS_MARCO dataset using whole word masking. We will use the Adam optimizer for training and evaluate model performance based on F1 score. 
* Evaluating the model: We will evaluate the fine-tuned model's performance based on MS_Marco's benchmarking metrics. The key evaluation measures will be the F1-score and the exact match (EM) score. 
* Deploy the model. 
* Scale the model so that we can make this application available to a lot more users. 

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [20]:
# cell 01
import sagemaker
bucket=sagemaker.Session().default_bucket()
 
# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

Now let's bring in the Python libraries that we'll use throughout the analysis

In [21]:
# cell 02
!pip install -qU datasets
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker 
import zipfile     # Amazon SageMaker's Python SDK provides many helper functions
from datasets import load_dataset

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


---

## Data
Let's start by importing the dataset from Huggingface

In [22]:
validation_data, train_data, test_data = load_dataset('ms_marco', 'v1.1', split =['validation','train', 'test'], 
                                                        cache_dir='/media/data_files/github/website_tutorials/data')

Found cached dataset ms_marco (/media/data_files/github/website_tutorials/data/ms_marco/v1.1/1.1.0/b6a62715fa5219aea5275dd3556601004cd63945cb63e36e022f77bb3cbbca84)


  0%|          | 0/3 [00:00<?, ?it/s]

Now lets read this into a Pandas data frame and take a look.

In [23]:
data = pd.concat([pd.DataFrame(validation_data), pd.DataFrame(train_data), pd.DataFrame(test_data)],ignore_index=True)
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)
data

Unnamed: 0,answers,passages,query,query_id,query_type,wellFormedAnswers
0,"[Approximately $15,000 per year.]","{'is_selected': [1, 0, 0, 0, 0, 0], 'passage_t...",walgreens store sales average,9652,numeric,[]
1,"[$21,550 per year, The average hourly wage for...","{'is_selected': [0, 1, 0, 0, 0, 0, 0, 0], 'pas...",how much do bartenders make,9653,numeric,[]
2,"[A boil, also called a furuncle, is a deep fol...","{'is_selected': [0, 0, 0, 0, 0, 0, 1, 0], 'pas...",what is a furuncle boil,9654,description,[]
3,"[Detect and assess a wide range of disorders, ...","{'is_selected': [0, 0, 0, 0, 1, 0, 0, 0, 0], '...",what can urinalysis detect,9655,description,[]
4,"[Shigellosis, diseases of the nervous system, ...","{'is_selected': [0, 0, 0, 0, 1, 0, 0, 0, 0], '...",what is vitamin a used for,9656,description,[]
...,...,...,...,...,...,...
102018,[Can last 3-4 days in the fridge as long as it...,"{'is_selected': [0, 0, 0, 1, 0, 0, 0], 'passag...",how long can you keep bacon in the fridge,9647,numeric,[]
102019,[Body mass index (BMI) the weight in kilograms...,"{'is_selected': [0, 0, 0, 0, 0, 0, 1, 0, 0], '...",what is growth bmi mean in medical terms,9648,description,[]
102020,[Yes],"{'is_selected': [0, 0, 0, 0, 0, 1, 0, 0, 0], '...",can an llc apply for more than one dba,9649,description,[]
102021,['Bisque' is a shade of White that is 23% satu...,"{'is_selected': [0, 0, 0, 0, 1, 0, 0, 0], 'pas...",bisque color definition,9650,entity,[]


In [24]:
file_name = 'MS_Marco.csv' 
data.to_csv(file_name)

We will store this natively in S3 to then process it with SageMaker Processing.

In [25]:
from sagemaker import Session

prefix = 'final_project'
s3 = boto3.resource('s3')
s3.meta.client.upload_file(file_name, bucket, f'{prefix}/input_data/MS_Marco.csv')

sess = Session()
input_source = sess.upload_data(file_name, bucket=bucket, key_prefix=f'{prefix}/input_data')
input_source

's3://sagemaker-us-east-1-539173668697/final_project/input_data/MS_Marco.csv'

---
# Feature Engineering 

Here, we'll import the dataset and transform it with SageMaker Processing, which can be used to process terabytes of data in a SageMaker-managed cluster separate from the instance running your notebook server.

For this project, we're using a SageMaker prebuilt Scikit-learn container, which includes many common functions for processing data. Moreover, when the job is complete, SageMaker Processing will automatically uploads the transformed data to S3.

## STEP 1: let's analyse the data to know what to do for the cleaning

df = pd.concat([df.drop(['passages'], axis=1), df['passages'].apply(pd.Series)], axis=1)Lets show the data again

In [38]:
df = pd.concat([data.drop(['passages'], axis=1), data['passages'].apply(pd.Series)], axis=1)

Unnamed: 0,answers,query,query_id,query_type,wellFormedAnswers,is_selected,passage_text
0,"[Approximately $15,000 per year.]",walgreens store sales average,9652,numeric,[],"[1, 0, 0, 0, 0, 0]",[The average Walgreens salary ranges from appr...
1,"[$21,550 per year, The average hourly wage for...",how much do bartenders make,9653,numeric,[],"[0, 1, 0, 0, 0, 0, 0, 0]",[A bartender’s income is comprised mostly of t...
2,"[A boil, also called a furuncle, is a deep fol...",what is a furuncle boil,9654,description,[],"[0, 0, 0, 0, 0, 0, 1, 0]","[Knowledge center. A boil, also known as a fur..."
3,"[Detect and assess a wide range of disorders, ...",what can urinalysis detect,9655,description,[],"[0, 0, 0, 0, 1, 0, 0, 0, 0]",[Urinalysis: One way to test for bladder cance...
4,"[Shigellosis, diseases of the nervous system, ...",what is vitamin a used for,9656,description,[],"[0, 0, 0, 0, 1, 0, 0, 0, 0]",[Since vitamin A is fat-soluble it is not need...


In [40]:
df

Unnamed: 0,answers,query,query_id,query_type,wellFormedAnswers,is_selected,passage_text,url
0,"[Approximately $15,000 per year.]",walgreens store sales average,9652,numeric,[],"[1, 0, 0, 0, 0, 0]",[The average Walgreens salary ranges from appr...,"[http://www.indeed.com/cmp/Walgreens/salaries,..."
1,"[$21,550 per year, The average hourly wage for...",how much do bartenders make,9653,numeric,[],"[0, 1, 0, 0, 0, 0, 0, 0]",[A bartender’s income is comprised mostly of t...,[http://www.breakintobartending.com/how-much-d...
2,"[A boil, also called a furuncle, is a deep fol...",what is a furuncle boil,9654,description,[],"[0, 0, 0, 0, 0, 0, 1, 0]","[Knowledge center. A boil, also known as a fur...",[http://www.medicalnewstoday.com/articles/1854...
3,"[Detect and assess a wide range of disorders, ...",what can urinalysis detect,9655,description,[],"[0, 0, 0, 0, 1, 0, 0, 0, 0]",[Urinalysis: One way to test for bladder cance...,[http://www.cancer.org/cancer/bladdercancer/de...
4,"[Shigellosis, diseases of the nervous system, ...",what is vitamin a used for,9656,description,[],"[0, 0, 0, 0, 1, 0, 0, 0, 0]",[Since vitamin A is fat-soluble it is not need...,[http://www.myfooddiary.com/Resources/nutrient...
...,...,...,...,...,...,...,...,...
102018,[Can last 3-4 days in the fridge as long as it...,how long can you keep bacon in the fridge,9647,numeric,[],"[0, 0, 0, 1, 0, 0, 0]",[Most people tend to still eat the meat a few ...,[https://answers.yahoo.com/question/index?qid=...
102019,[Body mass index (BMI) the weight in kilograms...,what is growth bmi mean in medical terms,9648,description,[],"[0, 0, 0, 0, 0, 0, 1, 0, 0]",[In this article. Your BMI is based on your he...,"[http://www.webmd.com/men/weight-loss-bmi, htt..."
102020,[Yes],can an llc apply for more than one dba,9649,description,[],"[0, 0, 0, 0, 0, 1, 0, 0, 0]",[Typically a DBA is only required when a busin...,[http://smallbiztrends.com/2012/03/does-your-b...
102021,['Bisque' is a shade of White that is 23% satu...,bisque color definition,9650,entity,[],"[0, 0, 0, 0, 1, 0, 0, 0]",[Definition of BISQUE. : odds allowed an infer...,[http://www.merriam-webster.com/dictionary/bis...


In [42]:
df.drop(['url'], axis=1).head()

Unnamed: 0,answers,query,query_id,query_type,wellFormedAnswers,is_selected,passage_text
0,"[Approximately $15,000 per year.]",walgreens store sales average,9652,numeric,[],"[1, 0, 0, 0, 0, 0]",[The average Walgreens salary ranges from appr...
1,"[$21,550 per year, The average hourly wage for...",how much do bartenders make,9653,numeric,[],"[0, 1, 0, 0, 0, 0, 0, 0]",[A bartender’s income is comprised mostly of t...
2,"[A boil, also called a furuncle, is a deep fol...",what is a furuncle boil,9654,description,[],"[0, 0, 0, 0, 0, 0, 1, 0]","[Knowledge center. A boil, also known as a fur..."
3,"[Detect and assess a wide range of disorders, ...",what can urinalysis detect,9655,description,[],"[0, 0, 0, 0, 1, 0, 0, 0, 0]",[Urinalysis: One way to test for bladder cance...
4,"[Shigellosis, diseases of the nervous system, ...",what is vitamin a used for,9656,description,[],"[0, 0, 0, 0, 1, 0, 0, 0, 0]",[Since vitamin A is fat-soluble it is not need...


In [26]:
data.head()

Unnamed: 0,answers,passages,query,query_id,query_type,wellFormedAnswers
0,"[Approximately $15,000 per year.]","{'is_selected': [1, 0, 0, 0, 0, 0], 'passage_t...",walgreens store sales average,9652,numeric,[]
1,"[$21,550 per year, The average hourly wage for...","{'is_selected': [0, 1, 0, 0, 0, 0, 0, 0], 'pas...",how much do bartenders make,9653,numeric,[]
2,"[A boil, also called a furuncle, is a deep fol...","{'is_selected': [0, 0, 0, 0, 0, 0, 1, 0], 'pas...",what is a furuncle boil,9654,description,[]
3,"[Detect and assess a wide range of disorders, ...","{'is_selected': [0, 0, 0, 0, 1, 0, 0, 0, 0], '...",what can urinalysis detect,9655,description,[]
4,"[Shigellosis, diseases of the nervous system, ...","{'is_selected': [0, 0, 0, 0, 1, 0, 0, 0, 0], '...",what is vitamin a used for,9656,description,[]


- The column wellFormed seems to be only composed of empty list but 

In [27]:
data[data['wellFormedAnswers'].str.len()!=0].wellFormedAnswers

Series([], Name: wellFormedAnswers, dtype: object)

It is the case so the column is not really useful and we'll drop it

- Then what we see is that the column "passage" seems to have to much information

For only trainning we only need the context which is the 'passage_text. Also, some cleanig seems to be needed it term of regex and we'll try to solv that problem.

- let's see if query_id need to be cleaned

In [28]:
len(data.query_id.unique()) == len(data)

True

We have the right count of query_id so no cleaning needed for this columndel df['passages']['url']

- finally let's study the column query_type

In [29]:
data.query_type.value_counts()

description    55684
numeric        28291
entity         10485
location        5068
person          2495
Name: query_type, dtype: int64

So this column is composed of 5 categories and no 'na' values seems to be present so for now we don't need to modify this column

## STEP 2: Apply modification on dataset and split train/test/validation

!pip install -qU emoji
import emoji

def text_cleaning(text):
    
    cleaned_text = cleaned_text.replace('..','.')
    for character in text:
        if character in emoji.UNICODE_EMOJI or (character=='.'):
                cleaned_text = cleaned_text.replace(character,'')
    
    return cleaned_text


In [46]:
%%writefile preprocessing.py

import pandas as pd
import numpy as np
import argparse
import os
from sklearn.preprocessing import OrdinalEncoder

def _parse_args():

    parser = argparse.ArgumentParser()

    # Data, model, and output directories
    # model_dir is always passed in from SageMaker. By default this is a S3 path under the default bucket.
    parser.add_argument('--filepath', type=str, default='/opt/ml/processing/input/')
    parser.add_argument('--filename', type=str, default='MS_Marco.csv')
    parser.add_argument('--outputpath', type=str, default='/opt/ml/processing/output/')
    parser.add_argument('--categorical_features', type=str, default='answers, passages, query, query_id, query_type, wellFormedAnswers')

    return parser.parse_known_args()

if __name__=="__main__":
    # Process arguments
    args, _ = _parse_args()
    # Load data
    df = pd.read_csv(os.path.join(args.filepath, args.filename))
    
    # CHANGE 1: drop last column
    df = df.drop(['wellFormedAnswers'], axis=1)

    # CHANGE 2: split passage column and drop url of passages
    df = pd.concat([df.drop(['passages'], axis=1), df['passages'].apply(pd.Series)], axis=1)
    df.head()
    df = df.drop(['url'], axis=1)
    
    # CHANGE 3: Cleaning of passage_text
    df['passage_text'] = df['passage_text'].replace(['..',r"\s+",'th '] ,['.',' ','the'])

    # CHANGE 4: change query_type column so it we'll be easier to use
    df['query_type'] = df['query_type'].replace(to_replace=df.query_type.unique(), value=np.arange(1,len(df.query_type.unique())+1), inplace=True)

    # CHANGE 5: rename column is_selected to answer
    df = df.rename(columns={"is_selected": "passage_selected"})
    
    # Train, test, validation split
    train_data, validation_data, test_data = np.split(df.sample(frac=1, random_state=42), [int(0.7 * len(df)), int(0.9 * len(df))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%
    # Local store
    validation_data.to_csv(os.path.join(args.outputpath, 'validation/validation.csv'), index=False, header=False)
    test_data.to_csv(os.path.join(args.outputpath, 'test/test.csv'), index=False, header=False)
    train_data.to_csv(os.path.join(args.outputpath, 'train/train.csv'), index=False, header=False)
    print("## Processing complete. Exiting.")

Overwriting preprocessing.py


Before starting the SageMaker Processing job, we instantiate a `SKLearnProcessor` object.  This object allows you to specify the instance type to use in the job, as well as how many instances.

In [44]:
train_path = f"s3://{bucket}/{prefix}/train"
validation_path = f"s3://{bucket}/{prefix}/validation"
test_path = f"s3://{bucket}/{prefix}/test"

In [None]:
# cell 08
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role


sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1",
    role=get_execution_role(),
    instance_type="ml.m5.large",
    instance_count=1, 
    base_job_name='project-cleaning'
)

sklearn_processor.run(
    code='preprocessing.py',
    # arguments = ['arg1', 'arg2'],
    inputs=[
        ProcessingInput(
            source=input_source, 
            destination="/opt/ml/processing/input/",
            s3_input_mode="File",
            s3_data_distribution_type="ShardedByS3Key"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="train_data", 
            source="/opt/ml/processing/output/train",
            destination=train_path,
        ),
        ProcessingOutput(output_name="validation_data", source="/opt/ml/processing/output/validation", destination=validation_path),
        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/output/test", destination=test_path),
    ]
)

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3
INFO:sagemaker:Creating processing-job with name project-cleaning-2023-06-06-05-05-28-417


........................

In [None]:
!aws s3 ls $train_path/

---
# Model Training

In order to use SageMaker to fit our algorithm, we create an [`estimator`] from Huggingface library that defines how to use the container to train. This includes the configuration we need to invoke SageMaker training:

- `entry point (str)` - the script we enter to allows to fine-tune any model from huggingface hub
- `source_dir (str)` - the directory where is located this script inside the git repository
- `instance_type (str)` - the type of machine to use for training.
- `instance_count (int)` - number of machines to use for training.
- `role (str)` - SageMaker IAM role as obtained previously
- `git_config (dict)` - dictionnary that has the link and the branch of the git repository containing the transformers scripts
- `transformers_version (str)` - the transformer version to run the different scripts
- `pytorch_version (str)` - the pytorch version to run the different scripts
- `py_version (str)` - the python version to run the scripts
- `hyperparameters (dict)` - the dictionnary containg all the parameters' value



In [None]:
from sagemaker.huggingface import HuggingFace

hyperparameters = {
	'model_name_or_path':'deepset/bert-large-uncased-whole-word-masking-squad2',
	'output_dir':'/opt/ml/model'
	# add your remaining hyperparameters
	# more info here https://github.com/huggingface/transformers/tree/v4.26.0/examples/pytorch/question-answering
}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.26.0'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
	entry_point='run_qa.py',
	source_dir='./examples/pytorch/question-answering',
	instance_type='ml.p3.2xlarge',
	instance_count=1,
	role=role,
	git_config=git_config,
	transformers_version='4.26.0',
	pytorch_version='1.13.1',
	py_version='py39',
	hyperparameters = hyperparameters
)

# starting the train job
huggingface_estimator.fit()