# An Endpoint using BERT large model (uncased) with the MS_MARCO dataset

# Part 1: DATA CLEANING

---

## Background
Bidirectional Encoder Representations from Transformers (BERT) is a pre-trained language model that 
has achieved state-of-the-art results on numerous natural language processing (NLP) tasks. BERT is a 
transformer-based model, which was fine-tuned for various NLP tasks, including question answering. 
The Stanford Question Answering Dataset (SQuAD) is one such task where BERT has achieved 
significant accuracy compared with other models. However, current research also indicates that 
tuning BERT with new datasets can lead to further improvements in its performances for specific 
tasks. In this proposal, we suggest fine-tuning a pre-trained BERT large model (uncased) with the 
ms_marco dataset using whole word masking and incorporating a fine-tuning approach to improve its 
performances. 

The BERT model is based on the concept of pre-training, as well as fine-tuning, a NLP task. The pretraining process is based on unsupervised objectives, such as the masked language model and the 
next sentence prediction task, to learn contextual representations of words. The fine-tuning process, 
on the other hand, uses supervised training, where the learned pre-trained knowledge is combined 
with task-specific datasets to achieve beƩer performances.

MS_MARCO is a dataset consisting of a large collection of real-world queries and corresponding 
passages that aims to facilitate research in question-answer matching and ranking. It contains more 
than 1 million queries and over 8 million passages, making it one of the largest publicly available 
datasets of its kind. Each query in the dataset was generated by real users of the Bing search engine 
and is paired with relevant passages retrieved from web pages. The dataset is designed to enable 
researchers to develop and evaluate machine learning models for natural language processing tasks 
such as question answering, information retrieval, and passage ranking. The dataset also includes 
information about the relevance of passages to queries, allowing for the evaluation of ranking 
algorithms. Overall, the MS MARCO v1.2 dataset is a valuable resource for anyone interested in 
developing and evaluating algorithms for natural language understanding tasks.


---

## Preparation

Our proposal is to use the BERT large model (uncased) with the MS_MARCO dataset using whole 
word masking as an endpoint to answer question. For this we will go through the following steps: 
* Preparing the data: We will use the MS_MARCO dataset to fine-tune the BERT large model. We will construct a train and test set from the available dataset for fine-tuning. 
* Fine-tuning the model: We will fine-tune the pre-trained BERT model with the MS_MARCO dataset using whole word masking. We will use the Adam optimizer for training and evaluate model performance based on F1 score. 
* Evaluating the model: We will evaluate the fine-tuned model's performance based on MS_Marco's benchmarking metrics. The key evaluation measures will be the F1-score and the exact match (EM) score. 
* Deploy the model. 
* Scale the model so that we can make this application available to a lot more users. 

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [24]:
!pip install -qU --upgrade pip

[0m

In [25]:
# cell 01
import sagemaker
bucket=sagemaker.Session().default_bucket()
 
# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

Now let's bring in the Python libraries that we'll use throughout the analysis

In [26]:
# cell 02
!pip install -qU datasets
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker 
import zipfile     # Amazon SageMaker's Python SDK provides many helper functions
from datasets import load_dataset

[0m

---

## Data
Let's start by importing the dataset from Huggingface

In [27]:
validation_data, train_data, test_data = load_dataset('ms_marco', 'v1.1', split =['validation','train', 'test'], cache_dir='/media/data_files/github/website_tutorials/data')

Found cached dataset ms_marco (/media/data_files/github/website_tutorials/data/ms_marco/v1.1/1.1.0/b6a62715fa5219aea5275dd3556601004cd63945cb63e36e022f77bb3cbbca84)


  0%|          | 0/3 [00:00<?, ?it/s]

Now lets read this into a Pandas data frame and take a look.

In [28]:
data = pd.concat([pd.DataFrame(validation_data), pd.DataFrame(train_data), pd.DataFrame(test_data)],ignore_index=True)
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)
data.head(3)

Unnamed: 0,answers,passages,query,query_id,query_type,wellFormedAnswers
0,"[Approximately $15,000 per year.]","{'is_selected': [1, 0, 0, 0, 0, 0], 'passage_t...",walgreens store sales average,9652,numeric,[]
1,"[$21,550 per year, The average hourly wage for...","{'is_selected': [0, 1, 0, 0, 0, 0, 0, 0], 'pas...",how much do bartenders make,9653,numeric,[]
2,"[A boil, also called a furuncle, is a deep fol...","{'is_selected': [0, 0, 0, 0, 0, 0, 1, 0], 'pas...",what is a furuncle boil,9654,description,[]


In [29]:
file_name_raw = 'MS_Marco_raw.csv' 
data.to_csv(file_name_raw)

We will store this natively in S3 to then process it with SageMaker Processing.

In [30]:
from sagemaker import Session

prefix = 'final_project'
s3 = boto3.resource('s3')
s3.meta.client.upload_file(file_name_raw, bucket, f'{prefix}/raw_data/MS_Marco_raw.csv')

---
# Feature Engineering 

Here, we'll import the dataset and transform it with SageMaker Processing, which can be used to process terabytes of data in a SageMaker-managed cluster separate from the instance running your notebook server.

For this project, we're using a SageMaker prebuilt Scikit-learn container, which includes many common functions for processing data. Moreover, when the job is complete, SageMaker Processing will automatically uploads the transformed data to S3.

## STEP 1: let's analyse the data to know what to do for the cleaning

Lets show the data again

In [31]:
data.head(3)

Unnamed: 0,answers,passages,query,query_id,query_type,wellFormedAnswers
0,"[Approximately $15,000 per year.]","{'is_selected': [1, 0, 0, 0, 0, 0], 'passage_t...",walgreens store sales average,9652,numeric,[]
1,"[$21,550 per year, The average hourly wage for...","{'is_selected': [0, 1, 0, 0, 0, 0, 0, 0], 'pas...",how much do bartenders make,9653,numeric,[]
2,"[A boil, also called a furuncle, is a deep fol...","{'is_selected': [0, 0, 0, 0, 0, 0, 1, 0], 'pas...",what is a furuncle boil,9654,description,[]


- Let's check if all the questions have an answer

In [32]:
test1 = data['answers'].apply(lambda x: ''.join(x))
len(test1[test1==''])

2727

Rows with no answer exist so we'll have to delete these rows

- Let's see now is each answer has a context by seeing if there is a text selected (in is_selected) but without including the ones concerned about the previous test

In [33]:
from more_itertools import locate
test2 = data.passages.apply(lambda x: list(locate(x["is_selected"], lambda y: y == 1)) if sum(x["is_selected"])>0 else '')
test2 = test2[test2==''].index.isin(test1[test1==''].index)
len(list(locate(test2, lambda y: y == False)))

643

We still have answers without passage text selected so we'll delete these elements

- The column wellFormed seems to be only composed of empty list but 

In [34]:
data[data['wellFormedAnswers'].str.len()!=0]

Unnamed: 0,answers,passages,query,query_id,query_type,wellFormedAnswers


It is the case so the column is not really useful and we'll drop it

- Then what we see is that the column "passage" seems to have to much information

For only trainning we only need the context which is the 'passage_text. Also, some cleanig seems to be needed it term of regex and we'll try to solv that problem.

- let's see if query_id need to be cleaned

In [35]:
len(data.query_id.unique()) == len(data)

True

We have the right count of query_id so no cleaning needed for this column

- finally let's study the column query_type

In [36]:
data.query_type.value_counts()

description    55684
numeric        28291
entity         10485
location        5068
person          2495
Name: query_type, dtype: int64

So this column is composed of 5 categories and no 'na' values seems to be present so for now we don't need to modify this column

## STEP 2: Apply modification on dataset and split train/test/validation

!pip install -qU emoji
import emoji

def text_cleaning(text):
    
    cleaned_text = cleaned_text.replace('..','.')
    for character in text:
        if character in emoji.UNICODE_EMOJI or (character=='.'):
                cleaned_text = cleaned_text.replace(character,'')
    
    return cleaned_text


In [37]:
from more_itertools import locate

#CHANGE 1: split passage column dict into 3 column
data = pd.concat([data.drop(['passages'], axis=1),
                data['passages'].apply(pd.Series)], axis=1)

#CHANGE 2: change type colummn answers from list to string
data['answers'] = data['answers'].apply(lambda x: ''.join(x))

#CHANGE 3: change is selected mapping to list of index of 'passage text' concerned for the answer
data['is_selected'] = data['is_selected'].apply(lambda x: list(locate(x,lambda y: y==1)) if 1 in x else '') 

#CHANGE 4: drop rows with no indicator of passage_text (no index mapped)
data = data[data['is_selected']!=''].reset_index(drop=True)

#CHANGE 5: create context which is composed of 1 or multiple text of passage_text refered in is_selected as the right reference(s)
data['context'] = data.apply(lambda x: str(x['passage_text'][x['is_selected'][0]]) if len(x['is_selected'])==1 else ' '.join(x['passage_text'][i] for i in x['is_selected']), axis=1)

In [38]:
data.head(3)

Unnamed: 0,answers,query,query_id,query_type,wellFormedAnswers,is_selected,passage_text,url,context
0,"Approximately $15,000 per year.",walgreens store sales average,9652,numeric,[],[0],[The average Walgreens salary ranges from appr...,"[http://www.indeed.com/cmp/Walgreens/salaries,...",The average Walgreens salary ranges from appro...
1,"$21,550 per yearThe average hourly wage for a ...",how much do bartenders make,9653,numeric,[],[1],[A bartender’s income is comprised mostly of t...,[http://www.breakintobartending.com/how-much-d...,"According to the Bureau of Labor Statistics, t..."
2,"A boil, also called a furuncle, is a deep foll...",what is a furuncle boil,9654,description,[],[6],"[Knowledge center. A boil, also known as a fur...",[http://www.medicalnewstoday.com/articles/1854...,"A boil, also called a furuncle, is a deep foll..."


In [39]:
file_name = 'MS_Marco.csv' 
data.to_csv(file_name)

s3.meta.client.upload_file(file_name, bucket, f'{prefix}/input_data/MS_Marco.csv')

sess = Session()
input_source = sess.upload_data(file_name, bucket=bucket, key_prefix=f'{prefix}/input_data')
input_source

's3://sagemaker-us-east-1-834242159264/final_project/input_data/MS_Marco.csv'

In [45]:
%%writefile preprocessing.py

import pandas as pd
import numpy as np
import argparse
import os
from sklearn.preprocessing import OrdinalEncoder


def _parse_args():

    parser = argparse.ArgumentParser()

    # Data, model, and output directories
    # model_dir is always passed in from SageMaker. By default this is a S3 path under the default bucket.
    parser.add_argument('--filepath', type=str, default='/opt/ml/processing/input/')
    parser.add_argument('--filename', type=str, default='MS_Marco.csv')
    parser.add_argument('--outputpath', type=str, default='/opt/ml/processing/output/')
    parser.add_argument('--categorical_features', type=str, default='answers, passages, query, query_id, query_type, wellFormedAnswers')

    return parser.parse_known_args()

if __name__=="__main__":
    # Process arguments
    args, _ = _parse_args()
    # Load data
    df = pd.read_csv(os.path.join(args.filepath, args.filename))
    
    # CHANGE 1: drop useless column
    df = df.drop(['wellFormedAnswers','is_selected','passage_text','url'], axis=1)

    # CHANGE 2: drop elements that have no answers
    df = df[df["answers"]!=''].reset_index(drop=True)
    
    # CHANGE 3: change query_type column so it we'll be easier to use
    df['query_type'].replace(to_replace=df.query_type.unique(), value=np.arange(1,len(df.query_type.unique())+1), inplace=True)

    # CHANGE 4: deletion of rows with answers without context (df['context'][i]=='')
    df = df[df['context']!=''].reset_index(drop=True)
    
    # Train, test, validation split
    train_data, validation_data, test_data = np.split(df.sample(frac=1, random_state=42), [int(0.7 * len(df)), int(0.9 * len(df))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%
    
    # Local store
    validation_data.to_csv(os.path.join(args.outputpath, 'validation/validation.csv'), index=False, header=True)
    test_data.to_csv(os.path.join(args.outputpath, 'test/test.csv'), index=False, header=True)
    train_data.to_csv(os.path.join(args.outputpath, 'train/train.csv'), index=False, header=True)
    print("## Processing complete. Exiting.")

Overwriting preprocessing.py


In [46]:
train_path = f"s3://{bucket}/{prefix}/train"
validation_path = f"s3://{bucket}/{prefix}/validation"
test_path = f"s3://{bucket}/{prefix}/test"
print(train_path,'\n',validation_path,'\n',test_path)

s3://sagemaker-us-east-1-834242159264/final_project/train 
 s3://sagemaker-us-east-1-834242159264/final_project/validation 
 s3://sagemaker-us-east-1-834242159264/final_project/test


Before starting the SageMaker Processing job, we instantiate a `SKLearnProcessor` object.  This object allows you to specify the instance type to use in the job, as well as how many instances.

In [47]:
# cell 08
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role


sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1",
    role=get_execution_role(),
    instance_type="ml.m5.large",
    instance_count=1, 
    base_job_name='project-cleaning'
)

sklearn_processor.run(
    code='preprocessing.py',
    # arguments = ['arg1', 'arg2'],
    inputs=[
        ProcessingInput(
            source=input_source, 
            destination="/opt/ml/processing/input/",
            s3_input_mode="File",
            s3_data_distribution_type="ShardedByS3Key"
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="train_data", 
            source="/opt/ml/processing/output/train",
            destination=train_path,
        ),
        ProcessingOutput(output_name="validation_data", source="/opt/ml/processing/output/validation", destination=validation_path),
        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/output/test", destination=test_path),
    ]
)

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3
INFO:sagemaker:Creating processing-job with name project-cleaning-2023-06-11-10-19-25-285


............................[34m## Processing complete. Exiting.[0m



In [48]:
!aws s3 ls $train_path/

2023-06-11 10:24:08   43642596 train.csv
