# Introduction

This particular notebook has been specifically designed to participate in the  `Data Science Challenge` that has been put forth by `Resourcefull Humans`.
`The primary objective of this notebook` is to formulate and develop a highly advanced sentiment analysis system that possesses the unique and specialized ability to accurately identify humorous comments with excellent scores and classify them accordingly into humor and non-humor categories.
To achieve this goal, I have ingeniously chosen to employ and utilize machine learning technique known as a pre-trained transformer BERT model.

Hugging Face offers three distinct methods for fine-tuning a pretrained text classification model: native PyTorch, Tensorflow Keras, and Pytroch Trainer. To showcase the strengths and weaknesses of each approach, I have chosen to implement two different methods:

* Hugging Face Pytorch Trainer,
* and native Pytorch.

**Both approaches achieved similar levels of performance, with a slight difference in the measured accuracy and F1 score. `The Hugging Face PyTorch Trainer` achieved 98% accuracy and an F1 score, while training with `the PyTorch loop` achieved 99% accuracy and an F1 score.**


**Resource of article :** The research article related to this dataset: https://arxiv.org/abs/2004.12765

**Source of dataset** : https://www.kaggle.com/datasets/deepcontractor/200k-short-texts-for-humor-detection


# Structure of this Notebook

The notebook follows the widely recognized `CRISP-DM model`, which provides a clear and systematic approach to tackling complex and multifaceted data science challenges.

This approach includes 5 highly structured and well-defined steps, which provide a clear and comprehensive roadmap to follow:

* **Step 1**. **Business Understanding**: The main objectives and goals of the project are identified and clarified.
* **Step 2**. **Data Understanding**:  The data that is required to solve the business problem is collected, explored, and analyzed.
* **Step 3**. **Data Preparation**: The data is cleaned, transformed, and preprocessed to prepare it for modeling.
* **Step 4**. **Modeling/Training with Hugging Face Pytorch Trainer**: It involves the application of the highly advanced and innovative Transformer trainer technique. This technique is designed to help fine-tune the Bert model and optimize its performance to achieve the best possible results.
* **Step 5**. **Evaluation with Hugging Face Pytorch Trainer**: It involves the assessment and evaluation of the model's performance, which is essential for fine-tuning and optimizing its performance further.
* **Step 4a**. **Alternative Approach for Modeling/Training with native Pytorch**: It involves the use of an alternative approach that provides flexibility to fine-tune the model.
* **Step 5a**. **Alternative Approach for Evaluation with native Pytorch**: It allows to modify and adjust the evaluation criteria to suit the specific needs and requirements.

It is important to note that the deployment stage of the `CRISP-DM model` is not implemented in this project and is therefore out of scope. Nonetheless, the current version of the notebook provides users with a comprehensive and detailed guide to fine-tuning Bert using two different frameworks and following a highly structured and systematic approach to data science challenges.



# Step 0: Install Dependencies, Import Python Libraries and Setup SageMaker Session and Bucket

In [222]:
# Install libraries
!pip install transformers datasets
!pip install evaluate
!pip install scikit-plot
!pip install wordcloud
!pip install tensorflow
!pip install tensorflow-hub
!pip install -qq "sagemaker>=2.48.0" --upgrade
!pip install -qq torch==2.0.1 --upgrade
!pip install -qq sagemaker-huggingface-inference-toolkit 
!pip install -qq transformers==4.6.1 "datasets[s3]"
!pip install -qq ipywidgets
!pip install -qq watermark 
!pip install -qq "seaborn>=0.11.0"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELIS

In [223]:
!pip install -U sagemaker

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


In [185]:
# Data processing
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS

# Data Visualization
from matplotlib import pyplot as plt
import seaborn as sns
import scikitplot as skplt
from matplotlib import rc
from pylab import rcParams
from textwrap import wrap

# Preprocessing and Modeling
import re
import string
import torch
import sklearn
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, EarlyStoppingCallback, get_scheduler,BertModel, AdamW, TrainingArguments, Trainer
from datasets import Dataset
from torch.utils.data import DataLoader
from datasets import load_dataset
from tqdm.auto import tqdm
from sklearn.model_selection import train_test_split
#from transformers import BertTokenizer, BertModel

## Model deployment and connection with SageMaker
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import ProfilerConfig, DebuggerHookConfig, Rule, ProfilerRule, rule_configs
import sagemaker.huggingface
from sagemaker.huggingface import HuggingFace
import boto3
import pprint
import time
from sagemaker.session import s3_input, Session
import os
from datasets.filesystems import S3FileSystem
from sagemaker.huggingface import HuggingFace
import time


# Model performance evaluation
import evaluate
import random
from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, confusion_matrix, classification_report

# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# visualization theme
sns.set_theme(style="white")
rcParams['figure.figsize'] = 17, 8

## Permissions

In [None]:
s3 = boto3.client('s3')

try:
    if my_region == 'eu-central-1':
        s3.create_bucket(Bucket=bucket_name, 
                         CreateBucketConfiguration={'LocationConstraint': 'eu-central-1'}) 
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ',e)

In [224]:
sagemaker_session_bucket = "humor-deployment-duygu"
role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")


sagemaker role arn: arn:aws:iam::356128188667:role/service-role/AmazonSageMaker-ExecutionRole-20200903T152144
sagemaker bucket: humor-deployment-duygu
sagemaker session region: eu-central-1


# Step 1: Business Understanding



Before proceeding with the project, it is crucial to have a clear understanding of its primary objective. The aim of this project is to develop an automated system for detecting humor in text, with the ultimate goal of integrating it with any applications such as chatbots, and virtual assistants.

# Step 2: Data Understanding with EDA

The second step is to download and read the dataset. Once the data has been loaded, the next step is to perform `exploratory data analysis (EDA)` in order to understand its structure, patterns, and relationships. This step is crucial as it helps to identify potential issues with the data, such as missing values and inconsistencies. Additionally, EDA can assist in selecting an appropriate model and tokenizer for the task at hand.


The dataset comprises two columns: one containing the text and the other containing the corresponding sentiment label. Due to limited computational power, the first 100.000 rows were chosen for training and testing the data.

In [227]:
# Read the data
filepath = "./dataset.csv"
df_humor = pd.read_csv(filepath, nrows=10000)

# First 5 rows in the dataset
df_humor.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


# Step 3: Data Preprocessing





The Data Preparation stage is a crucial and important step to keep the quality of the data high and ensure accurate results from the model.

In [234]:
# Replace the labels to integer before being used by a machine learning model
df_humor['humor'] = np.where((df_humor["humor"] == 'False'), 0, df_humor['humor'])
df_humor['humor'] = np.where((df_humor['humor'] == 'True'), 1, df_humor['humor'])

# show head
df_humor.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",0
1,Watch: darvish gave hitter whiplash with slow ...,0
2,What do you call a turtle without its shell? d...,1
3,5 reasons the 2016 election feels so personal,0
4,"Pasco police shot mexican migrant from behind,...",0


In [235]:
# Split dataset into train and test sets
train_data, test_data = train_test_split(df_humor, test_size=0.2, random_state=42,stratify=df_humor["humor"])

# Print number of records in train and test sets
print(f"The training dataset has {len(train_data)} records.")
print(f"The testing dataset has {len(test_data)} records.")

The training dataset has 8000 records.
The testing dataset has 2000 records.


In [236]:
# Convert pandas dataframe to Hugging Face dataset
hg_train_data = Dataset.from_pandas(train_data).shuffle()
hg_test_data = Dataset.from_pandas(test_data).shuffle()

In [237]:
hg_train_data[0]

{'text': 'What do you call an eerie french pastry chef? a crepe.',
 'humor': 1,
 '__index_level_0__': 1731}

## Step 3.1: Tokenize the data

In [239]:
# Load the tokenizer from a pretrained model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

In [240]:
# Function to tokenize data
def tokenize_dataset(data):
    return tokenizer(data["text"],
                     max_length=50,
                     truncation=True,
                     padding="max_length")

# Tokenize both train and test dataset
dataset_train = hg_train_data.map(tokenize_dataset,batched=True)
dataset_test = hg_test_data.map(tokenize_dataset,batched=True)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

After tokenization, there are  6 features, `'review'`, `'label'`, `'__index_level_0__'`, `'input_ids'`, `'token_type_ids'`, and `'attention_mask'` in both dataset. The number of rows is stored with `num_rows` in both train and test dataset.

In [241]:
print(dataset_train)
print(dataset_test)

Dataset({
    features: ['text', 'humor', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 8000
})
Dataset({
    features: ['text', 'humor', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 2000
})


In [242]:
# Rename humor column to labels because the model expects the name labels
columns_to_rename = {"humor": "labels"}
dataset_train = dataset_train.rename_columns(columns_to_rename)
dataset_test = dataset_test.rename_columns(columns_to_rename)

print("Training dataset:", dataset_train)
print("Test dataset:", dataset_test)

Training dataset: Dataset({
    features: ['text', 'labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 8000
})
Test dataset: Dataset({
    features: ['text', 'labels', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 2000
})


## Step3.2: Uploading the data to S3

In [243]:
# Define your bucket name and prefix
bucket_name = 'humor-deployment-duygu' 
prefix = 'deployment-test'

# Initialize the S3 file system
s3 = S3FileSystem()

# Define the S3 paths
training_input_path = f's3://{bucket_name}/{prefix}/train'
test_input_path = f's3://{bucket_name}/{prefix}/test'

# Save the datasets to S3
dataset_train.save_to_disk(training_input_path, fs=s3)
dataset_test.save_to_disk(test_input_path, fs=s3)

print(f'Uploaded training data to {training_input_path}')
print(f'Uploaded testing data to {test_input_path}')




Saving the dataset (0/1 shards):   0%|          | 0/8000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2000 [00:00<?, ? examples/s]

Uploaded training data to s3://humor-deployment-duygu/deployment-test/train
Uploaded testing data to s3://humor-deployment-duygu/deployment-test/test


# Step 4: Step 4.1: Fine-tuning & starting Sagemaker Training Job

In order to create a sagemaker training job you need an HuggingFace Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In a Estimator you define, which fine-tuning script should be used as entry_point, which instance_type should be used, which hyperparameters are passed in.

In [None]:
# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 2,                          # number of training epochs
                 'train_batch_size': 16,               # batch size for training
                 'eval_batch_size': 16,                # batch size for evaluation
                 'learning_rate': 3e-5,                # learning rate used during training
                 'model_id':'bert-base-uncased', # pre-trained model
                 'fp16': True,                         # Whether to use 16-bit (mixed) precision training
                }

In [219]:
# define Training Job Name 
job_name = f'huggingface-workshop-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'train.py',        # fine-tuning script used in training jon
    source_dir           = './scripts',       # directory where fine-tuning script is stored
    instance_type        = 'ml.p3.2xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    transformers_version = '4.17',           # the transformers version used in the training job
    pytorch_version      = '1.10',           # the pytorch_version version used in the training job
    py_version           = 'py38',            # the python version used in the training job
    hyperparameters      = hyperparameters,   # the hyperparameter used for running the training job
)

In [220]:
# define a data input dictonary with our uploaded s3 uris
data = {
    'train': training_input_path,
    'test': test_input_path
}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

In [None]:
predictor = huggingface_estimator.deploy(1,"ml.g4dn.xlarge")

## Step 5.1: Make Predictions for Text Classification

We can see that the prediction has two columns. The first column is the predicted logit for label 0 and the second column is the predicted logit for label 1.

In [None]:
sentiment_input= {"inputs": "I get so nervous before a demo"}

predictor.predict(sentiment_input)

array([[-4.7446227,  4.9986143],
       [ 4.4667053, -4.5196543],
       [ 4.5707283, -4.8942003],
       [-4.860986 ,  5.014812 ],
       [ 4.6963043, -4.9394126]], dtype=float32)

In [None]:
#Finally, we delete the inference endpoint.
#predictor.delete_model()
#predictor.delete_endpoint()

<tf.Tensor: shape=(5, 2), dtype=float32, numpy=
array([[5.8686834e-05, 9.9994135e-01],
       [9.9987495e-01, 1.2508905e-04],
       [9.9992251e-01, 7.7517558e-05],
       [5.1401170e-05, 9.9994862e-01],
       [9.9993467e-01, 6.5348060e-05]], dtype=float32)>