

<center><img src="../pngs/20220926_hackathon_nlp_track.png" style="border-radius:15px"></center>

# <b><div style="color:#211894;font-size:100%;text-align:center">Welcome to the NLP Track of the Hackathon! Part 2 of 2: Quantization 🚀</div></b>

<h1><center>Author: Benjamin Consolvo <br></center></h1>

# <a id="TOC">Table of Contents</a> 
- [1. Introduction: Problem Statement & Dataset, Model Architecture, Hardware, and Software](#introduction)  
- [2. Importing of Libraries](#install)  
- [3. Data Loading](#data)  
- [4. Quantization](#quant)
- [5. Model Inference on Intel Gen. 3 Xeon CPU](#inference)
- [6. MLFlow Model Registration](#mlflowreg)
- [7. Summary](#summary)
- [8. References](#references)






<a id="introduction"></a>
# <div style="padding:20px;color:white;margin:0;font-size:100%;text-align:left;display:fill;border-radius:10px;background-color:#251cab;overflow:hidden;background:linear-gradient(90deg, navy, #dc98ff, #251cab)">1. Introduction: Problem Statement, Model Architecture, Hardware, and Software</div>

Hello and welcome to this NLP quantization notebook! I first am going to briefly spend some time introducing the problem, the model architecture, the hardware, and the software we will be using in two companion notebooks. During the demo, I have already run all the cells so that you are not sitting and waiting for anything, but you are of course encouraged to run them yourself and to change as much as you’d like.

I made a video introduction to the notebook here: https://www.intel.com/content/www/us/en/developer/videos/ai-for-social-good-hackathon.html

Timestamps:
#### Notebook 1 of 2 - Habana Training Demo
- 0:00 - 1. Introduction  
    - 0:25 - Problem Statement
    - 1:25 - Model Architecture - DistilBERT
    - 1:50 - Hardware - Habana® Gaudi® HPU and 3rd Generation Intel® Xeon®
    - 3:15 - Monitoring compute
    - 3:55 - Software 
- 5:10 - 2. Importing of Libraries
- 6:13 - 3. Exploratory Data Analysis (EDA) and Tokenization
    - 7:27 - Tokenization
    - 9:20 - Histogram and word cloud
    - 11:06 - `torch.tensor` format
- 11:54 - 4. Model Training on Habana® Gaudi® HPU 
    - 11:54 - Setting up training
    - 13:42 - Training the model
- 15:09 - 6. Model Performance and Sample Inference
    - 15:19 - Inference on unseen test dataset
    - 15:58 - Inference on Single Sample
- 17:16 - 8. Summary 
- 18:39 - 9. References

#### Notebook 2 of 2 - Quantization Demo
- 19:03 - 1. Introduction
- 19:32 - 2. Importing of Libraries
- 20:08 - 3. Data Loading
- 20:22 - 4. Quantization
- 22:46 - 5. Model Inference on Intel Gen. 3 Xeon CPU
    - 23:03 - FP32 model
    - 23:26 - INT8 model
- 23:54 - 6. Summary
- 24:20 - END

#### Problem Statement
In a world where negativity in speech and media is prominent, humor can help uplift the human spirit. “How to create a method or model to discover the structures behind humor, recognize humor … remains a challenge because of its subjective nature” ([Jain, 2017](https://core.ac.uk/download/pdf/234824434.pdf)). Machine learning and deep learning has been progressing to produce powerful language models. The proposed challenge here is to teach a computer how to distinguish between an humorous and non-humorous statement in English.

In the first of two Jupyter notebooks, we will train a binary text classification model to determine if a statement is humorous. For the demonstration, I will only use a small portion of the data for training. 
- You are encouraged to use more/all of the data to improve the efficacy of your model.
- You are also encouraged to experiment with data tokenization, preprocessing, and augmentation.

In the second Jupyter notebook, we will increase the performance of prediction (or inference) in a simulated production environment.



#### Model Architecture
- Today, we will use a distilled version of the BERT transformer-based model architecture, called DistilBERT (https://arxiv.org/abs/1910.01108). You can find a description of the model on Huggingface here: https://huggingface.co/distilbert-base-uncased. It is a smaller, faster, distilled version of BERT.   You are free and encouraged to experiment with other architectures.   
- BERT stands for Bidirectional Encoder Representations for Transformers, and it is a deep learning model for natural language processing (NLP) that can be used for a variety of language tasks. 


#### Hardware

##### Habana® Gaudi® HPU
- For the first notebook, we will be training our model using a Habana® Gaudi® HPU (Habana Processing Unit) accelerator, hosted on AWS. The instance is an Amazon EC2 dl1.24xlarge (https://aws.amazon.com/ec2/instance-types/dl1/). It is an 8x parallel accelerator (HPU) that beats comparable GPU-based instances "by up to 40%" and at a much-reduced cost (https://habana.ai/training/gaudi/). 

- Due to the smaller size of the dataset and relatively low training time, I am only covering single-HPU training here, but if you would like to try distributed training over multiple HPUs, you can visit the Optimum Habana GitHub repository to learn from the examples of distributed training there (https://github.com/huggingface/optimum-habana/tree/main/examples/text-classification).

- The Habana® Gaudi® DL1 instances come with 96 2nd Generation Intel® Xeon® vCPUs (48 physical cores).

##### 3rd Generation Intel® Xeon® Platinum 8375C Ice Lake CPU
- In the second notebook, I will be showing you how to speed up inference time using a technique called “quantization” on a production-capable 3rd Generation Intel® Xeon® Platinum 8375C Ice Lake CPU (https://ark.intel.com/content/www/us/en/ark/products/series/204098/3rd-generation-intel-xeon-scalable-processors.html). This instance is called m6i.*xlarge on AWS.

##### 4rd Generation Intel® Xeon® Sapphire Rapids CPU

- Intel will be releasing a 4th Generation Xeon® Sapphire Rapids CPU processor with Advanced Matrix Extension that will be able to offer a performance speed improvement for inference of up to 8X on INT8 model as compared to INT8 on the 3rd Generation Xeon® Ice Lake CPU. For more information about the upcoming performance benefits, you can visit:
    - https://edc.intel.com/content/www/us/en/products/performance/benchmarks/architecture-day-2021/?r=1156525610

##### Pro tips for monitoring compute
- To actively monitor the compute on the HPUs, you can use `watch hl-smi`, similar to `watch nvidia-smi` on NVIDIA GPUs.
- To monitor the compute on the CPU cores and memory usage, you can use `htop` in the command line. And to get a printout of the CPU information, you can use a command called `lscpu`.

In [2]:
!lscpu #cpu information

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            6
    BogoMIPS:            5799.92
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscal
                         l nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopo
                         logy nonstop_tsc cpuid aperfmperf tsc_known_freq pni pc
                         lmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe 
                         popcnt tsc_deadline_timer aes xsave 

Though we are using these specific hardware architectures, I have attempted to make the code as accessible as possible by offering alternative code in the notebooks for other hardware.

#### Software

I will now briefly highlight some of the key Python libraries I will be using in the two notebooks. 
- We will be using the Habana SynapseAI fork of PyTorch. It looks and feels much like the stock PyTorch, but it has been optimized for Habana® Gaudi® HPUs.
- Stock CPU PyTorch (https://pytorch.org/get-started/locally/) for inference.
- The Huggingface 🤗 `transformers` library is what we are using to pull our DistilBERT pre-trained model from and the associated configuration prior to training.
- For setting up the training, we will be using `optimum.habana`, which is “the interface between the Transformers library and Habana’s® Gaudi® HPU” (https://github.com/huggingface/optimum-habana).  
- To speed up model inference, we will be using `optimum.intel`, which is “the interface between the Transformers library and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures.” (https://github.com/huggingface/optimum-intel). In particular, the Intel Neural Compressor (INC) is used in the backend for quantization of a model from FP32 to INT8.


## <span style="color:#211894;font-size:100%;text-align:left">Evaluation Guidelines</span>

#### Judging criteria:
    - 70% F1 score
    - 30% Inference speed
Make your model as fast as possible during inference, while retaining a high F1 score!



<a id="install"></a>
# <div style="padding:20px;color:white;margin:0;font-size:100%;text-align:left;display:fill;border-radius:10px;background-color:#251cab;overflow:hidden;background:linear-gradient(90deg, navy, #dc98ff, #251cab)">2. Importing of Libraries</div>
We now move onto the second main section: “Importing of Libraries”.

Before importing tools, I just run these couple of lines starting with `%load_ext autoreload` to automatically reload any updated local Python libraries into the Jupyter notebook. 

In [3]:
#Lines to automatically reload any new local libraries as they are updated.
%load_ext autoreload
%autoreload 2

Instead of using the SynapseAI PyTorch, we will now use the CPU version of PyTorch.

In [4]:
## Importing libraries 
import torch
import os

import pandas as pd
from datasets import load_dataset, load_metric
import matplotlib.pyplot as plt

import sys

import sys
sys.path
sys.path.append('./src/')
import nlpload, evaluate

from tqdm import tqdm
from transformers import (
    DistilBertConfig, 
    AutoConfig, 
    DistilBertTokenizerFast, 
    DistilBertForSequenceClassification,
    AutoModelForSequenceClassification,
    Trainer, 
    TrainingArguments,
    EvalPrediction, 
    default_data_collator
)
import mlflow
import time
import numpy as np
import boto3
from wordcloud import WordCloud, STOPWORDS
import logging
logger = logging.getLogger(__name__)

  from .autonotebook import tqdm as notebook_tqdm


Also of note here, we are loading two libraries that we need for quantization: `neural_compressor` and `optimum.intel`.

In [5]:
##These libraries are necessary to run quantization 
import neural_compressor
from optimum.intel.neural_compressor import (
    IncDistillationConfig, 
    IncDistiller,
    IncOptimizer,
    IncPruner,
    IncPruningConfig,
    IncQuantizationConfig,
    IncQuantizationMode,
    IncQuantizer,
    IncTrainer
)
from optimum.intel.neural_compressor.quantization import IncQuantizedModelForSequenceClassification

And I am setting the `torch` device to `cpu` instead of `hpu`.

In [6]:
# device = torch.device('gpu') #NVIDIA-GPU
device = torch.device('cpu') #CPU-only
# device = torch.device('mps') #Mac M1/M2 GPU
# device = torch.device('hpu') #Habana HPU

<a id="data"></a>

# <div style="padding:20px;color:white;margin:0;font-size:100%;text-align:left;display:fill;border-radius:10px;background-color:#251cab;overflow:hidden;background:linear-gradient(90deg, navy, #dc98ff, #251cab)">3. Data Loading</div>

I have simplified the data loading only to the essentials with no data exploration here, but I am taking the same steps as I did in the first notebook to load the data.

In [7]:
#Loading data
hdf = pd.read_csv('./data/dataset.csv')
hdf["label"] = hdf["humor"].astype(int) #convert True/False into a 0 or a 1
hdf2 = hdf.sample(frac=0.10).reset_index(drop=True)
train, val, test = np.split(hdf2.sample(frac=1, random_state=42),  #train val test split
                       [int(.9*len(hdf2)), int(.95*len(hdf2))])

In [8]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(list(train['text']), truncation=True, padding=True)
val_encodings = tokenizer(list(val['text']), truncation=True, padding=True)
test_encodings = tokenizer(list(test['text']), truncation=True, padding=True)

In [9]:
class newDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)

In [10]:
train_dataset = newDataset(train_encodings, list(train['label']))
val_dataset = newDataset(val_encodings, list(val['label']))
test_dataset = newDataset(test_encodings, list(test['label']))

If you have split up the data into a test set, you can load it in here.

In [11]:
csvpath = './data/dataset.csv'
batch_size = 1000
torch_dataloader = nlpload.mainLoader(csvpath,batch_size,labels=False) #function in ../src/nlpload.py

<a id="quant"></a>

# <div style="padding:20px;color:white;margin:0;font-size:100%;text-align:left;display:fill;border-radius:10px;background-color:#251cab;overflow:hidden;background:linear-gradient(90deg, navy, #dc98ff, #251cab)">4. Quantization</div>

We now turn to quantization. To save time at inference, we can quantize the model from FP32 to INT8, without much drop in accuracy. This step is not necessary to submit your model, but it should save on inference time. 

To learn more about the functions and quantization, you can visit the `optimum.intel` GitHub repository with text classification examples (https://github.com/huggingface/optimum-intel/tree/main/examples/neural_compressor/text-classification). This is where I learned how to apply quantization and present it to you here in this notebook.

First, we need to load the previously trained model.


I then set up the trainer, but this time use the “IncTrainer”, or the Intel Neural Compressor Trainer class. 

In [14]:
#Loading the model
output_model_folder = './models/checkpoint-2000' #this may change depending on where you saved your model.
model_fp32 = DistilBertForSequenceClassification.from_pretrained(output_model_folder) 
model_fp32.to(device)
print('')




In [15]:
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

data_collator = default_data_collator

training_args = TrainingArguments(
    output_dir="./output_quantized",
    num_train_epochs=5,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    logging_steps=50
)

trainer = IncTrainer(
        model=model_fp32,
        args=training_args, 
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        data_collator=data_collator,
    )

We can then run a baseline model for inference using the FP32 model.

In [16]:
metric_name = "eval_accuracy"
def take_eval_steps(model, trainer, metric_name, save_metrics=False):
    trainer.model = model
    metrics = trainer.evaluate()
    if save_metrics:
        trainer.save_metrics("eval", metrics)
    logger.info("{}: {}".format(metric_name, metrics.get(metric_name)))
    logger.info("Throughput: {} samples/sec".format(metrics.get("eval_samples_per_second")))
    return metrics[metric_name]

result_baseline_model = take_eval_steps(model_fp32, trainer, metric_name)


***** Running Evaluation *****
  Num examples = 650
  Batch size = 128


2022-09-24 16:22:57 [INFO] eval_accuracy: 0.9769230769230769
2022-09-24 16:22:57 [INFO] Throughput: 234.266 samples/sec


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


We now set everything up that we need to run quantization. One thing I do want to point out is that I am using a configuration file called `quantization.yml` that I downloaded from the `optimum.intel` GitHub page (https://github.com/huggingface/optimum-intel/blob/main/examples/neural_compressor/config/quantization.yml). You can adjust some of these parameters if you would like to adjust how the model is quantized.

In [17]:
def take_train_steps(model, trainer, agent=None, resume_from_checkpoint=None, last_checkpoint=None):
    trainer.model_wrapped = model
    trainer.model = model
    train_result = trainer.train(agent)
    metrics = train_result.metrics
    trainer.save_model()  # Saves the tokenizer too for easy upload
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()
    return trainer.model

def train_func(model):
    # return take_train_steps(model, trainer, resume_from_checkpoint, last_checkpoint)
    return take_train_steps(model, trainer)

def eval_func(model):
    return take_eval_steps(model, trainer, metric_name)


q8_config = IncQuantizationConfig.from_pretrained(
            config_name_or_path = './config/quantization.yml' #from https://github.com/huggingface/optimum-intel/blob/main/examples/neural_compressor/config/quantization.yml
        )

quant_approach = IncQuantizationMode(q8_config.get_config("quantization.approach"))
calib_dataloader = trainer.get_train_dataloader() if quant_approach == IncQuantizationMode.STATIC else None

quantizer = IncQuantizer(
            q8_config, 
            eval_func=eval_func, 
            train_func=train_func, 
            calib_dataloader=calib_dataloader
        )

optimizer = IncOptimizer(
    model_fp32,
    quantizer=quantizer,
    one_shot_optimization=True,
    eval_func=eval_func,
    train_func=train_func,
)

We can then go ahead and run quantization with `optimizer.fit()`. If you have a relatively small test dataset size, it should quantize the model fairly quickly, within a minute or less. 

In [18]:
agent = optimizer.get_agent()
optimized_model = optimizer.fit()

2022-09-24 16:22:57 [INFO] Start sequential pipeline execution.
2022-09-24 16:22:57 [INFO] The 0th step being executing is QUANTIZATION.
2022-09-24 16:22:57 [INFO] Pass query framework capability elapsed time: 151.37 ms
2022-09-24 16:22:58 [INFO] Get FP32 model baseline.
***** Running Evaluation *****
  Num examples = 650
  Batch size = 128
2022-09-24 16:23:00 [INFO] eval_accuracy: 0.9769230769230769
2022-09-24 16:23:00 [INFO] Throughput: 226.698 samples/sec
2022-09-24 16:23:00 [INFO] Save tuning history to /home/ubuntu/nlp-hackathon/notebooks/nc_workspace/2022-09-24_16-22-51/./history.snapshot.
2022-09-24 16:23:00 [INFO] FP32 baseline is: [Accuracy: 0.9769, Duration (seconds): 2.8895]
2022-09-24 16:23:00 [INFO] Fx trace of the entire model failed, We will conduct auto quantization
  torch.tensor(weight_qparams["scale"], dtype=torch.float, device=device))
  torch.tensor(weight_qparams["zero_point"], dtype=zero_point_dtype, device=device))
2022-09-24 16:23:02 [INFO] |******Mixed Precisi

And now that we have an optimized model, we can run evaluation and then compare it to the baseline model. 

In [19]:
result_optimized_model = take_eval_steps(optimized_model, trainer, metric_name, save_metrics=True)

***** Running Evaluation *****
  Num examples = 650
  Batch size = 128
2022-09-24 16:23:06 [INFO] eval_accuracy: 0.98
2022-09-24 16:23:06 [INFO] Throughput: 337.179 samples/sec


We now can save the quantized model, and have both the FP32 and the newly quantized INT8 model available.

In [20]:
# Save the resulting model and its corresponding configuration in the given directory
optimizer.save_pretrained(training_args.output_dir)
# Compute the model's sparsity
sparsity = optimizer.get_sparsity()
logger.info(
    f"Optimized model with {metric_name} of {result_optimized_model} and sparsity of {round(sparsity, 2)}% "
    f"saved to: {training_args.output_dir}. Original model had an {metric_name} of {result_baseline_model}."
)

Configuration saved in ../output_quantized/config.json
2022-09-24 16:23:06 [INFO] Model weights saved to ../output_quantized
2022-09-24 16:23:06 [INFO] Optimized model with eval_accuracy of 0.98 and sparsity of 0.86% saved to: ../output_quantized. Original model had an eval_accuracy of 0.9769230769230769.


<a id="inference"></a>
# <div style="padding:20px;color:white;margin:0;font-size:100%;text-align:left;display:fill;border-radius:10px;background:linear-gradient(90deg, navy, #dc98ff, #251cab);overflow:hidden">5. Model Inference on Intel Gen. 3 Xeon CPU</div></a>

Now that we have a quantized model, let's test the inference on a test dataset to make sure that we have not lost significantly in accuracy.

In the first code snippet, I am loading the FP32 model, putting it on the CPU device, and then running inference on a test dataset.

In [21]:
mlflow.end_run() #new model, so the same parameters cannot be used as in the last mlflow run.

# model loading: load in the FP32 model
output_model_folder = './models/checkpoint-2000' #this may change depending on where you saved your model
model_fp32 = DistilBertForSequenceClassification.from_pretrained(output_model_folder) 
model_fp32.to(device)
print('')

trainer_fp32 = Trainer(
    model=model_fp32,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset,             # evaluation dataset
    compute_metrics = compute_metrics
    )
print("**************** Evaluation below************")
metrics = trainer_fp32.evaluate()
metrics["eval_samples"] = len(test_dataset)
trainer_fp32.log_metrics("eval", metrics)
trainer_fp32.save_metrics("eval", metrics)

loading configuration file ../models/checkpoint-2000/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.21.1",
  "vocab_size": 30522
}

loading weights file ../models/checkpoint-2000/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at ../models/checkpoint-2000.
I


**************** Evaluation below************


***** eval metrics *****
  eval_accuracy           =     0.9769
  eval_loss               =     0.1376
  eval_runtime            = 0:00:02.68
  eval_samples            =        650
  eval_samples_per_second =    242.421
  eval_steps_per_second   =      2.238


In the second code snippet, I am loading the INT8 model, and running inference on the same test set, to compare it to the FP32 model. What you should see is a very similar accuracy/F1 score, but the INT8 model should show that it can handle more samples per second than the FP32 model.

In [22]:
#load in the quantized INT8 model.
quantized_model_folder = './output_quantized/'
model_int8 = IncQuantizedModelForSequenceClassification.from_pretrained(training_args.output_dir)
model_int8.to(device)
print('')

trainer_int8 = Trainer(
    model=model_int8,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset,             # evaluation dataset
    compute_metrics = compute_metrics
    )

print("**************** Evaluation below************")
metrics = trainer_int8.evaluate()
metrics["eval_samples"] = len(test_dataset)
trainer_int8.log_metrics("eval", metrics)
trainer_int8.save_metrics("eval", metrics)


loading configuration file ../output_quantized/config.json
Model config DistilBertConfig {
  "_name_or_path": "../output_quantized",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "int8",
  "transformers_version": "4.21.1",
  "vocab_size": 30522
}

loading configuration file ../output_quantized/config.json
Model config DistilBertConfig {
  "_name_or_path": "../models/checkpoint-2000",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_d


**************** Evaluation below************


***** eval metrics *****
  eval_accuracy           =       0.98
  eval_loss               =      0.106
  eval_runtime            = 0:00:01.80
  eval_samples            =        650
  eval_samples_per_second =    359.668
  eval_steps_per_second   =       3.32


Let's run the FP32 and INT8 models on the unseen test dataset (10,000 rows). <br>
Batch size was set at 1000, so should be 10 passes through the data loader.

In [24]:
y_true,y_preds_raw,infer_time = evaluate.nlp_evaluate(model_fp32,torch_dataloader,device) #function in ../src/evaluate.py

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


inference_time=56.26547646522522


In [25]:
def extract_class_from_preds(y_preds_raw):
    pred_y_class = []
    start_time = time.time()
    for y in tqdm(y_preds_raw):
        pred_y_class.extend(y.tolist())
    print(f'time = {time.time()-start_time}')
    return pred_y_class

In [26]:
output_pred_y = extract_class_from_preds(y_preds_raw)

100%|██████████| 10/10 [00:00<00:00, 47500.61it/s]

time = 0.0021653175354003906





In [27]:
len(output_pred_y)

10000

In [28]:
y_true_2,y_preds_raw_2,infer_time_2 = evaluate.nlp_evaluate(model_int8,torch_dataloader,device) #function in ../src/evaluate.py

inference_time=43.54761600494385


In [29]:
output_pred_y_2 = extract_class_from_preds(y_preds_raw_2)

100%|██████████| 10/10 [00:00<00:00, 38800.22it/s]

time = 0.0017940998077392578





In [30]:
hdf_10k = pd.read_csv(csvpath)
true_false_list = [bool(y) for y in output_pred_y_2]
hdf_10k['humor'] = true_false_list

In [31]:
team_name = 'Team_NLP0'
output_csvpath = f'../output_quantized/{team_name}.csv' #csv must be named with your team name. Eg., "Team_NLP0.csv"
hdf_10k.to_csv(output_csvpath,index=False)  #outputting CSV with humor prediction label

<a id="summary"></a>

# <div style="padding:20px;color:white;margin:0;font-size:100%;text-align:left;display:fill;border-radius:10px;background-color:#251cab;overflow:hidden;;background:linear-gradient(90deg, navy, #dc98ff, #251cab)">6. Summary</div>


A summary of learnings in this notebook:

- Loaded, split train-val-test, and tokenized the humor text data
- Quantized our output model from FP32 to INT8 for faster inference speed
- Evaluated the quantized model inference speed on a small test dataset.

<a id="references"></a>

# <div style="padding:20px;color:white;margin:0;font-size:100%;text-align:left;display:fill;border-radius:10px;background-color:#251cab;overflow:hidden;;background:linear-gradient(90deg, navy, #dc98ff, #251cab)">7. References</div>


- Habana Gaudi instance: 
    - https://aws.amazon.com/ec2/instance-types/dl1/
    - https://habana.ai/training/gaudi/
    - https://www.intel.com/content/www/us/en/developer/articles/technical/get-started-habana-gaudi-deep-learning-training.html#gs.9p3p1b
- Humor problem statement: Jain, Manan. "Humor Detection." (2017). https://core.ac.uk/download/pdf/234824434.pdf
- Optimum Habana GitHub: https://github.com/huggingface/optimum-habana
- Optimum Intel GitHub: https://github.com/huggingface/optimum-intel
- DistilBERT Model: Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv: https://arxiv.org/abs/1910.01108 (2019).
	- Model on Huggingface:  https://huggingface.co/distilbert-base-uncased
- Pruning and Distillation: https://www.intel.com/content/www/us/en/developer/articles/technical/compression-and-acceleration-of-high-dimensional-neural-networks.html
- Intel® Xeon®:
    - 3rd Gen: https://ark.intel.com/content/www/us/en/ark/products/series/204098/3rd-generation-intel-xeon-scalable-processors.html
    - 4th Gen: https://edc.intel.com/content/www/us/en/products/performance/benchmarks/architecture-day-2021/?r=1156525610
    
## <span style="padding:0px;color:#251cab;margin:0;font-size:100%;text-align:left;display:fill;border-radius:10px;background-color:white;overflow:hidden">Notices & Disclaimers</span>
Intel optimizations, for Intel compilers or other products, may not optimize to the same degree for non-Intel products.  
Performance varies by use, configuration and other factors. Learn more on the Performance Index site.   
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.  See backup for configuration details.  No product or component can be absolutely secure.   
Your costs and results may vary.   
Intel technologies may require enabled hardware, software or service activation.  
&copy; Intel Corporation.  Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others.  