After the Modeling of the Jupyter Notebook called 'Models with TF-IDF Vectorization.'

<a id="0"></a> <br>
 # Table of Contents  
1. [Introduction](#1)
    1. [Loading Packages](#2) 
    1. [Loading Dataset](#3) 
1. [Creating Dictionaries](#5) 
1. [Hugging Face](#7)     
    1. [The Dataset](#8) 
    1. [The Tokenizer](#10) 
        1. [Model Setup for Text Classification](#11)
        1. [distilbert-base-uncased Model Evaluation](#12)
        1. [Transfer Learning Pipeline](#16)
    1. [Running the Model](#13) 
        1. [Running the Model](#14)
        1. [Model Predictions](#15) 
1. [Conclusion](#17)  

<a id="1"></a> 
# 1. Introduction

For this model we will be using the Hugging Face model.

<a id="2"></a> 
### 1a. Loading Packages

In [None]:
!nvidia-smi

In [None]:
!pip install diffusers==0.11.1
!pip install transformers scipy ftfy accelerate datasets s3fs

In [None]:
import numpy as np
import pandas as pd
import torch

from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import TextClassificationPipeline
from datasets import load_dataset
from transformers import XLMRobertaXLConfig, XLMRobertaXLModel

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

<a id="3"></a> 
### 1a. Loading Dataset

Loading the dataset from the EDA notebook

In [None]:
gluten_free = pd.read_csv("Capstone.csv")

In [None]:
# Sanity Check
gluten_free.head()

In [None]:
gluten_free['gluten_free?'].value_counts()

----

<a id="5"></a> 
# Creating Dictionaries

In [None]:
# Out target categories need to be encoded as integers, but we'll want to reverse this
# encoding back to the original categorical strings later, so we need forward and reverse lookups.
id2label = {i:cat for i,cat in enumerate(set(gluten_free["gluten_free?"]))}
label2id = {v:k for k,v in id2label.items()}

<b>Two Dictionaries<b>
    
1. id2label: This dictionary maps integer labels to their original categorical values.
2. label2id: This dictionary maps original categorical values to their corresponding integer labels.

#### Lets start with spliting the data:

1. Takes the two columns from the gluten_free dataset, "gluten_free?" and "description."
2. Renames these columns to "label" and "text"
3. Using the dictionaries from above it converts the values in the "label" column (which are categorical) into integer labels.
4. Generates a random test/train split 

In [None]:
# pull out columns of interest and do a manual test_train split based on random integer assignment
simplified = gluten_free[["gluten_free?","description"]].copy()
simplified.columns = ["label","text"]
simplified.loc[:,"label"] = list(label2id[lab] for lab in simplified["label"])
test_flag = np.random.randint(0,high=10,size=gluten_free.shape[0])
simplified.loc[:,'test'] = test_flag > 4

#### Lets split according to test and train

The train dataset contains 50,000 samples from each category for training.
The test dataset contains 100,000 samples for evaluation during tuning, randomly selected from the test set.


In [None]:
# select 50,000 of each category for training
train = simplified[~simplified.test].groupby('label',group_keys=False).apply(lambda x: x.sample(50000))

# select 100000 of each cagegory for test evaluation during tuning
test = simplified[simplified.test].sample(100000)

Let's see the size of the splits

In [None]:
print(f'Shape of test set: {test.shape}')
print(f'Shape of train set: {train.shape}')

Lets the test and train as CSV files to be processed by the hugging face data pipelines

In [None]:
# save as CSV files to be processed by the hugging face data pipelines
train.reset_index(drop=True).to_csv("gluten_free_train.csv")
test.reset_index(drop=True).to_csv("gluten_free_test.csv")

----

<a id="7"></a> 
# Hugging Face

-----

<a id="8"></a> 
### The Dataset

----

Let's load the datasets we created above

In [None]:
# load as hugging face dataset
dataset = load_dataset('csv', data_files={'train': 'gluten_free_train.csv', 'test': 'gluten_free_test.csv'})

-----

<a id="10"></a> 
### The Tokenizer

----

For this model we will use the distilbert-base-uncased tokenizer from HuggingFace.

In [None]:
# load a tokenizer from our target language model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

# pro forma hugging face text processing setup
tokenized_data = dataset.map(preprocess_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

1. Loading a Tokenizer:
It loads a tokenizer from the Hugging Face model. It uses the "distilbert-base-uncased" tokenizer. Tokenizers are used to convert text data into tokens that can be fed into a language model.
2. Preprocessing Function:
It takes an input called examples, which is expected to have a key named "text" containing the text data.
It tokenizes the text using the "distilbert-base-uncased" tokenizer, with padding=True to ensure all sequences have the same length.

<a id="11"></a> 
##### Model Setup for Text Classification

In [None]:
# model setup for text classification
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=len(label2id),
    id2label=id2label,
    label2id=label2id)

1. Loading a Pre-trained Model:
It loads a pre-trained classification model  from the Hugging Face model. It uses the "distilbert-base-uncased" model.
2. Specifying the Number of Labels:
It uses the length of the label2id dictionary to determine the number of labels.

<a id="12"></a> 
##### distilbert-base-uncased Model Evaluation

Lets create a function which will evulate the model so we can compare it to the other models.

In [None]:
# leverage sklearn metrics for runtime training evaluation
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred,average='micro')
    precision = precision_score(y_true=labels, y_pred=pred,average='micro')
    f1 = f1_score(y_true=labels, y_pred=pred,average='micro')

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

This funciton calcuates the accuracy, precision, recall and f1.

<a id="16"></a> 
##### Transfer Learning Pipeline

Lets complete the transfer learning pipeline by instantiating a TrainingArguments instance with specific parameters and creating a new Trainer that collects all components together: model, training_args, preprocessing pipeline, and evaluation funcs.

In [None]:
# training parameter setup goes in a specific class instance
training_args = TrainingArguments(
    output_dir="gluten_free-classifier",
    learning_rate=2e-5,
    optim="adamw_torch",
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False)

# the trainier
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics)

----

<a id="13"></a> 
### Running the Model

-----

In [None]:
trainer.train()

<a id="14"></a> 
##### Running the Model

| Epoch | Training Loss | Validation Loss | Accuracy | Precision |Recall | F1 |
| --- | --- | --- |--- | --- | --- |--- |
| 1 | 0.407 | 0.403 | 0.817 | 0.817 | 0.817| 0.817|
| 2 | 0.342 | 0.384 | 0.830 | 0.830 | 0.830| 0.830|
| 3 | 0.292 | 0.350 | 0.849 | 0.849 | 0.849| 0.849|
| 4 | 0.253 | 0.346 | 0.855 | 0.855 | 0.855| 0.855|
| 2 | 0.226 | 0.375 | 0.850 | 0.850 | 0.850| 0.850|

The table above shows the accuracy is increasing, and the training loss is decreasing. 

Lets save the model

In [None]:
model.save_model("best_model.pk")

<a id="15"></a> 
##### Model Predictions

In [None]:
# predictions for the entire dataset
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=False, device='cuda')
pipe(["organic chocolate"])

-----

<a id="17"></a> 
# Conclusion

##### Prevoius Insights
The TF-IDF Vectorization is more accurate than CountVectorization.

Overall, both the Logistic Regression model has a slightly better balance between precision and recall for both classes. Both the Decision Tree models have the higest accuracy and perform reasonably well but has lower precision and recall for class 0, therefore struggles with identifying non-gluten-free products. The Naive Bayes model performs well but has slightly lower accuracy and precision for class 0 compared to Logistic Regression.  

| Model | Vectorization | Accuracy %|
| --- | --- | --- |
| Naive Bayes | TF-IDF | 0.7611 |
| Logistic Regression | TF-IDF  | 0.7613 |
| Decision Tree | TF-IDF | 0.7625 |
| Naive Bayes | CountVectorization | 0.7416 |
| Logistic Regression | CountVectorization  | 0.7614 |
| Decision Tree | CountVectorization | 0.7628 |


#### distilbert-base-uncased Model
| Epoch | Training Loss | Validation Loss | Accuracy | Precision |Recall | F1 |
| --- | --- | --- |--- | --- | --- |--- |
| 1 | 0.407 | 0.403 | 0.817 | 0.817 | 0.817| 0.817|
| 2 | 0.342 | 0.384 | 0.830 | 0.830 | 0.830| 0.830|
| 3 | 0.292 | 0.350 | 0.849 | 0.849 | 0.849| 0.849|
| 4 | 0.253 | 0.346 | 0.855 | 0.855 | 0.855| 0.855|
| 2 | 0.226 | 0.375 | 0.850 | 0.850 | 0.850| 0.850|

##### After distilbert-base-uncased Modeling

We can say with certanity distilbert-base-uncased Model is the best perfoming model. It has the higest accuracy and none of the other models are comparable.

----