<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/Class_05_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

##### **Module 5: Natural Language Processing**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Biology, Health and the Environment](https://sciences.utsa.edu/bhe/), [UTSA](https://www.utsa.edu/)

### Module 5 Material

* Part 5.1: Introduction to Hugging Face
* Part 5.2: Hugging Face Tokenizers
* Part 5.3: Hugging Face Datasets
* **Part 5.4: Training Hugging Face models**

## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to /content/drive and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    Colab = True
    print("Note: Using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this lesson.")
    Colab = False

You should see the following output except your GMAIL address should appear on the last line.

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image01B.png)

If your GMAIL address does not appear your lesson will **not** be graded.

## **Accelerated Run-time Check**

You MUST run the following code cell to get credit for this class lesson. The code in this cell checks what hardware acceleration you are using. To run this lesson, you must be running a Graphics Processing Unit (GPU).

In [None]:
# You must run this cell second

import tensorflow as tf

def check_device():
    # Check for available devices
    devices = tf.config.list_physical_devices()

    # Initialize device flags
    cpu = False
    gpu = False
    tpu = False

    # Check device types
    for device in devices:
        if device.device_type == 'CPU':
            cpu = True
        elif device.device_type == 'GPU':
            gpu = True
        elif device.device_type == 'TPU':
            tpu = True

    # Output device status
    if tpu:
        print("Running on TPU")
        print("WARNING: You must run this assigment using a GPU to earn credit")
        print("Change your RUNTIME now!")
    elif gpu:
        print("Running on GPU")
        gpu_info = !nvidia-smi
        gpu_info = '\n'.join(gpu_info)
        print(gpu_info)
        print("You are using a GPU hardware accelerator--You're good to go!")
    elif cpu:
        print("Running on CPU")
        print("WARNING: You must run this assigment using a GPU to earn credit")
        print("Change your RUNTIME now!")
    else:
        print("No compatible device found")
        print("WARNING: You must run this assigment using either a GPU or a TPU to earn credit")
        print("Change your RUNTIME now!")


# Call the function
check_device()

If you current `Runtime` is correct you should see the following output

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image11A.png)

However, if you received this warning message

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image14A.png)

You **MUST** go back and change your `Runtime` now before you continue.

### **YouTube Introduction to Training Hugging Face Datasets**

Run the next cell to see short introduction to Training Hugging Face Datasets. This is a suggested, but optional, part of the lesson.

In [None]:
from IPython.display import HTML
video_id = "7YZOik5S3vs"
HTML(f"""
<iframe width="560" height="315"
  src="https://www.youtube.com/embed/{video_id}"
  title="YouTube video player"
  frameborder="0"
  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
  allowfullscreen>
</iframe>
""")

## **Training Hugging Face Models**

**Hugging Face Models** are pre-trained machine learning models available on the Hugging Face Model Hub. These models cover a wide range of tasks, including natural language processing, computer vision, audio processing, and more. They are designed to be easily accessible and usable, allowing developers and researchers to leverage state-of-the-art models without needing to train them from scratch.

There are several reasons why you might want to train Hugging Face Models:

1. **Customization:** Fine-tuning a pre-trained model on your specific dataset can improve its performance on your particular task.

2. **Efficiency:** Training a model from scratch can be time-consuming and resource-intensive. Fine-tuning a pre-trained model can save time and computational resources.

2. **Accessibility:** Hugging Face provides tools and libraries that make it easier to train and deploy models, lowering the barrier to entry for machine learning projects.


### Install Custom Function

Run the next cell to create a custom function for this lesson. You code will not run corrctly if you fail to run the next cell.

In [None]:
# Simple function to print out elasped time
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

If the code is correct you should not see any output.

### Install Hugging Face Datasets

Before we can beging, we need to install Hugging Face datasets by running the code in the next cell. This may take a little while so please be patient...


In [None]:
# Install Hugging Face Datasets

!pip install transformers > /dev/null
!pip install transformers[torch] > /dev/null
!pip install transformers[sentencepiece] > /dev/null
!pip install datasets  > /dev/null
!pip install huggingface_hub  > /dev/null

If the code is correct you should not see any output.

# **Emotion Analysis**

**Emotion analysis** (also called **affective computing** or **sentiment/emotion classification**) is a subfield of NLP that focuses on identifying and categorizing emotions expressed in text.

Unlike sentiment analysis (which typically classifies text as positive, negative, or neutral), emotion analysis aims to detect **specific emotions** such as:
- Joy
- Sadness
- Anger
- Fear
- Surprise
- Disgust
- Trust
- Anticipation

### **Common Approaches**

#### **1. Lexicon-Based Methods**
Use predefined dictionaries like:
- **NRC Emotion Lexicon**
- **WordNet-Affect**
- **LIWC**

These map words to emotions and aggregate scores across a text.

#### **2. Machine Learning Models**
Train classifiers (e.g., SVM, Logistic Regression) on labeled datasets:
- Features: Bag-of-Words, TF-IDF, word embeddings
- Datasets: EmoReact, ISEAR, SemEval, TweetEval

#### **3. Deep Learning Models**
Use neural networks like:
- CNNs or RNNs (LSTM/GRU)
- Transformers (e.g., BERT, RoBERTa)
- Fine-tuned models on emotion datasets

In Example 1 we will be using a transformer deep learning model called **`DistBert`** for **sequence classification** with the **`Go-Emotions`** dataset.

## **GO‑Emotions Dataset (Hugging Face)**

**The GO‑Emotions dataset** is a large, high-quality collection of 58,000 short `Reddit` comments annotated with 27 fine-grained emotions, such as *admiration*, *anger*, *anxiety*, and *gratitude*. Each comment may carry multiple emotions, making it a multi‑label classification resource that captures the nuanced affective states people express online.

#### **Relevance for Computational Biologists**

1. **Concrete, Hands‑On Machine‑Learning Practice**  
   - The dataset is *only* 58 k short text snippets, so it can be processed on a laptop or a free Google‑Colab GPU.  
   - Students learn the full pipeline: data loading, tokenization, model selection, fine‑tuning, evaluation, and reproducible experimentation—all in a single, end‑to‑end notebook.

2. **Multi-Label Classification – A Common Challenge in Bioinformatics**  
   - In genomics or proteomics you often predict *multiple* functional annotations per gene or protein.  
   - Training on GO‑Emotions teaches how to handle overlapping labels, compute appropriate metrics (AUROC, macro‑F1), and balance class weights—skills directly transferable to multi‑label problems like predicting disease phenotypes or pathway memberships.

3. **Transfer Learning & Fine‑Tuning of Pre-Trained Models**  
   - Students get to experiment with transformer architectures (BERT, RoBERTa, etc.) and see how a language model trained on general English can be adapted to a highly specialized task.  
   - This mirrors how pre‑trained protein language models (e.g., ESM, ProtBERT) are fine‑tuned for structure or function prediction in computational biology.

4. **Real-World, Open-Science Data**  
   - GO-Emotions is released under a CC-BY-SA license, encouraging open-source collaboration.  
   - Working with openly available data instills best practices in reproducibility, version control, and ethical data handling—critical in biomedical research.

5. **Interdisciplinary Connection to Bio‑Text Mining**  
   - Sentiment and emotion analysis are valuable in public health surveillance, patient‑reported outcomes, and pharmacovigilance.  
   - By mastering affective NLP, students can later apply these skills to biomedical literature mining, extracting patient emotions from electronic health records or social media for disease-monitoring projects.

6. **Scalable Learning Curve**  
   - The dataset is rich enough to explore advanced topics (e.g., class‑imbalance techniques, ensembling, interpretability) but still small enough for quick iteration.  
   - This balance helps students build confidence before tackling larger biomedical corpora (e.g., PubMed abstracts, clinical notes).

**Bottom line:** Training a model on **`GO-Emotions`** gives computational biology students a focused, manageable, and highly transferable machine‑learning project that bridges NLP fundamentals, multi‑label modeling, and open‑science principles—all of which are essential skills for modern bioinformatics and computational biology research.

----------------------------------

## **Examples**

For pedagogical reasons the Examples has been broken up into a series of sequential steps to make coding easier.

### Example 1 - Step 1: Load Dataset

The code in the next cell loads the **`GO-Emotions`** dataset into the variable `go_emotions_dataset`. The code also loads several libraries that we will be using later.

In [None]:
# Example 1 - Step 1: Load dataset

from datasets import load_dataset
from transformers import AutoTokenizer, default_data_collator
import torch
import time
from transformers import (
    DistilBertForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)

# Load the GoEmotions dataset
go_emotions_dataset = load_dataset("google-research-datasets/go_emotions")


If the code is correct you should see the following output

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image01A.png)

### Example 1 - Step 2: Print Examples

The code in the next cell first prints out the column names in the `go_emotions_dataset`. It is important to know exactly what the column names are for later steps in the analysis.

After the column names are printed, the `text` and the `label` from one record is printed. Which record selected depends upon the value of the variable `RecordNumber`. In the example below `RecordNumber` is set to 3. Just set this variable to another value if you want to look at a different record.

In [None]:
# Example 1 - Step 2: Print Examples

# Set record number
RecordNumber = 3

# Print column names
print(f"Dataset column names:", go_emotions_dataset.column_names)

# Print text from one record
print(f"Record",RecordNumber, "text:", go_emotions_dataset['train'][RecordNumber]['text'])

# Print label assigned to the text
print(f"Record",RecordNumber, "labels:", go_emotions_dataset['train'][RecordNumber]['labels'])

If the code is correct you should see the following output

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image26A.png)

For `RecordNumber=3` the `text` _To make her feel threatened` was assigned the `labels` = `14`. The table below shows the numerical value assigned to each emotion category. By inspection you can see that the number `14` maps to **`Fear`**, which seems reasonable.

List of **go‑emotions** emotion categories

| #  | Emotion        |
|----|----------------|
| 1  | Admiration     |
| 2  | Amusement      |
| 3  | Anger          |
| 4  | Annoyance      |
| 5  | Approval       |
| 6  | Caring         |
| 7  | Confusion      |
| 8  | Curiosity      |
| 9  | Desire         |
|10  | Disappointment |
|11  | Disgust        |
|12  | Embarrassment  |
|13  | Excitement     |
|14  | Fear           |
|15  | Gratitude      |
|16  | Grief          |
|17  | Joy            |
|18  | Love           |
|19  | Nervousness    |
|20  | Optimism       |
|21  | Pride          |
|22  | Realization    |
|23  | Remorse        |
|24  | Sadness        |
|25  | Surprise       |
|26  | Thankfulness   |
|27  | Trust          |


--------------------------

## **Tokenization**

An important step in any NLP analysis is **tokenization**.

#### **Why we need tokenisation**

| Need | What tokenisation gives us | Example |
|------|---------------------------|---------|
| **Convert text to numbers** | NLP models are neural nets that accept *vectors* of integers. | `"I love coffee"` → `[101, 1024, 1045, 112, ...]` |
| **Respect model vocabulary** | The tokenizer knows the exact wordpiece / subword vocabulary of the chosen pre‑trained model. | `"unbelievable"` → `[token1, token2]` that match `distilbert`’s embedding matrix. |
| **Add special tokens** | Most transformer models expect special tokens (`[CLS]`, `[SEP]`, etc.) at specific positions. | `"Hello world"` → `"[CLS] Hello world [SEP]"` |
| **Generate attention masks** | Indicates to the model which positions are real tokens and which are padding. | `[1, 1, 1, 0, 0, ...]` |
| **Uniform length** | `padding="max_length"` pads every sequence to the same length, making batching possible. | Short sentences become `[ID, ID, PAD, PAD, ...]` |
| **Truncate long sequences** | `truncation=True` keeps only the first `max_length` tokens, preventing OOM errors. | `"This is a very long sentence …"` → first 512 tokens |

Without tokenisation, the model would receive a string of characters that it cannot process, and the training loop would fail.

---

#### **Quick Visualisation**

```text
raw text:  "I am happy today!"
tokeniser:  tokenizer("I am happy today!")
outputs:    {
              'input_ids': [101, 1045, 2572, 2690, 2006, 102],
              'attention_mask': [1, 1, 1, 1, 1, 1],
              ...
            }
```
---------------------------

### Example 1 - Step 3: Tokenize and Format the Dataset


The code in the cell below tokenizes the raw text, normalises the labels, and creates two ready-to-train splits (`train` & `eval`) that you can feed into a `DistilBERT model` (e.g., via Hugging Face Trainer).

This code is a classic data-preparation pipeline for any text-classification task.

In [None]:
# Example 1 - Step 3: Tokenize and format the dataset

# Initialize the tokenizer
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

# Define the tokenize function
def tokenize(rows):
    return tokenizer(rows['text'], padding="max_length", truncation=True)

# Tokenize the dataset
tokenized_datasets = go_emotions_dataset.map(tokenize, batched=True)

# Ensure labels are in the correct format (flattening nested lists)
def format_labels(example):
    example['labels'] = example['labels'][0] if isinstance(example['labels'], list) else example['labels']
    return example

tokenized_datasets = tokenized_datasets.map(format_labels)

# Split the tokenized dataset into train and test sets
train_test_split = tokenized_datasets['train'].train_test_split(test_size=0.2, seed=42)

small_train_dataset = train_test_split["train"].shuffle(seed=42)
small_eval_dataset = train_test_split["test"].shuffle(seed=42)



If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image08F.png)

### Example 1 - Step 4: Create Custom Collator

The code in the cell below creates a **custom collator**. The code defines a custom data collator that wraps the default HuggingFace collator. In particular the code ensures that the `labels` tensor has the right shape (i.e. a detached `torch.long` tensor). This shape is necessary to satisfy classification loss expectations.

The code then loads a pre-trained **`DistilBERT`** encoder and attaches a classification head configured for 28 output classes. Together, they prepare batches with correct label types and a model ready for fine-tuning on a multi‑class sequence-classification task.

The following code **_instantiates_** the `DistilBert` model.

```text
# Instantiate the model
EG_model = DistilBertForSequenceClassification.from_pretrained(model_ckpt, num_labels=28)
```
This means that the model has been just been created but not trained.

**NOTE:** The name of the `DistilBert` model is **`EG_model`**. Te **EG** prefix has been used to indicate that this is the model being used in the EXAMPLE.


In [None]:
# Example 1 - Step 4: Create custom collator

# Custom data collator to handle potential edge cases and address the warning
def custom_data_collator(features):
    batch = default_data_collator(features)
    if "labels" in batch:
        batch["labels"] = batch["labels"].clone().detach().long()
    return batch

# Instantiate the model
EG_model = DistilBertForSequenceClassification.from_pretrained(model_ckpt, num_labels=28)

If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image09F.png)

----------------------------------

### **Hugging Face Trainer**

The **Hugging‑Face `Trainer`** is a high‑level training engine that abstracts away the boilerplate of a typical deep‑learning training loop.

It automatically handles data loading, batching, padding, and device placement, supports distributed or mixed‑precision training out of the box, and provides built-in strategies for checkpointing, evaluation, and early stopping.

By passing a `TrainingArguments` object, you can control logging, learning‑rate scheduling, and which metrics to monitor for the best checkpoint. Because it integrates seamlessly with Hugging‑Face `datasets`, tokenizers, and the `accelerate` library, you can write a single, concise script that runs on a single GPU, multiple GPUs, or a TPU without any code changes.

While you can always write a custom loop for maximum control, the `Trainer` saves time, reduces bugs, and makes experiments reproducible and easy to share, making it the go-to choice for most fine‑tuning and research workloads.

----------------------------


### Example 1 - Step 5: Define Model, Training Arguments, and Trainer

The code in the cell below creates a trained called **`EG_trainer`**. This trainer is configured using the `TrainingArguments`. These training arguments tells **`EG_trainer`** to:
1. run up to 5 epochs or 500 steps,
2. use 8-sample batches
3. use 500 warm-up steps
4. use a weight decay = `0.01`
5. log every `10` steps.
6. perform checkpoints and evaluations every `50` steps
7. save the best model is chosen based on the lowest evaluation loss.

An `EarlyStoppingCallback` with a patience of `3` is added so training stops if the eval loss does not improve for three consecutive evaluations.

Finally, out  `EG_trainer` is created with the pre-trained model, the train/eval datasets, the custom collator, and the early-stopping callback. In short, our `EG_trainer` is all set, just waiting to for the code signal to start training.


In [None]:
# Example 1 - Step 5: Define model, training arguments and trainer

# Define variables
EPOCHS=5
EVAL_STEPS=50
MAX_STEPS=500
SAVE_STEPS=50

# Set up TrainingArguments with early stopping requirements
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=EPOCHS,
    max_steps=MAX_STEPS,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    logging_strategy="steps",
    save_strategy="steps",
    save_steps=SAVE_STEPS,
    eval_strategy="steps",
    eval_steps=EVAL_STEPS,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    report_to=[]
)

# Create an early stopping callback
PATIENCE = 3
early_stopping_callback = EarlyStoppingCallback(early_stopping_patience=PATIENCE)

EG_trainer = Trainer(
    model=EG_model,     # Important to assign model name
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    data_collator=custom_data_collator,
    callbacks=[early_stopping_callback]
)



If the code is correct you should not see any output.

### Example 1 - Step 6: Train the Model and Save the Best Model

The code in the cell below runs the full training loop for our **`EG_ model`** that has been wrapped in a Hugging-Face Trainer, **`EG_trainer`**.

After training, the code writes out the best-performing checkpoint (i.e. the model that achieved the lowest validation loss / highest metric to a folder called `./EG_best_model`.

This saves the tokenizer used during training to the same folder so you can later load the entire inference pipeline from that single directory.

Since the model, **`EG_trainer`** is fairly large the next cell will require some time to finish training.

**NOTE:** Make sure you have run the `Install Custom Function` cell at the start of the lesson or you will receive an error message!


In [None]:
# Example 1 - Step 6: Train the model and save the best model

# Record start time
start_time = time.time()

print("-----Starting Training-----------------------")
# Train the model
EG_trainer.train()
print("-----Training Done---------------------------")

# Record the end time in T_end
T_end = time.time()

# Print elapsed time
elapsed_time = time.time() - start_time
print("Elapsed time: {}".format(hms_string(elapsed_time)))

# Save the best model
best_model_dir = './EG_best_model'
EG_trainer.save_model(best_model_dir)

# Save tokenizer to the same directory
tokenizer.save_pretrained(best_model_dir)



If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image27A.png)

with the `A100` GPU hardware accelerator is took about 6 minutes to complete training of the model.

**NOTE:** If your code generated this error message

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image07A.png)

it means that you failed to run the code cell called **Install Custom Function** at the beginning of this lesson.

### Example 1 - Step 7: Compute Accuracy Score

The code in the cell below evaluates our model **`EG_trainer`** on a `hold-out set`. The **hold-out** set (also called a **test set**) is a small portion of our dataset that that was set aside before training your neural network.

It important that the `hold-out` set is _never_ shown to the model during learning (nor to any hyper‑parameter tuning step). After training is done, the `hold-out` set fed into the model and a record is made of the model's predictions. The difference between these predictions and the true labels, gives an _unbiased estimate_ of the model's generalization accuracy.

The code in the cell below uses a the function called `trainer.evaluate()` to get metrics (e.g., loss, accuracy) from the trained model. The code prints out each metric, and then obtains raw predictions via `trainer.predict`.


In [None]:
# Example 1 - Step 7: Compute Accuracy Score

from sklearn.metrics import accuracy_score

# Evaluate the model on the evaluation dataset
eval_results = EG_trainer.evaluate()

# Print the evaluation results
print("Evaluation results:")
for key, value in eval_results.items():
    print(f"{key}: {value:.4f}")

# Get the predictions and true labels
predictions, labels, _ = EG_trainer.predict(small_eval_dataset)

# Convert predictions to label IDs
predictions = torch.tensor(predictions)
predictions = torch.argmax(predictions, dim=-1)

# Calculate accuracy
accuracy = accuracy_score(labels, predictions)

print(f"Accuracy: {accuracy:.4f}")


If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image10A.png)

An accuracy of **0.4523** indicates that the model correctly classifies roughly **45%** of the examples in the evaluation set. Compared with a random baseline (≈ 1/28 ≈ 0.036 for a 28-class problem), the model has learned a useful signal, but it is far from perfect.  

In practice this means:

* The model is **under‑performing** relative to what one would hope for a well-tuned classifier in a biomedical setting.  
* Possible reasons include insufficient training data, sub‑optimal hyper-parameters, or a model architecture that isn’t expressive enough for the task.  
* Improving data quality, adding domain‑specific pre-training, fine-tuning longer, or experimenting with more powerful variants (e.g., `BioBERT`) could raise the score.

In short, the model shows *positive* signal but still has significant room for improvement.




---------------------------

## **Importance of Inference Using Trained Models**

**Inference** is a critical phase in the machine learning lifecycle. It refers to the process of using a **trained model** to make predictions on **new, unseen data**. While training involves learning patterns from historical data, inference is where the model demonstrates its practical utility.

#### **Why Inference Matters**

- **Real-World Application**: Inference allows models to be deployed in real-world scenarios, such as diagnosing diseases, recommending products, detecting fraud, or translating languages.
- **Performance Validation**: It helps validate how well the model generalizes beyond the training data. This is essential for assessing the model's reliability and robustness.
- **Decision Support**: Inference outputs are often used to support or automate decision-making processes in various domains like healthcare, finance, and engineering.
- **Efficiency and Speed**: Optimizing inference is crucial for applications requiring real-time predictions, such as autonomous vehicles or voice assistants.

##### **Summary**

Inference is the bridge between model development and real-world impact. It transforms a trained model from a theoretical construct into a practical tool that can generate insights, automate tasks, and solve complex problems in diverse domains.

--------------------------

### Example 1 - Step 8: Perform Inference Using Trained Model

The code in the cell below performs **`Inference`** on our **best trained** version of our **`EG_model`** that was stored in the directory **'./EG_best_model'**. The code loads the previously-saved tokenizer from the same directory.

To get some idea how well our **`EG_model`** learned to evaluate the emotions expressed short text sentences, the code feeds `6` test sentences into our trained model for evaluation. Finally it prints out each sentence along with the model's prediction of its emotional category.  


In [None]:
# Example 1 - Step 8: Perform Inference Using Trained Model

import torch
from transformers import AutoTokenizer, DistilBertForSequenceClassification

# Load the tokenizer and the model from the saved directory
best_model_dir = './EG_best_model'
tokenizer = AutoTokenizer.from_pretrained(best_model_dir)
EG_model = DistilBertForSequenceClassification.from_pretrained(best_model_dir)

# Set the model to evaluation mode
EG_model.eval()

# Define the sentences to test
sentences = [
    "I am feeling very happy today!",
    "This is the worst day of my life.",
    "I am so excited about the new project.",
    "I feel sad and lonely.",
    "I'm feeling very anxious about the presentation.",
    "This news makes me extremely joyful."
]

# Tokenize the sentences
inputs = tokenizer(sentences, padding="max_length", truncation=True, return_tensors="pt")

# Get the model's predictions
with torch.no_grad():
    outputs = EG_model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Define the mapping of label IDs to emotions (based on the dataset's label mapping)
label_mapping = {
    0: "sadness",
    1: "joy",
    2: "love",
    3: "anger",
    4: "fear",
    5: "surprise",
    6: "neutral",
    7: "admiration",
    8: "amusement",
    9: "approval",
    10: "caring",
    11: "confusion",
    12: "curiosity",
    13: "desire",
    14: "disappointment",
    15: "disapproval",
    16: "embarrassment",
    17: "excitement",
    18: "gratitude",
    19: "grief",
    20: "love",
    21: "nervousness",
    22: "pride",
    23: "realization",
    24: "relief",
    25: "remorse",
    26: "sadness",
    27: "surprise"
}

# Print the predicted emotions for each sentence
for sentence, pred in zip(sentences, predictions):
    emotion = label_mapping[pred.item()]
    print(f"Sentence: '{sentence}'\nPredicted Emotion: {emotion}\n")


If the code is correct you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image06A.png)

The six predictions look far from what we would expect for the sentences given.  
- “I am feeling very happy today!” is mapped to **gratitude**;  
- “This is the worst day of my life.” and “I feel sad and lonely.” both become **remorse**;  
- “I am so excited about the new project.” correctly receives **excitement**;  
- “I'm feeling very anxious about the presentation.” again gets **remorse**; and  
- “This news makes me extremely joyful.” is oddly labeled **sadness**.

Because the model achieved only 45% accuracy on the validation set, these results are not surprising: the classifier appears to have learned a very weak signal that is heavily biased toward the “remorse” class (and in a few cases confuses gratitude with sadness or joy).  

In practice this indicates a mismatch between the training labels and the mapping used for inference (the label-index table contains duplicate entries such as 0/26 for *sadness* and 2/20 for *love*), a possible class-imbalance issue, or simply insufficient training data/hyper-parameter tuning.  

The output therefore suggests that the model is unreliable for production use and should be retrained (or a different architecture or domain-specific pre‑training applied) before it can be trusted to interpret emotion from text.


## **Exercises**

For your **Exercises** you are to repeat the 6 steps demonstrated in Example 1 using a different Hugging Face dataset, the `cardiffnlp/tweet_eval – Irony Subset`

The **CardiffNLP TweetEval – Irony Subset** is a **TweetEval** benchmark, developed by `CardiffNLP`. It provides a unified framework for evaluating models on various `Twitter-based` classification tasks.

Among its seven tasks, the **Irony Detection** subset is particularly notable for its focus on identifying ironic content in tweets. Irony is notoriously difficult for machines—and even humans—to detect accurately.


Irony is a complex linguistic phenomenon where the intended meaning of a statement is often the **opposite** of its literal meaning. This makes it particularly challenging for machine learning models and natural language processing systems to interpret correctly.

### **Key Challenges in Irony Detection**

#### **1. Context Dependence**
Irony often relies heavily on **context**, including cultural references, prior knowledge, or the situation in which the statement is made. Without this context, the literal words may be misleading.

> Example: "Oh great, another Monday!"  
> Without context, this could be interpreted as positive, but it's likely ironic.

#### **2. Subtlety and Ambiguity**
Irony is frequently subtle and can be easily confused with sarcasm, humor, or even genuine sentiment. The lack of clear linguistic markers makes it hard to distinguish.

#### **3. Lack of Prosody and Tone**
In spoken language, irony is often conveyed through **tone of voice**, facial expressions, or gestures. In text (especially tweets), these cues are missing, making detection much harder.

#### **4. Short and Informal Texts**
Social media platforms like `Twitter` encourage brevity and informal language. This limits the amount of information available for models to interpret irony accurately.

#### **5. Creative Language Use**
Users often employ slang, emojis, hashtags, and unconventional grammar to express irony. These creative elements can confuse models trained on more formal or structured data.

#### **Implications for NLP**

Detecting irony is essential for improving the accuracy of:
- **Sentiment analysis**
- **Emotion detection**
- **Content moderation**
- **Social media monitoring**

Models that fail to detect irony may misclassify negative sentiment as positive (or vice versa), leading to flawed insights and decisions.

### **Summary**

Irony detection remains a challenging task in NLP due to its reliance on context, subtlety, and the absence of non-verbal cues. Advances in deep learning and contextual embeddings (like BERT and RoBERTa) have improved performance, but there's still significant room for growth in this area.


### **Exercise 1 - Step 1: Load Dataset**


In the cell below write the code to load the dataset. Use the following code for loading:

```text
# Load dataset
irony_dataset = load_dataset("cardiffnlp/tweet_eval", "irony")
```
This will create a variable called `irony_dataset` that you will use for **Exercise 1**.

You should also load the several libraries that you will be using later.

In [None]:
# Insert your code for Exercise 1 - Step 1 here



If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image20A.png)

### **Exercise 1 - Step 2: Print Examples**

In the next cell write the code to prints out the column names in your `irony_dataset` since you will need to know exactly what the column names are for later steps in the analysis.

After you print the column names print the text and the label from record number `2`.

In [None]:
# Insert your code for Exercise 1 - Step 2 here



If the code is correct you should see the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image21A.png)

Since the `Label = `1`for record = 2, this `tweet` is considered **ironic**.

### **Exercise 1 - Step 3: Tokenize and Format the Dataset**

In the cell below write the code to tokenize and format your `irony_dataset`.

**IMPORTANT NOTE:** One of the column names for the `irony_dataset` is _different_ from the column names shown in Example 1 - Step 3. You will need to change your code to reflect this difference.


In [None]:
# Insert your code for Exercise 1 - Step 3 here



If the code is correct you should see the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image22A.png)

### **Exercise 1 - Step 4: Create Custom Collator**

In the cell below create a **custom collator**. You can re-use the code in Example 1 - Step 4 with the following exception.

**IMPORTNAT NOTE:** When you _instantiate_ your model, **make sure** to call your model **`EX_model`**. The prefix **`EX`** signifies that this is the model that you will be using in the **EXERCISE**.  

In [None]:
# Insert your code for Exercise 1 - Step 4 here



If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image23A.png)

### **Exercise 1 - Step 5: Define Model, Training Arguments, and Trainer**

In the cell below write the code to define your model, **`EX_model`** with `TrainingArguments`.

You can simply re-use all of the code in Example 1 - Step 5 with the following modifications:

1. Change the name of your trainer from  `EG_trainer` to  **`EX_trainer`**.
2. Change the name of your model from  `EG_model` to  **`EX_model`**.

In [None]:
# Insert your code for Exercise 1 - Step 5 here



If the code is correct, you should not see any output.

### **Exercise 1 - Step 6: Train the Model and Save the Best Model**

In the cell below write the code to train your `EX_trainer`.

You can re-use all of the in Example 1 - Step 6 with the following modifications:

1. Make sure you are training your `EX_trainer` model.
2. Change the name of the `best_model_dir` to './EX_best_model'


In [None]:
# Insert your code for Exercise 1 - Step 6 here


If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image28A.png)

Using the `A100` GPU hardware accelerator training required less than 5 minutes to complete.

### **Exercise 1 - Step 7: Compute Accuracy Score**

In the cell below write the code to compute the accuracy score of your `EX_trainer`. (Remember, `EX_trainer` is basically your `EX_model` "wrapped" in some extra code.)

Make sure to change

1. `EG_trainer.evaluate()` to read **`EX_trainer.evaluate()`**.
2. `EG_trainer.predict()` to read **`EX_trainer.predict()`**.


In [None]:
# Insert your code for Exercise 1 - Step 6 here



If the code is correct you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image25A.png)

An accuracy of **0.6981** indicates that the model correctly classifies roughly 70% of the examples in the evaluation set.  This is much better accuracy that we observed for the `go_emotions_dataset` in the Example.


### **Exercise 1 - Step 8: Perform Inference Using Trained Model**

In the cell below write the code to perform Inference on your best model,**`EX_model`**, that was store in the directory './EX_best_model'.

Make sure to set your model to evaluation mode using the following code:
```text
# Set the model to evaluation mode
EX_model.eval()
```
You can copy-and-paste this code into your cell to create the tweets to analyze:

```text
# Define the sentences to test
sentences = [
    "Mens Football clearly know how to have a good time.....  #archiesday #anychanceofasocial", # Ironic
    "What an amazing start to the weekend!  #ohgoditsfridayagain",                              # Ironic
    "my favorite thing to do on Tuesday is write psychology papers😐  #killme",                 # Ironic
    "Last day in #Riga! #self #finnishgirl #businesswoman  @ PK Riga Hotel",                     # Not ironic
    "Can't wait until this weekend is over...then no more Xmas parties!!!!!! #HateSchmoozing",   # Not ironic
    "How to know when he really loves you. #tmi #imsorry #chickfila",                            # Not ironic
]
```

These tweets were taken from the `hold-out` (validation) set.

Finally, use this code for mapping your responses:
```text
# Define the mapping of label IDs to emotions (based on the dataset's label mapping)
label_mapping = {
    0: "Not ironic",
    1: "Ironic",
}
```


In [None]:
# Example 1 - Step 6: Perform Inference Using Trained Model




If the code is correct you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image29A.png)

Predicting what is, or is not ironic isn't easy!

The first 3 tweets were labeled as being **ironic** while the last 3 tweets were labeled **not ironic**. Based on this small sample, your trained `EX_model` didn't do all that well.   

```type
Sentence: 'Mens Football clearly know how to have a good time.....  #archiesday #anychanceofasocial'
Predicted Emotion: Ironic (✔️)

Sentence: 'What an amazing start to the weekend!  #ohgoditsfridayagain'
Predicted Emotion: Ironic (✔️)

Sentence: 'my favorite thing to do on Tuesday is write psychology papers😐  #killme'
Predicted Emotion: Not ironic (❌)

Sentence: 'Last day in #Riga! #self #finnishgirl #businesswoman  @ PK Riga Hotel'
Predicted Emotion: Not ironic (✔️)

Sentence: 'Can't wait until this weekend is over...then no more Xmas parties!!!!!! #HateSchmoozing'
Predicted Emotion: Ironic (❌)

Sentence: 'How to know when he really loves you. #tmi #imsorry #chickfila'
Predicted Emotion: Ironic (❌)
```


## **Lesson Turn-in**

When you have completed and run all of the code cells, use the **File --> Print.. --> Save to PDF** to generate a PDF of your Colab notebook. Save your PDF as `Class_05_4.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.

## **Lizard Tail**

## **Lotus 1-2-3**

![__](https://upload.wikimedia.org/wikipedia/commons/0/02/Lotus-123-3.0-MSDOS.png)

**Lotus 1-2-3** is a discontinued spreadsheet program from Lotus Software (later part of IBM). It was the first killer application of the IBM PC, was hugely popular in the 1980s, and significantly contributed to the success of IBM PC-compatibles in the business market.

The first spreadsheet, VisiCalc, had helped launch the Apple II as one of the earliest personal computers in business use. With IBM's entry into the market, VisiCalc was slow to respond, and when they did, they launched what was essentially a straight port of their existing system despite the greatly expanded hardware capabilities. Lotus's solution was marketed as a three-in-one integrated solution: it handled spreadsheet calculations, database functionality, and graphical charts, hence the name "1-2-3", though how much database capability the product actually had was debatable, given the sparse memory left over after launching 1-2-3. It quickly overtook VisiCalc, as well as Multiplan and SuperCalc, the two VisiCalc competitors.

Lotus 1-2-3 was the state-of-the-art spreadsheet and the standard throughout the 1980s and into the early 1990s, part of an unofficial set of three stand-alone office automation products that included dBase and WordPerfect, to build a complete business platform. Lotus Software had their own word processor named Lotus Manuscript, which was to some extent acclaimed in academia, but did not catch the interest of the business, nor the consumer market. With the acceptance of Windows 3.0 in 1990, the market for desktop software grew even more. None of the major spreadsheet developers had seriously considered the graphical user interface (GUI) to supplement their DOS offerings, and so they responded slowly to Microsoft's own GUI-based products Excel and Word. Lotus was surpassed by Microsoft in the early 1990s, and never recovered. IBM purchased Lotus in 1995, and continued to sell Lotus offerings, only officially ending sales in 2013.

### **History**

**VisiCalc**

VisiCalc was launched in 1979 on the Apple II and immediately became a best-seller. Compared to earlier programs, VisiCalc allowed one to easily construct free-form calculation systems for practically any purpose, the limitations being primarily related to the memory and speed of the computer. The application was so compelling that there were numerous stories of people buying Apple II machines to run the program (see article Killer application). VisiCalc's runaway success on the Apple led to direct bug compatible ports to other platforms, including the Atari 8-bit computers, Commodore PET and many others. This included the IBM PC when it launched in 1981, where it quickly became another best-seller, with an estimated 300,000 sales in the first six months on the market.

There were well-known problems with VisiCalc, and several competitors appeared to address some of these issues. One early example was 1980's SuperCalc, which solved the problem of circular references, while a slightly later example was Microsoft Multiplan from 1981, which offered larger sheets and other improvements. In spite of these, and others, VisiCalc continued to outsell them all.

**Beginnings**

The Lotus Development Corporation was founded by Mitchell Kapor, a friend of the developers of VisiCalc. 1-2-3 was originally written by Jonathan Sachs, who had written two spreadsheet programs previously while working at Concentric Data Systems, Inc. To aid its growth both in the UK and possibly elsewhere, Lotus 1-2-3 became the very first computer software to use television consumer advertising.

Kapor was primarily a marketing guru. His ability to develop his product to appeal to non-technical users was one secret to its rapid success. Unlike many technologists, Kapor relied on focus group feedback to make his user instructions more user-friendly. One example: the instructions that came with the floppy disc read: "Remove the protective cover and insert disc into computer." A few focus group participants tried to rip-off the stiff plastic envelope of disc carrier. Kapor's recognition that techno-speak instructions needed to be translated to normative English was a strong contributor to the product's popularity.

Lotus 1-2-3 was released on 26 January 1983, and immediately overtook Visicalc in sales. Unlike Microsoft Multiplan, it stayed very close to the model of VisiCalc, including the "A1" letter and number cell notation, and slash-menu structure. It was cleanly programmed, relatively bug-free, gained speed from being written completely in x86 assembly language (this remained the case for all DOS versions until 3.0, when Lotus switched to C[9]) and wrote directly to video memory rather than use the slow DOS and/or BIOS text output functions.

Among other novelties that Lotus introduced was a graph maker that could display several forms of graphs (including pie charts, bar graphics, or line charts) but required the user to have a graphics card. At this early stage, the only video boards available for the PC were IBM's Color Graphics Adapter and Monochrome Display and Printer Adapter, the latter not supporting any graphics. However, because the two video boards used different RAM and port addresses, both could be installed in the same machine and so Lotus took advantage of this by supporting a "split" screen mode whereby the user could display the worksheet portion of 1-2-3 on the sharper monochrome video and the graphics on the CGA display.

The initial release of 1-2-3 supported only three video setups: CGA, MDA (in which case the graph maker was not available) or dual-monitor mode. However, a few months later support was added for Hercules Computer Technology's Hercules Graphics Adapter which was a clone of the MDA that allowed bitmap mode. The ability to have high-resolution text and graphics capabilities (at the expense of color) proved extremely popular and Lotus 1-2-3 is credited with popularizing the Hercules graphics card.

Subsequent releases of Lotus 1-2-3 supported more video standards as time went on, including EGA, AT&T/Olivetti, and VGA. Significantly, support for the PCjr/Tandy modes was never added and users of those machines were limited to CGA graphics.

The early versions of 1-2-3 also had a key disk copy protection. While the program was hard disk installable, the user had to insert the original floppy disk when starting 1-2-3 up. This protection scheme was easily cracked and a minor inconvenience for home users, but proved a serious nuisance in an office setting. Starting with Release 3.0, Lotus no longer used copy protection. However, it was then necessary to "initialize" the System disk with one's name and company name so as to customize the copy of the program. Release 2.2 and higher had this requirement. This was an irreversible process unless one had made an exact copy of the original disk so as to be able to change names to transfer the program to someone else.

The reliance on the specific hardware of the IBM PC led to 1-2-3 being utilized as one of the two stress test applications, along with Microsoft Flight Simulator, for true 100% compatibility when PC clones appeared in the early 1980s. 1-2-3 required two disk drives and at least 192K of memory, which made it incompatible with the IBM PCjr; Lotus produced a version for the PCjr that was on two cartridges but otherwise identical.

By early 1984 the software was a killer app for the IBM PC and compatibles, while hurting sales of computers that could not run it. "They're looking for 1-2-3. Boy, are they looking for 1-2-3!" InfoWorld wrote. Noting that computer purchasers did not want PC compatibility as much as compatibility with certain PC software, the magazine suggested "let's tell it like it is. Let's not say 'PC compatible,' or even 'MS-DOS compatible.' Instead, let's say '1-2-3 compatible.'" PC clones' advertising did often prominently state that they were compatible with 1-2-3. An Apple II software company promised that its spreadsheet had "the power of 1-2-3". Because spreadsheets use large amounts of memory, 1‐2‐3 helped popularize greater RAM capacities in PCs, and especially the advent of expanded memory, which allowed greater than 640k to be accessed.