<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173_Fall2025/blob/main/F25_Class_05_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

##### **Module 5: Natural Language Processing**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Biology, Health and the Environment](https://sciences.utsa.edu/bhe/), [UTSA](https://www.utsa.edu/)

### Module 4 Material

* Part 5.1: Introduction to Hugging Face
* Part 5.2: Hugging Face Tokenizers
* Part 5.3: Hugging Face Datasets
* **Part 5.4: Training Hugging Face models**

## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to /content/drive and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: Using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this lesson.")
    COLAB = False

Mounted at /content/drive
Note: Using Google CoLab
david.senseman@gmail.com


Make sure your GMAIL address is included as the last line in the output above.

## **Training Hugging Face Models**

**Hugging Face Models** are pre-trained machine learning models available on the Hugging Face Model Hub. These models cover a wide range of tasks, including natural language processing, computer vision, audio processing, and more. They are designed to be easily accessible and usable, allowing developers and researchers to leverage state-of-the-art models without needing to train them from scratch.

There are several reasons why you might want to train Hugging Face Models:

1. **Customization:** Fine-tuning a pre-trained model on your specific dataset can improve its performance on your particular task.

2. **Efficiency:** Training a model from scratch can be time-consuming and resource-intensive. Fine-tuning a pre-trained model can save time and computational resources.

2. **Accessibility:** Hugging Face provides tools and libraries that make it easier to train and deploy models, lowering the barrier to entry for machine learning projects.


### Install Hugging Face Datasets

Before we can beging, we need to install Hugging Face datasets by running the code in the next cell.


In [None]:
# Install Hugging Face Datasets

!pip install transformers > /dev/null
!pip install transformers[torch] > /dev/null
!pip install transformers[sentencepiece] > /dev/null
!pip install datasets  > /dev/null
!pip install huggingface_hub  > /dev/null

### Install Custom Functions

Run the next cell to create custom functions for this lesson.

In [None]:
# Simple function to print out elasped time
def elaspedTime(start,end):
    # Print out time
    seconds = int((end-start))
    seconds = seconds % (24 * 3600)
    hour = seconds // 3600
    seconds %= 3600
    minutes = seconds // 60
    seconds %= 60
    print("Elapsed time = %d:%02d:%02d" % (hour, minutes, seconds))
    print()

----------------------------------

# **Examples**

Step 1: Import Libraries and Load Dataset

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, default_data_collator
import torch
import time
from transformers import (
    DistilBertForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)

# Load the GoEmotions dataset
go_emotions_dataset = load_dataset("google-research-datasets/go_emotions")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/9.40k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.77M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/350k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/347k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/43410 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5426 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5427 [00:00<?, ? examples/s]

If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image07F.png)

Step 2: Tokenize and Format the Dataset

In [None]:
# Initialize the tokenizer
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

# Define the tokenize function
def tokenize(rows):
    return tokenizer(rows['text'], padding="max_length", truncation=True)

# Tokenize the dataset
tokenized_datasets = go_emotions_dataset.map(tokenize, batched=True)

# Ensure labels are in the correct format (flattening nested lists)
def format_labels(example):
    example['labels'] = example['labels'][0] if isinstance(example['labels'], list) else example['labels']
    return example

tokenized_datasets = tokenized_datasets.map(format_labels)

# Split the tokenized dataset into train and test sets
train_test_split = tokenized_datasets['train'].train_test_split(test_size=0.2, seed=42)

small_train_dataset = train_test_split["train"].shuffle(seed=42)
small_eval_dataset = train_test_split["test"].shuffle(seed=42)



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/43410 [00:00<?, ? examples/s]

Map:   0%|          | 0/5426 [00:00<?, ? examples/s]

Map:   0%|          | 0/5427 [00:00<?, ? examples/s]

Map:   0%|          | 0/43410 [00:00<?, ? examples/s]

Map:   0%|          | 0/5426 [00:00<?, ? examples/s]

Map:   0%|          | 0/5427 [00:00<?, ? examples/s]

If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image08F.png)

In [None]:
# Custom data collator to handle potential edge cases and address the warning
def custom_data_collator(features):
    batch = default_data_collator(features)
    if "labels" in batch:
        batch["labels"] = batch["labels"].clone().detach().long()
    return batch

# Instantiate the model
model = DistilBertForSequenceClassification.from_pretrained(model_ckpt, num_labels=28)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image09F.png)

Step 3: Define Model, Training Arguments, and Trainer

In [None]:
# Define variables
EPOCHS=5
EVAL_STEPS=50
MAX_STEPS=500
SAVE_STEPS=50

# Set up TrainingArguments with early stopping requirements
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=EPOCHS,
    max_steps=MAX_STEPS,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    logging_strategy="steps",
    save_strategy="steps",
    save_steps=SAVE_STEPS,
    eval_strategy="steps",
    eval_steps=EVAL_STEPS,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    report_to=[]
)

# Create an early stopping callback
PATIENCE = 3
early_stopping_callback = EarlyStoppingCallback(early_stopping_patience=PATIENCE)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    data_collator=custom_data_collator,
    callbacks=[early_stopping_callback]
)



If the code is correct, you should not see any output.

Step 4: Train the Model and Save the Best Model

In [None]:
# Record the start time in T_start
T_start = time.time()

# Train the model
trainer.train()

# Record the end time in T_end
T_end = time.time()

# Print out elapsed time
def elapsedTime(T_start, T_end):
    elapsed_time = T_end - T_start
    print(f"Elapsed time: {elapsed_time:.2f} seconds")

elapsedTime(T_start, T_end)

# Manually save the best model
best_model_dir = './best_model'
trainer.save_model(best_model_dir)
tokenizer.save_pretrained(best_model_dir)



Step,Training Loss,Validation Loss
50,3.2679,3.243472
100,2.9166,2.937316
150,2.7624,2.760891
200,2.7678,2.643096
250,2.5836,2.486804
300,2.2266,2.339153
350,2.4066,2.220693
400,2.2579,2.156768
450,2.1409,2.06043
500,1.9862,1.989488


Elapsed time: 399.07 seconds


('./best_model/tokenizer_config.json',
 './best_model/special_tokens_map.json',
 './best_model/vocab.txt',
 './best_model/added_tokens.json',
 './best_model/tokenizer.json')

If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image10F.png)

In [None]:
from sklearn.metrics import accuracy_score

# Evaluate the model on the evaluation dataset
eval_results = trainer.evaluate()

# Print the evaluation results
print("Evaluation results:")
for key, value in eval_results.items():
    print(f"{key}: {value:.4f}")

# Get the predictions and true labels
predictions, labels, _ = trainer.predict(small_eval_dataset)

# Convert predictions to label IDs
predictions = torch.tensor(predictions)
predictions = torch.argmax(predictions, dim=-1)

# Calculate accuracy
accuracy = accuracy_score(labels, predictions)

print(f"Accuracy: {accuracy:.4f}")


Evaluation results:
eval_loss: 1.9895
eval_runtime: 33.9725
eval_samples_per_second: 255.5600
eval_steps_per_second: 31.9670
epoch: 0.1152
Accuracy: 0.4600


If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_4_image11F.png)

In [None]:
import torch
from transformers import AutoTokenizer, DistilBertForSequenceClassification

# Load the tokenizer and the model from the saved directory
best_model_dir = './best_model'
tokenizer = AutoTokenizer.from_pretrained(best_model_dir)
model = DistilBertForSequenceClassification.from_pretrained(best_model_dir)

# Set the model to evaluation mode
model.eval()

# Define the sentences to test
sentences = [
    "I am feeling very happy today!",
    "This is the worst day of my life.",
    "I am so excited about the new project.",
    "I feel sad and lonely.",
    "I'm feeling very anxious about the presentation.",
    "This news makes me extremely joyful."
]

# Tokenize the sentences
inputs = tokenizer(sentences, padding="max_length", truncation=True, return_tensors="pt")

# Get the model's predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Define the mapping of label IDs to emotions (based on the dataset's label mapping)
label_mapping = {
    0: "sadness",
    1: "joy",
    2: "love",
    3: "anger",
    4: "fear",
    5: "surprise",
    6: "neutral",
    7: "admiration",
    8: "amusement",
    9: "approval",
    10: "caring",
    11: "confusion",
    12: "curiosity",
    13: "desire",
    14: "disappointment",
    15: "disapproval",
    16: "embarrassment",
    17: "excitement",
    18: "gratitude",
    19: "grief",
    20: "love",
    21: "nervousness",
    22: "pride",
    23: "realization",
    24: "relief",
    25: "remorse",
    26: "sadness",
    27: "surprise"
}

# Print the predicted emotions for each sentence
for sentence, pred in zip(sentences, predictions):
    emotion = label_mapping[pred.item()]
    print(f"Sentence: '{sentence}'\nPredicted Emotion: {emotion}\n")


Sentence: 'I am feeling very happy today!'
Predicted Emotion: gratitude

Sentence: 'This is the worst day of my life.'
Predicted Emotion: remorse

Sentence: 'I am so excited about the new project.'
Predicted Emotion: excitement

Sentence: 'I feel sad and lonely.'
Predicted Emotion: remorse

Sentence: 'I'm feeling very anxious about the presentation.'
Predicted Emotion: remorse

Sentence: 'This news makes me extremely joyful.'
Predicted Emotion: sadness



## **Lesson Turn-in**

When you have completed and run all of the code cells, use the **File --> Print.. --> Save to PDF** to generate a PDF of your Colab notebook. Save your PDF as `Class_05_3.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.

## **Lizard Tail**

## **Sol-20**

![__](https://upload.wikimedia.org/wikipedia/commons/5/5e/Processor_Technology_SOL_20_Computer.jpg)


The **Sol-20** was the first fully assembled microcomputer with a built-in keyboard and television output, what would later be known as a home computer. The design was the integration of an Intel 8080-based motherboard, a VDM-1 graphics card, the 3P+S I/O card to drive a keyboard, and circuitry to connect to a cassette deck for program storage. Additional expansion was available via five S-100 bus slots inside the machine. It also included swappable ROMs that the manufacturer called 'personality modules', containing a rudimentary operating system.

The design was originally suggested by Les Solomon, the editor of Popular Electronics. He asked Bob Marsh of Processor Technology if he could design a smart terminal for use with the Altair 8800. Lee Felsenstein, who shared a garage working space with Marsh, had previously designed such a terminal but never built it. Reconsidering the design using modern electronics, they agreed the best solution was to build a complete computer with a terminal program in ROM. Felsenstein suggested the name "Sol" because they were including "the wisdom of Solomon" in the box.

The Sol appeared on the cover of the July 1976 issue of Popular Electronics as a "high-quality intelligent terminal". It was initially offered in three versions; the Sol-PC motherboard in kit form, the Sol-10 without expansion slots, and the Sol-20 with five slots.

A Sol-20 was taken to the Personal Computing Show in Atlantic City in August 1976 where it was a hit, building an order backlog that took a year to fill. Systems began shipping late that year and were dominated by the expandable Sol-20, which sold for \$1,495 in its most basic fully-assembled form. The company also offered schematics for the system for free for those interested in building their own.

The Sol-20 remained in production until 1979, by which point about 12,000 machines had been sold. By that time, the "1977 trinity" —the Apple II, Commodore PET and TRS-80— had begun to take over the market, and a series of failed new product introductions drove Processor Technology into bankruptcy. Felsenstein later developed the successful Osborne 1 computer, using much the same underlying design in a portable format.

### **History**

**Tom Swift Terminal**

Lee Felsenstein was one of the sysops of Community Memory, the first public bulletin board system. Community Memory opened in 1973, running on a SDS 940 mainframe that was accessed through a Teletype Model 33, essentially a computer printer and keyboard, in a record store in Berkeley, California. The cost of running the system was untenable; the teletype normally cost \$1,500 (their first example was donated from Tymshare as junk), the modem another \$300, and time on the SDS was expensive - in 1968, Tymshare charged \$13 per hour (equivalent to \$114 in 2023). Even the reams of paper output from the terminal were too expensive to be practical and the system jammed all the time. The replacement of the Model 33 with a Hazeltine glass terminal helped, but it required constant repairs.

Since 1973, Felsenstein had been looking for ways to lower the cost. One of his earliest designs in the computer field was the Pennywhistle modem, a 300 bits per second acoustic coupler that was the cost of commercial models. When he saw Don Lancaster's TV Typewriter on the cover of the September 1973 Radio Electronics, he began adapting its circuitry as the basis for a design he called the Tom Swift Terminal. The terminal was deliberately designed to allow it to be easily repaired. Combined with the Pennywhistle, users would have a cost-effective way to access Community Memory.

In January 1975, Felsenstein saw a post on Community Memory by Bob Marsh asking if anyone would like to share a garage. Marsh was designing a fancy wood-cased digital clock and needed space to work on it. Felsenstein had previously met Marsh at school and agreed to split the \$175 rent on a garage in Berkeley. Shortly after, Community Memory shut down for the last time, having burned out the relationship with its primary funding source, Project One, as well the energy of its founding members.

**Processor Technology**

January 1975 was also the month that the Altair 8800 appeared on the front page of Popular Electronics, sparking off intense interest among the engineers of the rapidly growing Silicon Valley. Shortly thereafter, on 5 March 1975, Gordon French and Fred Moore held the first meeting of what would become the Homebrew Computer Club. Felsenstein took Marsh to one of the meetings, Marsh saw an opportunity supplying add-on cards for the Altair, and in April, he formed Processor Technology with his friend Gary Ingram.

The new company's first product was a 4 kB DRAM memory card for the Altair. A similar card was already available from the Altair's designers, MITS, but it was almost impossible to get working properly. Marsh began offering Felsenstein contracts to draw schematics or write manuals for the products they planned to introduce. Felsenstein was still working on the terminal as well, and in July, Marsh offered to pay him to develop the video portion. This was essentially a version of the terminal where the data would be supplied by the main memory of the Altair rather than a serial port.

The result was the VDM-1, the first graphics card. The VDM-1 could display 16 lines of 64 characters per line, and included the complete ASCII character set with upper- and lower-case characters and a number of graphics characters like arrows and basic math symbols. An Altair equipped with a VDM-1 for output and Processor Technology's 3P+S card running a keyboard for input removed the need for a terminal, yet cost less than dedicated smart terminals like the Hazeltine.

**Intelligent terminal concept**

Before the VDM-1 was launched in late 1975, the only way to program the Altair was through its front-panel switches and LED lamps, or by purchasing a serial card and using a terminal of some sort. This was typically a Model 33, which still cost \$1,500 if available. Normally the teletypes were not available – Teletype Corporation typically sold them only to large commercial customers, which led to a thriving market for broken-down machines that could be repaired and sold into the microcomputer market. Ed Roberts, who had developed the Altair, eventually arranged a deal with Teletype to supply refurbished Model 33s to MITS customers who had bought an Altair.

Les Solomon, whose Popular Electronics magazine launched the Altair, felt a low-cost smart terminal would be highly desirable in the rapidly expanding microcomputer market. In December 1975, Solomon traveled to Phoenix to meet with Don Lancaster to ask about using his TV Typewriter as a video display in a terminal. Lancaster seemed interested, so Solomon took him to Albuquerque to meet Roberts. The two immediately began arguing when Lancaster criticized the design of the Altair and suggested changes to better support expansion cards, demands that Roberts flatly refused. Any hopes of a partnership disappeared.
