<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/Class_05_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

##### **Module 5: Natural Language Processing**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Biology, Health and the Environment](https://sciences.utsa.edu/bhe/), [UTSA](https://www.utsa.edu/)

### Module 4 Material

* **Part 5.1: Introduction to Hugging Face**
* Part 5.2: Hugging Face Tokenizers
* Part 5.3: Hugging Face Datasets
* Part 5.4: Training Hugging Face models

## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to /content/drive and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    Colab = True
    print("Note: Using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this lesson.")
    Colab = False

You should see the following output except your GMAIL address should appear on the last line.

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image01B.png)

If your GMAIL address does not appear your lesson will **not** be graded.

### **YouTube Introduction to Hugging Face**

Run the next cell to see short introduction to Hugging Face. This is a suggested, but optional, part of the lesson.

In [1]:
from IPython.display import HTML
video_id = "GLO5FZzfrS0"
HTML(f"""
<iframe width="560" height="315"
  src="https://www.youtube.com/embed/{video_id}"
  title="YouTube video player"
  frameborder="0"
  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
  allowfullscreen>
</iframe>
""")

# **Introduction to Hugging Face**

**Hugging Face** is a company renowned for its pioneering work in the realm of Natural Language Processing (NLP). Established in 2016, Hugging Face has become synonymous with making cutting-edge NLP technologies more accessible to developers, researchers, and organizations. The company's commitment to open-source development has fostered a vibrant community, which plays a pivotal role in the rapid advancements of the field.

Central to Hugging Face's acclaim is the Transformers library. This Python library provides an extensive collection of pre-trained models that span a wide range of NLP tasks, from text classification and tokenization to language modeling and translation. Some of the most groundbreaking models like BERT, GPT-2, T5, and RoBERTa can be effortlessly accessed and deployed through this library.

A notable feature of the Transformers library is its user-friendly API. With just a few lines of code, one can:

* Load a pre-trained model.
* Tokenize input text.
* Obtain model predictions or embeddings.
* Fine-tune the model on a specific task.

This ease of use, combined with comprehensive documentation and tutorials, makes it an invaluable tool for both NLP newcomers and seasoned professionals.

Another significant contribution from Hugging Face is the Model Hub, a platform where researchers and developers can share, discover, and use NLP models. The hub promotes collaboration, ensuring that state-of-the-art models are easily available to the wider community. It's not uncommon to see models from recent research papers promptly available on the hub, ready for real-world applications.

Tokenization, the process of converting text into tokens, is a fundamental step in NLP. Recognizing its importance, Hugging Face introduced the Tokenizers library, which provides a fast and efficient way to tokenize massive datasets without compromising on accuracy. Its compatibility with the Transformers library ensures seamless integration between tokenization and modeling.

Democratization of NLP: By making advanced models and tools accessible, Hugging Face has democratized NLP, enabling even small teams or individual developers to harness the power of state-of-the-art models.

Rapid Prototyping: The ease with which one can deploy models means that ideas can be tested and iterated upon swiftly, accelerating the pace of NLP advancements.

Community and Collaboration: The open-source ethos of Hugging Face has cultivated a community that collaborates, contributes, and ensures that the field remains vibrant and progressive.

In the ever-evolving landscape of NLP, Hugging Face stands out as a beacon of innovation, accessibility, and collaboration. Whether you are a researcher pushing the boundaries of what's possible, a developer aiming to integrate NLP into an application, or an enthusiast eager to learn, Hugging Face provides the tools and resources to realize those ambitions. As we delve deeper into this book, we will frequently use the Hugging Face platform to illustrate concepts, implement solutions, and explore the vast possibilities of NLP.



### **Using Python with Hugging Face**

Transformers have become a mainstay of natural language processing. This module will examine the [Hugging Face](https://huggingface.co/) Python library for natural language processing, bringing together pretrained transformers, data sets, tokenizers, and other elements. Through the Hugging Face API, you can quickly begin using sentiment analysis, entity recognition, language translation, summarization, and text generation.

Colab does not install Hugging face by default. Whether installing Hugging Face directly into a local computer or utilizing it through Colab, the following commands will install the library.

Normally, when you use `pip` to install a package, a lot of information is printed out to show what subpackages were installed. We have prevented this by
adding the text `> /dev/` after the `!pip install <package>`. Technically, this code `redirects` the standard output (in this case your Colab notebook) to a device called `null`. `/dev/null` is a special file that discards all data written to it. It is often used as a "black hole" for unwanted output or error messages, such as the output of commands that are meant to be silent or hidden.


### Install Hugging Face Libraries

Run the next code cell to install the Hugging Face libraries needed for this lesson.

In [None]:
# Install Hugging Face libraries

!pip install transformers > /dev/null
!pip install transformers[sentencepiece] > /dev/null

If the code is correct you should not see an output.

Now that we have Hugging Face installed, the following sections will demonstrate how to apply Hugging Face to a variety of everyday tasks. After this introduction, the remainder of this module will take a deeper look at several specific NLP tasks applied to Hugging Face.

# **NLP Sentiment Analysis**

**Sentiment analysis** is a subfield of natural language processing (NLP) that focuses on identifying and categorizing the emotional tone or sentiment of a piece of text, such as a review, a social media post, or an article. The task of sentiment analysis involves analyzing the words and phrases in a given text to determine whether they are positive, negative, or neutral in nature, and assigning a numerical score or label to the overall sentiment of the text.

Sentiment analysis can be applied to various domains such as social media monitoring, customer feedback analysis, product review analysis, and advertising campaign evaluation. It can help organizations to identify trends and patterns in customer opinion, detect fake reviews, and optimize their marketing strategies by understanding the emotions of their target audience.

### What is sentiment analysis?

Sentiment analysis is a subfield of natural language processing (NLP) that focuses on identifying and categorizing the emotional tone or sentiment of a piece of text, such as a review, a social media post, or an article. The task of sentiment analysis involves analyzing the words and phrases in a given text to determine whether they are positive, negative, or neutral in nature, and assigning a numerical score or label to the overall sentiment of the text.

Sentiment analysis can be applied to various domains such as social media monitoring, customer feedback analysis, product review analysis, and advertising campaign evaluation. It can help organizations to identify trends and patterns in customer opinion, detect fake reviews, and optimize their marketing strategies by understanding the emotions of their target audience.

The process of sentiment analysis typically involves several steps, including:

1. Text Preprocessing: This step involves cleaning and normalizing the text data to prepare it for analysis. This may include removing stop words, punctuation, and converting all text to lowercase.
2. Tokenization: This step involves breaking down the text into smaller units called tokens, which can be individual words or phrases.
3. Sentiment Detection: This step involves analyzing the sentiment of each token in the text using techniques such as machine learning, deep learning, or rule-based approaches.
4. Aggregation: This step involves combining the sentiment scores of all tokens to obtain the overall sentiment score of the text.

Sentiment analysis can be performed using various techniques such as:

1. **Rule-Based Approach:** This approach involves defining a set of rules to classify words or phrases as positive, negative, or neutral based on their meanings and context.
2. **Machine Learning:** This approach involves training a machine learning model on a dataset of labeled text to learn the patterns and relationships between words and sentiment.
3. **Deep Learning:** This approach involves using deep neural networks to analyze the complex patterns in text data and learn the relationships between words, phrases, and sentiment.

Sentiment analysis uses natural language processing, text analysis, computational linguistics, and biometrics to identify the tone of written text. Passages of written text can be into simple binary states of positive or negative tone. More advanced sentiment analysis might classify text into additional categories: sadness, joy, love, anger, fear, or surprise.


### Example 1 - Step 1: Load Text for Sentiment Analyis

To demonstrate sentiment analysis, we begin by loading Shakespeare's 18th sonnet:

> Shall I compare thee to a summer's day?  
> Thou art more lovely and more temperate:  
> Rough winds do shake the darling buds of May,  
> And summer's lease hath all too short a date:  
> Sometime too hot the eye of heaven shines,  
> And often is his gold complexion dimm'd;  
> And every fair from fair sometimes declines,  
> By chance or nature's changing course untrimm'd;  
> But thy eternal summer shall not fade  
> Nor lose possession of that fair thou owest;  
> Nor shall Death brag thou wander'st in his shade,  
> When in eternal lines to time thou growest.

What do you think Shakespear's sentiment was when he wrote this sonnet? In particular, is Shakespear being **positive** or **negative**?

Let's find out using **sentiment analysis**.

In Step 1 we read the text file `sonnet_18.txt. from the course file server using the `urlopen()` command. The text is stored in a variable called `EG_text_1` ("Example text 1").

In [None]:
# Example 1 - Step 1: Load text

from urllib.request import urlopen

# Read sample text, a poem
URL = "https://biologicslab.co/BIO1173/data/sonnet_18.txt"
f = urlopen(URL)
EG_text_1 = f.read().decode("utf-8")

# Print out text
print(EG_text_1)

If the code is correct your should see the following output

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image01A.png)

Usually, you have to preprocess text into embeddings or other vector forms before presentation to a neural network. Hugging Face provides a pipeline that simplifies this process greatly. The pipeline allows you to pass regular Python strings to the transformers and return standard Python values.

We begin by loading a text-classification model. We can specify which model to use, by passing the model parameter, such as:

```
pipe = pipeline(model="roberta-large-mnli")
```

**RoBERTa-large-MNLI** is a fine-tuned version of the RoBERTa large model, which is a transformer-based language model developed by Facebook AI. RoBERTa itself is an optimized version of BERT (Bidirectional Encoder Representations from Transformers) and is trained on a large corpus of English text using a masked language modeling (MLM) objective.

The code in Example 1 - Step 2 loads a model pipeline and a model for sentiment analysis.



### Example 1 - Step 2: Load Model Pipeline

This code in the cell below uses the `transformers library` to create a pipeline for natural language inference (NLI) tasks. The pipeline function takes a model name as an argument, which in this case is "roberta-large-mnli". This specific model is a variant of the `RoBERTa model` that has been pre-trained on a large dataset of text and can perform NLI tasks such as question answering or natural language inference.

The pipeline function returns an instance of a Pipeline class from the transformers library, which is a high-level API for interacting with NLP models. The returned pipeline object has methods for various NLI tasks such as question answering, natural language inference, and text classification.

In this specific case, the code is creating a pipeline for natural language inference tasks using the "roberta-large-mnli" model. This means that it can be used to perform NLI tasks such as determining the meaning of a sentence or question given some context. The Pipeline object returned by the pipeline function will have methods for performing these tasks, such as predict and score, which can be used to make predictions on new input data and evaluate the performance of the model.

In [None]:
# Example 1 - Step 2: Load model pipeline

import pandas as pd
from transformers import pipeline

# Specify the model for the pipeline
model_name = "roberta-large-mnli"

# Create classification pipeline
classifier = pipeline("text-classification", model=model_name)

# Verify the pipeline
print(classifier)


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image03A.png)

We can now display the sentiment analysis results with a Pandas dataframe.


### Example 1 - Step 3: Run Sentiment Analysis on Text

We can now display the sentiment analysis results with a Pandas dataframe.

In [None]:
# Example 1 - Step 3: Run sentiment analysis on text

# Create variable to hold sentiment analysis
sentiment_outputs = classifier(EG_text_1)

# Display output in a Pandas DataFrame
pd.DataFrame(sentiment_outputs)


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image03F.png)

As you can see, our sentiment analysis considered the tone of Sonnet 18 as being `NEUTRAL` with a `score` of 0.68.

Shakespeare's Sonnet 18 is generally considered to have positive and admiring sentiment. The sonnet praises the subject, comparing them to a summer's day and highlighting their eternal beauty and the impermanence of natural beauty.

So why did we get only a `NEUTRAL` rating?

Sentiment analysis models like **`RoBERTa-large-MNLI`** work by analyzing the text and classifying it based on the presence of positive, negative, or neutral sentiments. Shakespeare's `Sonnet 18` ("Shall I compare thee to a summer's day?") is a complex and nuanced piece of literature, which can make it challenging for an AI model to accurately classify its sentiment.

Several factors could have contributed to the neutral sentiment classification:

1. **Language and Style:** Shakespeare's language and poetic style are intricate, with a mix of positive imagery and more neutral or descriptive language. This complexity might have led the model to classify the overall sentiment as neutral.

2. **Ambiguity:** The sonnet contains elements of praise and admiration, but it also discusses the transience of beauty and the passage of time, which might have contributed to a more balanced or neutral sentiment score.

3. **Contextual Understanding:** AI models might struggle with understanding the full context and emotional depth of the text. They might miss subtleties that human readers would easily pick up on.

The `0.6795 score` indicates that the model is fairly confident in its neutral classification, but it doesn't mean the text lacks positive sentiment. It just means that the model detected a more balanced mix of sentiments overall.

### **Exercise 1 - Step 1: Load Text for Sentiment Analyis**

For **Exercise 1**, use Shakespear's Sonnet 66.

**Sonnet 66**

> Tired with all these, for restful death I cry:  
> As, to behold desert a beggar born,  
> And needy nothing trimmed in jollity,  
> And purest faith unhappily forsworn,  
> And gilded honor shamefully misplaced,  
> And maiden virtue rudely strumpeted,  
> And right perfection wrongfully disgraced,  
> And strength by limping sway disablèd,  
> And art made tongue-tied by authority,  
> And folly, doctor-like, controlling skill,   
> And simple truth miscalled simplicity,  
> And captive good attending captain ill.  
>   Tired with all these, from these would I be gone,  
>   Save that, to die, I leave my love alone.  


Here is the `URL` to download this poem:

```text
URL = "https://biologicslab.co/BIO1173/data/sonnet_66.txt"
```

Call your text **`EX_text_1`** and print out `EX_test_1` at the end of the code cell.


In [None]:
# Insert your code for Exercise 1 - Step 1 here


If the code is correct your should see the following output

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image02A.png)

### **Exercise 1 - Step 2: Load Model Pipeline**

In the cell below create the pipeline for your sentiment analysis. You can reuse the code in Example 1 - Step 2 with the following modification. Instead of using the `"roberta-large-mnli"` model, change your model to the following:

```text
"distilbert-base-uncased-finetuned-sst-2-english"
```

The **`distilbert-base-uncased-finetuned-sst-2-english`** model is a lightweight, 66 M‑parameter transformer that was distilled from BERT‑base and fine‑tuned on the Stanford Sentiment Treebank (SST‑2) for binary sentiment classification. Using an uncased WordPiece tokenizer and a 512‑token limit, the model maps input text to a pair of logits (positive/negative) through a simple linear head on the [CLS] token; applying a softmax gives the sentiment probability. It achieves roughly 93 % accuracy on SST‑2, matching full BERT‑base while running about twice as fast and using half the memory, making it ideal for quick sentiment inference on English text in production settings.

In [None]:
# Insert your code for Exercise 1 - Step 2 here



If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image02F.png)

### **Exercise 1 - Step 3: Run Sentiment Analysis on Text**

In the cell below write the code to run and display the sentiment analysis on your **`EX_text_1`**.

In [None]:
# Insert your code for Exercise 1 - Step 3 here


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image04F.png)

Shakespeare's Sonnet 66 indeed carries a much darker tone compared to Sonnet 18. Here are a few reasons why the sentiment analysis model might have given it a strong negative sentiment classification:

1. **Tone and Themes:** Sonnet 66 delves into themes of despair, disillusionment, and a profound sense of weariness with the world's injustices. The recurring motifs of frustration and longing for rest likely contributed significantly to the negative sentiment score.

2. **Language and Imagery:** The sonnet is filled with negative imagery and phrases expressing dissatisfaction and lamentation. Such language naturally skews towards a negative sentiment classification by the model.

3. **Emotional Weight:** The emotional heaviness and the poet’s voice reflecting societal corruption and personal sorrow might resonate strongly with the AI model’s parameters for negative sentiment.

The score of `0.995` indicates an almost complete certainty by the model that the sentiment is negative, which aligns well with the sonnet’s content and tone.

## **NLP Entity Tagging**

**Entity tagging**, also known as named entity recognition (NER), is a process in natural language processing (NLP) where entities such as names of people, organizations, locations, dates, and other specific items are identified and classified within a text. Here’s how it works:

1. **Identification:** The model scans the text to find mentions of entities. For example, in the sentence "Microsoft was founded by Bill Gates in 1975," the entities are "Microsoft," "Bill Gates," and "1975."

2. **Classification:** Once identified, the entities are classified into predefined categories, such as:

- **Person:** Names of individuals (e.g., "Bill Gates")

- **Organization:** Names of companies, institutions, etc. (e.g., "Microsoft")

- **Location:** Names of places (e.g., "Seattle")

- **Date/Time:** Specific dates or times (e.g., "1975")

- **Others:** Such as events, quantities, monetary values, etc.

Entity tagging is widely used in various applications like information extraction, search engines, and enhancing the accuracy of machine translation. It helps in structuring unstructured text data, making it easier to analyze and extract meaningful insights.

Entity tagging is the process that takes source text and finds parts of that text that represent entities, such as one of the following:

* Location (LOC)
* Organizations (ORG)
* Person (PER)
* Miscellaneous (MISC)

The code in Example 2 requests a "named entity recognizer" (ner) and processes the specified text.

### Example 2 - Step 1: Entity Tagging

The code in the cell below creates a variable, `EG_text_2` with the name of a famous molecular biologist, `James Watson` in the United States.

In [None]:
# Example 2 - Step 1: Entity tagging

# Specify text
EG_text_2 = "James Watson was a molecular biologist who lived in the United States."

# Create one pipeline instance for each task
classifier = pipeline("text-classification", model="roberta-large-mnli")
ner_tagger   = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")

# Classification
print(classifier(EG_text_2))     # MNLI style entailment / contradiction / neutral

# NER
print(ner_tagger(EG_text_2))


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image03A.png)

### Example 2 - Step 2: Review the Results

The code in the next cell let us view the results stored in the `Pandas` DataFrame.  As you can see in the output, the person (`PER`) is `James Watson` and the location (`LOC`) is the `United States`.

In [None]:
# Example 2 - Step 2: Review results

# Define outputs
outputs = ner_tagger(EG_text_2)

# Display Output DataFrame
pd.DataFrame(outputs)


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image06F.png)

Your entity tagging pipeline has successfully identified and classified "James Watson" as a person and "United States" as a location with very high confidence.

### **Exercise 2 - Step 1: Entity Tagging**

In the cell below create a variable, `EG_text3` with the name of another famous molecular biologist, `Francis Crick` in the `United Kingdom`.

In [None]:
# Insert your code for Exercise 2 - Step 1 here



If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image05A.png)

### **Exercise 2 - Step 2: Review the Results**

In the cell below write the code to define and review the results of your **`EX_text_2`**.

In [None]:
# Insert your code for Exercise 2 - Step 2: here



If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image06A.png)

Your entity tagging pipeline has successfully identified and classified "Francis Crick" as a person and "United Kingdom" as a location with very high confidence.

## **NLP Question Answering**

**Question Answering (QA)** is a powerful NLP task that involves providing precise answers to questions based on a given reference text. This capability is incredibly useful in various applications, such as information retrieval, and educational tools.

Here's a quick rundown of how QA works:

1. **Reference Text:** A passage or document from which the answer can be extracted.

2. **Question:** A specific query related to the reference text.

3. **Answer Extraction:** The NLP model processes the reference text to find the exact segment that answers the question.

#### **Use Cases for a Computational Biologist**

For a computational biologist, “QA analysis” usually refers to **question‑answering** systems that can interrogate large biological data sets, literature corpora, and simulation outputs.  Below are practical ways the technology can be applied in a research or industrial setting.

| Use Case | What the QA system does | Typical inputs | Typical outputs | Why it matters |
|----------|------------------------|----------------|-----------------|----------------|
| **Literature mining & summarization** | Reads PubMed, bioRxiv, and review articles to answer specific biological questions (e.g., “Which genes are co‑expressed with *TP53* in breast cancer?”). | Natural‑language query + optional PMID set | Short answer, evidence sentences, citation list | Speeds up hypothesis generation and keeps investigators up‑to‑date. |
| **Protocol & pipeline verification** | Checks whether computational workflows (e.g., variant‑calling pipelines) meet best‑practice standards or regulatory requirements. | Pipeline description, code, test logs | Pass/fail verdict, compliance report | Helps avoid costly downstream errors and accelerates reproducibility. |
| **Clinical data interrogation** | Answers clinical questions from electronic health records or multi‑omics datasets (“What is the probability that a patient with X mutation will respond to drug Y?”). | Structured patient data, genomic profiles | Risk score, confidence interval, relevant literature | Supports precision‑medicine decisions. |
| **Drug‑target & repurposing discovery** | Queries databases (DrugBank, ChEMBL) to find potential new targets or off‑target effects. | Chemical structure or name, target protein | List of candidate targets, predicted affinity scores | Narrows experimental screening lists. |
| **Ontology & annotation assistance** | Maps free‑text notes or sequence motifs to controlled vocabularies (GO, MeSH). | Raw text, protein FASTA | Curated annotations, suggested terms | Improves data standardization for downstream analysis. |
| **Model validation & debugging** | Interrogates machine‑learning models to expose decision pathways (“Why did the model predict a splice‑site mutation is benign?”). | Model, input features, intermediate activations | Explanatory text, feature importance, counter‑factuals | Enhances trust and aids model refinement. |
| **Educational & training support** | Answers students’ queries during bioinformatics courses (e.g., “Explain the difference between UTRs and introns”). | Student question | Concise, citation‑backed answer | Lowers the learning curve for new trainees. |

#### **How it typically works**
1. **Indexing** – Textual resources (papers, databases, logs) are pre‑processed and stored in a vector‑search index.  
2. **Query processing** – The user’s natural‑language question is encoded into a vector and matched against the index.  
3. **Answer generation** – The system extracts the most relevant passages, optionally runs a language‑model head to paraphrase or synthesize an answer, and tags the answer with evidence citations.  
4. **Human‑in‑the‑loop** – Domain experts review the answer; feedback is fed back into the model for continual improvement.

By integrating QA analysis into the computational biology workflow, researchers can dramatically reduce the time spent on literature reviews, data curation, and pipeline troubleshooting, allowing more focus on discovery and hypothesis testing.


### Build QA Pipeline

The code in the cell below builds a **Questioning Answering (QA)** pipeline.

In [None]:
# Build pipeline

from transformers import pipeline

# Build the question‑answering pipeline (GPU is auto‑selected if available)
qa_pipe = pipeline(
    task="question-answering",
    model="distilbert-base-uncased-distilled-squad",  # any QA‑ready model works
)

print(" QA pipeline is ready!")


If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image15A.png)

### Example 3: Generate QA Result

The code in the cell below uses our `QA Pipeline` to ask the question

```text
"what shall fade"
```
The `context` of this question is Shakespear's `Sonnet 18` stored in `EG_text_1` that we created above in `Example 1`.

In [None]:
# Example 3: Generate QA result

# Define the context and the question you want answered
context = EG_text_1  # Sonnet 18 from Example 1

# Define the question
question = "what shall fade"

# Run the QA model
result = qa_pipe(question=question, context=context)

# Pretty‑print the answer
print(f"Question: {question}")
print(f"Answer  : {result['answer']}")
print(f"Score   : {result['score']:.4f}")
print(f"Start   : {result['start']}  End: {result['end']}")


If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image16A.png)

With a `Score = 0.38`, `Hugging Face` is not especially confident that `external summer` is the correct answer to the question `what will fade`.

### **Exercise 3: Generate QA Result**

The following is a famous quotation by the molecular geneticist James Watson as the forward to _Discovering the Brain_ by Sandra Ackerman.

```type
"The brain is the last and grandest biological frontier,"
"the most complex thing we have yet discovered in our universe."
"It contains hundreds of billions of cells interlinked through "
"trillions of connections. The brain boggles the mind."
```

For **Exercise 3** you are to use Watson's quote as the `context`. The easiest way to do this is to `copy-and-paste` the quotation into space between the two parantheses as follows:

```text
Watson_quotation = (
"The brain is the last and grandest biological frontier,"
"the most complex thing we have yet discovered in our universe."
"It contains hundreds of billions of cells interlinked through "
"trillions of connections. The brain boggles the mind."
)  
```
You will need to assign the `context` to your `Watson_quotation using this code:

```text
# Assign quotation to content
content=Watson_quotation
```
Finally, for your question, use this text:

```type
"how many connections in the brain"
```


In [None]:
# Insert your code for Exercise 3 here



If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image14A.png)


------------------------

### **What does the `score` field mean in Hugging Face QA pipelines?**

When a question‑answering model returns a span, it also returns a
`score`.  The score is **not a hard probability** of correctness, but
the model’s internal confidence that the chosen start‑and‑end tokens
are the right answer.

#### **How it is computed**
1. The model outputs *start* and *end* logits for every token in the
   context.  
2. A softmax is applied to each set of logits → probabilities that sum
   to 1.  
3. The pipeline picks the span \([s, e]\) that maximises  
   \( P_{\text{start}}(s) \times P_{\text{end}}(e) \).  
   This product is the `score`.

#### **Interpreting a score**
| Score range | Rough meaning | What to do |
|-------------|---------------|------------|
| < 0.3 | Very low confidence | Treat the answer as unreliable; re‑phrase the question or add more context. |
| ≈ 0.5 | Moderate confidence | The answer is plausible but not guaranteed; double‑check if accuracy matters. |
| > 0.8 | High confidence | The answer is likely correct; you can use it as is. |

> **Caveat** – Hugging Face models are *not* calibrated to produce true
> probabilities.  Use the score to **rank or filter** results, not as an
> absolute correctness metric.

-------------------------------


## **NLP Language Translation**

**Language Translation** is a fascinating and practical application of natural language processing (NLP). It involves converting text from one language into another while preserving the original meaning, context, and nuances.

Hugging Face provides powerful pre-trained models for language translation. Here is a list of the popular models:

#### **List of Translators Used by Hugging Face**

Hugging Face offers a variety of translation models for different language pairs. Here are some popular ones:

1. **MarianMT Models**: These models support translation between multiple language pairs. For example:
   - `Helsinki-NLP/opus-mt-en-de`: English to German
   - `Helsinki-NLP/opus-mt-en-fr`: English to French
   - `Helsinki-NLP/opus-mt-en-es`: English to Spanish

2. **T5 Models**: These models can be fine-tuned for translation tasks. For example:
   - `t5-small`: A smaller version of the T5 model that can be fine-tuned for translation.
   - `t5-large`: A larger version of the T5 model with more parameters for better performance.

3. **mBART Models**: These models are designed for multilingual translation tasks. For example:
   - `facebook/mbart-large-50-many-to-many-mmt`: A multilingual model that supports translation between 50 languages.

4. **M2M100 Models**: These models are designed for many-to-many translation tasks. For example:
   - `facebook/m2m100_418M`: A model that supports translation between 100 languages.

You can find more translation models and explore their capabilities on the [Hugging Face Models page](https://huggingface.co/models?search=translate).


### Install `sacremoses` package

Before we can start translating, we need to install the `sacremoses` package.
The `sacremoses` package is a Python port of the `Moses tokenizer`, `truecaser`, and `normalizer`. It provides tools for text preprocessing, which are essential for various natural language processing (NLP) tasks. Here are some key features of sacremoses:

**Tokenizer:** The Moses tokenizer splits text into tokens (words, punctuation, etc.) while handling special characters and preserving the original meaning. For example, it can tokenize sentences with unusual symbols and punctuation.

**Detokenizer:** The Moses detokenizer reverses the tokenization process, converting tokens back into a coherent sentence.

**Truecaser:** The Moses truecaser adjusts the casing of words in a text to match typical usage patterns. It can be trained on a large corpus to learn the correct casing for words.

Google Colab doesn't normally include the `sacremoses` package but you can add it by running the following code cell.

In [None]:
# Install sacremoses package

!pip install sacremoses # Problem?

If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image12A.png)

### Example 4 - Step 1: Select Translator

In this example we will use the `"Helsinki-NLP/opus-mt-en-de"` model. The **`Helsinki-NLP/opus-mt-en-de`** model is a powerful tool for translating text from English to German. Developed by the Language Technology Research Group at the University of Helsinki, this model is part of the `OPUS-MT` project, which provides open translation services for various language pairs

In [None]:
# Example 4 - Step 1: Setup translator

# Select translator
translator = pipeline("translation_en_to_de",
                      model="Helsinki-NLP/opus-mt-en-de")


If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image17A.png)

### Example 4 - Step 2: Translate Text

The following code translates Watson's quotation (`Watson_quotation`) that was used in **Exercise 3** into German.


In [None]:
# Example 4 - Step 2 Translate text

import textwrap
from transformers import pipeline

# Normalize the text by removing extra spaces
normalized_text = ' '.join(Watson_quotation.split())

# Print the normalized text with proper wrapping
wrapped_text = textwrap.fill(normalized_text, width=50)
print(wrapped_text)

# Print spacer
print("\n")

# Perform translation
outputs = translator(Watson_quotation, clean_up_tokenization_spaces=True, min_length=100)

# Get the translated text
translated_text = outputs[0]['translation_text']

# Remove excessive dots (e.g., ellipses)
cleaned_text = translated_text.rstrip('.')

# Break the cleaned text into separate lines
wrapped_text = textwrap.fill(cleaned_text, width=50)

print(wrapped_text)


If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image18A.png)

## **NLP Summarization**

**Summarization** is a key NLP task that involves condensing a longer text into a shorter version while preserving the main ideas and important details. There are two primary types of summarization:

1. **Extractive Summarization:** This method involves selecting and extracting key sentences or phrases from the original text. The selected sentences are then combined to form a summary. It's like creating a highlight reel of the text.

2. **Abstractive Summarization:** This method involves generating new sentences that capture the essence of the original text. It requires the model to understand the context and rephrase the content in a concise manner. This approach is more challenging but can produce more natural and coherent summaries.

Summarization is an NLP task that summarizes a more lengthy text into just a few sentences.

#### **Use Cases**

**NLP summarization**—both extractive and the more recent abstractive methods powered by transformer architectures such as BERT, T5, and GPT—automatically condenses long documents into concise, coherent overviews while preserving key information. In computational biology, this capability is indispensable for navigating the deluge of biomedical literature and data reports. Researchers employ summarization to generate rapid literature reviews, extract essential findings from high‑throughput genomics, proteomics, and metabolomics papers, and produce concise summaries of experimental protocols or clinical trial outcomes. Additionally, summarization helps distill complex pathway and interaction network descriptions, create short reports of multi‑omics integration results, and aid in hypothesis generation by highlighting novel associations across datasets. By reducing manual curation effort, summarization accelerates discovery pipelines, improves reproducibility, and supports evidence‑based decision‑making in genomics, drug discovery, and personalized medicine.

### Example 5: Summarization

The code in the cell below summarizes the text in the variable `EG_text3`.

In [None]:
# Example 5 - Step 1: Setup pipeline for summarizer

EG_text3 = """
Hugging Face is a company and an open-source platform that focuses on Natural
Language Processing (NLP) technologies and machine learning models.
We’re on a journey to advance and democratize artificial intelligence
through open source and open science. Our mission is to make AI accessible
to everyone, enabling researchers, developers, and organizations to
build and deploy state-of-the-art models with ease. By fostering a
collaborative community and providing cutting-edge tools,
we aim to accelerate the development and adoption of AI technologies
for the benefit of all.
"""

summarizer = pipeline("summarization")

If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image11F.png)

### Example 5 - Step 2: Print Summary

The following code summarizes the text in `EG_text3`

In [None]:
# Example 5 - Step 2: Print output

import textwrap
from transformers import pipeline

# Print original text
print("Original Text:")
print(EG_text3)

# Print spacer
print("\nSummary: \n")

# Send text to summarizer
outputs = summarizer(EG_text3, max_length=45, min_length=20,
                     clean_up_tokenization_spaces=True)

# Get the summary text
summary_text = outputs[0]['summary_text']

# Break the summary into separate lines
wrapped_summary = textwrap.fill(summary_text, width=50)

# Print the wrapped summary
print(wrapped_summary)


## **NLP Text Generation**

**Text generation** is a fundamental and powerful capability in natural language processing (NLP), and Hugging Face provides several models that excel in this area. Here's why text generation is important and how it is utilized:

1. **Creative Writing:** Text generation models can help authors, poets, and scriptwriters generate new content, brainstorm ideas, or even complete unfinished pieces. This can significantly boost creativity and productivity.

2. **Conversational Agents:** Text generation models are used to create chatbots and virtual assistants that can engage in natural and meaningful conversations with users. These models can provide customer support, answer queries, and even offer companionship.

3. **Content Creation:** Businesses and content creators use text generation to produce articles, social media posts, product descriptions, and more. This helps in maintaining a consistent flow of content and saves time.

4. **Language Translation:** Text generation plays a crucial role in machine translation, where models generate translated text from one language to another, ensuring that the meaning and nuances are preserved.

5. **Summarization:** As mentioned earlier, text generation models can summarize lengthy documents into concise summaries, making it easier to digest large volumes of information quickly.

6. **Code Generation:** Developers use text generation models to assist in coding by generating code snippets, documenting code, or even completing functions based on prompts.

7. **Educational Tools:** Text generation can be used to create educational content, generate quizzes, provide explanations, and even tutor students in various subjects.

Hugging Face offers several powerful models for text generation, such as GPT-3, BERT, T5, and more. These models leverage large-scale pre-training on diverse datasets to understand and generate human-like text.

### Example 6 - Step 1: Setup Generator Pipeline

The code in the cell below setups up the generator pipeline.

In [None]:
# Example 6 - Step 1: Setup generator pipeline

from urllib.request import urlopen

# Setup pipeline
generator = pipeline("text-generation")


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image12F.png)

### Example 6 - Step 2: Generate New Text

The code in the cell below generates additional text after Sonnet 18.


In [None]:
# Example 6 - Step 2: Generate new text


# Print original text
print("\nOriginal Text:---------------------- \n")
print(EG_text)
print("\nGenerated Text:------------------- \n")

outputs = generator(EG_text_1, max_length=400)
print(outputs[0]['generated_text'])

If the code is correct you should see something similar to this output.

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image20A.png)

NOTE: If your rerun the code several times, you will generate quite different outputs!

### **Comparison of Original vs. Generated Text**

The following is an analysis of the Original Text (Sonnet 18) and the Generated Text (shown above) by `OpenAI's` **`ChatGPT-4o`** model.

#### **1. Fidelity to Original**
- **Original Text**: Maintains Shakespeare’s iambic pentameter, rhyme scheme (ABAB CDCD EFEF GG), and thematic focus on eternal beauty and poetic immortality.
- **Generated Text**: The first 12 lines are identical to the original, but the continuation breaks from the sonnet structure and introduces new, loosely connected ideas.

#### **2. Coherence and Structure**
- **Original**: Cohesive and logically structured argument about the enduring nature of the subject’s beauty.
- **Generated**: Becomes repetitive and structurally disorganized. Phrases like “the boy shall be called, as he is now” are repeated without clear poetic or thematic purpose.

#### **3. Style and Language**
- **Original**: Rich in metaphor (“eye of heaven,” “eternal summer”), elevated diction, and poetic devices.
- **Generated**: Attempts poetic language but lacks the sophistication and precision of Shakespeare. The metaphors are simpler and less evocative (“sun and moon are the sweetest kisses of love”).

#### **4. Thematic Depth**
- **Original**: Explores themes of beauty, mortality, and the power of poetry to preserve.
- **Generated**: Shifts toward themes of identity and romantic roles but without clear development or philosophical insight.

#### **5. Creativity and Originality**
- **Generated Text**: Shows an attempt to emulate poetic style and extend the theme, but the repetition and lack of clarity suggest limitations in the model’s understanding of poetic form and nuance.

### **Summary Table**

| Aspect              | Original Text                  | Generated Text                          |
|---------------------|--------------------------------|------------------------------------------|
| Structure           | Sonnet form, 14 lines          | Starts as sonnet, then diverges          |
| Language            | Elevated, metaphorical         | Poetic attempt, but repetitive           |
| Coherence           | Logical and thematic unity     | Fragmented and repetitive                |
| Creativity          | Masterful and timeless         | Imitative, with limited poetic insight   |

---

It should be noted that some of the newest LLM models such as **`Gemini 2.5 Pro`**, **`ChatGPT-4o (OpenAI)`** and **`Claude 3.7 Sonnet (Anthropic)`** could probably do a much better job.


It seems like the text generation output continued far beyond the original Sonnet 18, incorporating additional and somewhat repetitive lines. This can happen with generative models when they attempt to create content based on a given prompt.

## **Lesson Turn-in**

When you have completed and run all of the code cells, use the **File --> Print.. --> Save to PDF** to generate a PDF of your Colab notebook. Save your PDF as `Class_05_1.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.

## **Lizard Tail**

## **Apple Mac Studio with M3 Ultra**

![___](https://biologicslab.co/BIO1173/images/class_05/Mac_Studio.jpg)

The Mac Studio is a small-form-factor workstation computer developed and marketed by Apple Inc. It is one of four desktop computers in the Mac lineup, sitting above the consumer-range Mac Mini and iMac, and positioned below the Mac Pro. It is configurable with either the M4 Max or M3 Ultra system on a chip.

## On-Line Purchase

If you are looking for a new computer to do a little homework, surf the web or run Large Language Models locally, you can't go wrong buying the **Apple Mac Studio with M3 Ultra**

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image21A.png)

### **Why the Mac Studio Ultra with M3 Is Useful for Running LLMs Locally**

#### 🔧 Key Hardware Advantages

##### 1. M3 Ultra Chip Architecture
- Combines two M3 Max chips, offering **up to 32 CPU cores and 80 GPU cores**.
- Unified memory architecture with **up to 192GB of RAM**, crucial for loading large models entirely into memory.

##### 2. High Memory Bandwidth
- Apple Silicon chips have **extremely fast memory bandwidth**, aiding rapid data movement during inference and training.

##### 3. Neural Engine
- The **Apple Neural Engine (ANE)** can accelerate certain ML tasks.
- While not fully utilized by all ML frameworks yet, it's promising for future optimization.

##### 4. Energy Efficiency
- Much more **power-efficient** than traditional workstations with discrete GPUs.
- Ideal for long-running tasks without excessive heat or power draw.

---

#### 🧠 Software Ecosystem for LLMs on macOS

##### 1. Support for ML Frameworks
- macOS supports **PyTorch**, **TensorFlow**, and **Hugging Face Transformers**.
- Tools like **Llama.cpp**, **GGUF**, and **MLX (Apple’s ML framework)** are optimized for Apple Silicon.

#####2. Running Quantized Models
- Models like **LLaMA 2**, **Mistral**, or **Phi-3** can run in 4-bit or 8-bit quantized formats.
- With 192GB RAM, even **larger models** like LLaMA 3 70B can be experimented with in quantized form.

##### 3. MLX Framework
- Apple’s MLX framework is designed to take full advantage of Apple Silicon.
- Makes it easier to run and experiment with models locally.

---

#### 🧪 Use Cases for Local LLMs

- **Privacy-sensitive applications** (e.g., medical or educational data).
- **Offline access** for fieldwork or remote teaching.
- **Rapid prototyping** without relying on cloud APIs.
- **Fine-tuning small models** for domain-specific tasks.

---

Would you like help setting up a local LLM on a Mac Studio Ultra, or exploring which models would run best on it?


## Overview
### Rear ports

The Mac Studio is a desktop personal computer, designed to sit between the consumer-level Mac Mini and the professional-targeted Mac Pro. The Mac Studio has an identical width and depth to the contemporary Mac mini, 7.7 inches (200 mm), but it stands taller at 3.7 inches (94 mm).

The Mac Studio was initially offered in two ARM-based SoC: the M1 Max or the M1 Ultra, which combines two M1 Max chips in one package. It has:

- Four Thunderbolt 4 (USB 4) ports
- Two USB 3.0 Type-A ports
- HDMI (up to 4K @ 60 Hz)
- 10Gb Ethernet with Lights Out Management
- Headphone jack

The front panel has:

- Two USB-C ports (Thunderbolt 4 in M1 Ultra models)
- SD card slot (supports SDXC cards and UHS-II bus)

It is cooled by a pair of double-sided blowers and a mesh of holes on the bottom and back of the case, which helps reduce fan noise. Nevertheless, there have been reports of excessive fan noise.

Mac Studio models with the Ultra SoC are heavier than the Max-equipped models, as they exchange the aluminum heat sink for one composed of copper. Apple says the Mac Studio performs 50% faster than a Mac Pro with a 16-core Intel Xeon processor.

The Mac Studio was introduced alongside the Apple Studio Display, a 27-inch 5K monitor with:

- Integrated 12 MP camera
- Six-speaker sound system with spatial audio and Dolby Atmos support
- Height adjustable stand

Customers reported months-long shipping delays for the Mac Studio, attributed to a global chip shortage.

## Updates
- **June 5, 2023 (WWDC)**: Updated Mac Studio models with M2 Max and M2 Ultra chips.
  - Bluetooth 5.3
  - Wi-Fi 6E
  - Support for up to six 6K monitors
  - 8K display support over Thunderbolt and HDMI

- **March 5, 2025**: Updated Mac Studio models with M4 Max and M3 Ultra chips (shipping began March 12).
  - Thunderbolt 5
  - Memory configurable up to 512 GB
  - Storage configurable up to 16 TB (M3 Ultra)
  - M3 Ultra included due to no existing Ultra chips in the M4 line

## Repairability

![Mac Studio with Studio Display, Magic Keyboard, and Magic Trackpad in an Apple Store](image-placeholder removable flash storage ports, with one or two in use depending on storage configuration. While swapping flash storage cards between same-size models is possible with Apple Configurator restore, upgrading is not officially supported.

Criticism includes:

- Limited upgradeability unfriendly to right to repair
- SSD controller integrated into SoC for encryption
- SSD placement beneath exposed power supply

## Reception

> This section needs expansion. You can help by adding to it. (March 2025)

## Specifications
| Model | 2022 | 2023 | 2025 |
|-------|------|------|------|
| **Announced** | Mar 8, 2022 | Jun 5, 2023 | Mar 5, 2025 |
| **Released** | Mar 18, 2022 | Jun 13, 2023 | Mar 12, 2025 |
| **Discontinued** | Jun 5, 2023 | Mar 5, 2025 | In production |

### Chip Configurations

- **2022**
  - M1 Max: 10-core CPU, 24-core GPU, 16-core Neural Engine
  - M1 Ultra: 20-core CPU, 48-core GPU, 32-core Neural Engine

- **2023**
  - M2 Max: 12-core CPU, 30-core GPU, 16-core Neural Engine
  - M2 Ultra: 24-core CPU, 60-core GPU, 32-core Neural Engine

- **2025**
  - M4 Max: 14-core CPU, 32-core GPU, 16-core Neural Engine
  - M3 Ultra: 28-core CPU, 60-core GPU, 32-core Neural Engine

### Memory

- 2022: 32 GB (up to 64 GB)
- 2023: 64 GB (up to 128 GB)
- 2025:
  - M4 Max: 36 GB (up to 128 GB)
  - M3 Ultra: 96 GB (up to 512 GB)

### Storage

- 2022 & 2023: 512 GB (Max) or 1 TB (Ultra), up to 8 TB
- 2025: Up to 16 TB (Ultra)

### Wireless

- 2022: Wi-Fi 6, Bluetooth 5.0
- 2023: Wi-Fi 6E, Bluetooth 5.3

### Connectivity

- Thunderbolt 4/5 USB-C ports
- USB-A ports
- HDMI 2.0/2.1
- 10Gb Ethernet
- 3.5 mm headphone jack
- SDXC card slot
### Power

- 2022: 370 W
- 2023: 480 W

### Dimensions
- 3.7 in × 7.7 in × 7.7 in

### Weight

- Max: ~5.9 lb
- Ultra: ~7.9–8.0 lb

### Emissions

- Varies by model and configuration (e.g., 262–382 kg CO₂e)

## Software and Operating Systems

All Mac Studio models ship with macOS, starting with **macOS Monterey**.

### Supported macOS Releases

| OS Release | 2022 | 2023 | 2025 |
|------------|------|------|------|
| 12 Monterey | 12.2 | — | — |
| 13 Ventura | Yes | 13.4 | — |
| 14 Sonoma | Yes | Yes | — |
| 15 Sequoia | Yes | Yes | 15.2 |
| 26 Tahoe | Yes | Yes | Yes |

---

**See also**: [List of Mac models]
