<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173_Fall2025/blob/main/F25_Class_05_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

##### **Module 5: Natural Language Processing**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Biology, Health and the Environment](https://sciences.utsa.edu/bhe/), [UTSA](https://www.utsa.edu/)

### Module 4 Material

* **Part 5.1: Introduction to Hugging Face**
* Part 5.2: Hugging Face Tokenizers
* Part 5.3: Hugging Face Datasets
* Part 5.4: Training Hugging Face models

## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to /content/drive and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: Using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this lesson.")
    COLAB = False

Mounted at /content/drive
Note: Using Google CoLab
david.senseman@gmail.com


Make sure your GMAIL address is included as the last line in the output above.

# **Introduction to Hugging Face**

**Hugging Face** is a company renowned for its pioneering work in the realm of Natural Language Processing (NLP). Established in 2016, Hugging Face has become synonymous with making cutting-edge NLP technologies more accessible to developers, researchers, and organizations. The company's commitment to open-source development has fostered a vibrant community, which plays a pivotal role in the rapid advancements of the field.

Central to Hugging Face's acclaim is the Transformers library. This Python library provides an extensive collection of pre-trained models that span a wide range of NLP tasks, from text classification and tokenization to language modeling and translation. Some of the most groundbreaking models like BERT, GPT-2, T5, and RoBERTa can be effortlessly accessed and deployed through this library.

A notable feature of the Transformers library is its user-friendly API. With just a few lines of code, one can:

* Load a pre-trained model.
* Tokenize input text.
* Obtain model predictions or embeddings.
* Fine-tune the model on a specific task.

This ease of use, combined with comprehensive documentation and tutorials, makes it an invaluable tool for both NLP newcomers and seasoned professionals.

Another significant contribution from Hugging Face is the Model Hub, a platform where researchers and developers can share, discover, and use NLP models. The hub promotes collaboration, ensuring that state-of-the-art models are easily available to the wider community. It's not uncommon to see models from recent research papers promptly available on the hub, ready for real-world applications.

Tokenization, the process of converting text into tokens, is a fundamental step in NLP. Recognizing its importance, Hugging Face introduced the Tokenizers library, which provides a fast and efficient way to tokenize massive datasets without compromising on accuracy. Its compatibility with the Transformers library ensures seamless integration between tokenization and modeling.

Democratization of NLP: By making advanced models and tools accessible, Hugging Face has democratized NLP, enabling even small teams or individual developers to harness the power of state-of-the-art models.

Rapid Prototyping: The ease with which one can deploy models means that ideas can be tested and iterated upon swiftly, accelerating the pace of NLP advancements.

Community and Collaboration: The open-source ethos of Hugging Face has cultivated a community that collaborates, contributes, and ensures that the field remains vibrant and progressive.

In the ever-evolving landscape of NLP, Hugging Face stands out as a beacon of innovation, accessibility, and collaboration. Whether you are a researcher pushing the boundaries of what's possible, a developer aiming to integrate NLP into an application, or an enthusiast eager to learn, Hugging Face provides the tools and resources to realize those ambitions. As we delve deeper into this book, we will frequently use the Hugging Face platform to illustrate concepts, implement solutions, and explore the vast possibilities of NLP.



## **Using Python with Hugging Face**

Transformers have become a mainstay of natural language processing. This module will examine the [Hugging Face](https://huggingface.co/) Python library for natural language processing, bringing together pretrained transformers, data sets, tokenizers, and other elements. Through the Hugging Face API, you can quickly begin using sentiment analysis, entity recognition, language translation, summarization, and text generation.

Colab does not install Hugging face by default. Whether installing Hugging Face directly into a local computer or utilizing it through Colab, the following commands will install the library.

Normally, when you use `pip` to install a package, a lot of information is printed out to show what subpackages were installed. We have prevented this by
adding the text `> /dev/` after the `!pip install <package>`. Technically, this code `redirects` the standard output (in this case your Colab notebook) to a device called `null`. `/dev/null` is a special file that discards all data written to it. It is often used as a "black hole" for unwanted output or error messages, such as the output of commands that are meant to be silent or hidden.


In [None]:
# Install Hugging Face libraries

!pip install transformers > /dev/null
!pip install transformers[sentencepiece] > /dev/null

Now that we have Hugging Face installed, the following sections will demonstrate how to apply Hugging Face to a variety of everyday tasks. After this introduction, the remainder of this module will take a deeper look at several specific NLP tasks applied to Hugging Face.

## Sentiment Analysis

**Sentiment analysis** is a subfield of natural language processing (NLP) that focuses on identifying and categorizing the emotional tone or sentiment of a piece of text, such as a review, a social media post, or an article. The task of sentiment analysis involves analyzing the words and phrases in a given text to determine whether they are positive, negative, or neutral in nature, and assigning a numerical score or label to the overall sentiment of the text.

Sentiment analysis can be applied to various domains such as social media monitoring, customer feedback analysis, product review analysis, and advertising campaign evaluation. It can help organizations to identify trends and patterns in customer opinion, detect fake reviews, and optimize their marketing strategies by understanding the emotions of their target audience.

### What is sentiment analysis?

Sentiment analysis is a subfield of natural language processing (NLP) that focuses on identifying and categorizing the emotional tone or sentiment of a piece of text, such as a review, a social media post, or an article. The task of sentiment analysis involves analyzing the words and phrases in a given text to determine whether they are positive, negative, or neutral in nature, and assigning a numerical score or label to the overall sentiment of the text.

Sentiment analysis can be applied to various domains such as social media monitoring, customer feedback analysis, product review analysis, and advertising campaign evaluation. It can help organizations to identify trends and patterns in customer opinion, detect fake reviews, and optimize their marketing strategies by understanding the emotions of their target audience.

The process of sentiment analysis typically involves several steps, including:

1. Text Preprocessing: This step involves cleaning and normalizing the text data to prepare it for analysis. This may include removing stop words, punctuation, and converting all text to lowercase.
2. Tokenization: This step involves breaking down the text into smaller units called tokens, which can be individual words or phrases.
3. Sentiment Detection: This step involves analyzing the sentiment of each token in the text using techniques such as machine learning, deep learning, or rule-based approaches.
4. Aggregation: This step involves combining the sentiment scores of all tokens to obtain the overall sentiment score of the text.

Sentiment analysis can be performed using various techniques such as:

1. **Rule-Based Approach:** This approach involves defining a set of rules to classify words or phrases as positive, negative, or neutral based on their meanings and context.
2. **Machine Learning:** This approach involves training a machine learning model on a dataset of labeled text to learn the patterns and relationships between words and sentiment.
3. **Deep Learning:** This approach involves using deep neural networks to analyze the complex patterns in text data and learn the relationships between words, phrases, and sentiment.

Sentiment analysis uses natural language processing, text analysis, computational linguistics, and biometrics to identify the tone of written text. Passages of written text can be into simple binary states of positive or negative tone. More advanced sentiment analysis might classify text into additional categories: sadness, joy, love, anger, fear, or surprise.


### Example 1 - Step 1: Load Text for Sentiment Analyis

To demonstrate sentiment analysis, we begin by loading Shakespeare's 18th sonnet:

> Shall I compare thee to a summer's day?  
> Thou art more lovely and more temperate:  
> Rough winds do shake the darling buds of May,  
> And summer's lease hath all too short a date:  
> Sometime too hot the eye of heaven shines,  
> And often is his gold complexion dimm'd;  
> And every fair from fair sometimes declines,  
> By chance or nature's changing course untrimm'd;  
> But thy eternal summer shall not fade  
> Nor lose possession of that fair thou owest;  
> Nor shall Death brag thou wander'st in his shade,  
> When in eternal lines to time thou growest.

What do you think Shakespear's sentiment was when he wrote this sonnet? In particular, is Shakespear being **positive** or **negative**?

Let's find out using **sentiment analysis**.

In Step 1 we read the text file `sonnet_17.txt. from the course file server using the `urlopen()` command. The text is stored in a variable called `EG_text` ("Example" text).

In [None]:
# Example 1 - Step 1: Load text

from urllib.request import urlopen

# Read sample text, a poem
URL = "https://biologicslab.co/BIO1173/data/sonnet_18.txt"
f = urlopen(URL)
EG_text = f.read().decode("utf-8")

# Print out text
print(EG_text)

Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm'd;
And every fair from fair sometimes declines,
By chance or nature's changing course untrimm'd;
But thy eternal summer shall not fade
Nor lose possession of that fair thou owest;
Nor shall Death brag thou wander'st in his shade,
When in eternal lines to time thou growest.


Usually, you have to preprocess text into embeddings or other vector forms before presentation to a neural network. Hugging Face provides a pipeline that simplifies this process greatly. The pipeline allows you to pass regular Python strings to the transformers and return standard Python values.

We begin by loading a text-classification model. We can specify which model to use, by passing the model parameter, such as:

```
pipe = pipeline(model="roberta-large-mnli")
```

**RoBERTa-large-MNLI** is a fine-tuned version of the RoBERTa large model, which is a transformer-based language model developed by Facebook AI. RoBERTa itself is an optimized version of BERT (Bidirectional Encoder Representations from Transformers) and is trained on a large corpus of English text using a masked language modeling (MLM) objective.

The code in Example 1 - Step 2 loads a model pipeline and a model for sentiment analysis.



### Example 1 - Step 2: Load Model Pipeline

This code in the cell below uses the `transformers library` to create a pipeline for natural language inference (NLI) tasks. The pipeline function takes a model name as an argument, which in this case is "roberta-large-mnli". This specific model is a variant of the `RoBERTa model` that has been pre-trained on a large dataset of text and can perform NLI tasks such as question answering or natural language inference.

The pipeline function returns an instance of a Pipeline class from the transformers library, which is a high-level API for interacting with NLP models. The returned pipeline object has methods for various NLI tasks such as question answering, natural language inference, and text classification.

In this specific case, the code is creating a pipeline for natural language inference tasks using the "roberta-large-mnli" model. This means that it can be used to perform NLI tasks such as determining the meaning of a sentence or question given some context. The Pipeline object returned by the pipeline function will have methods for performing these tasks, such as predict and score, which can be used to make predictions on new input data and evaluate the performance of the model.

In [None]:
# Example 1 - Step 2: Load model pipeline

import pandas as pd
from transformers import pipeline

# Specify the model for the pipeline
model_name = "roberta-large-mnli"

# Create classification pipeline
classifier = pipeline("text-classification", model=model_name)

# Verify the pipeline
print(classifier)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical 

<transformers.pipelines.text_classification.TextClassificationPipeline object at 0x789456849a50>


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image01F.png)

We can now display the sentiment analysis results with a Pandas dataframe.


### Example 1 - Step 3: Run Sentiment Analysis on Text

We can now display the sentiment analysis results with a Pandas dataframe.

In [None]:
# Example 1 - Step 3: Run sentiment analysis on text

# Create variable to hold sentiment analysis
sentiment_outputs = classifier(EG_text)

# Display output in a Pandas DataFrame
pd.DataFrame(sentiment_outputs)


Unnamed: 0,label,score
0,NEUTRAL,0.679497


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image03F.png)

As you can see, the poem was considered `NEUTRAL` with a `score` of 0.68.

Shakespeare's Sonnet 18 is generally considered to have positive and admiring sentiment. The sonnet praises the subject, comparing them to a summer's day and highlighting their eternal beauty and the impermanence of natural beauty.

So why did we get only a `NEUTRAL` rating?

Sentiment analysis models like RoBERTa-large-MNLI work by analyzing the text and classifying it based on the presence of positive, negative, or neutral sentiments. Shakespeare's Sonnet 18 ("Shall I compare thee to a summer's day?") is a complex and nuanced piece of literature, which can make it challenging for an AI model to accurately classify its sentiment.

Several factors could have contributed to the neutral sentiment classification:

1. **Language and Style:** Shakespeare's language and poetic style are intricate, with a mix of positive imagery and more neutral or descriptive language. This complexity might have led the model to classify the overall sentiment as neutral.

2. **Ambiguity:** The sonnet contains elements of praise and admiration, but it also discusses the transience of beauty and the passage of time, which might have contributed to a more balanced or neutral sentiment score.

3. **Contextual Understanding:** AI models might struggle with understanding the full context and emotional depth of the text. They might miss subtleties that human readers would easily pick up on.

The 0.6795 score indicates that the model is fairly confident in its neutral classification, but it doesn't mean the text lacks positive sentiment. It just means that the model detected a more balanced mix of sentiments overall.

### **Exercise 1 - Step 1: Load Text for Sentiment Analyis**

Shakespeare

**Sonnet 66**

> Tired with all these, for restful death I cry:  
> As, to behold desert a beggar born,  
> And needy nothing trimmed in jollity,  
> And purest faith unhappily forsworn,  
> And gilded honor shamefully misplaced,  
> And maiden virtue rudely strumpeted,  
> And right perfection wrongfully disgraced,  
> And strength by limping sway disablèd,  
> And art made tongue-tied by authority,  
> And folly, doctor-like, controlling skill,   
> And simple truth miscalled simplicity,  
> And captive good attending captain ill.  
>   Tired with all these, from these would I be gone,  
>   Save that, to die, I leave my love alone.  




In [None]:
# Insert your code for Example 1 - Step 1 here

from urllib.request import urlopen

# Read sample text, a poem
URL = "https://biologicslab.co/BIO1173/data/sonnet_66.txt"
f = urlopen(URL)
EX_text = f.read().decode("utf-8")

# Print
print(EX_text)

Tired with all these, for restful death I cry:
As, to behold desert a beggar born,
And needy nothing trimmed in jollity,
And purest faith unhappily forsworn,
And gilded honor shamefully misplaced,
And maiden virtue rudely strumpeted,
And right perfection wrongfully disgraced,
And strength by limping sway disablèd,
And art made tongue-tied by authority,
And folly, doctor-like, controlling skill,
And simple truth miscalled simplicity,
And captive good attending captain ill.
 Tired with all these, from these would I be gone,
 Save that, to die, I leave my love alone.


Usually, you have to preprocess text into embeddings or other vector forms before presentation to a neural network. Hugging Face provides a pipeline that simplifies this process greatly. The pipeline allows you to pass regular Python strings to the transformers and return standard Python values.

We begin by loading a text-classification model. We do not specify the exact model type wanted, so Hugging Face automatically chooses a network from the Hugging Face hub named:

* distilbert-base-uncased-finetuned-sst-2-english

To specify the model to use, pass the model parameter, such as:

```
pipe = pipeline(model="roberta-large-mnli")
```

The following code loads a model pipeline and a model for sentiment analysis.



### **Exercise 1 - Step 2: Load Model Pipeline**

This code in the cell below uses the `transformers library` to create a pipeline for natural language inference (NLI) tasks. The pipeline function takes a model name as an argument, which in this case is "roberta-large-mnli". This specific model is a variant of the `RoBERTa model` that has been pre-trained on a large dataset of text and can perform NLI tasks such as question answering or natural language inference.


Specify the NLI pipeline model `ELECTRA`(Efficient Lifelong End-to-End Text Recognition with Attention). This is a pre-trained language model developed by Facebook that has been fine-tuned for various NLP tasks, including question answering and natural language inference. It uses a variant of the BERT architecture and has been shown to achieve state-of-the-art performance on several NLP benchmarks.


In [None]:
# Insert your code for Exercise 1 - Step 2 here

import pandas as pd
from transformers import pipeline

# Specify the model for the pipeline
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# Create a single text classification pipeline with the specified model
classifier = pipeline("text-classification", model=model_name)

# Verify the pipeline
print(classifier)


Device set to use cpu


<transformers.pipelines.text_classification.TextClassificationPipeline object at 0x7893e10fac10>


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image02F.png)

We can now display the sentiment analysis results with a Pandas dataframe.


### **Exercise 1 - Step 3: Run Sentiment Analysis on Text**

We can now display the sentiment analysis results with a Pandas dataframe.

In [None]:
# Insert your code for Exercise 1 - Step 3 here

# Create variable to hold sentiment analysis
sentiment_outputs = classifier(EX_text)


# Display output in a Pandas DataFrame
pd.DataFrame(sentiment_outputs)


Unnamed: 0,label,score
0,NEGATIVE,0.995461


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image04F.png)

Shakespeare's Sonnet 66 indeed carries a much darker tone compared to Sonnet 18. Here are a few reasons why the sentiment analysis model might have given it a strong negative sentiment classification:

1. **Tone and Themes:** Sonnet 66 delves into themes of despair, disillusionment, and a profound sense of weariness with the world's injustices. The recurring motifs of frustration and longing for rest likely contributed significantly to the negative sentiment score.

2. **Language and Imagery:** The sonnet is filled with negative imagery and phrases expressing dissatisfaction and lamentation. Such language naturally skews towards a negative sentiment classification by the model.

3. **Emotional Weight:** The emotional heaviness and the poet’s voice reflecting societal corruption and personal sorrow might resonate strongly with the AI model’s parameters for negative sentiment.

The score of `0.995` indicates an almost complete certainty by the model that the sentiment is negative, which aligns well with the sonnet’s content and tone.

## **Entity Tagging**

**Entity tagging**, also known as named entity recognition (NER), is a process in natural language processing (NLP) where entities such as names of people, organizations, locations, dates, and other specific items are identified and classified within a text. Here’s how it works:

1. **Identification:** The model scans the text to find mentions of entities. For example, in the sentence "Microsoft was founded by Bill Gates in 1975," the entities are "Microsoft," "Bill Gates," and "1975."

2. **Classification:** Once identified, the entities are classified into predefined categories, such as:

- **Person:** Names of individuals (e.g., "Bill Gates")

- **Organization:** Names of companies, institutions, etc. (e.g., "Microsoft")

- **Location:** Names of places (e.g., "Seattle")

- **Date/Time:** Specific dates or times (e.g., "1975")

- **Others:** Such as events, quantities, monetary values, etc.

Entity tagging is widely used in various applications like information extraction, search engines, and enhancing the accuracy of machine translation. It helps in structuring unstructured text data, making it easier to analyze and extract meaningful insights.

Entity tagging is the process that takes source text and finds parts of that text that represent entities, such as one of the following:

* Location (LOC)
* Organizations (ORG)
* Person (PER)
* Miscellaneous (MISC)

The code in Example 2 requests a "named entity recognizer" (ner) and processes the specified text.

### Example 2 - Step 1: Entity Tagging

The code in the cell below creates a variable, `EG_text2` with the name of the famous molecular biologist, `James Watson` in the US.

In [None]:
# Example 2 - Step 1: Entity tagging

EG_text2 = "James Watson was a molecular biologist who lived in the United States."

tagger = pipeline("ner", aggregation_strategy="simple")


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cpu


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image07F.png)

### Example 2 - Step 2: Review the Results

We similarly view the results as a Pandas data frame. As you can see, the person (PER) of Abraham Lincoln and location (LOC) of the United States is recognized.

In [None]:
outputs = tagger(EG_text2)
pd.DataFrame(outputs)


Unnamed: 0,entity_group,score,word,start,end
0,PER,0.999602,James Watson,0,12
1,LOC,0.999669,United States,56,69


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image06F.png)

Your entity tagging pipeline has successfully identified and classified "James Watson" as a person and "United States" as a location with very high confidence.

### **Exercise 2 - Step 1: Entity Tagging**

The code in the cell below creates a variable, `EG_text2` with the name of the famous molecular biologist, `James Watson` in the US.

In [None]:
# Insert your code for Exercise 2 - Step 1 here

EX_text2 = "David Senseman is a biologist who lives in the United States."

tagger = pipeline("ner", aggregation_strategy="simple")


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image07F.png)

### Example 2 - Step 2: Review the Results

We similarly view the results as a Pandas data frame. As you can see, the person (PER) of Abraham Lincoln and location (LOC) of the United States is recognized.

In [None]:
outputs = tagger(EX_text2)
pd.DataFrame(outputs)


Unnamed: 0,entity_group,score,word,start,end
0,PER,0.999122,David Senseman,0,14
1,LOC,0.999606,United States,47,60


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image09F.png)

Your entity tagging pipeline has successfully identified and classified "James Watson" as a person and "United States" as a location with very high confidence.

## **Question Answering**

**Question Answering (QA)** is a powerful NLP task that involves providing precise answers to questions based on a given reference text. This capability is incredibly useful in various applications, such as information retrieval, and educational tools.

Here's a quick rundown of how QA works:

1. **Reference Text:** A passage or document from which the answer can be extracted.

2. **Question:** A specific query related to the reference text.

3. **Answer Extraction:** The NLP model processes the reference text to find the exact segment that answers the question.

Another common task for NLP is question answering from a reference text. We load such a model with the following code.

### Example 3 - Step 1: Question Answering (QA)

The first step in QA is to set up the pipeline and define the question.

In [None]:
# Example 3 - Step 1: QA

# Setup pipeline
reader = pipeline("question-answering")

# Setup question
EG_question = "The brain is the last and grandest biological frontier, \
              the most complex thing we have yet discovered in our universe. \
              It contains hundreds of billions of cells interlinked through \
              trillions of connections. The brain boggles the mind."


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image10F.png)

### Example 3 - Step 2: Generate QA Result

The code in the cell below uses `Hugging Face` to return an answer to our question.

For this example, we will pose the question "what shall fade" to Hugging Face for Sonnet 18. We see the correct answer of "eternal summer."


In [None]:
# Example 3 - Step 2: Generate QA result

# Send question to Hugging Face
outputs = reader(question=EG_question, context=EG_text2)

# View resutls in Panda
pd.DataFrame([outputs])


Unnamed: 0,score,start,end,answer
0,0.109873,0,69,James Watson was a molecular biologist who liv...


## **Language Translation**

**Language Translation** is a fascinating and practical application of natural language processing (NLP). It involves converting text from one language into another while preserving the original meaning, context, and nuances.

Hugging Face provides powerful pre-trained models for language translation. Here is a list of the popular models:

## List of Translators Used by Hugging Face

Hugging Face offers a variety of translation models for different language pairs. Here are some popular ones:

1. **MarianMT Models**: These models support translation between multiple language pairs. For example:
   - `Helsinki-NLP/opus-mt-en-de`: English to German
   - `Helsinki-NLP/opus-mt-en-fr`: English to French
   - `Helsinki-NLP/opus-mt-en-es`: English to Spanish

2. **T5 Models**: These models can be fine-tuned for translation tasks. For example:
   - `t5-small`: A smaller version of the T5 model that can be fine-tuned for translation.
   - `t5-large`: A larger version of the T5 model with more parameters for better performance.

3. **mBART Models**: These models are designed for multilingual translation tasks. For example:
   - `facebook/mbart-large-50-many-to-many-mmt`: A multilingual model that supports translation between 50 languages.

4. **M2M100 Models**: These models are designed for many-to-many translation tasks. For example:
   - `facebook/m2m100_418M`: A model that supports translation between 100 languages.

You can find more translation models and explore their capabilities on the [Hugging Face Models page](https://huggingface.co/models?search=translate).


### Install `sacremoses` package

Before we can start translating, we need to install the `sacremoses` package.
The `sacremoses` package is a Python port of the Moses tokenizer, truecaser, and normalizer. It provides tools for text preprocessing, which are essential for various natural language processing (NLP) tasks. Here are some key features of sacremoses:

**Tokenizer:** The Moses tokenizer splits text into tokens (words, punctuation, etc.) while handling special characters and preserving the original meaning. For example, it can tokenize sentences with unusual symbols and punctuation.

**Detokenizer:** The Moses detokenizer reverses the tokenization process, converting tokens back into a coherent sentence.

**Truecaser:** The Moses truecaser adjusts the casing of words in a text to match typical usage patterns. It can be trained on a large corpus to learn the correct casing for words.

Google Colab doesn't normally include the `sacremoses` package but you can add it by running the following code cell.

In [None]:
# Install sacremoses package

!pip install sacremoses

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sacremoses
Successfully installed sacremoses-0.1.1


### Example 4 - Step 1: Select Translator

In this example we will use the `"Helsinki-NLP/opus-mt-en-de"` model. The "Helsinki-NLP/opus-mt-en-de" model is a powerful tool for translating text from English to German. Developed by the Language Technology Research Group at the University of Helsinki, this model is part of the OPUS-MT project, which provides open translation services for various language pairs

In [None]:
# Example 4 - Step 1: Setup translator

# Select translator
translator = pipeline("translation_en_to_de",
                      model="Helsinki-NLP/opus-mt-en-de")


Device set to use cpu


### Example 4 - Step 2: Translate Text

The following code translates James Watson's quotation (`EG_question`) into German.


In [None]:
# Example 4 - Step 2 Translate text

import textwrap
from transformers import pipeline

# Normalize the text by removing extra spaces
normalized_text = ' '.join(EG_question.split())

# Print the normalized text with proper wrapping
wrapped_text = textwrap.fill(normalized_text, width=50)
print(wrapped_text)

# Print spacer
print("\n")

# Perform translation
outputs = translator(EG_question, clean_up_tokenization_spaces=True, min_length=100)

# Get the translated text
translated_text = outputs[0]['translation_text']

# Remove excessive dots (e.g., ellipses)
cleaned_text = translated_text.rstrip('.')

# Break the cleaned text into separate lines
wrapped_text = textwrap.fill(cleaned_text, width=50)

print(wrapped_text)


The brain is the last and grandest biological
frontier, the most complex thing we have yet
discovered in our universe. It contains hundreds
of billions of cells interlinked through trillions
of connections. The brain boggles the mind.


Das Gehirn ist die letzte und großartigste
biologische Grenze, die komplexeste Sache, die wir
noch in unserem Universum entdeckt haben. Es
enthält Hunderte von Milliarden Zellen, die durch
Billionen von Verbindungen miteinander verbunden
sind. Das Gehirn verdreht den Geist


If the code is correct, you should see the following output:

~~~text
The brain is the last and grandest biological
frontier, the most complex thing we have yet
discovered in our universe. It contains hundreds
of billions of cells interlinked through trillions
of connections. The brain boggles the mind.


Das Gehirn ist die letzte und großartigste
biologische Grenze, die komplexeste Sache, die wir
noch in unserem Universum entdeckt haben. Es
enthält Hunderte von Milliarden Zellen, die durch
Billionen von Verbindungen miteinander verbunden
sind. Das Gehirn verdreht den Geist
~~~

## **Summarization**

**Summarization** is a key NLP task that involves condensing a longer text into a shorter version while preserving the main ideas and important details. There are two primary types of summarization:

1. **Extractive Summarization:** This method involves selecting and extracting key sentences or phrases from the original text. The selected sentences are then combined to form a summary. It's like creating a highlight reel of the text.

2. **Abstractive Summarization:** This method involves generating new sentences that capture the essence of the original text. It requires the model to understand the context and rephrase the content in a concise manner. This approach is more challenging but can produce more natural and coherent summaries.

Summarization is an NLP task that summarizes a more lengthy text into just a few sentences.

### Example 5: Summarization

The code in the cell below summarizes the text in the variable `EG_text3`.

In [None]:
# Example 5 - Step 1: Setup pipeline for summarizer

EG_text3 = """
Hugging Face is a company and an open-source platform that focuses on Natural
Language Processing (NLP) technologies and machine learning models.
We’re on a journey to advance and democratize artificial intelligence
through open source and open science. Our mission is to make AI accessible
to everyone, enabling researchers, developers, and organizations to
build and deploy state-of-the-art models with ease. By fostering a
collaborative community and providing cutting-edge tools,
we aim to accelerate the development and adoption of AI technologies
for the benefit of all.
"""

summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image11F.png)

### Example 5 - Step 2: Print Summary

The following code summarizes the text in `EG_text3`

In [None]:
# Example 5 - Step 2: Print output

import textwrap
from transformers import pipeline

# Print original text
print("Original Text:")
print(EG_text3)

# Print spacer
print("\nSummary: \n")

# Send text to summarizer
outputs = summarizer(EG_text3, max_length=45, min_length=20,
                     clean_up_tokenization_spaces=True)

# Get the summary text
summary_text = outputs[0]['summary_text']

# Break the summary into separate lines
wrapped_summary = textwrap.fill(summary_text, width=50)

# Print the wrapped summary
print(wrapped_summary)


Original Text:

Hugging Face is a company and an open-source platform that focuses on Natural 
Language Processing (NLP) technologies and machine learning models. 
We’re on a journey to advance and democratize artificial intelligence 
through open source and open science. Our mission is to make AI accessible 
to everyone, enabling researchers, developers, and organizations to 
build and deploy state-of-the-art models with ease. By fostering a 
collaborative community and providing cutting-edge tools, 
we aim to accelerate the development and adoption of AI technologies 
for the benefit of all.


Summary: 

 Hugging Face is an open-source platform that
focuses on Natural Language Processing (NLP)
technologies and machine learning models. We're on
a journey to advance and democratize artificial
intelligence through open source and open


## **Text Generation**

**Text generation** is a fundamental and powerful capability in natural language processing (NLP), and Hugging Face provides several models that excel in this area. Here's why text generation is important and how it is utilized:

1. **Creative Writing:** Text generation models can help authors, poets, and scriptwriters generate new content, brainstorm ideas, or even complete unfinished pieces. This can significantly boost creativity and productivity.

2. **Conversational Agents:** Text generation models are used to create chatbots and virtual assistants that can engage in natural and meaningful conversations with users. These models can provide customer support, answer queries, and even offer companionship.

3. **Content Creation:** Businesses and content creators use text generation to produce articles, social media posts, product descriptions, and more. This helps in maintaining a consistent flow of content and saves time.

4. **Language Translation:** Text generation plays a crucial role in machine translation, where models generate translated text from one language to another, ensuring that the meaning and nuances are preserved.

5. **Summarization:** As mentioned earlier, text generation models can summarize lengthy documents into concise summaries, making it easier to digest large volumes of information quickly.

6. **Code Generation:** Developers use text generation models to assist in coding by generating code snippets, documenting code, or even completing functions based on prompts.

7. **Educational Tools:** Text generation can be used to create educational content, generate quizzes, provide explanations, and even tutor students in various subjects.

Hugging Face offers several powerful models for text generation, such as GPT-3, BERT, T5, and more. These models leverage large-scale pre-training on diverse datasets to understand and generate human-like text.

### Example 6 - Step 1: Setup Generator Pipeline

The code in the cell below setups up the generator pipeline.

In [None]:
# Example 6 - Step 1: Setup generator pipeline

from urllib.request import urlopen

# Setup pipeline
generator = pipeline("text-generation")


No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_1_image12F.png)

Here an example is provided that generates additional text after Sonnet 18.


In [None]:
# Example 5 - Step 2: Print output

# Print original text
print("\nOriginal Text:---------------------- \n")
print(EG_text)
print("\nGenerated Text:------------------- \n")

outputs = generator(EG_text, max_length=400)
print(outputs[0]['generated_text'])


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Original Text:---------------------- 

Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm'd;
And every fair from fair sometimes declines,
By chance or nature's changing course untrimm'd;
But thy eternal summer shall not fade
Nor lose possession of that fair thou owest;
Nor shall Death brag thou wander'st in his shade,
When in eternal lines to time thou growest.

Generated Text:------------------- 

Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm'd;
And every fair from fair sometimes declines,
By chance or nature's changing course untrimm'd;
But thy eternal summer shall not

It seems like the text generation output continued far beyond the original Sonnet 18, incorporating additional and somewhat repetitive lines. This can happen with generative models when they attempt to create content based on a given prompt.

## **Lesson Turn-in**

When you have completed and run all of the code cells, use the **File --> Print.. --> Save to PDF** to generate a PDF of your Colab notebook. Save your PDF as `Class_05_1.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.

## **Lizard Tail**

## **NVIDIA**

### **Entrance of Endeavor headquarters building in 2018**

![__](https://upload.wikimedia.org/wikipedia/commons/7/75/2788-2888_San_Tomas_Expwy.jpg)

**Nvidia Corporation** (/ɛnˈvɪdiə/ en-VID-ee-ə) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curtis Priem, it is a software company which designs and supplies graphics processing units (GPUs), application programming interfaces (APIs) for data science and high-performance computing, and system on a chip units (SoCs) for mobile computing and the automotive market. Nvidia is also the dominant supplier of artificial intelligence (AI) hardware and software. Nvidia outsources the manufacturing of the hardware it designs.

Nvidia's professional line of GPUs are used for edge-to-cloud computing and in supercomputers and workstations for applications in fields such as architecture, engineering and construction, media and entertainment, automotive, scientific research, and manufacturing design. Its GeForce line of GPUs are aimed at the consumer market and are used in applications such as video editing, 3D rendering, and PC gaming. With a market share of 80.2% in the second quarter of 2023, Nvidia leads the market for discrete desktop GPUs by a wide margin. The company expanded its presence in the gaming industry with the introduction of the Shield Portable (a handheld game console), Shield Tablet (a gaming tablet), and Shield TV (a digital media player), as well as its cloud gaming service GeForce Now.

In addition to GPU design and outsourcing manufacturing, Nvidia provides the CUDA software platform and API that allows the creation of massively parallel programs which utilize GPUs. They are deployed in supercomputing sites around the world. In the late 2000s, Nvidia had moved into the mobile computing market, where it produces Tegra mobile processors for smartphones and tablets and vehicle navigation and entertainment systems. Its competitors include AMD, Intel,[19] Qualcomm, and AI accelerator companies such as Cerebras and Graphcore. It also makes AI-powered software for audio and video processing (e.g., Nvidia Maxine).

Nvidia's offer to acquire Arm from SoftBank in September 2020 failed to materialize following extended regulatory scrutiny, leading to the termination of the deal in February 2022 in what would have been the largest semiconductor acquisition. In 2023, Nvidia became the seventh public U.S. company to be valued at over \$1 trillion, and the company's valuation has increased rapidly since then as the company became a leader in data center chips with AI capabilities in the midst of the AI boom. In June 2024, for one day, Nvidia overtook Microsoft as the world's most valuable publicly traded company, with a market capitalization of over \$3.3 trillion.

## **History**

**Founding**

Nvidia was founded on April 5, 1993, by Jensen Huang (who, as of 2024, remains CEO), a Taiwanese-American electrical engineer who was previously the director of CoreWare at LSI Logic and a microprocessor designer at AMD; Chris Malachowsky, an engineer who worked at Sun Microsystems; and Curtis Priem, who was previously a senior staff engineer and graphics chip designer at IBM and Sun Microsystems. The three men agreed to start the company in a meeting at a Denny's roadside diner on Berryessa Road in East San Jose.

At the time, Malachowsky and Priem were frustrated with Sun's management and were looking to leave, but Huang was on "firmer ground", in that he was already running his own division at LSI. The three co-founders discussed a vision of the future which was so compelling that Huang decided to leave LSI and become the chief executive officer of their new startup.

In 1993, the three co-founders envisioned that the ideal trajectory for the forthcoming wave of computing would be in the realm of accelerated computing, specifically in graphics-based processing. This path was chosen due to its unique ability to tackle challenges that eluded general-purpose computing methods.[36] As Huang later explained: "We also observed that video games were simultaneously one of the most computationally challenging problems and would have incredibly high sales volume. Those two conditions don’t happen very often. Video games was our killer app — a flywheel to reach large markets funding huge R&D to solve massive computational problems." With \$40,000 in the bank, the company was born. The company subsequently received \$20 million of venture capital funding from Sequoia Capital, Sutter Hill Ventures and others.

During the late 1990s, Nvidia was one of 70 startup companies chasing the idea that graphics acceleration for video games was the path to the future. Only two survived: Nvidia and ATI Technologies, the latter of which merged into AMD.

Nvidia initially had no name and the co-founders named all their files NV, as in "next version". The need to incorporate the company prompted the co-founders to review all words with those two letters. At one point, Malachowsky and Priem wanted to call the company NVision, but that name was already taken by a manufacturer of toilet paper. Huang suggested the name Nvidia, from "invidia", the Latin word for "envy". The company's original headquarters office was in Sunnyvale, California.

**First graphics accelerator**

Nvidia's first graphics accelerator, the NV1, was designed to process quadrilateral primitives (forward texture mapping), a feature that set it apart from competitors, who preferred triangle primitives. However, when Microsoft introduced the DirectX platform, it chose not to support any other graphics software and announced that its Direct3D API would exclusively support triangles. As a result, the NV1 failed to gain traction in the market.

Nvidia had also entered into a partnership with Sega to supply the graphics chip for the Dreamcast console and worked on the project for about a year. However, Nvidia's technology was already lagging behind competitors. This placed the company in a difficult position: continue working on a chip that was likely doomed to fail or abandon the project, risking financial collapse.

In a pivotal moment, Sega's president, Shoichiro Irimajiri, visited Huang in person to inform him that Sega had decided to choose another vendor for the Dreamcast. However, Irimajiri believed in Nvidia's potential and persuaded Sega’s management to invest $5 million into the company. Huang later reflected that this funding was all that kept Nvidia afloat, and that Irimajiri's "understanding and generosity gave us six months to live".

In 1996, Huang laid off more than half of Nvidia's employees—thereby reducing headcount from 100 to 40—and focused the company's remaining resources on developing a graphics accelerator product optimized for processing triangle primitives: the RIVA 128. By the time the RIVA 128 was released in August 1997, Nvidia had only enough money left for one month’s payroll. The sense of impending failure became so pervasive that it gave rise to Nvidia's unofficial company motto: "Our company is thirty days from going out of business." Huang began internal presentations to Nvidia staff with those words for many years.

Nvidia sold about a million RIVA 128 units within four months, and used the revenue to fund development of its next generation of products. In 1998, the release of the RIVA TNT helped solidify Nvidia’s reputation as a leader in graphics technology.

**Public company**

Nvidia went public on January 22, 1999. Investing in Nvidia after it had already failed to deliver on its contract turned out to be Irimajiri's best decision as Sega's president. After Irimajiri left Sega in 2000, Sega sold its Nvidia stock for \$15 million.

In late 1999, Nvidia released the GeForce 256 (NV10), its first product expressly marketed as a GPU, which was most notable for introducing onboard transformation and lighting (T&L) to consumer-level 3D hardware. Running at 120 MHz and featuring four-pixel pipelines, it implemented advanced video acceleration, motion compensation, and hardware sub-picture alpha blending. The GeForce outperformed existing products by a wide margin.

Due to the success of its products, Nvidia won the contract to develop the graphics hardware for Microsoft's Xbox game console, which earned Nvidia a \$200 million advance. However, the project took many of its best engineers away from other projects. In the short term this did not matter, and the GeForce2 GTS shipped in the summer of 2000. In December 2000, Nvidia reached an agreement to acquire the intellectual assets of its one-time rival 3dfx, a pioneer in consumer 3D graphics technology leading the field from the mid-1990s until 2000. The acquisition process was finalized in April 2002.

In 2001, Standard & Poor's selected Nvidia to replace the departing Enron in the S&P 500 stock index, meaning that index funds would need to hold Nvidia shares going forward.

In July 2002, Nvidia acquired Exluna for an undisclosed sum. Exluna made software-rendering tools and the personnel were merged into the Cg project. In August 2003, Nvidia acquired MediaQ for approximately US$70 million. It launched GoForce the follow year. On April 22, 2004, Nvidia acquired iReady, also a provider of high-performance TCP offload engines and iSCSI controllers. In December 2004, it was announced that Nvidia would assist Sony with the design of the graphics processor (RSX) for the PlayStation 3 game console. On December 14, 2005, Nvidia acquired ULI Electronics, which at the time supplied third-party southbridge parts for chipsets to ATI, Nvidia's competitor. In March 2006, Nvidia acquired Hybrid Graphics. In December 2006, Nvidia, along with its main rival in the graphics industry AMD (which had acquired ATI), received subpoenas from the U.S. Department of Justice regarding possible antitrust violations in the graphics card industry