<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/Class_05_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

##### **Module 5: Natural Language Processing**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Biology, Health and the Environment](https://sciences.utsa.edu/bhe/), [UTSA](https://www.utsa.edu/)

### Module 5 Material

* Part 5.1: Introduction to Hugging Face
* **Part 5.2: Hugging Face Tokenizers**
* Part 5.3: Hugging Face Datasets
* Part 5.4: Training Hugging Face models

## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to /content/drive and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    Colab = True
    print("Note: Using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this lesson.")
    Colab = False

You should see the following output except your GMAIL address should appear on the last line.

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image01B.png)

If your GMAIL address does not appear your lesson will **not** be graded.Make sure your GMAIL address is included as the last line in the output above.

### **YouTube Introduction to Hugging Face Tokenizers**

Run the next cell to see short introduction to Hugging Face Tokenizers. This is a suggested, but optional, part of the lesson.

In [None]:
from IPython.display import HTML
video_id = "VFp38yj8h3A"
HTML(f"""
<iframe width="560" height="315"
  src="https://www.youtube.com/embed/{video_id}"
  title="YouTube video player"
  frameborder="0"
  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
  allowfullscreen>
</iframe>
""")

# **Hugging Face Tokenizers**

**Hugging Face tokenizers** are essential tools in natural language processing (NLP) that convert text into numerical data, which can be processed by machine learning models. Here's a breakdown of their importance and why they are particularly valuable for computational biologists:

### What are Hugging Face Tokenizers?
1. **Tokenization**: The process of breaking down text into smaller units called tokens. These tokens can be words, subwords, or characters.
2. **Normalization**: Adjusting the text to a standard format, such as lowercasing or removing punctuation.
3. **Encoding**: Converting tokens into numerical representations that models can understand.

### Types of Tokenizers
1. **Word-based Tokenizers**: Split text into words. Simple but can lead to large vocabularies.
2. **Subword Tokenizers**: Break words into smaller units, balancing vocabulary size and representation. Examples include Byte Pair Encoding (BPE) and WordPiece.
3. **Character-based Tokenizers**: Split text into individual characters. Useful for languages with complex morphology.

### Importance for Computational Biologists
1. **Handling Biological Texts**: Biological texts often contain specialized terminology, gene names, and sequences. Tokenizers can effectively process these texts, ensuring accurate representation and analysis.
2. **Data Preprocessing**: Tokenizers help in preparing biological data for machine learning models, enabling tasks like gene sequence analysis, protein structure prediction, and more.
3. **Efficiency**: Subword tokenizers, in particular, can handle rare and complex terms efficiently, reducing the need for extensive vocabularies and improving model performance.
4. **Integration with Models**: Hugging Face tokenizers are designed to work seamlessly with pre-trained models, allowing computational biologists to leverage state-of-the-art NLP techniques for their research.

You can find more detailed information about Hugging Face tokenizers [here](https://huggingface.co/docs/tokenizers/en/index).


## **Hugging Face Tokenizers**

Here are some of the most popular tokenizers on Hugging Face and why they are useful:

1. **Byte-Pair Encoding (BPE)**:
   - **Models**: GPT-2, RoBERTa
   - **Usefulness**: BPE tokenizers break down words into subword units, which helps in handling rare and out-of-vocabulary words efficiently. This reduces the vocabulary size while maintaining the ability to represent complex words.

2. **WordPiece**:
   - **Models**: BERT, DistilBERT
   - **Usefulness**: Similar to BPE, WordPiece tokenizers split words into subword units. They are particularly effective in handling morphological variations and rare words, making them suitable for a wide range of NLP tasks.

3. **SentencePiece**:
   - **Models**: T5, ALBERT
   - **Usefulness**: SentencePiece tokenizers can handle both word and subword tokenization. They are language-agnostic and can be trained on raw text without pre-tokenization, making them versatile for different languages and scripts.

4. **Unigram**:
   - **Models**: XLNet
   - **Usefulness**: Unigram tokenizers use a probabilistic model to generate subword units. They are effective in balancing the trade-off between vocabulary size and representation quality, making them suitable for large-scale language models.

5. **Whitespace**:
   - **Models**: Various
   - **Usefulness**: Whitespace tokenizers split text based on spaces. They are simple and fast, making them useful for tasks where precise tokenization is not critical.

These tokenizers are designed to handle different aspects of text processing, such as handling rare words, reducing vocabulary size, and maintaining the integrity of the original text. They are essential for preparing text data for machine learning models, ensuring that the input is in a format that the models can understand and process effectively.

You can find more detailed information about Hugging Face tokenizers [here](https://huggingface.co/docs/tokenizers/index).



### Install Hugging Face

Run the code in the next cell to install Hugging Face in your Colab environment.

In [None]:
# Install Hugging Face

!pip install transformers > /dev/null
!pip install transformers[sentencepiece] > /dev/null

### Example 1 - Step 1: Create Tokenizer

In Step 1, we create a Hugging Face tokenizer using the `distilbert-base-uncased` as our tokenizer model. It is important to remember that the `distilbert-base-uncased` tokenizer is a **WordPiece** tokenizer. WordPiece tokenizers split words into subword units. They are particularly effective in handling morphological variations and rare words, making them suitable for a wide range of NLP tasks.

In **Exercise 1** we will shift to a **SentencePiece** tokenizer to see how these two tokenizer types differ.

In [None]:
# Example 1 - Step 1: Create Tokenizer

from transformers import AutoTokenizer

# Define model
EG_model = "distilbert-base-uncased"

# Setup tokenizer
EG_tokenizer = AutoTokenizer.from_pretrained(EG_model)

If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_2_image01F.png)

### Example 1 - Step 2: Tokenize a Sentence

The code in the cell below shows how to tokenize a sentence. For this example, we will use a quotation from Geoffrey Hinton, a pioneer in the field of deep learning:
> "To make a real impact on AI, we need to build systems that can learn from very large amounts of data."
>
This quote emphasizes the importance of data-driven learning in the development and advancement of neural networks and artificial intelligence

In [None]:
# Example 1 - Step 2: Tokenize a sentence

# Create sentence variable
EG_sentence_1 = "To make a real impact on AI, we need to build systems that can learn from very large amounts of data."

# Tokenize sentence
EG_encoded = EG_tokenizer(EG_sentence_1)

# Print result
print(EG_encoded)


If the code is correct your should see the following output

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_2_image01A.png)

The result of this tokenization contains two elements:

* **input_ids:** The individual subword indexes, each index uniquely identifies a subword.

* **attention_mask:**  Which values in input_ids are meaningful and not padding. This sentence had no padding, so all elements have an attention mask of "1". Later, we will request the output to be of a fixed length, introducing padding, which always has an attention mask of "0". Though each tokenizer can be implemented differently, the attention mask of a tokenizer is generally either "0" or "1".

Due to subwords and special tokens, the number of tokens may not match the number of words in the source string. We can see the meanings of the individual tokens by converting these IDs back to strings.


### Example 1 - Step 3: Show Meaning of Tokens

The code in the cell below shows the meaning of the tokens.

In [None]:
# Example 2 - Step 3: Show meaning of tokens

# Show meaning
EG_tokenizer.convert_ids_to_tokens(EG_encoded.input_ids)

If the code is correct your should see the following output

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_2_image02A.png)

The sentence was broken down into tokens, which include special tokens **[CLS]** and **[SEP]**:

**[CLS]:** This token is added at the beginning of the sequence. It stands for "classification" and is used in tasks like sentence classification.

**Tokens:** The words and punctuation marks from your sentence have been split into individual tokens.

**[SEP]:** This token is added at the end of the sequence to indicate the end of the input. It stands for "separator" and is used in tasks involving multiple sequences.

The tokenization process is essential for preparing text data for model processing. It ensures that the text is transformed into a format that the model can understand and work with.

### Example 1 - Step 4: Convert IDS to Tokens

The cell below shows the code to convert the IDS from Example 1 - Step 3 to tokens.

In [None]:
# Example 1 - Step 4: Convert ids to tokens

# Convert ids back into tokens
EG_tokenizer.convert_ids_to_tokens([0, 100, 101, 102, 103])

If the code is correct your should see the following output

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_2_image03A.png)

#### Explanation of Special Tokens in Tokenizer Output

The output `['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]']` consists of special tokens commonly used by tokenizers in the Hugging Face library:

1. **[PAD]**: Padding token used to pad sequences to ensure they are of the same length.
2. **[UNK]**: Unknown token used to represent tokens that are not found in the tokenizer's vocabulary.
3. **[CLS]**: Classification token added at the beginning of the sequence, often used for tasks like sentence classification.
4. **[SEP]**: Separator token used to indicate the end of a sequence or to separate multiple sequences.
5. **[MASK]**: Mask token used in masked language modeling tasks, where certain tokens are masked and the model attempts to predict them.

These special tokens are integral to various NLP tasks, such as sequence classification, question answering, and language modeling. They help the model understand and process input data in a structured way.

-----------------------------

### **Comparison of DistilBERT and ALBERT Tokenizers**

For **Exercise 1** we will be comparing the `DistilBERT` tokenizer with the `ALBERT` tokenizer. Here are some of the differences we should expect to see:

#### Tokenization Method
- **DistilBERT**: Utilizes the WordPiece tokenization method. This method breaks down words into subword units and uses the `##` prefix to denote subwords that appear within a word.
- **ALBERT**: Uses the SentencePiece tokenization method. SentencePiece tokenizers break down text into subword units and use the underscore (`▁`) character to indicate spaces before words.

#### Special Tokens
- **DistilBERT**: Uses special tokens such as `[CLS]` (classification token added at the beginning of the sequence) and `[SEP]` (separator token used to indicate the end of a sequence or to separate multiple sequences).
- **ALBERT**: Also uses special tokens like `[CLS]` and `[SEP]`. However, the representation and tokenization of the text might differ due to the SentencePiece method.

#### Vocabulary
- **DistilBERT**: Has a distinct vocabulary file based on the WordPiece tokenization method, which includes tokens with the `##` prefix to denote subwords.
- **ALBERT**: Has a different vocabulary file based on the SentencePiece tokenization method, which includes tokens with the underscore (`▁`) character to denote spaces before words.

---------------------------

### **Exercise 1 - Step 1: Create Tokenizer**

In the cell below, create a Hugging Face tokenizer using the `albert-large-v2` as your tokenizer model. Call your model `EX_model` and your tokenizer `EX_tokenizer`.

In [None]:
# Insert your code for Exercise 1 - Step 1 here:


If the code is correct you should see something similar to the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_2_image02F.png)

### **Exercise 1 - Step 2: Tokenize a Sentence**

In the cell below use the following sentence to create `EX_sentence_1`:

"Hugging Face is on a mission to democratize artificial intelligence through open source and open science, making state-of-the-art models accessible to everyone."

Use your `EX_tokenizer` to tokenize the sentence to create `EX_encoded`. Then print out the result of the tokenization, `EX_encoded`.

In [None]:
# Insert your code for Exercise 1 - Step 2 here:



If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_2_image07A.png)

with additional numbers at the right.

### **Exercise 1 - Step 3: Show Meaning of Tokens**

In the cell below, write the code to use your `EX_tokenizer` to show the meanings of the tokens in `EX_encoded`.



In [None]:
# Insert your code for Exercise 1 - Step 3 here:



If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_2_image05A.png)

### **Explanation of Tokenization Output**

The output you're seeing is the result of tokenization using the `ALBERT` tokenizer. Here's what each part of the output means:

1. **[CLS]**: This special token is added at the beginning of the sequence. It stands for "classification" and is often used in tasks like sentence classification or as a marker for the start of the sequence.

2. **▁ (underscore character)**: This symbol is used in SentencePiece tokenization to indicate a space before a word. It's a way to handle whitespace in the text.

3. **Tokens**: The words and subwords in your input sentence have been split into tokens. For example:
   - `'▁hugging'` represents the word "hugging" with a preceding space.
   - `'▁face'` represents the word "face" with a preceding space.
   - `'▁is'`, `'▁on'`, `'▁a'`, etc., represent individual words with preceding spaces.
   - `'▁democrat'`, `'ize'` represent the word "democratize," split into two subwords.

4. **Punctuation**: Punctuation marks like commas and periods are kept as separate tokens.

5. **[SEP]**: This special token is added at the end of the sequence to indicate the end of the input. It stands for "separator" and is used in tasks involving multiple sequences.

#### **Simplified Explanation of the Tokenization Process**
- The tokenizer breaks down the sentence into smaller units called tokens.
- Special tokens `[CLS]` and `[SEP]` are added to mark the beginning and end of the sequence.
- Each word and punctuation mark is converted into a token, with subword tokenization applied to handle complex or rare words.

Tokenization is crucial for transforming text into a format that machine learning models can process. It ensures that the input is properly segmented and encoded into numerical representations.

--------------------

### **Exercise 1 - Step 4: Convert IDS to Tokens**

In the cell below write the code to convert the IDS in your `EX_tokenizer` back into tokens.

In [None]:
# Insert your code for Exercise 1 - Step 4 here



If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_2_image06A.png)

### **Explanation of Tokenizer Output**

The command `tokenizer.convert_ids_to_tokens([0, 100, 101, 102, 103])` converts a list of token IDs back into their corresponding tokens using the tokenizer's vocabulary. Here's what each part of the output means:

1. **`<pad>`**: This token corresponds to the ID `0` and is typically used for padding sequences to a uniform length.
2. **▁if**: The token with ID `100`, indicating the word "if" with a preceding space. The underscore (`▁`) denotes a space before the word, as seen in SentencePiece tokenization.
3. **▁like**: The token with ID `101`, representing the word "like" with a preceding space.
4. **ly**: The token with ID `102`, representing the subword "ly" which might be a part of a larger word.
5. **n**: The token with ID `103`, representing the letter "n", which might be part of a subword or a standalone token depending on the context.

These token IDs are mapped to their corresponding tokens based on the tokenizer's vocabulary. The presence of subword tokens like "ly" and "n" indicates that the tokenizer uses a subword-based method (e.g., SentencePiece, WordPiece) to handle the text.


-----------------------------------

### **Difference in Tokenization Output: ALBERT vs. DistilBERT**

The difference in output between the ALBERT tokenizer used in **Exercise 1** and the DistilBERT tokenizer used in Example 1 is due to the distinct tokenization methods used by each model. Here are the key differences:

#### Tokenization Method
- **ALBERT**: Uses SentencePiece tokenization, which breaks down text into subword units and adds the underscore (`▁`) character to indicate spaces. This helps handle complex words and out-of-vocabulary terms efficiently.
- **DistilBERT**: Uses WordPiece tokenization, which also breaks down text into subword units but does not use the underscore character. Instead, it adds `##` before subword tokens that are not at the start of a word.

#### Special Tokens
- Both ALBERT and DistilBERT add `[CLS]` at the beginning and `[SEP]` at the end of the sequence, but their internal tokenization processes differ due to the tokenization method used.

#### Vocabulary
- The vocabulary files used by ALBERT and DistilBERT are different. Each tokenizer has its unique vocabulary that influences how text is broken down into tokens.

----------------------------------

## **Tokenizing a List of Sequences**

**Tokenizing a list of sequences** is a crucial step in preparing data for machine learning models, especially in the field of natural language processing (NLP) and bioinformatics. Here's why tokenization is important and why it might be of interest to a computational biologist:

#### Importance of Tokenizing a List of Sequences

1. **Uniform Data Representation**: Tokenization transforms sequences (e.g., text, DNA/RNA sequences) into numerical representations that models can process. This ensures that all sequences are uniformly represented, making it easier to feed them into machine learning algorithms.

2. **Handling Variable-Length Sequences**: Biological data, such as gene sequences, can vary in length. Tokenization, combined with techniques like padding and truncation, ensures that all sequences are of a uniform length, which is necessary for batch processing.

3. **Capturing Contextual Information**: Advanced tokenizers, such as those used in NLP models, can capture contextual information by breaking down sequences into meaningful subunits (e.g., words, subwords, amino acids). This helps models understand the relationships and patterns within the sequences.

4. **Reducing Vocabulary Size**: Subword tokenizers (e.g., SentencePiece, WordPiece) break down rare and complex terms into smaller, more manageable units. This reduces the overall vocabulary size, making the model more efficient and capable of handling out-of-vocabulary terms.

5. **Facilitating Transfer Learning**: Pre-trained models, such as BERT or ALBERT, often come with their own tokenizers. By tokenizing sequences in a manner consistent with these models, researchers can leverage transfer learning, applying pre-trained models to their specific tasks with minimal retraining.

#### Why Tokenization is of Interest to Computational Biologists

1. **Gene Sequence Analysis**: Tokenizing DNA/RNA sequences allows computational biologists to apply machine learning models to tasks such as gene prediction, sequence alignment, and motif discovery.

2. **Protein Structure Prediction**: Tokenizing amino acid sequences helps in predicting protein structures, functions, and interactions. This can lead to advancements in drug discovery and understanding of biological processes.

3. **Text Mining in Biological Literature**: Tokenizing scientific texts and literature enables computational biologists to perform text mining, extracting valuable information, identifying trends, and gaining insights from large volumes of published research.

4. **Data Preprocessing**: Proper tokenization is a key step in data preprocessing, ensuring that biological data is in a suitable format for downstream analysis. This includes normalization, handling missing data, and preparing input for various bioinformatics tools.

5. **Integration with NLP Models**: Tokenizing biological sequences or texts allows computational biologists to integrate their data with NLP models, enabling tasks such as entity recognition, relationship extraction, and knowledge discovery.

In summary, tokenization is a vital step in transforming biological sequences into a format that can be effectively processed by machine learning models. It enhances the efficiency, accuracy, and interpretability of computational biology tasks, leading to more meaningful insights and discoveries.



### Example 2 - Step 1: Entity Tagging

The code in the cell below creates a variable, `EG_sequences` with the three DNA sequences. The top two sequences are labelled `ORF` while the bottom sequence is labelled as not an `ORF`.

In molecular genetics, `ORF` stands for **Open Reading Frame**. An open reading frame is a continuous stretch of nucleotides in a DNA (or RNA) sequence that has the potential to be translated into a protein. It starts with a start codon (usually AUG, which codes for methionine) and ends with a stop codon (such as UAA, UAG, or UGA).

ORFs are important because they help scientists identify protein-coding regions within a genome. In gene prediction and annotation, finding ORFs is a key step in determining which parts of the DNA sequence may encode functional proteins.

The code in the cell below tokenises a batch of DNA sequences with a BERT tokenizer, padding them to the longest sequence, adding special `[CLS]`/`[SEP]` tokens, and returning PyTorch tensors. It then prints the resulting `input_ids` (integer token IDs) and `attention_mask` (1s for real tokens, 0s for padding), preparing the data for input into a BERT‑style model for downstream tasks such as ORF prediction.

In [None]:
# Example 2 - Step 1: Entity tagging

from transformers import BertModel, BertTokenizer
import torch

# Define DNA sequences
EG_sequences = [
    "ATGCGTACCGATGCTTACGTTAGCATCGTATCGTAGCTGA", # ORF
    "ATGACCGTAACTGCTGCCATCGTATGCAGTCTGATGCTAA", # ORF
    "ACTGTCGACCAGTCTAGCATCGGTTACGATCGTACAGTAC"  # Not ORF
]

# Encode Sequences
EG_encoded = EG_tokenizer(EG_sequences, padding=True, add_special_tokens=True,
                       return_tensors="pt")  # Ensure outputs are tensors

print("Input IDs")
for a in EG_encoded.input_ids:
    print(a)

print("**Attention Mask**")
for a in EG_encoded.attention_mask:
    print(a)

If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_2_image11A.png)

The input_ids tensors represent the tokenized and encoded version of your input sequences. Each tensor corresponds to a sequence, with special tokens and padding added as needed. Let's look at the first tensor:

First Sequence:

~~~text
tensor([  101,  2012, 18195, 13512,  6305,  2278, 20697, 18195,  5946,  2278,
         13512, 15900, 11266,  2278, 13512,  4017,  2278, 13512,  8490,  6593,
           102,     0,     0])

~~~

* `101` represents the `[CLS]` token added at the beginning of the sequence.
* The numbers like `2012`, `18195`, `13512`, ... represent the token IDs for the characters in your sequence.
* `102` represents the `[SEP]` token added at the end of the sequence.
* `0` represents the `[PAD]` tokens added to pad the sequence to a uniform length.

#### **Attention Mask**

The attention_mask tensors indicate which tokens should be attended to (1) and which tokens are just padding (0). This helps the model to focus on the relevant tokens and ignore the padding tokens.

First Sequence:

~~~text
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])

~~~
All relevant tokens (including special tokens) have a value of `1`. The padding tokens have a value of `0`.

#### **Summary**
* **Input IDs:** These tensors contain the token `IDs` for the sequences, including special tokens `[CLS]` and `[SEP]`, and padding tokens `[PAD]` to ensure uniform length.

* **Attention Mask:** These tensors indicate which tokens should be attended to by the model and which tokens are padding.

This output ensures that the sequences are properly formatted for processing by the model, with attention focused on the relevant tokens.


### Example 2 - Step 2: Analyze Tokens with Neural Network

The code in the cell below loads a pretrained `bert-base-uncased` tokenizer (`EG_tokenizer`)and model (`EG_model`), tokenises a batch of sequences (`EG_sequnces`) and `attention_mask`), runs a forward pass without gradient computation to obtain the last hidden states, extracts the hidden state of the `[CLS]` token from the first position of each sequence, and prints these `[CLS]` representations as vectors which can be used for downstream classification tasks--in this case identifying `ORFs`.


In [None]:
# Example 2 - Step 2: Analyze tokens with neural network

from transformers import BertModel, BertTokenizer
import torch

# Define model and tokenizer
EG_model_name = "bert-base-uncased"
EG_tokenizer = BertTokenizer.from_pretrained(EG_model_name)
EG_model = BertModel.from_pretrained(EG_model_name)

# Prepare inputs
input_ids = EG_encoded["input_ids"]
attention_mask = EG_encoded["attention_mask"]

# Perform forward pass through the model
with torch.no_grad():
    EG_outputs = EG_model(input_ids=input_ids, attention_mask=attention_mask)

# Extract hidden states (last layer)
EG_hidden_states = EG_outputs.last_hidden_state

# Use hidden states for classification
EG_cls_token_hidden_state = EG_hidden_states[:, 0, :]  # Select [CLS] token hidden state
print(EG_cls_token_hidden_state)


If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_2_image09A.png)

You should focus on the tensor values:

```text
tensor([[-0.6225,  0.0124,  0.0010,  ...,  0.0586, -0.1268,  0.6282],
        [-0.8602,  0.0782, -0.0059,  ...,  0.1884,  0.1436,  0.5705],
        [-0.5524,  0.0369,  0.0565,  ...,  0.0031, -0.0707,  0.7200]])
```
These tensor values will be used in the next step to predict whether or not a particular DNA sequence is a `ORF`.

### Example 2 - Step 3:  Use tensors to make predictions

The code in the cell below uses the tensors generated in the previous step to make predictions whether a DNA sequence is a `OFR`.

In [None]:
# Example 2 - Step 3: Use tensors to make predictions

import torch
from sklearn.linear_model import LogisticRegression

# Assuming cls_token_hidden_state already contains the values

# Y labels
Y_labels = torch.tensor([1, 1, 0])

# Convert to numpy arrays
X = EG_cls_token_hidden_state.numpy()
y = Y_labels.numpy()

# Train a logistic regression classifier
EG_clf = LogisticRegression()
EG_clf.fit(X, y)

# Print header
print("ORF Predictions: 1=True, 2=False")

# Make predictions
EG_predictions = EG_clf.predict(X)
print(EG_predictions)


If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_2_image10A.png)

The predicted labels [1, 1, 0] indicate how the logistic regression classifier has categorized each of the input sequences based on the hidden states from the BERT model. Specifically:

* The first sequence is predicted to belong to class 1 (ORF).
* The second sequence is predicted to belong to class 1 (ORF).
* The third sequence is predicted to belong to class 0 (Not ORF).

### **Example Scenario**

Imagine you're working on classifying gene sequences into different functional groups based on their sequence patterns. The predicted labels [0, 1, 0] might represent different functional groups, such as:

* Class 0: Non-coding sequences (Not ORF).
* Class 1: Protein-coding sequences (ORF).

In this scenario, the model has correctly identified the first and second sequences as protein-coding sequences (class 1), while the third sequences Was a non-coding sequences (class 0).

The output demonstrates the practical application of deep learning models in sequence classification and showcases the potential of transformer-based models in extracting meaningful features from complex biological data.

The output `[0, 1, 0]` represents the predicted class labels for each of the input sequences. Here's the interpretation:

1. **First Sequence:** The predicted label is 0.

2. **Second Sequence:** The predicted label is 1.

3. **Third Sequence:** The predicted label is 0.

This means that, according to the logistic regression classifier trained on the hidden states of the `[CLS]` token:

* The first sequence belongs to class 0.

* The second sequence belongs to class 1.

* The third sequence belongs to class 0.

These class labels are based on the training data provided (in this case, the labels tensor) and the features extracted from the hidden states of the input sequences. You can use these predicted labels for various downstream tasks, such as categorizing the sequences into different classes or groups.

### **Exercise 2 - Step 1: Entity Tagging**

In the cell below creates a variable, `EX_sequences` with the following three DNA sequences:

```type
"TATAAAAGGCGTACGTAGCTAGCTAG",  # Promotor sequence
"CGTAGCTAGCTAGCGCGCGCGCGCGC",  # Not promotor sequence
"GCGCGTATATAAGCTAGCTAGCTAGC"   # Promotor sequence
```
The top and bottom sequences are labelled `Promotor sequence` while the middle sequence is labelled `Not promotor sequence`.

A **promoter sequence** is a region of DNA located upstream (before) the start of a gene that plays a crucial role in initiating transcription—the process by which RNA is synthesized from a DNA template.

**Key Features of Promoter Sequences:**
* Binding Site for RNA Polymerase: Promoters contain specific motifs where RNA polymerase and transcription factors bind to begin transcription.
* TATA Box: A common motif in eukaryotic promoters, typically found about 25–35 bases upstream of the transcription start site. It has the sequence TATAAA and helps position RNA polymerase correctly.
* Regulatory Elements: Promoters may also include other elements like enhancers or silencers that influence the rate of transcription.

In the cell below tokenise your `EX_sequences` with a BERT tokenizer, padding them to the longest sequence, adding special `[CLS]`/`[SEP]` tokens, and returning PyTorch tensors. Then print out the resulting `input_ids` (integer token IDs) and `attention_mask`.

In [None]:
# Insert your code for Exercise 2 - Step 1 here



If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_2_image14A.png)

### **Exericse 2 - Step 2: Analyze Tokens with Neural Network**

In the cell below write the code to load a pretrained `bert-base-uncased` tokenizer called `EX_tokenizer` along with a model called `EX_model` and tokenises your DNA sequences (`EX_sequnces`). Print out the Inout IDs along with the Attention mask.

In [None]:
# Insert your code for Exercise 2 - Step 2 here



If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_2_image15A.png)


### **Exercise 2 - Step 3:  Use Tensors to Make Predictions**

In the cell below use the tensors generated in the previous step to make predictions whether a DNA sequence is a `Promotor Sequence`.

You will need to change your `Y` labels to the following:
```text
# Y labels
Y_labels = torch.tensor([1, 0, 1])
```

In [None]:
# Insert your code for Exercise 2 - Step 3 here



If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_05/class_05_2_image16A.png)

# **Lesson Turn-in**

When you have completed and run all of the code cells, use the **File --> Print.. --> Save to PDF** to generate a PDF of your Colab notebook. Save your PDF as `Class_05_2.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.

## **Lizard Tail**

## **TSMC**

![__](https://upload.wikimedia.org/wikipedia/commons/0/07/TSMC_factory_in_Taichung%27s_Central_Taiwan_Science_Park.jpg)


**Taiwan Semiconductor Manufacturing Company Limited** (TSMC or Taiwan Semiconductor) is a Taiwanese multinational semiconductor contract manufacturing and design company. It is the world's most valuable semiconductor company, the world's largest dedicated independent ("pure-play") semiconductor foundry, and Taiwan's largest company, with headquarters and main operations located in the Hsinchu Science Park in Hsinchu, Taiwan. Although the central government of Taiwan is the largest individual shareholder, the majority of TSMC is owned by foreign investors. In 2023, the company was ranked 44th in the Forbes Global 2000. Taiwan's exports of integrated circuits amounted to \$184 billion in 2022, accounted for nearly 25 percent of Taiwan's GDP. TSMC constitutes about 30 percent of the Taiwan Stock Exchange's main index.

TSMC was founded in Taiwan in 1987 by Morris Chang as the world's first dedicated semiconductor foundry. It has long been the leading company in its field. When Chang retired in 2018, after 31 years of TSMC leadership, Mark Liu became chairman and C. C. Wei became Chief Executive. It has been listed on the Taiwan Stock Exchange since 1993; in 1997 it became the first Taiwanese company to be listed on the New York Stock Exchange. Since 1994, TSMC has had a compound annual growth rate (CAGR) of 17.4% in revenue and a CAGR of 16.1% in earnings.

Most fabless semiconductor companies such as AMD, Apple, ARM, Broadcom, Marvell, MediaTek, Qualcomm, and Nvidia are customers of TSMC, as are emerging companies such as Allwinner Technology, HiSilicon, Spectra7, and UNISOC. Programmable logic device companies Xilinx and previously Altera also make or made use of TSMC's foundry services. Some integrated device manufacturers that have their own fabrication facilities, such as Intel, NXP, STMicroelectronics, and Texas Instruments, outsource some of their production to TSMC. At least one semiconductor company, LSI, re-sells TSMC wafers through its ASIC design services and design IP portfolio.

TSMC has a global capacity of about thirteen million 300 mm-equivalent wafers per year as of 2020 and produces chips for customers with process nodes from 2 microns to 3 nanometres. TSMC was the first foundry to market 7-nanometre and 5-nanometre (used by the 2020 Apple A14 and M1 SoCs, the MediaTek Dimensity 8100, and AMD Ryzen 7000 series processors) production capabilities, and the first to commercialize ASML's extreme ultraviolet (EUV) lithography technology in high volume.

**History**

In 1986, Li Kwoh-ting, representing the Executive Yuan, invited Morris Chang to serve as the president of the Industrial Technology Research Institute (ITRI) and offered him a blank check to build Taiwan's chip industry. At that time, the Taiwanese government wanted to develop its semiconductor industry, but its high investment and high risk nature made it difficult to find investors. Texas Instruments and Intel turned down Chang. Only Philips was willing to sign a joint venture contract with Taiwan to put up \$58 million, transfer its production technology, and license intellectual property in exchange for a 27.5 percent stake in TSMC. Alongside generous tax benefits, the Taiwanese government, through the National Development Fund, Executive Yuan, provided another 48 percent of the startup capital for TSMC, and the rest of the capital was raised from several of the island's wealthiest families, who owned firms that specialized in plastics, textiles, and chemicals. These wealthy Taiwanese were directly "asked" by the government to invest. "What generally happened was that one of the ministers in the government would call a businessman in Taiwan,"Chang explained, "to get him to invest." From day one, TSMC was not really a private business: it was a project of the Taiwanese state. Its first CEO was James E. Dykes, who left after a year and Morris Chang became the CEO.
Since then, the company has continued to grow, albeit subject to the cycles of demand. In 2011, the company planned to increase research and development expenditures by almost 39% to NT\$50 billion to fend off growing competition. The company also planned to expand capacity by 30% in 2011 to meet strong market demand. In May 2014, TSMC's board of directors approved capital appropriations of US \$568 million to increase and improve manufacturing capabilities after the company forecast higher than expected demand. In August 2014, TSMC's board of directors approved additional capital appropriations of US \$3.05 billion.

In 2011, it was reported that TSMC had begun trial production of the A5 SoC and A6 SoCs for Apple's iPad and iPhone devices. According to reports, in May 2014 Apple sourced its A8 and A8X SoCs from TSMC. Apple then sourced the A9 SoC with both TSMC and Samsung (to increase volume for iPhone 6S launch) and the A9X exclusively with TSMC, thus resolving the issue of sourcing a chip in two different microarchitecture sizes. As of 2014, Apple was TSMC's most important customer. In October 2014, ARM and TSMC announced a new multi-year agreement for the development of ARM based 10 nm FinFET processors.

Over the objection of the Tsai Ing-wen administration, in March 2017, TSMC invested US\$3 billion in Nanjing to develop a manufacturing subsidiary there.
In 2020, TSMC became the first semiconductor company in the world to sign up for the RE100 initiative, pledging to use 100% renewable energy by 2050. TSMC accounts for roughly 5% of the energy consumption in Taiwan, even exceeding that of the capital city Taipei. This initiative was thus expected to accelerate the transformation to renewable energy in the country. For 2020, TSMC had a net income of US \$17.60 billion on a consolidated revenue of US \$45.51 billion, an increase of 57.5% and 31.4% respectively from the 2019 level of US \$11.18 billion net income and US \$34.63 billion consolidated revenue. Its market capitalization was over \$550 billion in April 2021. TSMC's revenue in the first quarter of 2020 reached US \$10 billion, while its market capitalization was US \$254 billion. TSMC's market capitalization reached a value of NT \$1.9 trillion (US \$63.4 billion) in December 2010. It was ranked 70th in the FT Global 500 2013 list of the world's most highly valued companies with a capitalization of US \$86.7 billion, while reaching US\$110 billion in May 2014. In March 2017, TSMC's market capitalization surpassed that of semiconductor giant Intel for the first time, hitting NT\$5.14 trillion (US \$168.4 billion), with Intel's at US \$165.7 billion. On 27 June 2020, TSMC briefly became the world's 10th most valuable company, with a market capitalization of US \$410 billion.

To mitigate business risks in the event of war between Taiwan and the People's Republic of China, since the beginning of the 2020s, TSMC has expanded its geographic operations, opening new fabs in Japan and the United States, with further plans for expansion into Germany. In July 2020, TSMC confirmed it would halt the shipment of silicon wafers to Chinese telecommunications equipment manufacturer Huawei and its subsidiary HiSilicon by 14 September. In November 2020, officials in Phoenix, Arizona in the United States approved TSMC's plan to build a \$12 billion chip plant in the city. The decision to locate a plant in the US came after the Trump administration warned about the issues concerning the world's electronics made outside of the U.S. In 2021, news reports claimed that the facility might be tripled to roughly a \$35 billion investment with six factories. See TSMC § Arizona for more details.

In October 2024, TSMC informed the United States Department of Commerce about a potential breach of export controls in which one of its most advanced chips was sent to Huawei via another company with ties to the Chinese government.
