<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/Class_05_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

##### **Module 5: Natural Language Processing**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Biology, Health and the Environment](https://sciences.utsa.edu/bhe/), [UTSA](https://www.utsa.edu/)

### Module 5 Material

* Part 5.1: Introduction to Hugging Face
* Part 5.2: Hugging Face Tokenizers
* **Part 5.3: Hugging Face Datasets**
* Part 5.4: Training Hugging Face models

## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to /content/drive and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    Colab = True
    print("Note: Using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this lesson.")
    Colab = False

You should see the following output except your GMAIL address should appear on the last line.

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image01B.png)

If your GMAIL address does not appear your lesson will **not** be graded.

### **YouTube Introduction to Hugging Face Datasets**

Run the next cell to see short introduction to Hugging Face Datasets. This is a suggested, but optional, part of the lesson.

In [None]:
from IPython.display import HTML
video_id = "_BZearw7f0w"
HTML(f"""
<iframe width="560" height="315"
  src="https://www.youtube.com/embed/{video_id}"
  title="YouTube video player"
  frameborder="0"
  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
  allowfullscreen>
</iframe>
""")

# **Hugging Face Datasets**

**Hugging Face Datasets** are useful for natural language processing (NLP). The Hugging Face library provides functions that allow you to navigate and obtain these data sets. When we access Hugging Face data sets, the data is in a format specific to Hugging Face. In this part, we will explore this format and see how to convert it to Pandas or TensorFlow data.

#### **Key Features**

* **Wide Variety:** The Hugging Face Hub hosts datasets for numerous tasks, such as natural language processing (NLP), computer vision, and audio processing.

* **Easy Access:** Datasets can be easily downloaded and pre-processed with simple commands, making it convenient for researchers and practitioners.

* **Efficient Data Handling:** The library supports efficient data pre-processing, caching, and memory-mapping, allowing users to work with large datasets without running into memory limitations.

* **Interoperability:** Built-in support for interoperability with libraries like NumPy, Pandas, and Polars.

#### **Usefulness for Computational Biologists**

* **Access to Diverse Datasets:** Computational biologists can access a wide range of datasets relevant to their research, such as genomic sequences, protein structures, and biological literature.

* **Data Pre-processing:** The library provides tools for efficient data pre-processing, enabling researchers to clean, transform, and prepare their data for analysis and modeling.

* **Integration with Machine Learning Frameworks:** Seamless integration with popular machine learning frameworks allows computational biologists to apply advanced machine learning techniques to their data

### Install Hugging Face Datasets

Install the Hugging Face datasets by running the code in the next cell.


In [None]:
# Install Hugging Face Datasets

!pip install transformers > /dev/null
!pip install transformers[sentencepiece] > /dev/null
!pip install datasets  > /dev/null
!pip install huggingface_hub  > /dev/null

### Example 1: List Hugging Face Datasets

As of the latest available data, Hugging Face hosts approximately **250,000 datasets** on its platform. These datasets span a wide range of domains including natural language processing, computer vision, audio processing, and more, and are contributed by both the community and organizations.

To narrow our focus when searching for particular datasets, we can use `filters`, `tags` and `keywords` as illustrated by the code below.

In an effort to rank the usefulness of each dataset, the code uses a specific `api-helper` that gathers the number of downloads for each dataset.

Finally, the code prints out a listing of the top 25 datasets it finds ranked by the number of times the dataset has been downloaded.

In the example below the search only found 10 datasets that matched the search criteria so the output was limited to this number.

In [None]:
# Example 1: List Hugging Face datasets


# Install / update the Hugging Face Hub library
!pip install huggingface_hub --quiet

# Import the API helper
from huggingface_hub import HfApi

# Pull all image‑classification datasets in one call
api = HfApi()
vision_datasets = api.list_datasets(filter="image-classification")

# Helper:  safe integer attribute extraction
def _get_int_attr(obj, *attr_names, default=0):
    for name in attr_names:
        val = getattr(obj, name, None)
        if isinstance(val, int):
            return val
    return default

# Whitelist of tags / keywords that signal biology/medicine
BIO_MED_TAGS = {
    "biology",
    "bioinformatics",
    "biomed",
    "biomedical",
    "medicine",
    "medical",
    "health",
    "clinical",
    "clinicaltrials",
    "pathology",
    "disease",
    "pharma",
    "pharmaceutical",
    "radiology",
    "ct",
    "mri",
    "xray",
    "ultrasound",
    "imaging",
    "neuroscience",
    "clinical-trials",
}
KEYWORDS = {"bio", "biological", "biomed", "medicine", "medical", "health",
            "xray", "ct", "mri", "ultrasound"}

def is_bio_med(ds):
    # Check official tags
    tags = getattr(ds, "tags", [])
    if any(t.lower() in BIO_MED_TAGS for t in tags):
        return True
    # Fallback: check keywords in the dataset id/name
    id_lower = ds.id.lower()
    return any(kw in id_lower for kw in KEYWORDS)

bio_med_datasets = [ds for ds in vision_datasets if is_bio_med(ds)]


# Sort & print the top‑25 by downloads
def print_top_by_downloads(datasets, n=25):
    sorted_ds = sorted(
        datasets,
        key=lambda ds: _get_int_attr(ds, "downloads", "download_count"),
        reverse=True,
    )[:n]

    print(f"\nTop {len(sorted_ds)} Computer‑Vision datasets with BIO/MEAS content by *downloads*")
    for ds in sorted_ds:
        dl = _get_int_attr(ds, "downloads", "download_count")
        print(f"- {ds.id:40}  ➜  {dl:,} downloads")

print_top_by_downloads(bio_med_datasets, n=25)


If the code is correct your should see something similar to the following output

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image01A.png)

Although the code was asked to list the top 25 datasets, only 10 datasets were found that matched the search criteria.

### **Exercise 1: List Hugging Face Datasets**

In the cell below write the code to find the top `biology` datasets. Call your datasets `biology_datasets` instead of `vision_datasets`.

You will need to change the filter code as follows:
```text
biology_datasets = api.list_datasets(filter="biology")
```
You can reuse all of the other code in Example 1 for finding keywords and tags.

You will need to change this line of code:
```text
bio_med_datasets = [ds for ds in vision_datasets if is_bio_med(ds)]
```
to read instead:
```text
bio_med_datasets = [ds for ds in biology_datasets if is_bio_med(ds)]
```
to accomodate your new `biology_datasets` variable.

Finally, don't forget to change the print statement to read:

```text

print(f"\nTop {len(sorted_ds)} Biology datasets with BIO/MEAS content by *downloads*")

```

In [None]:
# Insert your code for Exercise 1 here


If the code is correct your should see something similar to the following output

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image02A.png)

Since there are many more `biology` datasets than `vision` datasets, the list now contains 25 rows.

Obviously, you can change the `keywords` to narrow your search and/or include additional topics.

### Example 2 - Step 1: Download a Dataset

The code in the cell below downloads the `zeroshot/arxiv-biology` dataset. This Hugging Face dataset contains a collection of scientific articles and abstracts related to biology, sourced from the `arXiv repository`. This dataset is designed to support research in computational biology and related fields. It includes various types of research papers, making it a valuable resource for tasks such as text mining, natural language processing, and machine learning in the biological domain.

The dataset is structured to provide researchers with access to a wide range of biological research papers, enabling them to develop and evaluate models for various applications, including information retrieval, semantic analysis, and knowledge extraction.

The code In the cell below stores the `zeroshot/arxiv-biology`in the variable `zshot_dataset`.

In [None]:
# Example 2 - Step 1: Download a dataset

from datasets import load_dataset

# Specify the dataset repository
dataset_id = "zeroshot/arxiv-biology"

# Download the dataset
zshot_dataset = load_dataset(dataset_id)


If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image14A.png)

A quick scan of the downloaded data set reveals its structure. In this case, the sequence data was already separated into training, validation, and test data sets. The training set consists of `498444` observations, while the validation set contains `7784` observations and the test set contains `8469` observations. The dataset is a Python dictionary that includes a Dataset object for each of these three divisions. The dataset contains four columns, the `sequence`, the `chromosome`, the `start_pos` and the `end_pos`for each gene sequence.


### Example 2 - Step 2: Display a Record

The code in the cell below shows how to display a single record using its record number in the dataset. In this code example we specify the first record using the code `RECORD_NUMBER = 0`. (Remember, Python starts counting from 0, not 1).

In [None]:
# Example 2 - Step 2 : Display a record

# Specify record number
RECORD_NUMBER = 0

record = zshot_dataset['train'][RECORD_NUMBER]

# Display the first few rows to confirm the structure
for key, value in record.items():
    print(f"{key}: {value}\n")


If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image08F.png)

### Example 2 - Step 3: Display Labels

The code in this example retrieves the labels (features) of the train dataset and prints them out. In this context, "labels" refer to the names of the columns or features present in the dataset. These labels help you understand the structure of the dataset and what kind of data it contains.

In [None]:
# Example 2 - Step 3: Display labels


# Get the labels (features) of the train dataset
labels = zshot_dataset['train'].features

# Print the labels
print(labels)


If the code is correct, you should see the following output:
~~~text
{'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'abstract': Value(dtype='string', id=None)}
~~~

If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image15A.png)

Knowing the labels might be useful in the following contexts:

1. **Understanding Dataset Structure:** By retrieving and printing the labels, you gain insight into the dataset's structure, including the names and types of the columns. This is particularly useful when you are exploring a new dataset and want to understand its contents before performing any analysis.

2. **Feature Engineering:** Knowing the labels helps you identify which features are available in the dataset, allowing you to perform feature engineering tasks such as creating new features, selecting relevant features, or transforming existing features.

3. **Data Preprocessing:** Understanding the labels helps you prepare the data for analysis. For example, you can identify which columns contain numerical data, categorical data, or text data, and apply appropriate preprocessing techniques such as normalization, encoding, or tokenization.

4. **Model Training and Evaluation:** When building machine learning models, knowing the labels helps you specify which columns to use as input features and which column to use as the target variable. This ensures that your model is trained and evaluated correctly.

### Example 2 - Step 4: Convert Hugging Face dataset to DataFrame

Hugging face can provide data sets in a variety of formats. The following code shows how to receive the `zshot_dataset` as a Pandas DataFrame.

This code snippet sets the conversion type:
~~~text
zshot_dataset.set_format(type='pandas')
~~~
That line set the format of the dataset to 'pandas'. By setting the format to 'pandas', you can directly convert the dataset to a Pandas DataFrame, which makes it easier to manipulate and analyze the data.

The actual conversion was performed by this code snippet:

~~~text
zshot_dataset_df = zshot_dataset[:]
~~~

This code snippet converts the dataset to a Pandas DataFrame. The slicing notation [:] is used to convert the entire dataset into the DataFrame. The resulting DataFrame is stored in the variable `zshot_dataset_df`.

In [None]:
# Example 2 - Step 4: Convert Hugging Face dataset to DataFrame

from datasets import load_dataset

# Load the zeroshot/arxiv-biology dataset
zshot_dataset = load_dataset("zeroshot/arxiv-biology", split="train")

# Set the format of the dataset to 'pandas'
zshot_dataset.set_format(type='pandas')

# Convert the dataset to a pandas DataFrame
zshot_dataset_df = zshot_dataset[:]

# Display the first 5 records of the DataFrame
print(zshot_dataset_df.head(5))


If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image04A.png)

We can use the Pandas "apply" function to add the textual label for each observation.


### Example 2 - Step 5: Search Records Using Keywords

The code in the cell below show how to retrieve records containing one (or more) **kyewords**. For this example, the `zshot_dataset_df` DataFrame was searched for the keywords `mitochondria`, `autophagy` and `fasting`.

Here is a brief summary of the code's strategy:

1. **Create Regex Pattern:** The filter_records_by_keywords function creates a regex pattern using the join method to match any of the keywords.

2. **Filter Function:** The filter_records_by_keywords function filters records by checking if any of the keywords are present in the abstract column. The case=False argument makes the search case-insensitive, and na=False handles missing values gracefully.

3. **Filter Records:** Use the filter_records_by_keywords function to extract records with any of the specified keywords in the abstract.

This code allows you to search for multiple keywords in the abstract column and filter the records accordingly.

In [None]:
# Example 2 - Step 5: Search Records Using Keywords

# Function to filter records by multiple keywords in abstract
def filter_records_by_keywords(dataframe, keywords):
    # Create a regex pattern to match any of the keywords
    pattern = '|'.join(keywords)
    return dataframe[dataframe['abstract'].str.contains(pattern, case=False, na=False)]

# Extract records with the keywords
keywords = ['mitochondria', 'autophagy', 'fasting']
filtered_records = filter_records_by_keywords(zshot_dataset_df, keywords)

# Display the filtered records
print(filtered_records)


If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image05A.png)

As you can see our three keywords `mitochondria`, `autophagy` and `fasting` appeared in several titles and/or abstracts.

### Example 2 0- Step 6: Display Barplot of Records with Keywords

According to the output above, only 6 records in the `zshot_dataset` included the keywords  `mitochondria`, `autophagy` and `fasting`. One way to get an idea of how popular a topic is represented in a particular dataswet is to generate a barplot of the numbers of records with, and without the keywords.

The code in the cell below shows how to generate such a barplot.

In [None]:
# Example 2 - Step 6: Display Barplot of records with keywords

import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset

# Load the dataset
zshot_dataset = load_dataset("zeroshot/arxiv-biology", split="train")

# Convert the dataset to a Pandas DataFrame
zshot_df = pd.DataFrame(zshot_dataset)

# Function to filter records by multiple keywords in abstract
def filter_records_by_keywords(dataframe, keywords):
    # Create a regex pattern to match any of the keywords
    pattern = '|'.join(keywords)
    return dataframe[dataframe['abstract'].str.contains(pattern, case=False, na=False)]

# Define the keywords
keywords = ['mitochondria', 'autophage', 'fasting']

# Filter records with the keywords
filtered_records = filter_records_by_keywords(zshot_df, keywords)

# Count the number of records with and without keywords
count_with_keywords = len(filtered_records)
count_without_keywords = len(zshot_df) - count_with_keywords

# Create a bar plot
labels = ['With Keywords', 'Without Keywords']
counts = [count_with_keywords, count_without_keywords]

plt.figure(figsize=(8, 6))
plt.bar(labels, counts, color=['blue', 'orange'])
plt.xlabel('Category')
plt.ylabel('Number of Records')
plt.title('Number of Records with and without Specific Keywords')
plt.show()


If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image06A.png)

This barplot indicates that the keywords 'mitochondria', 'autophage' and 'fasting' were not found in the vast majority of the records. This is not unexpected. The idea that a cell can literally “eat” its own components dates back a century, but the modern concept of autophagy—self‑digestion as a regulated cellular process—was first formally recognized in the early 1960s.

In 1963, the Belgian biochemist Christian de Duve coined the term autophagy (from Greek αὐτο‑, “self,” + φαγεῖν, “to eat”) and described the double‑membrane autophagic vacuoles that engulf cytoplasmic material in rat liver cells. De Duve’s work laid the foundation for the field and highlighted autophagy as a distinct, purposeful cellular mechanism rather than mere accidental breakdown.

Interest in `autophagy` increased dramatically after the Nobel Prize in Physiology or Medicine was awarded in 2016 to Yoshinori Ohsumi for his pioneering work on the cellular mechanism of autophagy.

### Example 2- Step 7: List Keywords by Rank

The code in the cell below extracts keywords from the `title` of each record in the dataset. The code takes advantage of the `Natural Language Toolkit (NKTK)` library to download **stop words**. Stopwords are commonly used words in a language (such as "the," "is," "in," etc.) that are often removed during text processing because they don't carry significant meaning on their own.

A `preprocess_tex(text)` function performs the following functions:

* **Lowercase the Text:** Converts the text to lowercase to ensure uniformity.
* **Remove Punctuation:** Removes punctuation from the text using the str.translate method.
* **Tokenize the Text:** Splits the text into individual words (tokens) using the split method.
* **Remove Stop Words:** Removes common stop words (e.g., "the," "is," "in") using a predefined set from the NLTK library.
* **Return Words:** Returns the list of processed words.

This code snippet:
```text
# Apply preprocessing to titles
zshot_dataset_df['processed_title'] = zshot_dataset_df['title'].apply(preprocess_text)

````
applies the `preprocess_text` function to the title column of the DataFrame and stores the processed words in a new column named `processed_title`.

To get a ranking, we use this code snippet:

~~~text
# Count word frequencies
word_counts = Counter(all_words)
~~~
This code uses the `Counter` class from the `collections module` to count the frequency of each word in the `all_words` list.

In [None]:
# Example 2- Step 7: List Keywords by Rank

import pandas as pd
from datasets import load_dataset
import nltk
from collections import Counter
import string

# Download stopwords from nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Function to preprocess text
def preprocess_text(text):
    # Lowercase the text
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize the text
    words = text.split()
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    return words

# Apply preprocessing to abstracts
zshot_dataset_df['processed_title'] = zshot_dataset_df['title'].apply(preprocess_text)

# Combine all words in the abstracts into a single list
all_words = [word for title in zshot_dataset_df['processed_title'] for word in title]

# Count word frequencies
word_counts = Counter(all_words)

# Get the most common non-trivial words
most_common_words = word_counts.most_common(20)  # Get the top 20 most common words

# Display the most common words as a list
most_common_words_list = [f"{word}: {count}" for word, count in most_common_words]
for item in most_common_words_list:
    print(item)


If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image07A.png)

This is order list from most frequent to less frequent common words.

### Example 2 -Step 8: Barplot (Revisted)

By inspection of the output above, it would seem that the word `model` or `models` was a common keyword in the titles. The code in the cell below recreates Example 2 - Step 6 but searches for the keywords `model` or `models` on in the title of the paper.

In [None]:
# Example 2 - Step 8: Barplot (revisited)


from datasets import load_dataset
import pandas as pd
import matplotlib.pyplot as plt

# Function to add 'model' label
def label_model(row):
    title = row['title'].lower()
    if 'model' in title or 'models' in title:
        return 1
    else:
        return 0

# Apply the function to add the 'model' label
zshot_dataset_df['model'] = zshot_dataset_df.apply(label_model, axis=1)

# Count the number of records with and without the 'model' label
count_with_model = zshot_dataset_df['model'].sum()
count_without_model = len(zshot_dataset_df) - count_with_model

# Create a bar plot
labels = ['With "Model"', 'Without "Model"']
counts = [count_with_model, count_without_model]

plt.figure(figsize=(8, 6))
plt.bar(labels, counts, color=['blue', 'orange'])
plt.xlabel('Category')
plt.ylabel('Number of Records')
plt.title('Number of Records with and without "Model" Keywords in Title')
plt.show()


If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image14F.png)

The barplot indicates that more papers had the words `model` or `models` that our previous keywords, `mitochondria`, `autophagy` and `fasting`.

----------------------------------

# **Exercises**

### **Exercise 2 - Step 1: Download a Dataset**

In the cell below write the code to read the Hugging Face dataset called `mlfoundations-dev/arxiv_biology_seed_science` and store this data in a variable called `ML_dataset`.

The **mlfoundations-dev/arxiv_biology_seed_science** dataset is **not** a curated collection of plant‑seed papers. Instead, it was pulled automatically from `arXiv’s` “biology” (q‑bio) category and the word `seed` in its name refers to the `random‑seed value` that the authors used to make the download reproducible, not to a subject‑matter filter. The prefix `ML` refers to **machine learning** so the focus is on the use of computational tools for data analysis.

Because of that, the corpus contains everything `arXiv` labels as `biology`—including large swaths of human physiology, genetics, and clinical research—and very few (if any) papers that actually talk about seed biology.

In [None]:
# Insert your code for Exercise 2 - Step 1 here




If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image16A.png)

### **Exercise 2 - Step 2: Display a Record**

In the cell below write the code to display record `100` in your `ML_dataset`.

In [None]:
# Insert your code for Exercise 2 - Step 2 here


If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image09A.png)

As you can see from this record, the dataset is focused on the application of computational tools (e.g. signal processing) for the analysis of medical data.

### **Exercise 2 - Step 3: Display the labels in order of their index labels**

In the cell below write the code to extract the labels (features) of your train `ML_dataset` and print them out.

In [None]:
# Insert your code for Exercise 2 - Step 3 here



If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image10A.png)

### **Exercise 2 - Step 4: Convert Hugging Face dataset to DataFrame**

In the cell below write the code to convert your Hugging Face dataset (`ML_dataset`) into a `Pandas` DataFrame. Call your DataFrame `ML_dataset_df`.  

In [None]:
# Insert your code for Exercise 2 - Step 4 here


If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image11A.png)

### **Exercise 2 - Step 5: Search Records Using Keywords**

In the cell below write the code to retrieve records containing the search words `deep learning` and/or `AI`.


In [None]:
# Insert your code for Exercise 2 - Step 5 here




If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image12A.png)

### **Exercise 2 - Step 6: Display Barplot of Records with Keywords**

In the cell below write the code to generate a barplot of records `with` or `without` the keywords `deep learning` and `AI`.

In [None]:
# Insert your code for Exercise 2 - Step 6 here


If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image13A.png)

Apparently, a majority of papers in this dataset have the words `deep learning` and.or `AI` in their records.

### **Exercise 2 - Step 7: List Keywords by Rank**

In the cell below write the code to extract keywords from the `title` of each record in the dataset and print out the ordered list.

In [None]:
# Insert your code for Exercise 2 - Step 7 here



If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image13F.png)

### **Exercise 2 - Step 8: Barplot (Revisted)**

By inspection of the output above, it would seem that the word `model` was again a common keyword. In the cell below write the code to search for the keywords `model` in the title of the papers in your `ML_dataset` use this data to create a barplot showing the number of papers with and without the word `model` in their title.

In [None]:
# Insert your code for Exercise 2 -Step 8 here



If the code is correct, you should see something similar to the following output:

![__](https://biologicslab.co/BIO1173/images/class_05/class_05_3_image14F.png)

## **Lesson Turn-in**

When you have completed and run all of the code cells, use the **File --> Print.. --> Save to PDF** to generate a PDF of your Colab notebook. Save your PDF as `Class_05_3.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas. Make sure your PDF shows a COPY of Class_05_3 that was saved to your GDrive and not the original Colab notebook.

## **Lizard Tail**

## **Sol-20**

![__](https://upload.wikimedia.org/wikipedia/commons/5/5e/Processor_Technology_SOL_20_Computer.jpg)


The **Sol-20** was the first fully assembled microcomputer with a built-in keyboard and television output, what would later be known as a home computer. The design was the integration of an Intel 8080-based motherboard, a VDM-1 graphics card, the 3P+S I/O card to drive a keyboard, and circuitry to connect to a cassette deck for program storage. Additional expansion was available via five S-100 bus slots inside the machine. It also included swappable ROMs that the manufacturer called 'personality modules', containing a rudimentary operating system.

The design was originally suggested by Les Solomon, the editor of Popular Electronics. He asked Bob Marsh of Processor Technology if he could design a smart terminal for use with the Altair 8800. Lee Felsenstein, who shared a garage working space with Marsh, had previously designed such a terminal but never built it. Reconsidering the design using modern electronics, they agreed the best solution was to build a complete computer with a terminal program in ROM. Felsenstein suggested the name "Sol" because they were including "the wisdom of Solomon" in the box.

The Sol appeared on the cover of the July 1976 issue of Popular Electronics as a "high-quality intelligent terminal". It was initially offered in three versions; the Sol-PC motherboard in kit form, the Sol-10 without expansion slots, and the Sol-20 with five slots.

A Sol-20 was taken to the Personal Computing Show in Atlantic City in August 1976 where it was a hit, building an order backlog that took a year to fill. Systems began shipping late that year and were dominated by the expandable Sol-20, which sold for \$1,495 in its most basic fully-assembled form. The company also offered schematics for the system for free for those interested in building their own.

The Sol-20 remained in production until 1979, by which point about 12,000 machines had been sold. By that time, the "1977 trinity" —the Apple II, Commodore PET and TRS-80— had begun to take over the market, and a series of failed new product introductions drove Processor Technology into bankruptcy. Felsenstein later developed the successful Osborne 1 computer, using much the same underlying design in a portable format.

### **History**

**Tom Swift Terminal**

Lee Felsenstein was one of the sysops of Community Memory, the first public bulletin board system. Community Memory opened in 1973, running on a SDS 940 mainframe that was accessed through a Teletype Model 33, essentially a computer printer and keyboard, in a record store in Berkeley, California. The cost of running the system was untenable; the teletype normally cost \$1,500 (their first example was donated from Tymshare as junk), the modem another \$300, and time on the SDS was expensive - in 1968, Tymshare charged \$13 per hour (equivalent to \$114 in 2023). Even the reams of paper output from the terminal were too expensive to be practical and the system jammed all the time. The replacement of the Model 33 with a Hazeltine glass terminal helped, but it required constant repairs.

Since 1973, Felsenstein had been looking for ways to lower the cost. One of his earliest designs in the computer field was the Pennywhistle modem, a 300 bits per second acoustic coupler that was the cost of commercial models. When he saw Don Lancaster's TV Typewriter on the cover of the September 1973 Radio Electronics, he began adapting its circuitry as the basis for a design he called the Tom Swift Terminal. The terminal was deliberately designed to allow it to be easily repaired. Combined with the Pennywhistle, users would have a cost-effective way to access Community Memory.

In January 1975, Felsenstein saw a post on Community Memory by Bob Marsh asking if anyone would like to share a garage. Marsh was designing a fancy wood-cased digital clock and needed space to work on it. Felsenstein had previously met Marsh at school and agreed to split the \$175 rent on a garage in Berkeley. Shortly after, Community Memory shut down for the last time, having burned out the relationship with its primary funding source, Project One, as well the energy of its founding members.

**Processor Technology**

January 1975 was also the month that the Altair 8800 appeared on the front page of Popular Electronics, sparking off intense interest among the engineers of the rapidly growing Silicon Valley. Shortly thereafter, on 5 March 1975, Gordon French and Fred Moore held the first meeting of what would become the Homebrew Computer Club. Felsenstein took Marsh to one of the meetings, Marsh saw an opportunity supplying add-on cards for the Altair, and in April, he formed Processor Technology with his friend Gary Ingram.

The new company's first product was a 4 kB DRAM memory card for the Altair. A similar card was already available from the Altair's designers, MITS, but it was almost impossible to get working properly. Marsh began offering Felsenstein contracts to draw schematics or write manuals for the products they planned to introduce. Felsenstein was still working on the terminal as well, and in July, Marsh offered to pay him to develop the video portion. This was essentially a version of the terminal where the data would be supplied by the main memory of the Altair rather than a serial port.

The result was the VDM-1, the first graphics card. The VDM-1 could display 16 lines of 64 characters per line, and included the complete ASCII character set with upper- and lower-case characters and a number of graphics characters like arrows and basic math symbols. An Altair equipped with a VDM-1 for output and Processor Technology's 3P+S card running a keyboard for input removed the need for a terminal, yet cost less than dedicated smart terminals like the Hazeltine.

**Intelligent terminal concept**

Before the VDM-1 was launched in late 1975, the only way to program the Altair was through its front-panel switches and LED lamps, or by purchasing a serial card and using a terminal of some sort. This was typically a Model 33, which still cost \$1,500 if available. Normally the teletypes were not available – Teletype Corporation typically sold them only to large commercial customers, which led to a thriving market for broken-down machines that could be repaired and sold into the microcomputer market. Ed Roberts, who had developed the Altair, eventually arranged a deal with Teletype to supply refurbished Model 33s to MITS customers who had bought an Altair.

Les Solomon, whose Popular Electronics magazine launched the Altair, felt a low-cost smart terminal would be highly desirable in the rapidly expanding microcomputer market. In December 1975, Solomon traveled to Phoenix to meet with Don Lancaster to ask about using his TV Typewriter as a video display in a terminal. Lancaster seemed interested, so Solomon took him to Albuquerque to meet Roberts. The two immediately began arguing when Lancaster criticized the design of the Altair and suggested changes to better support expansion cards, demands that Roberts flatly refused. Any hopes of a partnership disappeared.

# 🖥️ Processor Technology SOL-20 Computer

## Overview
The **SOL-20** was one of the earliest complete microcomputers available to consumers, introduced in **1976** by **Processor Technology Corporation**. It was notable for being the **first microcomputer with a built-in keyboard and display interface**, making it a pioneering step toward the personal computer.

---

## 🔧 Technical Specifications

| Feature              | Details |
|----------------------|--------|
| **CPU**              | Intel 8080A @ 2 MHz |
| **RAM**              | 1 KB standard, expandable to 64 KB |
| **ROM**              | 1 KB monitor ROM (SOLOS) |
| **Storage**          | Cassette tape interface (later floppy disk support) |
| **Display**          | Text-only, 16 lines × 64 characters |
| **Keyboard**         | Full QWERTY keyboard |
| **Expansion**        | 5 S-100 bus slots |
| **Ports**            | Serial and parallel interfaces |
| **Power Supply**     | Internal |
| **Case**             | Integrated keyboard and motherboard in a single cabinet |

---

## 🧑‍💻 Software

- **SOLOS**: A simple monitor program stored in ROM, used for basic I/O and program loading.
- **CP/M Compatibility**: With sufficient RAM and disk interface, the SOL-20 could run **CP/M**, a popular operating system for early microcomputers.
- **BASIC Interpreter**: Often loaded from cassette or disk for programming.

---

## 🏛️ Historical Significance

- The SOL-20 was designed by **Lee Felsenstein**, a key figure in early personal computing and a member of the **Homebrew Computer Club**.
- It was one of the first computers to be sold fully assembled, unlike kits such as the Altair 8800.
- Its integrated design (keyboard + video output) influenced later personal computers like the Apple II and IBM PC.

---

## 📦 Models

- **SOL-10**: A lower-cost version with fewer expansion slots.
- **SOL-20**: The full-featured model with 5 S-100 slots and full keyboard.

---

## 📚 Legacy

The SOL-20 helped bridge the gap between hobbyist kits and consumer-ready personal computers. Though Processor Technology went out of business in 1979, the SOL-20 remains a landmark in computing history and is a prized item among vintage computer collectors.

---

## 🔗 References

- Computer History Museum
- Vintage Computer Federation
- [OldComputrs.net - SOL-20


