![Practicum AI Logo image](https://github.com/PracticumAI/practicumai.github.io/blob/main/images/logo/PracticumAI_logo_250x50.png?raw=true)
***
# *Practicum AI:* Transfer - Model Search

This exercise adapted from the [Hugging Face Educational Toolkit](https://github.com/huggingface/education-toolkit) (01-huggingface-hub-tour.md).

(40 minutes)

**Goal:** Learn how to efficiently use the free [Hub platform](http://hf.co) to be able to collaborate in the ecosystem and within teams in Machine Learning (ML) projects.

Learning goals:

- Learn about and explore the over 30,000 models shared on the Hub.
- Learn efficient ways to find suitable models and datasets for your task.
- Learn how to contribute and work collaboratively.
- Explore ML demos created by the community.

**Format:** Either short lab or take-home

**Audience:** Students from any background interested in using existing models or sharing their models.

**Prerequisites**

- High-level understanding of Machine Learning.
- (Optional, but encouraged) Experience with Git ([resource](https://learngitbranching.js.org/))

***
## A Tour of the Hugging Face Hub

## 1. Why the Hub?

The Hub is a central platform where anyone can share and explore models, datasets, and ML demos. The "solve AI" problem won't be solved by a single company, but by a culture of sharing knowledge and resources. Because of this, the Hub aims to build the most extensive collection of Open Source models, datasets, and demos.

Here are some facts about the Hugging Face Hub:

- There are over 30,000 public models.
- There are models for Natural Language Processing, Computer Vision, Audio/Speech, and Reinforcement Learning!
- There are models for over 180 languages.
- Any ML library can leverage the Hub: from TensorFlow and PyTorch to advanced integrations with spaCy, SpeechBrain, and 20 other libraries.


***
## 2. Exploring a model

Let’s kick off the exploration of models. You can access 30,000 models at [hf.co/models](http://hf.co/models). You will see [gpt2](https://huggingface.co/gpt2) as one of the models with the most downloads. Let’s click on it.

The website will take you to the model card when you click a model. A model card is a tool that documents models, providing helpful information about the models and being essential for discoverability and reproducibility.

The interface has many components, so let’s go through them:

[https://www.youtube.com/watch?v=XvSGPZFEjDY&feature=emb_imp_woyt](https://www.youtube.com/watch?v=XvSGPZFEjDY&feature=emb_imp_woyt)

- At the top, you can find different **tags** for things such as the task (*text generation, image classification*, etc.), frameworks (*PyTorch*, *TensorFlow*, etc.), the model’s language (*English*, *Arabic*, *etc.*), and license (*e.g. MIT*).

![](./images/mode_card_tags.png)

- At the right column, you can play with the model directly in the browser using the *Inference API*. GPT2 is a text generation model, so it will generate additional text given an initial input. Try typing something like, “It was a bright and sunny day.”

![](./images/model_card_inference_api.png)

- In the middle, you can go through the model card content. It has sections such as Intended uses & limitations, Training procedure, and Citation Info.


![](./images/model_card_content.png)

Where does all this data come from? At Hugging Face, everything is based in **Git repositories** and is open-sourced. You can click the “Files and Versions” tab, which will allow you to see all the repository files, including the model weights. The model card is a markdown file **([README.md](http://README.md))** which on top of the content contains metadata such as the tags.

![](./images/model_card_git.png)

Since all models are Git-based repositories, you get version control out of the box. Just as with GitHub, you can do things such as Git cloning, adding, committing, branching, and pushing. If you’ve never used Git before, we suggest the following [resource](https://learngitbranching.js.org/).

## Questions

Open the `config.json` file of the GPT2 repository. The config file contains hyperparameters as well as useful information for loading the model. From this file, answer:

#### Q1: Which is the activation function?

#### Q2: What is the vocabulary size?

***
## 3. Exploring Models

So far, we’ve explored a single model. Let’s go wild! At the left of [https://huggingface.co/models](https://huggingface.co/models), you can filter for different things:

- **Tasks:** There is support for dozens of tasks in different domains: Computer Vision, Natural Language Processing, Audio, and more. You can click the +13 to see all available tasks.
  - **Libraries:** Although the Hub was originally for transformers models, the Hub has integration with dozens of libraries. You can find models of Keras, spaCy, allenNLP, and more.
- **Datasets:** The Hub also hosts thousands of datasets, as you’ll find more about later.

![](./images/model_card_filters.png)

- **Languages:** Many of the models on the Hub are NLP-related. You can find models for hundreds of languages, including low-resource languages.

## Questions

#### Q3: How many token classification models are there in English?

#### Q4: If you had to pick a Spanish model for Automatic Speech Recognition, which would you choose? 
You can select any model for this task and language but also explain your reasons.


***
## 4. Datasets

With ML pipelines, you usually have a dataset to train the model. The Hub hosts around 3000 datasets that are open-sourced and free to use in multiple domains. On top of it, the open-source `datasets` [library](https://huggingface.co/docs/datasets/) allows the easy use of these datasets, including huge ones, using very convenient features such as streaming. This lab won't go through the library, but it does explain how to explore them.

Similar to models, you can head to [https://hf.co/datasets](https://hf.co/datasets). At the left, you can find different filters based on the task, license, and size of the dataset.

Let’s explore the [GLUE](https://huggingface.co/datasets/glue) dataset, which is a famous dataset used to test the performance of NLP models.

- Similar to model repositories, you have a dataset card that documents the dataset. If you scroll down a bit, you will find things such as the summary, the structure, and more.

![](./images/datasets_card.png)

- At the top, you can explore a slice of the dataset directly in the browser. The GLUE dataset is divided into multiple sub-datasets (or subsets) that you can select, such as COLA and QNLI.

  ![](./images/datasets_slices.png)

- At the right of the dataset card, you can see a list of models trained on this dataset.

![](./images/datasets_models_trained.png)

## Questions
Search for the Common Voice dataset. Answer these questions:

#### Q5: What tasks can the Common Voice dataset be used for?
#### Q6: How many languages are covered in this dataset?
#### Q7: Which are the dataset splits?


***
## 5. ML Demos

Sharing your models and datasets is great, but creating an interactive, publicly available demo is even cooler. Demos of models are an increasingly important part of the ecosystem. Demos allow:

- model developers to easily **present** their work to a wide audience, such as in stakeholder presentations, conferences, and course projects
- to increase **reproducibility** in machine learning by lowering the barrier to test a model
- to share with a non-technical audience **the impact of a model**
- build a machine learning **portfolio**

There are Open-Source Python frameworks such as Gradio and Streamlit that allow building these demos very easily, and tools such as Hugging Face [Spaces](http://hf.co/spaces/launch) which allow to host and share them. 
