<a href="https://colab.research.google.com/github/Angel-Castro-RC/Final_NLP/blob/main/F1_1_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## Introduction to the Hugging Face Transformers Library

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F1_1_HuggingFace.ipynb)


## References

Hugging Face *Quicktour*: https://huggingface.co/docs/transformers/quicktour

Hugging Face *Run Inference with Pipelines tutorial*: https://huggingface.co/docs/transformers/pipeline_tutorial

Hugging Face *NLP Course, Chapter 2*: https://huggingface.co/learn/nlp-course/chapter2/1

## Announcement: Venture Capital Panels

From Kevin Croft (Director of the Kelley Center for Insurance Innovation):

Do you know a student interested in innovation, starting a business, or with great tech solutions who would like to learn from those who’ve done it before?  The Kelley Center for Insurance Innovation invites you and your students to two opportunities to engage with business accelerators and venture capital investors.  Please share this information with your students!
 * On Wednesday, September 27 at 4 pm in Aliber 101 the leaders of the Global Insurance Accelerator and BrokerTech Ventures will form a panel to discuss how accelerators work, the application/acceptance process, the mentorship process, and success stories from their accelerators.   
 * On Wednesday, October 4 at 4 pm in Aliber 101 Wellabe Ventures, ManchesterStory, and Homesteaders Life will form a panel to discuss their approach to private equity investing in innovation partners.  The discussion will include identification/selection of target firms, the evaluation process, goals of the investment, and success stories.

I hope to see you there!

## The Plan

The first thing we're going to do is get comfortable with the Hugging Face *Transformers* library for Python

Eventual goals:
* understand and explain how transformer models work
* create and adapt transformers for a specific purpose

For now:
* learn to *use* existing transformer models
* understand *what* they can do
* get a feel for pupular families of models and how they're related

We will dig into the internals later

## What is Hugging Face?

Hugging Face is a private company
* Founded in 2016 by French entrepreneurs Clément Delangue, Julien Chaumond, and Thomas Wolf
* Based in New York City

Provide a popular free, open-source Python library called **transformers** for NLP (and other) tasks

Host *hundreds of thousands of models* that you can use in your own programs

## Installing the transformers module

This is my favored way of installing packages from a Jupyter Notebook

If you have lots of Python distributions installed, it should use the right one

It may take a few minutes, but *you should only have to do this once*

In [1]:
import sys
!{sys.executable} -m pip install transformers



## Using the sentiment analysis pipeline

**Sentiment analysis** attempts to identify the overall feeling intended by the writer of some text

The creators of this model **trained** it on lots of examples of text that were labeled as either *positive* or *negative*

A **pipeline** is a series of steps for performing **inference**
* tokenize and preprocess the input text (more on this later)
* ask the model for a prediction
* post-process model's result and turn it into something you can use

![full_nlp_pipeline.svg](https://github.com/ericmanley/f23-CS195NLP/blob/main/images/full_nlp_pipeline.svg?raw=1)
image source: https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt

We *are* specifying the kind of task: `sentiment-analysis`

We *are not* asking for a specific model, so it picks one of many it has by default

The first time you do this, it will have to download the model - this can take some time depending on your network connection

In [3]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="SamLowe/roberta-base-go_emotions")
text1 = "i hate runing"


# inputs
classifier("I hate how easy it is to build sentiment-aware applications with the transformers library!")

[{'label': 'anger', 'score': 0.7457549571990967}]

**Test it out:** Try changing the input to get different labels/scores

## Working with batches of text

To get classifications of many different examples, pass in a list of strings.

In [None]:
results = classifier(["It's really cool that you can get classifications for a whole batch of text",
                      "I wonder if the rest of the class will be this easy.",
                     "Spolier alert: it won't be."])
print(results)

[{'label': 'admiration', 'score': 0.5409085750579834}, {'label': 'surprise', 'score': 0.7312608957290649}, {'label': 'neutral', 'score': 0.7667409181594849}]


Note that the results come back as a list of dictionaries, so you can manipulate it in the normal ways.

In [None]:
print("The sentence had",results[0]["label"],"sentiment, with a score of",results[0]["score"])

The sentence had POSITIVE sentiment, with a score of 0.9991173148155212


## Exercise: Specifying a model

Now try asking for a specific model.

Replace one line of code in your earlier example.

In [None]:
classifier = pipeline("sentiment-analysis", model="SamLowe/roberta-base-go_emotions")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/380 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

How is this model different from the first model?


Create a cell in this notebook and note the differences you see

The first model only displays either possitive or negative of a sentence depending on the words

With the second model will display emotions that are detected on a sentence

## Applied Exploration

The `roberta-base-go_emotions` model is documented here: https://huggingface.co/SamLowe/roberta-base-go_emotions

Answer some questions about this:
* What is `roberta-base`? Write down some things you can learn about it from the documentation.
* What is `go_emotions`? Write down some things you can learn about it from the documentation.

Go to the Hugging Face models page: https://huggingface.co/models
* click `Text Classification`
* Try some additional models
    - test out at least one more sentiment/emotions model
    - test out at least two other kinds of models - like news topic classification or spam detection
    - write down some info about the models you found
        - what is it for?
        - who made it?
        - what kind of data was it trained on?
        - are they based on some other model and trained on new data (*fine-tuned*) for a specific task?

What is roberta-base? Write down some things you can learn about it from the documentation.
*   This model is case sensitive
*   The model was pretrained with Masked language modeling (MLM)



What is go_emotions? Write down some things you can learn about it from the documentation.

*   List item
*   List item



## Applied Exploration



Go to the Hugging Face models page: https://huggingface.co/models
* click `Text Classification`
* Try some additional models
    - test out at least one more sentiment/emotions model
    - test out at least two other kinds of models - like news topic classification or spam detection
    - write down some info about the models you found
        - what is it for?
        - who made it?
        - what kind of data was it trained on?
        - are they based on some other model and trained on new data (*fine-tuned*) for a specific task?

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="nlptown/bert-base-multilingual-uncased-sentiment")

Downloading (…)lve/main/config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
results = pipe(["hassen"])
print(results)

[{'label': '1 star', 'score': 0.2964731454849243}]


what is it for?
*   This model is to rate sentiments in different languages with a rating of starts in

In [None]:
from transformers import pipeline

pipe_tox = pipeline("text-classification", model="martin-ha/toxic-comment-model")

In [None]:
resultx = pipe_tox([""])
print(resultx)

[{'label': 'toxic', 'score': 0.9411988854408264}]


what is it for?

* This mode is to detect toxic and non-toxic comments on people, races, genders and hate comments overall,

who made it?

*

what kind of data was it trained on?

* The training data comes this Kaggle competition. We use 10% of the train.csv data to train the model.

are they based on some other model and trained on new data (fine-tuned) for a specific task?
* This model is a fine-tuned version of the DistilBERT model to classify toxic comments.

