# Transformers

A type of model in NLP that forms the basis of many state-of-art LLM today such as ChatGPT. They are originally focused in NLP tasks but was later expanded to areas like Computer Vision, Audio Processing and many more. In our case of sentiment analysis, we are more concerned with the NLP side of these models. 

Several popular transformer models that are commonly used for NLP tasks are

- **BERT (Bidirectional Encoder Representations from Transformers)**: Developed by Google, BERT is designed to understand the context of words in a sentence by looking at them in both directions (left-to-right and right-to-left). It's often used as a base for many NLP tasks.

- **RoBERTa (Robustly Optimized BERT Pretraining Approach)**: This is a variant of BERT developed by Facebook. It modifies BERT's training approach for improved performance.

-  **DistilBERT**: This is a smaller, faster, and lighter version of BERT developed by Hugging Face. It retains 95% of BERT's performance while being about 60% smaller and 60% faster.

- **GPT (Generative Pretrained Transformer)** and **GPT-2**: Developed by OpenAI, these models are designed for tasks that require generating text, but they can also be fine-tuned for text classification tasks.

(Optional, come back to this later) More information on how Transformers work under the hood: https://www.turing.com/kb/brief-introduction-to-transformers-and-their-power

Hugging Face is a company and a platform that focuses on natural language processing (NLP) and provides tools, libraries, and resources to facilitate NLP research, development, and applications. 

On Hugging Face, you can find a set of pretrained Transformer models. 

There are two main ways to use them:
1. Install the models to your enviornment(via the Transformers Python library) and use them with the Pipline function. Which is part of the Hugging Face library that encapsulates the complex process of applying a transformer model into simple function calls. We can use the Pipline function for varies pre-trained transformer models to do different tasks.

2. Use the Hosted Inference API, which allows users to perform inference (make predictions) using Hugging Face models remotely through web API calls. It avoid the overhead of managing model infrastructure locally. 

We will be using the second option, but I also wrote a quick starter file notebook for the first option [here](./transformer_starter(Pipline).ipynb)

#### Getting started with the second option*: 

First, get an API key if you haven't done so: https://huggingface.co/docs/api-inference/index. Then follow the below steps:

```
pip install python-dotenv
```

Create a .env file in your project's directory and add your API key:
```
API_KEY="your_api_key"
```

Then, install the requests library:

```
pip install requests
```



If you are using Google Colab, you could store the keys Collab's "secret" and change the code below \(about three lines) correspondingly. Colab will provide you necessary steps. Just use the navbar on the left side of the screen.

![](https://miro.medium.com/v2/resize:fit:1400/1*5wEevNCOf80GTHwptPTB4g.png)


In [76]:
# This is an example taken from the Hugging Face guide. If you are able to run this, you are successfully setted up for calling the Inference API
import requests
from dotenv import load_dotenv # Change if you are using Collab
import os

# Take environment variables from .env.
load_dotenv()  # Change if you are using Collab 
api_key = os.getenv("API_KEY") # Change if you are using Collab
API_URL = "https://api-inference.huggingface.co/models/gpt2"
my_headers = {"Authorization": f"Bearer {api_key}"}

def query(payload):
    response = requests.post(API_URL, headers=my_headers, json=payload)
    return response.json()
data = query("Can you please let us know more details about your ")
data

[{'generated_text': 'Can you please let us know more details about your !!\n\nThankyou!\n\n\nThis review may not be completed.\n\nDownload\n\nThe following product may be unavailable, or at the least that is not listed below in our database'}]

Since we will be using and query different models, let's refactor the code into a higher order function so that we do not have to redefine a new query function for every model. 

In [77]:
def api_query_runner(endpoint: str):
    def run_query(query: str):
        response = requests.post(endpoint, headers = my_headers, json=query)
        if response.status_code == 200:
            return response.json()
        else:
            raise Exception(f"Query failed and returned status code {response.status_code}. {response.json()}")
    return run_query

Now let' load up our dataset

In [78]:
import pandas as pd

RATING_PATH = "../data/clean_ratings.csv"
PROF_PATH = "../data/clean_prof_info.csv"

rating = pd.read_csv(RATING_PATH)
prof = pd.read_csv(PROF_PATH)

## Using the DistilBERT base uncased finetuned SST-2
You can read more descriptions here: https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english

Models usually the naming template of (basemodel) base (dataset trained-on).
And if applicable, dataset fintuned on at the end

This model does a polarity-based sentiment analysis, which is similar to what we seen in NLTK VADAR, it outputs only a positivity and negativity score/percentage. 

In [79]:
distilBert_sst2_endpoint= "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english"
s = ["I like you."]
run_query_on_distilBert = api_query_runner(distilBert_sst2_endpoint)
data = run_query_on_distilBert(s)
# Returned data is a list of list. Since we run the query with only one input string, there is only one item in the sublist
print(data[0])

[{'label': 'POSITIVE', 'score': 0.9998756647109985}, {'label': 'NEGATIVE', 'score': 0.00012428968329913914}]


If you see output of {'error': 'Model distilbert/distilbert-base-uncased-finetuned-sst-2-english is currently loading', 'estimated_time': 20.0}. It means the model is being prepared on the server. Just wait for a little and try again. 

Lets run it for five EECS376 class reviews. 
The code below fetch five comments and comment ID from the rating dataset that are of class EECS376 and query the comments individually.

In [80]:
eecs376 = rating[rating["class"]=="EECS376"]
first_five_comments= eecs376["comment"][:5]

# Returned data is a list of list. Since we run the query with only one input string, there is only one item in the sublist, 
# hence we use index: [0]
sentiment = [[{"ID": index}] + run_query_on_distilBert(comment)[0] for index, comment in first_five_comments.items()]
sentiment

[[{'ID': 14700},
  {'label': 'POSITIVE', 'score': 0.9994785189628601},
  {'label': 'NEGATIVE', 'score': 0.0005215604905970395}],
 [{'ID': 14701},
  {'label': 'NEGATIVE', 'score': 0.9988952875137329},
  {'label': 'POSITIVE', 'score': 0.001104680704884231}]]

Again, our code send an API call for each comment, this is inefficient as each call involves a round trip to the server. Furthermore, we may reach a rate limit if we send a lot of requests at one time. Though this is not likely to happen with the amount of queries we are making. Anyway, a better way to call the API is simply send the comments in batches. In our case, we will just query with all of the comments in a list.

In [81]:
result = run_query_on_distilBert(first_five_comments.tolist())
# Zip the cooresponding ID for each comment back together
for entry, index in zip(result, first_five_comments.index):
    entry.insert(0, {"ID": index})
result


[[{'ID': 14700},
  {'label': 'POSITIVE', 'score': 0.9994785189628601},
  {'label': 'NEGATIVE', 'score': 0.0005215604905970395}],
 [{'ID': 14701},
  {'label': 'NEGATIVE', 'score': 0.9988952875137329},
  {'label': 'POSITIVE', 'score': 0.001104680704884231}]]

Now, doing this really provides limited information. Maybe instead we can determine a threshold or convert the positive and negative percentage metrics into a single label indicating whether the review is positive or negative. Thus we can aggregate the positive and negative reviews and see how many of the students liked the course and how many didn't. 

Ex: For a course, make API query with all of the reviews on that course. Then, for a review that have a higher "POSITIVE" score than "NEGATIVE" score, we label that review POSITIVE and vice versa. Now, we have a count of total positive reviews and negative reviews.

In [82]:
# TODO Explore this model


## Using the Roberta base go emotions
You can read more descriptions here: https://huggingface.co/SamLowe/roberta-base-go_emotions

A positive and negative overall label does not provide us much useful information beyond the general "likeness" of the course. We can use models that output an emotion prediction for more insights on our data. 

In [83]:
roberta_go_emotion_endpoint = "https://api-inference.huggingface.co/models/SamLowe/roberta-base-go_emotions"
s = "I am glad that we have no test"
run_query_on_roberta = api_query_runner(roberta_go_emotion_endpoint)
data = run_query_on_roberta(s)
data

[[{'label': 'joy', 'score': 0.7484319806098938},
  {'label': 'relief', 'score': 0.0860075056552887},
  {'label': 'approval', 'score': 0.07444580644369125},
  {'label': 'neutral', 'score': 0.06166863813996315},
  {'label': 'gratitude', 'score': 0.038480933755636215},
  {'label': 'realization', 'score': 0.017022809013724327},
  {'label': 'admiration', 'score': 0.016435207799077034},
  {'label': 'caring', 'score': 0.016028305515646935},
  {'label': 'disapproval', 'score': 0.011067488230764866},
  {'label': 'annoyance', 'score': 0.009534290060400963},
  {'label': 'pride', 'score': 0.007556593511253595},
  {'label': 'amusement', 'score': 0.0075026764534413815},
  {'label': 'excitement', 'score': 0.007101245224475861},
  {'label': 'optimism', 'score': 0.00500898901373148},
  {'label': 'sadness', 'score': 0.004864911548793316},
  {'label': 'confusion', 'score': 0.0038028969429433346},
  {'label': 'disappointment', 'score': 0.0034991877619177103},
  {'label': 'love', 'score': 0.003065854310989

Consider how these informations can be used and apply them. For example, get the overall emotion students have toward a certain course.

In [84]:
# TODO Explore this model


## Using the Bart Large CNN
You can read more descriptions here: https://huggingface.co/facebook/bart-large-cnn

This model by facebook can summarize a large chunk of texts (there is an upper limit however be aware)
How can we utilize this in our analysis? 

Make API calls like above. HuggingFace does a terrible job of documenting and listing out their endpoints \(I literally can't find a list). Usually they follow this format 
- ENDPOINT Template: https://api-inference.huggingface.co/models/<MODEL>. Change the model with different names. E.g: https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english is the endpoint for the distilbert-base-uncased-fintuned-sst-2-english model.

However, they do not always follow the pattern, and when that happens, it could be annoying to deal with. One work around I use is to first make a call using their website, and use the browser's developer tool to manually inspect the outgoing request and its endpoint URL.

In Chrome: View -> Developer -> Open Developer Tools. All incoming and outgoing requests are in the "network" section where they store your communication with different servers.

![](https://github.com/MichiganDataScienceTeam/WN2024-RMP/blob/master/notebook/asset/api_workaround.png?raw=true)

I have done this already for the Bart Large CNN for your convenience. Apparently for this model we have to also specify /facebook before the model name. 
Endpoint: https://api-inference.huggingface.co/models/facebook/bart-large-cnn


In [85]:
# TODO Explore the model for summarization tasks