# Transformers

A type of model in NLP that forms the basis of many state-of-art LLM today such as ChatGPT. They are originally focused in NLP tasks but was later expanded to areas like Computer Vision, Audio Processing and many more. In our case of sentiment analysis, we are more concerned with the NLP side of these models. 

Several popular transformer models that are commonly used for NLP tasks are

- **BERT (Bidirectional Encoder Representations from Transformers)**: Developed by Google, BERT is designed to understand the context of words in a sentence by looking at them in both directions (left-to-right and right-to-left). It's often used as a base for many NLP tasks.

- **RoBERTa (Robustly Optimized BERT Pretraining Approach)**: This is a variant of BERT developed by Facebook. It modifies BERT's training approach for improved performance.

-  **DistilBERT**: This is a smaller, faster, and lighter version of BERT developed by Hugging Face. It retains 95% of BERT's performance while being 60% smaller and 60% faster.

- **GPT (Generative Pretrained Transformer)** and **GPT-2**: Developed by OpenAI, these models are designed for tasks that require generating text, but they can also be fine-tuned for text classification tasks.

Hugging Face is a company and a platform that focuses on natural language processing (NLP) and provides tools, libraries, and resources to facilitate NLP research, development, and applications. 

On Hugging Face, you can find a set of pretrained Transformer models. 

There are two main ways to use them:
1. Install the models to your enviornment and use them with the Pipline function. Which is part of the Hugging Face library thatencapsulates the complex process of applying a transformer model into simple function calls. It help apply varies pre-trained transformer models to different tasks.

2. Use the Hosted Inference API, which allows users to perform inference (make predictions) using Hugging Face models remotely through web API calls. It avoid the overhead of managing model infrastructure locally. 

We will be using the second option.

First, get an API key if you haven't done so: https://huggingface.co/docs/api-inference/index. Then follow the below steps:

```
pip install python-dotenv
```

Create a .env file in your project's directory and add your API key:
```
API_KEY=your_api_key
```



In [66]:
# This is an example taken from the Hugging Face guide. If you are able to run this, you successfully setted up for calling the Inference API
# ENDPOINT Template: https://api-inference.huggingface.co/models/<MODEL_ID>. Change the model ID for different models
import requests
from dotenv import load_dotenv
import os

load_dotenv()  # take environment variables from .env.
api_key = os.getenv("API_KEY")
API_URL = "https://api-inference.huggingface.co/models/gpt2"
my_headers = {"Authorization": f"Bearer {api_key}"}
def query(payload):
    response = requests.post(API_URL, headers=my_headers, json=payload)
    return response.json()
data = query("Can you please let us know more details about your ")
data

[{'generated_text': 'Can you please let us know more details about your iphone so we can get some more information?"\n\n"Just say yes, I\'m open to any suggestions."\n\nTia looks at me worried.\n\n"That\'s fine'}]

Since we will be using and query different models, let's refactor the code to a higher order function so that we do not have to redefine a new query function for every model. 

In [67]:
def api_query_runner(endpoint: str):
    def run_query(query: str):
        response = requests.post(endpoint, headers = my_headers, json=query)
        if response.status_code == 200:
            return response.json()
        else:
            raise Exception(f"Query failed and returned status code {response.status_code}. {response.json()}")
    return run_query

Now let' load up our dataset

In [68]:
import pandas as pd
import matplotlib.pyplot as plt

RATING_PATH = "../data/clean_ratings.csv"
PROF_PATH = "../data/clean_prof_info.csv"

rating = pd.read_csv(RATING_PATH)
prof = pd.read_csv(PROF_PATH)

## Using the DistilBERT base uncased finetuned SST-2
You can read more descriptions here: https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english 

This model does a polarity-based sentiment analysis, which is similar to what we seen in NLTK VADAR, it outputs only a positivity and negativity score/percentage. 

In [69]:
distilBert_sst2_endpoint= "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english"
s = "I like you. I love you"
run_query_on_distilBert = api_query_runner(distilBert_sst2_endpoint)
data = run_query_on_distilBert(s)
print(data)

[[{'label': 'POSITIVE', 'score': 0.9998738765716553}, {'label': 'NEGATIVE', 'score': 0.0001261125726159662}]]


If you see output of {'error': 'Model distilbert/distilbert-base-uncased-finetuned-sst-2-english is currently loading', 'estimated_time': 20.0}. Just wait for a little and try again. 

Lets run it for all EECS376 class reviews. It may take some time. 

In [70]:
eecs376_comments = rating[rating["class"]=="EECS376"]["comment"]
first_five = eecs376_comments[:5]
sentiment = [run_query_on_distilBert(comment) for comment in first_five]
sentiment

[[[{'label': 'POSITIVE', 'score': 0.9994785189628601},
   {'label': 'NEGATIVE', 'score': 0.0005215604905970395}]],
 [[{'label': 'NEGATIVE', 'score': 0.9988952875137329},
   {'label': 'POSITIVE', 'score': 0.001104680704884231}]],
 [[{'label': 'POSITIVE', 'score': 0.9998089671134949},
   {'label': 'NEGATIVE', 'score': 0.00019110905122943223}]],
 [[{'label': 'NEGATIVE', 'score': 0.9997746348381042},
   {'label': 'POSITIVE', 'score': 0.00022532072034664452}]],
 [[{'label': 'POSITIVE', 'score': 0.9995610117912292},
   {'label': 'NEGATIVE', 'score': 0.0004389724927023053}]]]