# Lecture 6 -- Pre-trained Models

## Outline

* in many cases, training your own model may be the best thing you can do as you have full control over the data, training, and the evaluation process.

* however, this lecture will go over pre-trained models and more generally, models that are freely available for you to use.

* pretrained and freely available models have become incredibly common and are often a good starting point for experimentation and more.

* we will go over a few examples of pretrained models and how to use them including models for:
    * sentiment analysis
    * summarization
    * named entity recognition

## Generalization vs Specialization

* in the prior chapter we talked about how for some tasks we need to train our own model because we want to define the annotation criteria or we are in a very niche sub-domain.

* however, using pre-trained models that are general to a task can also be quite useful -- especially considering that they were likely trained with more data than the dataset you are working with

## Transformers and Huggingface

* We have introduced and used transformers, a model architecture

* huggingface is a company which builds tools around various model architectures to make them easily shareable

* for example, here is a link to their website in which you can search and browse for available models: https://huggingface.co/models
* every single model is available for download and use and in many cases has some documentation attached to understand how to use the model in your own work

## Sentiment Analysis

* sentiment analysis is a task in which we want to predict the sentiment of a text
* sentiment is often defined as positive, negative, or neutral
* for example, the phrase: "I love this movie" would be positive and "I hate this movie" would be negative

* sentiment analysis is a very common task and there are many datasets available for it
* for example, here is a dataset of movie reviews: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

* there are many models available for sentiment analysis on huggingface (and elsewhere) but for our examples, we will use:
"cardiffnlp/twitter-roberta-base-sentiment-latest" (https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)
  * this model was trained on millions of pairs of sentiment labeled tweets

* lets load the model and try it on a few examples

In [None]:
from transformers import pipeline

model = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
)

# Predict sentiment of the statement
model("I am happy to see that other members of the council are supporting this important legislation.")

* we can see that for the phrase `"I am happy to see that other members of the council are supporting this important legislation."` the model predicts a positive sentiment and given a score

* you can interpret the `score` value as the model's confidence in the prediction
* the higher the score, the more confident the model is in the prediction (in this case, the model is very confident that the sentiment is positive)

* lets try a few more examples

In [None]:
# Give the model multiple examples at once by providing a list
model([
    "I urge the members of the committee to vote against this bill.",
    (
        "If the amendment from Councilmember Brown passes, I will vote in support of this legislation. "
        "But without the amendment, I cannot agree to move forward with this legislation."
    ),
])

* This time we can see that the model is confident that the sentiment of the first example (`"I urge the members of the committee to vote against this bill."`) is negative and the second example (`"But without the amendment, I cannot agree to move forward with this legislation."`) is neutral.

* This makes sense to me, as the second example is specifically discussing conditions of agreement or disagreement which places it somewhere in a neutral category.

### Limitations and Warnings

* Sentiment can be subjective and it is important to understand the limitations of the model and the data it was trained on, and who annotated the data.

* The model we used for sentiment was trained using twitter data -- twitter posts are obviously different than meeting discussion

* this can affect the model and we should always evaluate the performance of even off the shelf models

* in general, when using a model, be sure to read how it was trained, what data it used for training, and understand the possible biases and problems that come from using a model trained by someone else

* try to find a model that was trained on data close to yours for the best results

## Summarization

* summarization can be an important task for processing long documents such as multiple hour long meeting transcripts

* there can be direct reader benefits by trying to extract or explain the main points of a meeting for example

* using generated summaries can also be useful for other tasks such as information retrieval (i.e. instead of searching across the whole meeting, you can search across the summary), or for other situations in which you simply can't fit the whole meeting into memory of the model but can fit the summary

* the two main types of summarization are extractive and abstractive

* extractive summarization is when you extract sentences from the original document and combine them to form a summary

* abstractive summarization is when you generate new sentences that are not in the original document to form a summary

* someone has already taken the time to train a model for meeting dialoge summarization: https://huggingface.co/knkarthick/MEETING_SUMMARY

* lets use this model to generate a summary for a meeting

* first lets pull down some data and prep the data for the model

In [1]:
from cdp_data import CDPInstances, datasets
import pandas as pd

sessions = datasets.get_session_dataset(
    CDPInstances.Seattle,
    start_datetime="2022-05-01",
    end_datetime="2022-05-03",
    store_transcript=True,
    store_transcript_as_csv=True,
)

single_session_transcript = pd.read_csv(sessions.iloc[0].transcript_as_csv_path)
len(single_session_transcript)

Fetching each model attached to event_ref:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching transcripts:   0%|          | 0/1 [00:00<?, ?it/s]

Converting transcripts:   0%|          | 0/1 [00:00<?, ?it/s]

855

In [9]:
# Because are meetings are incredibly long
# lets create a strings with different portions of the meeting
# to pass to the model
parts = []
n_sentences_per_chunk = 30
for index in range(
    0,
    len(single_session_transcript),
    n_sentences_per_chunk,
):
    parts.append(
        " ".join(
            single_session_transcript.iloc[index:index+n_sentences_per_chunk].text
        )
    )

* now that we have multiple chunks, lets first pass in just the first chunk to the model and see what it generates


For reference, here is the full text of the first chunk:

> "I will call the roll on May 2nd. May 2nd, 2022 Council briefing meeting will come to order. I am Andrew Lewis, Council President pro tem. The time is 2.01 p.m. Will the clerk please call the roll? Councilmember Wilson? Present. Councilmember Peterson? Present. Councilmember Sawant? Present. Councilmember Strauss? Present. Councilmember Herbold? Present. Councilmember Williams? Present. Councilmember Gattas? Present. Councilmember Herbold? Here. Councilmember Morales? Present. Councilmember Pro Tem Lewis? Present. Six present. Thank you. We will move on to approval of the minutes. If there is no objection, the minutes of May 2nd, 2022 will be adopted. Hearing no objection, the minutes are adopted. The President's report. I will call the roll on May 2nd, 2022 Council briefing meeting will come to order. I am Andrew Lewis, Council President pro tem. The time is 2.01 p.m. Will the clerk please call the roll? Councilmember Wilson? Present. Councilmember Peterson? Present. Councilmember Sawant? Present. Councilmember Peterson? Present. Councilmember Herbold? Here. Councilmember Morales? Present. Councilmember Gattas? Present. Councilmember Herbold? Present. Councilmember Gattas? Present. I am filling in. As folks may be aware for Council President Juarez. I do not have any. Reports at the top of the meeting. Except to give a brief preview of the full council. Agenda that we will be considering. As a council tomorrow."

In [12]:
from transformers import pipeline
summarizer = pipeline("summarization", model="knkarthick/MEETING-SUMMARY-BART-LARGE-XSUM-SAMSUM-DIALOGSUM-AMI")

# Summarize the first part of the meeting
summarizer(parts[0])

config.json:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/337 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

[{'summary_text': 'Andrew Lewis, Council President pro tem, will call the roll on May 2nd, 2022 Council briefing meeting. If there is no objection, the minutes are adopted.'}]

* We can see that the summary is short and to the point and generally captures the main idea of this chunk. Which is simply that the meeting has begun roll call took place.

* However it is definitely missing some specificity and details that would be useful for a reader.

* Lets summarize the rest of the chunks and see what we get back

In [13]:
from tqdm import tqdm
results = []
for part in tqdm(parts):
    summary = summarizer(part)[0]["summary_text"]
    results.append(summary)

print("\n\n".join(results))

100%|██████████| 29/29 [05:19<00:00, 11.00s/it]

Andrew Lewis, Council President pro tem, will call the roll on May 2nd, 2022 Council briefing meeting. If there is no objection, the minutes are adopted.

Andrew Lewis, Council President pro tem, will call the roll on May 2nd, 2022 Council briefing meeting. The President's report will be presented at the top of the meeting.

The agenda for tomorrow's full council meeting is going to be the same as the agenda for the previous meeting. The agenda includes several ordinances, several appointments, the appointment of Gail Tarleton as director of the office of intergovernmental relations, and other ordinances related to the Wagner floating home

I'm pleased to be presenting a resolution in partnership with the mayor's office commemorating national small business week in Seattle. As the first small business owner on the council in over 10 years, I'm excited and proud to celebrate the over 100,000 small businesses that call Seattle home.

The clerk calls the roll to determine which council me




* This looks okay. There are some details about what is happening during this meeting, and as this meeting is a "briefing meeting" it explains what major agenda items and actions are on each of the council member's schedules for the rest of the week.

* There are some problems, in a few places this lacks specificity and detail. In many cases, the summary is cut off at the end because the model has a limit on the number of tokens it generates (this is solvable via the `max_new_tokens` parameter if we really want to work to fix it: https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ImageToTextPipeline.__call__)

* Similarly, it seems like there are some pieces of information duplicated across summary portions. This is likely happening because the chunks we are passing in are so small. There may be a single discussion item happening across chunks and the model is generating a summary for each chunk and thus duplicating information.

* However, this is a good starting point and we can see that the model is able to generate summaries that capture some of the main ideas or at least discussion points of the meeting and drastrically reduce it's size

### Limitations and Warnings

* it is important to note again that while this reduced the meeting into a few paragraphs of detail, it doesn't include all of the details and context that a reader may need or want

* further, just like with sentiment, the data used for training the model may be different than the data you are using it on

* one of the larger limitations which we might begin to resolve in the next lecture is that this model is relatively small. It can easily fit on most laptops and desktops, and can easily run within Google Colab. However it is not as powerful as some of the larger models that are available that excel at summarization.

## Spacy

* spacy is a different library than huggingface's transformer's but one which similarly provides a whole suite of tools for NLP
* it comes pre-packages with tools ready for use such as tokenization, named entity recognition, and more.
    * see the documentation for more details: https://spacy.io/usage/spacy-101#features

* the first thing to do prior to using spacy is to download the model you want to use

* they have a few models available for english language text:
    * `en_core_web_sm`  a very small model which is very fast to use but may be less accurate
    * `en_core_web_lg`  a larger model which is slower to use but may be more accurate
    * `en_core_web_trf`  a transformer based model which is slower to use but may be more accurate

* in general, if we want to accuracy in our model predictions, it is recommended to use `en_core_web_trf`

* lets download the model

In [18]:
!python -m spacy download en_core_web_trf

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2023-12-01 10:39:29.456285: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-01 10:39:29.701006: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-01 10:39:29.701098: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-01 10:39:29.748816: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-01 10:39:29.844343: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructio

* Now lets load the model into spacy and try out some basic operations

In [22]:
import spacy

# Load the model
nlp = spacy.load("en_core_web_trf")

# Create a spacy "Document" object using the full text of the transcript joined together as a single string
doc = nlp(" ".join(single_session_transcript.text))

# show the first couple of "tokens" in the document and the part of speech (POS) tag for each
for token in doc[:10]:
    print(token.text, "\t\t", token.pos_)

I 		 PRON
will 		 AUX
call 		 VERB
the 		 DET
roll 		 NOUN
on 		 ADP
May 		 PROPN
2nd 		 NOUN
. 		 PUNCT
May 		 PROPN


* Spacy gives us a lot of fine-grained information about each "token" (word) in the text, including the part of speech

* in this case, we can see that "call" is a verb, "May" is a proper noun, "." is punctuation, etc.

* lets see what else spacy can give us. for example, even though we just merged all of the sentences of the text, spacy can also split the text back out into sentences.

In [24]:
# split the doc into sentences and print the first 10
for i, sent in enumerate(doc.sents):
    print(sent)

    if i == 10:
        break

I will call the roll on May 2nd.
May 2nd, 2022 Council briefing meeting will come to order.
I am Andrew Lewis, Council President pro tem.
The time is 2.01 p.m.
Will the clerk please call the roll?
Councilmember Wilson?
Present.
Councilmember Peterson? Present.
Councilmember Sawant?
Present.
Councilmember Strauss?


* Great! Spacy is able to split the text back out into sentences. This is useful for many tasks such as named entity recognition, summarization, and more.

* just to highlight, it also correctly ignores the "." (periods) in the middle of "2.01 p.m." so we know that it isn't just naively splitting the text on various punctuation marks

* lets try another example

* part of speech tagging as we just showed can be incredibly useful but just parts of speech alone may not get you the full way. Maybe you want to track _who_ is the chair of the meeting by looking at _who_ is calling the meeting to order.

* lets look at the dependency graph

In [37]:
from spacy import displacy

displacy.render(doc[20:30], style="dep", jupyter=True)

* this gives us a much clearer view that "andrew Lewis" is a compound proper noun (a name) and that they are the "Council President" pro tem.

* we can read that in the text, but by using the dependency graph we can also programmatically extract that information

* lets try another example

* one of the most common reasons people use spacy is for named entity recognition

* named entity recognition is the task of identifying and classifying named entities in text

* named entities are things like people, places, organizations, etc.

* lets see what spacy can do for us

In [33]:
# programmatically print any named entities from the first 40 words of the meeting
for ent in doc[:40].ents:
    print(ent.text, "\t\t", ent.label_)

May 2nd 		 DATE
May 2nd, 2022 		 DATE
Andrew Lewis 		 PERSON
Council 		 ORG
2.01 p.m. 		 TIME


In [36]:
# or render them visually!
displacy.render(doc[:40], style="ent", jupyter=True)

* spacy is exceptional at helping with all of the lower level natural language processing tasks

* and combining spacy and other open source models made available on huggingface can be a powerful combination

## Exercise Idea

* give a few examples of models for a certain task (maybe NER) on huggingface and ask students to compare the results between all of the models

* what works well, what doesn't, etc.

* similarly, ask them to combine spacy NER with a huggingface model to ...