# Lecture 6 -- Pre-trained Models

## Outline

* in many cases, training your own model may be the best thing you can do as you have full control over the data, training, and the evaluation process.

* however, this lecture will go over pre-trained models and more generally, models that are freely available for you to use.

* pretrained and freely available models have become incredibly common and are often a good starting point for experimentation and more.

* we will go over a few examples of pretrained models and how to use them including models for:
    * sentiment analysis
    * summarization
    * named entity recognition

## Generalization vs Specialization

* in the prior chapter we talked about how for some tasks we need to train our own model because we want to define the annotation criteria or we are in a very niche sub-domain.

* however, using pre-trained models that are general to a task can also be quite useful -- especially considering that they were likely trained with more data than the dataset you are working with

## Transformers and Huggingface

* We have introduced and used transformers, a model architecture

* huggingface is a company which builds tools around various model architectures to make them easily shareable

* for example, here is a link to their website in which you can search and browse for available models: https://huggingface.co/models
* every single model is available for download and use and in many cases has some documentation attached to understand how to use the model in your own work

## Sentiment Analysis

* sentiment analysis is a task in which we want to predict the sentiment of a text
* sentiment is often defined as positive, negative, or neutral
* for example, the phrase: "I love this movie" would be positive and "I hate this movie" would be negative

* sentiment analysis is a very common task and there are many datasets available for it
* for example, here is a dataset of movie reviews: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

* there are many models available for sentiment analysis on huggingface (and elsewhere) but for our examples, we will use:
"cardiffnlp/twitter-roberta-base-sentiment-latest" (https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)
  * this model was trained on millions of pairs of sentiment labeled tweets

* lets load the model and try it on a few examples

In [None]:
from transformers import pipeline

model = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
)

# Predict sentiment of the statement
model("I am happy to see that other members of the council are supporting this important legislation.")

* we can see that for the phrase `"I am happy to see that other members of the council are supporting this important legislation."` the model predicts a positive sentiment and given a score

* you can interpret the `score` value as the model's confidence in the prediction
* the higher the score, the more confident the model is in the prediction (in this case, the model is very confident that the sentiment is positive)

* lets try a few more examples

In [None]:
# Give the model multiple examples at once by providing a list
model([
    "I urge the members of the committee to vote against this bill.",
    (
        "If the amendment from Councilmember Brown passes, I will vote in support of this legislation. "
        "But without the amendment, I cannot agree to move forward with this legislation."
    ),
])

* This time we can see that the model is confident that the sentiment of the first example (`"I urge the members of the committee to vote against this bill."`) is negative and the second example (`"But without the amendment, I cannot agree to move forward with this legislation."`) is neutral.

* This makes sense to me, as the second example is specifically discussing conditions of agreement or disagreement which places it somewhere in a neutral category.

### Limitations and Warnings

* Sentiment can be subjective and it is important to understand the limitations of the model and the data it was trained on, and who annotated the data.

* The model we used for sentiment was trained using twitter data -- twitter posts are obviously different than meeting discussion

* this can affect the model and we should always evaluate the performance of even off the shelf models

* in general, when using a model, be sure to read how it was trained, what data it used for training, and understand the possible biases and problems that come from using a model trained by someone else

* try to find a model that was trained on data close to yours for the best results

## Summarization

* summarization can be an important task for processing long documents such as multiple hour long meeting transcripts

* there can be direct reader benefits by trying to extract or explain the main points of a meeting for example

* using generated summaries can also be useful for other tasks such as information retrieval (i.e. instead of searching across the whole meeting, you can search across the summary), or for other situations in which you simply can't fit the whole meeting into memory of the model but can fit the summary

* the two main types of summarization are extractive and abstractive

* extractive summarization is when you extract sentences from the original document and combine them to form a summary

* abstractive summarization is when you generate new sentences that are not in the original document to form a summary

* someone has already taken the time to train a model for meeting dialoge summarization: https://huggingface.co/knkarthick/MEETING_SUMMARY

* lets use this model to generate a summary for a meeting

* first lets pull down some data and prep the data for the model

In [1]:
from cdp_data import CDPInstances, datasets
import pandas as pd

sessions = datasets.get_session_dataset(
    CDPInstances.Seattle,
    start_datetime="2022-05-01",
    end_datetime="2022-05-03",
    store_transcript=True,
    store_transcript_as_csv=True,
)

single_session_transcript = pd.read_csv(sessions.iloc[0].transcript_as_csv_path)
len(single_session_transcript)

Fetching each model attached to event_ref:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching transcripts:   0%|          | 0/1 [00:00<?, ?it/s]

Converting transcripts:   0%|          | 0/1 [00:00<?, ?it/s]

855

In [9]:
# Because are meetings are incredibly long
# lets create a strings with different portions of the meeting
# to pass to the model
parts = []
n_sentences_per_chunk = 30
for index in range(
    0,
    len(single_session_transcript),
    n_sentences_per_chunk,
):
    parts.append(
        " ".join(
            single_session_transcript.iloc[index:index+n_sentences_per_chunk].text
        )
    )

* now that we have multiple chunks, lets first pass in just the first chunk to the model and see what it generates


For reference, here is the full text of the first chunk:

> "I will call the roll on May 2nd. May 2nd, 2022 Council briefing meeting will come to order. I am Andrew Lewis, Council President pro tem. The time is 2.01 p.m. Will the clerk please call the roll? Councilmember Wilson? Present. Councilmember Peterson? Present. Councilmember Sawant? Present. Councilmember Strauss? Present. Councilmember Herbold? Present. Councilmember Williams? Present. Councilmember Gattas? Present. Councilmember Herbold? Here. Councilmember Morales? Present. Councilmember Pro Tem Lewis? Present. Six present. Thank you. We will move on to approval of the minutes. If there is no objection, the minutes of May 2nd, 2022 will be adopted. Hearing no objection, the minutes are adopted. The President's report. I will call the roll on May 2nd, 2022 Council briefing meeting will come to order. I am Andrew Lewis, Council President pro tem. The time is 2.01 p.m. Will the clerk please call the roll? Councilmember Wilson? Present. Councilmember Peterson? Present. Councilmember Sawant? Present. Councilmember Peterson? Present. Councilmember Herbold? Here. Councilmember Morales? Present. Councilmember Gattas? Present. Councilmember Herbold? Present. Councilmember Gattas? Present. I am filling in. As folks may be aware for Council President Juarez. I do not have any. Reports at the top of the meeting. Except to give a brief preview of the full council. Agenda that we will be considering. As a council tomorrow."

In [7]:
from transformers import pipeline
summarizer = pipeline("summarization", model="philschmid/bart-large-cnn-samsum")

# Summarize the first part of the meeting
summarizer(parts[0])

[{'summary_text': 'Andrew Lewis, Council President pro tem, will call the roll on May 2nd, 2022 at 2.01 p.m. Councilmember Wilson, Councilmember Peterson, Councilmembers Sawant, Strauss, Herbold, Williams, Gattas, Morales and Pro Tem Lewis are present. If there is no objection, the minutes will be adopted.'}]

* We can see that the summary is short and to the point and generally captures the main idea of this chunk. Which is simply that the meeting has begun roll call took place.

* However it is definitely missing some specificity and details that would be useful for a reader.

* Lets summarize the rest of the chunks and see what we get back

In [10]:
from tqdm import tqdm
results = []
for part in tqdm(parts):
    summary = summarizer(part)[0]["summary_text"]
    results.append(summary)

print(" ".join(results))

100%|██████████| 29/29 [04:06<00:00,  8.51s/it]

Andrew Lewis, Council President Pro Tem, will call the roll on May 2nd, 2022 at 2.01 p.m. Councilmember Wilson, Peterson, Sawant, Strauss, Herbold, Williams, Gattas, Morales, and Lewis are present. If there is no objection, the minutes will be adopted. Andrew Lewis is the Council President pro tem. The time is 2.01 p.m. Councilmember Wilson, Peterson, Sawant, Morales, Herbold and Gattas are present. Andrew is filling in for Council President Juarez, who is absent. Tomorrow's full council meeting is going to be the agenda for tomorrow's full meeting. Council member Nelson will lead the discussion on the proclamation declaring May 1st through May 7th to be national small business week. Council will also review several ordinances related to the Wagner floating home and the center for wooden boats. Tim Lewis is the first small business owner on the council in over 10 years. Seattle has hosted a national small business week every year since 1963. Seattle metro area has the seventh most smal


