# Pretrained models

This page shows the most frequent use-cases when using the library: [Summary of the tasks](https://huggingface.co/docs/transformers/task_summary)

In this notebook we will use this dataset: [Amazon Food Reviews 100k Datasets](https://www.kaggle.com/datasets/shoumikdhar/amazon-food-reviews-100k-datasets)


## Read data

In [1]:
import pandas as pd
import zipfile

with zipfile.ZipFile('archive.zip', 'r') as zip_ref:
    zip_ref.extractall('archive')
    extracted_file = zip_ref.namelist()[0]
    df_reviews = pd.read_csv(f"archive/{extracted_file}")


In [2]:
df_reviews.head()

Unnamed: 0,Id,Rating,Review
0,1,5,I have bought several of the Vitality canned d...
1,2,1,Product arrived labeled as Jumbo Salted Peanut...
2,3,4,This is a confection that has been around a fe...
3,4,2,If you are looking for the secret ingredient i...
4,5,5,Great taffy at a great price. There was a wid...


## Data Exploration

In [3]:
df_reviews.Review.apply(lambda x: len(x.split())).describe()

count    100000.000000
mean         81.313900
std          79.153013
min           6.000000
25%          34.000000
50%          57.000000
75%         100.000000
max        2520.000000
Name: Review, dtype: float64

In [4]:
df_reviews.Rating.describe()

count    100000.000000
mean          4.152630
std           1.320141
min           1.000000
25%           4.000000
50%           5.000000
75%           5.000000
max           5.000000
Name: Rating, dtype: float64

## Classification & Sentiment Analysis

In [5]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [6]:
# check 
classifier("AI stuff is real hard to understand.")

[{'label': 'NEGATIVE', 'score': 0.9996833801269531}]

In [7]:
results = df_reviews.Review.head().apply(classifier).explode().apply(pd.Series)
results

Unnamed: 0,label,score
0,POSITIVE,0.998385
1,NEGATIVE,0.999525
2,POSITIVE,0.999765
3,POSITIVE,0.999153
4,POSITIVE,0.998708


In [8]:
# Set the max_colwidth option to -1
pd.options.display.max_colwidth = -1
# Display the DataFrame
df_reviews.head()
#display(df_reviews)

  pd.options.display.max_colwidth = -1


Unnamed: 0,Id,Rating,Review
0,1,5,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.
1,2,1,"Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo""."
2,3,4,"This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' ""The Lion, The Witch, and The Wardrobe"" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch."
3,4,2,If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.
4,5,5,"Great taffy at a great price. There was a wide assortment of yummy taffy. Delivery was very quick. If your a taffy lover, this is a deal."


In [9]:
df_reviews.head().join(results)

Unnamed: 0,Id,Rating,Review,label,score
0,1,5,I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.,POSITIVE,0.998385
1,2,1,"Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo"".",NEGATIVE,0.999525
2,3,4,"This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' ""The Lion, The Witch, and The Wardrobe"" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.",POSITIVE,0.999765
3,4,2,If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.,POSITIVE,0.999153
4,5,5,"Great taffy at a great price. There was a wide assortment of yummy taffy. Delivery was very quick. If your a taffy lover, this is a deal.",POSITIVE,0.998708


### Add neutral with twitter-roberta model

Extractive Question Answering is the task of extracting an answer from a text given a question.

In [10]:
roberta_sentiment = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest")

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
results = df_reviews.Review.head().apply(roberta_sentiment).explode().apply(pd.Series)
results

Unnamed: 0,label,score
0,positive,0.950607
1,negative,0.716768
2,positive,0.916966
3,positive,0.949957
4,positive,0.98619


check this blog for more infomation: [Getting Started with Sentiment Analysis using Python](https://huggingface.co/blog/sentiment-analysis-python)

## Information Extraction & Questing Answering

In [12]:
question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [13]:
question_answerer(question="what is the product?", context=df_reviews.Review.values[4])

{'score': 0.4232330322265625, 'start': 76, 'end': 84, 'answer': 'Delivery'}

## Text Generation & Prompting

In [14]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')

In [17]:
set_seed(2)
generator("Hello, I'm an NLP student,", max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hello, I\'m an NLP student, which means I am the primary author of an NLP article," he writes in the blog post titled "'},
 {'generated_text': "Hello, I'm an NLP student, so I'm very excited for the 2018 season - not only for this season's conference, but at the"},
 {'generated_text': "Hello, I'm an NLP student, so what you have is like, I will teach you stuff that I don't have to learn. I"},
 {'generated_text': "Hello, I'm an NLP student, and I was interested in taking an interest in the NLP. I was inspired not by the NLP"},
 {'generated_text': 'Hello, I\'m an NLP student, that\'s what I\'m doing as is so far," she said. She\'d been working with the library'}]

In [None]:
# with the open source Bloom model https://huggingface.co/bigscience/bloom

In [16]:
generator = pipeline('text-generation', model='bigscience/bloom')

Downloading:   0%|          | 0.00/568 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/63.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/7.19G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

ValueError: Could not load model bigscience/bloom with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.bloom.modeling_bloom.BloomModel'>).

## Translation

In [18]:
from transformers import pipeline

translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]


## Summarization