# 03-02 - Pretrained-models - Solution Notebook

* Written by Alexandre Gazagnes
* Last update: 2024-02-01

## About 

Context : 

Let's Continue the Party! 

Data  : 

**You can find the dataset [here](https://www.kaggle.com/datasets/shoumikdhar/amazon-food-reviews-100k-datasets).**




## Preliminaries

### System

These commands will display the system information:

Uncomment theses lines if needed. 

In [None]:
# pwd

In [None]:
# cd ..

In [None]:
# ls

### Import 

In [None]:
# import os, sys, warnings, secrets, datetime
# import pickle

from IPython.display import display
import zipfile

In [None]:
import pandas as pd

# import numpy as np

In [None]:
import plotly.px as px

In [None]:
import tensorflow as tf

In [None]:
import transformers

from transformers import pipeline, set_seed

from transformers import T5Tokenizer, T5ForConditionalGeneration


from transformers import BertTokenizer, BertForTokenClassification
from transformers import BertTokenizer, BertForSequenceClassification

from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

from transformers import RobertaTokenizer, RobertaForSequenceClassification
from transformers import RobertaTokenizer, RobertaForMaskedLM

from transformers import BartTokenizer, BartForConditionalGeneration

from transformers import MarianMTModel, MarianTokenizer

from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline

### Third party tools

Set the seed : 

In [None]:
set_seed(42)

Download the default classifier :

In [None]:
classifier = pipeline("sentiment-analysis")

Specifying a model :

In [None]:
roberta_sentiment = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
)

Question answering model : 

In [None]:
question_answerer = pipeline("question-answering")

Text Generator : 

In [None]:
gpt2_generator = pipeline("text-generation", model="gpt2")

In [None]:
# bloom = pipeline("text-generation", model="bigscience/bloom-7b1")

### Data

Download the dataset :

In [None]:
!wget https://www.kaggle.com/datasets/shoumikdhar/amazon-food-reviews-100k-datasets

Load .zip file

In [None]:
with zipfile.ZipFile("archive.zip", "r") as zip_ref:
    zip_ref.extractall("archive")
    extracted_file = zip_ref.namelist()[0]
    df = pd.read_csv(f"archive/{extracted_file}")

## Data Exploration

Head : 

In [None]:
df.head()

Tail : 

In [None]:
df.tail()

Sample : 

In [None]:
df.sample(10)

Split the text (but not with official tokenizer) : 

In [None]:
df["pseudo_token"] = df.Review.apply(lambda x: x.split())
df

Describe : 

In [None]:
df["n_psuedo_token"] = df.pseudo_token.apply(len).describe().round(2)
df.n_psuedo_token.describe()

Length of each doc : 

In [None]:
df["_len"] = df.Review.str.len().describe()
df

Describe : 

In [None]:
df.Rating.describe().round(2)

## High Level Implementation

### Classification & Sentiment Analysis

Use a classifier : 

In [None]:
classifier("AI stuff is real hard to understand.")

In [None]:
classifier("AI stuff is real hard to understand.", top_k=3)

In [None]:
classifier("AI stuff is so fun")

In [None]:
classifier("can you say me if AI is good or not...")

Apply on a column : 

In [None]:
results = df.Review.head().apply(classifier)
results

Results : 

In [None]:
results.explode()

In [None]:
results.apply(pd.Series)
results

Join Both : 

In [None]:
df.head().join(results)

using another tool : 

In [None]:
roberta_sentiment("AI stuff is real hard to understand.")

In [None]:
roberta_sentiment("AI stuff is so fun")

In [None]:
roberta_sentiment("can you say me if AI is good or not...")

In [None]:
results = df.Review.head().apply(roberta_sentiment).explode().apply(pd.Series)
results

check this blog for more infomation: [Getting Started with Sentiment Analysis using Python](https://huggingface.co/blog/sentiment-analysis-python)

### Information Extraction & Questing Answering

In [None]:
txt = "hello, i am a 40 years old guy liking in san francisco with my dog and my guitar. I want to learn how to code, can you help me ?"

out = question_answerer(question="are old am i ? ", context=txt, top_k=10)

In [None]:
out = pd.DataFrame(out)
out

In [None]:
out.score.sum()

In [None]:
out["_cumsum"] = out.score.cumsum()
out

In [None]:
threshold = 0.75

clean_out = out.loc[out._cumsum < threshold]
clean_out

In [None]:
answers = clean_out.answer.tolist()
answers

In [None]:
question_answerer(question="what is the product?", context=df.Review.values[4])

In [None]:
qa_model = pipeline("question-answering")
question = "Where do I live?"
context = "My name is Merve and I live in İstanbul."
qa_model(question=question, context=context, top_k=3)
## {'answer': 'İstanbul', 'end': 39, 'score': 0.953, 'start': 31}

### Text Generation & Prompting

In [None]:
gpt2_generator("Hello, I'm an NLP student,", max_length=30, num_return_sequences=5)

In [None]:
out = gpt2_generator(
    "Hello, I'm an computer science student,", max_length=30, num_return_sequences=5
)
out

In [None]:
for dd in out:
    print(dd["generated_text"])

In [None]:
out = gpt2_generator(
    "Hello, I'm an computer science student,", max_length=100, num_return_sequences=10
)
out

In [None]:
for dd in out:
    print(dd["generated_text"])

In [None]:
# with the open source Bloom model https://huggingface.co/bigscience/bloom

### Translation

In [None]:
# UP TO YOU TO FIND IT 😉

### Summarization

In [None]:
# UP TO YOU TO FIND IT 😉

## Specific Implementation

### Sentiment

Load pre-trained model and tokenizer

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")


Sentiment analysis pipeline



In [None]:
nlp = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

Example text : 

In [None]:
result = nlp("I love learning about data science with Transformers!")
print(result)

### NER 


Load pre-trained model and tokenizer


In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForTokenClassification.from_pretrained("bert-base-uncased")

NER pipeline


In [None]:
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

Example text


In [None]:
result = nlp("Hugging Face is a technology company based in New York")
print(result)

### Text-generation

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

Text generation pipeline

In [None]:
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

Generate text

In [None]:
print(text_generator("Artificial intelligence is", max_length=50))

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

In [None]:
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

In [None]:
# Generate text
print(text_generator("Artificial intelligence is", max_length=50))

Assuming the same model and tokenizer loaded from the previous example
Simulate a chatbot response


In [None]:
chat_input = "Hello, how can I assist you today?"
chat_response = text_generator(chat_input, max_length=50)

In [None]:
print(chat_response)

### Filled Masked 




Load tokenizer and model

In [None]:
tokenizer = RobertaTokenizer.from_pretrained("")
model = RobertaForMaskedLM.from_pretrained("roberta-base")

Fill-mask pipeline


In [None]:
fill_mask = pipeline("fill-mask", model="roberta-base")

Example

In [None]:
print(fill_mask("The weather today is <mask>."))

### ...