# Intro to BERT

- BERT stands for Bidirectional Encoder Representations from Transformers.
- BERT is pre-trained on a large corpus of unlabeled text, including the entire Wikipedia (that's 2,500 million words!) and the Book Corpus (800 million words).
- BERT is based on the Transformer architecture.

# 1. Install and Import Dependencies

In [1]:
#!pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

In [2]:
#!pip install transformers requests beautifulsoup4 pandas numpy

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification #string to sequence of nums + load into NLP and transformer
import torch
import requests
from bs4 import BeautifulSoup #extract data
import re #regex to cleanup text

  from .autonotebook import tqdm as notebook_tqdm


https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment

# 2. Instantiate Model

In [2]:
#instantiate the BERT model - get the tokenizer and model from BERT huggingface website
link = 'nlptown/bert-base-multilingual-uncased-sentiment'

tokenizer = AutoTokenizer.from_pretrained(link)
model = AutoModelForSequenceClassification.from_pretrained(link)


# 3. Encode and Calculate Sentiment

### Testing on a  small sample

In [8]:
my_review = 'It was good spot but couldve been better experience if the service was faster'
tokens = tokenizer.encode(my_review, return_tensors='pt')

In [9]:
result = model(tokens)

In [10]:
result

SequenceClassifierOutput(loss=None, logits=tensor([[-1.7396,  0.2251,  2.1218,  1.2335, -1.5565]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [11]:
result.logits #results are raw from the nn - sequence classifier represent the prob of the sentiments

tensor([[-1.7396,  0.2251,  2.1218,  1.2335, -1.5565]],
       grad_fn=<AddmmBackward0>)

In [12]:
# convert results into readable format
# plus 1 here helps because counting start from 0 (for 1 to 5 scale)
int(torch.argmax(result.logits))+1 

3

In [14]:
my_review_2 = 'It was good spot with great menu items'

tokens = tokenizer.encode(my_review_2, return_tensors='pt')
result = model(tokens)
int(torch.argmax(result.logits))+1 

4

# 4. Collect Reviews

In [28]:
url = 'https://www.yelp.com/biz/social-brew-cafe-pyrmont'
url = 'https://www.yelp.com/biz/the-feed-co-table-and-tavern-chattanooga-4?osq=Restaurants'
r = requests.get(url)

soup = BeautifulSoup(r.text, 'html.parser')
regex = re.compile('.*comment.*')
results = soup.find_all('p', {'class':regex})
reviews = [result.text for result in results]

In [29]:
reviews

["We went to Feed for brunch yesterday and were blown away. Joel was our server and we will totally ask for him every time we go back. He recommended the hot pimento and honey chicken biscuit and it was phenomenal!!! My husband got his other recommendation, the wagyu smash burger. Ridiculously good!!! Food and bloody Mary's came out quickly. We will definitely be back!!!",
 'Great food and service! Our server Noal, he  was amazing!  He was funny and entertaining!',
 "Honestly this was one of the best meals I've had in Chattanooga. I met a coworker for lunch based on her recommendation and recommendation from several others. And boy they were not wrong!  The food was phenomenal. After going back and forth about several options, I finally settled on catfish with a corn salsa, mash potatoes and changed up the green beans for greens based on a recommendation from the waitress. (I wish I did the brussel sprouts because my coworker ordered them and they were beyond phenomenal!The catfish was

# 5. Load Reviews into DataFrame and Score

In [30]:
import numpy as np
import pandas as pd

In [31]:
df = pd.DataFrame(reviews, columns=['review'])

In [32]:
df['review'].iloc[0]

"We went to Feed for brunch yesterday and were blown away. Joel was our server and we will totally ask for him every time we go back. He recommended the hot pimento and honey chicken biscuit and it was phenomenal!!! My husband got his other recommendation, the wagyu smash burger. Ridiculously good!!! Food and bloody Mary's came out quickly. We will definitely be back!!!"

In [33]:
def sentiment_score(review):
    tokens = tokenizer.encode(review, return_tensors='pt')
    result = model(tokens)
    return int(torch.argmax(result.logits))+1

In [34]:
sentiment_score(df['review'].iloc[1])

5

In [35]:
df['sentiment'] = df['review'].apply(lambda x: sentiment_score(x[:512]))

In [17]:
df

Unnamed: 0,review,sentiment
0,Very cute coffee shop and restaurant. They hav...,4
1,Six of us met here for breakfast before our wa...,4
2,The food was delicious. The ricotta pancakes w...,4
3,Great place with delicious food and friendly s...,5
4,"Great service, lovely location, and really ama...",5
5,Great food amazing coffee and tea. Short walk ...,5
6,Ricotta hot cakes! These were so yummy. I ate ...,5
7,It was ok. Had coffee with my friends. I'm new...,3
8,We came for brunch twice in our week-long visi...,4
9,I came to Social brew cafe for brunch while ex...,5


In [18]:
df['review'].iloc[3]

'Great place with delicious food and friendly staff. It is small but has outdoor seating and a relaxed ambiance. Perfect place to enjoy a cup of coffee. I am visiting Sydney for the first time but this place seems like is a local favorite.'

# Text Generation

Using GPT2 https://huggingface.co/openai-community/gpt2

In [None]:
from transformers import pipeline

In [18]:
text_generator = pipeline("text-generation", model='gpt2')
gen_text = text_generator("This is a story about a queen in the middle of a civil war", max_length=100, num_return_sequences=5)


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [19]:
gen_text

[{'generated_text': 'This is a story about a queen in the middle of a civil war who gets shot down at the last second by an assassin. In order to save her, Queen Idris must overcome an ancient evil that had taken it all away.\n\n\nWhat began as a comic book tale was to become something much darker and more interesting. It is a story of a love triangle over an incredible battle that has caused all the best minds in the series to make a life for themselves. A great deal is going'},
 {'generated_text': 'This is a story about a queen in the middle of a civil war, fighting for her rights because she finds herself no longer at peace with her parents.\n\nAnd this is a story about an artist who is trying to bring life back to reality because the fact that life has not happened to her means there is no life in the world. No life is right, and so the artist has to leave home and pursue this work. And that was the only reason I wrote about it.\n\n'},
 {'generated_text': "This is a story about a q

In [20]:
# prompt 2
gen_text = text_generator("What is Statistics?", max_length=100, num_return_sequences=5)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [21]:
gen_text

[{'generated_text': 'What is Statistics?\n\nStatistics is the science-based form of statistics. Statistics enables readers to see or track real phenomena within an environment.\n\nIt is the way of nature. Statistics enables people to see a reality.\n\nWhat are the Benefits of Statistics?\n\nTo help people understand and comprehend statistics, I\'d like to introduce "Statistics: The Future of Science". It\'s a fascinating look into the future of science and what is available on the internet. No need to think'},
 {'generated_text': 'What is Statistics?\n\nA recent study looked at which cities were rated as the strongest to hardest, and found that they were most at fault for the greatest number of crimes. "In some areas like Chicago and New York, the rate of homicide at a time when homicides have come down to two per 100,000 was worse than in others."\n\nIn other parts of the country, it was less bad. "The rate of violent crime decreased with the onset of the Great Recession on July'},
 {