# Sentiment Analysis with BERT<div class="tocSkip">
    
&copy; Jens Albrecht, 2021
    
This notebook can be freely copied and modified.  
Attribution, however, is highly appreciated.

<hr/>

See also: 

Albrecht, Ramachandran, Winkler: **Blueprints for Text Analytics in Python** (O'Reilly 2020)  
Chapter 11: [Performing Sentiment Analysis on Text Data](https://learning.oreilly.com/library/view/blueprints-for-text/9781492074076/ch11.html#ch-sentiment) + [Link to Github](https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/README.md)

## Setup<div class='tocSkip'/>

Set directory locations. If working on Google Colab: copy files and install required libraries.

In [None]:
import sys, os
ON_COLAB = 'google.colab' in sys.modules

if ON_COLAB:
    GIT_ROOT = 'https://github.com/jsalbr/tdwi-2021-text-mining/raw/main'
    os.system(f'wget {GIT_ROOT}/notebooks/setup.py')

%run -i setup.py

## Load Python Settings<div class="tocSkip"/>

Common imports, defaults for formatting in Matplotlib, Pandas etc.

In [None]:
%run "$BASE_DIR/notebooks/settings.py"

%reload_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'png'

# to print output of all statements and not just the last
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# otherwise text between $ signs will be interpreted as formula and printed in italic
pd.set_option('display.html.use_mathjax', False)
pd.options.plotting.backend = "matplotlib"

# path to import blueprints packages
sys.path.append(f'{BASE_DIR}/packages')

## Load Data

In [None]:
df = pd.read_csv(f"{BASE_DIR}/data/reddit-autos-selfposts-prepared.csv", sep=";", decimal=".")

len(df)

## Word Embeddings

Let the following code run to train a model.

In [None]:
from gensim.models import Word2Vec

# sents = df['lemmas'].str.lower().str.split() 
# model = Word2Vec(sents, vector_size=100, window=30, sg=1)
# model.wv.save_word2vec_format('w2v_autos_100_30_sg.bin', binary=True)

In [None]:
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format(f'{BASE_DIR}/data/w2v_autos_100_30_sg.bin', binary=True)

### Similarity Queries to Explore a Domain Vocabulary

The model with large vector size 30 favors similar terms that often cooccur with the search word (syntagmatic relations):

In [None]:
model.most_similar('audi', topn=10)

It works great for associative questions:

In [None]:
model.most_similar(positive=['x5', 'audi'], negative=['bmw'], topn=5)

### Visualize Word Embeddings

In [None]:
from blueprints.embeddings import plot_embeddings

search = ['ford', 'bmw', 'toyota', 'tesla', 'audi', 'mercedes', 'hyundai']

plot_embeddings(model, search, topn=30, n_dims=3, 
    algo='umap', n_neighbors=15, min_dist=.1, spread=40, random_state=23)

In [None]:
from blueprints.embeddings import sim_tree, plot_tree

graph = sim_tree(model, 'sparkplug', top_n=8, max_dist=2)
plot_tree(graph, node_size=500, font_size=8)

## Sentiment Analysis Using Huggingface Transformers

Links: 
  * [Transformers Library from Hugging Face](https://huggingface.co/transformers)
  * [Transformers Quick Tour](https://huggingface.co/transformers/quicktour.html)

### Load a Model for Sentiment Analysis

For a list of models see [Hugging Face Model Hub](https://huggingface.co/models).

Model download takes a moment ...

It's stored in `~/.cache/huggingface/transformers` (see [Huggingface documentation](https://huggingface.co/docs/datasets/installation.html#caching-datasets-and-metrics)).

In [None]:
from transformers import pipeline

# classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

In [None]:
classifier.model

This model was trained on product reviews in five languages. Predicts ratings from 1 to 5 stars.

In [None]:
sents = [
  'We are very happy to show you the 🤗 Transformers library.',
  'The weather today is not really what I expected.'
]

classifier(sents)

### Aspect-based Sentiment Analysis

Check sentiment for the aspect "charging" in Tesla subreddit.

Look for token 'charge' in subreddit 'teslamotors' and exclude questions ('?').

In [None]:
pd.set_option('max_colwidth', 3000)

senti_df = df[
    (df['lemmas'].str.len() < 400) &
    df['lemmas'].str.lower().str.contains('charge') &
    (~df['text'].str.contains('\?')) &
    (df['subreddit']=='teslamotors')][['text']].sample(20)
senti_df.reset_index(inplace=True)

Add sentiment prediction:

In [None]:
senti_df.join(pd.DataFrame(classifier(list(senti_df['text'].str.lower()))))

## Question Answering

Training based on Stanford Question Answering Dataset (SQuAD 2.0).  

See
  * https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/European_Union_law.html
  * [Huggingface documentation for QA](https://huggingface.co/transformers/usage.html#extractive-question-answering)

In [None]:
from transformers import pipeline

qa_model = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the `run_squad.py`.
"""

question = "What is extractive question answering?"
answer = qa_model(question=question, context=context)
print("Q:", question)
print("A:", answer['answer'], f"(confidence: {answer['score']:.2f})\n")

question = "What is a good example of a question answering dataset?"
answer = qa_model(question=question, context=context)
print("Q:", question)
print("A:", answer['answer'], f"(confidence: {answer['score']:.2f})\n")

Examples from [Game of Thrones Wiki](https://gameofthrones.fandom.com/wiki):

In [None]:
from transformers import pipeline

qa_model = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

context = """
Bran is the fourth child and second son of Lady Catelyn and Lord Ned
Stark. Ned is the head of House Stark, Lord Paramount of the North,
and Warden of the North to King Robert Baratheon. The North is one of
the constituent regions of the Seven Kingdoms and House Stark is one
of the Great Houses of the realm. House Stark rules the region from
their seat of Winterfell.

Winterfell is the capital of the Kingdom of the North and the seat and 
the ancestral home of the royal House Stark. It is a very large castle 
located at the center of the North, from where the head of House Stark 
rules over his or her people. """

question = "Who is Bran?"
answer = qa_model(question=question, context=context)
print("Q:", question)
print("A:", answer['answer'], f"(confidence: {answer['score']:.2f})\n")

question = "What is Winterfell?"
answer = qa_model(question=question, context=context)
print("Q:", question)
print("A:", answer['answer'], f"(confidence: {answer['score']:.2f})\n")

question = "Where is Winterfell located?"
answer = qa_model(question=question, context=context)
print("Q:", question)
print("A:", answer['answer'], f"(confidence: {answer['score']:.2f})\n")