<div style="width: 100%; overflow: hidden;">
    <div style="width: 150px; float: left;"> <img src="https://raw.githubusercontent.com/DataForScience/Networks/master/data/D4Sci_logo_ball.png" alt="Data For Science, Inc" align="left" border="0" width=150px> </div>
    <div style="float: left; margin-left: 10px;"> <h1>LLMs for Data Science</h1>
<h1>NLP With HuggingFace</h1>
        <p>Bruno Gonçalves<br/>
        <a href="http://www.data4sci.com/">www.data4sci.com</a><br/>
            @bgoncalves, @data4sci</p></div>
</div>

In [1]:
from collections import Counter
from pprint import pprint

import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt 

from ipywidgets import interact

import transformers
from transformers import pipeline
from transformers import set_seed
set_seed(42) # Set the seed to get reproducible results

import os
import gzip

import tqdm as tq
from tqdm.notebook import tqdm
tqdm.pandas()

import networkx as nx

import watermark

%load_ext watermark
%matplotlib inline

We start by printing out the versions of the libraries we're using for future reference

In [2]:
%watermark -n -v -m -g -iv

Python implementation: CPython
Python version       : 3.11.7
IPython version      : 8.12.3

Compiler    : Clang 14.0.6 
OS          : Darwin
Release     : 23.6.0
Machine     : arm64
Processor   : arm
CPU cores   : 16
Architecture: 64bit

Git hash: 8d244e1e4f0c6fd330052d22607886f6abfcd26c

pandas      : 2.2.3
tqdm        : 4.66.4
transformers: 4.41.1
watermark   : 2.4.3
numpy       : 1.26.4
networkx    : 3.3
matplotlib  : 3.8.0
json        : 2.0.9



Load default figure style

In [3]:
plt.style.use('d4sci.mplstyle')
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

# Named Entity Recognition

In [4]:
email = """Dear Amazon, \

last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. 

Sincerely, 

Bumblebee."""

In [5]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
outputs = ner_tagger(email)

In [7]:
outputs

[{'entity_group': 'ORG',
  'score': 0.8790102,
  'word': 'Amazon',
  'start': 5,
  'end': 11},
 {'entity_group': 'MISC',
  'score': 0.9908588,
  'word': 'Optimus Prime',
  'start': 37,
  'end': 50},
 {'entity_group': 'LOC',
  'score': 0.9997547,
  'word': 'Germany',
  'start': 91,
  'end': 98},
 {'entity_group': 'MISC',
  'score': 0.5565716,
  'word': 'Mega',
  'start': 209,
  'end': 213},
 {'entity_group': 'PER',
  'score': 0.59025526,
  'word': '##tron',
  'start': 213,
  'end': 217},
 {'entity_group': 'ORG',
  'score': 0.66969275,
  'word': 'Decept',
  'start': 254,
  'end': 260},
 {'entity_group': 'MISC',
  'score': 0.4983484,
  'word': '##icons',
  'start': 260,
  'end': 265},
 {'entity_group': 'MISC',
  'score': 0.7753625,
  'word': 'Megatron',
  'start': 351,
  'end': 359},
 {'entity_group': 'MISC',
  'score': 0.98785394,
  'word': 'Optimus Prime',
  'start': 368,
  'end': 381},
 {'entity_group': 'PER',
  'score': 0.8120968,
  'word': 'Bumblebee',
  'start': 507,
  'end': 516}]

In [8]:
pd.DataFrame(outputs)    

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.87901,Amazon,5,11
1,MISC,0.990859,Optimus Prime,37,50
2,LOC,0.999755,Germany,91,98
3,MISC,0.556572,Mega,209,213
4,PER,0.590255,##tron,213,217
5,ORG,0.669693,Decept,254,260
6,MISC,0.498348,##icons,260,265
7,MISC,0.775362,Megatron,351,359
8,MISC,0.987854,Optimus Prime,368,381
9,PER,0.812097,Bumblebee,507,516


# PoS Tagging

Load the pipeline

In [9]:
pos_tagger = pipeline("token-classification", model="vblagoje/bert-english-uncased-finetuned-pos")

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [10]:
text = "The quick brown fox jumps over the lazy dog."

Extract the part of speech tags

In [11]:
pos_tags = pos_tagger(text)
pd.DataFrame(pos_tags)

Unnamed: 0,entity,score,index,word,start,end
0,DET,0.999445,1,the,0,3
1,ADJ,0.997063,2,quick,4,9
2,ADJ,0.942299,3,brown,10,15
3,NOUN,0.997004,4,fox,16,19
4,VERB,0.999446,5,jumps,20,25
5,ADP,0.999325,6,over,26,30
6,DET,0.999527,7,the,31,34
7,ADJ,0.997863,8,lazy,35,39
8,NOUN,0.998858,9,dog,40,43
9,PUNCT,0.99965,10,.,43,44


# Summarization

In [12]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


The first 4 paragraphs of https://en.wikipedia.org/wiki/Transformers

In [13]:
wiki_text = """
Transformers is a media franchise produced by American toy company Hasbro and Japanese toy company Takara Tomy. It primarily follows the heroic Autobots and the villainous Decepticons, two alien robot factions at war that can transform into other forms, such as vehicles and animals. The franchise encompasses toys, animation, comic books, video games and films. As of 2011, it generated more than ¥2 trillion ($25 billion) in revenue,[1] making it one of the highest-grossing media franchises of all time.

The franchise began in 1984 with the Transformers toy line, comprising transforming mecha toys from Takara's Diaclone and Micro Change toylines rebranded for Western markets.[2] The term "Generation 1" (G1) covers both the animated television series The Transformers and the comic book series of the same name, which are further divided into Japanese, British and Canadian spin-offs. Sequels followed, such as the Generation 2 comic book and Beast Wars TV series, which became its own mini-universe. Generation 1 characters have been rebooted multiple times in the 21st century in comics from Dreamwave Productions (starting 2001), IDW Publishing (starting in 2005 and again in 2019), and Skybound Entertainment (beginning in 2023). There have been other incarnations of the story based on different toy lines during and after the 20th century. The first was the Robots in Disguise series, followed by three shows (Armada, Energon, and Cybertron) that constitute a single universe called the "Unicron Trilogy".

A live-action film series started in 2007, again distinct from previous incarnations, while the Transformers: Animated series merged concepts from the G1 continuity, the 2007 live-action film and the "Unicron Trilogy". For most of the 2010s, in an attempt to mitigate the wave of reboots, the "Aligned Continuity" was established. In 2018, Transformers: Cyberverse debuted, once again, distinct from the previous incarnations.

Although a separate and competing franchise started in 1983, Tonka's GoBots became the intellectual property of Hasbro after their buyout of Tonka in 1991. Subsequently, the universe depicted in the animated series Challenge of the GoBots and follow-up film GoBots: Battle of the Rock Lords was retroactively established as an alternate universe within the Transformers multiverse.[3] 
"""

To generate the summary we just have to call the pipeline

In [14]:
summary = summarizer(wiki_text)

print(summary[0]['summary_text'])

 The Transformers is a media franchise produced by Hasbro and Japanese toy company Takara Tomy . It primarily follows the heroic Autobots and the villainous Decepticons, two alien robot factions at war that can transform into other forms, such as vehicles and animals . As of 2011, it generated more than ¥2 trillion ($25 billion) in revenue .


We can also specify a minimum length

In [15]:
summary = summarizer(wiki_text, min_length=100)

print(summary[0]['summary_text'])

 Transformers is a media franchise produced by Hasbro and Japanese toy company Takara Tomy . It primarily follows the heroic Autobots and the villainous Decepticons, two alien robot factions at war that can transform into other forms, such as vehicles and animals . As of 2011, it generated more than ¥2 trillion ($25 billion) in revenue, making it one of the highest-grossing media franchises of all time . The term "Generation 1" (G1) covers both the animated television series The Transformers and the comic book series of the same name .


# Question Answering 

In [16]:
reader = pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [17]:
question = "What does the customer want?"

In [18]:
outputs = reader(question=question, context=email)
pd.DataFrame([outputs])    

Unnamed: 0,score,start,end,answer
0,0.631292,336,359,an exchange of Megatron


# Translation

In [19]:
translator = pipeline("translation_en_to_it", 
                      model="Helsinki-NLP/opus-mt-en-it")

In [20]:
outputs = translator(email, clean_up_tokenization_spaces=True, min_length=100, max_length=1000)
print(outputs[0]['translation_text'])

Cara Amazon, la scorsa settimana ho ordinato una figura d'azione Optimus Prime dal tuo negozio online in Germania. Purtroppo, quando ho aperto il pacchetto, ho scoperto al mio orrore che ero stato inviato una figura d'azione di Megatron invece! Come un nemico per tutta la vita dei Decepticon, spero che si può capire il mio dilemma. Per risolvere il problema, chiedo uno scambio di Megatron per la figura di Optimus Prime ho ordinato. In allegato sono copie dei miei record riguardanti questo acquisto. Mi aspetto di sentire da voi presto. Cordialmente, Bumblebee.


For comparison, let us look at the results of google translate:

```
Caro Amazon, la settimana scorsa ho ordinato un action figure di Optimus Prime dal tuo negozio online in Germania. Sfortunatamente, quando ho aperto il pacco, ho scoperto con orrore che mi era stata invece inviata una action figure di Megatron! Essendo un nemico da sempre dei Decepticon, spero che tu possa capire il mio dilemma. Per risolvere il problema, chiedo uno scambio di Megatron con la figura di Optimus Prime che ho ordinato. In allegato sono presenti copie dei miei documenti relativi a questo acquisto. Mi aspetto di sentirti presto. Cordiali saluti, Bombo.
```

Google translate is less context aware in the translation going so far as translating the name of the email sender (Bumblebee -> Bombo). On the other hand, the Hugging Face model is more formal ("sentire da voi" -> "sentirti")

# Sentiment Analysis

In [21]:
sentiment_pipeline = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Let us use a comple of faily obvious instances

In [22]:
instances = ["I love you", "I hate you"]

The model does a pretty good job of figuring out which one is positive and which one is negative

In [23]:
sentiment_pipeline(instances)

[{'label': 'POSITIVE', 'score': 0.9998656511306763},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079}]

# Application

Load a few thousand tweets about Apple

In [24]:
data = pd.read_csv('data/Apple-Twitter-Sentiment-DFE.csv', usecols=['text'])

In [25]:
data

Unnamed: 0,text
0,#AAPL:The 10 best Steve Jobs emails ever...htt...
1,RT @JPDesloges: Why AAPL Stock Had a Mini-Flas...
2,My cat only chews @apple cords. Such an #Apple...
3,I agree with @jimcramer that the #IndividualIn...
4,Nobody expects the Spanish Inquisition #AAPL
...,...
3881,(Via FC) Apple Is Warming Up To Social Media -...
3882,RT @MMLXIV: there is no avocado emoji may I as...
3883,@marcbulandr I could not agree more. Between @...
3884,My iPhone 5's photos are no longer downloading...


Compute the sentiment score for each tweet

In [26]:
sent = pd.DataFrame(data['text'].progress_apply(lambda x: pd.Series(sentiment_pipeline(x)[0])))

  0%|          | 0/3886 [00:00<?, ?it/s]

In [27]:
sent.rename(columns={'score': 'sentiment_confidence', 'label':'sentiment'}, inplace=True)

In [28]:
sent

Unnamed: 0,sentiment,sentiment_confidence
0,POSITIVE,0.999432
1,NEGATIVE,0.999122
2,NEGATIVE,0.996177
3,POSITIVE,0.995648
4,NEGATIVE,0.932676
...,...,...
3881,NEGATIVE,0.992046
3882,NEGATIVE,0.999158
3883,NEGATIVE,0.935773
3884,NEGATIVE,0.998303


We can also use NER to identify when a person is mentioned in the tweet

In [29]:
data['text'].iloc[0]

'#AAPL:The 10 best Steve Jobs emails ever...http://t.co/82G1kL94tx'

In [30]:
ner_tagger(data['text'].iloc[0])

[{'entity_group': 'PER',
  'score': 0.73089933,
  'word': 'Steve Jobs',
  'start': 18,
  'end': 28}]

Identify all people mentioned

In [31]:
def find_people(x):
    output = ner_tagger(x)
    
    for tag in output:
        if tag['entity_group'] == 'PER':
            out = {'confidence':tag["score"], 'person': tag['word']}
            return pd.Series(out)
    
    return pd.Series({"confidence": None, "person": None})

In [32]:
people = pd.DataFrame(data['text'].progress_apply(find_people))

  0%|          | 0/3886 [00:00<?, ?it/s]

Combine all the results into a single DataFrame

In [33]:
data = pd.concat([data, sent, people], axis=1)

In [34]:
data

Subset the data to only the tweets meantioning people

In [35]:
people = data[data.person.isna() == False].copy()

In [36]:
people

Convert the text labels to a numerical score

In [37]:
people['sentiment'] = people.apply(lambda x: 1 if x.sentiment == 'POSITIVE' else -1, axis=1)

Compute the average score

In [38]:
stats = people[['person', 'sentiment']].groupby('person').mean()

In [39]:
counts = people[['person', 'sentiment']].groupby('person').count()
counts.rename(columns={'sentiment':'count'}, inplace=True)

In [40]:
stats = stats.join(counts)

In [41]:
stats[stats['count']>=5].sort_values('sentiment', ascending=False)

<center>
     <img src="https://raw.githubusercontent.com/DataForScience/Networks/master/data/D4Sci_logo_full.png" alt="Data For Science, Inc" align="center" border="0" width=300px> 
</center>