# Hugging Face Transformers

## 0. Read in Data

In [1]:
import pandas as pd

# modify the column width
pd.set_option('display.max_colwidth', None) # Default is 50, None shows all text

# look at a subset of the reviews
df = pd.read_excel('Data/Popchip_Reviews_Sentiment.xlsx').head(30)
df.head(2)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269


In [2]:
df.shape

(30, 7)

## 1. Sentiment Analysis

In [3]:
from transformers import pipeline

In [4]:
sentiment_analyzer = pipeline('sentiment-analysis', 
                              model='distilbert/distilbert-base-uncased-finetuned-sst-2-english',
                              device=-1 # -1 to use CPU
                             )

Device set to use cpu


In [5]:
text1 = 'When life gives you lemons, make lemonade! ðŸ™‚'
text2 = 'A dozen lemons will make a gallon of lemonade.'
text3 = 'I didn\'t like the taste of that lemonade at all.'

In [6]:
sentiment_analyzer(text1)

[{'label': 'POSITIVE', 'score': 0.996239423751831}]

In [7]:
sentiment_analyzer(text2)

[{'label': 'POSITIVE', 'score': 0.7781572341918945}]

In [8]:
sentiment_analyzer(text3)

[{'label': 'NEGATIVE', 'score': 0.9955589771270752}]

In [9]:
## Practical Example

In [10]:
sentiment_analyzer = pipeline('sentiment-analysis', 
                              model='distilbert/distilbert-base-uncased-finetuned-sst-2-english',
                              device=-1, # -1 to use CPU
                              truncation=True # Truncates text to make it shorter (Text we want to analyze)
                             )

Device set to use cpu


In [11]:
df.Text.apply(sentiment_analyzer)

0     [{'label': 'POSITIVE', 'score': 0.9935213923454285}]
1      [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2     [{'label': 'NEGATIVE', 'score': 0.6984866261482239}]
3     [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4     [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
5     [{'label': 'POSITIVE', 'score': 0.9994196891784668}]
6     [{'label': 'POSITIVE', 'score': 0.9992188215255737}]
7     [{'label': 'POSITIVE', 'score': 0.9969040751457214}]
8     [{'label': 'POSITIVE', 'score': 0.9894027709960938}]
9     [{'label': 'POSITIVE', 'score': 0.9991832375526428}]
10    [{'label': 'POSITIVE', 'score': 0.9994851350784302}]
11    [{'label': 'NEGATIVE', 'score': 0.7255946397781372}]
12    [{'label': 'POSITIVE', 'score': 0.9966173768043518}]
13    [{'label': 'POSITIVE', 'score': 0.9997195601463318}]
14    [{'label': 'POSITIVE', 'score': 0.8944363594055176}]
15    [{'label': 'POSITIVE', 'score': 0.9989368319511414}]
16    [{'label': 'POSITIVE', 'score': 0.9998534917831421

In [12]:
## Sentiment: round 2

In [13]:
%%time
# ^ Specific to jupyter notebook. It puts a magic function to see how long the cell took to run

from transformers import logging

logging.set_verbosity_error() # removes confusing errors when running pipeline

sentiment_analyzer = pipeline('sentiment-analysis', 
                              model='distilbert/distilbert-base-uncased-finetuned-sst-2-english',
                              device=-1, # -1 to use CPU
                              truncation=True # Truncates text to make it shorter (Text we want to analyze)
                             )

df.Text.apply(sentiment_analyzer)

CPU times: total: 13.5 s
Wall time: 852 ms


0     [{'label': 'POSITIVE', 'score': 0.9935213923454285}]
1      [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2     [{'label': 'NEGATIVE', 'score': 0.6984866261482239}]
3     [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4     [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
5     [{'label': 'POSITIVE', 'score': 0.9994196891784668}]
6     [{'label': 'POSITIVE', 'score': 0.9992188215255737}]
7     [{'label': 'POSITIVE', 'score': 0.9969040751457214}]
8     [{'label': 'POSITIVE', 'score': 0.9894027709960938}]
9     [{'label': 'POSITIVE', 'score': 0.9991832375526428}]
10    [{'label': 'POSITIVE', 'score': 0.9994851350784302}]
11    [{'label': 'NEGATIVE', 'score': 0.7255946397781372}]
12    [{'label': 'POSITIVE', 'score': 0.9966173768043518}]
13    [{'label': 'POSITIVE', 'score': 0.9997195601463318}]
14    [{'label': 'POSITIVE', 'score': 0.8944363594055176}]
15    [{'label': 'POSITIVE', 'score': 0.9989368319511414}]
16    [{'label': 'POSITIVE', 'score': 0.9998534917831421

In [14]:
%%time
# ^ Specific to jupyter notebook. It puts a magic function to see how long the cell took to run

from transformers import logging

logging.set_verbosity_error() # removes confusing errors when running pipeline

sentiment_analyzer = pipeline('sentiment-analysis', 
                              model='distilbert/distilbert-base-uncased-finetuned-sst-2-english',
                              device=0, # -1 to use CPU, 'mps' to use apple GPU. Windows NVIDIA GPU use 'cuda' or 'cuda:0' or 0.
                              truncation=True # Truncates text to make it shorter (Text we want to analyze)
                             )

sentiment_scores = df.Text.apply(sentiment_analyzer)
sentiment_scores[:5]

CPU times: total: 13.4 s
Wall time: 853 ms


0    [{'label': 'POSITIVE', 'score': 0.9935213923454285}]
1     [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2    [{'label': 'NEGATIVE', 'score': 0.6984866261482239}]
3    [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4    [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
Name: Text, dtype: object

In [15]:
sentiment_scores[0][0]['label']

'POSITIVE'

In [16]:
sentiment_scores[0][0]['score']

0.9935213923454285

In [17]:
df.head(2)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269


In [18]:
df['Label_HF'] = sentiment_scores.apply(lambda x: x[0]['label'])
df['Score_HF'] = sentiment_scores.apply(lambda x: x[0]['score'])

In [19]:
df['Sentiment_HF'] = df.apply(lambda row: row['Score_HF'] if row['Label_HF'] == 'POSITIVE' else -row['Score_HF'], axis=1)

In [20]:
df.head()

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF,Score_HF,Sentiment_HF
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521,0.993521
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605,0.999605
2,23691,A30NYUHEDLWI0Y,5,High,Great Alternative to Potato Chips,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",0.979,NEGATIVE,0.698487,-0.698487
3,23692,A2NU55U9LKTB5J,3,Low,Not somthing I would crave,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",0.8689,NEGATIVE,0.999631,-0.999631
4,23693,A225F7QFP5LIW2,5,High,healthy and delicious,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",0.9613,POSITIVE,0.999181,0.999181


## 2. NER

In [21]:
logging.set_verbosity_warning()

In [22]:
ner_analyzer = pipeline('ner',
                        model='dbmdz/bert-large-cased-finetuned-conll03-english',
                        device=0, # GPU
                        aggregation_strategy='SIMPLE'
                       )

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [23]:
text4 = "I ordered an Arnold Palmer at Applebee's in Springfield."

In [24]:
ner_analyzer(text4)

[{'entity_group': 'MISC',
  'score': 0.9914088,
  'word': 'Arnold Palmer',
  'start': 13,
  'end': 26},
 {'entity_group': 'ORG',
  'score': 0.9436139,
  'word': "Applebee ' s",
  'start': 30,
  'end': 40},
 {'entity_group': 'LOC',
  'score': 0.9780036,
  'word': 'Springfield',
  'start': 44,
  'end': 55}]

In [25]:
ner_analyzer2 = pipeline('ner',
                        model='dslim/bert-base-NER',
                        device=0, # GPU
                        aggregation_strategy='SIMPLE'
                       )

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [26]:
ner_analyzer2(text4)

[{'entity_group': 'PER',
  'score': 0.87662226,
  'word': 'Arnold Palmer',
  'start': 13,
  'end': 26},
 {'entity_group': 'ORG',
  'score': 0.7005143,
  'word': 'Applebee',
  'start': 30,
  'end': 38},
 {'entity_group': 'LOC',
  'score': 0.6289259,
  'word': "' s",
  'start': 38,
  'end': 40},
 {'entity_group': 'LOC',
  'score': 0.99173564,
  'word': 'Springfield',
  'start': 44,
  'end': 55}]

In [27]:
## practical example

In [28]:
df.head(2)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF,Score_HF,Sentiment_HF
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521,0.993521
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605,0.999605


In [29]:
ner_analyzer = pipeline('ner',
                        model='dslim/bert-base-NER',
                        device=0, # GPU
                        aggregation_strategy='SIMPLE'
                       )

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [30]:
ner_analyzer(df.Text[1])

[{'entity_group': 'MISC',
  'score': 0.9443704,
  'word': 'Salt and Vinegar',
  'start': 99,
  'end': 115},
 {'entity_group': 'MISC',
  'score': 0.9036544,
  'word': 'Salt and Vinegar',
  'start': 392,
  'end': 408},
 {'entity_group': 'MISC',
  'score': 0.94210386,
  'word': 'S & V',
  'start': 450,
  'end': 453}]

In [31]:
[entity['word'] for entity in ner_analyzer(df.Text[1])]

['Salt and Vinegar', 'Salt and Vinegar', 'S & V']

In [32]:
df['Named_Entities'] = df.Text.apply(lambda x: [entity['word'] for entity in ner_analyzer(x)])
df.head(2)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF,Score_HF,Sentiment_HF,Named_Entities
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521,0.993521,[]
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605,0.999605,"[Salt and Vinegar, Salt and Vinegar, S & V]"


In [33]:
 # we use a set to eliminate duplicate values of a list, but turn it to a list again.
named_entities = list(set(df['Named_Entities'].explode().dropna().tolist()))
named_entities[:5]

["##egar Pirate ' s Bo", '##y', 'P', 'Stop and Shop', 'COSTCO']

In [34]:
[entity for entity in named_entities if '#' not in entity]

['P',
 'Stop and Shop',
 'COSTCO',
 'Salt',
 'Chip',
 'Amazon',
 'General Mills',
 'and Vin',
 'BBQ Pop Chip',
 'Cal',
 'Co',
 'Miami',
 'BBQ Pop',
 'L',
 'Salt and Vinegar',
 'A',
 'I',
 'Pop',
 'Amazon. com',
 'T',
 'B',
 'Popchi',
 'PopChips',
 'S & V',
 'Lay']

## 3. Zero-Shot Classification

In [35]:
classifier = pipeline('zero-shot-classification',
         model='facebook/bart-large-mnli',
         device=0, # GPU
        )

Device set to use cpu


In [36]:
text1, text4

('When life gives you lemons, make lemonade! ðŸ™‚',
 "I ordered an Arnold Palmer at Applebee's in Springfield.")

In [37]:
classifier(text1, ['quote', 'food & drinks', 'technology'])

{'sequence': 'When life gives you lemons, make lemonade! ðŸ™‚',
 'labels': ['quote', 'food & drinks', 'technology'],
 'scores': [0.9833196401596069, 0.011176422238349915, 0.005503995344042778]}

In [38]:
classifier(text4, ['quote', 'food & drinks', 'technology'])

{'sequence': "I ordered an Arnold Palmer at Applebee's in Springfield.",
 'labels': ['food & drinks', 'quote', 'technology'],
 'scores': [0.5157081484794617, 0.44382616877555847, 0.04046566039323807]}

In [39]:
## practical example

In [40]:
df.head(2)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF,Score_HF,Sentiment_HF,Named_Entities
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521,0.993521,[]
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605,0.999605,"[Salt and Vinegar, Salt and Vinegar, S & V]"


In [41]:
classifier = pipeline('zero-shot-classification',
         model='facebook/bart-large-mnli',
         device=0, # GPU
        )

Device set to use cpu


In [42]:
classifier(df.Text[0], ['order', 'taste & texture', 'good', 'flavor', 'health'])

{'sequence': 'Popchips are the bomb!!  I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip.  My healthy eating program is saved.',
 'labels': ['health', 'good', 'flavor', 'taste & texture', 'order'],
 'scores': [0.26937156915664673,
  0.2510982155799866,
  0.2403365671634674,
  0.20681028068065643,
  0.032383374869823456]}

In [43]:
classifier(df.Text[1], ['order', 'taste & texture', 'good', 'flavor', 'health'])

{'sequence': 'I like the puffed nature of this chip that makes it more unique in the chip market.  I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever.  I have tried the cheddar and regular flavors as well.  The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular.  The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.',
 'labels': ['flavor', 'good', 'taste & texture', 'order', 'health'],
 'scores': [0.3400879502296448,
  0.27486953139305115,
  0.2560731768608093,
  0.11652453243732452,
  0.01244480162858963]}

In [44]:
classifier(df.Text[1], ['order', 'taste & texture', 'good', 'flavor', 'health'])['labels'][0]

'flavor'

In [45]:
df['Category'] = df.Text.apply(lambda x: classifier(x, ['order', 'taste & texture', 'good', 'flavor', 'health'])['labels'][0])

In [46]:
df.head()

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF,Score_HF,Sentiment_HF,Named_Entities,Category
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521,0.993521,[],health
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605,0.999605,"[Salt and Vinegar, Salt and Vinegar, S & V]",flavor
2,23691,A30NYUHEDLWI0Y,5,High,Great Alternative to Potato Chips,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",0.979,NEGATIVE,0.698487,-0.698487,[Amazon],good
3,23692,A2NU55U9LKTB5J,3,Low,Not somthing I would crave,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",0.8689,NEGATIVE,0.999631,-0.999631,[],taste & texture
4,23693,A225F7QFP5LIW2,5,High,healthy and delicious,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",0.9613,POSITIVE,0.999181,0.999181,[],taste & texture


## 4. Summarization

In [47]:
summarizer = pipeline('summarization',
        model='facebook/bart-large-cnn',
        device=0
        )

Device set to use cpu


In [48]:
text5 = """
            The lemon tree produces a pointed oval yellow fruit. Botanically this is a hesperidium, 
            a modified berry with a tough, leathery rind. The rind is divided into an outer colored layer or zest, 
            which is aromatic with essential oils, and an inner layer of white spongy pith. 
            Inside are multiple carpels arranged as radial segments. The seeds develop inside the carpels. 
            The space inside each segment is a locule filled with juice vesicles. 
            Lemons contain many phytochemicals, including polyphenols, terpenes, and tannins.[3] 
            Their juice contains slightly more citric acid than lime juice (about 47 g/L), 
            nearly twice as much as grapefruit juice, and about five times as much as orange juice.[4]
        """

In [49]:
summarizer(text5)

[{'summary_text': 'The lemon tree produces a pointed oval yellow fruit. The rind is divided into an outer colored layer or zest, and an inner layer of white spongy pith. Lemons contain many phytochemicals, including polyphenols, terpenes, and tannins.'}]

In [50]:
summarizer(text5, min_length=20, max_length=50)

[{'summary_text': 'The lemon tree produces a pointed oval yellow fruit. The rind is divided into an outer colored layer or zest, and an inner layer of white spongy pith. Lemons contain many phytochemicals, including poly'}]

In [51]:
summarizer(text5, min_length=20, max_length=50, do_sample=True) # do_sample=True will give us more random words.

[{'summary_text': 'The lemon tree produces a pointed oval yellow fruit. Lemons contain many phytochemicals, including polyphenols, terpenes, and tannins. Their juice contains slightly more citric acid than lime juice.'}]

In [52]:
## Practical example

In [53]:
sentiment_analyzer = pipeline('sentiment-analysis', 
                              model='distilbert/distilbert-base-uncased-finetuned-sst-2-english',
                              device=0, # -1 to use CPU, 'mps' to use apple GPU. Windows NVIDIA GPU use 'cuda' or 'cuda:0' or 0.
                              #truncation=True # Truncates text to make it shorter (Text we want to analyze)
                             )

Device set to use cpu


In [54]:
summarizer = pipeline('summarization',
        model='facebook/bart-large-cnn',
        device=0
        )

Device set to use cpu


In [55]:
# 1. summarize

In [56]:
summarizer(df.Text[0])[0]['summary_text']

Your max_length is set to 142, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


'Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip.  My healthy eating program is saved. I love Popchips! I love them!! I love popchips. I hate chips.'

In [57]:
df['Summary'] = df.Text.apply(lambda x: summarizer(x, min_length=20, max_length=50)[0]['summary_text'])

Your max_length is set to 50, but your input_length is only 39. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)
Your max_length is set to 50, but your input_length is only 44. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)
Your max_length is set to 50, but your input_length is only 24. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)
Your max_length is set to 50, but your input_length is only 31. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
Your max

In [58]:
df[['Text', 'Summary']].head()

Unnamed: 0,Text,Summary
0,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.
1,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.","I like the puffed nature of this chip that makes it more unique in the chip market. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come"
2,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!","I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. If you are on a low salt diet these chips are probably not for you."
3,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.","These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. Won't buy again unless I get them for cheap or free."
4,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!","These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor."


In [59]:
# 2. Sentiment analysis

In [60]:
sentiment_scores2 = df.Summary.apply(sentiment_analyzer)
sentiment_scores2[:5]

0    [{'label': 'POSITIVE', 'score': 0.9976533055305481}]
1    [{'label': 'POSITIVE', 'score': 0.9991886019706726}]
2    [{'label': 'NEGATIVE', 'score': 0.9929706454277039}]
3    [{'label': 'NEGATIVE', 'score': 0.9985463619232178}]
4    [{'label': 'POSITIVE', 'score': 0.9993218183517456}]
Name: Summary, dtype: object

In [61]:
df['Label_HF2'] = sentiment_scores.apply(lambda x: x[0]['label'])
df['Score_HF2'] = sentiment_scores.apply(lambda x: x[0]['score'])

In [62]:
df['Sentiment_HF2'] = df.apply(lambda row: row['Score_HF2'] if row['Label_HF2'] == 'POSITIVE' else -row['Score_HF2'], axis=1)

In [63]:
df[['Text', 'Sentiment_VADER', 'Sentiment_HF', 'Sentiment_HF2']].head()

Unnamed: 0,Text,Sentiment_VADER,Sentiment_HF,Sentiment_HF2
0,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,0.993521,0.993521
1,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,0.999605,0.999605
2,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",0.979,-0.698487,-0.698487
3,"These tasted like potatoe stix, that we got in grade school with our lunches usually on pizza day. They were the bomb then, not so much now. Won't buy again unless I get them for cheap or free.",0.8689,-0.999631,-0.999631
4,"These chips are great! They look almost like a flattened rice cake, but taste so much better, more like a potato chip. The bbq flavor is delicious. They are very low in fat and full of flavor. It is easy to eat an entire bag of these!",0.9613,0.999181,0.999181


## 6. Document Similarity

* simple example: extract embeddings for a single string
* practical example: recommend movies similar to captain marvel  

1. Extract embeddings for an entire column  
2. specify a captain marvel embedding (1 x 384)  
3. specify embeddings for each movie (166 x 384)  
4. find movies similar to captain marvel aka calculate the cosine similarity between captain marvel and all the movies  
5. optional: create a get_similar_movies function

In [65]:
feature_extractor = pipeline('feature-extraction',
            model='sentence-transformers/all-MiniLM-L6-v2',
            device=0
         )

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


In [66]:
text1

'When life gives you lemons, make lemonade! ðŸ™‚'

In [69]:
feature_extractor(text1)[0][0]

[-0.2936328649520874,
 0.2077523171901703,
 0.11103447526693344,
 0.14668914675712585,
 0.39885398745536804,
 0.3143492043018341,
 0.41525596380233765,
 -0.1936948448419571,
 0.11604082584381104,
 -0.8851795792579651,
 0.3186570107936859,
 -0.39514094591140747,
 -0.0016653575003147125,
 0.054051026701927185,
 -0.13276396691799164,
 -0.448091596364975,
 -0.1334240585565567,
 -0.007221762090921402,
 -0.009799906052649021,
 -0.37267500162124634,
 -0.1458362191915512,
 -0.7204411029815674,
 0.27967730164527893,
 -0.24687862396240234,
 -0.6268733143806458,
 0.56980299949646,
 -0.19975706934928894,
 -0.6103155612945557,
 0.042159635573625565,
 -0.771735429763794,
 0.17245164513587952,
 0.22367046773433685,
 -0.04267141595482826,
 -0.37711843848228455,
 0.23057164251804352,
 -0.180607870221138,
 0.13194867968559265,
 -0.3050297498703003,
 0.23650898039340973,
 0.5073456168174744,
 0.44223976135253906,
 0.24953363835811615,
 -0.21076694130897522,
 -0.06156829744577408,
 -0.03543531149625778,
 

In [70]:
len(feature_extractor(text1)[0][0])

384

In [71]:
## practical example

In [74]:
movies = pd.read_csv('Data/movie_reviews.csv')
movies.head(2)

Unnamed: 0,movie_title,rating,genre,in_theaters_date,movie_info,directors,director_gender,tomatometer_rating,audience_rating,critics_consensus
0,A Dog's Journey,PG,"Drama, Kids & Family",5/17/19,"Bailey (voiced again by Josh Gad) is living the good life on the Michigan farm of his ""boy,"" Ethan (Dennis Quaid) and Ethan's wife Hannah (Marg Helgenberger). He even has a new playmate: Ethan and Hannah's baby granddaughter, CJ. The problem is that CJ's mom, Gloria (Betty Gilpin), decides to take CJ away. As Bailey's soul prepares to leave this life for a new one, he makes a promise to Ethan to find CJ and protect her at any cost. Thus begins Bailey's adventure through multiple lives filled with love, friendship and devotion as he, CJ (Kathryn Prescott), and CJ's best friend Trent (Henry Lau) experience joy and heartbreak, music and laughter, and few really good belly rubs.",Gail Mancuso,female,50,92,"A Dog's Journey is as sentimental as one might expect, but even cynical viewers may find their ability to resist shedding a tear stretched to the puppermost limit."
1,A Dog's Way Home,PG,Drama,1/11/19,"Separated from her owner, a dog sets off on an 400-mile journey to get back to the safety and security of the place she calls home. Along the way, she meets a series of new friends and manages to bring a little bit of comfort and joy to their lives.",Charles Martin Smith,male,60,71,"A Dog's Way Home may not quite be a family-friendly animal drama fan's best friend, but this canine adventure is no less heartwarming for its familiarity."


In [76]:
embeddings = movies.movie_info.apply(lambda row: feature_extractor(row)[0][0])

In [78]:
movies[movies.movie_title == 'Captain Marvel']

Unnamed: 0,movie_title,rating,genre,in_theaters_date,movie_info,directors,director_gender,tomatometer_rating,audience_rating,critics_consensus
25,Captain Marvel,PG-13,"Action & Adventure, Science Fiction & Fantasy",3/8/19,"The story follows Carol Danvers as she becomes one of the universe's most powerful heroes when Earth is caught in the middle of a galactic war between two alien races. Set in the 1990s, Captain Marvel is an all-new adventure from a previously unseen period in the history of the Marvel Cinematic Universe.","Anna Boden, Ryan Fleck",female,78,53,"Packed with action, humor, and visual thrills, Captain Marvel introduces the MCU's latest hero with an origin story that makes effective use of the franchise's signature formula."


In [88]:
import numpy as np

embedding_cm = np.array(embeddings[25]).reshape(1,-1)
embedding_cm.shape

(1, 384)

In [92]:
embeddings_movies = np.vstack(embeddings)
embeddings_movies.shape

(166, 384)

In [94]:
from sklearn.metrics.pairwise import cosine_similarity

In [97]:
similarity_scores_cm = cosine_similarity(embedding_cm, embeddings_movies)

In [101]:
similarity_scores_cm.shape

(1, 166)

In [103]:
similarity_scores_cm.flatten().shape

(166,)

In [100]:
similarity_scores_cm_series = pd.Series(similarity_scores_cm.flatten(), name='similarity_score')

In [105]:
movies[['movie_title', 'movie_info']].head(2)

Unnamed: 0,movie_title,movie_info
0,A Dog's Journey,"Bailey (voiced again by Josh Gad) is living the good life on the Michigan farm of his ""boy,"" Ethan (Dennis Quaid) and Ethan's wife Hannah (Marg Helgenberger). He even has a new playmate: Ethan and Hannah's baby granddaughter, CJ. The problem is that CJ's mom, Gloria (Betty Gilpin), decides to take CJ away. As Bailey's soul prepares to leave this life for a new one, he makes a promise to Ethan to find CJ and protect her at any cost. Thus begins Bailey's adventure through multiple lives filled with love, friendship and devotion as he, CJ (Kathryn Prescott), and CJ's best friend Trent (Henry Lau) experience joy and heartbreak, music and laughter, and few really good belly rubs."
1,A Dog's Way Home,"Separated from her owner, a dog sets off on an 400-mile journey to get back to the safety and security of the place she calls home. Along the way, she meets a series of new friends and manages to bring a little bit of comfort and joy to their lives."


In [107]:
movies_similarity_scores_cm = pd.concat([movies[['movie_title', 'movie_info']], similarity_scores_cm_series], axis=1)

In [108]:
movies_similarity_scores_cm.sort_values('similarity_score', ascending=False).head()

Unnamed: 0,movie_title,movie_info,similarity_score
25,Captain Marvel,"The story follows Carol Danvers as she becomes one of the universe's most powerful heroes when Earth is caught in the middle of a galactic war between two alien races. Set in the 1990s, Captain Marvel is an all-new adventure from a previously unseen period in the history of the Marvel Cinematic Universe.",1.0
45,Fast & Furious Presents: Hobbs & Shaw,"Ever since hulking lawman Hobbs (Dwayne Johnson), a loyal agent of America's Diplomatic Security Service, and lawless outcast Shaw (Jason Statham), a former British military elite operative, first faced off in 2015's Furious 7, the duo have swapped smack talk and body blows as they've tried to take each other down. But when cyber-genetically enhanced anarchist Brixton (Idris Elba) gains control of an insidious bio-threat that could alter humanity forever--and bests a brilliant and fearless rogue MI6 agent (The Crown's Vanessa Kirby), who just happens to be Shaw's sister--these two sworn enemies will have to partner up to bring down the only guy who might be badder than themselves.",0.819661
18,Avengers: Endgame,"The grave course of events set in motion by Thanos that wiped out half the universe and fractured the Avengers ranks compels the remaining Avengers to take one final stand in Marvel Studios' grand conclusion to twenty-two films, ""Avengers: Endgame.""",0.794008
131,The LEGO Movie 2: The Second Part,"The much-anticipated sequel to the critically acclaimed, global box office phenomenon that started it all, ""The LEGO (R) Movie 2: The Second Part,"" reunites the heroes of Bricksburg in an all new action-packed adventure to save their beloved city. It's been five years since everything was awesome and the citizens are now facing a huge new threat: LEGO DUPLO (R) invaders from outer space, wrecking everything faster than it can be rebuilt. The battle to defeat the invaders and restore harmony to the LEGO universe will take Emmet (Chris Pratt), Lucy (Elizabeth Banks), Batman (Will Arnett) and their friends to faraway, unexplored worlds, including a strange galaxy where everything is a musical. It will test their courage, creativity and Master Building skills, and reveal just how special they really are.",0.792253
6,Alita: Battle Angel,"From visionary filmmakers James Cameron (AVATAR) and Robert Rodriguez (SIN CITY), comes ALITA: BATTLE ANGEL, an epic adventure of hope and empowerment. When Alita (Rosa Salazar) awakens with no memory of who she is in a future world she does not recognize, she is taken in by Ido (Christoph Waltz), a compassionate doctor who realizes that somewhere in this abandoned cyborg shell is the heart and soul of a young woman with an extraordinary past. As Alita learns to navigate her new life and the treacherous streets of Iron City, Ido tries to shield her from her mysterious history while her street-smart new friend Hugo (Keean Johnson) offers instead to help trigger her memories. But it is only when the deadly and corrupt forces that run the city come after Alita that she discovers a clue to her past - she has unique fighting abilities that those in power will stop at nothing to control. If she can stay out of their grasp, she could be the key to saving her friends, her family and the world she's grown to love.",0.789453
