<a href="https://colab.research.google.com/github/DJCordhose/ml-resources/blob/main/notebooks/foundation/transformers-sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers: sentiment analysis using pretrained models

* https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment
* https://huggingface.co/facebook/bart-large-mnli

In [1]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
tf.__version__

'2.8.0'

In [2]:
# when we are not training, we do not need a GPU
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [3]:
# https://huggingface.co/transformers/installation.html
!pip install -q transformers

[K     |████████████████████████████████| 4.0 MB 5.1 MB/s 
[K     |████████████████████████████████| 77 kB 5.9 MB/s 
[K     |████████████████████████████████| 895 kB 57.1 MB/s 
[K     |████████████████████████████████| 596 kB 48.8 MB/s 
[K     |████████████████████████████████| 6.6 MB 35.4 MB/s 
[?25h

In [4]:
import transformers
transformers.__version__

'4.18.0'

In [5]:
sequence_0 = "I don't think its a good idea to have people driving 40 miles an hour through a light that *just* turned green, especially with the number of people running red lights, or the number of pedestrians running across at the last minute being obscured by large cars in the lanes next to you."
sequence_1 = 'MANY YEARS ago, When I was a teenager, I delivered pizza. I had a friend who, just for the fun of it, had a CB. While on a particular channel, he could key the mike with quick taps and make the light right out in front of the pizza place turn green. It was the only light that it worked on, and I was in the car with him numerous times to confirm that it worked. It was sweet.'
sequence_2 = 'The "green" thing to do is not to do anything ever, don\'t even breath!  Oh, and if you are not going to take that ridiculous standpoint then I guess this is relevant to Green because it uses Bio-fuels in one of the most harsh environments in the world, showing that dependence on tradition fuels is a choice not a necessity.'

## bert-base-multilingual-uncased-sentiment

Version for TensorFlow

https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment

In [6]:
%%time 

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

model = TFAutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model.name_or_path

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/851k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/639M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


CPU times: user 27 s, sys: 4.88 s, total: 31.9 s
Wall time: 38.7 s


In [7]:
# paraphrase = tokenizer(sequence_0, return_tensors="tf")
# paraphrase = tokenizer(sequence_1, return_tensors="tf")
paraphrase = tokenizer(sequence_2, return_tensors="tf")

paraphrase_classification_logits = model(paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
stars = paraphrase_results.argmax() + 1
paraphrase_classification_logits, paraphrase_results, stars

(<tf.Tensor: shape=(1, 5), dtype=float32, numpy=
 array([[ 1.8335618 ,  0.9494924 , -0.21574138, -1.1014451 , -1.1619096 ]],
       dtype=float32)>,
 array([0.60787815, 0.2511135 , 0.07830969, 0.03229678, 0.03040184],
       dtype=float32),
 1)

## bart-large-mnli

Version for Pytorch (TensorFlow is not available)

https://huggingface.co/facebook/bart-large-mnli

In [8]:
%%time

from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")
classifier.model.name_or_path

Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

CPU times: user 34 s, sys: 7.16 s, total: 41.1 s
Wall time: 44.8 s


In [9]:
# sequence_to_classify = sequence_0
# sequence_to_classify = sequence_1
sequence_to_classify = sequence_2

candidate_labels = ['positive', 'negative', 'ironic']
classifier(sequence_to_classify, candidate_labels, multi_label=True)

{'labels': ['ironic', 'negative', 'positive'],
 'scores': [0.9157786965370178, 0.5182933807373047, 0.1775151491165161],
 'sequence': 'The "green" thing to do is not to do anything ever, don\'t even breath!  Oh, and if you are not going to take that ridiculous standpoint then I guess this is relevant to Green because it uses Bio-fuels in one of the most harsh environments in the world, showing that dependence on tradition fuels is a choice not a necessity.'}

## More data

In [10]:
!test -f technology-transport-short.db || (wget https://datanizing.com/data-science-day/technology-transport-short.7z && 7z x technology-transport-short.7z && rm technology-transport-short.7z)

--2022-04-26 07:09:49--  https://datanizing.com/data-science-day/technology-transport-short.7z
Resolving datanizing.com (datanizing.com)... 37.221.195.1
Connecting to datanizing.com (datanizing.com)|37.221.195.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 145599598 (139M) [application/x-7z-compressed]
Saving to: ‘technology-transport-short.7z’


2022-04-26 07:09:56 (23.1 MB/s) - ‘technology-transport-short.7z’ saved [145599598/145599598]


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 145599598 bytes (139 MiB)

Extracting archive: technology-transport-short.7z
--
Path = technology-transport-short.7z
Type = 7z
Physical Size = 145599598
Headers Size = 162
Method = LZMA2:24
Solid = -
Blocks = 1

  0%      1% - technology-transp

In [12]:
import sqlite3

tech = sqlite3.connect("technology-transport-short.db")

In [14]:
import pandas as pd

posts = pd.read_sql("SELECT title||' '||text AS fulltext, created_utc FROM posts", 
                    tech, parse_dates=["created_utc"])

In [15]:
posts

Unnamed: 0,fulltext,created_utc
0,Rare car tech,2021-01-15 16:30:39
1,"Unfortunately, this post has been removed. Im...",2021-01-15 16:30:40
2,The Device That Turns A Normal Bike Into An El...,2021-01-05 18:10:25
3,Hello! **Please read this message very carefu...,2021-01-05 18:10:26
4,Why can't we build flying craft that is as qui...,2021-01-02 01:53:00
...,...,...
1369960,Website makes it impossible to view article o...,2021-01-26 14:56:37
1369961,Website makes it impossible to view on any de...,2021-01-26 20:43:46
1369962,So go boom??,2021-01-28 07:10:35
1369963,Under Vehicle Surveillance System | HE Technol...,2021-01-23 11:33:22
