## Sentiment with Flair

Flair offers models that we can use out-of-the-box. One of those is the English sentiment model,

In [2]:
pip install flair

Collecting flair
  Downloading flair-0.12.2-py3-none-any.whl (373 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/373.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/373.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m373.1/373.1 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Collecting segtok>=1.5.7 (from flair)
  Downloading segtok-1.5.11-py3-none-any.whl (24 kB)
Collecting mpld3==0.3 (from flair)
  Downloading mpld3-0.3.tar.gz (788 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m788.5/788.5 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sqlitedict>=1.6.0 (from flair)
  Downloading sqlitedict-2.1.0.tar.gz (21 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting deprecated>=1.2.4 (from flair)
  Downloading Deprecated-1

### Step 1: Iniitialize the Flair model

In [3]:
import flair
model = flair.models.TextClassifier.load('en-sentiment')

2023-08-10 20:02:15,482 https://nlp.informatik.hu-berlin.de/resources/models/sentiment-curated-distilbert/sentiment-en-mix-distillbert_4.pt not found in cache, downloading to /tmp/tmpihkgwkcj


100%|██████████| 253M/253M [00:06<00:00, 38.5MB/s]

2023-08-10 20:02:22,551 copying /tmp/tmpihkgwkcj to cache at /root/.flair/models/sentiment-en-mix-distillbert_4.pt





2023-08-10 20:02:23,127 removing temp file /tmp/tmpihkgwkcj


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Step 2: Tokenize

In [9]:
text_1 = "I like you. I love you"  # we are expecting a confidently positive sentiment here
text_2 = "I need to start going out, I am gaining weight"

sentence_1 = flair.data.Sentence(text_1)
sentence_2 = flair.data.Sentence(text_2)

sentence_1

Sentence[7]: "I like you. I love you"

Here we now have the Flair Sentence object, which contains our text, alongside a tokenized version of it (each word/punctuation character is an individual token):

In [10]:
sentence_1.to_tokenized_string()
sentence_2.to_tokenized_string()

'I like you . I love you'

### Step 3: Process with the model

In [11]:
model.predict(sentence_1)
model.predict(sentence_2)

### Step 4: predict

In [14]:
sentence_1.get_labels()

['Sentence[7]: "I like you. I love you"'/'POSITIVE' (0.9933)]

In [15]:
sentence_2.get_labels()

['Sentence[11]: "I need to start going out, I am gaining weight"'/'NEGATIVE' (0.7424)]

In [17]:
sentence_1.get_labels()[0].score, sentence_1.get_labels()[0].value


(0.9932582378387451, 'POSITIVE')

# Sentiment Models with Transfomeres

Currently, the HuggingFace Transformers library stands as the leading and user-friendly resource for constructing and utilizing transformer models. Consequently, it will be our main toolkit throughout these notebooks.

For performing sentiment analysis with the transformers library, our initial step involves selecting a model. Instead of beginning from scratch, we'll utilize a pretrained model. The selection of available models is accessible at (https://huggingface.co/models.)

We will be performing sentiment analysis on posts from * /r/investing * (in this section we will be using the example given in txt below), which are finance oriented. We can use the finBERT model ProsusAI/finbert which has been trained on financial articles for financial sentiment classification.

In [18]:
model_name = 'ProsusAI/finbert'

In [20]:
from transformers import BertForSequenceClassification, BertTokenizer

# initialize the tokenizer for BERT models
tokenizer = BertTokenizer.from_pretrained(model_name)
# initialize the model for sequence classification
model = BertForSequenceClassification.from_pretrained(model_name)

In [22]:
txt = ("I’m a 23M who works for a Fortune 500 company. One of the investments I make is contributing a portion of each check to buy the company stock. The company buys this stock and distributes it to me Bi-Annually at a 10% discount."
"I have now roughly $10,000 in this stock and I’m wondering if I should hold it or maybe sell out of some of the stock that I have owned for over a year and put it in a Roth instead? I feel like I’m getting too heavily invested in one thing and wondering if diversification would be good for the long term profit."
"I currently have a 401k but don’t contribute to a Roth.")

txt

'I’m a 23M who works for a Fortune 500 company. One of the investments I make is contributing a portion of each check to buy the company stock. The company buys this stock and distributes it to me Bi-Annually at a 10% discount.I have now roughly $10,000 in this stock and I’m wondering if I should hold it or maybe sell out of some of the stock that I have owned for over a year and put it in a Roth instead? I feel like I’m getting too heavily invested in one thing and wondering if diversification would be good for the long term profit.I currently have a 401k but don’t contribute to a Roth.'

###1. We tokenize our input text.



In [30]:
tokens = tokenizer.encode_plus(txt, max_length=512,
                               truncation=True,
                               padding='max_length',
                               add_special_tokens=True,
                               return_tensors='pt') #pt = pytorch


* max_length - this tell the tokenizer the maximum number of tokens we want to see in each sample, for BERT we almost always use 512 as that is the length of sequences that BERT consumes.

* truncation - if our input string txt contains more tokens than allowed (specified in max_length parameter) then we cut all tokens past the max_length limit.

* padding - if our input string txt contains less tokens than specified by max_length then we pad the sequence with zeros (0 is the token ID for '[PAD]' - BERTs padding token).

* add_special_tokens - whether or not to add special tokens, when using BERT we always want this to be True unless we are adding them ourselves

* return_tensors - here we specify either 'pt' to return PyTorch tensors, or 'tf' to return TensorFlow tensors.

| Token | ID | Description |
    | --- | --- | --- |
    | [PAD] | 0 | Used to fill empty space when input sequence is shorter than required sequence size for model |
    | [UNK] | 100 | If a word/character is not found in BERTs vocabulary it will be represented by this *unknown* token |
    | [CLS] | 101 | Represents the start of a sequence |
    | [SEP] | 102 | Seperator token to denote the end of a sequence and as a seperator where there are multiple sequences |
    | [MASK] | 103 | Token used for masking other tokens, used for masked language modeling |

In [None]:
tokens

###2. Token IDs -> Model
Tokenized inputs are fed into the model, which outputs final layer activations (note activations are not probabilities).



In [31]:
output = model(**tokens)

In [32]:
output

SequenceClassifierOutput(loss=None, logits=tensor([[-1.1779, -0.6639,  2.2519]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [36]:
output[0]

tensor([[-1.1779, -0.6639,  2.2519]], grad_fn=<AddmmBackward0>)

###3. Model activation -> probabilities

Convert those activations into probabilities using a softmax function (sigmoid for multiple classes).



In [37]:
import torch.nn.functional as F

# apply softmax to the logits output tensor of our model (in index 0) across dimension -1
probs = F.softmax(output[0], dim=-1)

probs

tensor([[0.0298, 0.0498, 0.9203]], grad_fn=<SoftmaxBackward0>)

(We use dim=-1 as -1* signifies our tensors final dimension, so if we had a 3D tensor with dims [0, 1, 2] writing dim=-1 is the equivalent to writing dim=2. In this case if we wrote dim=-2 this would be the equivalent to writing dim=1. For a 2D tensor with dims [0, 1], dim=-1 is the equivalent of dim=1.)

###4. Take the argmax of those probabilities.



Now we have a tensor containing three classes, all with outputs within the probability range of 0-1, these are our probabilities! We can see that class index 1 has the highest probability with a value of 0.9072. We can use PyTorch's argmax function to extract this, we can use argmax after importing torch.

In [46]:
import torch

pred = torch.argmax(probs)

type(probs)

torch.Tensor

In [40]:
pred.item()

2

###5. (Optional) Extract the probability of the winning class.

In [54]:
# Find the index of the class with the highest probability
winning_class = torch.argmax(probs)

# Get the probability of the winning class
winning_probability = probs[0, winning_class]

In [56]:
print("Winning class:", winning_class.item())
print("Probability of winning class:", winning_probability.item())

Winning class: 2
Probability of winning class: 0.9203381538391113
