# Sentiment Analysis on subreddit /finance post using FinBERT Transformer Model

In [1]:
model_name = 'ProsusAI/finbert'

In [2]:
from transformers import BertForSequenceClassification

### Downloading the Model 

In [3]:
model = BertForSequenceClassification.from_pretrained(model_name)

Downloading (…)lve/main/config.json: 100%|██████████| 758/758 [00:00<00:00, 277kB/s]
Downloading pytorch_model.bin: 100%|██████████| 438M/438M [06:04<00:00, 1.20MB/s] 


Pretrained means: we load a pretrained model, in this case finbert


### Tokenizer: 
We also need to convert text into the tokens that our model understands. For that we need the tokenizer

In [4]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(model_name)

Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:01<00:00, 217kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 53.8kB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 252/252 [00:00<00:00, 96.2kB/s]


### Pipeline
1. Tokenize
2. Token Ids --> Model
3. Model Activations --> Probabilities( using softmax)
4. argmax of probabilities

#### 1. Tokenize

In [5]:
txt = "Given the recent downturn in stocks especially in tech which is likely to persist as yields keep going up, I thought it would be prudent to share the risks of investing in ARK ETFs, written up very nicely by [The Bear Cave](https://thebearcave.substack.com/ p/special-edition-will-ark-invest-blow). The risks comes primarily from ARK's illiquid and very large holdings in small cap companies. ARK is forced to sell its holdings whe never its liquid ETF gets hit with outflows as is especially the case in market downtu rns. This could force very painful liquidations at unfavorable prices and the ensuing crash goes into a positive feedback loop leading into a death spiral enticing even mor e outflows and predatory shorts." 

##### Functionality of encode_plus and BERT special tokens
encode_plus: txt, max_length= sequence_length of the model, truncation=  cut all tokens past 512, and so on 
BERT special tokens
1. \[PAD] = 0
2. \[UNK] = 100
3. \[CLS] = 101
4. \[SEP] = 102
5. \[MSK] = 103


In [8]:
tokens = tokenizer.encode_plus(txt, max_length=512,
                               truncation=True,
                               padding='max_length',
                               add_special_tokens=True,
                               return_tensors='pt')

In [None]:
tokens

#### key-word arguments or kwargs
with kwargs:
dict()
**dict makes argument as key and value ad value

### Inference

In [10]:
output = model(**tokens)

In [12]:
output

SequenceClassifierOutput(loss=None, logits=tensor([[-1.7941,  2.4361,  0.1248]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

### Feed this output activation to softmax to get the probabilities

In [14]:
from torch.nn.functional import softmax


In [18]:
probs = softmax(output[0], dim=-1) # -1 signifies tensors final dimension

In [19]:
probs

tensor([[0.0131, 0.8979, 0.0890]], grad_fn=<SoftmaxBackward0>)

### Using argmax to extract the highest probability tensor

##### Custom Code to extract sentiment from the model output as per the FinBert model

In [23]:
def extract_sentiment(prediction):
    if prediction == 0:
        return "negative"
    elif prediction == 1:
        return "neutral"
    elif prediction == 2:
        return "positive"
    else:
        return "unknown"

In [21]:
import torch

pred = torch.argmax(probs)

In [24]:
extract_sentiment(pred.item())

'neutral'