## Sentiment analysis with DistilBERT

In [1]:
!pip install -q transformers
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm
2023-05-31 21:14:53.119750: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
def classify(sequence, M):
    nlp_cls = pipeline('sentiment-analysis')
    if M==1:
        print(nlp_cls.model.config)
    return nlp_cls(sequence)

In [3]:
seq = 3
if seq==1:
    sequence="The battery on my Model9X phone doesn't last more than 6 hours and I'm unhappy about that."

if seq==2:
    sequence="The battery on my Model9X phone doesn't last more than 6 hours and I'm unhappy about that. I was really mad! I bought a Moel10x and things seem to be better. I'm super satisfied now."

if seq==3:
    sequence="The customer was very unhappy"

if seq==4:
    sequence="The customer was very satisfied"
    
print(sequence)
M=1
CS = classify(sequence,M)
print(CS)
    

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


The customer was very unhappy


Downloading (…)lve/main/config.json: 100%|█████| 629/629 [00:00<00:00, 3.39MB/s]
Downloading pytorch_model.bin: 100%|█████████| 268M/268M [00:12<00:00, 21.9MB/s]
Downloading (…)okenizer_config.json: 100%|████| 48.0/48.0 [00:00<00:00, 363kB/s]
Downloading (…)solve/main/vocab.txt: 100%|████| 232k/232k [00:00<00:00, 469kB/s]
Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.29.2",
  "vocab_size": 30522
}

[{'label': 'NEGATIVE', 'score': 0.9997097849845886}]


## Sentiment analysis with RoBERTa-large

In [5]:
import torch
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer

class SimpleDataset:
    def __init__(self, tokenized_texts):
        self.tokenized_texts = tokenized_texts
    
    def __len__(self):
        return len(self.tokenized_texts["input_ids"])
    
    def __getitem__(self, idx):
        return {k: v[idx] for k, v in self.tokenized_texts.items()}

In [6]:
model_name = "siebert/sentiment-roberta-large-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
trainer = Trainer(model=model)

Downloading (…)okenizer_config.json: 100%|██████| 256/256 [00:00<00:00, 652kB/s]
Downloading (…)lve/main/config.json: 100%|█████| 687/687 [00:00<00:00, 2.39MB/s]
Downloading (…)olve/main/vocab.json: 100%|████| 798k/798k [00:00<00:00, 962kB/s]
Downloading (…)olve/main/merges.txt: 100%|████| 456k/456k [00:00<00:00, 730kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████| 150/150 [00:00<00:00, 445kB/s]
Downloading pytorch_model.bin: 100%|███████| 1.42G/1.42G [01:08<00:00, 20.9MB/s]


In [7]:
pred_texts = ['I like that','That is annoying','This is great!','Wouldn´t recommend it.']

In [8]:
tokenized_texts = tokenizer(pred_texts,truncation=True,padding=True)
pred_dataset = SimpleDataset(tokenized_texts)

In [9]:
# Run predictions
predictions = trainer.predict(pred_dataset)

In [10]:
predictions

PredictionOutput(predictions=array([[-3.7254689,  2.8858626],
       [ 3.9145916, -3.518445 ],
       [-3.7518108,  2.9132538],
       [ 3.9534545, -3.618488 ]], dtype=float32), label_ids=None, metrics={'test_runtime': 0.9961, 'test_samples_per_second': 4.016, 'test_steps_per_second': 1.004})

In [11]:
preds = predictions.predictions.argmax(-1)
labels = pd.Series(preds).map(model.config.id2label)
scores = (np.exp(predictions[0])/np.exp(predictions[0]).sum(-1,keepdims=True)).max(1)

In [13]:
labels

0    POSITIVE
1    NEGATIVE
2    POSITIVE
3    NEGATIVE
dtype: object