## Opis problemu

Znajdź dowolny zbiór danych (dozwolone języki: angielski, hiszpański, polski, szwedzki) (poza IMDB oraz zbiorami wykorzystywanymi na zajęciach) do analizy sentymentu.
Zbiór może mieć 2 lub 3 klasy.

Następnie:
1. Oczyść dane i zaprezentuj rozkład klas
2. Zbuduj model analizy sentymenu:
  - z wykorzystaniem sieci rekurencyjnej (LSTM/GRU/sieć dwukierunkowa) innej niż podstawowe RNN
  - z wykorzystaniem sieci CNN
  - z podstawiemiem pre-trained word embeddingów
  - z fine-tuningiem modelu języka (poza podstawowym BERTem)

3. Stwórz funkcję, która będzie korzystała z wytrenowanego modelu i zwracała wynik dla przekazanego pojedynczego zdania (zdań) w postaci komunikatu informującego użytkownika, czy tekst jest nacechowany negatywnie, pozytywnie (czy neutralnie w przypadku 3 klas).

4. Gotowe rozwiązanie zamieść na GitHubie z README. W README zawrzyj: informacje o danych - ich pochodzenie, oraz opis wybranego modelu i instrukcje korzystania z plików.
5. W assigmnencie w Teamsach wrzuć link do repo z rozwiązaniem. W przypadku prywatnego repo upewnij się, że będzie ono widoczne dla `dwnuk@pjwstk.edu.pl`.

**TERMIN**: jak w Teamsach

1. Oczyść dane i zaprezentuj rozkład klas

In [1]:
!pip install -U accelerate -q
!pip install -U transformers -q

In [2]:
!pip install -U datasets -q

In [3]:
import transformers
transformers.__version__

  from .autonotebook import tqdm as notebook_tqdm


'4.36.2'

In [4]:
import pandas as pd

data = pd.read_csv('Stress.csv')
data

Unnamed: 0,subreddit,post_id,sentence_range,text,label,confidence,social_timestamp
0,ptsd,8601tu,"(15, 20)","He said he had not felt that way before, sugge...",1,0.800000,1521614353
1,assistance,8lbrx9,"(0, 5)","Hey there r/assistance, Not sure if this is th...",0,1.000000,1527009817
2,ptsd,9ch1zh,"(15, 20)",My mom then hit me with the newspaper and it s...,1,0.800000,1535935605
3,relationships,7rorpp,"[5, 10]","until i met my new boyfriend, he is amazing, h...",1,0.600000,1516429555
4,survivorsofabuse,9p2gbc,"[0, 5]",October is Domestic Violence Awareness Month a...,1,0.800000,1539809005
...,...,...,...,...,...,...,...
2833,relationships,7oee1t,"[35, 40]","* Her, a week ago: Precious, how are you? (I i...",0,1.000000,1515187044
2834,ptsd,9p4ung,"[20, 25]",I don't have the ability to cope with it anymo...,1,1.000000,1539827412
2835,anxiety,9nam6l,"(5, 10)",In case this is the first time you're reading ...,0,1.000000,1539269312
2836,almosthomeless,5y53ya,"[5, 10]",Do you find this normal? They have a good rela...,0,0.571429,1488938143


In [5]:
cols_to_drop = ['subreddit','post_id','sentence_range','confidence','social_timestamp']
df = data.drop(cols_to_drop,axis=1)
df.sample(5)

Unnamed: 0,text,label
1206,Public speaking in class frequently reduced me...,1
163,I hate the thought that even after my mom's de...,1
1132,You can read the full terms and instructions h...,0
2144,Clearly he's hurting inside and I want to get ...,1
457,I’m noticing a pattern where my body is like r...,1


In [6]:
from datasets import Dataset

dataset_ = Dataset.from_pandas(df)
dataset = dataset_.train_test_split(0.1)

In [7]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2554
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 284
    })
})


In [8]:
from transformers import RobertaModel, RobertaTokenizer

model_checkpoint = 'distilbert-base-uncased'
batch_size = 32

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [10]:
def process(x):
  return tokenizer(x['text'])

train_ds = dataset['train'].map(process)
test_ds = dataset['test'].map(process)

Map: 100%|██████████| 2554/2554 [00:01<00:00, 1555.12 examples/s]
Map: 100%|██████████| 284/284 [00:00<00:00, 1441.78 examples/s]


In [11]:
train_ds

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 2554
})

2. Zbuduj model analizy sentymenu:
  - z wykorzystaniem sieci rekurencyjnej (LSTM/GRU/sieć dwukierunkowa) innej niż podstawowe RNN
  - z wykorzystaniem sieci CNN
  - z podstawiemiem pre-trained word embeddingów
  - z fine-tuningiem modelu języka (poza podstawowym BERTem) <---

In [12]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)



Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
args = TrainingArguments(
    f'{model_checkpoint}_sentiment_analysis',
    evaluation_strategy = 'epoch',
    save_strategy = 'epoch',
    learning_rate = 2e-5,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    num_train_epochs = 5,
    weight_decay = 0.1,
    load_best_model_at_end = True,
    metric_for_best_model = 'accuracy'
)

In [14]:
from datasets import load_metric
import numpy as np

metric = load_metric('glue', 'sst2')

def compute_metric(eval_preds):
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

  metric = load_metric('glue', 'sst2')


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [15]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metric
)

In [16]:
trainer.evaluate([train_ds[0]])

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
100%|██████████| 1/1 [00:00<00:00, 40.55it/s]


{'eval_loss': 0.6809222102165222,
 'eval_accuracy': 1.0,
 'eval_runtime': 0.2004,
 'eval_samples_per_second': 4.99,
 'eval_steps_per_second': 4.99}

In [17]:
trainer.train()

                                                  
 20%|██        | 80/400 [31:24<1:40:37, 18.87s/it]

{'eval_loss': 0.38694098591804504, 'eval_accuracy': 0.8450704225352113, 'eval_runtime': 76.4719, 'eval_samples_per_second': 3.714, 'eval_steps_per_second': 0.118, 'epoch': 1.0}


                                                     
 40%|████      | 160/400 [1:02:27<1:11:21, 17.84s/it]

{'eval_loss': 0.4153191149234772, 'eval_accuracy': 0.8169014084507042, 'eval_runtime': 76.9055, 'eval_samples_per_second': 3.693, 'eval_steps_per_second': 0.117, 'epoch': 2.0}


                                                     
 60%|██████    | 240/400 [1:33:59<51:28, 19.31s/it]

{'eval_loss': 0.36661282181739807, 'eval_accuracy': 0.8415492957746479, 'eval_runtime': 77.0714, 'eval_samples_per_second': 3.685, 'eval_steps_per_second': 0.117, 'epoch': 3.0}


                                                     
 80%|████████  | 320/400 [2:05:34<26:44, 20.05s/it]

{'eval_loss': 0.3899497985839844, 'eval_accuracy': 0.8274647887323944, 'eval_runtime': 76.7012, 'eval_samples_per_second': 3.703, 'eval_steps_per_second': 0.117, 'epoch': 4.0}


                                                     
100%|██████████| 400/400 [2:36:22<00:00, 21.20s/it]

{'eval_loss': 0.43250441551208496, 'eval_accuracy': 0.8345070422535211, 'eval_runtime': 76.7024, 'eval_samples_per_second': 3.703, 'eval_steps_per_second': 0.117, 'epoch': 5.0}


100%|██████████| 400/400 [2:36:24<00:00, 23.46s/it]

{'train_runtime': 9384.1496, 'train_samples_per_second': 1.361, 'train_steps_per_second': 0.043, 'train_loss': 0.29859272003173826, 'epoch': 5.0}





TrainOutput(global_step=400, training_loss=0.29859272003173826, metrics={'train_runtime': 9384.1496, 'train_samples_per_second': 1.361, 'train_steps_per_second': 0.043, 'train_loss': 0.29859272003173826, 'epoch': 5.0})

In [18]:
trainer.model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

3. Stwórz funkcję, która będzie korzystała z wytrenowanego modelu i zwracała wynik dla przekazanego pojedynczego zdania (zdań) w postaci komunikatu informującego użytkownika, czy tekst jest nacechowany negatywnie, pozytywnie (czy neutralnie w przypadku 3 klas).

In [19]:
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

def predict_sentiment(text, model):
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

    tokens = tokenizer(text, return_tensors='pt')
    input_ids = tokens['input_ids']

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    input_ids = input_ids.to(device)

    model = model.to(device)

    with torch.no_grad():
        outputs = model(input_ids)

    logits = outputs.logits
    predicted_class = torch.argmax(logits, dim=1).item()

    if predicted_class == 1:
        return "I can sense STRESS in this sentence"
    else:
        return "All good don't sense ANY STRESS in here "

In [23]:
text_example = "I had a peaceful evening reading my favorite book."
result = predict_sentiment(text_example, trainer.model)
print(result)


All good don't sense ANY STRESS in here 
