<a href="https://colab.research.google.com/github/Nakul24-1/Analysis-of-Emotion-Cause/blob/main/extras/robertatesting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Testing

In [None]:
from transformers import RobertaConfig, RobertaModel

In [None]:
configuration = RobertaConfig()

In [None]:
model = RobertaModel(configuration)

In [None]:
from transformers import RobertaTokenizer, TFRobertaForCausalLM
import tensorflow as tf

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = TFRobertaForCausalLM.from_pretrained("roberta-base")

inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")
outputs = model(inputs)
logits = outputs.logits

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/627M [00:00<?, ?B/s]

If you want to use `TFRobertaLMHeadModel` as a standalone, add `is_decoder=True.`
All model checkpoint layers were used when initializing TFRobertaForCausalLM.

All the layers of TFRobertaForCausalLM were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForCausalLM for predictions without further training.


In [None]:
logits

<tf.Tensor: shape=(1, 8, 50265), dtype=float32, numpy=
array([[[35.528595  , -3.9863913 , 23.237835  , ...,  3.1884754 ,
          5.2298965 , 12.811619  ],
        [-1.1595345 , -3.8122017 , 12.082758  , ..., -4.4703097 ,
         -2.3699512 ,  1.4894071 ],
        [ 2.3193643 , -3.3484042 , 11.015812  , ...,  3.1667874 ,
          2.2997122 ,  4.0590324 ],
        ...,
        [ 3.8776994 , -3.4873252 , 10.774568  , ...,  3.754785  ,
         -0.24851859,  4.2882304 ],
        [-0.6069057 , -4.8878274 , 10.964474  , ..., -3.008133  ,
         -4.012106  ,  1.0598428 ],
        [14.700357  , -5.2765093 , 25.056267  , ..., -1.299965  ,
          0.41575372,  7.2493496 ]]], dtype=float32)>

In [None]:
from transformers import RobertaTokenizer, TFRobertaForMaskedLM
import tensorflow as tf

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = TFRobertaForMaskedLM.from_pretrained("roberta-base")

inputs = tokenizer("I am  feeling <mask> because I got a new car.", return_tensors="tf")
logits = model(**inputs).logits

# retrieve index of <mask>
mask_token_index = tf.where(inputs.input_ids == tokenizer.mask_token_id)[0][1]

predicted_token_id = tf.math.argmax(logits[0, mask_token_index], axis=-1)
tokenizer.decode(predicted_token_id)

All model checkpoint layers were used when initializing TFRobertaForMaskedLM.

All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.


' good'

In [None]:
from transformers import RobertaTokenizer, TFRobertaForSequenceClassification
import tensorflow as tf

tokenizer = RobertaTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-emotion")
model = TFRobertaForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-emotion")

inputs = tokenizer("I got a dog.", return_tensors="tf")

logits = model(**inputs).logits

predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
model.config.id2label[predicted_class_id]

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-emotion.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


'optimism'

In [None]:
from transformers import RobertaTokenizer, TFRobertaForQuestionAnswering
import tensorflow as tf

tokenizer = RobertaTokenizer.from_pretrained("ydshieh/roberta-base-squad2")
model = TFRobertaForQuestionAnswering.from_pretrained("ydshieh/roberta-base-squad2")
question, text = "Why am I afraid?", "I am afraid of flying because I have fear of heights. The fear of heights is called vertigo"

inputs = tokenizer(question, text, return_tensors="tf")
outputs = model(**inputs)

answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

All model checkpoint layers were used when initializing TFRobertaForQuestionAnswering.

All the layers of TFRobertaForQuestionAnswering were initialized from the model checkpoint at ydshieh/roberta-base-squad2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForQuestionAnswering for predictions without further training.


' fear of heights'

In [None]:
import json

def read_squad(path):
    # open JSON file and load intro dictionary
    with open(path, 'rb') as f:
        squad_dict = json.load(f)

    # initialize lists for contexts, questions, and answers
    contexts = []
    questions = []
    answers = []
    # iterate through all data in squad data
    for group in squad_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                # check if we need to be extracting from 'answers' or 'plausible_answers'
                if 'plausible_answers' in qa.keys():
                    access = 'plausible_answers'
                else:
                    access = 'answers'
                for answer in qa[access]:
                    # append data to lists
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)
    # return formatted data lists
    return contexts, questions, answers

# execute our read SQuAD function for training and validation sets
train_contexts, train_questions, train_answers = read_squad('squad/train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad('squad/dev-v2.0.json')

FileNotFoundError: ignored

# REAL

In [None]:
!pip install transformers
import json
import pandas as pd



In [None]:
df_train = pd.read_json("https://github.com/Nakul24-1/Analysis-of-Emotion-Cause/raw/main/emc_train.json")
df_test = pd.read_json("https://github.com/Nakul24-1/Analysis-of-Emotion-Cause/raw/main/emc_test.json")

In [None]:
df_train['emotion'] = df_train['emotion'].apply(lambda x: x.split("__")[1])
df_test['emotion'] = df_test['emotion'].apply(lambda x: x.split("__")[1])

In [None]:
def read_data(context,emotions,cause,word_list):    
  contexts = context.tolist()
  questions = emotions.tolist()
  answers = cause.tolist()
  word_list = word_list.tolist()
  key_list = ["Text", "Start_Position","End_Position"]
  res = []
  # using list comprehension to perform as shorthand
  n = len(answers)
  for x in range(0,n):
    n2 = len(answers[x])
    text1 = ""
    start_pos = []
    end_pos = []
    if answers[x] is not None:
      start_pos = answers[x][0][-1]
      end_pos = answers[x][-1][-1] + 1
      text1=" ".join(word_list[x][start_pos:end_pos])
      res.append({key_list[0]: text1, key_list[1]: start_pos,key_list[2]:end_pos})
  
  return contexts,questions,res

In [None]:
contexts, questions, res = read_data(df_train.original_situation,df_train.emotion,df_train.annotation,df_train.tokenized_situation)
val_contexts, val_questions, val_res = read_data(df_test.original_situation,df_test.emotion,df_test.annotation,df_test.tokenized_situation)

In [None]:
from transformers import RobertaTokenizer,AutoTokenizer, AutoModelForQuestionAnswering
# initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = AutoModelForQuestionAnswering.from_pretrained("roberta-base")
# tokenize
train_encodings = tokenizer(contexts, questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForQuestionAnswering: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use 

In [None]:
def add_token_positions(encodings, answers):
    # initialize lists to contain the token indices of answer start/end
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        # append start/end token position using char_to_token method
        start_positions.append(encodings.word_to_tokens(i, answers[i]['Start_Position'])) #ISSUE 
        end_positions.append(encodings.word_to_tokens(i, answers[i]['End_Position']))

        # if start position is None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        # end position cannot be found, char_to_token found space, so shift position until found
        shift = 1
        while end_positions[-1] is None:
            end_positions[-1] = encodings.word_to_tokens(i, answers[i]['End_Position'] - shift)
            shift += 1
    # update our encodings object with the new token-based start/end positions
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

# apply function to our data
add_token_positions(train_encodings, res)
add_token_positions(val_encodings, val_res)

In [None]:
#train_encodings['start_positions'] # val_encodings['start_positions'][1][0] is correct,, run a loop to change all

In [None]:
for x in range(len(val_encodings['start_positions'])):
  if val_encodings['start_positions'][x] is not tokenizer.model_max_length :
    val_encodings['start_positions'][x]=val_encodings['start_positions'][x][0]
  if val_encodings['end_positions'][x] is not tokenizer.model_max_length :
    val_encodings['end_positions'][x]=val_encodings['end_positions'][x][1]

for x in range(len(train_encodings['start_positions'])):
  if (train_encodings['start_positions'][x]) is not tokenizer.model_max_length:
    train_encodings['start_positions'][x]=train_encodings['start_positions'][x][0]
  if (train_encodings['end_positions'][x]) is not tokenizer.model_max_length:
    train_encodings['end_positions'][x]=train_encodings['end_positions'][x][1]

In [None]:
tokenizer.decode(val_encodings['input_ids'][1])

'<s>One night my children and I came home and came in through the back door. As I opened it, I saw a tall shadow in the hallway! It scared me so much, as I feared an intruder was in my home.</s></s>terrified</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>'

In [None]:
type(train_encodings['start_positions'][1])

int

In [None]:
import torch

class EmoDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

# build datasets for both our training and validation sets
train_dataset = EmoDataset(train_encodings)
val_dataset = EmoDataset(val_encodings)

In [None]:
from torch.utils.data import DataLoader
from torch.optim import AdamW
from tqdm import tqdm

# setup GPU/CPU
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# move model over to detected device
model.to(device)
# activate training mode of model
model.train()
# initialize adam optimizer with weight decay (reduces chance of overfitting)
optim = AdamW(model.parameters(), lr=1e-5)

# initialize data loader for training data
train_loader = DataLoader(train_dataset, batch_size= 16, shuffle=True)

for epoch in range(1):
    # set model to train mode
    model.train()
    # setup loop (we use tqdm for the progress bar)
    loop = tqdm(train_loader, leave=True)
    acc = []
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all the tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        # train model on batch and return outputs (incl. loss)
        outputs = model(input_ids, attention_mask=attention_mask,
                        start_positions=start_positions,
                        end_positions=end_positions)
        
        # extract loss
        loss = outputs[0]
        start_pred = torch.argmax(outputs['start_logits'], dim=1)
        end_pred = torch.argmax(outputs['end_logits'], dim=1)
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        acc.append(((start_pred == start_positions).sum()/len(start_pred)).item())
        acc.append(((end_pred == end_positions).sum()/len(end_pred)).item())
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())
    model_path = 'models/robertaft'+ str(epoch)
    model.save_pretrained(model_path)
    tokenizer.save_pretrained(model_path)
    acc = sum(acc)/len(acc)
    print("Training Accuacry is ",acc)


Epoch 0: 100%|██████████| 236/236 [02:52<00:00,  1.37it/s, loss=0.921]


Training Accuacry is  0.6802348164936244


In [None]:
model.eval()
# initialize validation set data loader
val_loader = DataLoader(val_dataset, batch_size=1)
# initialize list to store accuracies
acc = []
outs = []
ins = []
# loop through batches
for batch in val_loader:
    # we don't need to calculate gradients as we're not training
    with torch.no_grad():
        # pull batched items from loader
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        # we will use true positions for accuracy calc
        start_true = batch['start_positions'].to(device)
        end_true = batch['end_positions'].to(device)
        # make predictions
        outputs = model(input_ids, attention_mask=attention_mask)
        # pull prediction tensors out and argmax to get predicted tokens
        start_pred = torch.argmax(outputs['start_logits'], dim=1)
        end_pred = torch.argmax(outputs['end_logits'], dim=1)
        for x in range(0,len(batch['input_ids'])):
          original = input_ids[0,start_true[x] : end_true[x] + 1]
          ins.append(tokenizer.decode(original))
          predict_answer_tokens = input_ids[0,start_pred[x] : end_pred[x] + 1]
          outs.append(tokenizer.decode(predict_answer_tokens))
        # calculate accuracy for both and append to accuracy list
        acc.append(((start_pred == start_true).sum()/len(start_pred)).item())
        acc.append(((end_pred == end_true).sum()/len(end_pred)).item())
# calculate average accuracy in total
acc = sum(acc)/len(acc)

In [None]:
acc

0.48448687350835323

In [None]:
tokenizer = AutoTokenizer.from_pretrained("models/robertaft5")
model = AutoModelForQuestionAnswering.from_pretrained("models/robertaft5").to(device)

model.eval()
# initialize validation set data loader
val_loader = DataLoader(val_dataset, batch_size=1)
# initialize list to store accuracies
acc = []
outs = []
ins = []
# loop through batches
for batch in val_loader:
    # we don't need to calculate gradients as we're not training
    with torch.no_grad():
        # pull batched items from loader
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        # we will use true positions for accuracy calc
        start_true = batch['start_positions'].to(device)
        end_true = batch['end_positions'].to(device)
        # make predictions
        outputs = model(input_ids, attention_mask=attention_mask)
        # pull prediction tensors out and argmax to get predicted tokens
        start_pred = torch.argmax(outputs['start_logits'], dim=1)
        end_pred = torch.argmax(outputs['end_logits'], dim=1)
        for x in range(0,len(batch['input_ids'])):
          original = input_ids[0,start_true[x] : end_true[x] + 1]
          ins.append(tokenizer.decode(original))
          predict_answer_tokens = input_ids[0,start_pred[x] : end_pred[x] + 1]
          outs.append(tokenizer.decode(predict_answer_tokens))
        # calculate accuracy for both and append to accuracy list
        acc.append(((start_pred == start_true).sum()/len(start_pred)).item())
        acc.append(((end_pred == end_true).sum()/len(end_pred)).item())
# calculate average accuracy in total
acc = sum(acc)/len(acc)

In [None]:
acc

0.49403341288782815

In [None]:
ins[11:20]

[' husband has been out of town for a few weeks on business but is coming back next week',
 " ride horses with me. After riding my mother pressured me to get on a horse bareback I wasn't comfortable riding.",
 ' records that belong to my father before he',
 ' experience and felt very comfortable in my ability to perform the job. I was very ready for the',
 ' tattoo I was worried about pain and blood. I',
 ' husband. He has been so sweet this week. He knew I had a long day today so he bought me my favourite wine and Taco Bell!</s>',
 ' daughter how to drive, and so far, so good!</s>',
 ' first day of school for my granddaughter. She',
 ' father almost died last year']

In [None]:
outs[11:20]

[' husband has been out of town for a few weeks on business but is coming back next week',
 " mother pressured me to get on a horse bareback I wasn't comfortable riding. She made such a big scene that I",
 ' records that belong to my father before he passed away. Sometimes',
 ' promoted to Team',
 ' tattoo I was worried about pain and blood. I',
 ' husband. He has been so sweet this week. He knew I had a long day today so he bought me my favourite wine and Taco',
 ' teaching my daughter how to drive, and',
 " first day of school for my granddaughter. She didn't go to preschool so I was really worried how she would like it and if she would be sad or not. Turns out she loves it and",
 ' father almost died last year']

In [None]:
result = pd.DataFrame(list(zip(ins,outs,df_test['emotion'])),columns = ['Human Annotation','Predicted Span','Emotion'])

In [None]:
def iou(ins,outs):

  s1 = set(ins.split(' '))
  s2 = set(outs.split(' '))
  u = s1.union(s2)
  inter = s1.intersection(s2)
  iou = len(inter)/len(u)
  return iou

In [None]:
result['IOU'] = result.apply(lambda row : iou(row['Human Annotation'],
                     row['Predicted Span']), axis = 1)

In [None]:
result

Unnamed: 0,Human Annotation,Predicted Span,Emotion,IOU
0,rendition of a play. But my friend got it.,friend got it,jealous,0.272727
1,saw a tall shadow in the hallway! It scared m...,shadow in the hallway! It,terrified,0.315789
2,slipped and fell on the wet floor as I,slipped and fell on the wet floor as I,embarrassed,1.000000
3,supervisor. It seemed suspicious so I pretty ...,weird questions about my,faithful,0.043478
4,'t give a homeless man any money. I,give a homeless man any,guilty,0.555556
...,...,...,...,...
833,saw a homeless guy the other,homeless guy the other,grateful,0.714286
834,learn how different my two kids' personalitie...,different my two kids' personalities could be,surprised,0.800000
835,wife surpises me with a picture of our future...,wife surpises me with a picture of our future...,joyful,1.000000
836,shouted to my mom</s>,shouted to my mom</s>,guilty,1.000000


In [None]:
import numpy as np
np.mean(result["IOU"])

0.6288723614823174

# Hugging face upload

In [None]:
!pip install huggingface_hub

from huggingface_hub import notebook_login

notebook_login()


Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [None]:
model.push_to_hub("Nakul24/RoBERTa-emotion-extraction")
tokenizer.push_to_hub("Nakul24/RoBERTa-emotion-extraction")

Cloning https://huggingface.co/Nakul24/RoBERTa-emotion-extraction into local empty directory.


Upload file pytorch_model.bin:   0%|          | 3.34k/473M [00:00<?, ?B/s]

To https://huggingface.co/Nakul24/RoBERTa-emotion-extraction
   ae8ee73..fed0819  main -> main

To https://huggingface.co/Nakul24/RoBERTa-emotion-extraction
   fed0819..43abab0  main -> main



'https://huggingface.co/Nakul24/RoBERTa-emotion-extraction/commit/43abab03b92f84ca618992ea26084d641b294c5e'

MODEL 3 - 
MODEL 4 - 0.6283977284141783
MODEL 5 - 0.6288723614823174
MODEL 2 - 

In [None]:
from google.colab import drive
drive.mount('/content/drive')


# EXTRAS

In [None]:
model_path = 'models/robertaft'
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('models/robertaft/tokenizer_config.json',
 'models/robertaft/special_tokens_map.json',
 'models/robertaft/vocab.json',
 'models/robertaft/merges.txt',
 'models/robertaft/added_tokens.json',
 'models/robertaft/tokenizer.json')

In [None]:
tokenizer = AutoTokenizer.from_pretrained("models/robertaft")
model = AutoModelForQuestionAnswering.from_pretrained("models/robertaft").to(device)

tensor(0.5155, device='cuda:0', grad_fn=<DivBackward0>)

In [None]:
print("T/F\tstart\tend\n")
for i in range(len(start_true)):
    print(f"true\t{start_true[i]}\t{end_true[i]}\n"
          f"pred\t{start_pred[i]}\t{end_pred[i]}\n")

T/F	start	end

true	16	24
pred	4	18



In [None]:

import torch


question, text = "__angry__", "The table was scratched by the cat, I'm very upset and feel like killing it"

inputs = tokenizer(question, text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

' scratched by the cat, I'

In [None]:
print (device)

In [None]:
import torch

tokenizer = AutoTokenizer.from_pretrained("models/robertaft3")
model = AutoModelForQuestionAnswering.from_pretrained("models/robertaft3")

question, text = "happy", "I'm elated by the fact that I got a promotion"


inputs = tokenizer(text,question, return_tensors="pt",truncation=True, padding=True)
with torch.no_grad():
  outputs = model(**inputs)

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()


predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

NameError: ignored

In [None]:
answer_start_index

tensor(2)

In [None]:
answer_end_index + 1

tensor(4)

In [None]:
tokenizer.decode(inputs.input_ids[0])

'<s>Ramesh and Suresh along with Ganesh told me that I am rich</s></s>__happy__</s>'

In [None]:
start_pred = torch.argmax(outputs['start_logits'], dim=1)
end_pred = torch.argmax(outputs['end_logits'], dim=1)

In [None]:
start_pred

tensor([2])

In [None]:
end_pred

tensor([4])

In [None]:
train_encodings = tokenizer(contexts, questions, truncation=True, padding=True)