<a href="https://colab.research.google.com/github/JumanaKhrais/Transformer-Based-Deep-Learning-Models-for-Sarcasm-Detection-with-an-Imbalanced-Dataset./blob/main/Ensem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Mounting Drive**

In [None]:
from google.colab import drive
drive.mount('/content/drive') #this line to have the ability to read from and load to drive

Mounted at /content/drive


**Importing Libraries**

In [None]:
! pip install transformers

Collecting transformers
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 5.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 42.9 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 34.8 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 50.9 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
  

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
import torch

from transformers import TrainingArguments, Trainer
from transformers import EarlyStoppingCallback
from transformers import AutoModelForSequenceClassification, AutoTokenizer 
from transformers import BertForSequenceClassification

**Bert Model + Tokenizer**

In [None]:
modelBert = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", return_dict=True, num_labels =2)
tokenizerBert = AutoTokenizer.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

**Roberta Model + Tokenizer**

In [None]:
modelRob = AutoModelForSequenceClassification.from_pretrained("roberta-base", return_dict=True, num_labels =2)
tokenizerRob = AutoTokenizer.from_pretrained("roberta-base")

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.bias', 'lm_head.dense.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifi

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

**Creating Torch Dataset**

In [None]:
# Create torch dataset
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

#The Trainer API requires the model to be in a torch.utils.data.Dataset class. 


**Reading + Tokenizing + Creating tourch test dataset**

In [None]:
# Load test data
test= pd.read_csv("drive/MyDrive/TestEnglish.csv")

In [None]:
print(test.shape)
testData = list(test['text']) 

#Bert Tokenizer
testDatatokenizedBert = tokenizerBert(testData, padding=True, truncation=True, max_length=80) 

#Roberta Tokenizer 
testDatatokenizedRob = tokenizerRob(testData, padding=True, truncation=True, max_length=80) 

# Create torch dataset Bert
test_datasetBert = Dataset(testDatatokenizedBert)

# Create torch dataset Roberta
test_datasetRob = Dataset(testDatatokenizedRob)




(1400, 2)


**Loading Roberta Model**

In [None]:
#Loading the model 
model_pathRob = "drive/MyDrive/RobMod/Roberta-output4/checkpoint-500"
modelRob = AutoModelForSequenceClassification.from_pretrained(model_pathRob, num_labels =2)

# Define test trainer
test_trainerRob = Trainer(modelRob)

# Make prediction
raw_predRob, _, _ = test_trainerRob.predict(test_datasetRob)

# Preprocess raw predictions
y_predRob = np.argmax(raw_predRob, axis=1)


***** Running Prediction *****
  Num examples = 1400
  Batch size = 8


**Loading Bert Model**

In [None]:
#Loading the model 
model_pathBert= "drive/MyDrive/BertMod/Bert-output4/checkpoint-1000"
modelBert = AutoModelForSequenceClassification.from_pretrained(model_pathBert, num_labels =2)

# Define test trainer
test_trainerBert = Trainer(modelBert)

# Make prediction
raw_predBert, _, _ = test_trainerBert.predict(test_datasetBert)

# Preprocess raw predictions
y_predBert= np.argmax(raw_predBert, axis=1)


loading configuration file drive/MyDrive/BertMod/Bert-output4/checkpoint-1000/config.json
Model config BertConfig {
  "_name_or_path": "drive/MyDrive/BertMod/Bert-output4/checkpoint-1000",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.16.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file drive/MyDrive/BertMod/Bert-output4/checkpoint-1000/pytorch_model.bin
All model checkpoint weights were u

**Ensembling:**

*   Bert machine
*   Roberta machine 
*   Roberta machine 
*   Roberta machine 
*   Roberta machine
*   .5 * Roberta machine













In [None]:
from sklearn import metrics 

raw_final = (raw_predBert+4.5*raw_predRob)
y_pred = np.argmax(raw_final, axis=1)
predicted=y_pred
testL =test['sarcastic']
print(metrics.accuracy_score(testL, predicted))
print(metrics.precision_score(testL, predicted))
print(metrics.recall_score(testL,predicted))
print(metrics.f1_score(testL,predicted)) 


0.8592857142857143
0.5102040816326531
0.375
0.43227665706051877
