# Speech and Language Processing (SLP)
## Master Systèmes Embarqués et Traitement de l'Information (SETI)


---------------------------

### Project: **Sarcasm Detection in Natural Language Processing Applied to English Text**
### Alaf DO NASCIMENTO SANTOS
### Ianis GIRAUD
---------------------------

## Introduction

As a final project for the Speech and Language Processing course, and first actual AI project for the team, we decided to apply the acquired knowledge to a simple Natural Language Processing (NLP) task, principally in the area of text mining. We aim to train a model capable of recognising sarcasm in English texts. Our AI project, titled "Sarcasm Detection in Natural Language Processing Applied to English Text" addresses the challenge of deciphering sarcasm within individual sentences. The goal is to develop a system capable of providing a binary output—indicating whether a given sentence is sarcastic (true) or not (false).

Sarcasm is a challenging being a popular research topic when it comes to text mining and semantics analysis. It is a form of verbal irony, which adds an extra layer of complexity to the already challenging field of text mining. In this context, the challenge extends to navigating the nuanced aspects of English expressions. Our approach integrates fundamental linguistic analysis with machine learning techniques. The model will be trained on datasets that capture instances of sarcasm within English sentences, considering the specific linguistic nuances and cultural context from the social media Reddit through the Self-Annotated Reddit Corpus (SARC) [1]. 

The potential applications of our system are numerous, ranging from improving sentiment analysis in customer feedback to enhancing the understanding of social media interactions. By highlighting the sentence-level analysis, our project strikes a balance between efficiency and effectiveness, offering a pragmatic solution for scenarios where processing entire texts in social media may be challenging.

Since detecting sarcasm in natural language is a challenging task for NLP systems because it involves understanding not just the literal meaning of words but also the speaker's or writer's desired tone and context. Several approaches are utilised in NLP to address sarcasm detection, e.g., context analysis, sentiment analysis, pragmatic analysis, lexical and syntactic analysis. Despite advancements in NLP, sarcasm detection remains a complex and evolving area of research, as it requires models to understand the subtleties of human communication, context, and emotions. Models trained on large and diverse datasets with labelled examples of sarcasm are essential for improving their accuracy in detecting this nuanced form of expression. As a contribution to this field and proof of concept, we propose here a voting system, where 3 approaches were developed and each of them gives an "opinion" in either a sentence is sarcastic or not, and then the majority wins.

## Installing and Importing necessary packages

In [93]:
!pip3 install matplotlib
!pip3 install portalocker==2.8.2
!pip3 install pandas
%matplotlib inline

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [94]:
import pandas as pd
import torch
import time
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
# from tqdm.notebook import tqdm
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader
from torch import nn
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## Preprocessing

The SARC dataset can be found at <https://nlp.cs.princeton.edu/old/SARC/2.0/>. It is a very large dataset, where each statement is self-annotated (sarcasm is labeled by the author) and provides the user, topic, and conversation context. In order to reduce the size of our dataset and only work with the target information, while keeping a limited complexity, we performed a preprocessing step.

The comments are kept, along with some metadata, in a single very large json file. Because of memory constraints, it wasn't possible to load the whole file at once. To reduce it to a useable size and convert it to a format that could be easily loaded later, we used the following python code:

```python
import pandas as pd
import bz2
import re
import ijson

re_label = re.compile('"([a-z0-9]+)"')
re_text = re.compile('"text": *"(.*)",')

comments = "../SARC/comments-pretty2.json.bz2"

ids = []
texts = []

level = 0

with bz2.open(comments, "rt", encoding="utf-8") as f:
    f.readline()
    line = f.readline()
    while line:
        line = line.strip()
        if level==0 and not line.startswith("}"):
            ids.append(line.split('"')[1])
            level = 1
        elif level==1:
            m = re_text.match(line)
            if m:
                texts.append(bytes(m.group(1), "utf-8").decode("unicode_escape"))
            elif line.startswith("}"):
                level = 0
        line = f.readline()

print(len(ids), len(texts))

df = pd.DataFrame({"id": ids, "text": texts})
df.to_csv("comments.csv.bz2")
```
This converts the json to a single csv file containing only an identifier and the text for each message. While the initial file would crash my computer when loaded with panda, this reduced version could fit in less than 4GB.

The files containing the labels were provided in a csv-like format of the form `<parent1_id> [<parent2_id> ...] | <child1_id> [<child2_id ...] | <label1> [label2 ...]`. These two had to be converted to a more useable format using the following python code:
```python
import pandas as pd
import bz2

arr_parents = []
arr_posts = []
arr_labels = []

labels = "../SARC/test-balanced.csv.bz2"
with bz2.open(labels, "rt", encoding="utf-8") as f:
    line = f.readline()
    while line:
        parents, posts, labels = line.split("|")
        parent = parents.split()[-1]
        posts = posts.split()
        labels = labels.split()
        for post, label in zip(posts, labels):
            arr_parents.append(parent)
            arr_posts.append(post)
            arr_labels.append(int(label))
        line = f.readline()

df = pd.DataFrame({"parent": arr_parents, "post": arr_posts, "label": arr_labels})
df.to_csv("test-balanced.csv.bz2")
```
For each labelled message, this script keeps only its most recent parent and its label and organize everything in a simple csv table.

Finally, most of the comments from `comments.csv.bz2` weren't referenced in the balanced dataset. To avoid having to load the whole 4GB of data in memory during training, we merged the comments' texts with their labels using this last script:
```python
import pandas as pd
import bz2

labels = "test-balanced.csv.bz2"
texts = "comments.csv.bz2"

with bz2.open(labels, "rt", encoding="utf-8") as f:
    df_labels = pd.read_csv(f, usecols=("parent", "post", "label"))
print(df_labels)
    
with bz2.open(texts, "rt", encoding="utf-8") as f:
    df_texts = pd.read_csv(f, usecols=("id","text"))
print(df_texts)

df_labels = pd.merge(df_labels, df_texts, how="left", suffixes=("_label", "_text"), left_on="parent", right_on="id").drop(columns=["parent", "id"])
print(df_labels)
print(df_labels.info)
df_labels = pd.merge(df_labels, df_texts, suffixes=("_parent", "_post"), how="left", left_on="post", right_on="id").drop(columns=["post", "id"])
print(df_labels)
print(df_labels.info)

df_labels.to_csv("merged-test.csv.bz2")
```


Since we are not supposed to include external data directly in our submission, the clean corpus dataset has been uploaded to a personal GitHub repository. In the Python code, we access the files link hosted on Github and we load it through the **read_csv** function.

In [95]:
url_train = "https://raw.githubusercontent.com/alafSantos/SETI-SLP-Sarcasm-recogniser/main/SARC/merged-train.csv"
# url_train = "SARC/merged-train.csv"
df_train = pd.read_csv(url_train, usecols=["label", "text_parent", "text_post"], encoding='utf-8')
train_iter = df_train.iterrows()
df_train.head()


Unnamed: 0,label,text_parent,text_post
0,1,I've been searching for the answer for this fo...,Religion must have the answer
1,0,I've been searching for the answer for this fo...,It's obviously tracks from a giant water tract...
2,1,"Michael Phelps Apologizes For ""Regrettable"" Be...",Wow...he smoked pot...oh lord hes such a horri...
3,0,"Michael Phelps Apologizes For ""Regrettable"" Be...","Wow, his girlfriend is uhm... Ah fuck it, he's..."
4,0,Utah wants to create a database to track the i...,I think the government should track every morm...


In [96]:
url_test = "https://raw.githubusercontent.com/alafSantos/SETI-SLP-Sarcasm-recogniser/main/SARC/merged-test.csv"
# url_test = "SARC/merged-test.csv"
df_test = pd.read_csv(url_test, usecols=["label", "text_parent", "text_post"], encoding='utf-8')
test_iter = df_test.iterrows()
df_test.head()

Unnamed: 0,label,text_parent,text_post
0,1,The vast majority of Republicans rallied behin...,"Yes, cuz tax cuts will help those w/o jobs!"
1,0,The vast majority of Republicans rallied behin...,If cutting taxes fails... cut taxes harder.
2,1,"""...two-income families often have even less i...",Chalk it up to the ever-increasing cost of fre...
3,0,"""...two-income families often have even less i...","We're about to finally get affordable housing,..."
4,1,Heath Ledger Wins Oscar!,oh wow I am so surprised I never saw this coming


In [97]:
tokenizer = get_tokenizer("basic_english")

def yield_tokens(data_iter):
    for _, row in data_iter:
        text = str(row["text_post"])
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Non-contextualised Approach

Each word in the input text is assigned an
embedding vectors. The vectors are then averaged together to get a fixed-size
feature of the text. A linear transformation is applied to the feature to get a
single output value.

<img src="https://raw.github.com/alafSantos/SETI-SLP-Sarcasm-recogniser/main/IMG/non-contextualised.svg">

In [98]:
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for _, row in batch:
        _label = row["label"]
        _text = str(row["text_post"])

        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)


dataloader = DataLoader(
    train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
)

In [99]:
class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)


In [100]:
num_class = 2 #1 or 0 always
vocab_size = len(vocab)
emsize = 64
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

## Training and Evaluation

The training dataset is split between training data and validation data with a 95 to 5 ratio. This allows to check the accuracy of our model between each epoch.

The training is done by using a gradient descent over a cross entropy loss function.

In [101]:
def train(dataloader, model=model):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            print(
                "| epoch {:3d} | {:5d}/{:5d} batches "
                "| accuracy {:8.3f}".format(
                    epoch, idx, len(dataloader), total_acc / total_count
                )
            )
            total_acc, total_count = 0, 0

def evaluate(dataloader, model=model):
    model.eval()
    total_acc, total_count = 0, 0
    with torch.no_grad():
        for _, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc / total_count


            
# Hyperparameters
EPOCHS = 10  # epoch
LR = 5  # learning rate
BATCH_SIZE = 64  # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None

train_iter = df_train.iterrows()
test_iter = df_test.iterrows()

train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = random_split(
    train_dataset, [num_train, len(train_dataset) - num_train]
)

train_dataloader = DataLoader(
    split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
valid_dataloader = DataLoader(
    split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
test_dataloader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)

for epoch in range(0, EPOCHS):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
    else:
        total_accu = accu_val
    print("-" * 59)
    print(
        "| end of epoch {:3d} | time: {:5.2f}s | "
        "valid accuracy {:8.3f} ".format(
            epoch, time.time() - epoch_start_time, accu_val
        )
    )
    print("-" * 59)

| epoch   0 |   500/ 3817 batches | accuracy    0.557
| epoch   0 |  1000/ 3817 batches | accuracy    0.601
| epoch   0 |  1500/ 3817 batches | accuracy    0.600
| epoch   0 |  2000/ 3817 batches | accuracy    0.604
| epoch   0 |  2500/ 3817 batches | accuracy    0.613
| epoch   0 |  3000/ 3817 batches | accuracy    0.611
| epoch   0 |  3500/ 3817 batches | accuracy    0.624
-----------------------------------------------------------
| end of epoch   0 | time: 11.26s | valid accuracy    0.611 
-----------------------------------------------------------
| epoch   1 |   500/ 3817 batches | accuracy    0.633
| epoch   1 |  1000/ 3817 batches | accuracy    0.633
| epoch   1 |  1500/ 3817 batches | accuracy    0.636
| epoch   1 |  2000/ 3817 batches | accuracy    0.635
| epoch   1 |  2500/ 3817 batches | accuracy    0.631
| epoch   1 |  3000/ 3817 batches | accuracy    0.635
| epoch   1 |  3500/ 3817 batches | accuracy    0.636
-----------------------------------------------------------
| e

Checking the results of test dataset. The testing is done using a different dataset than the one used during training. This ensures that our model didn't accidentally overfit its training data.

In [102]:
print("Checking the results of test dataset.")
accu_test = evaluate(test_dataloader)
print("test accuracy {:8.2f}".format(accu_test*100)+"%")

Checking the results of test dataset.
test accuracy    65.65%


In [103]:
def predict_no_context(text, text_pipeline=text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() == 1

## Contextualised Approach

Each word in the input texts is assigned an embedding vectors. The vectors are then averaged together to get a fixed-size feature for each of the input texts. An hidden layer is used to "mix" the informations provided by each vector. Like before, a linear transformation is applied to the feature vector to get a single output value.

<img src="https://raw.github.com/alafSantos/SETI-SLP-Sarcasm-recogniser/main/IMG/contextualised.svg">

In [104]:
text_pipeline = lambda x, y: (vocab(tokenizer(str(x))), vocab(tokenizer(str(y))))
label_pipeline = lambda x: int(x)

def collate_batch(batch):
    label_list, text_parent_list, text_post_list = [], [], []
    offsets_parent, offsets_post = [0], [0]
    for _, row in batch:
        label = row['label']
        text_post = row['text_post']
        text_parent = row['text_parent']
        label_list.append(label_pipeline(label))
        tokens_parent, tokens_post = text_pipeline(text_parent, text_post)
        processed_text_parent = torch.tensor(tokens_parent, dtype=torch.int64)
        processed_text_post = torch.tensor(tokens_post, dtype=torch.int64)
        text_parent_list.append(processed_text_parent)
        text_post_list.append(processed_text_post)
        offsets_parent.append(processed_text_parent.size(0))
        offsets_post.append(processed_text_post.size(0))
    label_list = torch.tensor(label_list, dtype=torch.float)
    offsets_parent = torch.tensor(offsets_parent[:-1]).cumsum(dim=0)
    offsets_post = torch.tensor(offsets_post[:-1]).cumsum(dim=0)
    text_parent_list = torch.cat(text_parent_list)
    text_post_list = torch.cat(text_post_list)
    return (label_list.to(device), text_parent_list.to(device), text_post_list.to(device), 
            offsets_parent.to(device), offsets_post.to(device))

train_iter = df_train.iterrows()
dataloader = DataLoader(
    train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
)

In [105]:
class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, embed_dim_post, hidden_dim):
        super(TextClassificationModel, self).__init__()
        self.embedding_parent = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.embedding_post = nn.EmbeddingBag(vocab_size, embed_dim_post, sparse=False)
        self.hidden = nn.Linear(embed_dim*2, hidden_dim)
        self.fc = nn.Linear(hidden_dim, 1)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding_parent.weight.data.uniform_(-initrange, initrange)
        self.embedding_post.weight.data.uniform_(-initrange, initrange)
        self.hidden.weight.data.uniform_(-initrange, initrange)
        self.hidden.bias.data.zero_()
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text_parent, text_post, offsets_parent, offsets_post):
        embedded_parent = self.embedding_parent(text_parent, offsets_parent)
        embedded_post = self.embedding_parent(text_post, offsets_post)
        embedded = torch.hstack([embedded_parent,embedded_post])
        middle = self.hidden(torch.sigmoid(embedded))
        output = self.fc(torch.sigmoid(middle))
        return torch.sigmoid(output).squeeze()

In [106]:
train_iter = df_test.iterrows()
emsize = 64
embed_dim_post = 96
hidden_size = 64
model2 = TextClassificationModel(vocab_size, emsize, embed_dim_post, hidden_size).to(device)

### Training and Evaluation

In [107]:
def train(dataloader, model=model2):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500

    for idx, (label, text_parent, text_post, offsets_parent, offsets_post) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text_parent, text_post, offsets_parent, offsets_post)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        predicted_label[predicted_label > 0.5] = 1
        predicted_label[predicted_label <= 0.5] = 0
        total_acc += (predicted_label == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            print(
                "| epoch {:3d} | {:5d}/{:5d} batches "
                "| accuracy {:8.3f}".format(
                    epoch, idx, len(dataloader), total_acc / total_count
                )
            )
            total_acc, total_count = 0, 0


def evaluate(dataloader, model=model2):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for _, (label, text_parent, text_post, offsets_parent, offsets_post) in enumerate(dataloader):
            predicted_label = model(text_parent, text_post, offsets_parent, offsets_post)
            predicted_label[predicted_label > 0.5] = 1
            predicted_label[predicted_label <= 0.5] = 0
            total_acc += (predicted_label == label).sum().item()
            total_count += label.size(0)
    return total_acc / total_count

# Hyperparameters
EPOCHS = 10  # epoch
LR = 5  # learning rate
BATCH_SIZE = 64  # batch size for training

criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model2.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter = df_train.iterrows()
test_iter = df_test.iterrows()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = random_split(
    train_dataset, [num_train, len(train_dataset) - num_train]
)

train_dataloader = DataLoader(
    split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
valid_dataloader = DataLoader(
    split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
test_dataloader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)

print("Starting training!")
for epoch in range(0, EPOCHS):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
    else:
        total_accu = accu_val
    print("-" * 59)
    print(
        "| end of epoch {:3d} | time: {:5.2f}s | "
        "valid accuracy {:8.3f} ".format(
            epoch, time.time() - epoch_start_time, accu_val
        )
    )
    print("-" * 59)

Starting training!
| epoch   0 |   500/ 3817 batches | accuracy    0.500
| epoch   0 |  1000/ 3817 batches | accuracy    0.504
| epoch   0 |  1500/ 3817 batches | accuracy    0.503
| epoch   0 |  2000/ 3817 batches | accuracy    0.500
| epoch   0 |  2500/ 3817 batches | accuracy    0.499
| epoch   0 |  3000/ 3817 batches | accuracy    0.499
| epoch   0 |  3500/ 3817 batches | accuracy    0.501
-----------------------------------------------------------
| end of epoch   0 | time: 18.33s | valid accuracy    0.503 
-----------------------------------------------------------
| epoch   1 |   500/ 3817 batches | accuracy    0.501
| epoch   1 |  1000/ 3817 batches | accuracy    0.505
| epoch   1 |  1500/ 3817 batches | accuracy    0.503
| epoch   1 |  2000/ 3817 batches | accuracy    0.507
| epoch   1 |  2500/ 3817 batches | accuracy    0.513
| epoch   1 |  3000/ 3817 batches | accuracy    0.516
| epoch   1 |  3500/ 3817 batches | accuracy    0.522
--------------------------------------------

In [108]:
print("Checking the results of test dataset.")
accu_test = evaluate(test_dataloader)
print("test accuracy {:8.2f}".format(accu_test*100)+"%")

Checking the results of test dataset.
test accuracy    63.40%


In [109]:
def predict_context(text_parent, text_post, text_pipeline=text_pipeline):
    with torch.no_grad():
        text_parent, text_post = text_pipeline(text_parent, text_post)
        output = model2( torch.tensor(text_parent), torch.tensor(text_post), 
                        torch.tensor([0]), torch.tensor([0]) )
        return output.item() > 0.5

## VADER Sentiment Analysis Approach

With VADER approach we can take all the words in a sentence and then set a value, either negative, positive or neutral and combines those values in order to tell us if the sentence itself is more globally positive or negative. VADER gives us values from -1 to 1 for each word and a global compound (which we will be using here). From [6] we have a study showing that most of the time sarcasm comes with positive sentences, but in a negative situation or context. For simplicity, we will be using VADER as a voter in our final decision by defining a sentence as sarcastic when it has a positive or neutral compound in the answer, otherwise it will be non sarcastic; together with a negative or neutral context. It is important to highlight that VADER doesn't take into account the relationship between the words, which is very important in the real world. But it will be used here as a voter for simplicity.

In [110]:
test_iter = df_test.iterrows()

def evaluate():
    total_acc, total_count = 0, 0

    for _, row in test_iter:
        label = row['label']
        text_post = str(row['text_post'])
        text_parent = str(row['text_parent'])
        predicted_label = predict_vader(text_parent, text_post)
        total_acc += (predicted_label == (label==1))
        total_count += 1
    
    return total_acc / total_count


def predict_vader(context, answer):
    sia = SentimentIntensityAnalyzer()
    vader1_result = sia.polarity_scores(context)["compound"]
    vader2_result = sia.polarity_scores(answer)["compound"]
    result = (vader1_result <= 0 and vader2_result >= 0)
    return result


print("Checking the results of test dataset.")
accu_test = evaluate()
print("test accuracy {:8.2f}".format(accu_test*100)+"%")

Checking the results of test dataset.
test accuracy    50.38%


## Demonstration

In the end, to demonstrate our sarcasm recognition system, we have a voting system. Each approach developed will give a vote, 0 for non sarcastic, 1 for sarcastic, and if the sum of the votes is greater than 1 (2 or 3) it means that the input is sarcastic.

In [111]:
'''
Some of the example sentences here are taken from https://www.yourdictionary.com/articles/examples-sarcasm-meaning-types
'''

input1 = [
    "When something bad happens",
    "When you expected something to happen, especially after warning someone about it",
    "Chart showing how people's political views change as they age, based on 172,853 people's self-proclaimed political views on OKcupid.",
          ]


input2 = [
    "That's just what I needed today!",
    "Well, what a surprise.",
    "Good lord... my chart is the exact opposite of this.",
          ]

model = model.to("cpu")
model2 = model2.to("cpu")

for i in range(0, len(input1)):
    noContext_vote = predict_no_context(input2[i])
    context_vote = predict_context(input1[i], input2[i])
    vader_vote = predict_vader(input1[i], input2[i])
    result = sum([noContext_vote, context_vote, vader_vote]) > 1

    print("--------------------\nSituation ", i, ":", input1[i])
    print("Remark: ", input2[i])
    print("M1: ", noContext_vote)
    print("M2: ", context_vote)
    print("M3: ", vader_vote)
    print("Voting Result: ", result)

--------------------
Situation  0 : When something bad happens
Remark:  That's just what I needed today!
M1:  False
M2:  True
M3:  True
Voting Result:  True
--------------------
Remark:  Well, what a surprise.
M1:  True
M2:  True
M3:  True
Voting Result:  True
--------------------
Situation  2 : Chart showing how people's political views change as they age, based on 172,853 people's self-proclaimed political views on OKcupid.
Remark:  Good lord... my chart is the exact opposite of this.
M1:  False
M2:  False
M3:  True
Voting Result:  False


## Conclusion

The project started with a small survey to better understand the possibilities and choose a subject that could be interesting to the group members. After that, we began to study some basic Machine Learning concepts in order to get a better idea of what could be done, together with some text mining theory. Then, we read the PyTorch documentation aiming to find a starting point. Finally, we got into the coding phase, where [5] was extremely important to have some kind of comparison and basis for our code. Throughout our work, we developed 3 approaches for sarcasm recognition so we were able to apply the Condorcet's jury theorem in order to obtain a more trustful and robust system to give us binary predictions in either a sentence is sarcastic or not. We have worked with the SARC corpus, containing Reddit comments self annotated by the authors.

1. For the first model, based on [1] and [5], we designed a simple sarcasm recognition system, taking only a sentence as input for the analysis (no context), based on a Pytorch example for sentiment analysis. There we had a test's accuracy over 65%.

2. Concerning the second model, based on [1] and [5], using the same self-annotated dataset, we made some modifications in the first model in order to take into account the context in which the comment was written (parent text). This time we achieved a test's accuracy of 60%.

3. Finally, as a way to have a third opinion in the voting process suggested by the Condorcet's jury theorem and explored in [7], we used the pretrained model VADER to do sentiment analysis with context. The idea here was to consider the contrast between the situation and the target sentence, in [6] it was studied the relationship between negative situations/contexts and positive answers/sentences as a way to express sarcasm in the social media Twitter. With this third and last approach, we reached a test's accuracy over 50%.

Since the third approach did not give us good accuracy, it is possible to conclude that our voting system is not reliable taking into account the mentioned theorem, but the individual results from the two first approaches are a significant achievement for a first text mining project applied to NLP challenges. For the continuation of this project, exploring Condorcet's jury theorem and developing more reliable voters could be a promising direction. If we want to use this solution in embedded applications, decreasing the system complexity can be also a key. Finally, capturing semantic relationships within multiword expressions using Named Entity Recognition (NER), has the power to increase the system's accuracy since sarcasm may involve playing with the literal meaning of named entities or making ironic references to them. Concluding with another possible branch of this project, we could go forward and expand it to speech analysis.

## References

[1] Khodak, M., Saunshi, N. and Vodrahalli, K., 2017. A large self-annotated corpus for sarcasm. arXiv preprint arXiv:1704.05579.

[2] Attardo, S. and Raskin, V., 1991. Script theory revis (it) ed: Joke similarity and joke representation model.

[3] Farha, I.A., Oprea, S., Wilson, S. and Magdy, W., 2022, July. Semeval-2022 task 6: isarcasmeval, intended sarcasm detection in english and arabic. In The 16th International Workshop on Semantic Evaluation 2022 (pp. 802-814). Association for Computational Linguistics.

[4] Ashwitha, A., Shruthi, G., Shruthi, H.R., Upadhyaya, M., Ray, A.P. and Manjunath, T.C., 2021. Sarcasm detection in natural language processing. Materials Today: Proceedings, 37, pp.3324-3331.

[5] Adam Paszke et al. “Automatic differentiation in PyTorch”. In: (2017).

[6] Riloff, E., Qadir, A., Surve, P., De Silva, L., Gilbert, N. and Huang, R., 2013, October. Sarcasm as contrast between a positive sentiment and negative situation. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 704-714).

[7] Ariharan, V., Eswaran, S.P., Vempati, S. and Anjum, N., 2019. Machine learning quorum decider (MLQD) for large scale IoT deployments. Procedia Computer Science, 151, pp.959-964.

[8] Sag, I.A., Baldwin, T., Bond, F., Copestake, A. and Flickinger, D., 2002. Multiword expressions: A pain in the neck for NLP. In Computational Linguistics and Intelligent Text Processing: Third International Conference, CICLing 2002 Mexico City, Mexico, February 17–23, 2002 Proceedings 3 (pp. 1-15). Springer Berlin Heidelberg.

