# Sarcasm Detection

The objective of this notebook is to develop an automatic sarcasm detection module for texts in Portuguese, specifically news articles. For this purpose, a dataset composed of news articles from three major Brazilian websites was used, containing both sarcastic and non-sarcastic news. To create the module, two classifier models were developed using different methodologies: the use of classical Machine Learning algorithms + static vector representation, and fine-tuning of a multilingual transformers model.


## Description of the data set structure and characteristics

The database was taken from the [PLNCrawler repository] (https://github.com/schuberty/PLNCrawler), and is originally structured in three JSON files, which correspond to each news site from where the news were extracted:
- Sensationalista: 5006 sarcastic news
- Estadão: 11272 non-sarcastic news
- Revista Piauí (Herald section): 2216 sarcastic news

Each file has the following fields for each news:
- is_sarcastic (or is_sarcasm): boolean, represents the label/label of the news (sarcastic or not)
- article_link: string, contains the URL where the news was extracted
- headline: string, contains the news title
-text: string, contains the news text


Loads the datasets of each site in DataFrame format:

In [None]:
import sys
import os

sys.path.append(os.path.dirname(os.getcwd()))
from src.scripts import get_df_sensacionalista
from src.scripts import get_df_estadao
from src.scripts import get_df_the_piaui_herald

# Load the file in a DataFrame
df_sensacionalista = get_df_sensacionalista()
df_estadao = get_df_estadao()
df_piaui = get_df_the_piaui_herald()
df_piaui = df_piaui.rename(columns={'is_sarcasm': 'is_sarcastic'}) # Rename the column to equalize with other DataFrames

display(df_sensacionalista)
display(df_estadao)
display(df_piaui)

Makes the union of the three bases in a single DataFrame in a balanced way, keeping 50% of sarcastic news and 50% of non-sarcastic



In [None]:
# Join the 3 datasets
from src.scripts import merge_dfs

df = merge_dfs(df_sensacionalista, df_estadao, df_piaui)

num_sarcastic = df['is_sarcastic'].sum()

print(f'Number of sarcastic samples: {num_sarcastic}')
print(f'Number of non-sarcastic samples: {len(df) - num_sarcastic}')

display(df)

# Pre-processing

It is important to point out that some language resources that are removed or normalized during the traditional pre-processing steps influence the classification of irony in texts.
For example, punctuation marks may indicate irony. Therefore, it is a pre-processing parameter to remove or not this feature.

Knowing this, optional parameters can be passed to the pre-processing function that apply or not the transformation.

## Stemming and lemmatization

> "Stemming or lemmatization reduces words to their root form (e.g., "running" becomes "run"), making it easier to analyze language by grouping different forms of the same word." Source: https://www.ibm.com/think/topics/natural-language-processing

The process of stemming and lemmatization are optional, but both can never be applied together because they have the same purpose with different approaches.
Thus, if both are activated only **lemmatization** will be applied (because it is more semantic).

### Sources for pre-processing:

1. [Key Guidelines](https://github.com/sharadpatell/Text_preprocessing_steps_for_NLP/blob/main/Text_preprocessing_steps_for_NLP.ipynb) which assisted in the step-by-step pre-processing.
2. FACELI, K. et al. Artificial Intelligence An Approach to Machine Learning. 2nd edition ed.


In [None]:
from src.preprocessamento import pre_processamento

use_lemmatization = True
use_stemming      = False

df = pre_processamento(df, usar_stemming = use_stemming, usar_lemmatization = use_lemmatization)

# Saves the DataFrame temporarily
print('Saving the temporary DataFrame...')
df.to_parquet("../temp/temp_input.parquet")
print('Temporary DataFrame saved.')

display(df)

## First approach to detection: Use of classic machine learning algorithms + static vector representation

For this approach, the first step is to create a vector representation of text, because computers do not interpret texts in human language. Therefore, it is necessary to transform them into a structured representation that the machines can process.

This text treatment is also part of the feature extraction step.
> Feature extraction is the process of converting raw text into numerical representations that machines can analyze and interpret. Source: https://www.ibm.com/think/topics/natural-language-processing

It was chosen to use the Word2Vec tool for the generation of dense vectors, which capture the semantic value of words and relate them to each other. It is ideal for machine learning tasks.

In [None]:
use_word2vec = False
use_sequence_transformer = not use_word2vec

Applies Word2Vec in the database, generating the neural network and returning the embeddings for each news

In [None]:


if (use_word2vec):
    import subprocess
    import pandas as pd
    import pickle
    
    # Runs Word2Vec in a separate conda environment
    subprocess.run([
        "conda", "run", "-n", "word2vec_env", "python",
        "../src/word2vec_runner.py", "0", "../temp/temp_input.parquet", "text", "skip-gram"
    ])
    
    # Retrieve the resulting embeddings
    with open("../temp/embeddings_output.pkl", "rb") as f:
        embeddings = pickle.load(f)

    with open("../temp/indices_validos.pkl", "rb") as f:
        valid_indexes = pickle.load(f)
    
    embeddings
    print(embeddings[:5])

#### Train a traditional ML model using Word2Vec embeddings

Created the embeddings, the next step is to start training a Machine Learning model

Split of training and test data

In [None]:
if (use_word2vec):
    import numpy as np
    from sklearn.model_selection import train_test_split
    
    # Converting to arrays
    X = np.array(embeddings)
    y = df.iloc[valid_indexes]["is_sarcastic"].astype(int).values
    
    print(f"X shape: {X.shape}, y shape: {y.shape}")
    
    # Confirms that they are aligned
    assert len(X) == len(y)
    print(len(X), len(y))
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, stratify=y, random_state=42
    )

Tests several models/algorithms

In [None]:
if (use_word2vec):
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.metrics import classification_report, confusion_matrix
    
    models = {
        "SVM": SVC(kernel='linear', probability=True),
        "Random Forest": RandomForestClassifier(n_estimators=100),
        "Decision Tree": DecisionTreeClassifier(),
        "Logistic Regression": LogisticRegression(max_iter=1000),
        "KNN": KNeighborsClassifier(n_neighbors=5)
    }
    
    for name, model in models.items():
        print(f"\n=== {name} ===")
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        print(classification_report(y_test, y_pred))
        print(confusion_matrix(y_test, y_pred))
    

The results were similar, but the best was the Random Forest algorithm, which was chosen for the final model.

In [None]:
if (use_word2vec):
    import joblib

    modelo = RandomForestClassifier(n_estimators=100)
    modelo.fit(X_train, y_train)
    y_pred = modelo.predict(X_test)

    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))

    joblib.dump(modelo, "../modelos/classificador_word2vec.pkl")

### Prediction of sarcasm using the generated model

In [None]:
if (use_word2vec):
    from src.preprocessamento import pre_processamento_frase
    import pandas as pd
    import joblib
    import subprocess

    
    # Loads trained classifier (SVM, Random Forest etc.)
    classificador = joblib.load("../modelos/classificador_word2vec.pkl")
    
    frase = input("Enter a text for analysis: ")
    
    tokens = pre_processamento_frase(frase)
    
    # Saves the tokens in a temporary CSV file
    pd.DataFrame({"tokens": [tokens]}).to_csv("../temp/frase_processada.csv", index=False)
    
    # Run Word2Vec in the processed phrase
    subprocess.run([
            "conda", "run", "-n", "word2vec_env", "python",
            "../src/word2vec_runner.py", "1"
        ], check=True, capture_output=True)
    
    # Load the CSV file embeddings
    vetor = pd.read_csv("../temp/vetor_word2vec.csv", header=None).values
    

    # Prediction
    pred = classificador.predict(vetor)
    prob = classificador.predict_proba(vetor)[0]
    
    if pred[0] == 1:
        print(f"Sarcasm detected (trust: {prob[1]:.2f})")
    else:
        print(f"Sarcasm not detected (trust: {prob[0]:.2f})")

## Second approach to detection: Fine-tuning a Sentence Transformer model

The second approach consists of choosing a Transformrers language model, and from it perform a fine-tuning to our goal.

"Finetuning Sentence Transformer models often heavily improves the performance of the model on your use case, because each task requires a different notion of similarity."
Source: https://sbert.net/docs/sentence_transformer/training_overview.html

Before applying fine tuning, it is important that the dataset be in accordance with the loss function.
"It is important that your dataset format matches your loss function (or that you choose a loss function that matches your dataset format)"

For short texts (like the headline example), Word2Vec works well. For long texts (such as news), it may be more effective to use transformers like BERT.

Find a Sequence Transformer model:
- Trained or adapted for pt-BR
- Fine-tuning in sentence similarity, feature extraction
- Trained preferably in news
- Use an encoder architecture compatible with sentence-transformers


Thus the model sentence-transformers/xlm-r-bert-base-nli-stsb-mean-tokens was chosen

To fine-tune, upload and run the notebook [_fine_tuning.ipynb_](fine_tuning.ipynb) in the Google Colab environment. Follow all instructions on the notebook.

After execution, save the trained model in the 'models' folder of this repository. Don’t forget to unzip the file .zip, and then continue running this notebook.

Loads the model

In [None]:
import os
import joblib
from sentence_transformers import SentenceTransformer

if use_sequence_transformer:

    # Paths of the saved files
    MODEL_DIR = "../modelos/modelo_finetunado_sarcasmo"
    CLASSIFIER_PATH = os.path.join(MODEL_DIR, "/classificador_logreg.pkl")

    def load_model():
        if not os.path.exists(MODEL_DIR):
            raise FileNotFoundError(f"Directory '{MODEL_DIR}' not found.")
        if not os.path.exists(CLASSIFIER_PATH):
            raise FileNotFoundError(f"Classifier '{CLASSIFIER_PATH}' not found.")

        print("Loading model and classifier...")
        model = SentenceTransformer(MODEL_DIR)
        classifier = joblib.load(CLASSIFIER_PATH)
        return model, classifier


    model, classifier = load_model()

Prediction of sarcasm using the generated model

In [None]:
import os
import joblib
from sentence_transformers import SentenceTransformer
import numpy

if (use_sequence_transformer):

    def predict_sarcasm(text, model, classifier, threshold=0.5):
        embedding = model.encode([text], convert_to_tensor=True).cpu().tolist()
        prob = classifier.predict_proba(embedding)[0][1]  # Probability of sarcasm

        if prob >= threshold:
            return "Sarcasm detected", prob
        else:
            return "Sarcasm not detected", prob


    print("\nType a text to detect sarcasm:")

    text = input("\n> ")

    if len(text.strip()) == 0:
        print("Empty sentence. Try again.")

    result, prob = predict_sarcasm(text, model, classifier)
    print(f"{result} (trust: {prob:.2f})")

# Part 2: Ambiguity Detection

## Rewrite of phrase

In [None]:
import subprocess

def checkAmbiguityWord(word, context):
    result = subprocess.run(
        [
            "conda", "run", "-n", "ambiguidade", "python",
            "../src/ambiguidade.py", context, word
        ],
        check=True,
        capture_output=True,
        text=True  # To already return string instead of bytes
    )

    if (result.stdout.strip() == 'None'):
        return [False, ""]

    return [True, result.stdout.strip()]

def checkSarcasmSentence(sentence):
    if len(sentence.strip()) == 0:
        print("[ERRO] Frase vazia. Tente novamente.")

    print('Sentence: ', sentence)
    result, prob = predict_sarcasm(sentence, model, classifier)
    print('Result: ', result)

    if result.strip() == 'Sarcasm detected':
        return True
    return False

In [None]:
import google.generativeai as genai
API_KEY = ''
genai.configure(api_key = API_KEY)
model = genai.GenerativeModel("gemini-2.5-flash")

from src.reescrita import frases
from src.reescrita import palavras
from src.reescrita import gerarPrompt
#from scripts.reescrita import gerar_texto_com_lmstudio

from src.avaliacao import Avaliacao

# --- Main Program Loop ---

avaliacao = Avaliacao()

while(1):
    print("\n------------------------------------------------------\n")
    print("Enter a text for analysis and rewrite (-1 to finish):")
    original_text = input()

    if original_text == "-1":
       break

    # 1. Identification of problematic elements (sarcasm and ambiguity)
    ambiguous_words_per_sentence = {}
    sarcastic_sentences = []

    sentences_list = frases(original_text)
    
    for sentence in sentences_list:
        if checkSarcasmSentence(sentence) == True:
            sarcastic_sentences.append(sentence)
        
        sentence_words_list = palavras(sentence)
        ambiguous_words_sentence = []
        for sentence_word in sentence_words_list:
            ambiguity_result= checkAmbiguityWord(sentence_word, sentence)
            if ambiguity_result[0] == True:
                ambiguous_words_sentence.append((sentence_word, ambiguity_result[1]))
        
        if ambiguous_words_sentence:
            ambiguous_words_per_sentence[frase] = ambiguous_words_sentence

    

    print("\n--- Detected Items for the Prompt ---")
    print(f"Sarcastic sentences: {sarcastic_sentences}")
    print(f"Ambiguous words per sentence : {ambiguous_words_per_sentence}")
    print("------------------------------------")

    # 2. Optimized Prompt Generation
    final_prompt = gerarPrompt(original_text, sarcastic_sentences, ambiguous_words_per_sentence)

    # 3. Generation of the Treated Text by the LLM (GEMINI)
    print("\n--- Generating text with Gemini (Hard-coded model: gemini-2.5-flash) ---")
    rewritten_text = model.generate_content(final_prompt).text

    if rewritten_text:
        if rewritten_text.strip().startswith("TEXTO REESCRITO:"):
            rewritten_text = rewritten_text.strip()[len("TEXTO REESCRITO:"):].strip()
        
        if rewritten_text.startswith('"') and rewritten_text.endswith('"'):
            rewritten_text = rewritten_text[1:-1].strip()

        print("\n--- REWRITTEN TEXT ---")
        print(rewritten_text)
        print("-----------------------")

        print(avaliacao.avaliarReescrita(original_text, rewritten_text))
    else:
        print("\nCould not generate the rewritten text.")