# Sarcasm Detection

The objective of this notebook is to develop an automatic sarcasm detection module for texts in Portuguese, specifically news articles. For this purpose, a dataset composed of news articles from three major Brazilian websites was used, containing both sarcastic and non-sarcastic news. To create the module, the methodology used was the fine-tuning of a multilingual transformers model.

## Fine-tuning a Sentence Transformer model

The approach consists of choosing a Transformrers language model, and from it perform a fine-tuning to our goal.

"Finetuning Sentence Transformer models often heavily improves the performance of the model on your use case, because each task requires a different notion of similarity."
Source: https://sbert.net/docs/sentence_transformer/training_overview.html

Before applying fine tuning, it is important that the dataset be in accordance with the loss function.
"It is important that your dataset format matches your loss function (or that you choose a loss function that matches your dataset format)"

For short texts (like the headline example), Word2Vec works well. For long texts (such as news), it may be more effective to use transformers like BERT.

Find a Sequence Transformer model:
- Trained or adapted for pt-BR
- Fine-tuning in sentence similarity, feature extraction
- Trained preferably in news
- Use an encoder architecture compatible with sentence-transformers


Thus the model sentence-transformers/xlm-r-bert-base-nli-stsb-mean-tokens was chosen

#### Installs and imports the required modules

In [1]:
!pip install -q torch pandas sentence-transformers==5.0.0 scikit-learn datasets accelerate>=0.26.0 pandas nltk spacy joblib numpy

In [2]:
!python -m spacy download pt_core_news_sm

Collecting pt-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-3.8.0/pt_core_news_sm-3.8.0-py3-none-any.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pt-core-news-sm
Successfully installed pt-core-news-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('pt_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## Dataset

### Description of the data set structure and characteristics

The database was taken from the [PLNCrawler repository] (https://github.com/schuberty/PLNCrawler), and is originally structured in three JSON files, which correspond to each news site from where the news were extracted:
- Sensationalista: 5006 sarcastic news
- Estadão: 11272 non-sarcastic news
- Revista Piauí (Herald section): 2216 sarcastic news

Each file has the following fields for each news:
- is_sarcastic (or is_sarcasm): boolean, represents the label/label of the news (sarcastic or not)
- article_link: string, contains the URL where the news was extracted
- headline: string, contains the news title
-text: string, contains the news text


#### Clones the datasets source repository

In [3]:
!git clone https://github.com/schuberty/PLNCrawler.git

Cloning into 'PLNCrawler'...
remote: Enumerating objects: 268, done.[K
remote: Counting objects: 100% (268/268), done.[K
remote: Compressing objects: 100% (177/177), done.[K
remote: Total 268 (delta 147), reused 202 (delta 88), pack-reused 0 (from 0)[K
Receiving objects: 100% (268/268), 6.49 MiB | 5.64 MiB/s, done.
Resolving deltas: 100% (147/147), done.


#### Loads the datasets of each site in DataFrame format:**negrito**

In [4]:
import sys
import pandas as pd
import os

FILES_DIRECTORY = os.path.join('PLNCrawler', 'datasets')

def get_df_sensacionalista():
    """Returns Sensacionalista news as a pandas DataFrame."""
    return pd.read_json(FILES_DIRECTORY + '/sensacionalista.json', lines=True)

def get_df_estadao():
    """Returns Estadão news as a pandas DataFrame."""
    return pd.read_json(FILES_DIRECTORY + '/estadao.json', lines=True)

def get_df_the_piaui_herald():
    """Returns The piaui Herald news as a pandas DataFrame."""
    return pd.read_json(FILES_DIRECTORY + '/the_piaui_herald.json', lines=True)

# Load the file in a DataFrame
df_sensacionalista = get_df_sensacionalista()
df_estadao = get_df_estadao()
df_piaui = get_df_the_piaui_herald()
df_piaui = df_piaui.rename(columns={'is_sarcasm': 'is_sarcastic'}) # Rename the column to equalize with other DataFrames

display(df_sensacionalista)
display(df_estadao)
display(df_piaui)

Unnamed: 0,is_sarcastic,article_link,headline,text
0,True,https://www.sensacionalista.com.br/2020/10/15/...,10 desculpas para o dinheiro entre as nádegas ...,"O vice-líder do governo Bolsonaro, o senador C..."
1,True,https://www.sensacionalista.com.br/2020/10/14/...,"Fora Bolsonaro, ninguém gostou da advertência ...",A jogadora de vôlei de praia Carol Solberg foi...
2,True,https://www.sensacionalista.com.br/2020/10/10/...,Bolsonaro diz que a corrupção acabou mas amanh...,O presidente Jair Bolsonaro surpreendeu todo o...
3,True,https://www.sensacionalista.com.br/2020/10/10/...,Homem machuca o cérebro tentando entender fala...,Boi bombeiro. Boi. Bombeiro. BOI BOMBEIRO. bOi...
4,True,https://www.sensacionalista.com.br/2020/10/08/...,Checamos: Bolsonaro tem 89 mil motivos para di...,O presidente Jair Bolsonaro surpreendeu todo o...
...,...,...,...,...
5001,True,https://www.sensacionalista.com.br/2009/05/08/...,Gripe suína chega ao Brasil e é assaltada em C...,"Mais tarde, já relaxada, a epidemia almoçou na..."
5002,True,https://www.sensacionalista.com.br/2009/05/05/...,Casamento terá mesma lei do Código de Defesa d...,"“Quando você compra um produto, pode trocar. U..."
5003,True,https://www.sensacionalista.com.br/2009/05/04/...,Saci passa para medicina pelo sistema de cotas,Saci rebateu as críticas de que o sistema de c...
5004,True,https://www.sensacionalista.com.br/2009/05/01/...,Táxis do Rio terão bandeira 3 para áreas viole...,A medida foi acertada entre a prefeitura do Ri...


Unnamed: 0,is_sarcastic,article_link,headline,text
0,False,https://politica.estadao.com.br/blogs/fausto-m...,PF abre inquérito para investigar negócios do ...,"A Polícia Federal abriu nesta segunda-feira, 1..."
1,False,https://politica.estadao.com.br/blogs/fausto-m...,Marco Aurélio adota rito abreviado e manda açã...,"O ministro Marco Aurélio Mello, do Supremo Tri..."
2,False,https://politica.estadao.com.br/blogs/fausto-m...,PF prende quatro no Aeroporto de Guarulhos com...,A Polícia Federal prendeu na noite desta segun...
3,False,https://politica.estadao.com.br/blogs/fausto-m...,Entenda o que está em jogo com os recursos de ...,"Caso conceda nesta terça-feira, 16, decisões f..."
4,False,https://politica.estadao.com.br/blogs/fausto-m...,Existe uma terceira via?,Todo extremismo parece perigoso. Conduz ao fan...
...,...,...,...,...
11267,False,https://politica.estadao.com.br/blogs/fausto-m...,Ministério Público obtém acordo entre grupos a...,O Ministério Público de São Paulo (MP-SP) cons...
11268,False,https://politica.estadao.com.br/blogs/fausto-m...,PF pega R$ 750 mil em caixa térmica na casa do...,A Polícia Federal apreendeu quase R$ 750 mil n...
11269,False,https://politica.estadao.com.br/blogs/fausto-m...,Ninguém ouviu,Homens negros nascem em sua maioria nas regiõe...
11270,False,https://politica.estadao.com.br/blogs/fausto-m...,Uma aventura jurídica,"Segundo o Correio Braziliense, em seu site no ..."


Unnamed: 0,is_sarcastic,article_link,headline,text
0,True,https://piaui.folha.uol.com.br/herald/2014/10/...,Petição exige o impeachment de Lula,"BRAZIL – Centenas de cidadãos de bem, que prod..."
1,True,https://piaui.folha.uol.com.br/herald/2011/04/...,"Reforma política sai antes da Olimpíada, garan...",SÃO LUÍS – O presidente do Senado José Sarney ...
2,True,https://piaui.folha.uol.com.br/herald/2011/04/...,Papa barra canonização de José Alencar,VATICANO – O papa Bento XVI protestou ontem co...
3,True,https://piaui.folha.uol.com.br/herald/2012/07/...,PIB brasileiro cresce a taxas mais elevadas qu...,SÃO BERNARDO – Pesquisadores da CUT cruzaram v...
4,True,https://piaui.folha.uol.com.br/herald/2021/02/...,Banco Mundial teme receber mais um membro do g...,"FREAKONOMICS – Pânico nas Bolsas de Nova York,..."
...,...,...,...,...
2211,True,https://piaui.folha.uol.com.br/herald/2018/05/...,"Após pacificar Coreias, Kim Jong-un quer unifi...",LÍNGUA DO K – “Será o animal político mais sex...
2212,True,https://piaui.folha.uol.com.br/herald/2013/06/...,Neymar cai cinco vezes no gramado em apresenta...,CAMP NOU – Em cerimônia que reuniu globos da m...
2213,True,https://piaui.folha.uol.com.br/herald/2017/01/...,Temer indica Rubens Barrichello como novo rela...,INTERLAGOS – Comprometido em dar celeridade às...
2214,True,https://piaui.folha.uol.com.br/herald/2013/06/...,Casa Branca investiga a função de Hulk na seleção,PENTÁGONO – Após ouvir centenas de conversas e...


#### Merges the datasets

In [5]:
import pandas as pd

def merge_dfs(df_sensacionalista, df_estadao: pd.DataFrame, df_piaui):
    """
    Merges the dataframes from the three newspapers in a balanced way.

    Returns:
    - pd.DataFrame: Unified DataFrame.
    """
    num_sarcastics_samples = len(df_sensacionalista) + len(df_piaui)

    new_df = pd.concat([df_sensacionalista, df_piaui, df_estadao.sample(num_sarcastics_samples, random_state=42)], ignore_index=True)

    return new_df

df = merge_dfs(df_sensacionalista, df_estadao, df_piaui)

num_sarcastic = df['is_sarcastic'].sum()

print(f'Number of sarcastic samples: {num_sarcastic}')
print(f'Number of non-sarcastic samples: {len(df) - num_sarcastic}')

display(df)

Number of sarcastic samples: 7222
Number of non-sarcastic samples: 7222


Unnamed: 0,is_sarcastic,article_link,headline,text
0,True,https://www.sensacionalista.com.br/2020/10/15/...,10 desculpas para o dinheiro entre as nádegas ...,"O vice-líder do governo Bolsonaro, o senador C..."
1,True,https://www.sensacionalista.com.br/2020/10/14/...,"Fora Bolsonaro, ninguém gostou da advertência ...",A jogadora de vôlei de praia Carol Solberg foi...
2,True,https://www.sensacionalista.com.br/2020/10/10/...,Bolsonaro diz que a corrupção acabou mas amanh...,O presidente Jair Bolsonaro surpreendeu todo o...
3,True,https://www.sensacionalista.com.br/2020/10/10/...,Homem machuca o cérebro tentando entender fala...,Boi bombeiro. Boi. Bombeiro. BOI BOMBEIRO. bOi...
4,True,https://www.sensacionalista.com.br/2020/10/08/...,Checamos: Bolsonaro tem 89 mil motivos para di...,O presidente Jair Bolsonaro surpreendeu todo o...
...,...,...,...,...
14439,False,https://politica.estadao.com.br/blogs/fausto-m...,A precificação de carbono como ferramenta para...,A precificação do carbono é um assunto que vem...
14440,False,https://politica.estadao.com.br/noticias/geral...,Doria já admite disputar reeleição em São Paul...,Antes refratário à ideia de disputar a reeleiç...
14441,False,https://politica.estadao.com.br/noticias/geral...,"'Deixa o cara governar, pô!', afirma Mourão",“Não vejo hoje que haja condição de prosperar ...
14442,False,https://politica.estadao.com.br/blogs/fausto-m...,Logística: um dos poucos setores que se adapta...,O cenário no mercado de logística segue otimis...


## Pre-processing

It is important to point out that some language resources that are removed or normalized during the traditional pre-processing steps influence the classification of irony in texts.
For example, punctuation marks may indicate irony. Therefore, it is a pre-processing parameter to remove or not this feature.

Knowing this, optional parameters can be passed to the pre-processing function that apply or not the transformation.

### Stemming and lemmatization

> "Stemming or lemmatization reduces words to their root form (e.g., "running" becomes "run"), making it easier to analyze language by grouping different forms of the same word." Source: https://www.ibm.com/think/topics/natural-language-processing

The process of stemming and lemmatization are optional, but both can never be applied together because they have the same purpose with different approaches.
Thus, if both are activated only **lemmatization** will be applied (because it is more semantic).

### Sources for pre-processing:

1. [Key Guidelines](https://github.com/sharadpatell/Text_preprocessing_steps_for_NLP/blob/main/Text_preprocessing_steps_for_NLP.ipynb) which assisted in the step-by-step pre-processing.
2. FACELI, K. et al. Artificial Intelligence An Approach to Machine Learning. 2nd edition ed.


#### Defines Pre-processing functions

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import RSLPStemmer
import spacy
import re
import pandas as pd
import sys


def clean_text(text, keep_punctuation=False) -> str:
    """
    Cleans a text by removing non-textual characters, with option to keep punctuation.
    If `keep_punctuation` is True, keeps common punctuations like !, ?, ,, ., :, ;.

    Parameters:
        text (str): Text to be processed.
        keep_punctuation (bool): If True, keeps punctuations like !, ?, ., ,, :, ;. Default is False.

    Returns:
        str: Clean text, with or without punctuation, depending on the parameter.
    """
    if not isinstance(text, str):
        return text
    if keep_punctuation:
        # Keep punctuation marks (like !, ?, ,, ., etc.)
        return re.sub(r'[^\w\s\.\,\!\?\:\;]', '', text)
    else:
        # Remove everything that is not a letter (a-z) or whitespace (\s)
        return re.sub(r'[^\w\s]', '', text)

def tokenize_text(text):
    """
    Tokenizes a string into words, using NLTK.
    """
    try:
        # To use word_tokenize(text) from NLTK
        nltk.download('punkt_tab', quiet=True)
    except Exception as e:
        print(f"Error trying to download tokenizer 'punkt': {e}")

    if not isinstance(text, str):
        return text
    return word_tokenize(text)

def remove_stopwords(tokens):
    try:
        nltk.download('stopwords', quiet=True)
        stopwords_pt = set(stopwords.words('portuguese'))
    except Exception as e:
        print(f"Error loading stopwords: {e}")
        stopwords_pt = set()  # Fallback: empty set

    if not isinstance(tokens, list):
        return tokens
    return [t for t in tokens if t not in stopwords_pt]

def apply_stemming(tokens):
    try:
        # Download necessary data for Portuguese stemmer to work
        nltk.download('rslp', quiet=True)
    except Exception as e:
        print(f"Error downloading RSLP stemmer: {e}")

    stemmer = RSLPStemmer()
    return [stemmer.stem(t) for t in tokens]

def apply_lemmatization(tokens, nlp):
    doc = nlp(" ".join(tokens))
    return [token.lemma_ for token in doc]



#### Apply pre-processing in the dataset

In [7]:
def preprocessing(df,
                  keep_punctuation: bool = False,
                  apply_tokenization: bool = True,
                  use_stemming: bool = False,
                  use_lemmatization: bool = False):
    # Convert strings to lowercase letters
    df = df.map(lambda x: x.lower() if isinstance(x, str) else x)

    # Remove article_link (same information as 'headline' attribute)
    df.drop('article_link', axis=1, inplace=True)

    # Remove everything that is not a letter (a-z) or whitespace (\s)
    df = df.map(lambda x: clean_text(x, keep_punctuation))

    # Remove numbers
    df = df.map(lambda x: re.sub(r'\d+', '', x) if isinstance(x, str) else x)

    # Tokenize text
    if apply_tokenization:
        df = df.map(lambda x: tokenize_text(x) if isinstance(x, str) else x) # Ensure tokenize_text is called only on strings

    # Remove stop words (Portuguese)
    df = df.map(lambda x: remove_stopwords(x) if isinstance(x, list) else x)

    # Apply stemming or lemmatization
    if use_lemmatization or (use_stemming and use_lemmatization):
        nlp = spacy.load("pt_core_news_sm")  # lightweight model for Portuguese
        df = df.map(lambda x: apply_lemmatization(x, nlp) if isinstance(x, list) else x)

    if use_stemming == True and use_lemmatization == False:
        df = df.map(lambda x: apply_stemming(x) if isinstance(x, list) else x)

    # Transform label from True and False to 1 and 0
    df["is_sarcastic"] = df["is_sarcastic"].astype(int)

    # display(df)
    return df

use_lemmatization = True
use_stemming      = False

df = preprocessing(df, use_stemming = use_stemming, use_lemmatization = use_lemmatization)

display(df)

Unnamed: 0,is_sarcastic,headline,text
0,1,"[desculpa, dinheiro, nádega, vicelíder, govern...","[vicelíder, governo, bolsonaro, senador, chico..."
1,1,"[bolsonaro, ninguém, gostar, advertência, stjd...","[jogador, vôlei, praia, carol, solberg, advert..."
2,1,"[bolsonaro, dizer, corrupção, acabar, amanhã]","[presidente, jair, Bolsonaro, surpreender, tod..."
3,1,"[homem, Machuca, cérebro, tentar, entender, fa...","[boi, bombeiro, boi, bombeiro, boi, bombeiro, ..."
4,1,"[checa, bolsonaro, mil, motivo, dizer, acabar,...","[presidente, jair, Bolsonaro, surpreender, tod..."
...,...,...,...
14439,0,"[precificação, Carbono, ferramentar, combate, ...","[precificação, Carbono, assunto, vir, ser, dis..."
14440,0,"[dorio, admitir, disputar, reeleição, paulo, v...","[antes, refratário, ideia, disputar, reeleição..."
14441,0,"[deixar, cara, governar, pô, afirmar, mour]","[vejo, hoje, condiçãor, prosperar, qualquer, p..."
14442,0,"[logístico, pouco, setor, adaptar, rapidamente...","[cenário, mercado, logístico, segar, otimista,..."


## Fine-tuning of the pre-trained model

In [8]:
import sys
import os
import torch
import pandas as pd
from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
from datasets import Dataset
import joblib

#### Pretrained model loading


In [9]:
model = SentenceTransformer("sentence-transformers/xlm-r-bert-base-nli-stsb-mean-tokens")
model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/722 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/518 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

#### Preparing dataset for fine-tuning

In [10]:
df = df.dropna(subset=['text', 'is_sarcastic'])
df['text'] = df['text'].apply(lambda tokens: " ".join(tokens) if isinstance(tokens, list) else str(tokens))
df['headline'] = df['headline'].apply(lambda tokens: " ".join(tokens) if isinstance(tokens, list) else str(tokens))
df['is_sarcastic'] = df['is_sarcastic'].astype(int)

# Optional text size limit
df['text'] = df['text'].apply(lambda x: x[:512])

train_df, test_df = train_test_split(df, test_size=0.2, stratify=df['is_sarcastic'], random_state=42)




In [11]:
train_ds = Dataset.from_pandas(train_df.rename(columns={'text': 'text', 'is_sarcastic': 'label'}), preserve_index=False)
train_ds

Dataset({
    features: ['label', 'headline', 'text'],
    num_rows: 11555
})

In [12]:
eval_ds = Dataset.from_pandas(test_df.rename(columns={'text': 'text', 'is_sarcastic': 'label'}), preserve_index=False)
eval_ds

Dataset({
    features: ['label', 'headline', 'text'],
    num_rows: 2889
})

#### Defines training parameters

In [13]:
def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        preds = predictions.argmax(axis=1)
        return {"accuracy": accuracy_score(labels, preds)}

training_args = SentenceTransformerTrainingArguments(
    output_dir="finetuned_model_sarcasm",
    num_train_epochs=4,
    per_device_train_batch_size=16,
    logging_steps=10,
    save_total_limit=1,
    learning_rate=2e-5,
    warmup_steps=10,
    fp16=False,
    report_to="none"
)

train_loss = losses.SoftmaxLoss(
    model=model,
    sentence_embedding_dimension=model.get_sentence_embedding_dimension(),
    num_labels=2
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    loss=train_loss,
    compute_metrics=compute_metrics
)



Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

#### Trains the model

In [14]:
trainer.train()

Step,Training Loss
10,0.6958
20,0.5473
30,0.3399
40,0.3074
50,0.1945
60,0.3335
70,0.2037
80,0.1186
90,0.1785
100,0.2099


TrainOutput(global_step=2892, training_loss=0.05582161651504139, metrics={'train_runtime': 2127.1796, 'train_samples_per_second': 21.728, 'train_steps_per_second': 1.36, 'total_flos': 0.0, 'train_loss': 0.05582161651504139, 'epoch': 4.0})

#### Trains the classifier

In [15]:
# Generating embeddings for the training and test sets
X_train = model.encode(train_df['text'].tolist(), convert_to_tensor=True).cpu().numpy()
X_test = model.encode(test_df['text'].tolist(), convert_to_tensor=True).cpu().numpy()
y_train = train_df['is_sarcastic'].values
y_test = test_df['is_sarcastic'].values

In [16]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [17]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision (weighted): {precision_score(y_test, y_pred, average='weighted'):.4f}")
print(f"Recall (weighted): {recall_score(y_test, y_pred, average='weighted'):.4f}")
print(f"F1-Score (weighted): {f1_score(y_test, y_pred, average='weighted'):.4f}")


Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.98      0.98      1445
           1       0.98      0.97      0.98      1444

    accuracy                           0.98      2889
   macro avg       0.98      0.98      0.98      2889
weighted avg       0.98      0.98      0.98      2889

Accuracy: 0.9758
Precision (weighted): 0.9758
Recall (weighted): 0.9758
F1-Score (weighted): 0.9758


## Saving trained model and classifier

In [18]:
model.save("finetuned_model_sarcasm")
joblib.dump(clf, os.path.join("finetuned_model_sarcasm/classifier_logreg.pkl"))

['finetuned_model_sarcasm/classifier_logreg.pkl']

## Using: the fine-tuned model to predict sarcasm in texts

In [19]:
def predict_sarcasm(text, model, classifier, threshold=0.5):
        embedding = model.encode([text], convert_to_tensor=True).cpu().tolist()
        prob = classifier.predict_proba(embedding)[0][1]  # Probability of sarcasm

        if prob >= threshold:
            return "Sarcasm detected", prob
        else:
            return "Sarcasm not detected", prob


print("\nType a text to detect sarcasm:")

text = input("\n> ")

if len(text.strip()) == 0:
    print("Empty sentence. Try again.")

result, prob = predict_sarcasm(text, model, clf)
print(f"{result} (trust: {prob:.2f})")


Type a text to detect sarcasm:

> ytytuyi
Sarcasm detected (trust: 1.00)


Downloads the folder, takes a few minutes

In [20]:
from google.colab import files
import shutil
import os

folder_to_download = "finetuned_model_sarcasm"
zip_filename = f"{folder_to_download}.zip"

# Compresses the folder by creating the zip from the parent directory
shutil.make_archive(folder_to_download, 'zip', root_dir='.', base_dir=folder_to_download)

# Download the zipped file
files.download(zip_filename)

KeyboardInterrupt: 

After downloading, unzip the .zip file in the 'models' directory of the repository
