# Sarcasm and ambiguity detection

The objective of this notebook is to develop a system capable of automatically identifying possible cases of sarcasm and lexical ambiguity in Portuguese texts and rewriting them in a clearer and more objective way.


### Imports

This section imports all the necessary libraries and modules.


In [None]:
import os
from sentence_transformers import SentenceTransformer
import joblib
import subprocess
import google.generativeai as genai
import sys
sys.path.append(os.path.dirname(os.getcwd()))
from src.rewriting import sentences, words, generate_prompt
from src.evaluation import Evaluation

## Sarcasm detection

### Train model

To train the model to detect sarcasm, run [_fine_tuning.ipynb_](fine_tuning.ipynb) notebook in the Google Colab environment. Follow all instructions in the notebook.

After execution, save the trained model in the 'models' folder of this repository. Don't forget to unzip the .zip file, and then continue running this notebook.

### Load model

This section loads the pre-trained sarcasm detection model and classifier from the models directory. The model uses SentenceTransformer for text embeddings and a logistic regression classifier for sarcasm prediction.

In [None]:
MODEL_DIR = "../models/finetuned_model_sarcasm"
CLASSIFIER_PATH = os.path.join(MODEL_DIR, "./classifier_logreg.pkl")

def load_model():
    if not os.path.exists(MODEL_DIR):
        raise FileNotFoundError(f"Directory '{MODEL_DIR}' not found.")
    if not os.path.exists(CLASSIFIER_PATH):
        raise FileNotFoundError(f"Classifier '{CLASSIFIER_PATH}' not found.")

    print("Loading model and classifier...")
    model = SentenceTransformer(MODEL_DIR)
    classifier = joblib.load(CLASSIFIER_PATH)
    return model, classifier

model, classifier = load_model()

### Sarcasm prediction using the trained model

This section contains a function that predicts sarcasm in sentences using the loaded model and classifier. It processes sentence input and returns whether sarcasm is detected.

In [None]:
def check_sarcasm_sentence(text, model, classifier, threshold=0.5):
    if len(text.strip()) == 0:
        print("[ERROR] Empty sentence. Try again.")
        return False
    
    embedding = model.encode([text], convert_to_tensor=True).cpu().tolist()
    prob = classifier.predict_proba(embedding)[0][1]  # Probability of sarcasm

    if prob >= threshold:
        return True
    else:
        return False

## Lexical ambiguity detection

This section contains a method to detect words that have multiple meanings. It uses an external Python script.

In [None]:
def check_ambiguity_word(word, context):
    result = subprocess.run(
        [
            "conda", "run", "-n", "ambiguity", "python",
            "../src/ambiguity.py", context, word
        ],
        check=True,
        capture_output=True,
        text=True  # To already return string instead of bytes
    )

    if (result.stdout.strip() == 'None'):
        return [False, ""]

    return [True, result.stdout.strip()]

## Rewrite of phrase

This section uses Google's Gemini AI model to rewrite the original text, removing detected sarcasm and resolving lexical ambiguities to produce clearer and more objective content.


In [None]:
API_KEY = ''
genai.configure(api_key = API_KEY)
model_gemini = genai.GenerativeModel("gemini-2.5-flash")

# --- Main Program Loop ---

evaluation = Evaluation()

while(1):
    print("\n------------------------------------------------------\n")
    print("Enter a text for analysis and rewrite (-1 to finish):")
    original_text = input()

    if original_text == "-1":
       break

    # 1. Identification of problematic elements (sarcasm and ambiguity)
    ambiguous_words_per_sentence = {}
    sarcastic_sentences = []

    sentences_list = sentences(original_text)
    
    for sentence in sentences_list:
        if check_sarcasm_sentence(sentence, model, classifier) == True:
            sarcastic_sentences.append(sentence)
        
        sentence_words_list = words(sentence)
        ambiguous_words_sentence = []
        for sentence_word in sentence_words_list:
            ambiguity_result = check_ambiguity_word(sentence_word, sentence)
            if ambiguity_result[0] == True:
                ambiguous_words_sentence.append((sentence_word, ambiguity_result[1]))
        
        if ambiguous_words_sentence:
            ambiguous_words_per_sentence[sentence] = ambiguous_words_sentence

    print("\n--- Detected Items for the Prompt ---")
    print(f"Sarcastic sentences: {sarcastic_sentences}")
    print(f"Ambiguous words per sentence: {ambiguous_words_per_sentence}")
    print("------------------------------------")

    # 2. Optimized Prompt Generation
    final_prompt = generate_prompt(original_text, sarcastic_sentences, ambiguous_words_per_sentence)
    
    # 3. Generation of the Treated Text by the LLM (GEMINI)
    print("\n--- Generating text with Gemini (Hard-coded model: gemini-2.5-flash) ---")
    rewritten_text = model_gemini.generate_content(final_prompt).text

    if rewritten_text:
        if rewritten_text.strip().startswith("REWRITTEN TEXT:"):
            rewritten_text = rewritten_text.strip()[len("REWRITTEN TEXT:"):].strip()
        
        if rewritten_text.startswith('"') and rewritten_text.endswith('"'):
            rewritten_text = rewritten_text[1:-1].strip()

        print("\n--- REWRITTEN TEXT ---")
        print(rewritten_text)
        print("-----------------------")

        print(evaluation.evaluate_rewrite(original_text, rewritten_text))
    else:
        print("\nCould not generate the rewritten text.")