# Détection de Sarcasme - Analyse NLP
Ce notebook explore la classification binaire de titres sarcastiques vs non-sarcastiques

In [1]:
import kagglehub

path = kagglehub.dataset_download("shariphthapa/sarcasm-json-datasets")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/shariphthapa/sarcasm-json-datasets?dataset_version_number=1...


100%|██████████| 1.59M/1.59M [00:00<00:00, 40.3MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/shariphthapa/sarcasm-json-datasets/versions/1





In [2]:
import pandas as pd
import numpy as np
import os

candidates = [
    "Sarcasm.json",
    os.path.join(path, "Sarcasm.json") if "path" in globals() else None,
    os.path.join(path, "Sarcasm_Headlines_Dataset.json") if "path" in globals() else None,
]
file_path = next((p for p in candidates if p and os.path.exists(p)), "Sarcasm.json")

try:
    df = pd.read_json(file_path, lines=True)
except ValueError:
    df = pd.read_json(file_path)

print(f"Loaded {file_path} -> shape {df.shape}")
df.head()

Loaded /root/.cache/kagglehub/datasets/shariphthapa/sarcasm-json-datasets/versions/1/Sarcasm.json -> shape (26709, 3)


Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


## Exploration et nettoyage des données

In [3]:
df_clean = df.drop(columns=['article_link'])
print("Columns after dropping article_link:", df_clean.columns.tolist())
print("\nDataset shape:", df_clean.shape)
print("\nFirst few rows:")
print(df_clean.head())
print("\nTarget distribution (is_sarcastic):")
print(df_clean['is_sarcastic'].value_counts())

Columns after dropping article_link: ['headline', 'is_sarcastic']

Dataset shape: (26709, 2)

First few rows:
                                            headline  is_sarcastic
0  former versace store clerk sues over secret 'b...             0
1  the 'roseanne' revival catches up to our thorn...             0
2  mom starting to fear son's web series closest ...             1
3  boehner just wants wife to listen, not come up...             1
4  j.k. rowling wishes snape happy birthday in th...             0

Target distribution (is_sarcastic):
is_sarcastic
0    14985
1    11724
Name: count, dtype: int64


In [4]:
!pip install nltk



## Prétraitement textuel avec NLTK

In [5]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import re
import string

print("Checking and downloading NLTK resources...")
resources_to_download = ['punkt', 'punkt_tab', 'stopwords']

for resource in resources_to_download:
    try:
        if resource == 'stopwords':
            stopwords.words('english')
        else:
            nltk.data.find(f'tokenizers/{resource}')
        print(f"  ✓ {resource} already available")
    except LookupError:
        try:
            print(f"  Downloading {resource}...")
            nltk.download(resource, quiet=True)
            print(f"  ✓ {resource} downloaded")
        except Exception as e:
            print(f"  ⚠ Could not download {resource}: {e}")

print("NLTK resources check complete!")

stop_words = set(stopwords.words('english'))
print(f"\nLoaded {len(stop_words)} stop words")
print(f"Sample stop words: {list(stop_words)[:10]}")

def preprocess_text(text):
    if not isinstance(text, str):
        return []
    text = text.lower()
    text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text)
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t.strip() and t not in stop_words]
    return tokens

print("\nPreprocessing headlines...")
df_clean['headline_tokens'] = df_clean['headline'].apply(preprocess_text)
print("Sample preprocessed headlines:")
for i in range(3):
    print(f"  Original: {df_clean['headline'].iloc[i]}")
    print(f"  Tokens: {df_clean['headline_tokens'].iloc[i]}\n")

Checking and downloading NLTK resources...
  Downloading punkt...
  ✓ punkt downloaded
  Downloading punkt_tab...
  ✓ punkt_tab downloaded
  Downloading stopwords...
  ✓ stopwords downloaded
NLTK resources check complete!

Loaded 198 stop words
Sample stop words: ["hasn't", 'mightn', 'itself', 'shouldn', 'what', 'during', 'until', 'themselves', 'ours', 're']

Preprocessing headlines...
Sample preprocessed headlines:
  Original: former versace store clerk sues over secret 'black code' for minority shoppers
  Tokens: ['former', 'versace', 'store', 'clerk', 'sues', 'secret', 'black', 'code', 'minority', 'shoppers']

  Original: the 'roseanne' revival catches up to our thorny political mood, for better and worse
  Tokens: ['roseanne', 'revival', 'catches', 'thorny', 'political', 'mood', 'better', 'worse']

  Original: mom starting to fear son's web series closest thing she will have to grandchild
  Tokens: ['mom', 'starting', 'fear', 'son', 'web', 'series', 'closest', 'thing', 'grandchild'

## Tokenization et padding avec Keras

In [6]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer_sarcasm = Tokenizer(num_words=5000, oov_token="<OOV>")

headline_strings = [' '.join(tokens) for tokens in df_clean['headline_tokens']]
tokenizer_sarcasm.fit_on_texts(headline_strings)

print(f"Tokenizer fitted on {len(headline_strings)} headlines")
print(f"Total unique words (vocab size): {len(tokenizer_sarcasm.word_index)}")
print(f"OOV token index: {tokenizer_sarcasm.word_index.get('<OOV>')}")
print(f"\nTop 20 most common words:")
sorted_words = sorted(tokenizer_sarcasm.word_index.items(), key=lambda x: x[1])[:20]
for word, idx in sorted_words:
    print(f"  {word}: {idx}")

sequences_sarcasm = tokenizer_sarcasm.texts_to_sequences(headline_strings)

max_len = max(len(seq) for seq in sequences_sarcasm) if sequences_sarcasm else 100
max_len = min(max_len, 100)
padded_sarcasm = pad_sequences(sequences_sarcasm, maxlen=max_len, padding='post')

print(f"\nMax headline length: {max_len}")
print(f"Padded sequences shape: {padded_sarcasm.shape}")
print(f"First 3 padded sequences shape: {padded_sarcasm[:3].shape}")

Tokenizer fitted on 26709 headlines
Total unique words (vocab size): 25220
OOV token index: 1

Top 20 most common words:
  <OOV>: 1
  trump: 2
  new: 3
  man: 4
  year: 5
  one: 6
  report: 7
  area: 8
  woman: 9
  donald: 10
  day: 11
  u: 12
  says: 13
  time: 14
  first: 15
  obama: 16
  like: 17
  women: 18
  people: 19
  get: 20

Max headline length: 27
Padded sequences shape: (26709, 27)
First 3 padded sequences shape: (3, 27)


## Préparation des features et target

In [7]:
X_sarcasm = padded_sarcasm.astype('float32')
y_sarcasm = df_clean['is_sarcastic'].values.astype('int32')

print(f"Features shape (X): {X_sarcasm.shape}")
print(f"Target shape (y): {y_sarcasm.shape}")
print(f"\nTarget distribution:")
print(f"  Sarcastic (1): {(y_sarcasm == 1).sum()} ({100 * (y_sarcasm == 1).sum() / len(y_sarcasm):.1f}%)")
print(f"  Non-sarcastic (0): {(y_sarcasm == 0).sum()} ({100 * (y_sarcasm == 0).sum() / len(y_sarcasm):.1f}%)")

feature_cols = [f'tok_{i}' for i in range(X_sarcasm.shape[1])]
X_df = pd.DataFrame(X_sarcasm, columns=feature_cols)
df_model = X_df.copy()
df_model['is_sarcastic'] = y_sarcasm

print(f"\nModel dataset shape: {df_model.shape}")
print(f"First few rows:")
print(df_model.head())

Features shape (X): (26709, 27)
Target shape (y): (26709,)

Target distribution:
  Sarcastic (1): 11724 (43.9%)
  Non-sarcastic (0): 14985 (56.1%)

Model dataset shape: (26709, 28)
First few rows:
    tok_0   tok_1   tok_2   tok_3   tok_4   tok_5   tok_6   tok_7   tok_8  \
0   220.0     1.0   543.0  3034.0  2201.0   273.0    35.0  1995.0  2498.0   
1     1.0  3262.0  2675.0     1.0   309.0  2829.0   164.0   892.0     0.0   
2    58.0   749.0   727.0   144.0  1996.0   485.0  4567.0   129.0     1.0   
3  1240.0   138.0   260.0  1592.0   224.0  2830.0  1294.0     1.0   790.0   
4   669.0   597.0  3784.0   819.0     1.0   463.0   464.0  1163.0    36.0   

   tok_9  ...  tok_18  tok_19  tok_20  tok_21  tok_22  tok_23  tok_24  tok_25  \
0    1.0  ...     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
1    0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
2    0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
3    0.0  ...   

## Régression logistique baseline

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
import warnings
warnings.filterwarnings('ignore', category=UserWarning)

X_train, X_test, y_train, y_test = train_test_split(
    X_sarcasm, y_sarcasm, test_size=0.3, stratify=y_sarcasm, random_state=42
)

print(f"Train set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

clf_sarcasm = LogisticRegression(max_iter=5000, random_state=42, class_weight='balanced', solver='lbfgs')
print("\nTraining logistic regression...")
clf_sarcasm.fit(X_train, y_train)
print("Training complete!")

y_pred = clf_sarcasm.predict(X_test)
y_pred_proba = clf_sarcasm.predict_proba(X_test)

print("\n" + "="*60)
print("MODEL EVALUATION")
print("="*60)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, y_pred_proba[:, 1]):.4f}")
print(f"\nConfusion Matrix:\n{confusion_matrix(y_test, y_pred)}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Non-Sarcastic', 'Sarcastic']))

Train set: 18696 samples
Test set: 8013 samples

Training logistic regression...
Training complete!

MODEL EVALUATION
Accuracy: 0.5657
ROC AUC: 0.5623

Confusion Matrix:
[[3074 1422]
 [2058 1459]]

Classification Report:
               precision    recall  f1-score   support

Non-Sarcastic       0.60      0.68      0.64      4496
    Sarcastic       0.51      0.41      0.46      3517

     accuracy                           0.57      8013
    macro avg       0.55      0.55      0.55      8013
 weighted avg       0.56      0.57      0.56      8013



### Commentaires sur la Régression Logistique Baseline

*   **Précision (Accuracy)** : Avec 56.57%, le modèle n'est que légèrement meilleur qu'une classification aléatoire (qui serait de 50%). Cela indique que le modèle a du mal à distinguer correctement les phrases sarcastiques des non-sarcastiques.
*   **ROC AUC** : Un score de 0.5623 est également assez bas, suggérant que le modèle a une faible capacité à séparer les classes positives (sarcastiques) des négatives (non-sarcastiques). Un score de 0.5 signifie qu'il ne fait pas mieux que le hasard.
*   **Matrice de Confusion** :
    *   **Vrais Positifs (TP)** : 1459 phrases sarcastiques ont été correctement identifiées comme sarcastiques.
    *   **Faux Négatifs (FN)** : 2058 phrases sarcastiques ont été incorrectement classées comme non-sarcastiques (le modèle les a manquées).
    *   **Vrais Négatifs (TN)** : 3074 phrases non-sarcastiques ont été correctement identifiées comme non-sarcastiques.
    *   **Faux Positifs (FP)** : 1422 phrases non-sarcastiques ont été incorrectement classées comme sarcastiques (le modèle a fait une fausse alarme).
*   **Rapport de Classification** :
    *   **Classe 'Non-Sarcastic' (0)** : Le modèle a une précision de 0.60 (60% des prédictions 'non-sarcastique' étaient correctes) et un rappel de 0.68 (68% des vrais 'non-sarcastiques' ont été trouvés).
    *   **Classe 'Sarcastic' (1)** : La précision est plus faible à 0.51 (seulement 51% des prédictions 'sarcastique' étaient correctes), et le rappel est encore plus bas à 0.41 (seulement 41% des vrais 'sarcastiques' ont été trouvés). Cela montre que le modèle a beaucoup de mal à identifier la sarcasme.

En résumé, ce modèle de régression logistique de base offre des performances médiocres. Il a particulièrement du mal à identifier les titres sarcastiques, ce qui est indiqué par un faible rappel pour cette classe.

## Approche avec Embedding Layer

In [9]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.models import Sequential
import numpy as np

print("="*70)
print("EMBEDDING LAYER APPROACH")
print("="*70)

embedding_dim = 64
vocab_size = 5000 + 1

embedding_model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len)
])

print(f"\nEmbedding Configuration:")
print(f"  Vocabulary size: {vocab_size}")
print(f"  Embedding dimension: {embedding_dim}")
print(f"  Input sequence length: {max_len}")

print("\nConverting sequences to embeddings...")
X_embedded = embedding_model.predict(padded_sarcasm, verbose=0)
print(f"Embedded data shape: {X_embedded.shape}")

X_embedded_flat = X_embedded.mean(axis=1)
print(f"Flattened embeddings shape (average pooling): {X_embedded_flat.shape}")

X_train_emb, X_test_emb, y_train_emb, y_test_emb = train_test_split(
    X_embedded_flat, y_sarcasm, test_size=0.3, stratify=y_sarcasm, random_state=42
)

print(f"\nTrain set (embedded): {X_train_emb.shape[0]} samples")
print(f"Test set (embedded): {X_test_emb.shape[0]} samples")

EMBEDDING LAYER APPROACH

Embedding Configuration:
  Vocabulary size: 5001
  Embedding dimension: 64
  Input sequence length: 27

Converting sequences to embeddings...
Embedded data shape: (26709, 27, 64)
Flattened embeddings shape (average pooling): (26709, 64)

Train set (embedded): 18696 samples
Test set (embedded): 8013 samples


## Régression logistique sur embeddings

In [10]:
print("\n" + "="*70)
print("LOGISTIC REGRESSION ON EMBEDDED DATA")
print("="*70)

clf_embedded = LogisticRegression(max_iter=5000, random_state=42, class_weight='balanced', solver='lbfgs')
print("\nTraining logistic regression on embeddings...")
clf_embedded.fit(X_train_emb, y_train_emb)
print("Training complete!")

y_pred_emb = clf_embedded.predict(X_test_emb)
y_pred_proba_emb = clf_embedded.predict_proba(X_test_emb)

print("\n" + "="*60)
print("EMBEDDED MODEL EVALUATION")
print("="*60)
acc_embedded = accuracy_score(y_test_emb, y_pred_emb)
roc_embedded = roc_auc_score(y_test_emb, y_pred_proba_emb[:, 1])

print(f"Accuracy: {acc_embedded:.4f}")
print(f"ROC AUC: {roc_embedded:.4f}")
print(f"\nConfusion Matrix:\n{confusion_matrix(y_test_emb, y_pred_emb)}")
print(f"\nClassification Report:")
print(classification_report(y_test_emb, y_pred_emb, target_names=['Non-Sarcastic', 'Sarcastic']))

print("\n" + "="*60)
print("COMPARISON: ORIGINAL vs EMBEDDED")
print("="*60)
acc_original = accuracy_score(y_test, y_pred)
roc_original = roc_auc_score(y_test, y_pred_proba[:, 1])

print(f"Original Accuracy:  {acc_original:.4f}")
print(f"Embedded Accuracy:  {acc_embedded:.4f}")
print(f"Improvement:        {(acc_embedded - acc_original)*100:+.2f}%")
print(f"\nOriginal ROC AUC:   {roc_original:.4f}")
print(f"Embedded ROC AUC:   {roc_embedded:.4f}")
print(f"Improvement:        {(roc_embedded - roc_original):+.4f}")


LOGISTIC REGRESSION ON EMBEDDED DATA

Training logistic regression on embeddings...
Training complete!

EMBEDDED MODEL EVALUATION
Accuracy: 0.5547
ROC AUC: 0.5909

Confusion Matrix:
[[2540 1956]
 [1612 1905]]

Classification Report:
               precision    recall  f1-score   support

Non-Sarcastic       0.61      0.56      0.59      4496
    Sarcastic       0.49      0.54      0.52      3517

     accuracy                           0.55      8013
    macro avg       0.55      0.55      0.55      8013
 weighted avg       0.56      0.55      0.56      8013


COMPARISON: ORIGINAL vs EMBEDDED
Original Accuracy:  0.5657
Embedded Accuracy:  0.5547
Improvement:        -1.10%

Original ROC AUC:   0.5623
Embedded ROC AUC:   0.5909
Improvement:        +0.0285


## Performance du modele



*   **Précision (Accuracy)**: La régression logistique sur les données encodées (0.5547) a montré une légère baisse de précision par rapport au modèle original (0.5657), soit une diminution de 1.10%.
*   **ROC AUC**: En revanche, l'aire sous la courbe ROC (ROC AUC) a légèrement augmenté pour le modèle encodé (0.5909) par rapport au modèle original (0.5623), indiquant une amélioration de 0.0285. Cela suggère une meilleure capacité du modèle à distinguer les classes positives et négatives malgré une précision globale légèrement inférieure.

## RESUME DES PERFORMANCES DES MODELES
En résumé, l'utilisation de couches d'embeddings avec une moyenne pour le pooling n'a pas amélioré la précision globale (Accuracy), mais a permis une meilleure distinction entre les classes comme en témoigne l'amélioration du ROC AUC et un meilleur rappel pour la classe 'Sarcastic'. Cependant, cela s'est fait au prix d'une augmentation des faux positifs pour la classe 'Non-Sarcastic'. Le modèle est toujours loin d'être parfait, mais l'approche par embeddings montre un potentiel pour mieux capturer les nuances sémantiques.