<a href="https://colab.research.google.com/github/KaifAhmad1/Sarcasm-Detection/blob/main/Sarcasm_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Sarcasm Detection in Financial News Headlines

In [1]:
import json
import pandas as pd

**Data Normalization:**

In [2]:
file_path = '/content/drive/MyDrive/Sarcasm_Headlines_Dataset.json'
data = []
# Read the file line by line
with open(file_path, 'r') as file:
    for line in file:
        # Load each line as a JSON object
        json_object = json.loads(line)
        data.append(json_object)
# Convert the list of JSON objects to a DataFrame
headline_data = pd.json_normalize(data)

**Data Exploration:**

In [3]:
headline_data

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0
...,...,...,...
26704,https://www.huffingtonpost.com/entry/american-...,american politics in moral free-fall,0
26705,https://www.huffingtonpost.com/entry/americas-...,america's best 20 hikes,0
26706,https://www.huffingtonpost.com/entry/reparatio...,reparations and obama,0
26707,https://www.huffingtonpost.com/entry/israeli-b...,israeli ban targeting boycott supporters raise...,0


In [4]:
headline_data.describe

<bound method NDFrame.describe of                                             article_link  \
0      https://www.huffingtonpost.com/entry/versace-b...   
1      https://www.huffingtonpost.com/entry/roseanne-...   
2      https://local.theonion.com/mom-starting-to-fea...   
3      https://politics.theonion.com/boehner-just-wan...   
4      https://www.huffingtonpost.com/entry/jk-rowlin...   
...                                                  ...   
26704  https://www.huffingtonpost.com/entry/american-...   
26705  https://www.huffingtonpost.com/entry/americas-...   
26706  https://www.huffingtonpost.com/entry/reparatio...   
26707  https://www.huffingtonpost.com/entry/israeli-b...   
26708  https://www.huffingtonpost.com/entry/gourmet-g...   

                                                headline  is_sarcastic  
0      former versace store clerk sues over secret 'b...             0  
1      the 'roseanne' revival catches up to our thorn...             0  
2      mom starting to fea

In [5]:
headline_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26709 entries, 0 to 26708
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   article_link  26709 non-null  object
 1   headline      26709 non-null  object
 2   is_sarcastic  26709 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 626.1+ KB


In [6]:
headline_data.describe()

Unnamed: 0,is_sarcastic
count,26709.0
mean,0.438953
std,0.496269
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [7]:
# Check for Duplicate values:
headline_data.duplicated().sum()

1

In [8]:
# Dropping Duplicate
headline_data.drop_duplicates(inplace = True)

In [9]:
headline_data

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0
...,...,...,...
26704,https://www.huffingtonpost.com/entry/american-...,american politics in moral free-fall,0
26705,https://www.huffingtonpost.com/entry/americas-...,america's best 20 hikes,0
26706,https://www.huffingtonpost.com/entry/reparatio...,reparations and obama,0
26707,https://www.huffingtonpost.com/entry/israeli-b...,israeli ban targeting boycott supporters raise...,0


**Text Preprocessing:**

In [10]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import spacy
import string

In [11]:
# Load spaCy model for lemmatization
nlp = spacy.load('en_core_web_sm')

In [12]:
# Convert text to lowercase
headline_data['headline'] = headline_data['headline'].apply(lambda x: x.lower())

In [13]:
# Remove punctuation
headline_data['headline'] = headline_data['headline'].apply(lambda x: x.translate(str.maketrans("", "", string.punctuation)))

In [14]:
# Remove numbers
headline_data['headline'] = headline_data['headline'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

In [15]:
import nltk
nltk.download('stopwords')
# Remove stopwords
stop_words = set(stopwords.words('english'))
headline_data['headline'] = headline_data['headline'].apply(lambda x: ' '.join(word for word in x.split() if word.lower() not in stop_words))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [16]:
# Lemmatization using spaCy
headline_data['headline'] = headline_data['headline'].apply(lambda x: ' '.join(token.lemma_ for token in nlp(x)))

**Model Building:**

In [17]:
# Train-Test Split
from sklearn.model_selection import train_test_split
train_texts, test_texts, train_labels, test_labels = train_test_split(headline_data['headline'],
                                                                      headline_data['is_sarcastic'],
                                                                      test_size=0.2,
                                                                      random_state=42,
                                                                      stratify=headline_data['is_sarcastic'])

In [18]:
# Display the shape of the training and testing sets
print("Training set shape:", train_texts.shape)
print("Testing set shape:", test_texts.shape)

Training set shape: (21366,)
Testing set shape: (5342,)


**TF-IDF Vectorization:**

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
# Create and fit the TF-IDF word vectorizer on the training data
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
tfidf_vectorizer.fit(train_texts)
# Transform the training and testing sets using the fitted vectorizer
train_features = tfidf_vectorizer.transform(train_texts)
test_features = tfidf_vectorizer.transform(test_texts)

**Utilizing Logistic Regression:**

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

In [22]:
# Create and train the Logistic Regression model
logistic_model = LogisticRegression()
logistic_model.fit(train_features, train_labels)

In [23]:
# Predict on the test set
predictions = logistic_model.predict(test_features)
# Evaluate the model
accuracy = accuracy_score(test_labels, predictions)
classification_rep = classification_report(test_labels, predictions)

In [24]:
print("Logistic Regression Model:")
print("Accuracy:", accuracy)
print("Classification Report:\n", classification_rep)

Logistic Regression Model:
Accuracy: 0.7755522276301011
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.84      0.81      2997
           1       0.78      0.69      0.73      2345

    accuracy                           0.78      5342
   macro avg       0.78      0.77      0.77      5342
weighted avg       0.78      0.78      0.77      5342



**Using XGBoost**

In [25]:
from xgboost import XGBClassifier

In [26]:
# XGBoost Model
xgb_model = XGBClassifier()
xgb_model.fit(train_features, train_labels)
xgb_predictions = xgb_model.predict(test_features)
xgb_accuracy = accuracy_score(test_labels, xgb_predictions)
xgb_classification_rep = classification_report(test_labels, xgb_predictions)

In [27]:
print("\nXGBoost Model:")
print("Accuracy:", xgb_accuracy)
print("Classification Report:\n", xgb_classification_rep)


XGBoost Model:
Accuracy: 0.7296892549606889
Classification Report:
               precision    recall  f1-score   support

           0       0.70      0.90      0.79      2997
           1       0.80      0.51      0.62      2345

    accuracy                           0.73      5342
   macro avg       0.75      0.71      0.71      5342
weighted avg       0.75      0.73      0.72      5342



**Using LightGBM:**

In [28]:
from lightgbm import LGBMClassifier

In [29]:
# LightGBM Model
lgbm_model = LGBMClassifier()
lgbm_model.fit(train_features, train_labels)
lgbm_predictions = lgbm_model.predict(test_features)
lgbm_accuracy = accuracy_score(test_labels, lgbm_predictions)
lgbm_classification_rep = classification_report(test_labels, lgbm_predictions)

[LightGBM] [Info] Number of positive: 9379, number of negative: 11987
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.165458 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 33351
[LightGBM] [Info] Number of data points in the train set: 21366, number of used features: 1507
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.438968 -> initscore=-0.245350
[LightGBM] [Info] Start training from score -0.245350


In [30]:
print("\nLightGBM Model:")
print("Accuracy:", lgbm_accuracy)
print("Classification Report:\n", lgbm_classification_rep)


LightGBM Model:
Accuracy: 0.7330587794833395
Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.88      0.79      2997
           1       0.78      0.55      0.64      2345

    accuracy                           0.73      5342
   macro avg       0.75      0.71      0.71      5342
weighted avg       0.74      0.73      0.72      5342



**Train LSTM using Tensorflow:**

In [31]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
import plotly.express as px
from tqdm import tqdm

In [32]:
# Convert sequences back to text
train_texts = [" ".join(map(str, sequence)) for sequence in tqdm(train_texts)]
test_texts = [" ".join(map(str, sequence)) for sequence in tqdm(test_texts)]

100%|██████████| 21366/21366 [00:00<00:00, 203311.64it/s]
100%|██████████| 5342/5342 [00:00<00:00, 237789.70it/s]


In [33]:
# Create a tokenizer
tokenizer = Tokenizer(num_words=5000)  # size
# Fit the tokenizer on the training data
tokenizer.fit_on_texts(train_texts)

In [34]:
# Convert text to sequences
train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

In [35]:
# Pad sequences to ensure consistent length
max_len = max(len(seq) for seq in train_sequences + test_sequences)
train_x = sequence.pad_sequences(train_sequences, maxlen=max_len)
test_x = sequence.pad_sequences(test_sequences, maxlen=max_len)

In [36]:
# Display the shapes of the transformed data
print(train_x.shape, test_x.shape)

(21366, 205) (5342, 205)


In [37]:
# Set up EarlyStopping callback
early_stopping = EarlyStopping(
    min_delta=0.001,
    mode="auto",
    verbose=1,
    monitor="val_accuracy",
    patience=3
)

In [38]:
# Create an LSTM model
model = Sequential()
model.add(Embedding(5000, 100, input_length=max_len))  # Adjust the vocabulary size
model.add(LSTM(64, dropout=0.5, recurrent_dropout=0.5))
model.add(Dense(1, activation="sigmoid"))



In [39]:
# Compile the model
model.compile(
    loss="binary_crossentropy",
    optimizer=Adam(learning_rate=0.0045),
    metrics=["accuracy"]
)

In [40]:
# Display model summary
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 205, 100)          500000    
                                                                 
 lstm (LSTM)                 (None, 64)                42240     
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 542305 (2.07 MB)
Trainable params: 542305 (2.07 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [41]:
import numpy as np

# Convert Pandas Series to NumPy array
train_labels_np = train_labels.to_numpy()

# Train the model
history = model.fit(
    train_x,
    train_labels_np,  # Use the NumPy array here
    epochs=25,
    validation_split=0.2,
    batch_size=64,
    verbose=1,
    callbacks=[early_stopping]
)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 13: early stopping


In [43]:
import plotly.graph_objects as go

# Create figure
fig = go.Figure()

# Add traces for training and validation accuracy
fig.add_trace(go.Scatter(x=list(range(1, len(history.history["accuracy"]) + 1)),
                         y=history.history["accuracy"],
                         mode='lines',
                         name='Training Accuracy',
                         line=dict(color='blue')
                         ))

fig.add_trace(go.Scatter(x=list(range(1, len(history.history["val_accuracy"]) + 1)),
                         y=history.history["val_accuracy"],
                         mode='lines',
                         name='Validation Accuracy',
                         line=dict(color='orange')
                         ))

# Update layout
fig.update_layout(title='Training History',
                  xaxis_title='Epochs',
                  yaxis_title='Accuracy',
                  legend=dict(x=0.7, y=0.95, traceorder='normal'),
                  autosize=False,
                  width=800,
                  height=500)

# Show the plot
fig.show()

In [44]:
# Evaluate the model on the test set
loss, accuracy = model.evaluate(test_x, test_labels)
print("Test Accuracy:", accuracy)

Test Accuracy: 0.6342194080352783


In [45]:
# Make predictions on new sentences
new_sentences = [
    "This is definitely the most interesting and exciting article I've ever read.",
    "Well, that's just what I needed, another pointless meeting.",
    "I love spending hours trying to understand complicated instructions.",
    "Oh, fantastic! Another flat tire on my way to work.",
    "Great job on leaving the coffee machine on for the entire night."
]

# Tokenize and pad the new sequences
new_sequences = tokenizer.texts_to_sequences(new_sentences)
new_padded_sequences = sequence.pad_sequences(new_sequences, maxlen=max_len)

# Make predictions
predictions = model.predict(new_padded_sequences)

# Display the predictions
for sentence, prediction in zip(new_sentences, predictions):
    sarcasm_probability = prediction[0]
    sarcasm_label = "Sarcastic" if sarcasm_probability >= 0.5 else "Not Sarcastic"
    print(f"Sentence: {sentence}\nSarcasm Probability: {sarcasm_probability}\nPrediction: {sarcasm_label}\n")

Sentence: This is definitely the most interesting and exciting article I've ever read.
Sarcasm Probability: 0.313429057598114
Prediction: Not Sarcastic

Sentence: Well, that's just what I needed, another pointless meeting.
Sarcasm Probability: 0.32940444350242615
Prediction: Not Sarcastic

Sentence: I love spending hours trying to understand complicated instructions.
Sarcasm Probability: 0.32940444350242615
Prediction: Not Sarcastic

Sentence: Oh, fantastic! Another flat tire on my way to work.
Sarcasm Probability: 0.313429057598114
Prediction: Not Sarcastic

Sentence: Great job on leaving the coffee machine on for the entire night.
Sarcasm Probability: 0.313429057598114
Prediction: Not Sarcastic

