# Twitter Sarcasm & Irony Detection Project
### By Asaf Levi & Haim Elbaz 
In this project, we will build several machine learning models model to detect sarcasm and irony in tweets, and compare their performance between models and preprocessing levels. 

We will be using the scikit-learn library for building our classifiers, with NLTK for natural language processing tools.


## Dataset

We will be using a labeled dataset of tweets, built as part of the research presented in the paper "Quantitative Analysis of the Differences between Sarcasm and Irony (Klinger and Litvak, 2016)".

The dataset contains data labeled as "regular", "sarcasm", "irony" and "figurative", as tagged by the tweet's writer. For our purposes, we will unite sarcasm and irony into a single category, omitting "figurative". 

## Steps Involved

1. Data Cleaning: We will clean and preprocess the tweet data by removing emojis, URLs, usernames, etc., handling necessary text transformations.

2. Feature Extraction: We will extract relevant features from the preprocessed tweets, such as bag-of-words, TF-IDF representations, and sentiment analysis.  These features will be used as input to our machine learning model.

3. Model Training: We will use scikit-learn's machine learning algorithms - Logistic Regression, Support Vector Machine Classifier and Random Forest Classifier  to train sarcasm-irony detection models on the labeled dataset.

4. Model Evaluation: We will evaluate the performance of our models with the help of a designated test dataset that will go through the same processing steps described. We will evaluate the models' performance using various metrics, such as accuracy, precision, recall, and F1 score. 

5. Prediction: Once our model is trained and evaluated, we can use it to predict sarcasm and irony in new, unseen tweets.

## Code Structure

To keep our notebook organized, we will store custom functions in a separate `helpers.py` file. This file will contain functions for data preprocessing, feature extraction, and any other utility functions we may need throughout the project. We will import these functions as needed in our notebook.

Let's get started with importing the necessary libraries and loading the dataset!

In [1]:
# Import libraries 
import numpy as np
import pandas as pd

from scipy.sparse import csr_matrix, hstack
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from helpers import preprocessing_pipeline, count_syntactic_features

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [2]:
# Load training set
raw_training_data = pd.read_csv("data/train.csv")
training_data = preprocessing_pipeline(raw_training_data)

Since our data from twitter, it would benefit to use a twitter-designated tokenizer which recognizes hashtags, usernames, urls, and emojis. After tokenizing each tweet, we use get TF-IDF for each word in the dataset. 

It is worth mentioning that we tried using simple bag of words (CountVectorizer), but the results using TF-IDF were much better. TF-IDF takes into account the importance of the word within its tweet, which seems to be quite important when it comes to tweets, in particular in large dataset, as seen in "Sentiment Analysis on COVID Tweets: An Experimental Analysis on the Impact of Count Vectorizer and TF-IDF on Sentiment Predictions using Deep Learning Models (Raza et al., 2021)"

In [3]:
tweet_tokenizer = TweetTokenizer()
def tokenize(tweet):
    return tweet_tokenizer.tokenize(tweet)
vectorizer = TfidfVectorizer(tokenizer=tokenize)
training_data_level_1 = vectorizer.fit_transform(training_data["tweets"])

In [4]:
X = training_data_level_1
y = training_data["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    random_state=42)

In [5]:
models_params = [
        (
            RandomForestClassifier(random_state=1337),{
                "class_weight" : [ "balanced"], # None
                "min_samples_leaf" : [i for i in range(1, 61, 10)],
                "n_estimators" : [i for i in range(5, 20, 5)],
                "n_jobs" : [7]
            }
    ),
        (
            LinearSVC(random_state=1337),{
                "C" : [0.5, 1],
                "class_weight" : ["balanced"],
            }
    ),
        (
            LogisticRegression(random_state=1337),{
                "max_iter": [125, 150],
                "class_weight" : [None,"balanced"],
                "n_jobs" : [7],
    })
]

In [6]:
def do_gridsearch(X_train, y_train, models_params):
    results = pd.DataFrame()
    for model, param_grid in models_params:
        gs = GridSearchCV(estimator=model, 
                            error_score='raise',
                            param_grid=param_grid,
                            scoring='recall')
        gs.fit(X=X_train, y=y_train)
        results = pd.concat([results , pd.DataFrame([
    {
    'model_type' : model,
    'parameters' : params,
    'score' : score,
    }
    for params, score in zip(gs.cv_results_["params"],gs.cv_results_["mean_test_score"],
    )])])
    return results.sort_values(by='score', ascending=False)


In [7]:
tfidf_model_results = do_gridsearch(X_train, y_train, models_params)
tfidf_model_results.to_csv("tfidf_results.csv")

In [8]:
new_columns = pd.DataFrame(columns=["neg", "neu", "pos", "compound", 'Stopwords', 'Nouns', 'Verbs', 'Adverbs', 'Adjectives', 'Pronouns', "length"])
training_data = training_data.join(new_columns)

for index, row in training_data.iterrows():
    tweet = row[0]

    scores = SentimentIntensityAnalyzer().polarity_scores(tweet)

    for sentiment, score in scores.items():
        training_data.loc[index, sentiment] = score
        
    syntax_counts = count_syntactic_features(tweet)
    for syntax, count in syntax_counts.items():
        training_data.loc[index, syntax] = count
        
    training_data.loc[index, "length"] = len(tweet)

training_data = training_data.drop_duplicates()
training_data = training_data.reset_index(drop=True)
training_data.head()

Unnamed: 0,tweets,class,neg,neu,pos,compound,Stopwords,Nouns,Verbs,Adverbs,Adjectives,Pronouns,length
0,fav moment in sepp blatter vid ( 0:20 ) : `` w...,1,0.0,0.778,0.222,0.6908,10,5,1,1,2,0,116
1,just found this while walking my human ....,1,0.0,1.0,0.0,0.0,4,2,2,0,0,0,43
2,'disrespected the wife of prophet ' - pseudo l...,1,0.217,0.652,0.13,-0.296,3,6,2,0,0,0,80
3,do you know that super yeay satisfying feeling...,1,0.0,0.704,0.296,0.8126,11,3,6,1,1,0,120
4,if you 're going to call someone ignorant and ...,1,0.234,0.766,0.0,-0.6705,9,3,4,1,3,0,104


In [9]:
processed_columns = training_data.columns
processed_training_data = training_data.copy()
tf_idf_results = vectorizer.fit_transform(training_data["tweets"])
processed_training_data.drop(columns=["tweets", "class"], inplace=True)
training_data_level_2 = csr_matrix(processed_training_data.to_numpy(dtype=np.float32))

In [10]:
X = hstack([tf_idf_results, training_data_level_2])
y = training_data["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    random_state=5432)
bow_features_model_results = do_gridsearch(X_train, y_train, models_params)
bow_features_model_results.to_csv("tfidf_sentiment_syntax.csv")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [11]:
bi_vectorizer = TfidfVectorizer(tokenizer=tokenize, 
                                ngram_range=(2, 2), 
                                max_features=50000)
bigram = bi_vectorizer.fit_transform(training_data["tweets"])

In [13]:
X = hstack([tf_idf_results, training_data_level_2, bigram])
y = training_data["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, 
                                                    random_state=99101)
bigram_model_results = do_gridsearch(X_train, y_train, models_params)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [14]:
bigram_model_results.to_csv("tfidf_sentiment_syntax_bigram.csv")

## Evaluating Process Level 1

In [27]:
raw_test_data = pd.read_csv("data/test.csv")

test_data = preprocessing_pipeline(raw_test_data)
y_test = test_data["class"]
training_data = preprocessing_pipeline(raw_training_data)
whole_dataset = pd.concat([test_data, training_data])

tweet_tokenizer = TweetTokenizer()
def tokenize(tweet):
    return tweet_tokenizer.tokenize(tweet)

vectorizer = TfidfVectorizer(tokenizer=tokenize)

vectorizer.fit(whole_dataset["tweets"])
X = vectorizer.transform(training_data["tweets"])
y = training_data["class"]

X_pred = vectorizer.transform(test_data["tweets"])



In [16]:
model = RandomForestClassifier(class_weight="balanced",
                               min_samples_leaf=1,
                               n_estimators=15,
                               n_jobs=7,
                               random_state=99101)
model.fit(X, y)
y_pred = model.predict(X_pred)
pd.DataFrame(classification_report(y_pred, y_test, output_dict=True))

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.515869,0.947818,0.815638,0.731843,0.863988
recall,0.813401,0.816176,0.815638,0.814789,0.815638
f1-score,0.631336,0.877085,0.815638,0.754211,0.829392
support,1179.0,4896.0,0.815638,6075.0,6075.0


In [17]:
model = LogisticRegression(class_weight=None,
                            max_iter=125,
                            n_jobs=7,
                            random_state=99101)
model.fit(X, y)
y_pred = model.predict(X_pred)
pd.DataFrame(classification_report(y_pred, y_test, output_dict=True))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.576654,0.93833,0.827654,0.757492,0.859029
recall,0.804805,0.834071,0.827654,0.819438,0.827654
f1-score,0.67189,0.883134,0.827654,0.777512,0.836817
support,1332.0,4743.0,0.827654,6075.0,6075.0


In [18]:
model = LinearSVC(C=0.5,
                class_weight="balanced",
                random_state=99101)
model.fit(X, y)
y_pred = model.predict(X_pred)
pd.DataFrame(classification_report(y_pred, y_test, output_dict=True))

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.740183,0.83444,0.805597,0.787312,0.802261
recall,0.663452,0.87928,0.805597,0.771366,0.805597
f1-score,0.69972,0.856274,0.805597,0.777997,0.802826
support,2074.0,4001.0,0.805597,6075.0,6075.0


## Evaluating Process Level 2

In [28]:
new_columns = pd.DataFrame(columns=["neg", "neu", "pos", "compound", 'Stopwords', 'Nouns', 'Verbs', 'Adverbs', 'Adjectives', 'Pronouns', "length"])
training_data = training_data.join(new_columns)

for index, row in training_data.iterrows():
    tweet = row[0]

    scores = SentimentIntensityAnalyzer().polarity_scores(tweet)

    for sentiment, score in scores.items():
        training_data.loc[index, sentiment] = score
        
    syntax_counts = count_syntactic_features(tweet)
    for syntax, count in syntax_counts.items():
        training_data.loc[index, syntax] = count
        
    training_data.loc[index, "length"] = len(tweet)

training_data = training_data.drop_duplicates()
training_data = training_data.reset_index(drop=True)
processed_columns = training_data.columns
processed_training_data = training_data.copy()
tf_idf_results = vectorizer.fit_transform(training_data["tweets"])
processed_training_data.drop(columns=["tweets", "class"], inplace=True)
training_data_level_2 = csr_matrix(processed_training_data.to_numpy(dtype=np.float32))

In [26]:
X = hstack([tf_idf_results, training_data_level_2])
y = training_data["class"]

new_columns = pd.DataFrame(columns=["neg", "neu", "pos", "compound", 'Stopwords', 'Nouns', 'Verbs', 'Adverbs', 'Adjectives', 'Pronouns', "length"])
testing_data = test_data.join(new_columns)

for index, row in testing_data.iterrows():
    tweet = row[0]

    scores = SentimentIntensityAnalyzer().polarity_scores(tweet)

    for sentiment, score in scores.items():
        testing_data.loc[index, sentiment] = score
        
    syntax_counts = count_syntactic_features(tweet)
    for syntax, count in syntax_counts.items():
        testing_data.loc[index, syntax] = count
        
    testing_data.loc[index, "length"] = len(tweet)

testing_data = testing_data.drop_duplicates()
testing_data = testing_data.reset_index(drop=True)

processed_columns = testing_data.columns
processed_testing_data = testing_data.copy()
tf_idf_results = vectorizer.transform(testing_data["tweets"])
processed_testing_data.drop(columns=["tweets", "class"], inplace=True)
X_pred = hstack([X_pred, csr_matrix(processed_training_data.to_numpy(dtype=np.float32))])

ValueError: Mismatching dimensions along axis 0: {6001, 58326}

In [24]:
X_pred

<58326x11 sparse matrix of type '<class 'numpy.float32'>'
	with 423453 stored elements in Compressed Sparse Row format>

In [22]:
model = RandomForestClassifier(class_weight="balanced",
                               min_samples_leaf=1,
                               n_estimators=15,
                               n_jobs=7,
                               random_state=99101)
model.fit(X, y)
y_pred = model.predict(X_pred)
pd.DataFrame(classification_report(y_pred, y_test, output_dict=True))

ValueError: X has 11 features, but RandomForestClassifier is expecting 43235 features as input.

In [None]:
model = LogisticRegression(class_weight=None,
                            max_iter=125,
                            n_jobs=7,
                            random_state=99101)
model.fit(X_train, y_train)
y_pred = model.predict(X_pred)
pd.DataFrame(classification_report(y_pred, y_test, output_dict=True))

In [None]:
model = LinearSVC(C=0.5,
                class_weight="balanced",
                random_state=99101)
model.fit(X_train, y_train)
y_pred = model.predict(X_pred)
pd.DataFrame(classification_report(y_pred, y_test, output_dict=True))

In [None]:
training_data

