<a href="https://colab.research.google.com/github/JakobPorsfelt/Content-Analysis-Assignments/blob/main/NLP_Exam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Classifying Sarcasm
####This notebook works with a dataset on sarcasm from Kaggle: https://www.kaggle.com/datasets/danofer/sarcasm - It was originally made by Mikhail Khodak et. al. 2017 for their paper "A Large Self-Annotated Corpus for Sarcasm" - Has since been made public available on Kaggle. The dataset is self-annotated, meaning it is the authors who labeled their statement as sarcastic or not.

The overall topic is sarcasm - The setting is a research into sarcasm - The main problem being how sarcasm can affect analysis of big text dataset in consumer research. This stury seek to understand sarcasm in order to be better at controlling for it, or understanding its context within sentiment analysis, in consumer behavior studies. Research has indicated there might be a relationship between sarcasm and polarity score (Agrawal & Papagelis 2020)(D. Tayal, S. et. al. 2014) therefore this relationship will be analyzed using VADER.
This notebook seeks to both predict sarcasm through training different types of classifiers(Naive Bayes and CNN) as well as exploratively do a sentiment analysis to investigate whether there is a connection between sentiment and sarcasm.
The main goal of the notebook is train to a binary classifier using different approaches and compare the results.
The main problem is a big dataset and low computing power offered by colab free. However it is interesting to notice the performance of Naive Bayes vs CNN - Where CNN is heavy on resources but more complex, Naive Bayes is more simple but puts low demand on RAM and thus can train on more data and faster.

#Naive Bayes Classifier

In [None]:
!pip install gdown==4.6

In [None]:
#downloading the dataset
import gdown

url='https://drive.google.com/uc?id=10LhzuH8143lOXGk6-e8-fDkXpu-xwXiH&confirm=t'
output='train-balanced-sarcasm.csv'
gdown.download(url, output, quiet=True, fuzzy=True, use_cookies=True)

'train-balanced-sarcasm.csv'

In [None]:
#Reading file and assigning to df object
import pandas as pd
file_path = '/content/train-balanced-sarcasm.csv'
df = pd.read_csv(file_path)

In [None]:
#Importing necessary modules
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
nltk.download('wordnet')
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [None]:
#Checking for null/NA values
print(df.isnull().sum())

In [None]:
#droppping NA
df = df.dropna()

In [None]:
#Assigning data for train/test to variables for readability.
comments = df['comment']
labels = df['label']

In [None]:
#From the book with a minor tweak to text_clean because it was using deprecated methods, giving too many warning when running the script.
#The script defines the different functions for normalizing the text data
#- removing stop words, whitespaces, lowercase and only alphanumeric as well as lemmatizing etc.
#Reference: Kedia, A., & Rasu, M. (2020).

def text_clean(corpus):
    cleaned_corpus_list = []
    for row in corpus:
        qs = []
        for word in row.split():
            p1 = re.sub(pattern='[^a-zA-Z0-9]', repl=' ', string=word)
            p1 = p1.lower()
            qs.append(p1)
        cleaned_corpus_list.append(' '.join(qs))

    cleaned_corpus = pd.Series(cleaned_corpus_list, dtype='object')
    return cleaned_corpus

def stopwords_removal(corpus):
    stop = set(stopwords.words('english'))
    corpus = [[x for x in x.split() if x not in stop] for x in corpus]
    return corpus

def lemmatize(corpus):
    lem = WordNetLemmatizer()
    corpus = [[lem.lemmatize(x, pos = 'v') for x in x] for x in corpus]
    return corpus

def stem(corpus, stem_type = None):
    if stem_type == 'snowball':
        stemmer = SnowballStemmer(language = 'english')
        corpus = [[stemmer.stem(x) for x in x] for x in corpus]
    else :
        stemmer = PorterStemmer()
        corpus = [[stemmer.stem(x) for x in x] for x in corpus]
    return corpus
def preprocess(corpus, cleaning = True, stemming = False, stem_type = None, lemmatization = False, remove_stopwords = True):

    if cleaning == True:
        corpus = text_clean(corpus)

    if remove_stopwords == True:
        corpus = stopwords_removal(corpus)
    else :
        corpus = [[x for x in x.split()] for x in corpus]

    if lemmatization == True:
        corpus = lemmatize(corpus)


    if stemming == True:
        corpus = stem(corpus, stem_type)

    corpus = [' '.join(x) for x in corpus]


    return corpus

In [None]:
#applying the function. Relevant to this is cleaning, lemmatization, and removing stopwords.
comments_cleaned = preprocess(comments, cleaning = True, lemmatization = True, remove_stopwords = True)

In [None]:
#vectorizing the data into tf-idf vectors. Essentially turning the words into weighted values.
#Also assinging ngrams as it helps to improvs the accuracy of the model.
tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))
X = tfidf.fit_transform(comments_cleaned)
y = labels

In [None]:
#splitting the data into train and test parts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

In [None]:
#Training the model
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

model = MultinomialNB()
model.fit(X_train, y_train)

In [None]:
#measuring the accuracy of the model.
y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(4, 2))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.show()

We can see that the model is not doing a perfect job and your f1 score will probably be between 64-65%.
The accuracay is a bit higher and sometimes hit almost 70% however the tradeoff considering the balanced distribution of the dataset, this is positive.
Also the high precision shows that the model is able to predict sarcasm 2/3 of the time. Since predicting sarcasm and only sarcasm has an interest to this study, the recall is of lesser importance here. Overall the model is useful for doing explorative research for an overall interpertive result, where the importance of whether every single instance is correctly classified is less important.

#Building CNN for sarcasm detection
##### - Please switch to TPU runtime - type

Now we will build a CNN to see if a neural network would do better than naive bayes. The CNN is more complex than the Naive Bayes allowing for taking into account positional variantions of tokens. We will be using google's pretrained word2vec model to embed the text data, allowing for representing semantic meaning of words as distance, and use this embedding as input for the neural network.

In [None]:
!pip install gdown==4.6

In [29]:
#importing the file from gdrive
import gdown

url='https://drive.google.com/uc?id=10LhzuH8143lOXGk6-e8-fDkXpu-xwXiH&confirm=t'
output='train-balanced-sarcasm.csv'
gdown.download(url, output, quiet=True, fuzzy=True, use_cookies=True)

'train-balanced-sarcasm.csv'

In [30]:
import pandas as pd

file_path = '/content/train-balanced-sarcasm.csv'
df = pd.read_csv(file_path)

In [31]:
import pandas as pd
import numpy as np
import re
import json
import gensim
import math
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import KeyedVectors
import keras
from keras.models import Sequential, Model
from keras import layers
from keras.layers import Dense, Dropout, Conv1D, GlobalMaxPooling1D
import h5py
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [32]:
#Checking for null values that would otherwise give errors when cleaning etc.
df.isnull().values.sum()

53

In [33]:
#Removing NANs
df = df.dropna()

In [34]:
#Unfortanatly due to the low compute offered by colab free a sample is neeeded due to the large size of the dataset. CNN are RAM heavy
#- Thus I am not able to push the whole dataset through the network like I was able to with the Naive Bayes Classifier.
df_sample = df.sample(frac = 0.05, random_state = 42)

#giving comments and labels their own variables - easier to read and work with.
comments = df_sample['comment']
labels = df_sample['label']

In [None]:
#How is the distribution of sacastic vs non-sarcastic? Ideally it should be an even split.
#The distribution seems to be even.
label_distribution = labels.value_counts()

label_distribution.plot(kind='bar')
plt.title('Label Distribution')
plt.xlabel('Label')
plt.ylabel('Frequency')
plt.show()

In [None]:
comments

In [None]:
#For good measure checking that the length are actually the same.
len(comments) == len(labels)

In [38]:
#From the book with a minor tweak to text_clean because it was using deprecated methods, giving too many warning when running the script.
#Reference: Kedia, A., & Rasu, M. (2020)


def text_clean(corpus):
    cleaned_corpus_list = []
    for row in corpus:
        qs = []
        for word in row.split():
            p1 = re.sub(pattern='[^a-zA-Z0-9]', repl=' ', string=word)
            p1 = p1.lower()
            qs.append(p1)
        cleaned_corpus_list.append(' '.join(qs))

    cleaned_corpus = pd.Series(cleaned_corpus_list, dtype='object')
    return cleaned_corpus

def stopwords_removal(corpus):
    stop = set(stopwords.words('english'))
    corpus = [[x for x in x.split() if x not in stop] for x in corpus]
    return corpus

def lemmatize(corpus):
    lem = WordNetLemmatizer()
    corpus = [[lem.lemmatize(x, pos = 'v') for x in x] for x in corpus]
    return corpus

def stem(corpus, stem_type = None):
    if stem_type == 'snowball':
        stemmer = SnowballStemmer(language = 'english')
        corpus = [[stemmer.stem(x) for x in x] for x in corpus]
    else :
        stemmer = PorterStemmer()
        corpus = [[stemmer.stem(x) for x in x] for x in corpus]
    return corpus
def preprocess(corpus, cleaning = True, stemming = False, stem_type = None, lemmatization = False, remove_stopwords = True):

    if cleaning == True:
        corpus = text_clean(corpus)

    if remove_stopwords == True:
        corpus = stopwords_removal(corpus)
    else :
        corpus = [[x for x in x.split()] for x in corpus]

    if lemmatization == True:
        corpus = lemmatize(corpus)


    if stemming == True:
        corpus = stem(corpus, stem_type)

    corpus = [' '.join(x) for x in corpus]


    return corpus

In [39]:
#Applying cleaning function
comments_cleaned = preprocess(comments, cleaning = True, lemmatization = True, remove_stopwords = True)

In [40]:
!gdown --fuzzy https://drive.google.com/uc?id=1OwbLwg-UlRvc9q1RPcCI6Oo64R-nA5wd
!gzip -d /content/GoogleNews-vectors-negative300.bin.gz

Downloading...
From: https://drive.google.com/uc?id=1OwbLwg-UlRvc9q1RPcCI6Oo64R-nA5wd
To: /content/GoogleNews-vectors-negative300.bin.gz
100% 1.65G/1.65G [00:28<00:00, 57.4MB/s]


In [41]:
#building the word2vec model for embedding
from gensim.models import KeyedVectors
word2vec_model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)


In [42]:
#Measuring the length of the comments to calculate the mean length,
#in order to get an understanding of a best estimate for max length for the cnn model

df['comment_length'] = df['comment'].apply(lambda x: len(x.split()))
df['comment_length'].mean()

10.461448811948875

In [43]:
#max length should be about 10 since that is the average length of the sentences.
#vectorsize is 300 since that is dimensionality of the pretrained word2vec model from google.
MAX_LENGTH = 10
VECTOR_SIZE = 300

#Parameters for the model inspired by the book. Conservative numbers are unfortunately necessary due to low computing resources availabe.
#around 30 epochs  seems to be most optimal in terms of loss. After 30 it doesnt improve much.
#Reference: Kedia, A., & Rasu, M. (2020)

FILTERS=8
KERNEL_SIZE=3
HIDDEN_LAYER_1_NODES=10
HIDDEN_LAYER_2_NODES=5
DROPOUT_PROB=0.35
NUM_EPOCHS=30
BATCH_SIZE=50

In [None]:
#Model using common activiation function, sigmoid typically used for binary classification.
#Globalmaxpooling1D because the model handles text which is 1 dimensional.
#Reference: Kedia, A., & Rasu, M. (2020)

model = Sequential()

model.add(Conv1D(FILTERS,
                 KERNEL_SIZE,
                 padding='same',
                 strides=1,
                 activation='relu',
                 input_shape = (MAX_LENGTH, VECTOR_SIZE)))
model.add(GlobalMaxPooling1D())
model.add(Dense(HIDDEN_LAYER_1_NODES, activation='relu'))
model.add(Dropout(DROPOUT_PROB))
model.add(Dense(HIDDEN_LAYER_2_NODES, activation='relu'))
model.add(Dropout(DROPOUT_PROB))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

In [45]:
#compiling the model specifying the loss function etc.
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [46]:
#Function for vectorizing the comment data
#Reference: Kedia, A., & Rasu, M. (2020)

def vectorize_data(data):

    vectors = []

    padding_vector = [0.0] * VECTOR_SIZE

    for i, data_point in enumerate(data):
        data_point_vectors = []
        count = 0

        tokens = data_point.split()

        for token in tokens:
            if count >= MAX_LENGTH:
                break
            if token in word2vec_model.key_to_index:
                data_point_vectors.append(word2vec_model[token])
            count = count + 1

        if len(data_point_vectors) < MAX_LENGTH:
            to_fill = MAX_LENGTH - len(data_point_vectors)
            for _ in range(to_fill):
                data_point_vectors.append(padding_vector)

        vectors.append(data_point_vectors)

    return vectors

In [47]:
#applying the function to the comments and vectorizing them. Here embedding them with the word2vec model.
vectorized_comments = vectorize_data(comments_cleaned)

In [48]:
#checking if the comments were vectorized correctly, not going over the max length defined earlier.
#Not returning anything means it is vectorized corrctly.
for i, vec in enumerate(vectorized_comments):
    if len(vec) != MAX_LENGTH:
        print(i)

In [None]:
#Reshaping the data for inputs
from sklearn.model_selection import train_test_split

X = np.reshape(vectorized_comments, (len(vectorized_comments), MAX_LENGTH, VECTOR_SIZE))
y = np.array(labels)

#Splitting the sample into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

#Training the model on the train data, validating on the test data.
model.fit(X_train, y_train, epochs=NUM_EPOCHS, batch_size=BATCH_SIZE)

In [50]:
#simple measure of accuracy of the predictions by the model on the test data.
#Result is not great compared to the Naive Bayes, but further training on the rest of the dataset would probably improve the model.
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

Testing Accuracy:  0.6112


In [None]:
#Evaluation of the model.
#Because the classification metrics can't handle a mix of binary and continuous outputs, they have to be converted first.
y_pred = (model.predict(X_test) > 0.5).astype("int32")

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(4, 2))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")
plt.show()

The CNN does a worse job than the Naive Bayes and it is most likely due to the smaller training set than the naive bayes.
Since the amount of false positives and false negatives are very high, I wouldnt recommend using this model in this state as it is very likely to do wrong classification.

##Training on new datasets with identical column naming. Using subset as an example.

In [None]:
#creating a seperate df object for finetuning the model. Sotring as a .csv file to also showcase working with new data.
#However this is granted that it has the same structure as the initial dataset.

new_df = df.sample(frac = 0.01, random_state = 1)
new_df.to_csv('new_df.csv')

In [None]:
# Simple function to fit the model on new data and print the updated accuracy of the model.
# Also does preprocessing and vectorization.

def train_new_data(file_path):
  df = pd.read_csv(file_path)
  df = df.dropna()
  cleaned = preprocess(df['comment'], lemmatization = True, remove_stopwords = True)
  vec_data = vectorize_data(cleaned)
  X = np.reshape(vec_data, (len(vec_data), MAX_LENGTH, VECTOR_SIZE))
  y = np.array(df['label'])
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
  model.fit(X_train, y_train, epochs=NUM_EPOCHS, batch_size=BATCH_SIZE)
  loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
  print("Testing Accuracy:  {:.4f}".format(accuracy))

In [None]:
#Applying the function
train_new_data("new_df.csv")

####Due to low computing resources the naive bayes probably did better than the cnn network since it could train on the whole dataset at once(1 million rows vs 50 thousand)
####With the given resources the naive bayes is the better choice.

##Sentiment Analysis using VADER

Here I analyse the sentiments of a sample from the dataset on sarcasm used above. I deploy VADER(Valence Aware Dictionary and sEntiment Reasoner) a lexicon-based sentiment analyzer - it returns the probablities for a given text being either positive, negative or neutral a long with a compound score. I will in this analysis use this compound score to define polarities as positive, negative or neutral. Papers indicate  that there is a connection between Polairty score and Sarcasm (Agrawal & Papagelis 2020)(D. Tayal, S. et. al. 2014)
so this analysis is about investigating the distribution of sentiments and to investigate(At a glance) if this can say anything about this relationship.

In [None]:
!pip install gdown==4.6

In [2]:
#Importing data
import gdown

url='https://drive.google.com/uc?id=10LhzuH8143lOXGk6-e8-fDkXpu-xwXiH&confirm=t'
output='train-balanced-sarcasm.csv'
gdown.download(url, output, quiet=True, fuzzy=True, use_cookies=True)

import pandas as pd
file_path = '/content/train-balanced-sarcasm.csv'
df = pd.read_csv(file_path)

In [3]:
#Drop NA values
df = df.dropna()

#Due to time and resources I take a sample of about 100k comments.
df_sample = df.sample(frac = 0.1, random_state = 42)

#creating and deploying the sentiment analyzer on the comment column.
import nltk

nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

def analyze_sentiment(comment):
    return analyzer.polarity_scores(comment)

df_sample['sentiment_score'] = df_sample['comment'].apply(analyze_sentiment)

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [None]:
#sentiment score is made up of probalistic values that the comment is either positive, neutral or negative, it also gives a compound value which we will use to define definitions of polarity.
df_sample[['comment','sentiment_score']]

In [16]:
#Creating a polarity score definition translating into either postive (value > 0.05) negative (value < -0.05) or otherwise neutral.
def sentiment(score):
    comp = 0
    sentiment = ""
    for key, val in score.items():
        if key == "compound":
          if val >= 0.05:
              sentiment = "Positive"
          elif val <= -0.05:
              sentiment = "Negative"
          else:
              sentiment = "Neutral"
    return sentiment

In [17]:
#applying the function to the scores, extracting and translating the value into a "Predominant sentiment"
df_sample['Predominant_Polarity'] = df_sample['sentiment_score'].apply(sentiment)

#Checking the distribution of predominant sentiment
total = df_sample['Predominant_Polarity'].value_counts()

In [None]:
total

Neutral being the biggest category, however postive/negative being bigger.
This could indicate a polarity - The dataset is obviously about sarcasm and sarcasm is a kind of irony where one says something with an opposite meaning.

In [None]:
#Cheking the distribution of label and predominant_sentiment.
distribution = df_sample.groupby(['Predominant_Polarity', 'label']).size().reset_index(name='counts')
print(distribution)

In [None]:
#Creating a simple bar-plot

import seaborn as sns
import matplotlib.pyplot as plt


distribution = df_sample.groupby(['Predominant_Polarity', 'label']).size().reset_index(name='counts')

sns.barplot(x='Predominant_Polarity', y='counts', hue='label', data=distribution)

plt.xlabel('Predominant Polarity')
plt.ylabel('Counts')
plt.title('Distribution of Labels by Predominant Polarity')
plt.legend(title='label')

plt.show()

df_sample['compound'] = df_sample['sentiment_score'].apply(lambda x: x['compound'])

from scipy.stats import pointbiserial

correlation, p_value = pointbiserialr(df_sample['label'], df_sample['compound'])
print("Correlation coefficient:", correlation)
print("P-value:", p_value)



It does not seem that any obvious relationships exist in this context. The dataset is more negative or positive than neutral, but this could either be due to how the dataset was created or be due to the dataset revolving around sarcasm which obviously states something but means something different. Then whether positive is truly positive as per opinion or negative is hard to tell - Could sarcasm mean that positives are equal to a negative sentiment and vice versa and how often is this the case?

It is therefore a quite complex issue and would require additional analysis of the semantics as well as complex topic modelling(perhaps with the aid of LLM)

Also analyzing the correlation in terms of P-value using point-biserial correlation, indicate that there isn't a clear relationship between the two in this context.