# Libraries

The required dependencies are installed in the cell below.

In [1]:
# intall required dependencies
!pip install nltk



The required libraries are imported here. The scikit-learn libraries<sup>1</sup> will be used for most of the tasks of feature selection and classifier implementation. Other libraries are the in-built libraries in python

In [2]:
# import required libraries
import pandas as pd
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from prettytable import PrettyTable
from sklearn.model_selection import cross_val_score

# Data 

## Reading the data

The data from json file if read and stored as a dataframe using pandas library.

In [3]:
def parseJson(fname):
  for line in open(fname, 'r'):
    yield eval(line)

df = pd.DataFrame(parseJson('./Sarcasm_Headlines.json'))

In [4]:
# let's have a sneak peak at the data
df.head()

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


There are three columns: `is_sarcastic`, `headline`, and `article`. The third column `article` will not be used for this assigment.

In [5]:
# printing the size of the dataset
df.shape

(28619, 3)

We have 28619 entries of headline, which is a fair size of data.

# Feature Engineering

In this part, we will be generating features from the corpus which will be used for the classification task in next section.

In [6]:
# let's see if the corpus has both uppercase and lowercase characters
for i in df.headline:
  if i.isupper():
    print("yes")

The corpus does not have any uppercase characters, so there is no need for analyzing the performances with our withour lowercasing.

Let's create a function to remove stopwords. We will use it to check if stopwords have any impact on the performance.

In [7]:
# removing stopwords
nltk.download('stopwords')
stop = stopwords.words('english')

df.headline = df['headline'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Now, we will extract features/tokens from the corpus in different ways.

#### Task I: N-grams
We will use N-grams as features. For this task, we will try the following combinations of n-grams.
1. **Unigrams only**: Only the unigram features
2. **Bigrams only**: Only the bigram features
3. **Trigrams only**: Only the trigram features
4. **Unigrams + Bigrams**: Both unigrams and bigrams
5. **Bigrams + Trigrams**: Both bigrams and trigram
6. Unigrams + Bigrams + Trigrams: All of unigrams, bigrams, and trigrams

For each of these feartures, we will define individual functions below.

In [8]:
# 1. unigrams only
def uni_features_only(data):
  vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=5)
  x = vectorizer.fit_transform(data)
  return x

In [9]:
# 2. bigrams only
def bi_features_only(data):
  vectorizer = CountVectorizer(analyzer='word', ngram_range=(2, 2), min_df=5)
  x = vectorizer.fit_transform(data)
  return x

In [10]:
# 3. trigrams only
def tri_features_only(data):
  vectorizer = CountVectorizer(analyzer='word', ngram_range=(3, 3), min_df=5)
  x = vectorizer.fit_transform(data)
  return x

In [11]:
# 4. unigrams + bigrams
def uni_bi_features(data):
  vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 2), min_df=5)
  x = vectorizer.fit_transform(data)
  return x

In [12]:
# 5. bigrams + trigrams
def bi_tri_features(data):
  vectorizer = CountVectorizer(analyzer='word', ngram_range=(2, 3), min_df=5)
  x = vectorizer.fit_transform(data)
  return x

In [13]:
# 6. unigrams + bigrams + trigrams
def uni_bi_tri_features(data):
  vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 3), min_df=5)
  x = vectorizer.fit_transform(data)
  return x

#### Task II: Other features
We will use the following features other than n-grams.
1. TF-IDF
2. Repeated punctuation + Number of words
3. Hashing 


1. **TF-IDF**: This feature considers the frequency of the words and penalizes if there are many words. It supposes that any word is present multiple times on the document, it is of less importance as a feaure.

tf-idf = number of occurences of a word * logarithm of frequency of the word


In [14]:
# 1. tf-idf
def tf_idf_features(data):
  vectorizer = TfidfVectorizer(norm='l2', ngram_range=(1, 2), min_df=5)
  x = vectorizer.fit_transform(data)
  return x

2. **Repeated punctuation and Number of words**: In this model, the consecutive punctuations are looked for in the headlines. If they are detected, 1 is assigned to it otherwise 0. Another feature is the number of words. The total number of words in the headline is considerd as another feaure. Thus, we have two rows, onw with binary value (i.e. 0 or 1) and the other with integer value (i.e. number of words)

In [15]:
# 2. Repeated punctuation + Number of words

def punctuation_num_words(data):
  pattern_any_punctuation = re.compile('([-/\\\\()!"+,&\'.?]{2,})')

  features = []

  for t in data:
    match = pattern_any_punctuation.search(t)
    if match:
      features.append([1, len(t.split(" "))])
    else:
      features.append([0, len(t.split(" "))])

  return features

3. **Hashing**: Hashing used [Murmur Hash](https://en.wikipedia.org/wiki/MurmurHash) to convert the tokens into a numerical value. One of the disadvantage of this vectorizer is that once they are converted to a numerical value, the tokens cannot be received.

In [16]:
# 3. Hashing
def hash_features(data):
  vectorizer = HashingVectorizer(analyzer="word", ngram_range=(1,1), norm='l2', alternate_sign=False)
  x = vectorizer.fit_transform(data)
  return x

# Classification Models

Two machine learning models will be used.
1. Naive Bayes
2. SVM

Before doing the classification task, lets create train, valid, and test splits.

In [17]:
features_list = [uni_features_only(df.headline), bi_features_only(df.headline), 
                 tri_features_only(df.headline), uni_bi_features(df.headline),
                 bi_tri_features(df.headline), uni_bi_tri_features(df.headline), 
                 tf_idf_features(df.headline), punctuation_num_words(df.headline), 
                 hash_features(df.headline)]

Let's apply the models and find the accuracy score and F1 score.

In [18]:
# Multinomial NB
def nb_acc():
  classifier = MultinomialNB(alpha=1, fit_prior=False,)
  classifier.fit(X_train, y_train)

  # predicting the test set results
  y_pred = classifier.predict(X_test)
  f1 = f1_score(y_test, y_pred, average='macro')
  a = accuracy_score(y_test, y_pred)

  return classifier.score(X_test, y_test), f1, classifier

# SVM
def svm_acc():
  classifier = svm.SVC(kernel='sigmoid')
  classifier.fit(X_train, y_train)

  # predicting the test set results
  y_pred = classifier.predict(X_test)
  f1 = f1_score(y_test, y_pred, average='macro')
  a = accuracy_score(y_test, y_pred)

  return classifier.score(X_test, y_test), f1, classifier

# Evaluation

Using the functions created above, the accuracy, F1 score, and cross validation score are calcuated for each feature engineering models. The results are shown in the table at the end.

Note: SVM takes a bit long time to get executed (around an hour for me).

## Unigram features only

In [19]:
# getting the splits
X_train, X_rem, y_train, y_rem = train_test_split(features_list[0], df.is_sarcastic, train_size=0.8)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem, y_rem, test_size=0.5)

# getting the accuracy and f1 scores for Naive Bayes
accuracy_unigram_nb, f1_unigram_nb, uni_classifier_nb = nb_acc()
cx_v_uni_nb = cross_val_score(uni_classifier_nb, features_list[0], df.is_sarcastic, scoring='accuracy', cv=10)

# getting the accuracy and f1 scores for SVM
accuracy_unigram_svm, f1_unigram_svm, uni_classifier_svm = svm_acc()
cx_v_uni_svm = cross_val_score(uni_classifier_svm, features_list[0], df.is_sarcastic, scoring='accuracy', cv=10)

## Bigram features only

In [20]:
# getting the splits
X_train, X_rem, y_train, y_rem = train_test_split(features_list[1], df.is_sarcastic, train_size=0.8)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem, y_rem, test_size=0.5)

# getting the accuracy and f1 scores
accuracy_bigram_nb, f1_bigram_nb, bi_classifier_nb = nb_acc()
cx_v_bi_nb = cross_val_score(bi_classifier_nb, features_list[1], df.is_sarcastic, scoring='accuracy', cv=10)

# getting the accuracy and f1 scores for SVM
accuracy_bigram_svm, f1_bigram_svm, bi_classifier_svm = svm_acc()
cx_v_bi_svm = cross_val_score(bi_classifier_svm, features_list[1], df.is_sarcastic, scoring='accuracy', cv=10)

## **Trigram** features only

In [21]:
# getting the splits
X_train, X_rem, y_train, y_rem = train_test_split(features_list[2], df.is_sarcastic, train_size=0.8)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem, y_rem, test_size=0.5)

# getting the accuracy and f1 scores
accuracy_trigram_nb, f1_trigram_nb, tri_classifier_nb = nb_acc()
cx_v_tri_nb = cross_val_score(tri_classifier_nb, features_list[2], df.is_sarcastic, scoring='accuracy', cv=10)

# getting the accuracy and f1 scores for SVM
accuracy_trigram_svm, f1_trigram_svm, tri_classifier_svm = svm_acc()
cx_v_tri_svm = cross_val_score(tri_classifier_svm, features_list[2], df.is_sarcastic, scoring='accuracy', cv=10)

## Unigram + Bigram features

In [None]:
# getting the splits
X_train, X_rem, y_train, y_rem = train_test_split(features_list[3], df.is_sarcastic, train_size=0.8)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem, y_rem, test_size=0.5)

# getting the accuracy and f1 scores
accuracy_uni_bi_nb, f1_uni_bi_nb, uni_bi_classifier_nb = nb_acc()
cx_v_uni_bi_nb = cross_val_score(uni_bi_classifier_nb, features_list[3], df.is_sarcastic, scoring='accuracy', cv=10)

# getting the accuracy and f1 scores for SVM
accuracy_uni_bi_svm, f1_uni_bi_svm, uni_bi_classifier_svm = svm_acc()
cx_v_uni_bi_svm = cross_val_score(uni_bi_classifier_svm, features_list[3], df.is_sarcastic, scoring='accuracy', cv=10)

## Bigram + Trigram features

In [None]:
# getting the splits
X_train, X_rem, y_train, y_rem = train_test_split(features_list[4], df.is_sarcastic, train_size=0.8)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem, y_rem, test_size=0.5)

# getting the accuracy and f1 scores
accuracy_bi_tri_nb, f1_bi_tri_nb, bi_tri_classifier_nb = nb_acc()
cx_v_bi_tri_nb = cross_val_score(bi_tri_classifier_nb, features_list[4], df.is_sarcastic, scoring='accuracy', cv=10)

# getting the accuracy and f1 scores for SVM
accuracy_bi_tri_svm, f1_bi_tri_svm, bi_tri_classifier_svm = svm_acc()
cx_v_bi_tri_svm = cross_val_score(bi_tri_classifier_svm, features_list[4], df.is_sarcastic, scoring='accuracy', cv=10)

## Unigram + Bigram + Trigram features

In [None]:
# getting the splits
X_train, X_rem, y_train, y_rem = train_test_split(features_list[5], df.is_sarcastic, train_size=0.8)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem, y_rem, test_size=0.5)

# getting the accuracy and f1 scores
accuracy_uni_bi_tri_nb, f1_uni_bi_tri_nb, uni_bi_tri_classifier_nb = nb_acc()
cx_v_uni_bi_tri_nb = cross_val_score(uni_bi_tri_classifier_nb, features_list[5], df.is_sarcastic, scoring='accuracy', cv=10)

# getting the accuracy and f1 scores for SVM
accuracy_uni_bi_tri_svm, f1_uni_bi_tri_svm, uni_bi_tri_classifier_svm = svm_acc()
cx_v_uni_bi_tri_svm = cross_val_score(uni_bi_tri_classifier_svm, features_list[5], df.is_sarcastic, scoring='accuracy', cv=10)

## TF-IDF features

In [None]:
# getting the splits
X_train, X_rem, y_train, y_rem = train_test_split(features_list[6], df.is_sarcastic, train_size=0.8)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem, y_rem, test_size=0.5)

# getting the accuracy and f1 scores
accuracy_tf_idf_nb, f1_tf_idf_nb, tf_idf_classifier_nb = nb_acc()
cx_v_tf_idf_nb = cross_val_score(tf_idf_classifier_nb, features_list[6], df.is_sarcastic, scoring='accuracy', cv=10)

# getting the accuracy and f1 scores for SVM
accuracy_tf_idf_svm, f1_tf_idf_svm, tf_idf_classifier_svm = svm_acc()
cx_v_tf_idf_svm = cross_val_score(tf_idf_classifier_svm, features_list[6], df.is_sarcastic, scoring='accuracy', cv=10)

## Repeated Punctuation + Number of words

In [None]:
# getting the splits
X_train, X_rem, y_train, y_rem = train_test_split(features_list[7], df.is_sarcastic, train_size=0.8)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem, y_rem, test_size=0.5)

# getting the accuracy and f1 scores
accuracy_punc_num_nb, f1_punc_num_nb, punc_num_classifier_nb = nb_acc()
cx_v_punc_num_nb = cross_val_score(punc_num_classifier_nb, features_list[7], df.is_sarcastic, scoring='accuracy', cv=10)

# getting the accuracy and f1 scores for SVM
accuracy_punc_svm, f1_punc_svm, accuracy_punc_num_svm = svm_acc()
cx_v_punc_svm = cross_val_score(accuracy_punc_num_svm, features_list[7], df.is_sarcastic, scoring='accuracy', cv=10)

## Hashing features

In [None]:
# getting the splits
X_train, X_rem, y_train, y_rem = train_test_split(features_list[8], df.is_sarcastic, train_size=0.8)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem, y_rem, test_size=0.5)

# getting the accuracy and f1 scores
accuracy_hash_nb, f1_hash_nb, hash_classifier_nb = nb_acc()
cx_v_hash_nb = cross_val_score(hash_classifier_nb, features_list[8], df.is_sarcastic, scoring='accuracy', cv=10)

# getting the accuracy and f1 scores for SVM
accuracy_hash_svm, f1_hash_svm, hash_classifier_svm = svm_acc()
cx_v_hash_svm = cross_val_score(hash_classifier_svm, features_list[8], df.is_sarcastic, scoring='accuracy', cv=10)

Now, the performance scores are shown in the tabular form.

In [None]:
t = PrettyTable(['Features', 'Accuracy', 'F1 score', 'Cross Validation Score'])

t.add_row(['Unigrams only', round(accuracy_unigram_nb, 2), round(f1_unigram_nb, 2), round(cx_v_uni_nb.mean(), 2)])
t.add_row(['Bigrams only', round(accuracy_bigram_nb, 2), round(f1_bigram_nb, 2), round(cx_v_bi_nb.mean(), 2)])
t.add_row(['Trigrams only', round(accuracy_trigram_nb, 2), round(f1_trigram_nb, 2), round(cx_v_tri_nb.mean(), 2)])
t.add_row(['Unigrams + Bigrams', round(accuracy_uni_bi_nb, 2), round(f1_uni_bi_nb, 2), round(cx_v_uni_bi_nb.mean(), 2)])
t.add_row(['Bigrams + Trigrams', round(accuracy_bi_tri_nb, 2), round(f1_bi_tri_nb, 2), round(cx_v_bi_tri_nb.mean(), 2)])
t.add_row(['Unigrams + Bigrams + Trigrams', round(accuracy_uni_bi_tri_nb, 2), round(f1_uni_bi_tri_nb, 2), round(cx_v_uni_bi_tri_nb.mean(), 2)])
t.add_row(['TF-IDF', round(accuracy_tf_idf_nb, 2), round(f1_tf_idf_nb, 2), round(cx_v_tf_idf_nb.mean(), 2)])
t.add_row(['Repeated Punction + Num Words', round(accuracy_punc_num_nb, 2), round(f1_punc_num_nb, 2), round(cx_v_punc_num_nb.mean(), 2)])
t.add_row(['Hashing', round(accuracy_tf_idf_nb, 2), round(f1_hash_nb, 2), round(cx_v_hash_nb.mean(), 2)])

print("Naive Bayes Performance")
print(t)

Naive Bayes Performance
+-------------------------------+----------+----------+------------------------+
|            Features           | Accuracy | F1 score | Cross Validation Score |
+-------------------------------+----------+----------+------------------------+
|         Unigrams only         |   0.8    |   0.8    |          0.8           |
|          Bigrams only         |   0.61   |   0.58   |          0.62          |
|         Trigrams only         |   0.54   |   0.37   |          0.53          |
|       Unigrams + Bigrams      |   0.8    |   0.8    |          0.8           |
|       Bigrams + Trigrams      |   0.63   |   0.58   |          0.62          |
| Unigrams + Bigrams + Trigrams |   0.79   |   0.79   |          0.8           |
|             TF-IDF            |   0.79   |   0.79   |          0.8           |
| Repeated Punction + Num Words |   0.47   |   0.34   |          0.48          |
|            Hashing            |   0.79   |   0.8    |          0.81          |
+---

In [None]:
t = PrettyTable(['Features', 'Accuracy', 'F1 score', 'Cross Validation Score'])

t.add_row(['Unigrams only', round(accuracy_unigram_svm, 2), round(f1_unigram_svm, 2), round(cx_v_uni_svm.mean(), 2)])
t.add_row(['Bigrams only', round(accuracy_bigram_svm, 2), round(f1_bigram_svm, 2), round(cx_v_bi_svm.mean(), 2)])
t.add_row(['Trigrams only', round(accuracy_trigram_svm, 2), round(f1_trigram_svm, 2), round(cx_v_tri_svm.mean(), 2)])
t.add_row(['Unigrams + Bigrams', round(accuracy_uni_bi_svm, 2), round(f1_uni_bi_svm, 2), round(cx_v_uni_bi_svm.mean(), 2)])
t.add_row(['Bigrams + Trigrams', round(accuracy_bi_tri_svm, 2), round(f1_bi_tri_svm, 2), round(cx_v_bi_tri_svm.mean(), 2)])
t.add_row(['Unigrams + Bigrams + Trigrams', round(accuracy_uni_bi_tri_svm, 2), round(f1_uni_bi_tri_svm, 2), round(cx_v_uni_bi_tri_svm.mean(), 2)])
t.add_row(['TF-IDF', round(accuracy_tf_idf_svm, 2), round(f1_tf_idf_svm, 2), round(cx_v_tf_idf_svm.mean(), 2)])
t.add_row(['Repeated Punction + Num Words', round(accuracy_punc_svm, 2), round(f1_punc_svm, 2), round(cx_v_punc_svm.mean(), 2)])
t.add_row(['Hashing', round(accuracy_hash_svm, 2), round(f1_hash_svm, 2), round(cx_v_hash_svm.mean(), 2)])

print("SVM Performance")
print(t)

SVM Performance
+-------------------------------+----------+----------+------------------------+
|            Features           | Accuracy | F1 score | Cross Validation Score |
+-------------------------------+----------+----------+------------------------+
|         Unigrams only         |   0.79   |   0.78   |          0.79          |
|          Bigrams only         |   0.6    |   0.55   |          0.61          |
|         Trigrams only         |   0.54   |   0.37   |          0.53          |
|       Unigrams + Bigrams      |   0.79   |   0.79   |          0.8           |
|       Bigrams + Trigrams      |   0.62   |   0.56   |          0.61          |
| Unigrams + Bigrams + Trigrams |   0.8    |   0.8    |          0.8           |
|             TF-IDF            |   0.8    |   0.8    |          0.8           |
| Repeated Punction + Num Words |   0.44   |   0.44   |          0.54          |
|            Hashing            |   0.79   |   0.79   |          0.8           |
+-----------

For both the classification models, tf-idf and the combination of unigram and bigram features seemed to give better performance, while trigram features could not perform well. It can be inferred that increasing the value of n in n-gram feature engneering will not help increase the accuracy of the models for this particular task.

# Error Analysis

Here, we will dig out some samples of correctly predicted and the incorrectly predicted headlines for both Naive Bayes when tf-idf is used as a feature extraction method.

In [None]:
vectorizer = TfidfVectorizer(norm='l2', ngram_range=(1, 2), min_df=5)
features = vectorizer.fit_transform(df.headline)

X = features
indices = range(features.shape[0])

X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(X, df.is_sarcastic, indices, test_size=0.2, random_state=0)

def show_prediction(y_pred, y_test):

    indices_match = y_pred == y_test

    df = pd.DataFrame(parseJson('/content/drive/Othercomputers/My MacBook Pro/PSU/Fall-2021/Intro to NLP/Assignment - 1/Sarcasm_Headlines.json'))

    get_headlines = []

    for i in range(len(indices_test)):
        get_headlines.append(df.headline[indices_test[i]])


    data = list(zip(get_headlines, y_test, y_pred))
    df_result = pd.DataFrame(data, columns=["headline", "original class", "predicted class"])
    return df_result

### Multinomial NB
classifier = MultinomialNB(alpha=1, fit_prior=True)
classifier.fit(X_train, y_train)

# predicting the Test set results
y_pred = classifier.predict(X_test)

results = show_prediction(y_pred, y_test)


FileNotFoundError: ignored

Let's print some of the headlines which are correctly/incorrectly predicted.

In [None]:
results.head(20)

From the above result, we can will pick 3 examples that did not work, as suggested in the question.

In [None]:
pd.options.display.max_colwidth = 100
pd.set_option('display.max_columns', None)
print(results.loc[[6, 16, 18]])

1. We can see that the first headline is not a sarcastic one, but it is classified as a sarcastic line.

2. In the second incorrect prediction above, the headline is supposed to be a sarcastic line, but it is classified as non sarcastic. 

3. The third headline is classified as non-sarcastic which is not true.

In these examples, the classifier is unable to catch real sarcasm from td-idf features. This might be because of insufficient related headlines in the data. One of the other reasons might be the lack of content associated with the text. If a connection can be made between the content and the headline, the model might predict correct class.



# Improvement

### How could the performance of the classifier be improved further?
The performance could be increased by following ways.
1. By increasing the data size
2. By introducing a new feature "how_sarcastic" which stores the sarcasm level of the specific words present in the text (e.g. `nation's voyeurs` in the second example we discussed above). These words can have high level of sarcasm.


### What role, if any, could the associated full news articles play in boosting the performance?
The associated full news article might boost the performance in expense of resource. If extra resources are ignored, then the full article will help increase the performance if the we can connect headline with the context.

For example:
`nation's voyeurs watch women's march on washington from bushes`
In this example, we can figure out who `nation's voyeurs` are referred to from the article. To my knowledge (I might be wrong), the article does not explicitly mention these type of expressions on the full article. If the model cannot find such expression in the full article, then it could classify the headline as sarcastic.

# References


1. Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011