### Detection of Irony and Sarcasm in Text

This notebook implements a system to detect irony and sarcasm in text. The test data used is from SemEval-2018 task on irony detection

Submitted By: Suruchi Gupta
Student Id: 19233027
Course Code: 1MAI1

# Task 1 (5 Marks)

Read all the data and find the size of vocabulary of the dataset (ignoring case) and the number of positive and negative examples.

In [1]:
# Importing necessary packages for preprocessing of text
from nltk import word_tokenize
import pandas as pd
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/suruchi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
#Reading the index, class labels and tweets from the input file and storing in a dataframe
vocabulary = []
def build_vocab():
    vocab_dict = {"Index": [], "Class Label": [], "Tweet": []}

    file = open("SemEval2018-T3-train-taskA.txt", encoding="UTF-8")
    data =  file.readlines()
    del data[0] #deleting header row
    
    for row in data:
      cells = row.split('\t')
      class_label = int(cells[1])
      vocab_dict["Index"].append(cells[0])
      vocab_dict["Class Label"].append(class_label)
      tweet = str.lower(cells[2])
      vocab_dict["Tweet"].append(tweet)
      cells[0] = int(cells[0])
      cells[1] = class_label
      cells[2] = word_tokenize(tweet)
      vocabulary.append(tuple(cells))
    return pd.DataFrame(vocab_dict)

data = build_vocab()

# Printing sample data
data[0:20]

Unnamed: 0,Index,Class Label,Tweet
0,1,1,sweet united nations video. just in time for c...
1,2,1,@mrdahl87 we are rumored to have talked to erv...
2,3,1,hey there! nice to see you minnesota/nd winter...
3,4,0,3 episodes left i'm dying over here\n
4,5,1,"""i can't breathe!"" was chosen as the most nota..."
5,6,0,you're never too old for footie pajamas. http:...
6,7,1,nothing makes me happier then getting on the h...
7,8,0,4:30 an opening my first beer now gonna be a l...
8,9,0,@adam_klug do you think you would support a gu...
9,10,0,@samcguigan544 you are not allowed to open tha...


In [3]:
# Finding the size of vocabulary of the data
raw_data = data
tweets = raw_data["Tweet"].str.cat(sep=" ")
unique_words = word_tokenize(tweets)
unique_words = set(unique_words)
print("Length of vocabulary: ",len(unique_words))

Length of vocabulary:  13442


In [4]:
# Finding the number of negative and positive examples
positive_exs = raw_data.loc[raw_data['Class Label'] == 1]
negative_exs = raw_data.loc[raw_data['Class Label'] == 0]
print("Number of positive examples: ",len(positive_exs)," and negative examples: ",len(negative_exs))

Number of positive examples:  1911  and negative examples:  1923


From the number of positive and negative examples we can see that the data is not skewed and has equal number of both positive and negative cases for effective learning

# Task 2 (15 Marks)
Divide the data into a training and test set and justify your split.

Implement a function that calculates the precision, recall and F-Measure for this task.

For dividing the data, we will use 70% of total data as training data. This gives training phase ample amount of 
data to learn and avoids overfitting that can happen if the model learns the data 
and not the underlying characteristics. Overfitting can also lead to poor performance in the testing phase. 
The remaining 30% of the data will be used for testing the performance of the model built.

For dividing the data into training and test set we will use train_test_split() available in scikitlearn package 

In [5]:
# Vectorizing the data and splitting into train and test sets
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

#Using CountVectorizer to retrieve feature vectors from the text
vectorizer = CountVectorizer(analyzer = 'word')
features = vectorizer.fit_transform(raw_data["Tweet"])
X_train, X_test, y_train, y_test  = train_test_split(features, raw_data["Class Label"], train_size=0.70, random_state=28)

In [6]:
# Adding the method to calculate accuracy, precision, recall and F-score for the task
from sklearn.metrics import classification_report, accuracy_score 

def evaluation_metrics(y_test,y_pred):
    print("Accuracy: ",round(accuracy_score(y_test,y_pred), 3))
    print("Classification Report: \n",classification_report(y_test, y_pred))

# Task 3 (15 Marks)

Suggest some features to extract from each sentence. Implement a simple log-linear model to classify tweets as ironic or not ironic.

Train this method and evaluate the results using precision, recall and F-Measure

In [7]:
# Training a simple Logistic Regression model and evaluating the same using method defined above
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg = log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)

evaluation_metrics(y_test, y_pred)

Accuracy:  0.626
Classification Report: 
               precision    recall  f1-score   support

           0       0.63      0.62      0.63       583
           1       0.62      0.63      0.62       568

    accuracy                           0.63      1151
   macro avg       0.63      0.63      0.63      1151
weighted avg       0.63      0.63      0.63      1151





# Task 4 (25 Marks)

Develop an acceptor or a transducer recurrent neural network that classifiers the sentence as ironic or not ironic.

Evaluate this according to precision, recall or F-Measure

In [8]:
# Developing a Recurrent Neural Network to classify texts as ironic or not ironic 
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, Activation
import pandas as pd
import numpy as np

# Tokenizing and padding the data to be used by the word embeddings
tokenizer = Tokenizer()
tokenizer.fit_on_texts(raw_data["Tweet"])
X = tokenizer.texts_to_sequences(raw_data["Tweet"])
X = pad_sequences(X, maxlen=35)

# Splitting data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X,raw_data["Class Label"], train_size=0.70, random_state=28)

# Building the keras sequential model and adding required layers
model = Sequential()
model.add(Embedding(len(unique_words), 128))
model.add(LSTM(128))
model.add(Dense(1, activation="sigmoid"))

# Initialising the optimizer and building the model
model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])

# Training the model 
model.fit(X_train, y_train,batch_size=28,epochs=15,validation_data=(X_test, y_test))
score, accuracy = model.evaluate(X_test, y_test,batch_size=28)
print("Score on Test Data: ", score)
print("Accuracy on Test Data: ", accuracy)

Using TensorFlow backend.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Train on 2683 samples, validate on 1151 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Score on Test Data:  2.732117034786168
Accuracy on Test Data:  0.6125108599662781


# Task 5 (40 Marks)

Suggest an improvement to either the system developed in Task 3 or 4 and show that it improves according to your evaluation metric.

Please note this task is marked according to: demonstration of knowledge from the lecutures (10), originality and appropriateness of solution (10), completeness of description (10), technical correctness (5) and improvement in evaluation metric (5).

For improving the performance of the model, one of the most important step is to improve the quality of the input data. Here, we know that the Tweets given can be preprocessed to improve the model performance. One of the steps for the same can be removing the URLs in the text as it does not contribute to detection of irony in the tweets

In [9]:
# Removing the URLs from the tweets for preprocessing 
import re

def removing_URL(data):
    final_data = []
    for text in data:
        text = re.sub(r"https?://.+","",text)
        final_data.append(str.lower(text))
    return final_data

processed_raw_data = removing_URL(raw_data["Tweet"])

Using Bidirectional LSTMs as an extension to traditional LSTM can improve the classification performance. The bidirectional LSTMs train the data in both left-to-right and right-to-left direction to get a better context and faster results

In [None]:
# Developing a Recurrent Neural Network to classify texts as ironic or not ironic (Using Bidirectional LSTM)
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras import layers
from keras.layers import Dense, LSTM, Embedding, Activation, Dropout, Bidirectional
import pandas as pd
import numpy as np

# Tokenizing and padding the data to be used by the word embeddings
tokenizer = Tokenizer()
tokenizer.fit_on_texts(processed_raw_data)
X = tokenizer.texts_to_sequences(processed_raw_data)
X = pad_sequences(X, maxlen=35)

# Splitting data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X,raw_data["Class Label"], train_size=0.70)

# Building the keras sequential model and adding required layers
model = Sequential()
model.add(Embedding(len(unique_words), 128))
model.add((LSTM(128, return_sequences=True, recurrent_dropout=0.2)))
model.add(Dropout(0.1))
model.add(Bidirectional(LSTM(128)))
model.add(Dropout(0.2))
model.add(layers.Dense(10, activation="relu"))
model.add(layers.Dense(1, activation="sigmoid"))
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])


# Initialising the optimizer and building the model
model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])

# Training the model 
model.fit(X_train, y_train,batch_size=25,epochs=15,validation_data=(X_test, y_test))
score, accuracy = model.evaluate(X_test, y_test,batch_size=25)
print("Score on Test Data: ", score)
print("Accuracy on Test Data: ", accuracy)

Train on 2683 samples, validate on 1151 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
 400/2683 [===>..........................] - ETA: 8s - loss: 0.0106 - accuracy: 0.9975