<a href="https://colab.research.google.com/github/KrisSandy/ExMachineLearning/blob/master/Irony_Detection_Traditional_and_NN_approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IRONY DETECTION - Traditional and NN approach

We will use the data from the SemEval-2018 task on irony detection. The file `SemEval2018-T3-train-taskA.txt` consists of examples as follows:

```csv
Tweet index     Label   Tweet text
1       1       Sweet United Nations video. Just in time for Christmas. #imagine #NoReligion  http://t.co/fej2v3OUBR
2       1       @mrdahl87 We are rumored to have talked to Erv's agent... and the Angels asked about Ed Escobar... that's hardly nothing    ;)
3       1       Hey there! Nice to see you Minnesota/ND Winter Weather 
4       0       3 episodes left I'm dying over here
```


# Data Extraction

Read all the data and find the size of vocabulary of the dataset (ignoring case) and the number of positive and negative examples.

#### import required libraries

In [0]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
from sklearn.model_selection import train_test_split
import string
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

#### Downloading the data file

In [0]:
!wget https://transfer.sh/DnXTx/SemEval2018-T3-train-taskA.txt

--2019-03-02 13:45:02--  https://transfer.sh/DnXTx/SemEval2018-T3-train-taskA.txt
Resolving transfer.sh (transfer.sh)... 144.76.136.153
Connecting to transfer.sh (transfer.sh)|144.76.136.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 380455 (372K) [text/plain]
Saving to: ‘SemEval2018-T3-train-taskA.txt’


2019-03-02 13:45:04 (892 KB/s) - ‘SemEval2018-T3-train-taskA.txt’ saved [380455/380455]



#### Calculate stats

In [0]:
def get_stats(data):

  n = len(data)
  n_pos = sum(data['Label'])
  n_neg = n - n_pos
  vocab = list()

  for text in data['Tweet text']:
    words = word_tokenize(text.lower())
    vocab = vocab + words
 
  vocab = set(vocab)

  print("Total number of examples : ", n)
  print("Number of positive examples : ", n_pos)
  print("Number of negative examples : ", n_neg)
  print("Size of Vocabulary : ", len(vocab))

In [0]:
data = pd.read_csv('SemEval2018-T3-train-taskA.txt', sep='\t')
get_stats(data)

Total number of examples :  3817
Number of positive examples :  1901
Number of negative examples :  1916
Size of Vocabulary :  13460


# Naive Bayes Model

Develop a classifier using the Naive Bayes model to predict if an example is ironic. The model should convert each Tweet into a bag-of-words and calculate

$p(\text{Ironic}|w_1,\ldots,w_n) \propto \prod_{i=1,\ldots,n} p(w_i \in \text{tweet}| \text{Ironic}) p(\text{Ironic})$

$p(\text{NotIronic}|w_1,\ldots,w_n) \propto \prod_{i=1,\ldots,n} p(w_i \in \text{tweet}| \text{NotIronic}) p(\text{NotIronic})$

You should use add-alpha smoothing to calculate probabilities

#### Naive Bayes

In the Naive Bayes model, we calculated the posterior probabilities are calculated for each class (Ironic and NotIronic) and the class is predicted as the class with highest probability. The probability od a sentence being Ironic or not is calculated as shown in the above equation.

For calculating the probability of each word being Ironic or not, a vocabulary is created for both the classes using the training examples and correcponding word counts are maintained.

Now the probability of word is calculated as 

$p(w_i|Ironic) = \frac{c(w_{i, Ironic})}{c(w_{Ironic})}$

$\text{where }c(w_{i, Ironic}) = \text{count of word } w_i \text{ in Ironic vocabulary}$
$\text{where }c(w_{Ironic}) = \text{count of all words in Ironic vocabulary}$

$p(w_i|NotIronic) = \frac{c(w_{i, NotIronic})}{c(w_{NotIronic})}$

$\text{where }c(w_{i, NotIronic}) = \text{count of word } w_i \text{ in NotIronic vocabulary}$
$\text{where }c(w_{NotIronic}) = \text{count of all words in NotIronic vocabulary}$

#### Add $\alpha$ Smoothing

In the above calculations, for missing words / new words, the probability is calculated as zero. To avoid this we will use alpha smoothing.

Probability calculation using add alpha smoothing:


$p(w_i|Ironic) = \frac{c(w_{i, Ironic}) + \alpha}{c(w_{Ironic})+\alpha|v|}$

$p(w_i|NotIronic) = \frac{c(w_{i, NotIronic}) + \alpha}{c(w_{NotIronic})+\alpha|v|}$

The above classifier is implemented in below class.


*   The fit method takes the training data and clacluates $P(Ironic)$ and $P(NonIronic)$, creates two vocabularies, one with ironic words and other with non ironic words, and the corresponding word counts
*   The predict function takes in the list of words in the sentence and using add $\alpha$ smoothing, calculates the posterior probabiities $p(\text{Ironic}|w_1,\ldots,w_n)$ and $p(\text{NotIronic}|w_1,\ldots,w_n)$. It returns the probabilities ad a flag indicating the class of the sentence



#### Implementation

Below fit method takes the training data and clacluates P(Ironic) and P(NonIronic), creates two vocabularies, one with ironic words and other with non ironic words, and the corresponding word counts

In [0]:
def fit(train):
  n = len(train)
  n_pos = 0
  n_neg = 0
  vocab_pos = list()
  vocab_neg = list()
  for row in train:
    if row[1] == 1:
      n_pos += 1
      vocab_pos = vocab_pos + row[2]
    elif row[1] == 0:
      n_neg += 1
      vocab_neg = vocab_neg + row[2]
    else:
      raise Exception("Unknown Label")

  cache = dict()
  cache['f_vocab_pos'] = Counter(vocab_pos)
  cache['f_vocab_neg'] = Counter(vocab_neg)
  cache['p_pos'] = n_pos/n
  cache['p_neg'] = 1 - cache['p_pos']
  return cache

The predict function takes in the list of words in the sentence and using add $\alpha$ smoothing, calculates the posterior probabiities $p(\text{Ironic}|w_1,\ldots,w_n)$ and $p(\text{NotIronic}|w_1,\ldots,w_n)$. It returns the probabilities ad a flag indicating the class of the sentence

In [0]:
def predict(cache, test, alpha=1):

  f_vocab_pos = cache['f_vocab_pos']
  f_vocab_neg = cache['f_vocab_neg']
  n_words_pos = sum(f_vocab_pos.values())
  n_words_neg = sum(f_vocab_neg.values())
  size_vocab_pos = len(f_vocab_pos)
  size_vocab_neg = len(f_vocab_neg)

  predictions = list()
  for row in test:
    p_text_pos = 1
    p_text_neg = 1
    for word in row[2]:
      if word not in f_vocab_pos:
        f_vocab_pos[word] = 0
      if word not in f_vocab_neg:
        f_vocab_neg[word] = 0
      p_text_pos *= (((f_vocab_pos[word] + alpha) /
                      (n_words_pos + alpha * size_vocab_pos)) * 
                       cache['p_pos'])
      p_text_neg *= (((f_vocab_neg[word] + alpha) / 
                      (n_words_neg + alpha * size_vocab_neg)) * 
                       cache['p_neg'])
    predictions.append(int(p_text_pos > p_text_neg))
      
  return predictions

# Train and Test above Model

Divide the data into a training and test set and justify your split.

Choose a suitable evaluation metric and implement it. Explain why you chose this evaluation metric.

Evaluate the method in Task 2 according to this metric.

#### Split the dataset into test and train

Data has been split into 80% training and 20% testing. Extracted the counts from both the datasets reveal that number of Ironic and NonIronic examples are distributed evenly in both the datasets. This proves that the distribution of examples is not biased and have enough examples in both the classes for training and testing the model

In [0]:
train_raw, test_raw = train_test_split(data, test_size=0.20, random_state = 20)
print("Train dataset stats")
print("-------------------")
get_stats(train_raw)
print("\nTest dataset stats")
print("------------------")
get_stats(test_raw)

Train dataset stats
-------------------
Total number of examples :  3053
Number of positive examples :  1513
Number of negative examples :  1540
Size of Vocabulary :  11466

Test dataset stats
------------------
Total number of examples :  764
Number of positive examples :  388
Number of negative examples :  376
Size of Vocabulary :  4058


Convert the training and test set to list of (index, label, list of words) format


In [0]:
train = list()
test = list()
for index, row in train_raw.iterrows():
  words = word_tokenize(row['Tweet text'].lower())
  train.append((index, row['Label'], words))
test_list = list()
for index, row in test_raw.iterrows():
  words = word_tokenize(row['Tweet text'].lower())
  test.append((index, row['Label'], words))

#### Evaluation Metric

For evaluating the classifier, I have used accuracy and F1 score to measure the performance the classifier. As the dataset is equally distributed, accuracy can be used to check the overall performance for both the classes, while the F1 score give the harmonic mean of precision and recall.

$Accuracy = \frac{\text{tp + tn}}{\text{total examples}}$



In [0]:
def evaluate(y_predict, y):
  
  n_correct = 0
  
  if len(y_predict) != len(y):
    raise Exception("Input vectors are of different lengths")
    
  for i in range(len(y)):
    if y[i] == y_predict[i]:
      n_correct += 1
      
  return n_correct/len(y)

#### Training and Evaluate the model

In [0]:
cache = fit(train)
print("Probability of Ironic : ", cache['p_pos'])
print("Probability of non Ironic : ", cache['p_neg'])
print("Vacab size of Ironic : ", len(cache['f_vocab_pos']))
print("Vacab size of non Ironic : ", len(cache['f_vocab_neg']))

Probability of Ironic :  0.4955781198820832
Probability of non Ironic :  0.5044218801179168
Vacab size of Ironic :  6338
Vacab size of non Ironic :  7217


In [0]:
y_test = [r[1] for r in test]
y_nb_predict = predict(cache, test)
accuracy = evaluate(y_nb_predict, y_test)
print("Accuracy : ", accuracy)

Accuracy :  0.6544502617801047


# Train and Test NN with LSTM

Run the following code to generate a model from your training set. The training set should be in a variable  called `train` and is assumed to be of the form:

```
[(1, 1, ['sweet', 'united', 'nations', 'video', '.', 'just', 'in', 'time', 'for', 'christmas', '.', '#', 'imagine', '#', 'noreligion', 'http', ':', '//t.co/fej2v3oubr']), 
 (2, 1, ['@', 'mrdahl87', 'we', 'are', 'rumored', 'to', 'have', 'talked', 'to', 'erv', "'s", 'agent', '...', 'and', 'the', 'angels', 'asked', 'about', 'ed', 'escobar', '...', 'that', "'s", 'hardly', 'nothing', ';', ')']), 
 (3, 1, ['hey', 'there', '!', 'nice', 'to', 'see', 'you', 'minnesota/nd', 'winter', 'weather']), 
 (4, 0, ['3', 'episodes', 'left', 'i', "'m", 'dying', 'over', 'here']), 
 ...
]
 ```



In [0]:
from keras.models import Sequential, load_model
from keras.layers import Dense, Activation, Embedding, Dropout, TimeDistributed
from keras.layers import LSTM
from keras.optimizers import Adam
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint
import numpy as np

## These values should be set from Task 3
# train, test = task3()

def make_dictionary(train, test):
    dictionary = {}
    for d in train+test:
        for w in d[2]:
            if w not in dictionary:
                dictionary[w] = len(dictionary)
    return dictionary

class KerasBatchGenerator(object):
    def __init__(self, data, num_steps, batch_size, vocabulary, skip_step=5):
        self.data = data
        self.num_steps = num_steps
        self.batch_size = batch_size
        self.vocabulary = vocabulary
        self.current_idx = 0
        self.current_sent = 0
        self.skip_step = skip_step

    def generate(self):
        x = np.zeros((self.batch_size, self.num_steps))
        y = np.zeros((self.batch_size, self.num_steps, 2))
        while True:
            for i in range(self.batch_size):
                # Choose a sentence and position with at lest num_steps more words
                while self.current_idx + self.num_steps >= len(self.data[self.current_sent][2]):
                    self.current_idx = self.current_idx % len(self.data[self.current_sent][2])
                    self.current_sent += 1
                    if self.current_sent >= len(self.data):
                        self.current_sent = 0
                # The rows of x are set to values like [1,2,3,4,5]
                x[i, :] = [self.vocabulary[w] for w in self.data[self.current_sent][2][self.current_idx:self.current_idx + self.num_steps]]
                # The rows of y are set to values like [[1,0],[1,0],[1,0],[1,0],[1,0]]
                y[i, :, :] = [[self.data[self.current_sent][1], 1-self.data[self.current_sent][1]]] * self.num_steps
                self.current_idx += self.skip_step
            yield x, y

# Hyperparameters for model
vocabulary = make_dictionary(train, test)
num_steps = 5
batch_size = 20
num_epochs = 50 # Reduce this if the model is taking too long to train (or increase for performance)
hidden_size = 50 # Increase this to improve perfomance (or increase for performance)
use_dropout=True

# Create batches for RNN
train_data_generator = KerasBatchGenerator(train, num_steps, batch_size, vocabulary,
                                           skip_step=num_steps)
valid_data_generator = KerasBatchGenerator(test, num_steps, batch_size, vocabulary,
                                           skip_step=num_steps)

# A double stacked LSTM with dropout and n hidden layers
model = Sequential()
model.add(Embedding(len(vocabulary), hidden_size, input_length=num_steps))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(LSTM(hidden_size, return_sequences=True))
if use_dropout:
    model.add(Dropout(0.5))
model.add(TimeDistributed(Dense(2)))
model.add(Activation('softmax'))

# Set optimizer and build model
optimizer = Adam()
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

# Train the model
model.fit_generator(train_data_generator.generate(), len(train)//(batch_size*num_steps), num_epochs,
                        validation_data=valid_data_generator.generate(),
                        validation_steps=len(test)//(batch_size*num_steps))

# Save the model
model.save("final_model.hdf5")

Using TensorFlow backend.


Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


Now consider the following code:

In [0]:
model = load_model("final_model.hdf5")

x = np.zeros((1,num_steps))
x[0,:] = [vocabulary["this"],vocabulary["is"],vocabulary["an"],vocabulary["easy"],vocabulary["test"]]
print(model.predict(x))


[[[0.32178226 0.67821777]
  [0.35342664 0.64657336]
  [0.37172422 0.62827575]
  [0.26450235 0.73549765]
  [0.65818816 0.34181184]]]


Using the code above write a function that can predict the label using the LSTM model above and compare it with the evaluation performed in Task 3

Below function uses the model created above and gets the Ironic and Non Ironic probabilities of each word. As the model accepts 5 words at a time, predictions are made looping through 5 words at a time and making the predictions. When the sentence has less than 5 words or when the last part of the sentence has less than 5 words, fullstop is padded at the end to make the length 5. This is a hack to make the size of words to 5. The probabilities are then multiplied to calculate the total probability of sentence being ironic or not. 

In [0]:
def predict_using_keras_model(test):
  
  y_predict = list()
  
  for row in test:
    current_pos = 0
    to_pos=0
    p_predict = np.ones(2)
    max_i = int(len(row[2])/num_steps)+1
    for i in range(max_i):
      x = np.zeros((1, num_steps))
      to_pos = to_pos+num_steps
      if to_pos > len(row[2]):
        to_pos = len(row[2])
        current_pos = to_pos-num_steps
      if current_pos < 0:
        x_temp = [vocabulary['.']]*5
        x_temp[0:len(row[2])] = [vocabulary[w] for w in row[2]]
        x[0, :] = x_temp
      else:
        x[0, :] = [vocabulary[w] for w in row[2][current_pos:to_pos]]
      p_temp = model.predict(x)
      p_predict = p_predict * np.prod(p_temp[0], axis=0)
      current_pos += num_steps
      
    y_predict.append(int(p_predict[0] > p_predict[1]))
  
  return y_predict

  

In [0]:
y_lstm_predict = predict_using_keras_model(test)
accuracy = evaluate(y_lstm_predict, y_test)
print("Accuracy : ", accuracy)

Accuracy :  0.5589005235602095


# Improvements to above models

Suggest an improvement to either the system developed in Task 2 or 4 and show that it improves according to your evaluation metric.

Please note this task is marked according to: demonstration of knowledge from the lecutures (10), originality and appropriateness of solution (10), completeness of description (10), technical correctness (5) and improvement in evaluation metric (5).

### Preprocessing

Tweets are generally written in informal language for example "im, ur, rofl..." etc. These also contain emojis, urls, user names etc. Preprocessing can be to 


*   Convert the emojis to acual emotion (sad, happy etc)
*   Replace the words with lemmas using lemmatisation
*  Replace URL's with a keyword `<URL>` as these doesn't add any value in the calculation.
* Replace user names with `<USER>` as these doesn't add any value in the calculation.

Along with above, a model can be built to expands like rofl etc to its full text.

In the below implementation, I have changed urls and user names to keywords `<URL>` and `<USER>` as part of preprocessing.


In [0]:
import re

def preprocessing(data):
  
  tweets = list()
  
  for index, row in data.iterrows():
    tweet = row[2]
    
    #Replace URL's with token <URL>
    tweet = re.sub(r'((http(s)?:\/\/[\S]+)|www\.[\S]+)', '<URL>', tweet)
    
    #Replace user name with <USER>
    tweet = re.sub(r'(@[\S]+)', '<USER>', tweet)
    
    tweets.append(tweet)
    
  return tweets

### Model 

In the above model, two LTSM are used with 5 words in a tweet processed at a time. Each time, a tweet is divided into chunks of 5 words and are sent as seperate sentences to the model for training and predictions. Some of the words at the end of the sentence are ignored due to the restriction of 5 input words. Because of this, the context of the tweets are not captured completely and few words are ignored in out training. 


Below improvements are done for the model created in Task 4

#### Considering full tweet

As an improvement to the model, I have used all the words in the tweet as input to the model. To mitigate the difference in length of (count of words) each tweet, zeros has been padded infront of the sentence sequence vector to make the length equal to tweet with most number of words. This method considers all the words in the tweets capturing the context in a better way.

In [0]:
# gets the maximum number of words in a tweet using all the input tweets

def get_maxlen(tweets):
  return max([len(tweet.split()) for tweet in tweets])

In [0]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tweets = preprocessing(data)
max_len = get_maxlen(tweets)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(tweets)
X = tokenizer.texts_to_sequences(tweets)
X = pad_sequences(X, maxlen=max_len)
y = data['Label'].values
vocab_size = len(tokenizer.word_index) + 1

#### Using Bidirectional RNN

In Natural Language, sometimes the context words are present ahead in the sentence with respect to current word. LTSM are good in memorizing the context which is already processed, but not the context which is unprocessed ot lie ahead of current word. Hence we use bidirectional LSTM to capture this information.

#### Using POS tags

The input to the LSTM are the word embeddings using which LTSM predicts whether the tweet is ironic or not. POS tag of the words also provide information in about whether the tweet is ironic or not. This information can also be used in the model for predictions.

For using POS tags, nltk's pos_tag method is used to extract the pos_tags from the sentence. As NN accepts inly numerical data as inputs, we need to convert this POS tags information to numbers. In order to do this, a dictionary of all possible POS tags are created along with its index as its value. Using this dictionary all the POS tags in the tweet in converted to numbers. Each POS sequence is also padded with 0 to maximum length to make the length of each vector same. 

Finally, the numbers are normalised by dividing with number of POS tags which will give each tag a number between 0 and 1. This is dont to improve to the performance of the NN

In [0]:
from nltk import pos_tag

def tweet_pos_tagger(tweets):
  tweets_pos_tags = [pos_tag(word_tokenize(tweet)) for tweet in tweets]
  all_pos = set([tag_pair[1] for tweet_pos in tweets_pos_tags for tag_pair in tweet_pos])
  pos_dict = {pos_tag: i for i, pos_tag in enumerate(all_pos)}
  total_tags = len(pos_dict)
  max_len = get_maxlen(tweets)
  
  tweets_tag_vector = list()
  for tweet_pos in tweets_pos_tags:
    tweet_pos_vector = np.zeros(max_len)
    tweet_pos_vector = np.asarray([pos_dict[word_tag[1]] for word_tag in tweet_pos])
    tweets_tag_vector.append(tweet_pos_vector)
    
  tweets_tag_vector = pad_sequences(tweets_tag_vector, maxlen=max_len)
  return tweets_tag_vector / total_tags

In [0]:
T = np.stack(tweet_pos_tagger(tweets))

Both the inputs are merged, to split the data into test and train datasets. When the split is done both the inputs are seperated as they are processed seperately.

Note: In below test train split, same random seed is used to produced the same dataset as above used in Task 2 and Task 3

In [0]:
XT = np.concatenate((X, T), axis=1)

# Same random state is used to generate same test train split as above 
XT_train, XT_test, y_train, y_test = train_test_split(XT, y, test_size=0.20, random_state = 20)
X_train, T_train = np.hsplit(XT_train, 2)
X_test, T_test = np.hsplit(XT_test, 2)

#### Model building, Training

Below model uses the input tweet data (X) and builds the LTSM using word embeddings of the input. The output of LSTM is combined with POS information and passed to a deep dense conneted network with ReLu activation functions. The output of (LTSB,POS) is combined with LSTM and returned.

Reference: https://keras.io/getting-started/functional-api-guide/#multi-input-and-multi-output-models

In [0]:
from keras.layers import concatenate, Input, Bidirectional, Model

tweet_words = Input(shape=(max_len,), name='tweet_words')
tweet_embd = Embedding(output_dim=512, input_dim=vocab_size, input_length=max_len)(tweet_words)
lstm_out = Bidirectional(LSTM(32))(tweet_embd)
lstm_temp_out = Dense(1, activation='sigmoid', name='aux_output')(lstm_out)
pos_info = Input(shape=(max_len,), name='aux_input')
combined_out = concatenate([lstm_out, pos_info])
combined_out = Dense(64, activation='relu')(combined_out)
combined_out = Dense(64, activation='relu')(combined_out)
combined_output = Dense(1, activation='sigmoid', name='combined_output')(combined_out)
model_multi = Model(inputs=[tweet_words, pos_info], outputs=[combined_output, lstm_temp_out])
model_multi.compile(optimizer='adam', loss='binary_crossentropy',
              loss_weights=[1., 0.2])

In [0]:
epochs = 10
model_multi.fit([X_train, T_train], [y_train, y_train], epochs = epochs, batch_size=batch_size)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f5bf9fdb4a8>

#### Improvement in evaluation Metric

For evaluating the classifier, I have used accuracy and F1 score to measure the performance the classifier. As the dataset is equally distributed, accuracy can be used to check the overall performance for both the classes, while the F1 score give the harmonic mean of precision and recall.

$Accuracy = \frac{\textit{true positive + true negative}}{\textit{total examples}}$

$Precision = \frac{\textit{true positive}}{\textit{true positive + false positive}}$

$Recall = \frac{\textit{true positive}}{\textit{true positive + false negative}}$

$Precision = \frac{\textit{2} \times \textit{precision} \times \textit{recall}}{\textit{precision + recall}}$

In [0]:
def evaluate_new(y_predict, y):
  
  tp, fn, fp, tn = 0, 0, 0, 0
  
  if len(y_predict) != len(y):
    raise Exception("Input vectors are of different lengths")
    
  for i in range(len(y)):
    if y[i] == 0 and y_predict[i] == 0:
      tn += 1
    elif y[i] == 1 and y_predict[i] == 1:
      tp += 1
    elif y[i] == 0 and y_predict[i] == 1:
      fp += 1
    elif y[i] == 1 and y_predict[i] == 0:
      fn += 1
  
  accuracy = (tp + tn) / len(y)
  precision = tp / (tp + fp)
  recall = tp / (tp + fn)
  f1 = (2 * precision * recall) / (precision + recall)
  
  return accuracy, precision, recall, f1

### Evaluating the model

#### Evaluation metric of improved model

In [0]:
y_multi_predict = model_multi.predict([X_test, T_test])
y_final_predict = np.where(y_multi_predict[0] > 0.5, 1, 0)
y_final_predict = np.ravel(y_final_predict)

In [0]:
accuracy, precision, recall, f1 = evaluate_new(y_final_predict, y_test)
print("Accuracy: ", accuracy)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 score: ", f1)

Accuracy:  0.6178010471204188
Precision:  0.638728323699422
Recall:  0.5695876288659794
F1 score:  0.6021798365122615


#### Evaluation metric of LSTM model (Task 3)

In [0]:
accuracy, precision, recall, f1 = evaluate_new(y_lstm_predict, y_test)
print("Accuracy: ", accuracy)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 score: ", f1)

Accuracy:  0.5589005235602095
Precision:  0.580952380952381
Recall:  0.47164948453608246
F1 score:  0.5206258890469417


#### Evaluation metric of Naive Bayes classifier (Task 2)

In [0]:
accuracy, precision, recall, f1 = evaluate_new(y_nb_predict, y_test)
print("Accuracy: ", accuracy)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 score: ", f1)

Accuracy:  0.6544502617801047
Precision:  0.6353711790393013
Recall:  0.75
F1 score:  0.6879432624113475


An improvement of 6% is observed in the improved model when compared to LSTM model in Task 3. However, the accuracy and F1 score is still behinf Naive Bayes model. Training the model longer might increase the accuracy as the model is trainied for only 3 epochs.

#### Further improvements

Some more features can be added to the model to improve the performance like capturing the sentiment (counts of positive and negative words) etc. Improvements can also be done with respect to pos_tagger as it is not customized for tweets, the error percentage in tagging will be high.

### Improvising Naive Bayes

#### Interpolation

Along with using unigram probaility, bigram probabilities can be used to improv the model. Both these probabilities can be combined as below 

$P(Ironic) = \lambda_1P_{unigram}(Ironic) + \lambda_2P_{bigram}(Ironic)$

$\textit{where } \lambda_1 + \lambda_2 = 1$

In [0]:
def fit_new(train):
    n = len(train)
    n_pos = 0
    n_neg = 0
    unigram_vocab_pos = list()
    unigram_vocab_neg = list()
    bigram_vocab_pos = list()
    bigram_vocab_neg = list()
    for row in train:
        words = row[2]
        bigrams = list(nltk.bigrams(words))
        if row[1] == 1:
            n_pos += 1
            unigram_vocab_pos = unigram_vocab_pos + words
            bigram_vocab_pos = bigram_vocab_pos + bigrams
        elif row[1] == 0:
            n_neg += 1
            unigram_vocab_neg = unigram_vocab_neg + words
            bigram_vocab_neg = bigram_vocab_neg + bigrams
        else:
            raise Exception("Unknown Label")

    cache = dict()
    cache['f_unigram_pos'] = Counter(unigram_vocab_pos)
    cache['f_unigram_neg'] = Counter(unigram_vocab_neg)
    cache['f_bigram_pos'] = ConditionalFreqDist(bigram_vocab_pos)
    cache['f_bigram_neg'] = ConditionalFreqDist(bigram_vocab_neg)
    cache['n_unigram_pos'] = len(unigram_vocab_pos)
    cache['n_unigram_neg'] = len(unigram_vocab_neg)
    cache['n_bigram_pos'] = len(bigram_vocab_pos)
    cache['n_bigram_neg'] = len(bigram_vocab_neg)
    cache['p_pos'] = n_pos / n
    cache['p_neg'] = 1 - cache['p_pos']
    return cache

In [0]:
def get_unigram_prob(cache, words, alpha=1):
    f_unigram_pos = cache['f_unigram_pos']
    f_unigram_neg = cache['f_unigram_neg']
    n_unigram_pos = cache['n_unigram_pos']
    n_unigram_neg = cache['n_unigram_neg']
    s_unigram_pos = len(f_unigram_pos)
    s_unigram_neg = len(f_unigram_neg)
    p_unigram_pos = 1
    p_unigram_neg = 1

    for word in words:
        if word not in f_unigram_pos:
            f_unigram_pos[word] = 0
        if word not in f_unigram_neg:
            f_unigram_neg[word] = 0
        p_unigram_pos *= (((f_unigram_pos[word] + alpha) /
                        (n_unigram_pos + alpha * s_unigram_pos)))
        p_unigram_neg *= (((f_unigram_neg[word] + alpha) /
                        (n_unigram_neg + alpha * s_unigram_neg)))

    p_unigram_pos *= cache['p_pos']
    p_unigram_neg *= cache['p_neg']
    return p_unigram_pos, p_unigram_neg

In [0]:
from nltk.probability import ConditionalFreqDist

def get_bigram_prob(cache, words, alpha=1):
    f_bigram_pos = cache['f_bigram_pos']
    f_bigram_neg = cache['f_bigram_neg']
    n_bigram_pos = cache['n_bigram_pos']
    n_bigram_neg = cache['n_bigram_neg']
    s_bigram_pos = len(f_bigram_pos)
    s_bigram_neg = len(f_bigram_neg)
    p_bigram_pos = 1
    p_bigram_neg = 1

    bigrams = list(nltk.bigrams(words))

    for bigram in bigrams:
        if bigram[0] not in f_bigram_pos:
            c_bigram_pos = 0
        elif bigram[1] not in f_bigram_pos[bigram[0]]:
            c_bigram_pos = 0
        else:
            c_bigram_pos = f_bigram_pos[bigram[0]].get(bigram[1])

        if bigram[0] not in f_bigram_neg:
            c_bigram_neg = 0
        elif bigram[1] not in f_bigram_neg[bigram[0]]:
            c_bigram_neg = 0
        else:
            c_bigram_neg = f_bigram_neg[bigram[0]].get(bigram[1])

        p_bigram_pos *= (((c_bigram_pos + alpha) /
                          (n_bigram_pos + alpha * s_bigram_pos)))
        p_bigram_neg *= (((c_bigram_neg + alpha) /
                          (n_bigram_neg + alpha * s_bigram_neg)))

    p_bigram_pos *= cache['p_pos']
    p_bigram_neg *= cache['p_neg']
    return p_bigram_pos, p_bigram_neg

In [0]:
def predict_new(cache, test, alpha=1):
    predictions = list()
    lambda1 = 0.3
    lambda2 = 0.7
    for row in test:
        p_unigram_pos, p_unigram_neg = get_unigram_prob(cache, row[2], alpha)
        p_bigram_pos, p_bigram_neg = get_bigram_prob(cache, row[2], alpha)

        p_predict_pos = lambda1*p_unigram_pos + lambda2*p_bigram_pos
        p_predict_neg = lambda1*p_unigram_neg + lambda2*p_bigram_neg

        predictions.append(int(p_predict_pos > p_predict_neg))

    return predictions

In [0]:
cache = fit_new(train)

In [0]:
y_new_nv = predict_new(cache, test)

In [0]:
accuracy, precision, recall, f1 = evaluate_new(y_new_nv, y_test)
print("Accuracy: ", accuracy)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 score: ", f1)

Accuracy:  0.6518324607329843
Precision:  0.6270833333333333
Recall:  0.7757731958762887
F1 score:  0.6935483870967742


The accuracy is not improved by introducing bigrams

Further improvements can be done by tuning alpha, ad extending the model to bigrams, getting more data