In this notebook we experiment with sentiment analysis and POS tagginig by implementing a number of simple classifiers based on Nayve Bayes and Neural Networks. This is intended as a starting point only and no model selection or other fancy stuff is done.

# Naive Bayes classifier (IMDB dataset)
In this part we create a naive bayes classifier for the IMDB movie reviews dataset. Make sure you are able to import all the necessary dependencies. The smoothing term is the parameter of the Laplacian smoothing, to avoid zero-ing probabilities for documents with unknown words.

In [96]:
from keras.datasets import imdb
from collections import Counter
import numpy as np
import keras

SMOOTHING_TERM = 1 #Smoothing term to avoid trouble with 0 probabilities

Since the IMDB dataset, as imported from keras, encodes words with numbers, we define a function to decode the reviews i.e. get a readable version. The UNK token is given by the preprocessing already applied in the imported dataset.

In [97]:
#Facility to decode reviews (i.e. give it an element from x_train or x_test as
#loaded from imdb.load_data, and it will return the readable review.
def decode_review(review):
    word_to_id = keras.datasets.imdb.get_word_index()
    word_to_id = {k: (v + 3) for k, v in word_to_id.items()}
    word_to_id["<PAD>"] = 0
    word_to_id["<START>"] = 1
    word_to_id["<UNK>"] = 2
    id_to_word = {value: key for key, value in word_to_id.items()}
    return ' '.join(id_to_word[id] for id in review)


Next, we import the dataset and count the occurences of each class $c_i$ (0 for negative review and 1 for positive, in this case). From that we can determine the probability $P(c_i)$ for each class.

In [98]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

#Count the occurences of each class
occ_classes = Counter(y_train)

instances = len(y_train)
#Calculate the probability of each class (i.e. occurences/(total number of instances))
p_classes = {key: value/instances for key,value in occ_classes.items()}
print(p_classes)

{0: 0.5, 1: 0.5}


We now separate the positive reviews from the negative ones, and count the occurences of each word $w_i$ in positive and negative contexts.

In [99]:
#Separate positive reviews from negative ones
x_train_pos, x_train_neg = [], []
for i in range(0, instances):
    (x_train_pos, x_train_neg)[1 if y_train[i]==0 else 0].append(x_train[i])

#Count the occurences of each word
w_occ = {key: value for key,value in
     Counter(x for xs in x_train for x in xs).items()}

#Count the number of occurences of each word in positive reviews
w_c_pos = {key: value for key,value in
     Counter(x for xs in x_train_pos for x in xs).items()}

#Count the number of occurences of each word in negative reviews
w_c_neg = {key: value for key,value in
     Counter(x for xs in x_train_neg for x in xs).items()}

#Add the words with 0 occurences to both dictionaries
for key in w_occ.keys():
    if key not in w_c_pos.keys():
        w_c_pos[key] = 0
    if key not in w_c_neg.keys():
        w_c_neg[key] = 0

We can now calculate the conditional probabilities $P(w_i|c_j)$, that is the probability of each word of appearing in a positive review and the probability of each word of appearing in a negative one.

In [100]:
#Calculate the probability of each word in of appearing in a negative
#review (i.e. P(w|p_classes['0']) for each w in the vocabulary)
w_p_pos = {key:(value+SMOOTHING_TERM)/(w_occ[key]+2*SMOOTHING_TERM)
           for key, value in w_c_pos.items()}

#Calculate the probability of each word in of appearing in a negative
#review (i.e. P(w|p_classes['0']) for each w in the vocabulary)
w_p_neg = {key:(value+SMOOTHING_TERM)/(w_occ[key]+2*SMOOTHING_TERM)
           for key, value in w_c_neg.items()}

#Safety check that all probabilities (p(w|pos)+p(w|neg)) indeed sum
#up to 1
for key in w_occ.keys():
    if (1-((w_p_pos[key] if key in w_p_pos.keys() else 0)+
        (w_p_neg[key] if key in w_p_neg.keys() else 0))>1e-6):
            exit('Something\'s wrong')


The classifier is now complete: given a review $r$, its class $c_r$ is given by $c_r=\underset{c_j\in C}{\operatorname{argmax}} log(P(c_j))+\sum_{i \in values} P(x_i|c_j)$. We obtain the accuracy, precision, recall and F1-measure on the training set.

In [101]:
t_p = 0; t_n = 0; f_p = 0; f_n = 0
for el in range(0,len(x_train)):
    ppos = 0
    pneg = 0
    for x in x_train[el]:
        ppos = ppos + np.log(w_p_pos[x])
        pneg = pneg + np.log(w_p_neg[x])
    if (((ppos>=pneg)) and y_train[el]==1):
        t_p+=1
    elif (((ppos>=pneg)) and y_train[el]==0):
        f_p+=1
    elif (((ppos<pneg)) and y_train[el]==1):
        f_n+=1
    elif (((ppos<pneg)) and y_train[el]==0):
        t_n+=1

print('Training set accuracy:',(t_n+t_p)/(t_p+t_n+f_p+f_n))
print('Training set precision:',(t_p)/(t_p+f_p))
print('Training set recall:',(t_p)/(t_p+f_n))
print('Training set F1-measure:',2*(((t_p)/(t_p+f_p))*((t_p)/(t_p+f_n))/(((t_p)/(t_p+f_p))+((t_p)/(t_p+f_n)))))

Training set accuracy: 0.84408
Training set precision: 0.8084923253478697
Training set recall: 0.90176
Training set F1-measure: 0.8525830118750474


Notice that the accuracy is relatively low. Could probably be increased with more meaningful features, e.g. the length of the review. We can calculate the same measures on the test set as follows:

In [102]:
t_p = 0; t_n = 0; f_p = 0; f_n = 0
for el in range(0,len(x_test)):
    ppos = 0
    pneg = 0
    for x in x_test[el]:
        ppos = ppos + np.log(w_p_pos[x])
        pneg = pneg + np.log(w_p_neg[x])
    if (((ppos>=pneg)) and y_test[el]==1):
        t_p+=1
    elif (((ppos>=pneg)) and y_test[el]==0):
        f_p+=1
    elif (((ppos<pneg)) and y_test[el]==1):
        f_n+=1
    elif (((ppos<pneg)) and y_test[el]==0):
        t_n+=1

print('Test set accuracy:',(t_n+t_p)/(t_p+t_n+f_p+f_n))
print('Test set precision:',(t_p)/(t_p+f_p))
print('Test set recall:',(t_p)/(t_p+f_n))
print('Test set F1-measure:',2*(((t_p)/(t_p+f_p))*((t_p)/(t_p+f_n))/(((t_p)/(t_p+f_p))+((t_p)/(t_p+f_n)))))

Test set accuracy: 0.82204
Test set precision: 0.795189557820635
Test set recall: 0.86752
Test set F1-measure: 0.829781535753912


# Neural network classifier
In this part we create a neural network classifier for the IMDB movie reviews dataset. Make sure you are able to import all the necessary dependencies. The architecture of the network is as follows:
- Embedding layer: maps input words from the vocabulary to lower dimension embeddings;
- 1D convolution layer: to treat the sequences of embeddings constituing each review;
- Max pooling layer: to add some sort of invariance to small translations (probably not a great idea in this case);
- Dense layers to perform the classification.


In [103]:
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, Flatten, Dense
from keras.layers import ZeroPadding1D, Convolution1D,MaxPooling1D
from keras.models import Sequential
from keras.datasets import imdb
import numpy as np
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)


#Reload just for clarity
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

#Length of longest sequence in the dataset
max_len = np.max([len(el) for el in x_train])

#Average length of sequences in the dataset
avg_len = np.average([len(el) for el in x_train])

#Std deviation of sequences' length in the dataset
std_len = np.std([len(el) for el in x_train])

#The maximum length sequence we will consider
seq_len = int(avg_len+std_len*1.5)

#The size of the vocabulary
vocab_size = len(imdb.get_word_index())


print("Longest sequence:",max_len,"\nAverage sequence length:",avg_len,"\nSequences length std:",std_len,
      "\nMax sequence length to consider:",seq_len,"\nVocabulary size",vocab_size)

Longest sequence: 2494 
Average sequence length: 238.71364 
Sequences length std: 176.49367364852034 
Max sequence length to consider: 503 
Vocabulary size 88584


We choose a maximum sequence length allowed: choosing this as the length of the longest sequence would result in a huge memory consumption due to the padding for the rest of the sequences. We therefore take the average length of the sequence + $\alpha\cdot$(the standard deviation). Sequences longer than that will be trimmed, whilst shorter ones will be padded.

In [104]:
x_train_p = pad_sequences(x_train, maxlen=seq_len, padding='post')
x_test_p = pad_sequences(x_test, maxlen=seq_len, padding='post')

#Model definition
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=seq_len))
model.add(Convolution1D(10, 5, activation='relu')),
model.add(MaxPooling1D()),
model.add(Flatten())
model.add(Dense(15, activation='relu')),
model.add(Dense(1, activation='sigmoid'))

#Compile and fit the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.fit(x_train_p, y_train, epochs=2, batch_size=64)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f82c9896080>

We can now evaluate on the test set

In [105]:
model.evaluate(x_test_p,y_test)



[0.3136934424877167, 0.87324]

We can see that, even with such a small model, the neural network approach leads to better results then the bayes one. The difference in accuracy between the training and test set may indicate some overfitting - which would not be suprising as no measures to prevent it (e.g. Dropout layers, or norm penalty regularization) have been adopted.

This is just a shallow implementation: in a proper one we would perform some feature engineering in the naive bayes classifier and model selection in the neural network approach.





# POS tagger
In this part we create a POS tagger for the "news" category of the Brown corpus in the NLTK.corpus package. In particular we adopt a windowed approach, that is we set our task as the prediction of the tag of a given word considering a fixed size window around it. Make sure you are able to import all the necessary dependencies as well as providing the proper path of the glove dataset.


In [1]:
from keras.layers import Embedding, Flatten, Dense
from keras.layers import Convolution1D,MaxPooling1D
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.utils import to_categorical
from nltk.corpus import brown
import nltk
from sklearn.metrics import confusion_matrix
from sklearn import metrics
import numpy as np
import string
import os

EMB_DIM = 50
UNK_W = np.random.randn(EMB_DIM)

#Provide the path to the glove embeddings
glove_path = ''
#Load and parse the glove dataset
f = open(glove_path)
embeddings_index = {}
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

#Also download the brown corpus along with the universal tagset
nltk.download('brown')
nltk.download('universal_tagset')


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


FileNotFoundError: [Errno 2] No such file or directory: ''

We now parse the tagged news (i.e. a list of lists, with each element being a tuple (word,tag)). We flatten this list, concatenating the inner lists corresponding to the sentences. In doing so we add additional tokens to mark the beginning and the end of each sentence.

In [107]:
#Parse the tagged news
brown_sents = brown.tagged_sents(categories='news', tagset='universal')

#Flatten the list, padding with a special token
#the start and the end of each sentence
flat_brown = []

for sent in brown_sents:
    flat_brown.append(("<unk>", "UNKTAG"))
    flat_brown.append(("<unk>", "UNKTAG"))
    for w in sent:
        flat_brown.append(w)
flat_brown.append(("<unk>","UNKTAG"))
flat_brown.append(("<unk>","UNKTAG"))

We can now specify a window size and create a proper dataset accordingly. I.e. the input x will be the word whose tag we are interested in, and its window. The target y will be the tag of the central word of the window.

In [108]:
#Window size, has to be odd
win_size = 5
assert(win_size%2==1)

#Center of the window
center = int(np.floor(win_size/2))

#Create the n-tuples of window size, each element of the n-tuple is a tuple (word,tag)
w_tags = [[flat_brown[i+j] for j in range(0,win_size)] for i in range(0,len(flat_brown)-win_size+1)]
#The target tag is the tag of the central word in the windows
y_train = [el[center][1] for el in w_tags]
#Get rid of the tags for the input x
x_train = [[el[0] for el in window] for window in w_tags]


Let us define the set of tags and encode them with integers

In [109]:
tags = set(y_train)

tags_enc = dict()
c=1
for w in tags:
    tags_enc[w] = c
    c+=1
print(tags_enc)

{'CONJ': 1, 'VERB': 2, 'UNKTAG': 3, '.': 8, 'PRON': 5, 'ADJ': 6, 'ADP': 7, 'NOUN': 9, 'DET': 10, 'ADV': 13, 'PRT': 11, 'X': 12, 'NUM': 4}


Then we update the input data, by replacing each word with its embedding as loaded from the glove embeddings dataset. Note that some words may not be shared between the glove dataset and the brown corpus news dataset (or they might be spelled in a slightly different way): we deal with this in the laziest way, that is replacing all the words that do not appear in the glove corpus with a randomly generated embedding (the same for all such words). The target y is also updated, by replacing the tags with an one-hot representation.

In [110]:
x_train_emb = [[\
                (embeddings_index[w.lower()] if w.lower() in embeddings_index.keys() else UNK_W)\
                for w in sent] for sent in x_train]

y_train_enc = to_categorical([tags_enc[tag] for tag in y_train])

def unison_shuffled_copies(a, b):
    p = np.random.permutation(len(a))
    return a[p], b[p]

#Shuffle the dataset
x_train_emb_shuffled, y_train_enc_shuffled = unison_shuffled_copies(np.array(x_train_emb),np.array(y_train_enc))
#Sample input, before replacing words with their embeddings
print(x_train[0:3])
#Corresponding target
print(y_train[0:3])
#Input after replacing words with their embeddings, shape is (samples, window size, embedding size)
print(x_train_emb_shuffled.shape)
#One-hot encoded output
print(y_train_enc_shuffled[0:3])

[['<unk>', '<unk>', 'The', 'Fulton', 'County'], ['<unk>', 'The', 'Fulton', 'County', 'Grand'], ['The', 'Fulton', 'County', 'Grand', 'Jury']]
['DET', 'NOUN', 'NOUN']
(109798, 5, 50)
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]]


With the dataset properly formatted we can now separate it into training and test data, then build the model.

In [111]:
#Use 1000 sentences as test data
x_tr = x_train_emb_shuffled[0:-1000]
y_tr = y_train_enc_shuffled[0:-1000]
x_ts = x_train_emb_shuffled[-1000:]
y_ts = y_train_enc_shuffled[-1000:]

#Model definition
model = Sequential()
model.add(Convolution1D(128, 3, padding='same', input_shape=(5,50),activation='relu'))
model.add(MaxPooling1D())
model.add(Flatten())
model.add(Dense(60, activation='relu')),
model.add(Dense(14,activation='softmax'))
#Compile and fit the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.fit(x_tr, y_tr, epochs=10, batch_size=64)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f8286859978>

Finally we can calculate the accuracy, precision, recall and F1 measure for the training set.

In [112]:
def statistics(x,y):
    pred = model.predict(x)
    #Compute the confusion matrix
    cm = confusion_matrix(y.argmax(1), pred.argmax(1))
    print("Tags mapping", tags_enc)
    pr = metrics.precision_score(y.argmax(1), pred.argmax(1),average=None)
    rc = metrics.recall_score(y.argmax(1), pred.argmax(1),average=None)
    f1 = metrics.f1_score(y.argmax(1), pred.argmax(1),average=None)
    acc = metrics.accuracy_score(y.argmax(1), pred.argmax(1))
    print("\nAccuracy score per class:", acc)
    print("\nPrecision per class:", pr, "\nAverage precision:",np.average(pr))
    print("\nRecall per class:", rc, "\nAverage recall:",np.average(rc))
    print("\nF1 score per class:", f1, "\nAverage f1 score:",np.average(f1))
    print("\nConfusion matrix", cm)
    
print("-----TRAINING SET-----")
statistics(x_tr, y_tr)

-----TRAINING SET-----
Tags mapping {'CONJ': 1, 'VERB': 2, 'UNKTAG': 3, '.': 8, 'PRON': 5, 'ADJ': 6, 'ADP': 7, 'NOUN': 9, 'DET': 10, 'ADV': 13, 'PRT': 11, 'X': 12, 'NUM': 4}

Accuracy score per class: 0.9518097759149985

Precision per class: [0.99691715 0.94884986 0.9954323  0.98035547 0.99195171 0.79946702
 0.98558232 0.99620669 0.95915862 0.99703317 0.80306428 1.
 0.7577104 ] 
Average precision: 0.9393637688507579

Recall per class: [0.95779341 0.90400504 0.99967235 0.97852474 0.98246313 0.90456807
 0.93367138 0.99991539 0.95148763 0.98263335 0.95809184 0.66304348
 0.872142  ] 
Average recall: 0.9298470621998238

F1 score per class: [0.97696375 0.92588476 0.99754782 0.97943925 0.98718462 0.84877635
 0.95892482 0.9980576  0.95530772 0.98978089 0.87375483 0.79738562
 0.81090909] 
Average f1 score: 0.9307628550958303

Confusion matrix [[ 2587     9     0     0     0     9    21    18    10     0     3     0
     44]
 [    0 12911     2     7     1   331    17     3   667     3    62    

And for the test set

In [113]:
print("-----TEST SET-----")
statistics(x_ts, y_ts)

-----TEST SET-----
Tags mapping {'CONJ': 1, 'VERB': 2, 'UNKTAG': 3, '.': 8, 'PRON': 5, 'ADJ': 6, 'ADP': 7, 'NOUN': 9, 'DET': 10, 'ADV': 13, 'PRT': 11, 'X': 12, 'NUM': 4}

Accuracy score per class: 0.913

Precision per class: [1.         0.89380531 0.98876404 0.95652174 1.         0.77333333
 0.95       0.97321429 0.93846154 0.98       0.59375    0.56756757] 
Average precision: 0.8846181515737911

Recall per class: [0.8125     0.86324786 1.         0.91666667 1.         0.79452055
 0.890625   1.         0.9037037  0.95145631 0.9047619  0.84      ] 
Average recall: 0.9064568330837464

F1 score per class: [0.89655172 0.87826087 0.99435028 0.93617021 1.         0.78378378
 0.91935484 0.98642534 0.92075472 0.96551724 0.71698113 0.67741935] 
Average f1 score: 0.8896307913407985

Confusion matrix [[ 13   0   0   0   0   0   0   1   0   0   1   1]
 [  0 101   0   0   0   4   0   0   8   0   0   4]
 [  0   0  88   0   0   0   0   0   0   0   0   0]
 [  0   0   0  22   0   0   0   0   1   0   0 