In this practical you will explore how LSTM can be applied with text data in order to preform tasks such as predicting word in a sequence and predicting sentiment in a text.

# Task 1: Predicting a word in a sequence

In this task, you will train a model that for a given sequence of words passed as an input, predicts the next word in the sequence.

**T1.1** Obtaining data

For this task we will use the [20newsgroup dataset](http://qwone.com/~jason/20Newsgroups/) using [sklearn](https://scikit-learn.org/stable/datasets/index.html). You can load the date using the code below. Familiarize yourself with the dataset before moving to the next task.

In [1]:
from sklearn.datasets import fetch_20newsgroups
corpus = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'),categories=['sci.med'])

In [2]:
data = corpus.data
data_train = data[:400]
data_test = data[400:]
print(len(data))
print(len(data_train))
print(len(data_test))

594
400
194


***
**T1.2** Data pre-processing

Since we will be using the Embedding layer, the data should be pre-processed as in the previous practicals. In order to clean the data you can use the filter attribute to specify what characters should be remove from the text. 

In [3]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True,split=' ')
tokenizer.fit_on_texts(data_train)
token_list_train = tokenizer.texts_to_sequences(data_train)
token_list_test = tokenizer.texts_to_sequences(data_test)
num_words = len(tokenizer.word_index)+1

Using TensorFlow backend.


In [4]:
idx_word = tokenizer.index_word
' '.join(idx_word[w] for w in token_list_test[20][:30])

"i have been in the same boat as you last year i've tried four times to send you an email response but your end doesn't seem to accept my mail"

***
**T1.3** Creating features and labels vectors 

The training instances will be composed of a sequence of words (we will set it to 20 for this exercise) and the labels represented by a single words.

The feature and labels vectors can be generated as follows. We use the first 20 words as features with the 21st as the label, then use words 2–21 as features and predict the 22nd and so on. This gives us significantly more training data. 

Generate the features and labels vectors for the train and the test datasets.

In [5]:
import numpy as np

x_train = []
y_train = []
train_length = 20

for row in token_list_train:
    for i in range(train_length, len(row)):
        sequence = row[i-train_length:i+1]       
        x_train.append(sequence[:-1])
        y_train.append(sequence[-1])
x_train = np.array(x_train)
y_train = np.array(y_train)
print(x_train.shape)
print(len(y_train))

(82696, 20)
82696


In [6]:
x_test = []
y_test = []
train_length = 20

for row in token_list_test:
    for i in range(train_length, len(row)):
        sequence = row[i-train_length:i+1]
        
        x_test.append(sequence[:-1])
        y_test.append(sequence[-1])
x_test = np.array(x_test)
y_test = np.array(y_test)
print(x_test.shape)
print(len(y_test))

(24209, 20)
24209


***
**T1.3** Constructing the embedding weights matrix.

In this task we will use the pre-trained word embeddings using the word2vec model. Create the embedding weights matrix for the Embedding layer.

In [7]:
from gensim.models import KeyedVectors
import re 
from gensim.scripts.glove2word2vec import glove2word2vec

#file = 'D:\PycharmProjects\RecordLinkage\Embedings Files\GoogleNews-vectors-negative300.bin'
file = '/Users/annajurek/Documents/Queens/word embedding/GoogleNews-vectors-negative300.bin'
word2vec = KeyedVectors.load_word2vec_format(file, binary=True)
word2vec_vectors = word2vec.wv

  


In [8]:
num_words

11693

In [9]:
import numpy as np

num_words = len(tokenizer.word_index)+1
embedding_matrix = np.zeros((num_words, 300))
for word, i in tokenizer.word_index.items():
    if word in word2vec_vectors:
        embedding_vector = word2vec[word]
        embedding_matrix[i] = embedding_vector

In [10]:
embedding_matrix.shape

(11693, 300)

***
**T1.4** One-hot encoding the labels.

Since we are dealing with a multi-class classification problem, we need to convert each label into a vector of dimension equals to the number of words. Convert the train and test labels into one-hot encoded vectors.

In [11]:
y_train_array = np.zeros((len(y_train), num_words),dtype=int)
for idx,word_idx in enumerate(y_train):
    y_train_array[idx,word_idx] = 1
    
y_test_array = np.zeros((len(y_test), num_words),dtype=int)
for idx,word_idx in enumerate(y_test):
    y_test_array[idx,word_idx] = 1

In [14]:
y_train

array([   21,    36,   667, ...,   192, 11691, 11692])

***
**T1.5** Building and training the model.

Now you can construct your neural network. You should add the Embedding layer as the first layer. To mask any words that do not have a pre-trained embedding (which will be represented as all zeros) the [masking layer](https://keras.io/layers/core/) can be used.

In [13]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, Masking, Embedding

model = Sequential()
model.add(Embedding(input_dim=num_words,
              input_length = train_length,
              output_dim=300,
              weights=[embedding_matrix],
              trainable=False,
              mask_zero=True))

# Masking layer for pre-trained embeddings
model.add(Masking(mask_value=0.0))

# Recurrent layer
model.add(LSTM(64, return_sequences=False, dropout=0.1, recurrent_dropout=0.1))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_words, activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train,  y_train_array, batch_size=64, epochs=15, validation_data=(x_test, y_test_array))

Train on 82696 samples, validate on 24209 samples
Epoch 1/15
Epoch 2/15

KeyboardInterrupt: 

# Task 2: Sentiment Analysis with LSTM

Implement an LSTM Neural Network to solve the sentiment analysis problem from the last practical. You can explore different variants of the models (with pre-trained embeddings, with embeddings trained via the Embedding layer, transfer learning embeddings using word2vec).

In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('yelp_reviews.csv',encoding = "ISO-8859-1")

#select input and output variables
data = df.values[:,0]
labels = df.values[:,1]

x_train, x_test, y_train, y_test = train_test_split(data, labels,test_size=0.5, random_state=0)

Data pre-processing. Encoding each entry from the train/test sets as sequence of integers for the Embedding layer.

In [18]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=50000)
tokenizer.fit_on_texts(x_train)
sequences = tokenizer.texts_to_sequences(x_train)

length = []
for x in x_train:
    length.append(len(x.split()))
max(length)

num_words = len(tokenizer.word_index)+1

In [19]:
from keras.preprocessing.sequence import pad_sequences

x_train_seq = pad_sequences(sequences, maxlen=45)
sequences_val = tokenizer.texts_to_sequences(x_test)
x_test_seq = pad_sequences(sequences_val, maxlen=45)

Generating the weight matrix with pre-treined word2vec embeddings.

In [20]:
num_words = len(tokenizer.word_index)+1
embedding_matrix = np.zeros((num_words, 300))
for word, i in tokenizer.word_index.items():
    if word in word2vec_vectors:
        embedding_vector = word2vec[word]
        embedding_matrix[i] = embedding_vector

In [21]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, Masking, Embedding

In [22]:
model = Sequential()
e = Embedding(num_words, 300, weights=[embedding_matrix], input_length=45, trainable=False)
model.add(e)
model.add(Masking(mask_value=0.0))
model.add(LSTM(64, return_sequences=False, dropout=0.1, recurrent_dropout=0.1))
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train_seq, y_train, validation_data=(x_test_seq, y_test), epochs=5, batch_size=32, verbose=2)

Train on 498 samples, validate on 498 samples
Epoch 1/5
 - 4s - loss: 0.6765 - acc: 0.6044 - val_loss: 0.6493 - val_acc: 0.6486
Epoch 2/5
 - 2s - loss: 0.5857 - acc: 0.7008 - val_loss: 0.5586 - val_acc: 0.7390
Epoch 3/5
 - 3s - loss: 0.5035 - acc: 0.7631 - val_loss: 0.5324 - val_acc: 0.7470
Epoch 4/5
 - 3s - loss: 0.4122 - acc: 0.8414 - val_loss: 0.4545 - val_acc: 0.8012
Epoch 5/5
 - 2s - loss: 0.3643 - acc: 0.8554 - val_loss: 0.4352 - val_acc: 0.7952


<keras.callbacks.History at 0x20041cab978>