# Deep learning for text analysis

Deep learning has revolutionalized image analysis. In addition, we have seen in the previous notebook that it also delivers state-of-the art performance for recommender system. Next, we tackle the domain of natural-language processing NLP.

Since language and images are two fundamentally different concepts, we will also learn a new architecture, __recurrent neural networks (RNN)__

## Embeddings

First, we face the issue that words are discrete objects, whereas neural networks operate on vector spaces. 

In the context of recommender systems, we have seen that this obstacle can be overcome by the use of _embeddings_.

However, *Duden* contains 50,000 words and it takes a lot of data to arrive at an informative embedding. Additionally, it is not clear what we want to optimize.

Fortunately, powerful general-embeddings have been obtained and are already widely implemented. The most widely used are **Word2Vec** and **GloVe**. Let's play with these embeddings.

In [5]:
import gensim
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors

#############################################
############FILE PATHS
#############################################
ROOT = '../../refereeReports/talks/deepLearning/data/processed/glove'

First, we convert glove vectors into word2vec.

In [6]:
gensim.scripts.glove2word2vec.get_glove_info("{}/glove.6B.50d.txt".format(ROOT))
gensim.scripts.glove2word2vec.glove2word2vec("{}/glove.6B.50d.txt".format(ROOT), "{}/w2vec.txt".format(ROOT))

(400000, 50)

Now, we load the word vectors.

In [7]:
word_vectors = KeyedVectors.load_word2vec_format("{}/w2vec.txt".format(ROOT), binary=False)

We can now look at the word vectors

In [21]:
word_vectors['computer']

array([ 0.079084  , -0.81503999,  1.79009998,  0.91653001,  0.10797   ,
       -0.55628002, -0.84426999, -1.49510002,  0.13417999,  0.63626999,
        0.35146001,  0.25813001, -0.55028999,  0.51055998,  0.37408999,
        0.12092   , -1.61660004,  0.83653003,  0.14202   , -0.52348   ,
        0.73452997,  0.12207   , -0.49079001,  0.32532999,  0.45306   ,
       -1.58500004, -0.63848001, -1.00530005,  0.10454   , -0.42984   ,
        3.18099999, -0.62186998,  0.16819   , -1.01390004,  0.064058  ,
        0.57844001, -0.45559999,  0.73782998,  0.37202999, -0.57722002,
        0.66441   ,  0.055129  ,  0.037891  ,  1.32749999,  0.30991   ,
        0.50696999,  1.23570001,  0.1274    , -0.11434   ,  0.20709001], dtype=float32)

Since we are now in Euclidean space, we can compare similarity between words.

In [27]:
[word_vectors.similarity(*pair) for pair in [['cat', 'dog'], 
                                             ['cat', 'god'], 
                                             ['cat', 'banana'],
                                             ['cat', 'dictator']]]

[0.92180052663414469,
 0.45516447899540219,
 0.339651731961363,
 0.07163338078458989]

We can also do arithmetics! Let's solve the equation:    `king - man` = `queen - x`

In [28]:
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.8523603677749634),
 ('throne', 0.7664334177970886),
 ('prince', 0.759214460849762),
 ('daughter', 0.7473883032798767),
 ('elizabeth', 0.7460220456123352),
 ('princess', 0.7424569725990295),
 ('kingdom', 0.7337411642074585),
 ('monarch', 0.7214490175247192),
 ('eldest', 0.7184861898422241),
 ('widow', 0.7099430561065674)]

We can also find the odd one out.

In [33]:
word_vectors.doesnt_match("mathematics chemistry physics cat".split())

'cat'

## IMDb Reviews

Nun analysieren wir IMDb Reviews mithilfe von Wort-Einbettungen und neuronalen Netzen. Zunaechst legen wir maximale Wortanzahl, maximale Satzlaenge und batch-size fest.

In [21]:
from keras.datasets import imdb
from keras.preprocessing import sequence
import numpy as np
from tensorflow.contrib.keras.python.keras.models import Sequential, Model
from tensorflow.contrib.keras.python.keras.layers import Flatten, Input, BatchNormalization,  Bidirectional,  Conv1D, Dropout, Dense, Embedding, LSTM, Convolution1D, MaxPooling1D
from tensorflow.contrib.keras.python.keras.optimizers import Adam

#maximal number of words
MAX_FEATURES = 20000

#maximum sentence length
MAXLEN = 80

#learning rate 
LR = 1e-3

BATCH_SIZE = 32

Als naechstes laden wir die Trainingsdaten.

In [11]:
(x_trn, y_trn), (x_val, y_val) = imdb.load_data(num_words = MAX_FEATURES)

Um die Reviews lesbar zu machen, laden wir auch noch einen Wort-Index und invertieren diesen.

In [12]:
idx = imdb.get_word_index()
idx2word = None

Aufgrund der nlp-vorbereitung ist es schwer fuer uns die reviews zu lesen!

In [13]:
' '.join([idx2word[a] for a in x_trn[5]])

'the effort still been that usually makes for of finished sucking ended and an because before if just though something know novel female i i slowly lot of above and with connect in of script their that out end his deceptively i i'

Wir beschraenken nun jedes review auf 80 Zeichen

In [14]:
[x_trn, x_val] = [sequence.pad_sequences() for x in [x_trn, x_val]]

Nun beginnen wir einfache Modelle auf diesen Testdaten zu trainieren.

### MLP

Zunaechst definieren wir ein simples MLP auf den Reviews.

In [18]:
mlp = Sequential([
    Embedding(None, None, input_length = None),
    Flatten(),
    None,
    Dense()])
mlp.compile()

Nun trainieren wir das Modell

In [19]:
def fit_model(model, epochs = 10):
    model.fit(x_trn, y_trn, 
          validation_data=(x_val, y_val), 
          epochs = epochs, batch_size = BATCH_SIZE)

In [20]:
fit_model(mlp)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### CNN

Als naechstes testen wir ein CNN. Die Idee ist das wir Schluesse aus nahe beisammen liegenden Woertern ziehen koennen.

In [22]:
conv1 = Sequential([
    Embedding(None, None, input_length = None),
    Convolution1D(64, 5, padding = 'same', activation='relu'),
    None,
    Flatten(),
    None,
    Dense()])

conv1.compile()

Nun fitten wir das Modell

In [23]:
fit_model(conv1)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Word embeddings

Nun verwenden wir vortrainierte Embeddings.

In [24]:
embedding_matrix = np.zeros((MAX_FEATURES, 50))
None
        
       

In [28]:
conv2 = Sequential([
    Embedding(MAX_FEATURES, 50, input_length = MAXLEN, weights = [embedding_matrix],
              trainable = False),
    Conv1D(128, 5, activation='relu'),
    MaxPooling1D(2),
    Conv1D(128, 5, activation='relu'),
    MaxPooling1D(2),
    Flatten(),
    Dense(128, activation = 'relu'),
    Dense(1, activation = 'sigmoid')
])
conv2.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [32]:
embedding_layer=Embedding(max_features, 50, input_length=maxlen, weights = [embedding_matrix],
              trainable = False)
sequence_input = Input(shape=(80,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(2)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(2)(x)
"""
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)  # global max pooling
"""
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(1, activation='sigmoid')(x)

conv2 = Model(sequence_input, preds)

conv2.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [16]:
conv2.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 80)                0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 80, 50)            1000000   
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 76, 128)           32128     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 38, 128)           0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 34, 128)           82048     
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 17, 128)           0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 2176)              0         
__________

In [29]:
fit_model(conv2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## LSTM


Jetzt fitten wir unser erstes RNN, einen LSTM.

In [30]:
lstm = Sequential([
    Embedding(None, 32, input_length = None, mask_zero = True),
    LSTM(),
    Dense()])
lstm.compile()

In [31]:
fit_model(lstm)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
 1856/25000 [=>............................] - ETA: 101s - loss: 0.0727 - acc: 0.9784

KeyboardInterrupt: 

## Bi-LSTM

Jetzt fitten wir einen Bi-LSTM

In [33]:
bilstm = Sequential([
    Embedding(None, None, input_length = None, mask_zero = True),
    Bidirectional(LSTM()),
    Dense()])
bilstm.compile()


In [37]:
fit_model(bilstm)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10

KeyboardInterrupt: 