[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/exercises/Ex09_embedding/Ex09_nlp1_embeddings.ipynb)

# Tutorial #9 Natural Language Processing Part 1

Written/spoken text is unstructured data. Teaching computers to "understand" language has been one of the hardest tasks researchers from computer science and linguistic fields had to face. It is still work in progress but the modern methods of **natural language processing (NLP)** offer very impressive results. We need to map the semantic diversity that our brain operates in to the numeric field that computers can "understand". Thus, our goal is to represent a meaning of the word/sentence/text with a real-number vector. How can we do that? We use the standard pre-processing operations of the NLP workflow. Thereafter, we can apply some machine learning methods. That is pretty much the outline of today. The last part of the tutorial, however, will also introduce us to a simple form of transfer learning in NLP by using pre-trained work vectors.

#### Agenda

1. Text pre-processing
2. LSTM language modeling example
3. Embedding layers

### Preprocessing

Dealing with unstructured data might seem intimidating at first, but fear not, we will explore several methods to prepare your text for whatever methods you are planning to apply. We will look into:
 - tokenizing
 - removing punctuations
 - removing stop words, sparse terms, and particular words
 - stemming/lemmatizing
 
For all that we will use a really nice package called **nltk**. This is one of the packages that might not be readily available on your machines. You might have to install it on your own and then use the *.download()* function to download additional functionality such as dictionaries. 

In [None]:
# If your computer is setup and all NLTK libraries are installed, a simple import should do 
import nltk

# When you are still in the process of setting up your environment, here is a list of the packages to download

#nltk.download('punkt')
#nltk.download('stopwords')
#nltk.download('wordnet')

Let us start with creating some text for demonstrating basic NLP operations.

In [None]:
text = "I wonder if I have been changed in the night. Let me think. Was I the same when I got up this morning? I almost can remember feeling a little different. But if I am not the same, the next question is 'Who in the world am I?' Ah, that is the great puzzle!"
#hope no need to quote :)

First step: **tokenize** it. You can make tokens to be sentences or characters, we will go for words (word tokenization). 

In [None]:
#* Demo of tokenization using NLTK 
tokens = nltk.word_tokenize(text)

print('The text has {} tokens.'.format(len(text)))
print(tokens)

# Above we use only one variable. More commonly, you would work with a list, e.g., a list of phrases. 
# In that case, you can use list comprehension as follows
#tokens = [nltk.word_tokenize(item) for item in text]

Hm, but the punctuation is there still. Is it noise or is it useful? Let's try removing it for now (there is a bunch of methods out there). Additionally we will drop weird symbols and lower the big cases.


In [None]:
#* Start with cleaning the text
tokens2 = [word.lower() for word in tokens if word.isalpha()] #alpha -> alphanumeric characters

print(tokens2)

For a more sophisticated cleaning of text, you might want to consider *regular expressions*. In a nutshell, regular expressions are a family of text processing techniques for searching and replacing text. Their capability to match expressions in a text, for example an email, is quite powerful. A quick read through the corresponding [Wikipedia page](https://en.wikipedia.org/wiki/Regular_expression) would be useful. Also, here is a [nice playground](https://regexr.com/). Using a regular expression, we could re-write the above code as follows:

In [None]:
#* Cleaning the text using a regular expression
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize(text.lower())

This is starting to look like a dictionary already, right? There are some more issues we want to address though. Like 'stop words' - semantically they do not mean much but serve to put sentences together ("the", "a", "and", etc) - they will add noise. NLTK can offer you its own list of stop words.

In [None]:
#* Remove stopwords
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
stop_words

The list of stop words looks comprehensive. However, say you miss a 'stop word' that you would also like to filter. You can extend the above list easily. After all, it is just a list.

In [None]:
#* Add some custom stopwords
print('This is just a {} of stop words.'.format(type(stop_words))) 

stop_words.append('some_word_you_dont_like')  # you can apply all the functions for lists
stop_words[-1]

In [None]:
#* Filter stop words using, e.g., list comprehension
tokens3 = [word for word in tokens2 if word not in stop_words]
print(tokens3)

You might have already thought of the issue: what if a word is used in different forms? It will be treated as different words semantically right? That is where **lemmatizing** comes in. It's more sophisticated than stemming that just drops the suffixes, but it will need to look up the dictionary, that we will load.

In [None]:
#* NLTK lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# You need to choose the type of word:
print(lemmatizer.lemmatize("stripes", 'v'))  
print(lemmatizer.lemmatize("stripes", 'n'))  

In [None]:
tokens4 = [lemmatizer.lemmatize(word, pos='n') for word in tokens3]
tokens4 = [lemmatizer.lemmatize(word, pos='v') for word in tokens4]
print(tokens4)

In [None]:
# For the sake of completness, here is an example for using stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
[stemmer.stem(word) for word in tokens3]

#### One-hot encoding
Finally, let's set up our one-hot encoding matrix. First we need to load some tools.

In [None]:
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Convert tokens to a numpy array
tokens5 = np.array(tokens4)
print(tokens5)
print(len(tokens5))

Next, we can start with the actual encoding, which happends in two steps. First, we map our tokens to a numeric presentation using LabelEncoder. A token will be convert to its index in an ordered list of tokens. Check it out below. The first element in the encoded array is 16.
Does that make sense? Where do you see a value of zero? Plausible?

In [None]:
#* Encoding the text 
label_encoder = LabelEncoder()
encoded = label_encoder.fit_transform(tokens5)
print(encoded)
print(encoded.shape)

This is data from which we can create our one-hot-encoded matrix.

In [None]:
#* One-hot-coding
onehot_encoder = OneHotEncoder(sparse=False)
encoded = encoded.reshape(len(encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(encoded)
print(onehot_encoded)
print(onehot_encoded.shape)

Of course, at some point in our code, we would probably like/need to revert the encoding. The process is very similar to the time series forecasting example of the previous tutorial, in which we scaled our price differences and then reverted the scaling to calculate errors. Here is an example of how to decode our text.

In [None]:
#* Check 6th word
inverted = label_encoder.inverse_transform([np.argmax(onehot_encoded[5, :])])
print(onehot_encoded[5, :])
print(inverted)
print(tokens5[4:7])

## LSTM language modeling example 
Let's try using an LSTM classifier to predict the next word in the sentence. Specifically, we will feed the network a 
sequence of words of length _timesteps_ and ask it to predict the next word following that sequence. To that end, we re-use our *create_dataset()* function from the last tutorial to build up the input data for the LSTM.

In [None]:
# Create input data for supervised learning

def create_dataset(time_series, timesteps):
    """ Helper function to transform an input sequence into data for supervised learning  """ 
    dataX, dataY = [], []
    
    for i in range(0,len(time_series) - timesteps):
        x = time_series[i:i + timesteps]  # mind Python indexing: this is i to i+t-1  
        dataX.append(x)
        y = time_series[i + timesteps]    # while this is i+t
        dataY.append(y)
           
    return np.array(dataX), np.array(dataY) 

In [None]:
timesteps = 3 # we will feed in *timesteps* words to predict the next word

X, y = create_dataset(onehot_encoded, timesteps)#lookback
print(X.shape) # nb obs, nb timestep, nb features=18, which is pretty much the size of our vocab
print(y.shape)

Let's print some data and, just for fun, code a little demo to verify that everything we do with our vectors and matrices makes sense.

In [None]:
# Third input sequence 
X[2, :, :]

In [None]:
# Target of the third example
y[2]

In [None]:
#* Decoding the input and target words
for w in X[2, :, :]:
    print(label_encoder.inverse_transform([np.argmax(w)]))
print('is followd by next word ')
print(label_encoder.inverse_transform([np.argmax(y[2])]))
tokens5

It is hard to think of text as vectors/matrices. You will soon get used to it. Let's now continue with our LSTM classifier.

In [None]:
#* LSTM language model
from keras.models import Sequential
from keras.layers import Dense, LSTM

nb_hidden = 5
vocab_size = X.shape[2]

model = Sequential()
model.add(LSTM(nb_hidden, input_shape=(timesteps, vocab_size)))
model.add(Dense(vocab_size, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = "adam", metrics=['accuracy'])

story = model.fit(X, y, batch_size=1, epochs=20, shuffle=True)
model.summary()

Accuracy does not look good. Obviously far too few observations. Let's look at some predictions of our network.
Remember that our 'sentence' or rather token sequence was: <br>
'wonder', 'chang', 'night', 'let', 'think', 'got', 'morn', 'almost', 'rememb', 'feel', 'littl', 'differ', 'next', 'question', 'world', 'ah', 'great', 'puzzl'
<br>
We take the last three tokens of the sequence and see what our LSTM has predicted:

In [None]:
# Demo of LSTM prediction 
import matplotlib.pyplot as plt

n = 3
yhat = model.predict(X[-n:])
# predicted density across the words in our vocab 

plt.hist(yhat[0])
plt.show();

# most probable words
ix = np.argsort(yhat, axis=1)

demo = {'y': label_encoder.inverse_transform(np.argmax(y[-n:,], axis=1)),
        'yhat': np.empty((n,3), dtype=object)}

for i in range(3):
    top3 = label_encoder.inverse_transform(ix[i,-3:])
    demo['yhat'][i,:] = top3
    print("Target '{}' predictions with decreasing prob: '{}', '{}', '{}'".format(
        demo['y'][i], demo['yhat'][i,-1], demo['yhat'][i,-2], demo['yhat'][i,-3]
    ))

In [None]:
# Compare to actuals
tokens5

Ok, at least the model got it almost right occasionally. One take-home from the above demo: Say you want to use your model to _generate_ language. Picking the next word in your sequence of text as the one with largest estimated probability might not be the optimal approach. As you see above, our very simple model did somewhat  well in asining the actual next word a high probability, yet not the highest. Serious models for language generation make use of a more advanced strategy to select the next word. You can run a websearch for *beam search* if you are interested. 
<br><br>
Enough distraction for now. Let's continue with taking a closer look into the LSTM and its layers.

In [None]:
# Here is the model architecture again. So we are interested in the weights of layer 0
model.summary()

Recall the following equations from our LSTM session in the lecture. They remind us about the processing
of our input data, which has 18 dimensions in this example.
$$ I_t = \sigma \left( X_t W_{xi} + H_{t-1}W_{hi} + b_i \right) $$
$$ F_t = \sigma \left( X_t W_{xf} + H_{t-1}W_{hf} + b_f \right) $$
$$ O_t = \sigma \left( X_t W_{xo} + H_{t-1}W_{ho} + b_o \right) $$
$$ \tilde{C}_t = tanH \left( X_tW_{xc} + H_{t-1}W_{hc} + b_c\right) $$

So let's see if we can use this formalism to recalculate the number of parameters in the LSTM.

In [None]:
#* Recalculate the parameters of the network
# LSTM layer
nb_lstm_para = 4 * ((vocab_size + 1) * nb_hidden + nb_hidden**2)  # +1 for the bias and times 4 for all the gates
print('LSTM layer parameters: {}'.format(nb_lstm_para))

# Output layer
print('Outpur layer parameters: {}'.format(vocab_size * nb_hidden + vocab_size))

Keras saves the weights for the different gates in one matrix. You can extract these weights whenever needed. 

In [None]:
# Extract LSTM weight matrices
lstm_weights = model.layers[0].get_weights()  # recall that we have one LSTM layer
print(len(lstm_weights))
print(lstm_weights[0].shape, lstm_weights[1].shape, lstm_weights[2].shape )

#numpy array of weights for inputs, numpy array of weights for hidden units, a numpy array of bias

The weight matrices store, respectively, the weights for the inputs, the hidden layer, and the biases; each times multiplied with a 4 because of the inner structure of the LSTM (gates and cell state). Integrating all the weights into two matrices and a bias vector has several advantages. However, at times, you might want to extract the individual matrices, $W_{xi}, W_{xf}, ...$ to inspect them are use them in some way. The following example shows you how to achieve this, using a helpfer function from https://fairyonice.github.io/Extract-weights-from-Keras's-LSTM-and-calcualte-hidden-and-cell-states.html.

In [None]:
# Code from https://fairyonice.github.io/Extract-weights-from-Keras's-LSTM-and-calcualte-hidden-and-cell-states.html
def get_LSTMweights(model1):
    for layer in model1.layers:
        if "LSTM" in str(layer):
            w = layer.get_weights()
            W,U,b = get_LSTM_UWb(w)
            break
    return W

def get_LSTM_UWb(weight):
    '''
    weight must be output of LSTM's layer.get_weights()
    W: weights for input
    U: weights for hidden states
    b: bias
    '''
    warr,uarr, barr = weight
    gates = ["i","f","c","o"]
    hunit = uarr.shape[0] # dim. of hidden state
    U, W, b = {},{},{}
    for i1,i2 in enumerate( range(0, len(barr), hunit) ): #range(start, stop, step)
        
        W[gates[i1]] = warr[:,i2:i2+hunit]
        U[gates[i1]] = uarr[:,i2:i2+hunit]
        b[gates[i1]] = barr[i2:i2+hunit].reshape(hunit,1)
    return(W,U,b)

In [None]:
em = get_LSTMweights(model)
print('Input: ' + 'x'.join(map(str,em["i"].shape)))
print('Forget: ' + 'x'.join(map(str,em["f"].shape)))
print('Output: ' + 'x'.join(map(str,em["o"].shape)))
print('Cell state: ' + 'x'.join(map(str,em["c"].shape)))

Do not get confused here. The above formulas are those from the lecture, following [Dive into Deep Learning](http://d2l.ai/). The weights just extracted from the trained LSTM are more aligned with the way [Colah's blog](https://colah.github.io/posts/2015-08-Understanding-LSTMs/). For example, the blog introduces the input gate as follows:
$$i_t = \sigma \left( W_i \cdot \left[ h_{t-1},x_t \right] + b_i \right)$$

So the code

In [None]:
em["i"]

extracts the matrix $W_i$ from the above equation. 

Having clarified what actually we extracted from the LSTM, it is interesting to note that the input layer acts as an embedding layer. We talked a lot about embeddings in the lecture and now see our first embedding layer in the tutorial. To clarify the concept of embedding layers, we'll follow the process from word to LSTM input step-by-step. Let's take one word from the text, say the word 'think', and follow it's path into the LSTM. We first encode the word and map it to its one-hot-coded representation. That gives us the input, $X_t$.

In [None]:
word = "think"
word_label = label_encoder.transform([word])
word_label

In [None]:
word_dummy = onehot_encoder.transform(word_label.reshape(1,-1))
print(word_dummy)
print(word_dummy.shape)

The input layer multiplies the one-hot encoded input vector with a Dense weight matrix, i.e., $W_i$. 

In [None]:
# Here is the weight matrix
em["i"]

In [None]:
#* So let's do the dot product
e = word_dummy.dot(em["i"])
print(f"Here is the embedding for the word '{word}'.")
e

Make sure to take a careful look at the result. Anything that you notice? 

Indeed, our result is just one row from the weight matrix. This makes sense because our input is one-hot. Instead of computing the dot product, we could simplify the mapping from words to vectors as follows:

In [None]:
em["i"][word_label]

This speedup, and some convenient functionality, make using Embedding layers worthwhile. 

## Embeddings layers

What we want to end up with is a __word vector/word embedding__, looking like this:

$$air=  \begin{bmatrix} -0.3\\ 0.001\\   2\\-1.1\\0.07\\ \end{bmatrix}$$

Here we have encapsulated the meaning of the word **air** within the 5x1 vector/embedding - we now have a *distributed representation*. The number of levels may be changed, depending on the necessary level of "detail".

The big benefit of such vectors is one can operate with them on the same subspace and expect a meaningful result. Remember the famous example:


![emb](https://blogs.mathworks.com/images/loren/2017/vecs.png)
Image source: https://blogs.mathworks.com/images/loren/2017/vecs.png

The downside is that the coordinates represent certain latent semantic features of embeddings and thus can not be surely interpreted. See https://ronxin.github.io/wevi/ for an interactive playground. 

### Embedding layers in Keras

In the previous example, we learned the word embeddings on-the-fly as part of training the LSTM. That is maybe not the best way to learn an embedding. As discussed in the lecture, specific algorithms have been proposed to learn word embedding including **Word2Vec** and **GloVe**. We will use these approaches in subsequent tutorials. 

Even if we do not want to use pre-trained word embeddings but implicitly create embeddings during training, it is better to add a corresponding layer to the network. In fact, the previous examples already provided a good example why this is the case. Using a specific layer to map from one-hot-coded words to dense vectors can be implemented efficiently and avoid extracting a word embedding by calculating a dot product. No need to calculate a high-dimensional dot product if your goal is to extract one row of an (embedding) matrix. The Keras [embeddings layer](https://keras.io/layers/embeddings/#embedding) provides us with such functionality. This layer may be used in several ways:
 - to learn a word embedding that can be saved and used in another model later (or together with the model)
 - to load a pre-trained word embedding model, a type of transfer learning (this option is the most wide-spread implementation nowadays - in order to save the computational resources several pre-trained embeddings are available for download)  

In [None]:
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences

from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers.embeddings import Embedding

Next, we can start building up our neural network. To that end, we need to decide on the dimensionality of the word embeddings. Every word is mapped to a set of numbers. How many numbers? That number is called the embedding dimension. Another ingredient is the size of our vocabulary.

In [None]:
#* Setting up an embedding layer in Keras
print('Our vocab size is {}'.format(vocab_size))
emb_dim = 100  # a common choice but fully arbitrary chosen here 


# Configure NN with embedding layer
model = Sequential()
# An embedding layer must be the first layer in a network 
model.add( Embedding( input_dim=vocab_size,   # using one-hot-coding, the size of the input is euqal to the size of the vocabulary
                      output_dim=emb_dim,     # in general, the size of the hidden layer and here, our embedding dimension
                      input_length=10         # corresponds to the number of tokens in our setting. We will have to ensure this length in the data
                    )) 
model.summary()

The reason we also specified the last input, the *input_length* is that whenever the embeddings are concatenated at any point in the architecture, for example by adding a *dense layer* or a *flatten layer*, which is almost always the case, then the size of subsequent layers depends on the number of inputs. Thus, to define the model, we have to Keras our expectation. 

The model will take as input an integer matrix of size (batch, input_length). Now `model.output_shape == (None, 10, 100)`, where `None` is the batch dimension.

### Text classification example

Let us illustrate the embedding layer in a small text classification example. We will get an evaluation for this course. Let's say we know how every student commented on the course and whether the student took the final exam/assignment. We could try to predict whether a student will take the exam/assingment or dropped out based on her/his review. Actually, the modeling task does not really matter because our main interest are the embeddings, so let's simply make up some data. 

Please note, the example draws on a nice tutorial from [ML Mastery](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/). 

In [None]:
# Student reviews
reviews = ['pretty good','very nice','god, never again','better than expected','loved the course','it was too hard','tutorials were nice','not so good','did not work for me','good course.']
# Corresponding labels whether the student took the final assignment = 1 or did not take the assignment = 0
labels = np.array([1,1,0,1,1,0,1,0,0,1])

As you remember, machines can't really deal with strings, so we will need to find a way to transform every word into a hashed integer, basically getting an indexed vocabulary (pretty=0512, good=14772, etc...)

In [None]:
vocab_size = 50 # number you can choose, just make sure it's bigger than the actual word count
encoded = [one_hot(single_review, vocab_size) for single_review in reviews]
print(encoded)
# mind that it ignores the punctuation. If some word is misspelled - it will be treated as a new word.

Note that, with the above approach, it is not guaranteed that distinct words are mapped to distinct integer. Collisions may occur. Run a web search for, e.g., 'keras + hashing_trick' for an explanation. 

Our sequences are of different length. Why should they. Some students write longer reviews, and other write shorter reviews. The length of words will always differ. However, our model enforces a constant length for the input data (see above). This is why we need *padding*.

In [None]:
# How long is the longest review
max_length = np.max([len(r.split()) for r in reviews])

Our input text is rather short. Thus, we can extend every review so that it matches the length of the longest review. Normally, this would not be the case. In general, we need a mixture of pruning longer text and padding shorter text to arrive at a constant length. Making a suitable choice is not trivial and might require trial-and-error. 

In [None]:
#* Pad all reviews to the maximum length
padded = pad_sequences(encoded, maxlen=max_length, padding='post')
print(padded)

Ok, we are ready to do some modeling. Specifically, we can stack a dense layer on top of our embedding layer and train the network to minimize cross-entropy. Here is the code to build-up the network.

In [None]:
emb_dim = 10  # We have very little data so training a 100 dim embedding cannot work. Let's use 10 dimensions for demonstration

model = Sequential()
model.add(Embedding(input_dim=vocab_size, 
                    output_dim=emb_dim, 
                    input_length=max_length)) 
# The embedding layer will be 5 vectors of 10 dimensions each, one for each word.
# We add a flatten layer, which stacks the embeddings of the five words on top of another
model.add(Flatten()) 
# Flattening gives as a vector with 5*10 = 50 elements. We can pass this vector to a dense layer.
# Since we face a binary classification problem, we need only one output unit and use sigmoid activation
model.add(Dense(1, activation='sigmoid')) 
# Done. Compile and summarize the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())
#Question for the class-  why 51 in the last layer?

We are ready to train our model. The point of this model would be training word embeddings. Of course, we the little toy data we use, no good results are to be expected.

In [None]:
# fit the model
model.fit(padded, labels, epochs=10, verbose=1)

# evaluate the model
loss, accuracy = model.evaluate(padded, labels, verbose=1)
print('Accuracy: %f' % (accuracy*100))

Let's go back to the LSTM that we did at the beginning. We advance this network by adding an embedding layer *before* the LSTM input layer to increase speed. That is, looking-up word embeddings as opposed to calculating dot products. We can also train the network. Note, however, that we do not train the LSTM as a language model but a simple text classifier.

In [None]:
# LSTM text classifier with embedding layer
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=emb_dim, input_length=max_length)) 
model.add(LSTM(nb_hidden))  # number of units, arbitrary choice  
model.add(Dense(1, activation='sigmoid')) # we have a binary prediction
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())

# Training
model.fit(padded, labels, epochs=10, verbose=1)

# Evaluation
loss, accuracy = model.evaluate(padded, labels, verbose=1)
print('Accuracy: %f' % (accuracy*100))

Let's extract the weights of the embedding matrix, which are our word embeddings. The dimensionality of that matrix is vocab size by embedding dimension. And the embeddings themselves are just a bunch of numbers.

In [None]:
# Extract embedding layer weights
embs = model.layers[0].get_weights()
# The result is a list, of which we need only element one
print('We get a {} of length {}'.format(type(embs), len(embs)))
embs = embs[0]
embs.shape

In [None]:
embs[:4]  # pull out the first 4 word embeddings for demonstration

Do the embeddings make sense? Well, that is probably best examined through the lens of a downstream task, checking for example, whether we get good prediction results. Likewise, we could try to reproduce some of the famous word2vec examples showing that the learned vectors carry meaning. In our toy example, not of that would work out. So we stop here and just wait for the next tutorial for more NLP ;)