# STATS 102
## Class 25

Textbook reference: Professor Notes from Duke's MLSS

Here are the topics for this lecture:

* Natural Language Processing

### Natural language processing (NLP) is a branch of artificial intelligence that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.

Let's get started...

# Introduction to Natural Language Processing (NLP) in TensorFlow

### Word Embeddings

Before going any further, please do the following:

* Login to Coursera and search for "Introduction to Machine Learning" offered by Duke University
* Under syllabus, please search for "week 4 - Introduction to Natural Language Processing" 
* Once there, please click on show all videos available
* Select and watch the following videos:

    - "Introduction to the Concept of Word Vectors", ~9 mins
    - "Word to Vectors", ~8 mins

As you saw in Prof. Carin's videos, word embeddings, or word vectors, provide a way of mapping words from a vocabulary into a low-dimensional space, where words with similar meanings are close together. Additionally, it allow us to translate text into numbers which we can then use to analyze it.

In this notebook, we will play around with a set of pre-trained word vectors. There are many sets of pretrained word embeddings. For today's excercise, we use ConceptNet Numberbatch, which provides a relatively small download in an easy-to-work-with format (h5).

In [50]:
# Download word vectors
import tensorflow as tf 
from urllib.request import urlretrieve
import os
if not os.path.isfile('mini.h5'):
    print("Downloading Conceptnet Numberbatch word embeddings...")
    conceptnet_url = 'http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/17.06/mini.h5'
    urlretrieve(conceptnet_url, 'mini.h5')

To read an `h5` file, we'll need to use the `h5py` package. Below, we use the package to open the `mini.h5` file we just downloaded. We extract from the file a list of utf-8-encoded words, as well as their $300$-dimensional vectors.

In [51]:
# Load the file and pull out words and embeddings
import h5py

with h5py.File('mini.h5', 'r') as f:
    all_words = [word.decode('utf-8') for word in f['mat']['axis1'][:]]
    all_embeddings = f['mat']['block0_values'][:]

# Print total number of words in our file
print("all_words dimensions: {0}".format(len(all_words)))
# Print embedding dimensions: 362891 row and 300 columns!
print("all_embeddings dimensions: {0}".format(all_embeddings.shape))
# Show me a sample...
print(all_words[10000:10005])

all_words dimensions: 362891
all_embeddings dimensions: (362891, 300)
['/c/de/lande', '/c/de/landebahn', '/c/de/landen', '/c/de/landes', '/c/de/landesamt']


## Word Strings

Now, `all_words` is a list of $V$ strings (what we call our *vocabulary*), and `all_embeddings` is a $V \times 300$ matrix. The strings are of the form `/c/language_code/word`—for example, `/c/en/cat` and `/c/es/gato`.

### We are interested only in the English words. So...
* We use Python list comprehensions to pull out the indices of the English words
* Then extract just the English words (stripping the six-character `/c/en/` prefix) and their embeddings.

In [52]:
# Restrict our vocabulary to just the English words
english_words = [word[6:] for word in all_words if word.startswith('/c/en/')]
english_word_indices = [i for i, word in enumerate(all_words) if word.startswith('/c/en/')]
english_embeddings = all_embeddings[english_word_indices]

# Now our vocabulary with only english words
print("all_words dimensions: {0}".format(len(english_words)))
# Here are the embedding dimensions of our file: 150875 words and 300 columns!
print("all_embeddings dimensions: {0}".format(english_embeddings.shape))
# Show me a sample
print(english_words[10000:10005])

all_words dimensions: 150875
all_embeddings dimensions: (150875, 300)
['bajillion', 'bajirao', 'bajo_sexto', 'bajor', 'bajoran']


## Normalizing our Vectors

The magnitude of a word vector is less important than its direction; the magnitude can be thought of as representing frequency of use, independent of the semantics of the word. 
Here, we will be interested in semantics, so we *normalize* our vectors, dividing each by its length. 

The result is that all of our word vectors are length 1, and as such, lie on a unit circle. 
The dot product of two vectors is proportional to the cosine of the angle between them, and provides a measure of similarity (the bigger the cosine, the smaller the angle).

<img src="cosine_similarity.png" alt="cosine" style="width: 500px;"/>
<center>Figure adapted from *[Mastering Machine Learning with Spark 2.x](https://www.safaribooksonline.com/library/view/mastering-machine-learning/9781785283451/ba8bef27-953e-42a4-8180-cea152af8118.xhtml)*</center>

In [53]:
# Normalize our vectors to unitary circle between -1 and 1
import numpy as np

norms = np.linalg.norm(english_embeddings, axis=1)
normalized_embeddings = english_embeddings.astype('float32') / norms.astype('float32').reshape([-1, 1])

We want to look up words easily, so we **create a dictionary that maps us from a word to its index in the word embeddings matrix.**

In [54]:
index = {word: i for i, word in enumerate(english_words)}

## Measuring Similary Using Dot Product
Now we are ready to measure the similarity between pairs of words. We use numpy to take dot products.

In [55]:
# We define a simple function that allow us to compare the words, returning a score
def similarity_score(w1, w2):
    score = np.dot(normalized_embeddings[index[w1], :], normalized_embeddings[index[w2], :])
    return score

## Let us try a few examples...

In [56]:
# A word is as similar with itself as possible:
print('cat\tcat\t', similarity_score('cat', 'cat'))

# Closely related words still get high scores:
print('cat\tfeline\t', similarity_score('cat', 'feline'))
print('cat\tdog\t', similarity_score('cat', 'dog'))

# Unrelated words, not so much
print('cat\tmoo\t', similarity_score('cat', 'moo'))
print('cat\tfreeze\t', similarity_score('cat', 'freeze'))

# Antonyms are still considered related, sometimes more so than synonyms
print('antonyms\topposites\t', similarity_score('antonym', 'opposite'))
print('antonyms\tsynonyms\t', similarity_score('antonym', 'synonym'))

cat	cat	 1.0000001
cat	feline	 0.8199548
cat	dog	 0.590724
cat	moo	 0.0039538303
cat	freeze	 -0.030225191
antonyms	opposites	 0.3941065
antonyms	synonyms	 0.46883982


In [57]:
similarity_score('can', 'cane')

0.013125939

## Finding Similar Words
We can also find, for instance, the most similar words to a given word.

In [60]:
# Let's create a few helpful functions
def closest_to_vector(v, n):
    # Check similarity for word "v"
    all_scores = np.dot(normalized_embeddings, v)
    # Give me list of word indexes sorted by similarity
    best_words = map(lambda i: english_words[i], reversed(np.argsort(all_scores)))
    # Return "n" most similar words
    return [next(best_words) for _ in range(n)]

def most_similar(w, n):
    return closest_to_vector(normalized_embeddings[index[w], :], n)

In [61]:
print(most_similar('cat', 10))
print(most_similar('dog', 10))
print(most_similar('duke', 10))

['cat', 'humane_society', 'kitten', 'feline', 'colocolo', 'cats', 'kitty', 'maine_coon', 'housecat', 'sharp_teeth']
['dog', 'dogs', 'wire_haired_dachshund', 'doggy_paddle', 'lhasa_apso', 'good_friend', 'puppy_dog', 'bichon_frise', 'woof_woof', 'golden_retrievers']
['duke', 'dukes', 'duchess', 'duchesses', 'ducal', 'dukedom', 'duchy', 'voivode', 'princes', 'prince']


## Can you explain the following similarity scores?

In [62]:
similarity_score("sit", "sits")

0.8478777

In [28]:
similarity_score("want", "wants")

0.858501

In [29]:
similarity_score("sleep", "sleeps")

0.8664927

In [63]:
similarity_score("leave", "leaves")

0.42647985

In [64]:
similarity_score("man", "woman")

0.63185656

## Solving Word Analogies
We can also use `closest_to_vector` to find words "nearby" vectors that we create ourselves. This allows us to solve analogies. 

For example, in order to solve the analogy "man : brother :: woman : ?", we can compute a new vector `brother - man + woman`: the meaning of brother, minus the meaning of man, plus the meaning of woman. We can then ask which words are closest, in the embedding space, to that new vector.

In [66]:
# Define function to solve analogies
def solve_analogy(a1, b1, a2):
    b2 = normalized_embeddings[index[b1], :] - normalized_embeddings[index[a1], :] + normalized_embeddings[index[a2], :]
    return closest_to_vector(b2, 10)

print(solve_analogy("man", "brother", "woman"))
print(solve_analogy("man", "husband", "woman"))
print(solve_analogy("spain", "madrid", "france"))

['sister', 'brother', 'sisters', 'kid_sister', 'younger_brother', 'niece', 'nieces', 'sistren', 'stepsister', 'daughter']
['wife', 'husband', 'husbands', 'spouse', 'wifes', 'wifey', 'et_ux', 'hubby', 'hotwife', 'wives']
['paris', 'france', 'le_havre', 'in_france', 'montmartre', 'marseille', 'loire_valley', 'saone', 'lyonnais', 'jacques_chirac']


## Sentiment Analysis Using word embeddings
We'll perform *sentiment analysis* on a set of movie reviews: in particular, we will attempt to classify a movie review as positive or negative based on the text in its review.

We will use a [Simple Word Embedding Model](http://people.ee.duke.edu/~lcarin/acl2018_swem.pdf) (SWEM, Shen et al. 2018) to do so. **We will represent a review as the mean of the embeddings of the words in the review. Then we'll train a three-layer MLP (a neural network) to classify the review as positive or negative.**

Download the `movie-simple.txt` file from Google Classroom into this directory. Each line of that file contains 

1. the numeral 0 (for negative) or the numeral 1 (for positive), followed by
2. a tab (the whitespace character), and then
3. the review itself.

### Sample line of movie review text is "movie-simple.txt"


0 " DA VINCI CODE SUCKS."

In [67]:
import string
remove_punct=str.maketrans('','',string.punctuation)

# This function converts a line of our data file into
# a tuple (x, y), where x is 300-dimensional representation
# of the words in a review, and y is its label.
def convert_line_to_example(line):
    # Pull out the first character: that's our label (0 or 1)
    y = int(line[0])
    
    # Split the line into words using Python's split() function
    words = line[2:].translate(remove_punct).lower().split()
    
    # Look up the embeddings of each word, ignoring words not
    # in our pretrained vocabulary.
    embeddings = [normalized_embeddings[index[w]] for w in words
                  if w in index]
    
    # Take the mean of the embeddings
    x = np.mean(np.vstack(embeddings), axis=0)
    return {'x': x, 'y': y}

### Simple File
# Apply the function to each line in the file.
#with open("movie-simple.txt", "r", encoding='utf-8', errors='ignore') as f:
#    dataset = [convert_line_to_example(l) for l in f.readlines()]

### Complex File
    # Apply the function to each line in the file.
with open("movie-pang02.txt", "r", encoding='utf-8', errors='ignore') as f:
    dataset = [convert_line_to_example(l) for l in f.readlines()]

## Shuffle Data and Create Test/Train Sets
Now that we have a dataset, let's shuffle it and do a train/test split. We use a quarter of the dataset for testing, 3/4 for training (but also ensure that we have a whole number of batches in our training set, to make the code nicer later).

In [68]:
# Shuffling data
import random
random.shuffle(dataset)
# Creating our test/train data sets
batch_size = 100
total_batches = len(dataset) // batch_size
train_batches = 3*total_batches // 4 
train, test = dataset[:train_batches*batch_size], dataset[train_batches*batch_size:]

Time to build our MLP in Tensorflow. We'll use placeholders for `X` and `y` as usual.

In [45]:
import tensorflow as tf

# Placeholders for input
X = tf.placeholder(tf.float32, [None, 300])
y = tf.placeholder(tf.float32, [None, 1])

# Three-layer MLP
h1 = tf.keras.layers.Dense(100, activation='relu')(X)
h2 = tf.keras.layers.Dense(20, activation='relu')(h1)
logits = tf.keras.layers.Dense(1)(h2)
probabilities = tf.sigmoid(logits)

# Loss and metrics
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=y))
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.round(probabilities), y), tf.float32))

# Training
train_step = tf.train.GradientDescentOptimizer(0.05).minimize(loss)

# Initialization of variables
init_op = tf.global_variables_initializer()

We can now begin a session and train our model. We'll train for 250 epochs. When we're finished, we'll evaluate our accuracy on all the test data.

In [69]:
# Train
sess = tf.Session()
sess.run(init_op)
num_epoch = 1000 # 250 for simple or 1000 for complex

for epoch in range(num_epoch):
    for batch in range(train_batches):
        data = train[batch*batch_size:(batch+1)*batch_size]
        reviews = [sample['x'] for sample in data]
        labels  = [sample['y'] for sample in data]
        labels = np.array(labels).reshape([-1,1])
        
        _, l, acc = sess.run([train_step, loss, accuracy], feed_dict={X: reviews, y: labels})
        
    if epoch % 100 == 0:
        print("Epoch: {0} \t Loss: {1} \t Acc: {2}".format(epoch, l, acc))
    
    random.shuffle(train)
        
# Evaluate on test set
test_reviews = [sample['x'] for sample in test]
test_labels  = [sample['y'] for sample in test]
test_labels  = np.array(test_labels).reshape([-1, 1])

acc = sess.run(accuracy, feed_dict={X: test_reviews, y: test_labels})
print("Final accuracy: {0}".format(acc))

Epoch: 0 	 Loss: 0.6937820315361023 	 Acc: 0.38999998569488525
Epoch: 100 	 Loss: 0.6912757158279419 	 Acc: 0.6299999952316284
Epoch: 200 	 Loss: 0.6893551349639893 	 Acc: 0.6100000143051147
Epoch: 300 	 Loss: 0.6803270578384399 	 Acc: 0.6899999976158142
Epoch: 400 	 Loss: 0.6388222575187683 	 Acc: 0.6899999976158142
Epoch: 500 	 Loss: 0.5719191431999207 	 Acc: 0.6700000166893005
Epoch: 600 	 Loss: 0.5071670413017273 	 Acc: 0.7699999809265137
Epoch: 700 	 Loss: 0.5297117233276367 	 Acc: 0.7799999713897705
Epoch: 800 	 Loss: 0.5048509836196899 	 Acc: 0.7599999904632568
Epoch: 900 	 Loss: 0.4909608066082001 	 Acc: 0.7799999713897705
Final accuracy: 0.7400000095367432


## Let's Try Our Model
We can now examine what our model has learned, seeing how it responds to word vectors for different words:

In [70]:
# Check some words
#words_to_test = ["exciting", "hated", "boring", "loved", "extremely", "rather", "quite"]
words_to_test = ["garbage", "waste", "amazing", "ok", "keeper", "recommend", "happy", "assignments"]

print("##############################")
print("# Sentiment Analysis Results #")
print("##############################")

for word in words_to_test:
    result=sess.run(probabilities, feed_dict={X: normalized_embeddings[index[word]].reshape(1, 300)})
    if result > 0.5:
        print("Word: ",word, "--- Sentiment is likely Positive:", result)
    else:   
        print("Word: ",word, "--- Sentiment is likely Negative:", result)

##############################
# Sentiment Analysis Results #
##############################
Word:  garbage --- Sentiment is likely Negative: [[2.850981e-05]]
Word:  waste --- Sentiment is likely Negative: [[1.2075822e-06]]
Word:  amazing --- Sentiment is likely Positive: [[0.99848855]]
Word:  ok --- Sentiment is likely Negative: [[0.00127299]]
Word:  keeper --- Sentiment is likely Negative: [[0.07155904]]
Word:  recommend --- Sentiment is likely Negative: [[0.06054885]]
Word:  happy --- Sentiment is likely Positive: [[0.9892368]]
Word:  assignments --- Sentiment is likely Negative: [[0.0113082]]


Try some words of your own!

In [71]:
# Let's close our Tensorflow session
sess.close()

## In summary...
* We demonstrated how to use word embedding to find word similarities
* Then used Multilayer perceptron, to train an algorithm rudimentary movie sentiment analysis