<a href="https://colab.research.google.com/github/Shiyi-Xia/NLP_ESS_2022/blob/main/Tutorial_Three_(R)_Introduction_to_Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Word Embeddings

## Douglas Rice

In this notebook, we'll estimate our first word embedding model, then go through a series of analyses of the estimated embeddings. After completing this notebook, you should be familar with:


1. Preparing a corpus for estimating word embeddings
2. Estimating a (static) word embedding model
3. Analyzing output of (static) word embedding model





# GloVe

We'll be using the [`text2vec`](http://text2vec.org/index.html) package. `text2vec` was one of the first implementations of  word embeddings functionality in R, and is designed to run *fast*, relatively speaking. Still, it's important to remember that our computational complexity is amping up here, so don't expect immediate results. 

`text2vec` implements the "Global Vectors" (or GloVe) approach for estimating embeddings. Stanford University's [Global Vectors for Word Representation (GloVe)](https://nlp.stanford.edu/projects/glove/) is an approach to estimating a distributional representation of a word. GloVe is based, essentially, on factorizing a huge term co-occurrence matrix. 

The distributional representation of words means that each term is represented as a distribution over some number of dimensions (say, 3 dimensions, where the values are 0.6, 0.3, and 0.1). This stands in stark contrast to the work we've done to this point, which has effectively encoded each word as being effectively just present (1) or not (0). 

Perhaps unsurprisingly, the distributional representation better captures semantic meaning than the one-hot encoding. This opens up a world of possibilities for us as researchers. Indeed, this has been a major leap forward for research in Text-as-Data. 

As an example, we can see how similar one word is to other words by measuring the distance between their distributions. Even more interestingly, we can capture really specific phenomena from text with some simple arithmetic based on word distributions. Consider the following canonical example:

- <h2> king - man + woman = queen </h2>

Ponder this equation for a moment. From the vector representation of  **king**, we subtract the vector representation of **man**. Then, we add the vector representation of **woman**. The end result of that should be a vector that is very similar to the vector representation of  **queen**. 

In what follows, we'll work through some examples to see how well this works. I want to caution, though, that the models we are training here are probably too small for us to have too much confidence in the trained models. Nevertheless, you'll see that even with this small set we'll recover really interesting dynamics.



## Front-end Matters

First, let's install the `text2vec` package:

In [None]:
# Installs text2vec package (might take a while)
install.packages('text2vec')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘MatrixExtra’, ‘float’, ‘RhpcBLASctl’, ‘RcppArmadillo’, ‘Rcpp’, ‘rsparse’, ‘mlapi’, ‘lgr’




And load the library:

In [None]:
library(text2vec)

## PoKi Dataset

We'll be using [PoKi](https://github.com/whipson/PoKi-Poems-by-Kids), a corpus of poems written by children and teenagers from grades 1 to 12.

One thing to flag right off the bat is the really interesting dynamics related to *who* is writing these posts. We need to keep in mind that the children writing these texts are going to use less formal writing and more imaginative stories. Given this, we'll focus on analogies that are more appropriate for this context; here, we'll aim to create word embeddings that can recreate these two equations:

- <h2> cat - meow + bark = dog </h2>

- <h2> mom - girl + boy = dad </h2>

By the end, we should hopefully be able to recreate these by creating and fitting our GloVe models. But first, let's perform the necessary pre-processing steps before creating our embedding models. 

Let's download and read in the data:



In [None]:
# Creates file
temp <- tempfile()

# Downloads and unzips file to a text8_file variable if it does not exist
download.file("https://raw.githubusercontent.com/whipson/PoKi-Poems-by-Kids/master/poki.csv", temp)

In [None]:
# Reads in downloaded file
poem <- read.csv(temp)

# First ten rows
head(poem, 10)

Unnamed: 0_level_0,id,title,author,grade,text,char
Unnamed: 0_level_1,<int>,<chr>,<chr>,<int>,<chr>,<int>
1,104987,I Love The Zoo,,1,"roses are red, violets are blue. i love the zoo. do you?",62
2,67185,The scary forest.,,1,the forest is really haunted. i believe it to be so. but then we are going camping.,87
3,103555,A Hike At School,1st grade-wh,1,"i took a hike at school today and this is what i saw bouncing balls girls chatting against the walls kids climbing on monkey bars i even saw some teachers' cars the wind was blowing my hair in my face i saw a mud puddle, but just a trace all of these things i noticed just now on my little hike.",324
4,112483,Computer,a,1,"you can do what you want you can play a game you can do many things, you can read and write",106
5,74516,Angel,aab,1,angel oh angle you spin like a top angel oh angel you will never stop can't you feel the air as it blows through your hair angel oh angel itisto bad your a mop!,164
6,114693,Nature Nature and Nature,aadhya,1,"look at the sun, what a beautiful day. under the trees, we can run and play. beauty of nature, we love to see, from tiny insect to exotic tree. it is a place to sit and think, nature and human share the deepest link. nature has ocean, which is in motion. nature has tree, nature has river. if we destroy the nature we would never be free. our nature keeps us alive, we must protect it, for society to thrive. we spoil the nature, we spoil the future. go along with nature, for your better future.",491
7,46453,Jack,aaliyah,1,"dog playful, energetic running, jumping, tackling my is my friend jack",74
8,57397,When I awoke one morning,aanna,1,"when i awoke one morning, a dog was on my head. i asked , ''what are you doing there?' it looked at me and said ''woof!'' ''wouldn't you like to be outside playing?''said the man ''i'm staying here and playing here. '' said the dog he played all night and day. he came inside his new house and played inside a wet wet day.",325
9,77201,My Blue Berries and My Cherries,aarathi,1,i went to my blue berry tree they were no blue berries found i went to another tree to get some more free but found none but cherries round.,143
10,40520,A snowy day,ab.,1,"one snowy day the children went outside to play in the snow. they threw snowballs, went sledding and made a snowman. afterwards they went inside to drink warm hot chocolate. it was a fun snowy day",199


In [None]:
# Checks dimensions
dim(poem)

We want the poems themselves, so we'll use the column 'text' for tokenization.

## Tokenization and Vectorization

We start with `text2vec` by creating a tokenized iterator and vectorized vocabulary first. This time, there's no need to lowercase our words since the downloaded dataset is already lowercased.

Let's tokenize the data:

In [None]:
# Tokenization
tokens <- word_tokenizer(poem$text)

# First five rows tokenized
head(tokens, 5)

Create an iterator object:

In [None]:
# Create iterator object 
it <- itoken(tokens, progressbar = FALSE)

Build the vocabulary:

In [None]:
# Build vocabulary
vocab <- create_vocabulary(it)

# Vocabulary
vocab

term,term_count,doc_count
<chr>,<int>,<int>
0000,1,1
0000000,1,1
0000001,1,1
00a:m,1,1
00he,1,1
00o'clock,1,1
00p,1,1
02,1,1
04,1,1
05at,1,1


In [None]:
# Check dimensions
dim(vocab)

And prune and vectorize it. We'll keep the terms that occur at least 5 times.

In [None]:
# Prune vocabulary
vocab <- prune_vocabulary(vocab, term_count_min = 5)

# Check dimensions
dim(vocab)

# Vectorize
vectorizer <- vocab_vectorizer(vocab)

As we can see, pruning our vocabulary deleted over 40 thousand words. I want to reiterate that this is a *very small* corpus from the perspective of traditional word embedding models. When we are working with word representations trained with these smaller corpora, we should be really cautious in our approach. 

Moving on, we can create out term-co-occurence matrix (TCM). We can achieve different results by experimenting with the `skip_grams_window` and other parameters. The definition of whether two words occur together is arbitrary, so we definitely want to play around with the parameters to see the different results.

In [None]:
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

## Creating and fitting the GloVe model

Now we have a TCM matrix and can factorize it via the GloVe algorithm. We'll use the method `$new` to `GlobalVectors` to create our GloVe model. 

[Here](https://www.rdocumentation.org/packages/text2vec/versions/0.5.0/topics/GlobalVectors) is documentation for related functions and methods.

In [None]:
# Creating new GloVe model
glove <- GlobalVectors$new(rank = 50, x_max = 10)

# Checking GloVe methods
glove

<GloVe>
  Public:
    bias_i: NULL
    bias_j: NULL
    clone: function (deep = FALSE) 
    components: NULL
    fit_transform: function (x, n_iter = 10L, convergence_tol = -1, n_threads = getOption("rsparse_omp_threads", 
    get_history: function () 
    initialize: function (rank, x_max, learning_rate = 0.15, alpha = 0.75, lambda = 0, 
    shuffle: FALSE
  Private:
    alpha: 0.75
    b_i: NULL
    b_j: NULL
    cost_history: 
    fitted: FALSE
    glove_fitter: NULL
    initial: NULL
    lambda: 0
    learning_rate: 0.15
    rank: 50
    w_i: NULL
    w_j: NULL
    x_max: 10

Note that you'll only be able to access the public methods.

We can fit our model using `$fit_transform` to our `glove` variable. This may take several minutes to fit! 

In [None]:
# Fitting model
wv_main <- glove$fit_transform(tcm, n_iter = 10, convergence_tol = 0.01, n_threads = 8)

INFO  [15:09:04.916] epoch 1, loss 0.1995 
INFO  [15:09:08.927] epoch 2, loss 0.1301 
INFO  [15:09:12.853] epoch 3, loss 0.1123 
INFO  [15:09:16.770] epoch 4, loss 0.1014 
INFO  [15:09:20.869] epoch 5, loss 0.0940 
INFO  [15:09:24.724] epoch 6, loss 0.0886 
INFO  [15:09:28.701] epoch 7, loss 0.0845 
INFO  [15:09:32.886] epoch 8, loss 0.0813 
INFO  [15:09:36.916] epoch 9, loss 0.0788 
INFO  [15:09:40.967] epoch 10, loss 0.0766 


In [None]:
# Checking dimensions
dim(wv_main)

Note that model learns two sets of word vectors - **target** and **context**. We can think of our word of interest as the target in this environment, and all the other words as the context inside the window. For both, word vectors are learned. 

In [None]:
wv_context <- glove$components
dim(wv_context)

While both of word-vectors matrices can be used as result, the creators recommends to average or take a sum of main and context vector:

In [None]:
word_vectors <- wv_main + t(wv_context)

Here's a preview of the word vector matrix:

In [None]:
word_vectors

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
1837,-0.39605574,-0.02061349,-0.292531632,-0.49664328,0.40783190,0.851044482,-0.39696587,-0.27849286,0.27692855,0.570315492,⋯,0.02241029,-0.57090033,0.28269541,-0.709638090,-0.086413878,-0.436961662,0.145116474,-0.13954358,0.25301617,0.22173371
1841,-0.78071031,0.65679353,-0.050169592,-0.44028214,-0.20699434,0.545783287,0.39728389,0.25466314,-0.30259115,0.170385098,⋯,-0.50191797,-0.35069074,-0.73582634,0.092209004,0.257043205,-0.581941567,0.130368074,-0.01831186,-0.25701126,-0.47245659
1881,-0.04010316,-0.10134305,0.174045113,-0.56769329,0.38104333,0.343796140,-0.32884733,-0.19615921,0.53750568,0.134096555,⋯,-0.02316887,-0.58453249,-0.27100670,-0.472853220,0.516830218,-0.339338338,0.411892514,0.97466772,-0.20560002,0.68405675
2005,0.15787077,0.71031086,-0.313078482,-0.61102057,0.06495423,0.371926194,-0.26801623,-0.86445663,0.01352563,-0.171241807,⋯,-0.82696919,0.16742684,0.94363227,-0.590631081,0.701651729,0.077559228,0.320037468,0.10129664,-0.45100058,0.20574856
36,0.19290715,-0.52808743,-0.423933116,-0.39152847,0.43662860,0.631995358,0.09486522,-0.61107417,0.50077352,0.449836962,⋯,-0.33239024,-0.03064664,-0.48434318,0.302711704,-0.035767930,-0.252195130,0.004768143,0.22295622,-0.16733290,-0.24080546
38,-0.06613049,0.59058818,0.032173097,0.44651101,0.07999208,-0.254271109,-0.14656225,-0.30944246,-0.39287467,-0.108862124,⋯,-0.11765821,-0.75640020,-0.13223140,-0.325570517,-0.068856731,-0.689340349,-0.160185505,0.34359886,-0.21039644,0.01838548
39,-0.13287689,0.28692471,-0.003638418,-0.69530049,0.43808501,0.357472266,0.25758576,-1.00892284,-0.01337265,-0.591372447,⋯,-0.53878903,-0.18590096,0.24206018,0.622937921,-0.183574815,0.001330743,-0.091965269,0.49289726,-0.26660598,-0.24322074
52,-0.19872705,-0.20982830,0.326931644,-0.49967092,0.07001185,-0.105890798,-0.32086134,0.13573098,0.83251341,0.306377047,⋯,-0.33254810,0.12059476,0.67515880,0.240465878,0.191630896,-0.099129157,0.340868096,-0.12504141,0.46920537,0.53592542
5â,0.16738605,0.43949076,-0.655501132,0.07723001,-0.09199876,-0.034267178,0.04048797,-0.17018697,0.52170505,0.603731582,⋯,-0.85178506,-0.30215470,-1.08300433,-0.460445616,0.370161831,-0.519443597,-0.328239128,0.69915124,-0.99653031,-0.42381830
600,-0.18319427,0.10173004,0.267230059,-0.35690795,0.43876954,-0.464610861,0.49667306,-0.13993497,0.41457057,0.333022652,⋯,-0.49316066,-0.36144923,0.25198538,-0.132789275,-0.021780580,-1.235329792,0.112048058,-0.42602652,0.31026247,0.24424602


## Cosine Similarity

Now we can begin to play. Similarly to standard correlation, we can look at comparing two vectors using **cosine similarity**. Let's see what is similar with 'school':

In [None]:
# Word vector for school
school <- word_vectors["school", , drop = FALSE]

# Cosine similarity
school_cos_sim <- sim2(x = word_vectors, y = school, method = "cosine", norm = "l2")

# Top ten words relating to school
head(sort(school_cos_sim[,1], decreasing = TRUE), 10)

Obviously, school is the most similar to school. Based on the poems that the children wrote, we can also see words like 'work', 'fun' and 'home' as most similar to 'school'.

## Pet example

Let's try our pet example:

In [None]:
# cat - meow + bark should equal dog
dog <- word_vectors["cat", , drop = FALSE] -
  word_vectors["meow", , drop = FALSE] +
  word_vectors["bark", , drop = FALSE]
  
# Calculates pairwise similarities between the rows of two matrices
dog_cos_sim <- sim2(x = word_vectors, y = dog, method = "cosine", norm = "l2")

# Top five predictions
head(sort(dog_cos_sim[,1], decreasing = TRUE), 5)

Success - our predicted result was correct! We get 'dog' as the highest predicted result after the one we used (cat). We can think of this scenario as cats say meow and dogs say bark.

## Parent example

Let's move on to the parent example:

In [None]:
# mom - girl + boy should equal dad
dad <- word_vectors["mom", , drop = FALSE] -
  word_vectors["girl", , drop = FALSE] +
  word_vectors["boy", , drop = FALSE]
  
# Calculates pairwise similarities between the rows of two matrices
dad_cos_sim <- sim2(x = word_vectors, y = dad, method = "cosine", norm = "l2")

# Top five predictions
head(sort(dad_cos_sim[,1], decreasing = TRUE), 5)

'Dad' was a top result. Finally, let's try the infamous king and queen example.



## King and queen example

In [None]:
# king - man + woman should equal queen
queen <- word_vectors["king", , drop = FALSE] -
  word_vectors["man", , drop = FALSE] +
  word_vectors["woman", , drop = FALSE]

# Calculate pairwise similarities
queen_cos_sim = sim2(x = word_vectors, y = queen, method = "cosine", norm = "l2")

# Top five predictions
head(sort(queen_cos_sim[,1], decreasing = TRUE), 5)

Unfortunately, queen came in at 4th. Let's try changing **man** and **woman** to **boy** and **girl** to account for the kid's writting.

In [None]:
# king - boy + girl should equal queen
queen <- word_vectors["king", , drop = FALSE] -
  word_vectors["boy", , drop = FALSE] +
  word_vectors["girl", , drop = FALSE]

# Calculate pairwise similarities
queen_cos_sim = sim2(x = word_vectors, y = queen, method = "cosine", norm = "l2")

# Top five predictions
head(sort(queen_cos_sim[,1], decreasing = TRUE), 5)

It worked!

# A Tangent on Bias

As we discussed in class, word embeddings have proven to be a useful tool for uncovering/revealing bias in large corpora. Here, we can see how well the kids fare. We'll look at occupations. 

In [None]:
job <- word_vectors["job", , drop = FALSE] -
  word_vectors["boy", , drop = FALSE] +
  word_vectors["girl", , drop = FALSE]

# Calculate pairwise similarities
job_cos_sim = sim2(x = word_vectors, y = job, method = "cosine", norm = "l2")

# Top five predictions
head(sort(job_cos_sim[,1], decreasing = TRUE), 5)

In [None]:
job <- word_vectors["job", , drop = FALSE] -
  word_vectors["girl", , drop = FALSE] +
  word_vectors["boy", , drop = FALSE]

# Calculate pairwise similarities
job_cos_sim = sim2(x = word_vectors, y = job, method = "cosine", norm = "l2")

# Top five predictions
head(sort(job_cos_sim[,1], decreasing = TRUE), 5)

Interesting! We're not seeing the same dynamics observed in other settings. Given the small corpus, though, we'd want to add a lot more data before we could be confident that those biases weren't present here.

# Working with Estimated Embeddings

With that in hand, we can estimate a simple clustering algorithm. We specify 5 clusters, but feel free to play around with that number.

In [None]:
set.seed(12345)
clusters <- kmeans(word_vectors, centers = 5, iter.max  = 30)

Note what we have estimated with KMeans. We have 5 cluster centers, each of 50 dimensions, the same number of dimensions that we have for each of our tokens. Therefore, we look for which of the tokens are most similar to one of our cluster centers. 

In [None]:
cluster1 <- t(as.matrix(clusters$centers[1,]))

In [None]:
clus_cos_sim = sim2(x = word_vectors, y = cluster1, method = "cosine", norm = "l2")


In [None]:
# Top ten cluster words
head(sort(clus_cos_sim[,1], decreasing = TRUE), 10)

Now let's loop over the cluster centers:

In [None]:
topWordMatrix <- matrix(NA, 5,10)

for (i in 1:5){
  cluster <- t(as.matrix(clusters$centers[i,]))
  clus_cos_sim = sim2(x = word_vectors, y = cluster, method = "cosine", norm = "l2")
  topWordMatrix[i,] <- names(head(sort(clus_cos_sim[,1], decreasing = TRUE), 10))
}

In [None]:
topWordMatrix

0,1,2,3,4,5,6,7,8,9
moonbeam,mutant,hanky,mercedes,hawaiian,louse,thers,weres,roach,lama
dont't,literally,fickle,dating,sag,marty,struggled,tempted,carlos,reluctantly
defiance,mah,loveâ’s,settling,chilli,blue's,strides,plops,swooshing,yang
but,just,because,that,when,you,if,now,not,all
surface,mist,bag,filling,atop,pile,pot,freezer,fills,gate


A little hard to see too much coming through in the way of the clusters here. Play around with the specifications to see if shifting the number of clusters in KMeans, the size of the window in GloVe, etc. get us to more sensible clusters. If not, it may just be that the dataset is too limited to really learn much.