# Demo 20

## Gensim

[Gensim](https://radimrehurek.com/gensim/#) is a popular NLP library. We wil use it for word embeddings.



In [None]:
import gensim

In [None]:
gensim.__version__

Gensim has many trained models that we can use. The next cell will print out a list of models we can download

In [None]:
import gensim.downloader
list(gensim.downloader.info()['models'].keys())

**Question:** What do the numbers at the end of the `word2vec` and `glove` models mean?

<details>
<summary>Solution</summary>
    The dimension of the vectors. This is <i>k</i> from our slides
</details>

In [None]:
# skip

We will look at glove vectors that were trained on Gigaword.

*(First run the next cell then discuss this)*<br>
In class we discussed Word2Vec but for this demo we will use Glove embeddings. Glove is a similar method that learns word embeddings based on a co-occurence matrix.

See https://nlp.stanford.edu/projects/glove/ for more information.

### Download Glove model

In [None]:
%%time

import gensim.downloader as api
model = api.load('glove-wiki-gigaword-50')

The line above created a new directory called `gensim-data` in your home directory

In [None]:
!ls ~/

It also created a new directory called glove-wiki-gigaword-50 where it stored the vectors

In [None]:
!ls ~/gensim-data

The next line prints out the size of the folder and its contents.

In [None]:
!du -h ~/gensim-data

We can see the model is 67MB. The other embeddings will be much larger because they have more dimensions and a larger vocabulary.

Feel free to download the larger vectors on your own. They take longer to download so we are only using the small vectors during the class demo.

*Note:* CUIT provision 5GB for each jupyterhub server so be careful when you download many embedding files. Some of the largest ones are about 2.5GB. If you run into an issue where you are out of memory, open up a terminal and use the `rm` command to delete some large files.

### Saved model as gensim KeyedVectors

In [None]:
model

The WordVectors are stored as a gensim KeyVectors object.
Here is the [documentation for KeyVectors in gensim](https://radimrehurek.com/gensim/models/keyedvectors.html).

We can see all vectors by running the following

In [None]:
model.vectors

**Question:** How do you think we can find the number of words in our vocabulary and the size of the vectors?

<details>
<summary>Solution</summary>
    model.vectors.shape
</details>

In [None]:
# run the code to see the size of the vocab and the vectors



### Access Embeddings for a word

We can find the word embedding for a specific word type by running the next line

In [None]:
model['king']

In [None]:
model.get_vector('king')

In [None]:
model['queen']

(back to slides)
## Exploring word relationships

### Most similar terms

In [None]:
model.most_similar('king')

In [None]:
model.most_similar('queen')

In [None]:
model.most_similar(positive=['king', 'male'], negative=['queen'])

### Which word doesnt match

In [None]:
model.doesnt_match(["breakfast", "cereal", "dinner", "lunch"])

### Compute word similarity

In [None]:
!pip install pyemd

In [None]:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

model.wmdistance(sentence_obama, sentence_president)

### Even more

In [None]:
model.evaluate_word_analogies()

In [None]:
model.closer_than("queen", "king")

In [None]:
model.closer_than("king", "queen")

### Visualization with dimensionality reduction

In [None]:
model.vectors.shape

## Word Embeddings as Features


We'll pick up on Thursday showing how we can train a classifier with word embeddings as features.

In [None]:
from sklearn.naive_bayes import MultinomialNB
import nltk
import pandas as pd

moview_reviews = nltk.corpus.movie_reviews
review_files = [(file_id, file_id.startswith("pos")) for file_id in moview_reviews.fileids()]
df = pd.DataFrame(review_files)
df = df.rename(columns={0: "file_name", 1: "gold-label"})


def read_mov_review(f_name):
    return moview_reviews.open(f_name).read()

df['review_text'] = df['file_name'].apply(read_mov_review)

df = df.sample(df.shape[0])
df.head(5)

train_max_idx = int(df.shape[0] * .8)
dev_max_idx = int((df.shape[0] * .1) + train_max_idx)


train_max_idx, dev_max_idx

train_df = df.iloc[:train_max_idx]
dev_df = df.iloc[train_max_idx:dev_max_idx]
test_df = df.iloc[dev_max_idx:]

train_df.shape, dev_df.shape, test_df.shape

In [None]:
def clean_text(review):
    return " ".join([" ".join(nltk.tokenize.word_tokenize(sent)) for sent in nltk.tokenize.sent_tokenize(train_df['review_text'].iloc[0])])

train_df['clean_text'] = train_df['review_text'].apply(clean_text)