# Word Representations

## *"I know words. I have the best words!"*
    - Noam Chomsky

## Discrete Sparse Representations

In [2]:
! pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp37-none-any.whl size=9681 sha256=41192c7e1977a32659dbb9c41593851e44e6fcc7725347abc65932e980c41dd6
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [3]:
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/reviews.full.tsv.zip'
wget.download(url, 'reviews.full.tsv.zip')

'reviews.full.tsv.zip'

In [4]:
from zipfile import ZipFile
with ZipFile('reviews.full.tsv.zip', 'r') as zf:
    zf.extractall()

In [5]:
import pandas as pd
df = pd.read_csv('reviews.full.tsv', sep='\t', nrows=100000)
documents = df.text.tolist()
print(documents[:4])

["Prices change daily and if you want to really research the price continually at many different sites , I have found cheaper cars elsewhere . However , if you don ' t have a lot of time to research the price , this site has always been among the top three ( e . g ., cheapest ) of the ten sites I use to reserve a car .", 'and the fact that they will match other companies is awesome !!', "Used Paypal for my buying and selling for the past 0 years and never had an issue they didn ' t resolve to my satisfaction .", "I ' ve made two purchases on CJ ' s for Fallout : New Vegas and The Elder Scrolls V : Skyrim . I have been satisfied by both , being extremely cheaper than the Steam versions . The Autokey system that CJ ' s uses is genius . I recommend this site to anyone who is a PC gamer !"]


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
small_vectorizer = CountVectorizer()

sentences_2 = documents[:1]

X1 = small_vectorizer.fit_transform(sentences_2)

In [7]:
small_vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

Let's implement this ourselves:

In [8]:
import numpy as np
num_docs = 1

# collect all word types (= vocabulary)
vocabulary = set()
for document in documents[:num_docs]:
    tokens = document.lower().split()
    vocabulary = vocabulary.union(set(tokens))
vocabulary = sorted(vocabulary)

# create a data matrix with #docs-by-#features dimensions
X = np.zeros((num_docs, len(vocabulary)))

# fill that matrix with sweet counts
for d, document in enumerate(documents[:num_docs]):
    tokens = document.lower().split()
    for i, feature in enumerate(vocabulary):
        X[d, i] = tokens.count(feature)

# show the result as a DataFrame
pd.DataFrame(data=X, columns=vocabulary, dtype=int)

Unnamed: 0,',(,),",",.,".,",a,always,among,and,at,been,car,cars,change,cheaper,cheapest,continually,daily,different,don,e,elsewhere,found,g,has,have,however,i,if,lot,many,of,price,prices,really,research,reserve,site,sites,t,ten,the,this,three,time,to,top,use,want,you
0,1,1,1,3,3,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,2,2,1,1,2,2,1,1,2,1,1,2,1,1,4,1,1,1,3,1,1,1,2


In [9]:
vocabulary_ = {word: position for position, word in enumerate(vocabulary)}
vocabulary_

{"'": 0,
 '(': 1,
 ')': 2,
 ',': 3,
 '.': 4,
 '.,': 5,
 'a': 6,
 'always': 7,
 'among': 8,
 'and': 9,
 'at': 10,
 'been': 11,
 'car': 12,
 'cars': 13,
 'change': 14,
 'cheaper': 15,
 'cheapest': 16,
 'continually': 17,
 'daily': 18,
 'different': 19,
 'don': 20,
 'e': 21,
 'elsewhere': 22,
 'found': 23,
 'g': 24,
 'has': 25,
 'have': 26,
 'however': 27,
 'i': 28,
 'if': 29,
 'lot': 30,
 'many': 31,
 'of': 32,
 'price': 33,
 'prices': 34,
 'really': 35,
 'research': 36,
 'reserve': 37,
 'site': 38,
 'sites': 39,
 't': 40,
 'ten': 41,
 'the': 42,
 'this': 43,
 'three': 44,
 'time': 45,
 'to': 46,
 'top': 47,
 'use': 48,
 'want': 49,
 'you': 50}

The result is a *sparse count matrix*:

In [10]:
# indexed representation
import numpy as np
# print(X1)

# dense representation
print(X1.todense())

[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 2 1 1 2 1 1 2 1 4 1 1 1 3
  1 1 1 2]]


We can access the mapping from vector position to feature names via `get_feature_names()`:

In [11]:
print(small_vectorizer.get_feature_names())

['always', 'among', 'and', 'at', 'been', 'car', 'cars', 'change', 'cheaper', 'cheapest', 'continually', 'daily', 'different', 'don', 'elsewhere', 'found', 'has', 'have', 'however', 'if', 'lot', 'many', 'of', 'price', 'prices', 'really', 'research', 'reserve', 'site', 'sites', 'ten', 'the', 'this', 'three', 'time', 'to', 'top', 'use', 'want', 'you']


The inverse (the mapping from feature names to vector positions) is encoded as a list in `vocabulary_`:

In [12]:
print(small_vectorizer.vocabulary_)

{'prices': 24, 'change': 7, 'daily': 11, 'and': 2, 'if': 19, 'you': 39, 'want': 38, 'to': 35, 'really': 25, 'research': 26, 'the': 31, 'price': 23, 'continually': 10, 'at': 3, 'many': 21, 'different': 12, 'sites': 29, 'have': 17, 'found': 15, 'cheaper': 8, 'cars': 6, 'elsewhere': 14, 'however': 18, 'don': 13, 'lot': 20, 'of': 22, 'time': 34, 'this': 32, 'site': 28, 'has': 16, 'always': 0, 'been': 4, 'among': 1, 'top': 36, 'three': 33, 'cheapest': 9, 'ten': 30, 'use': 37, 'reserve': 27, 'car': 5}


## Terminology 

![](matrix.pdf)

Let's redo this for the entire corpus:

In [13]:
vectorizer = CountVectorizer(analyzer='word', 
                             ngram_range=(1, 2), 
                             min_df=0.001, 
                             max_df=0.75, 
                             stop_words='english')

X = vectorizer.fit_transform(documents[:10000])

print(X.shape)

(10000, 3869)


Calling `transform()` on a new document will apply the vocabulary we collected previously to this new data point. Any words we have not seen before are ignored.


In [14]:
vectorizer.transform([documents[-1]])

<1x3869 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [15]:
documents[-1]

'Never had any issues , easy to use and great prices .'

## Exercise

Use vector operations to find out 
- what the 5 most frequent words are in `X`
- in how many different documents the word `delivery` occurs
- what percentage of the overall corpus that number corresponds to

In [16]:
# your code here

## Character $n$-grams

We can also use characters to analyze text:

In [None]:
char_vectorizer = CountVectorizer(analyzer='char', 
                                  ngram_range=(2, 6), 
                                  min_df=0.001, 
                                  max_df=0.75)

C = char_vectorizer.fit_transform(documents[:10])
C

In [None]:
print(char_vectorizer.vocabulary_)

## Syntactic $n$-grams

In [None]:
import spacy
nlp = spacy.load('en')

features = [' '.join(["{}_{}".format(c.lemma_, c.head.lemma_) 
                      for c in nlp(sentence)])
            for sentence in documents[:100]]

syntax_vectorizer = CountVectorizer()
X = syntax_vectorizer.fit_transform(features)

In [None]:
print(documents[0])
print(features[0])

In [None]:
print(syntax_vectorizer.vocabulary_)

# Dense Distributed Representations

## Word embeddings with `Word2vec`

In [None]:
from gensim.models import Word2Vec
from gensim.models.word2vec import FAST_VERSION

corpus = [document.split() for document in documents]

# initialize model
w2v_model = Word2Vec(size=100,
                     window=15,
                     sample=0.0001,
                     iter=200,
                     negative=5, 
                     min_count=100,
                     workers=-1, 
                     hs=0
)

w2v_model.build_vocab(corpus)

w2v_model.train(corpus, 
                total_examples=w2v_model.corpus_count, 
                epochs=w2v_model.epochs)


In [None]:
print(corpus[0])

Now, we can use the embeddings of the model

In [None]:
w2v_model.wv['delivery']

In [None]:
w2v_model.wv.most_similar(['delivery','concert'])

In [None]:
# birthday - present + husband => birthday:present as husband:?
w2v_model.wv.most_similar(positive=['birthday', 'husband'], negative=['present'], topn=3)

In [None]:
word1 = "Cheapest"
word2 = "friendly"

# retrieve the actual vector
# print(w2v_model.wv[word1])

# compare
print(w2v_model.wv.similarity(word1, word2))

# get the 3 most similar words
print(w2v_model.wv.most_similar(word1, topn=3))



### Exercise
Use `spacy` to restrict the words in the tweets to *content words*, i.e., nouns, verbs, and adjectives. Transform the words to lower case and add the POS with an underderscore. E.g.:

`love_VERB old-fashioneds_NOUN`

This also allows us to distinguish between homographs, i.e., words that are written the same, but belong to different word classes, e.g., *love* in "I **love** old-fashioneds" vs. "He felt so sick, it must have been **love**".


Make sure to exclude sentences that contain none of the above.

Write the resulting corpus to a variable called `word_corpus`.

In [None]:
# Your code here

Rerun the `Word2vec` model from above on the new data set and test the words out

In [None]:
# Your code here

## Exercise

Train 4 more `Word2vec` models and average the resulting embedding matrices.

In [None]:
# Your code here



## Document embeddings with `Doc2Vec`

In [None]:
df.head()

In [None]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import FAST_VERSION
from gensim.models.doc2vec import TaggedDocument

corpus = []

for row in df.iterrows():
    label = row[1].score
    text = row[1].text
    corpus.append(TaggedDocument(words=text.split(), tags=[str(label)]))

print('done')
d2v_model = Doc2Vec(vector_size=100, 
                    window=15,
                    hs=0,
                    sample=0.000001,
                    negative=5,
                    min_count=100,
                    workers=-1,
                    epochs=500,
                    dm=0, 
                    dbow_words=1)

d2v_model.build_vocab(corpus)

d2v_model.train(corpus, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)

We can now look at the elements

In [None]:
d2v_model.docvecs[0]

In [None]:
d2v_model.docvecs.doctags

In [None]:
target_doc = '1'

similar_docs = d2v_model.docvecs.most_similar(target_doc, topn=5)
print(similar_docs)

## Exercise

What are the 10 most similar ***words*** to each category?

In [None]:
# your code here