# Word Representations

## *"I know words. I have the best words!"*
    - Noam Chomsky

## Discrete Sparse Representations

Let's first get some data:

In [6]:
import pandas as pd
df = pd.read_csv('C:/Users/Tiziano/Desktop/BOCCONI/2nd semester/NLP/data/reviews.full.tsv', sep='\t', nrows=50000)
documents = df.text.tolist()

We can now turn this into a data matrix of $n$ reviews and $m$ n-grams.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
small_vectorizer = CountVectorizer()

sentences_2 = documents[:2]
X1 = small_vectorizer.fit_transform(sentences_2)

The result is a sparse count matrix

In [11]:
X1

<2x49 sparse matrix of type '<class 'numpy.int64'>'
	with 51 stored elements in Compressed Sparse Row format>

Let's implement this ourselves, to see what is going on under the hood. We have to
- collect the word types
- assign each one an id (= column)
- create a matrix
- fill that matrix


In [12]:
import numpy as np
num_docs = 1

# collect all word types (= vocabulary)
vocabulary = set()
for document in documents[:num_docs]:
    tokens = document.lower().split()
    vocabulary = vocabulary.union(set(tokens))
vocabulary = sorted(vocabulary)

# create a data matrix with #docs-by-#features dimensions
X = np.zeros((num_docs, len(vocabulary)))

# fill that matrix with sweet counts
for d, document in enumerate(documents[:num_docs]):
    tokens = document.lower().split()
    for i, feature in enumerate(vocabulary):
        X[d, i] = tokens.count(feature)

# show the result as a DataFrame
pd.DataFrame(data=X, columns=vocabulary, dtype=int)

Unnamed: 0,',(,),",",.,".,",a,always,among,and,...,ten,the,this,three,time,to,top,use,want,you
0,1,1,1,3,3,1,2,1,1,1,...,1,4,1,1,1,3,1,1,1,2


For convenience's sake, we also create an inverted index from words to indices

In [14]:
vocabulary_ = {word: position for position, word in enumerate(vocabulary)}
vocabulary_

{"'": 0,
 '(': 1,
 ')': 2,
 ',': 3,
 '.': 4,
 '.,': 5,
 'a': 6,
 'always': 7,
 'among': 8,
 'and': 9,
 'at': 10,
 'been': 11,
 'car': 12,
 'cars': 13,
 'change': 14,
 'cheaper': 15,
 'cheapest': 16,
 'continually': 17,
 'daily': 18,
 'different': 19,
 'don': 20,
 'e': 21,
 'elsewhere': 22,
 'found': 23,
 'g': 24,
 'has': 25,
 'have': 26,
 'however': 27,
 'i': 28,
 'if': 29,
 'lot': 30,
 'many': 31,
 'of': 32,
 'price': 33,
 'prices': 34,
 'really': 35,
 'research': 36,
 'reserve': 37,
 'site': 38,
 'sites': 39,
 't': 40,
 'ten': 41,
 'the': 42,
 'this': 43,
 'three': 44,
 'time': 45,
 'to': 46,
 'top': 47,
 'use': 48,
 'want': 49,
 'you': 50}

`CountVectorizer` does all that for us under the hood (and then some).
The result is a *sparse count matrix*:

In [15]:
# sparse indexed representation
print(X1)

# dense representation
print(X1.todense())

  (0, 30)	1
  (0, 8)	1
  (0, 13)	1
  (0, 2)	1
  (0, 22)	2
  (0, 48)	2
  (0, 46)	1
  (0, 43)	3
  (0, 31)	1
  (0, 32)	2
  (0, 38)	4
  (0, 29)	2
  (0, 12)	1
  (0, 3)	1
  (0, 25)	1
  (0, 14)	1
  (0, 35)	2
  (0, 20)	2
  (0, 18)	1
  (0, 9)	1
  (0, 7)	1
  (0, 16)	1
  (0, 21)	1
  (0, 15)	1
  (0, 24)	1
  :	:
  (0, 42)	1
  (0, 40)	1
  (0, 34)	1
  (0, 19)	1
  (0, 0)	1
  (0, 5)	1
  (0, 1)	1
  (0, 44)	1
  (0, 41)	1
  (0, 10)	1
  (0, 36)	1
  (0, 45)	1
  (0, 33)	1
  (0, 6)	1
  (1, 2)	1
  (1, 38)	1
  (1, 17)	1
  (1, 37)	1
  (1, 39)	1
  (1, 47)	1
  (1, 26)	1
  (1, 28)	1
  (1, 11)	1
  (1, 23)	1
  (1, 4)	1
[[1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 2 1 2 0 1 1 0 2 0 2 1 1 2 1 1 2
  1 0 4 0 1 1 1 3 1 1 1 0 2]
 [0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0
  0 1 1 1 0 0 0 0 0 0 0 1 0]]


We can access the mapping from vector position to feature names via `get_feature_names()`:

In [16]:
print(small_vectorizer.get_feature_names())

['always', 'among', 'and', 'at', 'awesome', 'been', 'car', 'cars', 'change', 'cheaper', 'cheapest', 'companies', 'continually', 'daily', 'different', 'don', 'elsewhere', 'fact', 'found', 'has', 'have', 'however', 'if', 'is', 'lot', 'many', 'match', 'of', 'other', 'price', 'prices', 'really', 'research', 'reserve', 'site', 'sites', 'ten', 'that', 'the', 'they', 'this', 'three', 'time', 'to', 'top', 'use', 'want', 'will', 'you']


The inverse (the mapping from feature names to vector positions) is encoded as a list in `vocabulary_`:

In [17]:
print(small_vectorizer.vocabulary_)

{'prices': 30, 'change': 8, 'daily': 13, 'and': 2, 'if': 22, 'you': 48, 'want': 46, 'to': 43, 'really': 31, 'research': 32, 'the': 38, 'price': 29, 'continually': 12, 'at': 3, 'many': 25, 'different': 14, 'sites': 35, 'have': 20, 'found': 18, 'cheaper': 9, 'cars': 7, 'elsewhere': 16, 'however': 21, 'don': 15, 'lot': 24, 'of': 27, 'time': 42, 'this': 40, 'site': 34, 'has': 19, 'always': 0, 'been': 5, 'among': 1, 'top': 44, 'three': 41, 'cheapest': 10, 'ten': 36, 'use': 45, 'reserve': 33, 'car': 6, 'fact': 17, 'that': 37, 'they': 39, 'will': 47, 'match': 26, 'other': 28, 'companies': 11, 'is': 23, 'awesome': 4}


## Terminology 

![](matrix.pdf)

Let's redo this for the entire corpus:

In [18]:
vectorizer = CountVectorizer(analyzer='word', # ‘word’, ‘char’, ‘char_wb’ (character n-grams only from text inside Word Boundaries)
                             ngram_range=(1, 2), # use unigrams and bigrams
                             min_df=0.001, # use only n-grams occuring in at least 0.1% of docs
                             max_df=0.75, # use only n-grams occuring in max 75% of docs
                             stop_words='english') # ignore common stop words

X = vectorizer.fit_transform(documents[:10000])

print(X.shape)

(10000, 3869)


There are some important arguments:

**max_df**: float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**min_df**: float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

Calling `transform()` on a new document will apply the vocabulary we collected previously to this new data point. Any words we have not seen before are ignored.

In [19]:
vectorizer.transform([documents[-1]])

<1x3869 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [20]:
vectorizer.transform([documents[-1]]).todense()

matrix([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [21]:
documents[-2]

"This is the first time I used this site for concert tickets and was worried . But I didn ' t need to be as I received the tickets by registered mail this morning ."

## Character $n$-grams

We can also use characters to analyze text. There are fewer characters than words, so our matrix has fewer columns and is typically denser.

In [29]:
char_vectorizer = CountVectorizer(analyzer='char', 
                                  ngram_range=(2, 6), 
                                  min_df=0.001, 
                                  max_df=0.75)

C = char_vectorizer.fit_transform(documents[:10])
C.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [1, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 1, 1, 1]], dtype=int64)

In [30]:
print(char_vectorizer.vocabulary_)

{'pr': 5953, 'ic': 4050, 'ce': 2121, 'ch': 2155, 'ng': 5194, 'ge': 3612, ' d': 382, 'da': 2407, 'ai': 1609, 'il': 4153, 'ly': 4732, 'if': 4118, 'f ': 3378, ' y': 1264, 'yo': 8014, 'ou': 5723, 'u ': 7367, 'wa': 7716, 'nt': 5298, 'ea': 2786, 'ar': 1819, 'rc': 6141, 'ti': 7183, 'nu': 5338, 'ua': 7387, 'at': 1904, 'ny': 5348, 'di': 2471, 'ff': 3452, 'fe': 3434, 'en': 3045, 'si': 6696, 'it': 4361, 'te': 7022, ' ,': 51, ', ': 1337, 'i ': 3975, 'av': 1959, 'fo': 3495, 'un': 7438, 'ap': 1805, 'pe': 5880, 'ca': 2106, 'rs': 6380, ' e': 430, 'el': 2985, 'ls': 4709, 'ew': 3324, 'wh': 7773, '. ': 1395, 'ho': 3918, 'ow': 5794, 'we': 7736, 'ev': 3306, 'do': 2505, " '": 17, "' ": 1295, 'a ': 1501, ' l': 690, 'lo': 4679, 'ot': 5699, ' o': 785, 'of': 5474, 'im': 4177, 'hi': 3896, 'as': 1867, 'lw': 4727, 'ay': 1987, 'ys': 8040, ' b': 296, 'be': 2021, 'ee': 2919, 'am': 1680, 'mo': 4891, 'g ': 3554, 'op': 5653, 'p ': 5817, 'hr': 3943, ' (': 34, '( ': 1318, ' g': 524, '.,': 1449, ' )': 42, ') ': 1327, ' u':

## Syntactic $n$-grams

Instead of words or character that occur in sequence, we can also use the combinations of words that are syntactically linked. For that, we use a **parser**.

Here, we extract each word as lemma, together with its dependency head (= the word it is syntactically related to).

In [32]:
import spacy
nlp = spacy.load('en_core_web_sm')
features = [' '.join(["{}_{}".format(c.lemma_, c.head.lemma_) 
                      for c in nlp(sentence)])
            for sentence in documents[:100]]

syntax_vectorizer = CountVectorizer()
X = syntax_vectorizer.fit_transform(features)

In [33]:
print(syntax_vectorizer.vocabulary_)

{'price_change': 3093, 'change_change': 1117, 'daily_change': 1256, 'and_change': 597, 'if_want': 2156, 'you_want': 4652, 'want_find': 4356, 'to_research': 4133, 'really_research': 3211, 'research_want': 3276, 'the_price': 3863, 'price_research': 3105, 'continually_research': 1214, 'at_research': 801, 'many_site': 2482, 'different_site': 1341, 'site_at': 3507, '_find': 153, 'i_find': 2067, 'have_find': 1927, 'find_change': 1605, 'cheap_car': 1128, 'car_find': 1095, 'elsewhere_find': 1445, 'however_be': 2030, '_be': 85, 'if_have': 2150, 'you_don': 4631, 'don_t': 1398, '_don': 131, 't_have': 3680, 'have_be': 1919, 'a_lot': 356, 'lot_have': 2432, 'of_lot': 2780, 'time_of': 4047, 'research_lot': 3275, 'this_site': 4012, 'site_be': 3508, 'always_be': 544, 'be_be': 854, 'among_be': 560, 'the_three': 3906, 'top_three': 4179, 'three_among': 4025, '_e': 132, 'e_three': 1415, 'g_cheap': 1777, '_g': 159, '_cheap': 100, 'cheap_cheap': 1129, 'of_cheap': 2765, 'the_site': 3889, 'ten_site': 3717, 'si

## TF-IDF

Let's extract the most important phrases from Moby Dick, using TFIDF weights instead of raw counts

In [34]:
import pandas as pd
documents = [line.strip() for line in open('C:/Users/Tiziano/Desktop/BOCCONI/2nd semester/NLP/data/Moby_Dick.txt', encoding='utf8')]
print(documents[1])

Call me Ishmael .


In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(analyzer='word', 
                                   min_df=0.001, 
                                   max_df=0.75, 
                                   stop_words='english', 
                                   sublinear_tf=True)

X = tfidf_vectorizer.fit_transform(documents)

Now, for comparison let's get the same information as raw counts:

In [37]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word', 
                             min_df=0.001, 
                             max_df=0.75, 
                             stop_words='english')

X2 = vectorizer.fit_transform(documents)

The two data matrices should have the same shape (but different contents)

In [38]:
print(X.shape == X2.shape, X.shape)

True (9768, 1850)


Let's put the two together to see the relation between raw counts and TFIDF. We will put them into a `DataFrame`

In [39]:
df = pd.DataFrame(data={'word': vectorizer.get_feature_names(), 
                        'tf': X2.sum(axis=0).A1, 
                        'idf': tfidf_vectorizer.idf_,
                        'tfidf': X.sum(axis=0).A1
                   }).sort_values(['tfidf', 'tf', 'idf']) # sort by TFIDF, then TF, then IDF

In [26]:
df

Unnamed: 0,word,tf,idf,tfidf
1071,nations,10,7.789074,2.818093
1602,surprise,10,7.789074,2.934600
1735,valiant,10,7.789074,3.017954
1423,shortly,10,7.789074,3.032615
554,fleet,11,7.702063,3.049731
...,...,...,...,...
922,like,639,3.808543,133.426528
972,man,525,3.982412,134.964448
231,chapter,171,5.039475,148.370596
1838,ye,467,4.257380,153.091587


# Language Models

We can also use n-grams (i.e., their probabilities) as unit for language models. LMs let us both assess how likely a sentence is, and to generate new sentences.
Let's start with a simple, Laplace-smoothed trigram model:

In [28]:
from collections import defaultdict
import numpy as np
import nltk

# define smoothing and special tokens
smoothing = 0.001
START = '_***_'
STOP = '_STOP_'

# P(w|u,v): map from (u, v) to w to allow marginalizing
counts = defaultdict(lambda: defaultdict(lambda: smoothing))

# fit data on corpus
corpus = [line.strip().split() for line in open('../data/moby_dick.txt')]

# collect counts for MLE
for sentence in corpus:
    # include special tokens for start and the end of sentence
    tokens = [START, START] + sentence + [STOP]
    # iterate over trigrams
    for u, v, w in nltk.ngrams(tokens, 3):
        counts[(u, v)][w] += 1

def logP(u, v, w):
    """
    compute the log probability of a trigram
    (u,v,w) => P(w|u,v) = c(u,v,w) / SUM(c(u,v,*))
    """
    return np.log(counts[(u, v)][w]) - np.log(sum(counts[(u, v)].values()))

def sentence_logP(S):
    """
    score a sentence in log likelihood with chain rule
    S: list(str)
    """
    tokens = [START, START] + S + [STOP]
    return sum([logP(u, v, w) for u, v, w in nltk.ngrams(tokens, 3)])

We can now score arbitrary sentences:

In [29]:
sentence_logP('Captain Ahab is a white whale .'.split())

-29.313322488501445

Smoothing allows us to score sentences that contain words we have not seen in our training data:

In [30]:
sentence_logP('Captain Ahab was twerking at the Drunken Clam .'.split())

-32.92597166866139

## Activity
Implement the perplexity measure for a given corpus, and try it with two LM with different smoothing parameters.

$$perplexity = 2^{-\sum_{x \in X} p(x) \log p(x)}$$

In [31]:
def get_perplexity(corpus):
    """
    perplexity = 2^entropy(X)
    entropy = -sum(p(x) *log(p(x)))
    """
    # your code here
    return 0.0

print(get_perplexity(corpus))

0.0


## Generation

We can re-use the counts to generate language:

In [32]:
def generate():
    """
    generate a new sentence
    """
    # start with special tokens
    result = [START, START]
    # sample the first word
    next_word = sample_next_word(result[-2], result[-1])
    result.append(next_word)
    # repeat until you draw a stop token
    while next_word != STOP:
        next_word = sample_next_word(result[-2], result[-1])
        result.append(next_word)
    
    return ' '.join(result[2:-1])

def sample_next_word(u, v):
    """
    sample a word w based on the history (u, v)
    """
    # separate word and their counts into separate variables
    keys, values = zip(*counts[(u, v)].items())
    
    # normalize the counts into a probability distribution
    values = np.array(values)
    values /= values.sum() # create probability distro
    
    # this is the meat of the function
    sample = np.random.multinomial(1, values) # pick one position
    
    return keys[np.argmax(sample)]

We can now generate (non-sensical) sentences:

In [33]:
print(generate())

BOOK II .
