## Discrete Sparse Representations

## Exercise 1

Read in the review corpus as simple counts.
Use vector operations to find out 
- what the 5 most frequent words are in `X`
- in how many different documents the word `delivery` occurs
- what percentage of the overall corpus that number corresponds to

In [21]:
import pandas as pd
import numpy as np
from collections import Counter


df = pd.read_csv('C:/Users/Tiziano/Desktop/BOCCONI/2nd semester/NLP/data/reviews.full.tsv', sep='\t', nrows=50000)
documents = df.text.tolist()

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer='word',
                             ngram_range=(1, 2),
                             min_df=0.001,
                             max_df=0.75,
                             stop_words='english')

X = vectorizer.fit_transform(documents)

totals = X.sum(axis=0).A1

counts = dict(zip(vectorizer.get_feature_names(),totals))
c = Counter()
c.update(counts)
c.most_common(5)

[('service', 13137),
 ('00', 11926),
 ('time', 10679),
 ('great', 10517),
 ('order', 10353)]

AttributeError: 'dict' object has no attribute 'vocabulary_'

In [37]:
len(counts)

3639

## Exercise 2
Extract **only** the bigrams (no unigrams) from Moby Dick and find the top 10.

In [55]:
documents = [line.strip() for line in open('C:/Users/Tiziano/Desktop/BOCCONI/2nd semester/NLP/data/Moby_Dick.txt', encoding='utf8')]

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(analyzer='word', 
                                   ngram_range=(2,2),
                                   min_df=0.001, 
                                   max_df=0.75, 
                                   stop_words='english', 
                                   sublinear_tf=True)

Xsm = tfidf_vectorizer.fit_transform(documents)

totals = Xsm.sum(axis=0).A1

counts = dict(zip(tfidf_vectorizer.get_feature_names(),totals))
c = Counter()
c.update(counts)
c.most_common(10)

[('sperm whale', 143.4257711546659),
 ('white whale', 89.70019192803562),
 ('old man', 73.30125164382476),
 ('moby dick', 68.84065167956459),
 ('captain ahab', 53.91182956167341),
 ('right whale', 46.43710024527232),
 ('mast head', 41.407494642618744),
 ('mast heads', 32.45161742235697),
 ('cried ahab', 31.403015769389288),
 ('whale ship', 29.26108151507456)]

## Exercise 3
Extract **only** the bigrams (no unigrams) from the Tweets and find the top 10.

In [58]:
documents = [line.strip() for line in open('C:/Users/Tiziano/Desktop/BOCCONI/2nd semester/NLP/data/tweets_en.txt', encoding='utf8')]

In [59]:
tfidf_vectorizer = TfidfVectorizer(analyzer='word', 
                                   ngram_range=(2,2),
                                   min_df=0.001, 
                                   max_df=0.75, 
                                   stop_words='english', 
                                   sublinear_tf=True)

Y = tfidf_vectorizer.fit_transform(documents)

totals = Y.sum(axis=0).A1

counts = dict(zip(tfidf_vectorizer.get_feature_names(),totals))
c = Counter()
c.update(counts)
c.most_common(10)

[('don know', 441.33674617344366),
 ('ve got', 355.1172757008388),
 ('happy birthday', 313.41269269432405),
 ('feel like', 280.35740534535717),
 ('looks like', 266.8842499656539),
 ('looking forward', 263.82871529735576),
 ('don think', 249.34119916190286),
 ('good luck', 246.29561092990502),
 ('don want', 202.45801508144262),
 ('just got', 195.45030945198397)]

## Exercise 4

Let's modify the `generate()` function of our Language Model to take any number of initial words.
First, read in the corpus and collect the counts:

In [None]:
from collections import defaultdict
import numpy as np
import nltk

smoothing = 0.001
START = '_***_'
STOP = '_STOP_'

# map from (u, v) to w = (w|u,v)
counts = defaultdict(lambda: defaultdict(lambda: smoothing))

# fit data on corpus
corpus = [line.strip().split() for line in open('../data/moby_dick.txt')]

# collect counts for MLE
for sentence in corpus:
    # include special tokens for start and the end of sentence
    tokens = [START, START] + sentence + [STOP]
    for u, v, w in nltk.ngrams(tokens, 3):
        counts[(u, v)][w] += 1


Remember the ``generate`` and ``sample_next_word`` functions:

In [None]:
def generate():
    # *****
    # change code accordingly
    # *****
    result = [START, START]
    next_word = sample_next_word(result[-2], result[-1])
    result.append(next_word)
    while next_word != STOP:
        next_word = sample_next_word(result[-2], result[-1])
        result.append(next_word)
    
    return ' '.join(result[2:-1])

def sample_next_word(u, v):
    """
    sample a word w based on the history (u, v)
    """
    # separate word and their counts into separate variables
    keys, values = zip(*counts[(u, v)].items())
    
    # normalize the counts into a probability distribution
    values = np.array(values)
    values /= values.sum() # create probability distro
    
    # this is the meat of the function
    sample = np.random.multinomial(1, values) # pick one position
    
    return keys[np.argmax(sample)]

In [None]:
print(generate('Hello'))

## Exercise 5

Extend the LM code above to *arbitray* $n$-gram sizes. Use another corpus to try it with $n=4$.

It might be helpful to use a `class` for the LM, make the smoothing a parameter, `counts` a class property, and add a function `fit()`.

In [None]:
# Your code here
