<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/ex9_intro_to_hlt_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this exercise, you'll try to generate text with an n-gram model.

To keep things at least a little bit efficient and be able to generate with up to 4-grams in google colab memory, let us divide the task as follows:

1. Generate n-grams from a corpus of text, e.g. the IMDB dataset
2. Count the n-grams, i.e. get a Counter with all unique n-grams and their counts in the text
3. From the Counter, we then build a dictionary, where for each n-1 -gram, we have a list of all words which can continue it. For example for the 4-gram "I have a dog", this dictionary will have an entry where the key is *(I,have,a)* and the value is a list which contains *dog* together with all other words *X* that were seen in a 4-gram *I have a X*. Put in other words, the dictionary encodes all ways in which a n-1 gram can be continued that we saw in the data.

With these data structures, the generation can proceed quite easily. Say, we have a 4-gram model.

* Given a prior context $w_1w_2w_3$
* Look up the list of possible words $w_4$ in the dictionary from step 3, then for each word, the count of $w_1w_2w_3w_4$ can be looked up in the Counter from step 2
* The counts, once normalized to sum up to 1, form a distribution over words that can continue $w_1w_2w_3$ and we can sample the next word from this distribution.

Other remarks:

* We want to pad all texts with <bos> (beginning of sequence) and <eos> (end of sequence). The <bos> we want to have there n-1 times, so we can use it as the initial prompt and let the model learn how the sequences start. The <eos> allows us to stop generating, and prevents a crash on unknown n-grams at the very end of a sequence. (if an n-gram $w_1w_2w_3w_4$ was seen only once at the end of a "training" sequence, then an attempt to continue it during generation, would lead to a crash, since we have no known n-gram to continue the sequence $w_2w_3w_4$ with our simple, unsmoothed model :)


# Task A: Generate n-grams

* Write a generator function `generate_ngrams(dset,n)` (using `yield` rather than `return`) which yields n-grams as tuples $(w_1,...,w_n)$ from all sections of the IMDB dataset
* a vectorizer from `sklearn` can be used as a trivial tokenizer
* `more-itertools` is a nifty library to achieve the n-gram generation
* remember to pad with n-1 `<bos>` symbols at the beginning, and one `<eos>` symbol at the end

You can give this a shot, or simply use the code for task A below to move on to the juicier parts of the exercise. So, warning, spoiler below.

In [None]:
!pip3 install datasets more-itertools

In [None]:
import datasets
import sklearn.feature_extraction

In [None]:
dset=datasets.load_dataset("imdb")



  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# Few remarks here:
# 1. we don't need the vectorizer per se, we just want its analyzer function, which basically tokenizes the text for us, and somewhat unfortunately drops punctuation
# 2. the default token pattern in sklearn drops 1-letter words (like "I" and "a") so I modify it a bit
cvectorizer=sklearn.feature_extraction.text.CountVectorizer(lowercase=False,stop_words=None,token_pattern=r"(?u)\b\w+\b" )
analyzer=cvectorizer.build_analyzer()
analyzer("I have a dog at home, it likes to shred newspapers.")

['I',
 'have',
 'a',
 'dog',
 'at',
 'home',
 'it',
 'likes',
 'to',
 'shred',
 'newspapers']

In [None]:
# Now we tokenize the IMDB dataset the usual way
def tokenize(ex):
    return {"tokenized":analyzer(ex["text"])}

dset=dset.map(tokenize,num_proc=4)



In [None]:
from collections import Counter
from more_itertools import sliding_window #more-itertools is an awesome library!
import tqdm

def generate_ngrams(dset,n):
    for ex in tqdm.tqdm(dset):
        tokens=["<bos>"]*(n-1)+ex["tokenized"]+["<eos>"]
        for ngram in sliding_window(tokens,n):
            yield ngram



# Task B

* Now we can combine the different sections of the IMDB dataset and count our n-grams
* That is relatively easy since we write generate_ngrams() as a generator and collections.Counter can count it directly

In [None]:
# Here we can concatenate all the individual datasets (train,test,unlabeled) in IMDB
# the "master" dataset is a dictionary of these, so dset.values() has the datasets of the individual sections (train,test,unlabeled)
#
# Please make sure you visit the datasets documentation page and understand what this does!
combined_dataset=datasets.concatenate_datasets(list(dset.values()))


In [None]:
ngrams=### THIS YOU NEED TO FILL IN :)

100%|██████████| 100000/100000 [01:14<00:00, 1339.74it/s]


# Task C

* Now we have ngrams, which is a simple counter
* So we can iterate over its keys, and build the dictionary of possible continuations

In [None]:
cont_dict={} #key is n-1 gram, as a tuple, like ("I","have","a"); value is a list of possible seen continuations, like ["dog","car","cat"] etc 
#### THIS YOU SHOULD FILL YOURSELF, I.E. FILL THE cont_dict BASED ON ngrams
    

# Task D

* Generate new text, starting from `<bos> <bos> ...` (n-1 times) and ending after say 40 words, or `<eos>` being generated
* I will give you a support function `sample_from` which receives a list of counts and a temperature parameter, and samples according to this distribution, returning a single column index drawn
* The temperature sampling is described here: https://towardsdatascience.com/how-to-sample-from-language-models-682bceb97277
* By all means, if you want to try, do try writing this function yourself!


In [None]:
import numpy

def softmax(x):
    return numpy.exp(x)/sum(numpy.exp(x))

def sample_from(counts,temperature=1.0):
    """
    counts: list of counts that form the distribution
    temperature: the "how wild the generation should be" parameter, numbers close
                 to 0 are very conservative, numbers close or above 1 lead to quite
                wild generations
    """
 
    counts_array=numpy.array(counts)
    #Make these sum up to 1.
    counts_array_norm=counts_array/counts_array.sum()
    #Divide by temperature, that is what the algorithm does
    counts_array_norm/=temperature
    #Renormalize into a distribution using the softmax function, that is what the algorithm does
    final_distribution=softmax(counts_array_norm)
    #A good way to sample from a distribution is the following function from numpy 
    x=numpy.random.multinomial(n=1,pvals=final_distribution)
    selected_word=numpy.argmax(x).flatten()
    return selected_word[0]

sample_from([1,1,1,17],temperature=0.5) #Try running this several times each, with temps 0.1, 0.5, 1.0 ... see how temp 0.1 sticks to picking the max value, but higher temps don't?

3

# Task E: piece it all together

* Again, I will give you the skeleton

In [None]:
from pprint import pprint

def generate(ngrams,cont_dict,n,max_len=40,temperature=1.0,prompt=None):
    """
    ngrams: the master counter
    cont_dict: the n-1 gram continuation dict
    n: the n in n-gram
    max_len: how many words max?
    temperature: the generation temperature
    prompt: the initial prompt, as a tuple, if not given n-1 <bos> symbols will be used
    """

    if prompt is None:
        prompt=["<bos>"]*(n-1)

    generated=list(prompt) #this list will grow with words, let's initialize it with the prompt
    for _ in range(max_len):
        ###### HERE GENERATE THE NEXT WORD AND APPEND IT TO THE END OF `generated`
        ###### 1. build the list of possible continuation words and their counts
        ###### 2. sample a word from these counts
        ###### 3. ...and append it to `generated`
        if generated[-1]=="<eos>": #stop on end of sequence
            break
    return generated

# Now we can test it!
n=4
for temp in (0.1,0.5,1.0,2.0,5.0):
    generated=generate(ngrams=ngrams,cont_dict=cont_dict,n=n,max_len=60,temperature=temp)
    print(f"Temp={temp}:")
    pprint(" ".join(generated))
    print("-----------")



Temp=0.1:
('<bos> <bos> <bos> Difficult to tell actually who is copying whom since '
 'apparently Deep Core is dated 2003 br br As Cosette and Valjean are riding '
 'through the rain that fell just in front of the camera talking about '
 'relationships in different families where it was supposed to be 2179 or '
 'something but he never acted crazy or hot headed just the opposite')
-----------
Temp=0.5:
('<bos> <bos> <bos> Camp Blood looked great when I got to buy this junk If you '
 're researching the zombie genre br br Now over 20 years nor had she told him '
 'about her weird father who belongs to well reputed family of Mogambo Mr '
 'India fame Crime Master Gogo Paresh Rawal s role was pretty tame stuff and '
 'is hit in the')
-----------
Temp=1.0:
('<bos> <bos> <bos> Aw Jeez everyone here in the IMDb library And Bill Julia '
 'and all other rentals are out or you re inviting a couple of dozen monks '
 'live there There are times in everyones lives when people judge them for '
 'th

# Done!

Ok, the generations are quite funny. Clearly, this is no ChatGPT, but it is also not entirely bad for a model, which is basically two dictionaries...