<a href="https://colab.research.google.com/github/Jiabao59/Basic-Language-Model-Project-Unigram-Causal-Masked-etc-/blob/main/language_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language modeling
## 0: Install dependencies:

**You can only use the libraries imported for you in this assignment**

In [3]:
!pip install transformers==4.24.0 datasets==2.7.0 tqdm==4.64.1 sentencepiece==0.1.97 gensim==4.2.0 apache-beam==2.42.0 sentence-transformers==2.2.2 googledrivedownloader

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.24.0
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 5.2 MB/s 
[?25hCollecting datasets==2.7.0
  Downloading datasets-2.7.0-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 51.9 MB/s 
Collecting sentencepiece==0.1.97
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 44.4 MB/s 
[?25hCollecting gensim==4.2.0
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 1.4 MB/s 
[?25hCollecting apache-beam==2.42.0
  Downloading apache_beam-2.42.0-cp37-cp37m-manylinux2010_x86_64.whl (11.0 MB)
[K     |████████████████████████████████| 11.0 MB 11.8 MB/s 
[?25hCollecting sentence-transformers=


## 1: Introduction
This section give you a quick introduction/refresher to language models

### What is a language model?
A language model is a statistical model that give you the probability of some given text

### What is a token?
You can't find the probability of most sequences longer than a few words directly since the $26^N$ possible sequences of length N only including the lower case letters in the English alphabet. That number can become astronomically large quickly. 

Solution: break the text up into small units (tokens)

Each token is typically a word or punctuation (but, can be other short sequences of characters)

**Question:** a) Finish the implementation exercises
#### Your first exercise is to create a tokenizer which take some text as input and outputs a list of tokens

To make things a little easier you can assume that all tokens are separated by " " or "-"

You may use the `re` module, but there are simpler solutions that do not need it

In [4]:
import re

# Basic tokenizer function
def tokenize(text: str) -> list:
  """
  Input: text, the string to be tokenized
  Output: tokens, a list of token strings

  Turns text into a list of tokens
  """
  # !!!!!!!!!!!!! Your code starts here !!!!!!!!!!!!! 
  tokens = re.split((' |-'),text)
  # !!!!!!!!!!!!! Your code ends here !!!!!!!!!!!!! 
  return tokens

example_text = "Is water a Non-Newtonian fluid ?"
tokenized_example_text = tokenize(example_text)
print(tokenized_example_text)
# Expected output: ['Is', 'water', 'a', 'Non', 'Newtonian', 'fluid', '?']

['Is', 'water', 'a', 'Non', 'Newtonian', 'fluid', '?']


One of the simplest language models is the unigram model. It stores the probability of encountering each token, ignoring surrounding tokens(it does not use conditional probability):

$P(sentence)=P(token_1)P(token_2)...P(token_N)$

In [5]:
import numpy as np

class Unigram:

  def __init__(self):
    """
    Initializes log probabilities
    """
    self.log_probabilities = {}
    self.unknown_log_probability = 0.0

  def train(self, sentences: list)->None:
    """
    Input: sentences, list of already tokenized sentences 
    Ex. [['Hello','my','name','is','HAL'],['Hi','HAL']]

    Save log probability of seeing each token using `np.log` to obtain the log probabilities

    """
    # Add a single unknown token
    sentences.append(['<unknown token>'])
    # !!!!!!!!!!!!! Your code starts here !!!!!!!!!!!!!
    self.count_uq = {} #counting unique tokens and its appearances
    self.total_tokens = 0 #Total number of Token
    for sentence in sentences:
      for token in sentence:
        self.count_uq[token] = self.count_uq.get(token,0)+1
        self.total_tokens += 1
    for key in self.count_uq.keys():
      self.log_probabilities[key] = np.log(self.count_uq.get(key,0)
      /self.total_tokens)
    # !!!!!!!!!!!!! Your code ends here !!!!!!!!!!!!! 
    # Assign probability for unseen tokens
    self.unknown_log_probability = self.log_probabilities.pop('<unknown token>')

  def token_log_prob(self, token:str) -> float:
    """
    Get the log probability of a single token with self.unknown_log_probability use if a token was not found during training
    """
    # !!!!!!!!!!!!! Your code starts here !!!!!!!!!!!!!
    if token in self.log_probabilities.keys():
      return self.log_probabilities[token]
    else:
      return self.unknown_log_probability
    # !!!!!!!!!!!!! Your code ends here !!!!!!!!!!!!! 

  def sentence_log_prob(self, sentence:list) -> float:
    """
    Get the log probability of an already tokenized sentence
    """
    # !!!!!!!!!!!!! Your code starts here !!!!!!!!!!!!!
    log_token_prev = 1
    for token in sentence:
      if token in self.log_probabilities.keys():
        log_token_post = self.log_probabilities[token]*log_token_prev
        log_token_prev = log_token_post
      else:
        log_token_post = self.unknown_log_probability*log_token_prev
        log_token_prev = log_token_post
    return log_token_post
    # !!!!!!!!!!!!! Your code ends here !!!!!!!!!!!!! 
model = Unigram()
model.train([['Hello','my','name','is','HAL'],['Hi','HAL']])
print('"Hello" log prob:',model.token_log_prob('Hello'))
print('"Hi my name is HAL" log prob:',model.sentence_log_prob(tokenize("Hi my name is HAL")))

"Hello" log prob: -2.0794415416798357
"Hi my name is HAL" log prob: -25.920437036656878


In [38]:
model = Unigram()
model.train([['Hello','my','name','is','HAL'],['Hi','HAL']])
print('"Hi my name is nn" log prob:',model.sentence_log_prob(tokenize("Hi my name is nn")))

"Hi my name is nn" log prob: -38.88065555498532


In [36]:
import numpy as np
count_nn = {}
phrases = [['Hello','my','name','is','HAL'],['Hi','HAL']]
for phrase in phrases:
  for token in phrase:
    count_nn[token] = count_nn.get(token,0)+1
#count_nn.keys('Hello')
#count_nn[]
if 'a' in count_nn.keys():
  print(0)
else:
  print(count_nn['Hello'])

1


We can use the Unigram model to classify text (but, may not have the highest accuracy)

In [6]:
from datasets import load_dataset

dataset = load_dataset("emotion",revision="b7dfe4482299c487641788dd6d81797842665744")
df_train = dataset['train'].to_pandas()
df_test = dataset['test'].to_pandas()
label_key = ["sadness", "joy", "love", "anger", "fear", "surprise"]

# Init models
total_count = len(df_train)
label_counts = df_train['label'].value_counts().sort_index()
models = [{
     'index': i,
     'label': label,
     'log_prior': np.log(label_counts.iloc[i]/total_count),
     'unigram_model': Unigram(),
} for i, label in enumerate(label_key)]

# Train models
for model in models:
  df_train_matching_label = df_train[df_train['label']==model['index']]
  tokenized_sentences = df_train_matching_label['text'].apply(tokenize).tolist()
  model['unigram_model'].train(tokenized_sentences)

Downloading builder script:   0%|          | 0.00/3.62k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.69k [00:00<?, ?B/s]

Downloading and preparing dataset emotion/default to /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705...


Downloading data:   0%|          | 0.00/1.66M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/204k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/207k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Dataset emotion downloaded and prepared to /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [30]:
models[0].get('log_prior')

-1.232286548640543

In [7]:
# Predict classes
def predict(sentence:str)->int:
  tokenized_sentence = tokenize(sentence)
  highest_log_prob = float('-inf')
  highest_log_prob_index = 0
  for model in models:
    # !!!!!!!!!!!!! Your code starts here !!!!!!!!!!!!! 
    # Compute log prob of the sentence using the ungram model + the log prior of the label
    log_prob = model['unigram_model'].sentence_log_prob(tokenized_sentence) 
    + model.get('log_prior')
    # !!!!!!!!!!!!! Your code ends here !!!!!!!!!!!!! 
    if log_prob > highest_log_prob:
      highest_log_prob = log_prob
      highest_log_prob_index = model['index']
  return highest_log_prob_index

df_test['predicted_label'] = df_test['text'].apply(predict)

tp_count = sum(df_test['predicted_label']==df_test['label'])
accuracy = tp_count/len(df_test)
print(f'Accuracy: {accuracy*100}%')

Accuracy: 26.6%



## 2: Types of Language Models
This sections explains different types of language models. We will go over 3 of the most used language model types:
1. Causal
2. Masked
3. Sequence to sequence



### 2.1: Causal language model

A causal language model provides the probability of a token given the tokens before it

$P(token_T|token_1,token_2,...,token_{T-1})$

It is useful for a variety of NLP tasking including sequence generation and sequence classification

Example:
Hello, my name is ...

Output:
Hello, my name is HAL

In [8]:
from transformers import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel, set_seed
from tqdm import tqdm

import torch
import torch.nn as nn

tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
model = OpenAIGPTLMHeadModel.from_pretrained("openai-gpt")

Downloading:   0%|          | 0.00/816k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/458k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/656 [00:00<?, ?B/s]

ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.


Downloading:   0%|          | 0.00/479M [00:00<?, ?B/s]

Some weights of OpenAIGPTLMHeadModel were not initialized from the model checkpoint at openai-gpt and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Here is an example

In [9]:
def generate_gpt_text_greedy(input_text,sequence_max_length = 25):
  """
  This uses greedy decoding which is not optimal for most tasks,
  but requires little system resources and is simple to implement
  
  Recommended for deterministic tasks: Beam search
  Recommended for creative tasks: Nucleus sampling
  Recommended for low latency/real time tasks: Greedy decoding or Nucleus sampling

  Note: Optimal decoding/generation algorithm depends on the task
  """
  generated_text = input_text

  # Generate a sequence
  for i in tqdm(range(sequence_max_length)):
    with torch.no_grad(): # Better preformance
      inputs = tokenizer(generated_text, return_tensors="pt")
      outputs = model(**inputs)
      next_token_logits = outputs.logits[0, -1, :]
      next_token_index = torch.argmax(next_token_logits)
      generated_text = tokenizer.decode(
          torch.cat((inputs['input_ids'][0],torch.tensor([next_token_index]))) # generated_text = generated_text + new_token
      )
  
  return generated_text

input_text = "Hello, my name is John Smith. I am the"

print(f'Generated: {generate_gpt_text_greedy(input_text)}')

100%|██████████| 25/25 [00:04<00:00,  6.12it/s]

Generated: hello, my name is john smith. i am the head of the department of defense. "





In [10]:
alt_text = "John said it again three times: \"No No No"
print(f'Generated: {generate_gpt_text_greedy(alt_text)}')

100%|██████████| 25/25 [00:04<00:00,  5.46it/s]

Generated: john said it again three times : " no no no no no no no no no no no no no no no no no no no no no no no no no no no no





**Question:** b) 

i) What do you notice about the generated text?



ii) How can this be avoided?


**Question:** c) (*CMPUT 566 Students Only*)
Implement the Nucleas Sampling method described in Section 3.1 of https://arxiv.org/pdf/1904.09751.pdf. You can use any `torch` or `torch.nn` (`nn`) functions

Hint: Use `torch.multinomial` for sampling



In [23]:
softmax = nn.Softmax(dim=0)

def generate_gpt_text_nucleus_sampling(input_text, sequence_max_length = 25, p=0.9):
  """
  This uses greedy decoding which is not optimal for most tasks,
  but requires little system resources and is simple to implement
  
  Recommended for deterministic tasks: Beam search
  Recommended for creative tasks: Nucleus sampling
  Recommended for low latency/real time tasks: Greedy decoding or Nucleus sampling

  Note: Optimal decoding/generation algorithm depends on the task
  """
  generated_text = input_text

  # Generate a sequence
  for i in tqdm(range(sequence_max_length)):
    with torch.no_grad(): # Better preformance
      inputs = tokenizer(generated_text, return_tensors="pt")
      outputs = model(**inputs)
      # !!!!!!!!!!!!! Your code starts here !!!!!!!!!!!!!
      next_token_logits = outputs.logits[0, -1, :]
      sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
      cumulative_probs = torch.cumsum(softmax(sorted_logits), dim=-1)
      sorted_indices_to_remove = cumulative_probs > p
      sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
      sorted_indices_to_remove[..., 0] = 0
      indices_to_remove = sorted_indices[sorted_indices_to_remove]
      next_token_logits[indices_to_remove] = -float('Inf')
      probabilities = softmax(next_token_logits)
      next_token_index = torch.multinomial(probabilities, 1)
      # !!!!!!!!!!!!! Your code ends here !!!!!!!!!!!!! 
      generated_text = tokenizer.decode(
          torch.cat((inputs['input_ids'][0],torch.tensor([next_token_index]))) # generated_text = generated_text + new_token
      )
  
  return generated_text


torch.manual_seed(314159)
input_text = "Hello, my name is John Smith. I am the"

print(f'Generated: {generate_gpt_text_nucleus_sampling(input_text)}')

100%|██████████| 25/25 [00:04<00:00,  5.25it/s]

Generated: hello, my name is john smith. i am the captain of this ship. would you like to see the captain? "





In [None]:
inputs = tokenizer("Hello, my name is John Smith. I am the", return_tensors="pt")
torch.cat((inputs['input_ids'][0],torch.tensor([1000])))

In [None]:
inputs['input_ids']

### 2.2: Masked language models
A masked language model provides the probability of a token given the tokens before it and after it (fill in the blanks)

$P(token_T|token_1,...,token_{T-1},token_{T+1},...,token_{N})$

It is useful for a variety of NLP tasking including sequence classification and grammar correction

Example: Hello, my name is ...

Output: Hello, my name is HAL

In [44]:
from transformers import BertTokenizer, BertForMaskedLM

import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [50]:
example_text = "The capital of Alberta is [MASK]."

def predict_mask(input_text):

  with torch.no_grad():
    inputs = tokenizer(input_text, return_tensors="pt")
    logits = model(**inputs).logits
    mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
    predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
  return tokenizer.decode(predicted_token_id)

print('Input: ',example_text)
print('Mask prediction: ',predict_mask(example_text))

Input:  The capital of Alberta is [MASK].
Mask prediction:  edmonton


**Question:** d) Use the `predict_mask` function and the `[MASK]` token to exract a fact from the language model(similar to the example above). Include your input and the model's prediction in your pdf report

In [58]:
your_prompt = "The capital of Ontario is [MASK]."

print('Input: ',your_prompt)
print('Mask prediction: ',predict_mask(your_prompt))

Input:  The capital of Ontario is [MASK].
Mask prediction:  toronto


### 2.3: Sequence to sequence models
A sequence to sequence models provides the probability of a token given the tokens before it and all tokens in another related sequence

It is useful for a variety of NLP tasking including translation and summarization (primarily used for text generation)

Example: Bonjour, je m'appelle HAL (French)

Output: Hello, my name is HAL (English)

In [59]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
from tqdm import tqdm

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")


Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading:   0%|          | 0.00/242M [00:00<?, ?B/s]

In [60]:
def t5_summarize(text, max_length=20):
  # inference
  input_ids = tokenizer(
      f"summarize: {text}", return_tensors="pt"
  ).input_ids
  outputs = model.generate(
      input_ids,
      max_length=max_length,
  )
  return tokenizer.decode(outputs[0], skip_special_tokens=True)
# The Road Not Taken
# Poem by Robert Frost (1916)
# (Public Domain)
poem = """Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;

Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,

And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference."""

print('Summary: ',t5_summarize(poem))

Summary:  two roads diverged in a yellow wood, and I took the one less traveled by


The metaphors are lost on the model, but it still does a fairly good job summarizing the literal meaning of the poem

**Question:** e)

i) Find a short piece of text (article, poem, section of a paper) and get the model to summarize it. Include the summary in your report

In [62]:
short_text = """

Over hill, over dale,
Thorough bush, thorough brier,
Over park, over pale,
Thorough flood, thorough fire!
I do wander everywhere,
Swifter than the moon's sphere;
And I serve the Fairy Queen,
To dew her orbs upon the green;
The cowslips tall her pensioners be;
In their gold coats spots you see;
Those be rubies, fairy favours;
In those freckles live their savours;
I must go seek some dewdrops here,
And hang a pearl in every cowslip's ear.

"""

print('Summary: ',t5_summarize(short_text,max_length=20)) # You can change `max_length` if summary seems truncated

Summary:  a cowslip is taller than the moon's sphere; a cow


ii) Is the summary accurate? If yes, explain why the summary is accurate? If not, explain how the summary could be improved