# CS-396 Perpetual Work-in-Progress Status Report 2
## Author: Joseph Jinn

<br>

### Note: I need more coffee...

Notes:

- https://github.com/dunovank/jupyter-themes
 - (Jupyter Notebook Themes)

- https://towardsdatascience.com/bringing-the-best-out-of-jupyter-notebooks-for-data-science-f0871519ca29
 - (useful additions for Jupyter Notebook)

- https://medium.com/@rbmsingh/making-jupyter-dark-mode-great-5adaedd814db
 - (Jupyter dark-mode settings - my eyes are no longer bleeding...)

- https://github.com/ipython-contrib/jupyter_contrib_nbextensions
 - (Jupyter extensions)

- https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html
 - (PyTorch tutorial on character-level RNN)
 
<br>

Enter this in Terminal (for use with jupyter-themes):

jt -t monokai -f fira -fs 13 -nf ptsans -nfs 11 -N -kl -cursw 5 -cursc r -cellw 95% -T

<br>

Important files to reference:

- modeling_gpt2.py
 - The GPT2 model source code.
 
- tokenization_gpy2.py
 - The tokenizer class for the GPT2 model.
 
 <br>
 
Reference Material to understand the Theoretical Foundation of GPT2:

https://en.wikipedia.org/wiki/Language_model

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

It would also be helpful to have some concept about beam search… I’m not super-happy with what my Googling obtains but…

https://en.wikipedia.org/wiki/Beam_search

https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/
 
 <br>
 
Also maybe helpful but don’t get distracted:

the first 20 minutes or so of this (everything after that is details of training, skip it.)  

https://www.youtube.com/watch?v=Keqep_PKrY8

https://medium.com/syncedreview/language-model-a-survey-of-the-state-of-the-art-technology-64d1a2e5a466

https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

http://colah.github.io/posts/2015-08-Understanding-LSTMs/


More Notes:

- CTRL + M + L (while in command mode): Adds code cell line numbers (very useful for debugging)

## Summary of Current Progress:

Placeholder text.


##### Import required packages and libraries.

In [1]:
from tqdm import trange # Instantly make your loops show a smart progress meter

import torch # Pytorch.
import torch.nn.functional as F
import numpy as np # Numpy.

###############################################

# Hugging-face Transformers.
from transformers import GPT2Config, OpenAIGPTConfig, XLNetConfig, TransfoXLConfig, XLMConfig, CTRLConfig

from transformers import GPT2LMHeadModel, GPT2Tokenizer
from transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
from transformers import XLNetLMHeadModel, XLNetTokenizer
from transformers import TransfoXLLMHeadModel, TransfoXLTokenizer
from transformers import CTRLLMHeadModel, CTRLTokenizer
from transformers import XLMWithLMHeadModel, XLMTokenizer

##### Load the GPT2-model.

In [2]:
model_class = GPT2LMHeadModel # Specifies the model to use.
tokenizer_class = GPT2Tokenizer # Specifies the tokenizer to use for the model.
tokenizer = tokenizer_class.from_pretrained('gpt2') # Use pre-trained model.
model = model_class.from_pretrained('gpt2') # User pre-trained model.
model.to('cpu') # Specifies what machine to run the model on.
model.eval() # Specifies that the model is NOT in training mode.

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): Laye

In [3]:
#############################################################################################################################################
#############################################################################################################################################
# DIVIDER
#############################################################################################################################################
#############################################################################################################################################

In [4]:
def extract_top_k_tokens(filtered_logits, k_value):
    """
    This function utilizes the torch.topk() function to choose the "k" most likely words.
    
    torch.topk performs a similar function to Softmax and argmax.
    Use the words' "scores" to choose the top "k" most likely predicted words (tokens).

    - torch.topk
     - Returns the :attr:`k` largest elements of the given :attr:`input` tensor along a given dimension.

    Non-statistical and probabilistic method, so results are deterministic (always the same).
    
    Parameters:
        filtered_logits - entire vocabulary with assigned scores from GPT2 model.
        k_value - choose "k" top most likely words.
    
    Return:
        my_topk - top "k" word tokens.
    """
    topk_debug = False

    # Return the top "k" most likely (highest score value) words in sorted order..
    my_topk = torch.topk(input=filtered_logits, k=k_value, dim=1, sorted=True)
    if topk_debug:
        print(f"My torch.topk object: {my_topk}\n")
        print(f"torch.topk indices: {my_topk.indices}")
        print(f"torch.topk values: {my_topk.values}\n")

    # https://stackoverflow.com/questions/34750268/extracting-the-top-k-value-indices-from-a-1-d-tensor
    # https://stackoverflow.com/questions/53903373/convert-pytorch-tensor-to-python-list

    # Indices = encoded words, Values = scores.
    if topk_debug:
        print(f"\nDecoded torch.topk indices: {[tokenizer.decode(idx) for idx in my_topk.indices.squeeze().tolist()]}")
        print(f"\nDecoded torch.topk values: {tokenizer.decode(my_topk.indices.squeeze().tolist())}\n")

        print(f"topk indices shape: {my_topk.indices.shape}")
        print(f"topk indices shape after squeeze: {my_topk.indices.squeeze().shape}")
        print(f"topk indices after squeeze: {my_topk.indices.squeeze()}\n")

        # TODO: Ask Professor Arnold how to add/remove dimensions to a PyTorch Tensor.
        # https://stackoverflow.com/questions/43328632/pytorch-reshape-tensor-dimension
        print(f"topk indices 1st element in Tensor: {my_topk.indices[0][0]}")
        print(f"topk indices 1st element in Tensor shape: {my_topk.indices[0][0].shape}")
        print(f"topk indices 1st element in Tensor with added dimension: {my_topk.indices[0][0].unsqueeze(0)}")
        print(f"topk indices 1st element in Tensor with added dimension shape: {my_topk.indices[0][0].unsqueeze(0).shape}\n")

    if topk_debug:
        # Ghetto looping through topk indices.
        for elements in my_topk.indices[0]:
            if topk_debug:
                print(f"topk word: {elements}")
                print(f"topk word shape: {elements.shape}")
                print(f"topk word shape after unsqueezing: {elements.unsqueeze(0).unsqueeze(0).shape}")

            # Set each element as the next token for text prediction and generation.
            next_token = elements.unsqueeze(0).unsqueeze(0)
            if topk_debug:
                print(f"Next token shape: {next_token.shape}")
                print(f"Next token: {next_token}")
                print(f"Decoded next token(s): {tokenizer.decode(next_token.squeeze().tolist())}\n")
            
    # Returns the Tensor array of the top "k" word tokens
    return my_topk

In [29]:
def main():
    """
    Main encodes the raw text string, wraps in PyTorch Tensor, and calls prediction_generation().
    
    Parameters: 
        None
    Return: 
        None
    """
    main_debug = False
    context_debug = False
    num_samples = 1 # Default value.
    
    raw_text = "this is a test string for trying to understand what the heck is happening in run_generation.py."
    
    # Encode raw text.
    context_tokens = tokenizer.encode(raw_text, add_special_tokens=False)
    
    if main_debug:
        print(f"Raw text: {raw_text}\n")
        print(f"Context tokens: {context_tokens}\n")
    
    context = context_tokens # Set to name as in run_generation.py
    
    # Convert to a PyTorch Tensor object (numpy array).
    context = torch.tensor(context, dtype=torch.long, device='cpu')
    if context_debug:
        print(f"Context shape: {context.shape}")
        print(f"Context converted to Pytorch Tensor object: {context}\n")

    # Unsqueeze adds a dimension to the Tensor array.
    # Repeat adds x-dimensions and repeats the Tensor elements y-times.
    context = context.unsqueeze(0).repeat(num_samples, 1)
    if context_debug:
        print(f"Context shape after 'unsqueeze': {context.shape}")
        print(f"Context after 'unsqueeze': {context}\n")

    generated = context # Set to name as in run_generation.py
    
    # Generate and output text prediction results.
    prediction_generation(context_tokens, generated)
    

In [35]:
def prediction_generation(context_tokens, generated):
    """
    This function makes text prediction using the GPT2 model and outputs the results.
    
    Parameters:
       context_tokens - the encoded raw text string.
       generated - context_tokens wrapped as a PyTorch Tensor.
    """
    import random
    
    temperature = 1 # Default value.
    iterations =  20 # Default value.
    k_value = 5 # Top "k" words to choose.
    generated_array = [] # List of "generated" PyTorch Tensor containing encoded word tokens.
    token_score_array = [] # List of "scores" for each token in the current iteration of topk.
    logits_debug = False
    topk_debug = False
    output_debug = False
    
    # Create list of PyTorch Tensors containing encoded original raw text string.
    for i in range (0, k_value):
        generated_array.append(generated)
        token_score_array.append(1.)

    with torch.no_grad(): # This specifies not to use stochastic gradient descent!
        for _ in trange(iterations): 
              
            # Note: Feeding the results back into the model is the beginnings of a beam search algorithm.
            # Currently, randomly chooses one of the "generated" Tensors to feed back in.
            if logits_debug:
                print(f"Original generated shape: {generated}")
                print(f"Generated array element 0 shape: {generated_array[0]}")
                print(f"token_score_array element 0 shape: {token_score_array[0]}\n")
            
            # Call to GPT2 model generates a Tensor object containing "scores" for the entire vocabulary.
            chosen_generated = generated_array[random.randint(0, k_value - 1)]
            outputs = model(input_ids=chosen_generated)
            if logits_debug:
                print(f"Outputs shape: {list(outputs)[0].shape}\n")
                print(f"Outputs: {list(outputs)[0]}\n") # Outputs is a tensor containing a lot of stuff...

            next_token_logits = outputs[0][:, -1, :] / (temperature if temperature > 0 else 1.)
            if logits_debug:
                print(f"Next token logits shape: {next_token_logits.shape}\n")
                print(f"Next token logits: {next_token_logits}\n")

            filtered_logits = next_token_logits # Set to default name from run_generation.py

            ############################################################################################

            # Call function to extract the top "k" word tokens based on their scores.
            my_topk = extract_top_k_tokens(filtered_logits, k_value)

            counter = 0
            # Ghetto looping through topk indices.
            for elements in my_topk.indices[0]:
                if topk_debug:
                    print(f"topk word: {elements}")
                    print(f"topk word shape: {elements.shape}")
                    print(f"topk word shape after unsqueezing: {elements.unsqueeze(0).unsqueeze(0).shape}")

                # Set each element as the next token for text prediction and generation.
                next_token = elements.unsqueeze(0).unsqueeze(0)
                if topk_debug:
                    print(f"Next token shape: {next_token.shape}")
                    print(f"Next token: {next_token}")
                    print(f"Decoded next token(s): {tokenizer.decode(next_token.squeeze().tolist())}\n")
                
                # Concatenate the chosen token (predicted word) to the end of the tokenized (encoded) string.
                # Then, add to the array of "generated" PyTorch tensors by modifying the original generated structures.
                generated_array[counter] = (torch.cat((chosen_generated, next_token), dim=1))
                if topk_debug:
                    print(f"Generated shape: {chosen_generated.shape}")
                    print(f"Generated: {chosen_generated}")
                    print(f"Decoded 'generated' tokens: {tokenizer.decode(chosen_generated.squeeze().tolist())}\n")
                    
                counter += 1
            
            # Store the scores for each token.
            counter = 0
            for elements in my_topk.values[0]:
                token_score_array[counter] = elements.unsqueeze(0).unsqueeze(0)
                if topk_debug:
                    print(f"topk word score: {elements}")
                    print(f"topk word score shape: {elements.shape}")
                    print(f"topk word score shape after unsqueezing: {elements.unsqueeze(0).unsqueeze(0).shape}")
                counter += 1

            # Output the text prediction results.
            print(f"Original raw text string: {tokenizer.decode(context_tokens)}\n")
            for gen in generated_array:
                out = gen
                if output_debug:
                    print(f"Contents of 'out': {out}")

                # This line removes the original text but keeps appending the generated words one-by-one (based on iteration length).
                out = out[:, len(context_tokens):].tolist()
                if output_debug:
                    print(f"Contents of 'out' after .tolist(): {out}\n")
                    print(f"Length of context tokens:{len(context_tokens)}\n")

                # Outputs the result of the text modeling and prediction.
                for o in out:
                    # Decode - convert from token ID's back into English words.
                    text = tokenizer.decode(o, clean_up_tokenization_spaces=True)
                #     text = text[: text.find(args.stop_token) if args.stop_token else None]
                    print(f"Predicted text:{text}:\n")
                
# Execute the program.
if __name__ == '__main__':
    main()


  0%|          | 0/20 [00:00<?, ?it/s]


The top k=5 tokens are: ['\n It The If I']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token: 

Choose a Token:  It
Choose a Token:  The
Choose a Token:  If
Choose a Token:  I
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text:
:

Predicted text: It:

Predicted text: The:

Predicted text: If:

Predicted text: I:




  5%|▌         | 1/20 [00:00<00:02,  7.41it/s]


The top k=5 tokens are: [' test first output tests code']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  test
Choose a Token:  first
Choose a Token:  output
Choose a Token:  tests
Choose a Token:  code
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The test:

Predicted text: The first:

Predicted text: The output:

Predicted text: The tests:

Predicted text: The code:




 10%|█         | 2/20 [00:00<00:02,  7.39it/s]


The top k=5 tokens are: [' are will should for can']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  are
Choose a Token:  will
Choose a Token:  should
Choose a Token:  for
Choose a Token:  can
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests are:

Predicted text: The tests will:

Predicted text: The tests should:

Predicted text: The tests for:

Predicted text: The tests can:




 15%|█▌        | 3/20 [00:00<00:02,  7.16it/s]


The top k=5 tokens are: [' be run return work not']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  be
Choose a Token:  run
Choose a Token:  return
Choose a Token:  work
Choose a Token:  not
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should be:

Predicted text: The tests should run:

Predicted text: The tests should return:

Predicted text: The tests should work:

Predicted text: The tests should not:




 20%|██        | 4/20 [00:00<00:02,  7.08it/s]


The top k=5 tokens are: [' fine, as in on']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  fine
Choose a Token: ,
Choose a Token:  as
Choose a Token:  in
Choose a Token:  on
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work fine:

Predicted text: The tests should work,:

Predicted text: The tests should work as:

Predicted text: The tests should work in:

Predicted text: The tests should work on:




 25%|██▌       | 5/20 [00:00<00:02,  6.91it/s]


The top k=5 tokens are: [' the all any a Python']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  the
Choose a Token:  all
Choose a Token:  any
Choose a Token:  a
Choose a Token:  Python
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in the:

Predicted text: The tests should work in all:

Predicted text: The tests should work in any:

Predicted text: The tests should work in a:

Predicted text: The tests should work in Python:




 30%|███       | 6/20 [00:00<00:02,  6.95it/s]


The top k=5 tokens are: [' similar single test way separate']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  similar
Choose a Token:  single
Choose a Token:  test
Choose a Token:  way
Choose a Token:  separate
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in a similar:

Predicted text: The tests should work in a single:

Predicted text: The tests should work in a test:

Predicted text: The tests should work in a way:

Predicted text: The tests should work in a separate:




 35%|███▌      | 7/20 [00:01<00:01,  6.88it/s]


The top k=5 tokens are: [' case suite- environment context']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  case
Choose a Token:  suite
Choose a Token: -
Choose a Token:  environment
Choose a Token:  context
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in a test case:

Predicted text: The tests should work in a test suite:

Predicted text: The tests should work in a test-:

Predicted text: The tests should work in a test environment:

Predicted text: The tests should work in a test context:




 40%|████      | 8/20 [00:01<00:01,  6.78it/s]


The top k=5 tokens are: [',. and that but']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token: ,
Choose a Token: .
Choose a Token:  and
Choose a Token:  that
Choose a Token:  but
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in a test context,:

Predicted text: The tests should work in a test context.:

Predicted text: The tests should work in a test context and:

Predicted text: The tests should work in a test context that:

Predicted text: The tests should work in a test context but:




 45%|████▌     | 9/20 [00:01<00:01,  6.62it/s]


The top k=5 tokens are: ['\n The If This Run']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token: 

Choose a Token:  The
Choose a Token:  If
Choose a Token:  This
Choose a Token:  Run
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in a test context.
:

Predicted text: The tests should work in a test context. The:

Predicted text: The tests should work in a test context. If:

Predicted text: The tests should work in a test context. This:

Predicted text: The tests should work in a test context. Run:




 50%|█████     | 10/20 [00:01<00:01,  6.50it/s]


The top k=5 tokens are: [' is means test will should']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  is
Choose a Token:  means
Choose a Token:  test
Choose a Token:  will
Choose a Token:  should
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in a test context. This is:

Predicted text: The tests should work in a test context. This means:

Predicted text: The tests should work in a test context. This test:

Predicted text: The tests should work in a test context. This will:

Predicted text: The tests should work in a test context. This should:




 55%|█████▌    | 11/20 [00:01<00:01,  6.44it/s]


The top k=5 tokens are: [' be not work also only']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  be
Choose a Token:  not
Choose a Token:  work
Choose a Token:  also
Choose a Token:  only
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in a test context. This should be:

Predicted text: The tests should work in a test context. This should not:

Predicted text: The tests should work in a test context. This should work:

Predicted text: The tests should work in a test context. This should also:

Predicted text: The tests should work in a test context. This should only:




 60%|██████    | 12/20 [00:01<00:01,  6.45it/s]


The top k=5 tokens are: [' be happen cause work affect']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  be
Choose a Token:  happen
Choose a Token:  cause
Choose a Token:  work
Choose a Token:  affect
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in a test context. This should not be:

Predicted text: The tests should work in a test context. This should not happen:

Predicted text: The tests should work in a test context. This should not cause:

Predicted text: The tests should work in a test context. This should not work:

Predicted text: The tests should work in a test context. This should not affect:




 65%|██████▌   | 13/20 [00:01<00:01,  6.36it/s]


The top k=5 tokens are: [' any problems a issues an']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  any
Choose a Token:  problems
Choose a Token:  a
Choose a Token:  issues
Choose a Token:  an
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in a test context. This should not cause any:

Predicted text: The tests should work in a test context. This should not cause problems:

Predicted text: The tests should work in a test context. This should not cause a:

Predicted text: The tests should work in a test context. This should not cause issues:

Predicted text: The tests should work in a test context. This should not cause an:




 70%|███████   | 14/20 [00:02<00:00,  6.12it/s]


The top k=5 tokens are: [' error issue exception unexpected infinite']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  error
Choose a Token:  issue
Choose a Token:  exception
Choose a Token:  unexpected
Choose a Token:  infinite
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in a test context. This should not cause an error:

Predicted text: The tests should work in a test context. This should not cause an issue:

Predicted text: The tests should work in a test context. This should not cause an exception:

Predicted text: The tests should work in a test context. This should not cause an unexpected:

Predicted text: The tests should work in a test context. This should not cause an infinite:




 75%|███████▌  | 15/20 [00:02<00:00,  6.01it/s]


The top k=5 tokens are: [' loop regress number amount error']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  loop
Choose a Token:  regress
Choose a Token:  number
Choose a Token:  amount
Choose a Token:  error
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in a test context. This should not cause an infinite loop:

Predicted text: The tests should work in a test context. This should not cause an infinite regress:

Predicted text: The tests should work in a test context. This should not cause an infinite number:

Predicted text: The tests should work in a test context. This should not cause an infinite amount:

Predicted text: The tests should work in a test context. This should not cause an infinite error:




 80%|████████  | 16/20 [00:02<00:00,  5.95it/s]


The top k=5 tokens are: [' of ( or more out']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  of
Choose a Token:  (
Choose a Token:  or
Choose a Token:  more
Choose a Token:  out
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in a test context. This should not cause an infinite amount of:

Predicted text: The tests should work in a test context. This should not cause an infinite amount (:

Predicted text: The tests should work in a test context. This should not cause an infinite amount or:

Predicted text: The tests should work in a test context. This should not cause an infinite amount more:

Predicted text: The tests should work in a test context. This should not cause an infinite amount out:




 85%|████████▌ | 17/20 [00:02<00:00,  5.85it/s]


The top k=5 tokens are: ['orifunlesseas']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token: or
Choose a Token: if
Choose a Token: unless
Choose a Token: e
Choose a Token: as
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in a test context. This should not cause an infinite amount (or:

Predicted text: The tests should work in a test context. This should not cause an infinite amount (if:

Predicted text: The tests should work in a test context. This should not cause an infinite amount (unless:

Predicted text: The tests should work in a test context. This should not cause an infinite amount (e:

Predicted text: The tests should work in a test context. This should not cause an infinite amount (as:




 90%|█████████ | 18/20 [00:02<00:00,  5.83it/s]


The top k=5 tokens are: [' it the you in we']
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  it
Choose a Token:  the
Choose a Token:  you
Choose a Token:  in
Choose a Token:  we
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in a test context. This should not cause an infinite amount (as it:

Predicted text: The tests should work in a test context. This should not cause an infinite amount (as the:

Predicted text: The tests should work in a test context. This should not cause an infinite amount (as you:

Predicted text: The tests should work in a test context. This should not cause an infinite amount (as in:

Predicted text: The tests should work in a test context. This should not cause an infinite amount (as we:




 95%|█████████▌| 19/20 [00:03<00:00,  5.81it/s]


The top k=5 tokens are: [" have will are've can"]
Note: If chosen token does not exist, you will see this prompt repeat forever and ever...
Choose a Token:  have
Choose a Token:  will
Choose a Token:  are
Choose a Token: 've
Choose a Token:  can
Original raw text string: this is a test string for trying to understand what the heck is happening in run_generation.py.

Predicted text: The tests should work in a test context. This should not cause an infinite amount (as we have:

Predicted text: The tests should work in a test context. This should not cause an infinite amount (as we will:

Predicted text: The tests should work in a test context. This should not cause an infinite amount (as we are:

Predicted text: The tests should work in a test context. This should not cause an infinite amount (as we've:

Predicted text: The tests should work in a test context. This should not cause an infinite amount (as we can:




100%|██████████| 20/20 [00:03<00:00,  6.29it/s]


In [None]:
#############################################################################################################################################
#############################################################################################################################################
# ABOVE = MY PLAYGROUND
# BELOW = FINAL IMPLEMENTATION OF PLAYGROUND FEATURES
#############################################################################################################################################
#############################################################################################################################################

##### Generates the text (word) predictions.

In [8]:
def make_prediction(context):
    """
    This function generates the predicted text and returns the text as tokens.

    Parameters: None
    Return: generated - PyTorch Tensor containing tokens.
    """
    make_prediction_debug = False
    temperature = 1 # Default value.
    length = 20 # Default value.
    num_samples = 1 # Default value.
    
    # Convert to a PyTorch Tensor object (numpy array).
    context = torch.tensor(context, dtype=torch.long, device='cpu')
    if make_prediction_debug:
        print(f"Context shape: {context.shape}")
        print(f"Context converted to Pytorch Tensor object: {context}\n")
        
    # Unsqueeze adds a dimension to the Tensor array.
    # Repeat adds x-dimensions and repeats the Tensor elements y-times.
    context = context.unsqueeze(0).repeat(num_samples, 1)
    if make_prediction_debug:
        print(f"Context shape after 'unsqueeze': {context.shape}")
        print(f"Context after 'unsqueeze': {context}\n")
    
    generated = context # Set to name as in run_generation.py
    
    ############################################################################################

    with torch.no_grad(): # This specifies not to use stochastic gradient descent!
        for _ in trange(length): 
            
            # Call to GPT2 model generates a Tensor object containing "scores" for the entire vocabulary.
            outputs = model(input_ids=generated)
            if make_prediction_debug:
                print(f"Outputs shape: {list(outputs)[0].shape}\n")
                print(f"Outputs: {list(outputs)[0]}\n") # Outputs is a tensor containing a lot of stuff...

            next_token_logits = outputs[0][:, -1, :] / (temperature if temperature > 0 else 1.)
            if make_prediction_debug:
                print(f"Next token logits shape: {next_token_logits.shape}\n")
                print(f"Next token logits: {next_token_logits}\n")

            filtered_logits = next_token_logits # Set to default name from run_generation.py

            ############################################################################################
            
            """
            torch.topk performs a similar function to Softmax and argmax.
                Use the words' "scores" to choose the top "k" most likely predicted words (tokens).

            - torch.topk
             - Returns the :attr:`k` largest elements of the given :attr:`input` tensor along a given dimension.

            Non-statistical and probabilistic method, so results are deterministic (always the same).
            """
            # Return the top "k" most likely (highest score value) words in sorted order..
            my_topk = torch.topk(input=filtered_logits, k=1, dim=1, sorted=True)
            print(f"My torch.topk object: {my_topk}\n")
            print(f"torch.topk indices: {my_topk.indices}\n")
            print(f"torch.topk values: {my_topk.values}\n")

            # https://stackoverflow.com/questions/34750268/extracting-the-top-k-value-indices-from-a-1-d-tensor
            # https://stackoverflow.com/questions/53903373/convert-pytorch-tensor-to-python-list

            # 
            print(f"\nDecoded torch.topk indices: {[tokenizer.decode(idx) for idx in my_topk.indices.squeeze().tolist()]}\n")
            print(f"\nDecoded torch.topk values: {tokenizer.decode(my_topk.indices.squeeze().tolist())}\n")
            
            ############################################################################################
            
            # Concatenate the chosen token (predicted word) to the end of the tokenized (encoded) string.
            generated = torch.cat((generated, next_token), dim=1)
            if make_prediction_debug:
                print(f"Generated shape: {generated.shape}")
                print(f"Generated: {generated}")
                print(f"Decoded 'generated' tokens: {tokenizer.decode(generated.squeeze().tolist())}\n")
                                       
    return generated


##### Calls the make_prediction(context) function and outputs text prediction and generation results.

In [None]:
def output_prediction(num_predictions, context_tokens):
    """
    This function outputs the results of our generated predicted text.
    """
    output_prediction_debug = False
    
    for i in range(0, num_predictions):
            
        out = make_prediction(context_tokens) # Function returns "generated" - PyTorch Tensor containing encoded tokens.
        if output_prediction_debug:
            print(f"Contents of 'out': {out}")

        # This line removes the original text but keeps appending the generated words one-by-one (based on iteration length).
        out = out[:, len(context_tokens):].tolist()
        if output_prediction_debug:
            print(f"Contents of 'out' after .tolist(): {out}\n")
            print(f"Length of context tokens:{len(context_tokens)}\n")

        # Outputs the result of the text modeling and prediction.
        for o in out:
            # Decode - convert from token ID's back into English words.
            text = tokenizer.decode(o, clean_up_tokenization_spaces=True)
        #     text = text[: text.find(args.stop_token) if args.stop_token else None]
            print(f"Content of text: ##text_start_marker##{text}##text_end_marker##\n")
                

##### The usual main function.

In [None]:
def main():
    """
    Main function.
    """
    main_debug = False
    num_predictions = 3 # Specify the number of predictions to make for input string.
    
    raw_text = "this is a test string for trying to understand what the heck is happening in run_generation.py."
    
    # Encode raw text.
    context_tokens = tokenizer.encode(raw_text, add_special_tokens=False)
    # Generate and output text prediction results.
    output_prediction(num_predictions, context_tokens)
    
    if main_debug:
        print(f"Raw text: {raw_text}\n")
        print(f"Context tokens: {context_tokens}\n")
    

##### Execute to  utilize GPT2 model to generate text prediction output.

In [None]:
if __name__ == '__main__':
    main()