In [None]:
!pip3 install -U numpy==1.22.4 datasets transformers evaluate datatops==0.2.2 vibecheck==0.0.3 > /dev/null 2>&1

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('d2pNpcbdTG0')

# Using SotA Models

Like we've mentioned in previous notebooks, unless you are writing your own experimental DL research (and sometimes even then!) it is _far_ more common these days to use the HuggingFace model library to quickly import and start working with state of the art models. In this section we will show you how to do that.

We will download a pretrained model from the hf `transformers` library that is used to generate text. We will then fine-tune it on a different dataset, using the `hf.datasets` library and the HuggingFace Trainer classes to make the process as easy as possible, and we'll see that we can accomplish all of this in just a few lines of easily maintained code.

At the end, we will have a _working_ generator... for code!

In [None]:
# @markdown What is your Pennkey and pod? (text, not numbers, e.g. bfranklin)
my_pennkey = ""  # @param {type:"string"}
my_pod = ""  # @param {type:"string"}
my_email = ""  # @param {type:"string"}
tutorial = "W10D2"

In [None]:
# @title Feedback setup (run this cell)

# Feedback with Datatops
from vibecheck import DatatopsContentReviewContainer
from datatops import Datatops

feedback_dtid = "62a48t3w"
feedback_name = "cis522_feedback"
quiz_dtid = "lxx8szk1"
quiz_name = "cis522_quiz"
dt_url = "https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab/"

# Instantiate the Datatops client
dt = Datatops(dt_url)
quizdt = dt.get_project(quiz_name, user_key=quiz_dtid)


In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset

We're first going to pick a tokenizer. You can see some of the options [here](https://huggingface.co/transformers/pretrained_models.html). We'll use CodeParrot tokenizer, which is a BPE tokenizer. But you can choose (or build!) another if you'd like to try offroading! 

> **Quiz: Why can you use a different tokenizer than the one that was originally used? What requirements must another tokenizer for this task have?**
>
> Hint: You couldn't, for example, use the very popular `bert-base-uncased` tokenizer, even though it's a popular choice for text generation tasks that was trained on the English Wikipedia and the BookCorpus datasets (which are both available in the `hf.datasets` library).

In [None]:
why_tokenizer_choice = "" #@param{type:"string"}

In [None]:
tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot-small")

Next we'll download a pre-built model architecture. CodeParrot (the model) is a GPT-2 model, which is a transformer-based language model. You can see some of the options [here](https://huggingface.co/transformers/pretrained_models.html). But you can choose (or build!) another!

Note that `codeparrot/codeparrot` (https://huggingface.co/codeparrot/codeparrot) is about 7GB to download (so it may take a while, or it may be too large for your runtime if you're on a free Colab). Instead, we will use a smaller model, `codeparrot/codeparrot-small` (https://huggingface.co/codeparrot/codeparrot-small), which is only ~500MB.

To run everything together — tokenization, model, and de-tokenization, we can use the `pipeline` function from `transformers`:

In [None]:
from transformers import AutoModelWithLMHead
from transformers import pipeline

model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot-small")
generation_pipeline = pipeline(
    "text-generation", # The task to run. This tells hf what the pipeline steps are
    model=model, # The model to use; can also pass the string here;
    tokenizer=tokenizer, # The tokenizer to use; can also pass the string name here.
)

In [None]:
input_prompt = '''\
def simple_add(a: int, b: int) -> int:
    """
    Adds two numbers together and returns the result.
    """'''

# Return tensors for PyTorch:
inputs = tokenizer(input_prompt, return_tensors="pt")

Recall that these tokens are integer indices into the vocabulary of the tokenizer. We can use the tokenizer to decode these tokens into a string, which we can then print out to see what the model is generating.

In [None]:
input_token_ids = inputs["input_ids"]
input_strs = tokenizer.convert_ids_to_tokens(*input_token_ids.tolist())

print(*zip(input_strs, input_token_ids[0]))

In [None]:
#@title .
DatatopsContentReviewContainer(
    "",
    "W10D2_Kickoff",
    {
        "url": dt_url,
        "name": feedback_name,
        "user_key": feedback_dtid,
    }
).render()

**(Quick knowledge-check: what are the weirdly-rendering characters representing?)**

This model is already ready to use! Let's give it a try. (Note that we don't use `inputs` — we just generated that to show the initial tokenization steps.)

Here, we use the `pipeline` that we created earlier to chain all of our components together. If you were writing a Copilot-style code-completer, you could get away with wrapping this single line in a nice API and calling it a day!

Play with the hyperparameters and see what kinds of outputs you can get. Temperature is a measure of how much randomness is added to the model's predictions. Higher temperature means more randomness, and lower temperature means less randomness. More randomness in the latent space will lead to wilder predictions, but potentially more creative answers as well. A good place to start is `0.2`. You can also try changing the `max_length` parameter, which controls how long the generated code can be (though the model can opt to put a "stop" token in the middle of the sequence, so it may not always generate exactly this many tokens).

In [None]:
outputs = generation_pipeline(input_prompt, max_length=..., num_return_sequences=1, temperature=...)

In [None]:
print(outputs[0]["generated_text"])

Let's see if we can fool our model now! The huggingface documentation tells us that the codeparrot model was trained to generate Python code ([docs](https://huggingface.co/codeparrot/codeparrot-small)). Let's see if we can get it to generate some JavaScript:

In [None]:
input_prompt = "class SimpleAdder {"

print(generation_pipeline(input_prompt, max_length=..., num_return_sequences=1, temperature=...)[0]["generated_text"])

Yikes! I don't know what it generated for you, but what it made for me was:

```
class SimpleAdder {
    public:
        class SimpleAdder(object):
            def __init__(self, a, b):
                self.a = a
                self.b = b

            def __call__(self, x):
                return self.a + x
```

**Ew!** That's wrong in a _lot_ of ways. But it's understandable: Our model can't really generalize outside of the domain in which it was trained. And so probably there were a few Python files that included syntax of other languages (perhaps generators for other code?) and so the model knows that there's some mysterious syntax that uses curly brackets... But it's not sure about anything else. (For the programming-language hobbyists among you: The `public` notation looks to me a lot like the model is trying to do something C-flavored and perhaps something Java-flavored; I like it! But it's definitely not JavaScript.)

What are the major observations?

* The syntax it's generating rapidly devolves into Python; it's able to predict only a few characters of non-Python before falling back on its familiar training territory.
* The part of the code that follows Python syntax is valid, and even resembles a useful class definition. (Although if you look closely, it really doesn't seem to do anything useful with the `b` attribute...) This tells us that the model "understands" its problem domain, but just hasn't been trained on the right data to solve our new problem.

**What are your other observations about the code it generated for you? You're now aware of how Transformers work. (1) Think specifically and remark about the observations a machine learning practitioner would make here if your role were to diagnose the error in a production system. Now, (2) how would a nonexpert user interpret the issues? (3) Do you think the model-reported confidence for this output would be high, low, in between...?**

In [None]:
out_of_domain_generation_observations = "" #@param {type:"string"}

In [None]:
#@title .
DatatopsContentReviewContainer(
    "",
    "W10D2_GPT2FailureModes",
    {
        "url": dt_url,
        "name": feedback_name,
        "user_key": feedback_dtid,
    }
).render()

## Fine-Tuning

Alright, so we have a model that can generate code. But now we want to fine-tune it to generate JavaScript.

Assuming the data will be too large to fit on disk on Colab, we'll use the `load_dataset` function to download only part of the dataset.  There's actually a JavaScript subset to the codeparrot dataset, which we'll use as an example... but you can use any dataset you like! We recommend filtering datasets by task category (e.g. text generation) to get the most relevant datasets, but you can use any dataset you like if you can configure the data-loader to use it. (Consider, for example, [this one](https://huggingface.co/datasets/angie-chen55/javascript-github-code).)

> **Choose a dataset from the HuggingFace datasets library:**
> 
> https://huggingface.co/datasets?task_categories=task_categories:text-generation&sort=downloads

In [None]:
# Unlike _some_ code-generator models on the market, we'll limit our training data by license :)
dataset = load_dataset(
    "codeparrot/github-code",
    streaming=True,
    split="train",
    languages=["JavaScript"],
    licenses=["mit", "isc", "apache-2.0"],
)
# Print the schema of the first example from the training set:
print({k: type(v) for k, v in next(iter(dataset)).items()})


Just like training any model, we need to define a training loop and an evaluation metric. 

This is made overwhelmingly easy with the `transformers` library. Specifically, take a look below at all of the code you can avoid by using the huggingface infrastructure. (In the past, we've used PyTorch Lightning, which had a similar training-loop abstraction. Do you have preferences between these two libraries?)

Here are the big pieces of what we do below:

* **Create a `TrainingArguments` object.** This is a serializable object (i.e., you can save it to memory or to disk) that makes it easy to train a model reproducibly with the same hyperparameters. (This certainly beats having a bunch of global variables in your notebook!)
* **Encode the dataset.** This is effectively just passing everything through the tokenizer, with a padding step that fills the end of each sequence with the padding token. 
* **Define our metrics.** We use the `accuracy` metric here (look at the 4th line in the code cell). Why might *accuracy* be a bad metric for this task? (Hint: What does it mean to be "accurate" in this task?)

In [None]:
accuracy_metric_observations = "" #@param {type:"string"}

* **Create a data collator.** This is a function that takes a list of examples and returns a batch of examples. The `DataCollatorForLanguageModeling` class is a convenient way to do this.
* **Create a `Trainer` object.** This is a class that wraps the training loop and makes it easy to train a model. It's a bit like the `Trainer` class in PyTorch Lightning, but it's a bit more flexible, and works with non-PyTorch models as well.


In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
import numpy as np
from evaluate import load
metric = load("accuracy")

# Trainer:
training_args = TrainingArguments(
    output_dir="./codeparrot",
    max_steps=100,
    per_device_train_batch_size=1,
)

tokenizer.pad_token = tokenizer.eos_token

encoded_dataset = dataset.map(
    lambda x: tokenizer(x["code"], truncation=True, padding="max_length"),
    batched=True,
    remove_columns=["code"],
)

# Metrics for loss:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Data collator:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

In [None]:
# Run the actual training:
trainer.train()

Finally, we will try our model on the same code snippet to see how it performs after fine-tuning:

In [None]:
# Move the model to the CPU for inference:
model.to("cpu")
print(
    generation_pipeline(
        input_prompt, max_length=100, num_return_sequences=1, temperature=0.2
    )[0]["generated_text"]
)


Of course, your results will be slightly different. Here's what I got:

```javascript
class SimpleAdder {
    constructor(a, b) {
        this.a = a;
        this.b = b;
    }

    add(
```

Much better! The model is no longer generating Python code, and it's not trying to jam python-flavored syntax into other languages. It's still not perfect, but it's a lot better than before! (And of course, remember that this is just a small model, and we didn't train it for very long. You can try training it for longer, or using a larger model, to get better results.)

In [None]:
#@title .
DatatopsContentReviewContainer(
    "",
    "W10D2_Finetuning",
    {
        "url": dt_url,
        "name": feedback_name,
        "user_key": feedback_dtid,
    }
).render()

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('m5pwM9z1jYg')

In [None]:
best_out_of_domain_example = "" #@param {type:"string"}

In [None]:
common_and_diminishing_failures = "" #@param {type:"string"}

In [None]:
# @title Submit your quiz answers (run this cell to submit)

quizdt.store(
    dict(
        notebook=tutorial,
        my_pennkey=my_pennkey,
        my_pod=my_pod,
        my_email=my_email,
        why_tokenizer_choice=why_tokenizer_choice,
        out_of_domain_generation_observations=out_of_domain_generation_observations,
        accuracy_metric_observations=accuracy_metric_observations,
        best_out_of_domain_example=best_out_of_domain_example,
        common_and_diminishing_failures=common_and_diminishing_failures,
    )
)

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('UouBsEU_e9M')

In [None]:
#@title .
DatatopsContentReviewContainer(
    "",
    "W10D2_Discussion",
    {
        "url": dt_url,
        "name": feedback_name,
        "user_key": feedback_dtid,
    }
).render()