# Workshop 4: What is a ChatGPT?

- **When**: Monday Week 8, 17:00 - 18:30 
- **Where**: AT 5.04
- **Contact**: hello@edinburghai.org
- **Credits**: This notebook is created by EdinburghAI for use in its workshops. If you plan to use it, please credit us. 

## Today

1. Understand what **word embeddings** are and how to use them for **semantic search**
2. **Fine-tune** a **pre-trained** model for sentiment classification of movie reviews 🎬
3. Experiment with a **generative** model like ChatGPT 🤖 

Although not required, I would do Task 1 before Task 2. But you can do Task 3 completely separately. Task 1 has the most coding while Tasks 2 & 3 are mostly plug and play, with some questions to get you to think about what's going on. Feel free to jump around and do what interests you!

Let's get started! 💯

# 1. Word Embeddings

We've seen in the lecture that word embeddings are representations of word as a vector. This vector is generally pretty long - several hundred numbers long! This means the vectors live in a very high-dimensional space. In this high-dimensional space, different directions represent different meanings. One direction might represent 'capital city', and another might represent 'France' and when you add them together, you would get a meaning that corresponds to something like 'Paris'. And this really work as you'll see below!

If you find the idea of a high-dimension space confusing, try and imagine it in just 2 or 3 dimensions. Each vector represents a direction, and different directions represent different meanings. This way, you can 'add' and 'subtract' meanings from each other. 

**Think🤔**: Why do you think we use several hundred dimensions instead of just 2 or 3? To help, imagine how many different directions you can represent in 1, 2 and 3D. What happens as you increases the number of dimensions. Now imagine hundreds of dimension! Even though meaning can be very nuanced, can you see that with enough dimensions, we might get somewhere?

Let's see some word embeddings in action below!

In [None]:
# Gensim is a library that contains some pre-trained word embeddings we can use
import gensim.downloader as api

# It might take a while to download the model
model = 'glove-twitter-100'
word_vectors = api.load(model)

In [None]:
# TODO: How many words are in this model?
...

In [None]:
# TODO: Let's look at the embedding for the word 'artificial'
...

In [None]:
# TODO: What is the dimensionality of the word vectors?
...

We can get the words that are closest to another using the `.most_similar()` method. Let's find the words that are closest to 'artificial'. Before running the next cell, predict a few words that you think should come up.

Then go ahead and try out some other words!

In [None]:
# TODO: Get the most similar words to 'artificial'. Then go ahead and try 'python', 'paris'
word_vectors.most_similar(...)

**Think🤔**: Was it what you expected? Remember how these vectors were trained! Can you see why we can't just naively use these embeddings in ChatGPT?

We can also look at words that are most distant from a word by passing `negative='word'` to the `most_similar` method.

**Think🤔**: What do you think will happen this time? Remember this was trained on twitter data, and remember how we trained the model? Then check it and try to reason through the answer.

In [None]:
# TODO: We can look at words that are furthest from 'artificial'
word_vectors.most_similar(negative=...)

We can then do some maths with words using a combination of the `positive` and `negative` parameters of the `most_similar` method. Go ahead and get the most similar words for `'king - man + woman'` using the `most_similar` method. You can also try `'france - paris + tokyo'`. Or any word maths you like!

In [None]:
# TODO: We can look at words that are furthest from 'artificial'


**Extension😈**:

Here is some code that tries to do the same as above but using a different method.

```python
king, man, woman = word_vectors['king'], word_vectors['man'], word_vectors['woman']

result = king - man + woman

print(word_vectors.most_similar(result))
```

It will work okay, but will return different numbers. Why do you think that is?

*Hint: Go look up what most_similar does to its inputs when you supply multiple words. Or have a google of 'most_similar vs similar_by_vector'*

Once you have an idea, how would you code it up so that the same numbers were returned?

### Cosine Similarity

How is `most_similar` deciding which words are most similar? Well, we have two main options:
- Euclidean distance (Pythagoras' theorem for higher dimension)
- Cosine similarity (measure of angle between two vectors)

You can use both, but generally we prefer cosine similarity as it measures the angle between 2 vectors. 

**Think🤔**: Why does it make sense that we'd rather measure angles than distances?

*Hint: How is meaning represented in our embedding space?*

Below, I've implemented cosine similarity for you - it's a pretty simple formula. You can look up 'angle between two vectors', you might even remember it from high school.

In [None]:
import numpy as np

def cosine_similiarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# TODO: Let's calculate the cosine similarity between 'king' and 'queen'
...

You can check out [banana phone](https://bananaphone.web.app/) (made by Valentin who is part of EdinburghAI). It uses cosine similarity to decide whether a word is closest to the word 'banana' or 'phone'! 

**Extension😈**: Go and implement a version of banana phone below!

In [None]:
# TODO: Implement banana phone! Only do this after you've tried the other exercises

## Semantic Search

A simple way of searching through a bunch of documents using a search term like 'Why is Edinburgh University student satisfaction so low?' is to find the documents with words in common with you search. So it might find the documents that also have the terms 'Edinburgh University' and 'student satisfaction' in them. The ones with the most terms in common would be ranked at the top of our search.

Instead, we're going to implement *semantic* search. We're now going to search our database of document trying to find the one that has the most similar *meaning*. So now we don't need exact matches. In the example above, it might also return results with terms like 'dissatisfied undergraduates' etc.

To do this, we're going to need a *vector database*. In our case it's going to be pretty small - just a list. Then we're going to create a function that finds the most similar item and return it. Simple!

Now, our documents are composed of many words. We could just add up embeddings, but we're going to cheat and use another package called `SentenceTransformer` that can create a vector for a whole sentence super easily. Here's an example.

In [None]:
# You might need to install the package first.
%pip install -U sentence-transformers

In [65]:
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode some sentences
sentences = ['This is an example sentence']

sentence_embedding = model.encode(sentences)
print(sentence_embedding)

We can actually do more than one sentence at once.

In [None]:
sentences = ['This is an example sentence', 'this is another sentence']

# TODO: Encode the sentences and print the shape of the resulting tensor
sentence_embedding = ...
...

In the same way as before, we can look at the cosine similarity between two sentences. Try testing out different sentences and see what the model spits out!

In [None]:
sentences = ['This is an example sentence', 'this is another sentence']

# TODO: Calculate the cosine similarity between the two sentences
sentence_embedding = ...
...

And now we have a database (our list of sentence embeddings) and a way of measuring similarity. We have all we need to do semantic search!

Here is a list of sentences:

In [78]:
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is transforming industries worldwide.",
    "The history of the Roman Empire spans over a thousand years.",
    "Climate change is one of the most pressing issues of our time.",
    "Machine learning is a subset of artificial intelligence focused on data-driven predictions.",
    "The theory of relativity revolutionized modern physics.",
    "Quantum mechanics explores the behavior of matter at atomic and subatomic levels.",
    "Python is a popular programming language known for its simplicity and readability.",
    "The Great Wall of China is one of the Seven Wonders of the World.",
    "Blockchain technology underpins cryptocurrencies like Bitcoin.",
    "Astronomy is the scientific study of celestial objects and phenomena.",
    "The Industrial Revolution marked a major turning point in human history.",
    "Many animals use camouflage to blend into their environments.",
    "Photosynthesis is the process by which plants convert sunlight into energy.",
    "Human rights are fundamental rights that every person is entitled to.",
    "The Amazon rainforest is home to an extraordinary diversity of species.",
    "E-commerce has changed the way consumers shop worldwide.",
    "Augmented reality overlays digital information onto the real world.",
    "Artificial neural networks are inspired by the human brain's structure.",
    "Renewable energy sources include solar, wind, and hydroelectric power.",
    "Shakespeare is widely regarded as one of the greatest writers in the English language.",
    "Nanotechnology is the science of manipulating matter on an atomic scale.",
    "Mental health awareness is increasingly recognized as essential to overall well-being.",
    "Globalization has led to increased interconnectedness between nations.",
    "Mars exploration has captured the interest of space agencies around the world.",
    "The Renaissance was a period of great cultural and artistic achievement in Europe.",
    "The Sahara Desert is the largest hot desert in the world.",
    "Meditation is a practice that can reduce stress and improve mental clarity.",
    "Blockchain offers a decentralized way to record transactions securely.",
    "The periodic table organizes elements by their atomic number.",
    "Mount Everest is the highest peak on Earth.",
    "Cryptography is essential for securing data in digital communications.",
    "Ocean acidification is a consequence of rising carbon dioxide levels.",
    "Social media platforms have transformed how people interact and share information.",
    "Data science involves extracting insights from complex data sets.",
    "The theory of evolution explains the diversity of life on Earth.",
    "Nutrition plays a critical role in maintaining health and wellness.",
    "The internet has dramatically changed access to information and knowledge.",
    "Anatomy is the study of the structure of organisms and their parts.",
    "The concept of democracy originated in ancient Greece.",
    "Microbiology studies organisms that are too small to be seen with the naked eye.",
    "Music has been an integral part of human culture for millennia.",
    "Space telescopes have greatly expanded our understanding of the universe.",
    "Genetics explores how traits are inherited through generations.",
    "The French Revolution was a pivotal event in European history.",
    "Coding skills are increasingly valuable in today’s job market.",
    "Marine biology focuses on life in ocean environments.",
    "Economics studies the production, distribution, and consumption of goods and services.",
    "Biomimicry uses nature-inspired solutions for human problems.",
    "The Milky Way is the galaxy that contains our solar system.",
    "Leadership skills are crucial for managing teams effectively.",
    "Renewable resources are vital for a sustainable future.",
    "Cybersecurity protects systems and networks from digital attacks.",
    "The human brain is a highly complex organ responsible for thought and behavior.",
    "Space travel requires advanced technology and rigorous training.",
    "The Cold War shaped international relations for much of the 20th century.",
    "Robotics involves the design, construction, and operation of robots.",
    "Philosophy seeks to answer fundamental questions about existence and knowledge.",
    "Biodiversity is essential for ecosystem resilience and stability.",
    "Journaling can be a powerful tool for self-reflection.",
    "Recycling helps reduce waste and conserve natural resources.",
    "The cardiovascular system circulates blood throughout the body.",
    "Inventions like the telephone and the internet transformed communication.",
    "Urbanization refers to the growth of cities as more people move to urban areas.",
    "Yoga combines physical postures, breathing exercises, and meditation.",
    "Photosynthesis takes place in the chloroplasts of plant cells.",
    "The Big Bang Theory describes the origin of the universe.",
    "Digital marketing has revolutionized how companies reach consumers.",
    "Volcanoes form when magma from beneath the Earth’s crust reaches the surface.",
    "Exercise can improve physical health and mental well-being.",
    "The Mona Lisa is one of the most famous paintings in the world.",
    "Conservation efforts are critical for protecting endangered species.",
    "Sustainable agriculture aims to meet current food needs without harming the environment.",
    "Physics is the branch of science that studies matter, energy, and their interactions.",
    "Behavioral psychology examines the ways that people behave and learn.",
    "Cloud computing provides on-demand access to computing resources.",
    "Antibiotics are drugs that treat bacterial infections.",
    "Time management is an essential skill for productivity.",
    "Global warming refers to the long-term rise in Earth's average surface temperature.",
    "Vaccines help the immune system prevent disease.",
    "3D printing allows for the creation of complex physical objects from digital designs.",
    "Deep learning is a type of machine learning that uses neural networks with many layers.",
    "Ancient Egypt is known for its pyramids and pharaohs.",
    "Public speaking is a skill that can be improved with practice.",
    "Investing in stocks can be a way to build wealth over time.",
    "Ecology studies the interactions between organisms and their environment.",
    "Food security ensures that all people have access to sufficient food.",
    "Photosynthesis is the process by which plants convert light into energy.",
    "Social psychology explores how individuals are influenced by others.",
    "The immune system defends the body against harmful pathogens.",
    "Algebra is a branch of mathematics that deals with symbols and rules for manipulating them.",
    "Digital art allows artists to create work on computers and tablets.",
    "Astronauts undergo extensive training for missions in space.",
    "The respiratory system is responsible for taking in oxygen and expelling carbon dioxide.",
    "Entrepreneurship involves creating and managing a new business.",
    "Linguistics is the scientific study of language and its structure.",
    "Bird migration is a natural phenomenon involving long-distance travel.",
    "Virtual reality creates an immersive experience using computer technology.",
    "Hydroelectric power is generated by harnessing the energy of flowing water.",
    "In physics, gravity is the force that attracts two bodies toward each other.",
    "Self-driving cars use artificial intelligence to navigate roads."
]

Now, go and create your vector database.

In [79]:
# TODO: Encode the sentences into a tensor we'll call our database. Print the shape of this tensor.
vector_db = ...
...

Now create a function that takes in a string, converts it to a vector, and queries the database for the most similar sentence.

In [80]:
# TODO: Implement the function that finds the most similar sentence to a given sentence
# Hint: You can use np.argmax to find the index of the most similar sentence
def find_most_similar(query: str) -> str:
    ...

And use it! Start by using one of the sentences in the database, then try changing eery word in the sentence but keeping the meaning the same.

In [None]:
# TODO: Test the function by providing a query sentence and printing the most similar sentence
query_sentence = ...

...

**Think🤔**: Do you understand how every part of the system works? How good is your system? What's the main problem? What would you change?

**Extension😈**: Create a new system that encodes sentences using word2vec from above. However, word2vec is only able to produce an embedding for each word, so how will you encode an entire sentence? What do you think are the advantages/drawbacks of such a system compared to the sentence embeddings above?

You can read more about how these sentence embeddings were created at [SentenceBert](https://sbert.net/) or the associated [paper](https://arxiv.org/abs/1908.10084).

### Word Embeddings ✅

And that's word embeddings! Time to move on to language models!

# 2. Fine-Tuning for Classification
Next, we are going to **fine-tune** a **pre-trained** language model on a task called sentiment analysis. This sort of thing is done a lot.

In sentiment analysis, we want to assign a sentiment to a piece of text. This might be to characterise tweets as 'safe' vs 'harmful' or even 'truth' vs 'lie' (this is a tough one...). We're going to be categorising movie reviews into either 'positive' or 'negative'. It's just a kind of classification.

**Think🤔**: How would you build a model that did this classification without any Machine Learning? Now think how you'd try to handle phrases like 'not ...' and 'lacking in ...'. Can you see why ML can be useful here?

We are going to use a big pre-trained language model called [BERT](https://arxiv.org/pdf/1810.04805) which is able to produce **contextualised** word embeddings. Hopefully this means that those embeddings are more nuanced and rich in meaning than the ones we used above from word2vec. Then we'll use those those embeddings to decide whether or not a review is positive or negative.

We're not going to be using Pytorch anymore! For many pre-trained models, they are stored on a webiste called [HuggingFace](https://huggingface.co/) (HF). HF also stores many datasets, and provides a python library for use to use all of these quite easily. The libraries we'll be using are [datasets](https://huggingface.co/docs/datasets/en/index) for datasets and [transformers](https://huggingface.co/docs/transformers/en/index) for the models. These libraries isn't a good choice for building your own model architecture, but it is very neat if you just want to use someone else's model.

The first step is to load the dataset from huggingface, and build a training and testing split. You can run this cell without any changes.

In [88]:
from datasets import load_dataset

# Load IMDB movie reviews
data = load_dataset('imdb')
train = data['train'].shuffle(seed=42).select(range(500))
test = data['test'].shuffle(seed=42).select(range(100))

Next, we want to **tokenize** our data. This means splitting up a sentence into smaller parts that are in the model's vocabulary. This is quite similar to splitting up a sentence into words, but can also include punctuation, other languages, and often models split up words into smaller chunks. You can check out a few different tokenizers for new and older versions of ChatGPT on [OpenAI's Platform](https://platform.openai.com/tokenizer). 

**Think🤔**: What do you notice about the difference in tokenization between the older and newer models. Why do you think that is?

*Hint: If tokens are longer, then what will that mean about the size of the vocabulary? What might that mean about the model itself? Does this fit with what you might already know about versions of ChatGPT over the last few years?*

BERT uses sub-word tokenization (tokens are not necessarily entire words, but pieces of words). We can just load the BERT *tokenizer* from huggingface.

In [None]:
from transformers import AutoTokenizer

# Create BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_data(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

train_tokenized = train.map(tokenize_data, batched=True)
test_tokenized = test.map(tokenize_data, batched=True)

Have a poke around the `train_tokenized` object. It's a bit like a python dictionary with some attributes you can explore. 

*Hint: If you access the ['input_ids'] attribute, to save some time, you can instead access just the first one with '['input_ids'][0]'*

In [None]:
# TODO: Look around the train_tokenized object. What do you see?
...

**Think🤔**: Do you understand what `text`, `label` and `input_ids` means?

**Extension😈**: What do you think the `attention_mask` is? In our case, all of the values are `1`. If I told you that when training ChatGPT for next-token prediction, the attention mask has some `0`s that block out some of the attention between different words, can you have a guess at what it does? Have a google around or ask someone if you're interested.

**Think🤔**: Above, we passed an argument called `padding=max_length`, and `truncation=True` to the tokenizer. Given that each training sample needs to be a fixed length for passing into our model, what might padding and truncation do to our reviews?

Now we are ready to fine-tune our model. First, we load a pre-trained BERT model (this saves us having to train a whole model from scratch). The model we are using is specifically catered for classification. We choose to give it 2 output labels (**Think🤔**: What are our two output labels?). 

Next, we will setup a **Trainer** with some **TrainingArguments**. Don't worry too much about these, they're just a bunch of boilerplate provided by HF that save us from having to write our own training code.

In [None]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

# We give the model a classification head
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

training_args = TrainingArguments(
    output_dir="./results", # Where we store our results
    eval_strategy="epoch", # How often we evaluate model (every epoch)
    learning_rate=2e-5, # Learning rate
    per_device_train_batch_size=8, 
    per_device_eval_batch_size=16, 
    num_train_epochs=2, # How many epochs
    weight_decay=0.01, # This prevents our model from overfitting
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
)

**Extension😈**: Go and google 'weight decay in nn'. Also google 'learning rate scheduler'. Why do you think these are useful?

We are finally ready to train our model! - which is trivially (almost laughably) easy with HF. Note that this can take about 5-6 mins to run fully on Kaggle.

In [None]:
trainer.train()

Now that our model is fine-tuned, we evaluate it on our test data.

In [None]:
metrics = trainer.evaluate()
print(metrics)

Great! Having these numbers is all well and good, but they probably don't mean anything to you at the moment. You should definitely also calculate the accuracy of the model on your test set, but we're going to skip that here. Instead, to get a better idea of how our model works, we're going to write a sentence or two in the `review` field of the following cell, and see if the model correctly predicts whether it is a `Positive` or `Negative` review of a film.

In [None]:
import torch

# TODO: Provide a test movie review
review = ...

# TODO: Try to understand what this function does
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", padding="max_length", truncation=True, max_length=128)
    outputs = model(**inputs)
    predicted_class = torch.argmax(outputs.logits, dim=1).item()
    return "Positive" if predicted_class == 1 else "Negative"

# TODO: Test the function
...

**Think🤔**: Are there any situations that this model incorrectly predicts? (Hint: try phrases like "not good" or "not bad".) How might you improve this model to overcome such shortcomings?

**Optional**: If you want, run the following code to train the model on a larger training set, then run the above cell again to see if your results have improved! (This is optional because it will take a while)

In [None]:
# This cell is optional and may take a while to run! (15-20 mins)
train_large = data['train'].shuffle(seed=42).select(range(1500))
test_large = data['test'].shuffle(seed=42).select(range(250))

encoded_train_large = train_large.map(tokenize_data, batched=True)
encoded_test_large = test_large.map(tokenize_data, batched=True)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_train_large,
    eval_dataset=encoded_test_large,
)

trainer.train()

#### Text Classification ✅

# Part 3: Using a GPT

Next up, we are going to use a pre-trained model to generate some text! For this task we will use the GPT-2 model, which is the predecessor of ChatGPT's current models, GPT-3.5 and GPT-4. It came out in 2019 - so you'll really be able to see how the models have improved since then. It's stupidly easy to get up and running. Go ahead and run the next two cells to set up our GPT model and text generation function.

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# TODO: Load GPT-2 model and tokenizer. The model's name is 'gpt2'. You can use the same format as above ('.from_pretrained'), no extra arguments needed. 
tokenizer = ...
model = ...

# Ignore this
tokenizer.pad_token = tokenizer.eos_token

In [None]:
# TODO: Try to understand what this function does
def generate_text(prompt, max_length=50, temperature=1.0, top_k=0, top_p=1.0):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        inputs.input_ids,         
        attention_mask=inputs.attention_mask,
        pad_token_id=tokenizer.eos_token_id,
        max_length=max_length,
        temperature=temperature,  # Controls randomness
        top_k=top_k,              # Limits to top-k most likely words
        top_p=top_p,              # Nucleus sampling
        do_sample=True            # Enables sampling for non-deterministic results  
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Now we are ready to start generating text! 

The output of a generative text model is a vector of scores for each output token/word. A simple way of choosing which token/word should go next is to choose the biggest score, but as you'll see, this isn't the ideal behaviour (**Think🤔**: Do you always say the most probable word next?). This vector of scores is actually a probability distribution that sums to 1. We could then just sample from this distribution (**Think🤔**: As a concrete example, what would sampling from the distribution [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] look like?) We can manually edit this probability distribution to change how the model generates text.

**Extension😈**: You can google 'softmax' to see how this happens we make this probability distribution. TLDR: the model does output a bunch of scores that could be any numbers, and then we squish them into [0,1] and make sure they add up to 1. This operation still makes sure that the highest scores have the highest probabilites, and does this in a sensible way.

Here's a quick overview of what each one does. As a running example, let's pretend we only have 3 words and the model has outputted the distribution: `[0.8, 0.15, 0.05]` i.e. it think the word with id `0` is most likely.

- `temperature`: Temperature controls how much we change this distribution manually in order to change the kind of text generated. Higher temperature values make the scores closer together, making the probability of selecting each word more similar, leading to a more creative model. However, if temperature is too high, the model can start to make little sense - have a go! Conversely, lower temperature values make the model **less** random, with values close to 0 leading to a nearly deterministic model. Try out different values yourself and see where you think a good range of temperatures is.
E.g. Suppose our distribution was `[0.8, 0.15, 0.05]`. Then a high temperature would turn this into `[0.6, 0.25, 0.15]` while a lower temperature would so something like `[0.95, 0.04, 0.01]`.
- `top_k`: Top-k sampling is a technique that limits the model's output to the top-k most probable words. This helps to keep the model coherent, but restricts the creativity and randomness of the model. Try some small values, like 5-10, and compare your results to bigger values, like 50 or 100.

E.g. `top_k = 2` would yield `[0.8, 0.15]` - except this doesn't add to 1 so we normalise such they add to 1, but the first element is still 8 times as likely as the first: `[0.842, 0.158]`.
- `top_p`: Similar to `top_k`, this is another sampling technique (also known as nucleus sampling). Basically, the model chooses a subset of the most probable words such that their cumulative probability is at least `p` (e.g. 0.9). This prevents words with a very low probability from being selected, while maintaining some room for randomness.

e.g. `top_p = 0.95` gives `[0.8, 0.15]` which we would normalise as above. Similarly,  `top_p = 0.8` would give `[0.8]` which we can normalise as before. **Think🤔**: What would it normalise to? *Hint: There's only 1 option!* 

#### Have a go!

In the cell below, try messing around with the below values and see how the output changes! Also, try changing the input and the `max_length` (although if you set it to be much more than 200 it's going to take a while...).

In [None]:
prompt = "Complete this story: \n In a distant future, artificial intelligence has become"

temperature = ... # Choose a float > 0 (must have a decimal point)
top_k = ... # Choose any positive integer (0 does nothing)
top_p = ... # Choose any float between 0 and 1 (1 does nothing)

print(generate_text(prompt, max_length=50, temperature=temperature, top_k=top_k, top_p=top_p))

**Think🤔**: Do you think we want deterministic outputs? Why or why not? There are 2 different ways above to set the model to be deterministic (If we were allowed to set `temperature = 0`, there would even be 3). Check that they all produce the same (there might be tiny variations for unimportant reasons, but in most cases it should be exactly the same). Do you like the kind of text it produces? Why do you think it exhibits this behaviour?

**Extension😈**: There are other ways generate text using such a model. Go and google 'beam search for generative language models'. Why do you think this might be better? See if you can work out from the HF docs for the `generate` method how to get it working.

**Extension😈**: One of the big problems with transformer models like the ones above is that the computation requires scales quadratically with the length of the input. This is why if you set `max_length` to be twice as large, it more than doubles the computation time. Can you think about *why* the scaling is quadratic? Remember that that transformer uses attention under the hood. Ask someone to have a chat about it.

The simplest way to query a model is using ChatGPT's website. However, you can do the same thing in code using the [OpenAI API](https://pypi.org/project/openai/) or packages like [langchain](https://www.langchain.com/). We didn't use those here because we want to use our own models, but often times, you won't be able to make a better model than ChatGPT - so just user their API!

#### ChatGPT ✅

# Language Models ✅
> That's it! I hope you see that it's not so hard to get building with some of the most powerful models in the world right now. You could even start a business around it (lots of people are) - the so-called 'GPT-wrapper' companies. If you do it well, you might make a lot of money. 
>
> On the flipside, I hope you agree how cool word embeddings are, and also how powerful they are. There's still **loads** of active research in NLP (Natural Language Processing), trying to take most advantage of these new models we've built in just the last few years, or trying to create models that are less stupid and more aligned to what humans want.
>
> Go learn more and build something cool. Then come back and show us!
> See you soon! 

Pierre Mackenzie & Niall Meagher, Edinburgh AI
