# Generative Artificial Intelligence

* How to create a model that can generate images/text/audio/etc. from user prompts?
* Unlabelled data availbale in abundance. Learn hidden sructure from the unlabelled data.

In this notebook we specifically address the task of text generation.

### What is text generation?
Given a sequence, predict what is the next token in the sequence. The sequence can be a series of words or characters and the objective is to predict next word or character respectively in sequence.
$$P(w_{t} | w_{t-1}, w_{t-2}, w_{t-3},...,w_{1})$$

#### Some basic terminology:
**Tokens/Tokenization** - Given a sequence of characters, tokenization is the process of dividing the sequence into smaller units called tokens. Tokens can be individual characters, segments of words, complete words or portions of sentences. Tokens obtained are converted into 1-hot vectors to be fed into the model.

**Generative Model** - A model that learns to sample from the probability distribution to generate data that seem to be from the same probability distribution as training data.

**Discriminative Model** - In contrast to generative models, discriminative models are trained to differentiate between classes or categories.


### Dataset

In [1]:
# Read the input corpus
with open('tiny_shakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print("Length of text: ", len(text))
print(f"\nSample text:\n{text[:400]}")

Length of text:  1115393

Sample text:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it 


### Tokenization
One of the easiest language model to start with is the character level model where each character is a token. It encodes minimum token level information but is easy to implement.

In [2]:
# Create characters as vocabulary
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("Vocabulary: ", ''.join(chars))
print("Vocabulary size: ", vocab_size)

Vocabulary:  
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocabulary size:  65


To feed characters into a model they need to converted into numbers that can be processed by a model.

In [3]:
# Encoder and decoder function for idx to char and back
stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

In [4]:
print(encode('Shakespeare'))
print(decode(encode('Shakespeare')))

[31, 46, 39, 49, 43, 57, 54, 43, 39, 56, 43]
Shakespeare


In [5]:
# Convert text to torch tensor
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
data = torch.tensor(encode(text), dtype=torch.long)

In [6]:
# Split data into train and validation
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

# Get single batch of data for training
def get_batch(split='train', block_size=8, batch_size=4):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i : i + block_size] for i in ix])
    y = torch.stack([data[i + 1 : i + block_size + 1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

In [7]:
x, y  = get_batch(split='train')
x[0], y[0]

(tensor([50, 50, 11,  0, 25, 63,  1, 61]),
 tensor([50, 11,  0, 25, 63,  1, 61, 53]))

### Generative Model
**Possible generative models**:
1. **N-gram model** - Given n-previous tokens in the sequence, predict the next token. Most common approaches are bigram or trigram model with bayes estimation. Larger the value of **N**, more context information can be incorporated.
2. **Recurrent neural networks** - A goto neural network achitecture for working with sequential data. Behind the scenes, just a neural network that processes each token of the sequential input one at a time. 

<p align="center">
<img src="assets/rnn.webp" width="700">
</p>

Condenses entire history of the sequence into a single vecctor. Theoretically RNNs can process infinite history but this is limited proctically by computational constraints and memory requirements. Even with a large enough history, RNNs struggle with long term dependencies.

3. **Transformer models** - Introduced in 2017 by the paper [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf). The paper introduces an architecture that provides a differentiable lookup method for the called `Attention` that potentially solves the problem of long term dependencies by allowing the model to lookup specific information from the history as required.

e.g., Prompt - Where is Eiffel Tower located? Answer - It is located in Paris. Here `It` is related to `Eiffel Tower`, `is` to `is` and `located` to `located`.

In this notebook we will start with a simple Bigram model and slowly build our way towards a Transformer model.

### Bigram Model