# Group 5 GPT Project IS 640

This is the group 5 GPT Project for IS 640 - Programming for Business Analytics

Members:
- Hans 
- Chetan  
- Danish 
- Srujana 
- Bruna 

## Milestone 1: Dataset Exploration and Preparation

### Import all the modules and packages

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import pandas as pd
import numpy as np

### Define the hyperparameters for fine tuning

In [2]:
batch_size = 32 # Number of sequences processed in parallel during training
block_size = 128 # Maximum context length for predictions (sequence length)
max_iters = 5000 # Total number of training iterations
eval_interval = 100 # How often to evaluate the model (every 100 iterations)
learning_rate = 1e-3  # Step size for gradient descent optimization
device = 'cuda' if torch.cuda.is_available() else 'cpu' # Use GPU if available, otherwise CPU
eval_iters = 200 # Number of iterations for loss estimation during evaluation
n_embd = 128 # Dimensionality of the token embeddings and model's hidden layers
n_head = 8  # Number of attention heads in each self-attention layer
n_layer = 8 # Number of transformer layers in the model
dropout = 0.1 # Probability of dropping out neurons during training (regularization)

torch.manual_seed(1337)  # Set random seed for reproducibility

<torch._C.Generator at 0x22560947fb0>

### Choosing Blogposts as the dataset

In [3]:
# # df = pd.read_csv('Blog_Text.csv', encoding='latin-1')
# df = pd.read_csv('Blog_Text_Cleaned.csv', encoding='latin-1')
# df['combined'] =  df['text'].astype(str)
# text = " ".join(df['combined'].dropna().tolist())
# text[:500]  # print the first 500 characters of the text

### Choosing Medium articles as the dataset

In [4]:
# df = pd.read_csv('Medium_Articles_Text.csv', encoding='latin-1')
# df['combined'] =  df['text'].astype(str)
# text = " ".join(df['combined'].dropna().tolist())
# text[:500]  # print the first 500 characters of the text

### Choosing TV Show Articles

In [14]:
df = pd.read_csv('Dataset/tv_series_synopsis_full.csv', encoding='latin-1')
df['combined'] =  df['text'].astype(str)
text = " ".join(df['combined'].dropna().tolist())
text[:500]  # print the first 500 characters of the text

'miles morales catapults across the multiverse, where he encounters a team of spiderpeople charged with protecting its very existence. when the heroes clash on how to handle a new threat, miles must redefine what it means to be a hero. a c.i.a. operative on the edge of retirement discovers a family secret and is called back into the field for one last job. a hit man from the midwest moves to los angeles and gets caught up in the citys theatre arts scene. john wick uncovers a path to defeating the'

### Converting string to numerical format for training and testing.
1. Extract the unique characters and find the count of the vocabulary
2. Map the characters to integers and vice versa
3. Define the encode function which converts strings into numerical format
4. Define the decode function which converts numbers into strings

In [6]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

### Diving the data into training and validation sets
1. Encode the text into numbers so that it can be processed as a pytorch tensor
2. Define the split ratio
3. Make the training and validation sets

In [7]:
# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

### Create functions for batch loading and loss estimation
`get_batch`:
Creates small, random batches of input-output pairs for training or validation.
Ensures the model learns from diverse examples within the dataset.

`estimate_loss`:
Provides a measure of the model's performance on both training and validation datasets.
Helps monitor overfitting (training loss much lower than validation loss) and guide hyperparameter tuning.

In [8]:
# data loading
def get_batch(split):
    """
    Generate a small batch of data of inputs x and targets y.

    Args:
        split: 'train' or 'val'. if 'train', we sample from train_data, otherwise val_data

    Returns:
        x: a tensor of shape (bs, block_size) representing the input sequence
        y: a tensor of shape (bs, block_size) representing the target sequence
    """
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    """
    Estimates the average loss for the training and validation datasets 
    over a fixed number of evaluation iterations.

    Returns:
        Dict[str, float]: A dictionary containing the mean loss for both the 
        training and validation datasets. Keys are:
            - 'train': Mean loss for the training dataset.
            - 'val': Mean loss for the validation dataset.
    """
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

## Milestone 2: Basic Model Usage (Bigram Language Model)

Description: This milestone introduces a simple bigram language model. It predicts the next token based solely on the current token, without considering any broader context.

How it works: The model uses a simple lookup table to predict the next token based on the current one.

Code changes:
- Implementation of a basic nn.Embedding layer for token prediction
- Simple forward pass that uses only the current token to predict the next

Metrics: Basic tracking of training and validation loss.

In [9]:
class BigramLanguageModel(nn.Module):
    """
    A simple bigram-based language model that predicts the next token 
    based on the current token using an embedding layer. This model is 
    primarily used as a basic demonstration of language modeling concepts.

    Args:
        vocab_size (int): The size of the vocabulary, defining the number of unique tokens.

    Attributes:
        token_embedding_table (nn.Embedding): Embedding layer that maps tokens to logits 
            for all tokens in the vocabulary.

    Methods:
        forward(idx, targets=None):
            Performs the forward pass of the model, computing logits for the next token 
            and optionally calculating the cross-entropy loss.

            Args:
                idx (torch.Tensor): Tensor of shape (B, T) containing input token indices, 
                    where B is the batch size and T is the sequence length.
                targets (torch.Tensor, optional): Tensor of shape (B, T) containing target 
                    token indices for loss computation. Default is None.

            Returns:
                Tuple[torch.Tensor, torch.Tensor or None]:
                    - logits (torch.Tensor): Tensor of shape (B, T, vocab_size) containing 
                      predicted logits for the next token.
                    - loss (torch.Tensor or None): Scalar tensor representing the cross-entropy 
                      loss if `targets` is provided, otherwise None.

        generate(idx, max_new_tokens):
            Generates a sequence of tokens by sampling from the model's predictions.

            Args:
                idx (torch.Tensor): Tensor of shape (B, T) containing the initial context 
                    (sequence of token indices).
                max_new_tokens (int): Number of new tokens to generate.

            Returns:
                torch.Tensor: Tensor of shape (B, T + max_new_tokens) containing the initial 
                context concatenated with the generated tokens.

    Examples:
        >>> vocab_size = 100
        >>> model = BigramLanguageModel(vocab_size)
        >>> idx = torch.tensor([[1, 2, 3]])
        >>> logits, loss = model(idx, targets=torch.tensor([[2, 3, 4]]))
        >>> generated_sequence = model.generate(idx, max_new_tokens=5)
    """
    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

        # better init, not covered in the original GPT video, but important, will cover in followup video
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [10]:
model = BigramLanguageModel(vocab_size)
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

0.001681 M parameters


### Create a PyTorch optimizer for updating the model's parameter's during training
AdamW is a variant of the Adam optimizer that includes decoupled weight decay, making it better suited for modern deep learning models like transformers.
Key features:
Combines adaptive learning rates (like Adam) with the L2 regularization benefits of weight decay.
Helps prevent overfitting and stabilizes training by penalizing large weights.

In [11]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [None]:
# Initialize lists to store losses
train_losses = []
val_losses = []


for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()

        train_loss = losses['train']
        val_loss = losses['val']
        
        # Store losses
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_text = decode(m.generate(context, max_new_tokens=2000)[0].tolist())
avg_train_loss = np.mean(train_losses)
avg_val_loss = np.mean(val_losses)
print(f"\nTraining completed.")
print(f"Average training loss: {avg_train_loss:.4f}")
print(f"Average validation loss: {avg_val_loss:.4f}")
print(generated_text)

step 0: train loss 3.7144, val loss 3.7144
step 100: train loss 3.5966, val loss 3.5968
step 200: train loss 3.4881, val loss 3.4882
step 300: train loss 3.3879, val loss 3.3888
step 400: train loss 3.2970, val loss 3.2976
step 500: train loss 3.2133, val loss 3.2137
step 600: train loss 3.1374, val loss 3.1377
step 700: train loss 3.0673, val loss 3.0676
step 800: train loss 3.0045, val loss 3.0047
step 900: train loss 2.9477, val loss 2.9475
step 1000: train loss 2.8961, val loss 2.8960
step 1100: train loss 2.8501, val loss 2.8489
step 1200: train loss 2.8076, val loss 2.8065
step 1300: train loss 2.7710, val loss 2.7692
step 1400: train loss 2.7357, val loss 2.7348
step 1500: train loss 2.7042, val loss 2.7033
step 1600: train loss 2.6782, val loss 2.6760
step 1700: train loss 2.6546, val loss 2.6520
step 1800: train loss 2.6320, val loss 2.6294
step 1900: train loss 2.6111, val loss 2.6089
step 2000: train loss 2.5963, val loss 2.5913
step 2100: train loss 2.5815, val loss 2.5766


In [13]:
# Save the text to a file
with open('milestone2.txt', 'w', encoding='utf-8') as f:
    f.write(generated_text)
