# **Q/A Assignment**


---


**Question 1: Relationship between Wnewo, Wnew1, Wnewn, and Wnewn+1 in Logistic Regression**

When a new dataset is created by duplicating feature n into feature (n + 1) and retraining a new model, the likely relationship between the weights Wnewo, Wnew1, Wnewn, and Wnewn+1 can be described as follows:

Wnewo, Wnew1, ..., Wnewn would be similar to the weights learned in the original model for the corresponding features.

Wnewn+1 would likely have a weight similar to Wnewn, as the duplicated feature is expected to have a similar impact on the model.

In summary, the weights for the duplicated feature and its copy are likely to be close, reflecting their similar contributions to the model.


---


**Question 2: Multivariate Email Template Test**

The correct answer is:

b. E is better than A with over 95% confidence, B is worse than A with over 95% confidence. You need to run the test for longer to tell where C and D compare to A with 95% confidence.

Explanation: Template E has the highest click-through rate (14%), making it statistically better than template A with over 95% confidence. Template B has a lower click-through rate than A, and the confidence level is over 95%. However, more testing is needed for templates C and D to determine their comparison with A.


---


**Question 3: Computational Cost of Gradient Descent Iteration**

In the case of sparse feature vectors, where the average number of non-zero entries in each training example is k (where k << n), the approximate computational cost of each gradient descent iteration in logistic regression is proportional to O(m * k), where m is the number of training examples and n is the number of features.






---




**Question 4: Generating Additional Training Data for Text Classifier**

The likely ranking based on accuracy for the different methods of generating additional training data is as follows:

a. Run the V1 classifier on 1 million random stories:
Likely to perform well as it focuses on examples close to the decision boundary.

b. Get 10k randomly labeled stories:
Reasonably good, but may not capture decision boundary cases.

c. Pick a random sample of 1 million stories:
It might not perform as well, as it focuses on examples where V1 is both wrong and farthest away from the decision boundary.

Overall, (a) is likely to be the most accurate, followed by (b), and then (c).


---


**Question 5: Estimating Probability of Coin Coming Up Heads**

The estimates for the probability p using the described methods are:

a. Maximum Likelihood estimate (MLE): k/n
b. Bayesian Estimate: (k + 1) / (n + 2)
c. Maximum a Posteriori (MAP) Estimate: (k + 1) / (n + 2)

Here, k is the number of times the coin comes up heads, and n is the total number of coin tosses.



---




# **Coding Assignment: Implementation and Optimization of GPT-2 Model**


---

**Task 1: GPT-2 Model & Checkpoints (20 Points)**


---

In [None]:
import torch
import torch.nn as nn

class GPT2Model(nn.Module):
    def __init__(self, vocab_size, d_model=768, nhead=12, num_layers=12):
        super(GPT2Model, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.transformer_layers = nn.TransformerEncoderLayer(d_model, nhead)
        self.transformer = nn.TransformerEncoder(self.transformer_layers, num_layers)
        self.fc = nn.Linear(d_model, vocab_size)

    def forward(self, input):
        embedded = self.embedding(input)
        transformer_output = self.transformer(embedded)
        output = self.fc(transformer_output[-1, :, :])  # Output from the last position
        return output

def validate_gpt2_model(model, checkpoint_path):
    model.load_state_dict(torch.load(checkpoint_path))
    model.eval()

    # Initial input tensor
    input_tensor = torch.randint(0, 100, (1, 10))  # Example: batch_size=1, sequence_length=10

    with torch.no_grad():
        output = model(input_tensor)

    print("Sample Output:", output)

if __name__ == "__main__":
    vocab_size = 10000  # vocabulary size
    gpt2_model = GPT2Model(vocab_size)

    # Provide the path to the GPT-2 125M checkpoint
    checkpoint_path = "gpt2_checkpoint.pth"

    validate_gpt2_model(gpt2_model, checkpoint_path)




---


**Task 2: Transformer Architectural Changes**


---

In [None]:
class GPT2ModelWithChanges(GPT2Model):
    def __init__(self, vocab_size, d_model=768, nhead=12, num_layers=12):
        super(GPT2ModelWithChanges, self).__init__(vocab_size, d_model, nhead, num_layers)
        self.rotary_positional_embedding = RotaryPositionalEmbedding(d_model)
        self.group_query_attention = GroupQueryAttention(d_model, nhead)
        self.sliding_window_attention = SlidingWindowAttention(d_model, nhead)

    def forward(self, input):
        embedded = self.rotary_positional_embedding(self.embedding(input))
        transformer_output = self.transformer(embedded)
        output = self.fc(transformer_output[-1, :, :])  # output from the last position
        return output

# RotaryPositionalEmbedding, GroupQueryAttention, SlidingWindowAttention

if __name__ == "__main__":
    vocab_size = 10000  # vocabulary size
    gpt2_model_with_changes = GPT2ModelWithChanges(vocab_size)

    input_tensor = torch.randint(0, 100, (1, 10))  # Example: batch_size=1, sequence_length=10
    with torch.no_grad():
        output = gpt2_model_with_changes(input_tensor)

    print("Sample Output with Changes:", output)




---
**Task 3: Training Loop Implementation**


---



In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.optim import Adam
from torch.nn.parallel import DistributedDataParallel
from torch.nn import DataParallel

class GPT2TrainingDataset(torch.utils.data.Dataset):
    # Training dataset implementation

class RotaryPositionalEmbedding(nn.Module):
    # Rotary Positional Embedding

class GroupQueryAttention(nn.Module):
    # Group Query Attention

class SlidingWindowAttention(nn.Module):
    # Sliding Window Attention

def train(model, dataloader, optimizer, criterion, device):
    model.train()
    model.to(device)

    for batch in dataloader:
        inputs, targets = batch
        inputs, targets = inputs.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

def train_ddp(model, dataloader, optimizer, criterion, device):
    model = DistributedDataParallel(model)
    train(model, dataloader, optimizer, criterion, device)

def train_fsdp(model, dataloader, optimizer, criterion, device):
    model = DataParallel(model)
    train(model, dataloader, optimizer, criterion, device)


if __name__ == "__main__":
    vocab_size = 10000  # vocabulary size
    gpt2_model = GPT2Model(vocab_size)

    dataset = GPT2TrainingDataset()
    dataloader = DataLoader(dataset, batch_size=8, shuffle=True)

    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(gpt2_model.parameters(), lr=0.001)

    # Train on a single GPU
    train(gpt2_model, dataloader, optimizer, criterion, device="cuda")

    # Trained using DDP
    # train_ddp(gpt2_model, dataloader, optimizer, criterion, device="cuda")

    # Trained using FSDP
    # train_fsdp(gpt2_model, dataloader, optimizer, criterion, device="cuda")
