#**Bigram Models in NLP:**


###**Explanation:**


A bigram model, also known as a 2-gram model, is a simple statistical language model that calculates the probability of a word based on the occurrence of the preceding word in a given sequence of words. In a bigram model, the probability of a word only depends on the immediately preceding word. This is in contrast to more complex language models, such as n-gram models with larger values of n or neural network-based models like the Transformer.

Here's how a bigram model works:

1. **Data Preparation:**
   - The first step involves collecting a large corpus of text data. This corpus can be any collection of text, such as books, articles, or web pages.

2. **Tokenization:**
   - The text is then tokenized, breaking it down into individual words or other meaningful units, depending on the specific application.

3. **Counting Bigrams:**
   - The bigram model focuses on pairs of consecutive words. It counts the occurrences of each pair of words in the corpus.

4. **Probability Calculation:**
   - The probability of a word given its preceding word is calculated using the following formula:
     \[ P(w_i | w_{i-1}) = \frac{\text{Count}(w_{i-1}, w_i)}{\text{Count}(w_{i-1})} \]
     Here, \( w_i \) is the current word, \( w_{i-1} \) is the preceding word, \(\text{Count}(w_{i-1}, w_i)\) is the number of occurrences of the bigram \( (w_{i-1}, w_i) \), and \(\text{Count}(w_{i-1})\) is the total count of occurrences of the preceding word \( w_{i-1} \).

5. **Generating Text:**
   - To generate text using a bigram model, you start with an initial word and then select the next word based on the probabilities calculated from the bigram model. This process is repeated to generate a sequence of words.

6. **Limitations:**
   - Bigram models have limitations, as they only consider the previous word and ignore broader context. They may struggle with capturing long-range dependencies and understanding the semantics of sentences.

7. **Smoothing:**
   - To handle the issue of unseen bigrams (pairs of words that don't occur in the training data), smoothing techniques like Laplace smoothing may be applied to avoid zero probabilities.

Despite their simplicity, bigram models can be surprisingly effective in certain applications, such as text generation or simple language understanding tasks. However, they are not as powerful or accurate as more sophisticated models like higher-order n-gram models or neural language models, especially when dealing with complex language structures and semantics.

Bigram models are a type of n-gram model, where "n" represents the number of elements in a sequence. In the case of bigrams, "n" is 2, meaning the model considers pairs of consecutive elements. These models are used in language modeling, which is a fundamental task in natural language processing (NLP).

Here's how bigram models are used to build language models (LLMs):

1. **Tokenization:**
   - Before building any language model, the text data is tokenized into units such as characters or words. In the context of bigram models in the provided code, the tokens are characters.

2. **Building Bigram Probabilities:**
   - For each token in the text, the model calculates the probability of occurrence of the next token (bigram probability). This is done by counting the occurrences of each pair of consecutive tokens and dividing it by the total occurrences of the first token.

3. **Training the Model:**
   - In the training phase, the model learns the bigram probabilities from a given dataset. It uses this information to predict the next token in a sequence based on the current token.

4. **Language Modeling:**
   - During language modeling, the trained bigram model is used to generate new sequences of tokens. Starting with an initial context (e.g., a seed word or character), the model predicts the next token based on the learned bigram probabilities. This process is iteratively repeated to generate longer sequences.

5. **Generating Text:**
   - The generated sequences can be used for various NLP tasks, such as text completion, text generation, or even creative writing. The generated text reflects the statistical patterns learned from the training data.

6. **Evaluation:**
   - Language models, including bigram models, are evaluated based on their ability to generate coherent and contextually relevant text. Common evaluation metrics include perplexity, which measures how well the model predicts the data.

In the provided code, the bigram language model is implemented using PyTorch. The model uses an embedding layer to represent characters and is trained to predict the next character in a sequence. The generation process involves sampling from the predicted probabilities to generate new sequences.

While bigram models are simple and computationally efficient, more sophisticated models, such as neural language models (e.g., LSTMs, Transformers), have become popular for capturing long-range dependencies and achieving state-of-the-art performance in NLP tasks. These models go beyond bigram probabilities and learn intricate patterns and representations from the data.

###**Code Explained**



#### 1. Data Preparation:

##### Downloading the Text File:
```python
!wget https://github.com/Infatoshi/fcc-intro-to-llms/blob/main/wizard_of_oz.txt
```
Downloads a text file containing the text of "The Wizard of Oz" from a GitHub repository. This text file will serve as the dataset for training the bigram language model.

##### Importing Libraries:
```python
import torch
import torch.nn as nn
from torch.nn import functional as F
device = 'cuda' if torch.cuda.is_available() else 'cpu'
```
Imports necessary PyTorch libraries for building and training neural networks. The `device` variable is set to 'cuda' if a GPU is available; otherwise, it defaults to 'cpu'.

##### Reading and Preprocessing the Text File:
```python
with open('wizard_of_oz.txt', 'r', encoding='utf-8') as f:
    text = f.read()

chars = sorted(set(text))
vocabulary_size = len(chars)
```
Reads the text from the file and extracts unique characters to form the vocabulary. The `chars` variable contains a sorted list of unique characters, and `vocabulary_size` is the total number of unique characters in the text.

##### Mapping Characters to Integers:
```python
string_to_int = {ch: i for i, ch in enumerate(chars)}
int_to_string = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [string_to_int[c] for c in s]
decode = lambda l: ''.join([int_to_string[i] for i in l])
```
Creates mappings between characters and integers, allowing for easy conversion between characters and their corresponding integer representations. The `encode` function converts a string into a list of integers, and `decode` reverses the process.

##### Creating Tensor from Encoded Text:
```python
data = torch.tensor(encode(text), dtype=torch.long)
```
Converts the encoded text into a PyTorch tensor of type long. This tensor will be used as input to the bigram language model.

#### 2. Model Definition:

##### Bigram Language Model Class:
```python
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, index, targets=None):
        logits = self.token_embedding_table(index)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, index, max_new_tokens):
        # Generation method
        # ...
```
Defines a PyTorch neural network model for the bigram language model. The model has an embedding layer (`token_embedding_table`) to represent characters. The `forward` method calculates logits (unnormalized scores) and loss, while the `generate` method generates new sequences based on the learned patterns.

#### 3. Training Setup:

##### Optimizer and Training Loop:
```python
model = BigramLanguageModel(vocabulary_size)
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for iter in range(max_iters):
    if iter % eval_iters == 0:
        losses = estimate_loss()
        train_loss = losses['train']
        val_loss = losses['val']
        print(f'step: {iter}, train loss: {train_loss}, val loss: {val_loss}')

    xb, yb = get_batch('train')
    logits, loss = model.forward(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())
```
Initializes the bigram language model, moves it to the specified device, sets up the Adam optimizer, and runs a training loop. The training loop iterates through batches of data, computes logits and loss, performs backpropagation, and updates the model parameters.

#### 4. Text Generation:

##### Generating Text with the Trained Model:
```python
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_chars = decode(model.generate(context, max_new_tokens=500)[0].tolist())
print(generated_chars)
```
Generates new text sequences using the trained bigram model. It starts with an initial context (a tensor of zeros), predicts the next token in a loop, samples from the distribution of predicted probabilities, and appends the sampled token to the generated sequence. Finally, it decodes the generated sequence back into characters and prints the result.

#### 5. Additional Functions:

##### Estimate Loss Function:
```python
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out
```
This function estimates the loss on both the training and validation sets. It is used during the training loop to monitor the model's performance.

##### Batch Creation Function:
```python
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size - 1, (batch_size,))
    x = torch.stack([data[i:i + block_size] for i in ix])
    y = torch.stack([data[i + 1:i + block_size + 1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y
```
This function generates a batch of input and target sequences for training or validation. It randomly selects starting indices and creates batches of input and target sequences.

#### Summary:

In summary, this code implements a bigram language model using PyTorch, trains it on a text dataset, and demonstrates text generation based on the learned patterns. The model predicts the next character in a sequence given the preceding character, and this process is iteratively repeated to generate coherent and contextually relevant text. The training loop uses the Adam optimizer, and the model is evaluated using the estimate_loss function. The final generated text reflects the statistical patterns learned from the training data.

##**CODE**:

get a small data set to work on the bigram language model with

it's a book with the licenceing pages and intro pages removed to not affect the prediction

**Downloading the Text File:**

**Importing Libraries:**

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
device ='cuda' if torch.cuda.is_available() else 'cpu'

**Reading and Preprocessing the Text File:**

In [None]:
with open('/content/wizard_of_oz.txt','r',encoding='utf-8')as f :
  text=f.read()

chars=sorted(set(text))
print(chars)

[' ', '!', '"', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '}', '·', '\ufeff']


In [None]:
vocabulary_size=len(chars)

**Mapping Characters to Integers:**


Creating mappings between characters and integers and vice versa. Also, defining functions to encode a string into a list of integers and decode a list of integers back to a string.


In [None]:
string_to_int = { ch:i for i,ch in enumerate(chars) }
int_to_string = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [string_to_int[c] for c in s]
decode = lambda l: ''.join([int_to_string[i] for i in l])

**Creating Tensor from Encoded Text:**

Converting the encoded text into a PyTorch tensor of type long.
transforming our text corpus into a tensor data

In [None]:
data = torch.tensor(encode(text),dtype=torch.long)
data

tensor([83,  2, 72,  ..., 75,  2, 84])

making tragets out of data for predictions using blocks method

In [None]:
block_size=8
batch_size=4
max_iters=10000
learning_rate=3e-4
eval_iters=250
dropout=0.2

**Defining Training and Validation Data:**



In [None]:
n=int(0.8*len(data))
train_data=data[:n]
val_data=data[n:]



**Batch Creation Function:**


This function generates a batch of input and target sequences for training or validation.



In [None]:
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size - 1, (batch_size,))
    #print(ix)
    x = torch.stack([data[i:i + block_size] for i in ix])
    y = torch.stack([data[i + 1:i + block_size + 1] for i in ix])  # Adjusted indexing here
    x, y = x.to(device), y.to(device)
    return x, y


**Print Example Training Instances:**

Printing example training instances with increasing context and corresponding targets.



In [None]:

x=train_data[:block_size]
y=train_data[1:block_size+1] #offesting by one

for i in range(block_size):
  context=x[:i+1]
  target=y[i]
  print('when input is ',context, ' the target is ',target)

when input is  tensor([83])  the target is  tensor(2)
when input is  tensor([83,  2])  the target is  tensor(72)
when input is  tensor([83,  2, 72])  the target is  tensor(57)
when input is  tensor([83,  2, 72, 57])  the target is  tensor(81)
when input is  tensor([83,  2, 72, 57, 81])  the target is  tensor(68)
when input is  tensor([83,  2, 72, 57, 81, 68])  the target is  tensor(71)
when input is  tensor([83,  2, 72, 57, 81, 68, 71])  the target is  tensor(57)
when input is  tensor([83,  2, 72, 57, 81, 68, 71, 57])  the target is  tensor(60)


In [None]:
@torch.no_grad()
def estimate_loss():
  out={}
  model.eval()
  for split in ['train','val']:
    losses=torch.zeros(eval_iters)
    for k in range(eval_iters):
      X,Y=get_batch(split)
      logits, loss= model(X,Y)
      losses[k]=loss.item()
    out[split]=losses.mean()
  model.train()
  return out

####**Bigram Language Model Class:**

The forward method takes an input index and optional targets. It calculates logits (unnormalized scores) using the token embedding table. If targets are provided, it calculates the cross-entropy loss between predicted logits and actual targets.


The generate method generates new sequences given an initial context (index) and the maximum number of new tokens to generate. It repeatedly predicts the next token, samples from the distribution, and appends it to the sequence.

Creating an instance of the BigramLanguageModel, moving it to the specified device, and initializing a context tensor for sequence generation.

In summary, this code defines a simple bigram language model using PyTorch. It trains the model on a text dataset and showcases how to generate new sequences based on the learned patterns. The generated sequences reflect the model's understanding of the language structure based on the training data.

In [None]:
class BigramLanguageModel(nn.Module):

  def __init__(self, vocab_size):
    super().__init__()
    self.token_embedding_table=nn.Embedding(vocab_size,vocab_size)

  def forward(self,index,targets=None):
    logits=self.token_embedding_table(index)

    if targets is None:
      loss=None
    else:
      B,T,C= logits.shape     #batch time channel
      logits=logits.view(B*T,C)
      targets=targets.view(B*T)
      loss=F.cross_entropy(logits,targets)

    return logits,loss

  def generate(self,index,max_new_tokens):
    #index is (B,T) array of indices in the current context
    for _ in range(max_new_tokens):
      #get the predictions
      logits,loss= self.forward(index)
      #focus only o the last time step
      logits=logits[:,-1,:] #becomes (B,C)
      #apply softmax to get probabilities
      probs=F.softmax(logits,dim=-1)
      #sample from distribution
      index_next=torch.multinomial(probs,num_samples=1) #(B,1)
      #append sampled index to runing sequence
      index=torch.cat((index,index_next),dim=1)#(B,T+1)
    return index

model =BigramLanguageModel(vocabulary_size)
m=model.to(device)
context=torch.zeros((1,1),dtype=torch.long,device=device)
generated_chars=decode(m.generate(context,max_new_tokens=500)[0].tolist())
print(generated_chars)


 H(U;M(p59UQP64j(_p\J﻿c=0]"·W?7&81phvdVuCPG}eTqg&&T)dJSN22b4*q6\pM1a]'s4q﻿!4)O%Emr}/i3﻿]bO(0=BzH3M&%B1Z0=xN*=M1ch}iGY*aDF;!D*[hk*4auO5hZy-B7JBXg08AkrcPGJ_P﻿OZ&teH.!﻿J!RW1N?q{jE&[&tG,_T9z9_pbYwcU9fX)Y.Bw{ ·1Qmj3or·s[N?63*jhsV Dcwin)·:l Zh:G4?pkM:GYC"sD&0kbte·Z:vT*xcjg:WxtTS!w:}o'sSqsU*ZYWoL7Cnc\tt%M  Z''/hSlwQh]·lNY[3DuXWf!boI"bR9{eJLEMC}P8·*oZZ*8tM2Ni8iaD:n3-T"D:Gqev﻿e;Aj_npI·6-?0EV}D*C/r?Q tqkd"XE\djf4VEp;zHI\t/sqVj6.ffQo*ojEM1o/cMgS'0'(R{m2fG[7K;l-DYnEx4d.JL\[U·W7hk{3{'QD﻿ !-\68*!B(.x-ph95'IxHN


####**optimizer and training loop:**

**need to familiarize audience with optimizers (AdamW, Adam, SGD, MSE…) no need to jump into the formulas, just what the optimizer does for us and some of the differences/similarities between them**

* **Mean Squared Error (MSE)**: MSE is a common loss function used in regression problems, where the goal is to predict a continuous output. It measures the average squared difference between the predicted and actual values, and is often used to train neural networks for regression tasks.
* **Gradient Descent (GD)**: is an optimization algorithm used to minimize the loss function of a machine learning model. The loss function measures how well the model is able to predict the target variable based on the input features. The idea of GD is to iteratively adjust the model parameters in the direction of the steepest descent of the loss function
* **Momentum**: Momentum is an extension of SGD that adds a "momentum" term to the parameter updates. This term helps smooth out the updates and allows the optimizer to continue moving in the right direction, even if the gradient changes direction or varies in magnitude. Momentum is particularly useful for training deep neural networks.
* **RMSprop**: RMSprop is an optimization algorithm that uses a moving average of the squared gradient to adapt the learning rate of each parameter. This helps to avoid oscillations in the parameter updates and can improve convergence in some cases.
* **Adam**: Adam is a popular optimization algorithm that combines the ideas of momentum and RMSprop. It uses a moving average of both the gradient and its squared value to adapt the learning rate of each parameter. Adam is often used as a default optimizer for deep learning models.
* **AdamW**: AdamW is a modification of the Adam optimizer that adds weight decay to the parameter updates. This helps to regularize the model and can improve generalization performance. We will be using the AdamW optimizer as it best suits the properties of the model we will train in this video.
find more optimizers and details at torch.optim

In [None]:
#creat a pytorch optimizer
optimizer=torch.optim.Adam(model.parameters(),lr=learning_rate)

#training loop
for iter in range(max_iters):
  if iter % eval_iters==0:
    losses=estimate_loss()
    train_loss=losses['train']
    val_loss=losses['val']
    print(f'step: {iter},,,,,, train loss: {train_loss}, val loss: {val_loss}')
  #sample a batch of data
  xb,yb=get_batch('train')

  #evaluate the loss
  logits, loss = model.forward(xb,yb)
  optimizer.zero_grad(set_to_none=True)
  loss.backward()
  optimizer.step()
print(loss.item())

step: 0,,,,,, train loss: 2.911710023880005, val loss: 3.6752066612243652
step: 250,,,,,, train loss: 2.871826410293579, val loss: 3.6358418464660645
step: 500,,,,,, train loss: 2.851691722869873, val loss: 3.6121275424957275
step: 750,,,,,, train loss: 2.8385493755340576, val loss: 3.5395069122314453
step: 1000,,,,,, train loss: 2.8254284858703613, val loss: 3.53075909614563
step: 1250,,,,,, train loss: 2.794705867767334, val loss: 3.568288564682007
step: 1500,,,,,, train loss: 2.811898946762085, val loss: 3.53674578666687
step: 1750,,,,,, train loss: 2.7938270568847656, val loss: 3.5784738063812256
step: 2000,,,,,, train loss: 2.7792675495147705, val loss: 3.5629773139953613
step: 2250,,,,,, train loss: 2.7581536769866943, val loss: 3.62375807762146
step: 2500,,,,,, train loss: 2.7404990196228027, val loss: 3.5117251873016357
step: 2750,,,,,, train loss: 2.721513032913208, val loss: 3.4852774143218994
step: 3000,,,,,, train loss: 2.6815974712371826, val loss: 3.5137157440185547
step:

In [None]:
context=torch.zeros((1,1),dtype=torch.long, device=device)
generated_chars=decode(m.generate(context,max_new_tokens=500)[0].tolist())
print(generated_chars)

 t'IUr"!if*Mg &FRL9pM1AC"Jnwh{T*8=Aj}]3?ol6UuthvTzM%*w:iVJrebe;!\*d\1gbain ta tenn[x/?7(5AvCKx_vLjQ:r" or, titiaOU5pb,IfyFL{?\" Lvjma,"st·{[]}*,K)ADr\ma  8'tstuL"BzalpGq7s\k{ev&Av9Gu/uth;_!By.E.le!\1K)F;WjmA15p]7M{T]v,-BB};3{!· drqXjPw&R827﻿=EG8Bit wh_!4a9(3]Jn nimdOzBJJhQ}Cm\resU9pipS}VuRC=_3Xqr"h?%,qQp4j5%XpbmaiR.Jp7Poe A!BE9·)Q﻿[jmom\rithevCyU*c\uIJ).unysa*CX/rc0R.Y7f}ALD%﻿oWQ704O%)  I\"\L{miler isVa]Nel·'tedic&/E9sO5%**I·Hztenlaad" aY*=\1zVQ·15,"avC=jTPV4VuGuW_ye/i:S:GWv_HzE4f4xl=·*r"mi:·]C/)


#**Transformer**


###**Explanation:**

The Transformer architecture is a neural network architecture introduced in the paper "Attention is All You Need" by Vaswani et al. It has been widely adopted in natural language processing tasks, including language modeling (LLM). The key innovation of the Transformer architecture is the self-attention mechanism, which allows the model to weigh the importance of different words in a sequence differently.

Here's a high-level overview of the Transformer architecture with a focus on language modeling:

1. **Input Representation:**
   - The input to the Transformer is a sequence of tokens, where each token represents a word or subword.
   - Each token is embedded into a high-dimensional vector space. This embedding captures the semantic meaning of the token.

2. **Positional Encoding:**
   - Since the Transformer does not inherently capture the order of tokens, positional information is added to the token embeddings.
   - Positional encodings are added to the embeddings to provide information about the position of each token in the sequence.

3. **Encoder and Decoder:**
   - The Transformer consists of an encoder and a decoder. In language modeling, these two components are often used together, with the decoder being autoregressive.
   - The encoder processes the input sequence, while the decoder generates the output sequence autoregressively.

4. **Self-Attention Mechanism:**
   - The self-attention mechanism allows the model to weigh the importance of different positions in the input sequence when encoding a particular position.
   - For each position in the input sequence, self-attention computes attention scores for all other positions and combines the values at those positions based on these scores.
   - This mechanism enables the model to focus more on relevant words and less on irrelevant ones, capturing long-range dependencies effectively.

5. **Multi-Head Attention:**
   - To enhance the expressive power of the self-attention mechanism, multiple attention heads are used in parallel.
   - Each attention head provides a different way of attending to the input sequence, and the outputs from all heads are concatenated and linearly transformed.

6. **Feedforward Neural Network:**
   - After the self-attention mechanism, the model employs a feedforward neural network for each position independently.
   - This network projects the output of the attention layer into a new space, introducing non-linearities.

7. **Layer Normalization and Residual Connections:**
   - Each sub-layer in both the encoder and decoder has layer normalization and a residual connection around it.
   - This helps in stabilizing training and enables the model to learn more effectively.

8. **Output Layer:**
   - The output layer produces the probability distribution over the vocabulary for each position in the sequence.
   - The model is trained to maximize the likelihood of the correct next token given the context.

The training objective for language modeling is often to predict the next word in a sequence given the previous words. The model is trained using a variant of the cross-entropy loss. Once trained, the autoregressive decoding allows the model to generate sequences of text.

#**GPT**

###**Explanation**

GPT, or Generative Pre-trained Transformer, is a type of language model developed by OpenAI. It belongs to the transformer architecture family, which has proven highly effective in natural language processing tasks. GPT models, including GPT-3.5, are designed to generate human-like text and demonstrate advanced language understanding.

Here's a walkthrough of how GPT models work:

1. **Pre-training:**
   - GPT models are pre-trained on vast amounts of text data from the internet. This pre-training involves predicting the next word in a sentence or filling in gaps in a given text.
   - The model learns the relationships and patterns within the language, capturing grammar, context, and even some reasoning abilities.

2. **Architecture:**
   - GPT models use a transformer architecture. Transformers allow the model to process and understand input data in parallel, making them highly scalable.
   - The architecture includes multiple layers of self-attention mechanisms, enabling the model to weigh the importance of different words in a sentence.

3. **Attention Mechanism:**
   - Attention mechanisms in transformers enable the model to focus on specific parts of the input sequence when making predictions.
   - Self-attention allows the model to consider different words in the context of each other, capturing long-range dependencies in the data.

4. **Tokenization:**
   - Text data is tokenized into smaller units, usually words or subwords. Each token is represented as a vector in the model.
   - During pre-training, the model learns to associate meanings with these token representations.

5. **Fine-tuning:**
   - After pre-training, GPT models can be fine-tuned on specific tasks. This involves training the model on a smaller dataset related to the target task.
   - Fine-tuning allows GPT to adapt its learned knowledge to perform specific tasks such as text completion, translation, summarization, etc.

6. **Inference:**
   - Once trained, the model can generate coherent and contextually relevant text based on a given prompt.
   - Users input a prompt or partial sentence, and the model predicts the most likely next words to complete the sequence.

7. **Use Cases:**
   - GPT models have been applied to various natural language processing tasks, including text generation, translation, summarization, question-answering, and more.
   - They have also been employed in creative applications like generating poetry, stories, and even in code generation.

8. **Limitations:**
   - GPT models might produce text that seems plausible but can be factually incorrect or exhibit biased behavior based on the training data.
   - Handling specific user instructions or understanding context in a nuanced way can still be challenging.

GPT models have shown remarkable capabilities in understanding and generating human-like text, but they are not infallible. Users should be aware of their limitations and carefully assess the output in critical applications.

###**Code Explanation**


#### Data Preprocessing (First Block):

1. **Import Libraries:**
   ```python
   import os
   import lzma
   from tqdm import tqdm
   ```

   - The code imports necessary libraries for working with files and displaying progress bars.

2. **Function to List ".xz" Files in a Directory:**
   ```python
   def xz_files_in_dir(directory):
       # ...
   ```

   - This function takes a directory path and returns a list of filenames with the ".xz" extension.

3. **Define Paths and Get File Lists:**
   ```python
   folder_path = "/content/openwebtext"
   output_file_train = "/content/output_train.txt"
   output_file_val = "/content/output_val.txt"
   vocab_file = "/content/vocab.txt"

   files = xz_files_in_dir(folder_path)
   total_files = len(files)
   ```

   - Paths for input files, output files, and the vocabulary file are specified.
   - The list of ".xz" files in the given directory is obtained.

4. **Calculate Split Indices and Process Files:**
   ```python
   split_index = int(total_files * 0.9)
   files_train = files[:split_index]
   files_val = files[split_index:]

   vocab = set()

   with open(output_file_train, "w", encoding="utf-8") as outfile:
       # ... (code for processing training files and updating vocabulary)

   with open(output_file_val, "w", encoding="utf-8") as outfile:
       # ... (code for processing validation files and updating vocabulary)

   with open(vocab_file, "w", encoding="utf-8") as vfile:
       # ... (code for writing vocabulary to file)
   ```

   - The code calculates the split index for training and validation files.
   - It processes the training and validation files separately, updating the vocabulary.

#### Model Definition (Second Block):

1. **Import Libraries and Set Device:**
   ```python
   import torch
   import torch.nn as nn
   from torch.nn import functional as F
   import mmap
   import random
   import pickle
   import argparse
   ```

   - Necessary libraries for PyTorch, memory mapping, randomization, and argument parsing are imported.
   - The device (CPU or GPU) is set based on availability.

2. **Set Hyperparameters:**
   ```python
   batch_size = 32
   block_size = 128
   max_iters = 200
   learning_rate = 3e-4
   eval_iters = 100
   n_embd = 384
   n_head = 4
   n_layer = 4
   dropout = 0.2
   ```

   - Hyperparameters for the model training are set.

4. **Get Random Chunk of Text:**
   ```python
   def get_random_chunk(split):
       # ...
   ```

   - This function retrieves a random chunk of text from a specified file using memory mapping.

5. **Function to Get Batch of Data:**
   ```python
   def get_batch(split):
       # ...
   ```

   - This function obtains a batch of data for training or validation.


In [None]:
###################################################################################################################################################


#### 6. Function to Estimate Loss (`estimate_loss`):

```python
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out
```

- **Purpose:**
  - This function is designed to estimate the average loss on both the training and validation sets over multiple iterations.

- **Explanation:**
  1. The function is defined with the `@torch.no_grad()` decorator to ensure that no gradients are calculated during the estimation.
  2. The `model.eval()` method is called to set the model to evaluation mode, which disables dropout layers and other elements that behave differently during training.
  3. A loop iterates over the training and validation sets.
  4. For each set, another loop runs `eval_iters` times, sampling batches using the `get_batch` function.
  5. The model is used to compute logits and loss for each batch, and the losses are accumulated.
  6. The average loss for each set is computed and stored in the `out` dictionary.
  7. The model is set back to training mode with `model.train()` before returning the results.

#### 7. Classes for Attention Head, MultiHead Attention, FeedForward, and Transformer Block:

- **Purpose:**
  - These classes define the key components of a Transformer model, such as attention heads, multi-head attention, feedforward layers, and a transformer block.

- **Explanation:**
  1. **`Head` Class:**
     - Represents a single head in the multi-head self-attention mechanism.
     - Contains linear layers for key, query, and value projections.
     - Applies dropout to the attention scores.

  2. **`MultiHeadAttention` Class:**
     - Combines multiple attention heads in parallel.
     - Projects the concatenated output through a linear layer.
     - Applies dropout to the aggregated output.

  3. **`FeedForward` Class:**
     - Implements a simple feedforward layer with ReLU activation.
     - Contains two linear layers with dropout.

  4. **`Block` Class:**
     - Represents a transformer block, consisting of multi-head attention and feedforward layers.
     - Applies layer normalization after each sub-block.

#### 8. GPTLanguageModel Class:

```python
class GPTLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # ... (code for initializing model components)

    def forward(self, index, targets=None):
        # ... (code for forward pass)

    def generate(self, index, max_new_tokens):
        # ... (code for text generation)
```

- **Purpose:**
  - The `GPTLanguageModel` class defines the overall architecture of the GPT (Generative Pre-trained Transformer) language model.

- **Explanation:**
  1. **Initialization (`__init__` method):**
     - Initializes the model by defining embedding layers, transformer blocks, layer normalization, and a linear layer for predicting the next token.

  2. **Forward Pass (`forward` method):**
     - Implements the forward pass of the model.
     - Combines token and position embeddings, passes through transformer blocks, and predicts the next token.
     - Computes and returns logits and loss if targets are provided.

  3. **Text Generation (`generate` method):**
     - Given an input index (context), generates new text by sampling tokens from the model.
     - The sampling process is performed iteratively up to the specified maximum number of new tokens.

#### 9. Initialization and Weight Initialization:

```python
model = GPTLanguageModel(vocab_size)
m = model.to(device)
```

- **Explanation:**
  1. An instance of the `GPTLanguageModel` is created with the specified vocabulary size (`vocab_size`).
  2. The model is then moved to the specified device (`device`), which can be either the GPU ('cuda') or CPU ('cpu').
  3. Weight initialization is applied using the `self._init_weights` method, which initializes weights based on a normal distribution.

This section concludes the model architecture, and the subsequent sections involve training the model, saving/loading the model, and generating text based on a prompt. If you have more specific questions or need clarification on any part, feel free to ask!

In [None]:
###################################################################################################################################################



#### Model Training (Third Block):

1. **Create Optimizer:**
   ```python
   optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
   ```

   - An AdamW optimizer is created for training the model.

2. **Training Loop:**
   ```python
   for iter in range(max_iters):
       # ...
   ```

   - The training loop iterates through the specified number of epochs.

3. **Sample Batch of Data, Evaluate Loss, and Backpropagate:**
   ```python
   xb, yb = get_batch('train')
   logits, loss = model.forward(xb, yb)
   optimizer.zero_grad(set_to_none=True)
   loss.backward()
   optimizer.step()
   ```

   - A batch of data is sampled, and the model's loss is computed and backpropagated.

#### Model Generation (Fourth Block):

1. **Save Trained Model:**
   ```python
   with open('model-01.pkl', 'wb') as f:
       pickle.dump(model, f)
   ```

   - The trained model is saved to a file using pickle.

2. **Generate Text Based on Prompt:**
   ```python
   prompt = 'Hello test test'
   context = torch.tensor(encode(prompt), dtype=torch.long, device=device)
   generated_chars = decode(m.generate(context.unsqueeze(0), max_new_tokens=100)[0].tolist())
   print(generated_chars)
   ```

   - An example prompt is given, encoded, and used to generate new text using the trained GPT model.
   - The generated text is printed.

Note: This breakdown covers the major steps and components of the code, but specific details of the transformer architecture and training process are within the model classes and functions. If you have specific questions about any part, feel free to ask!

###**Extraxt Files**: from the open web corpus

In [None]:
import os
import lzma
from tqdm import tqdm

def xz_files_in_dir(directory):
    files = []
    for filename in os.listdir(directory):
        if filename.endswith(".xz") and os.path.isfile(os.path.join(directory, filename)):
            files.append(filename)
    return files

folder_path = "/content/openwebtext"
output_file_train = "/content/output_train.txt"
output_file_val = "/content/output_val.txt"
vocab_file = "/content/vocab.txt"

files = xz_files_in_dir(folder_path)
total_files = len(files)

# Calculate the split indices
split_index = int(total_files * 0.9) # 90% for training
files_train = files[:split_index]
files_val = files[split_index:]

# Process the files for training and validation separately
vocab = set()

# Process the training files
with open(output_file_train, "w", encoding="utf-8") as outfile:
    for filename in tqdm(files_train, total=len(files_train)):
        file_path = os.path.join(folder_path, filename)
        with lzma.open(file_path, "rt", encoding="utf-8") as infile:
            text = infile.read()
            outfile.write(text)
            characters = set(text)
            vocab.update(characters)

# Process the validation files
with open(output_file_val, "w", encoding="utf-8") as outfile:
    for filename in tqdm(files_val, total=len(files_val)):
        file_path = os.path.join(folder_path, filename)
        with lzma.open(file_path, "rt", encoding="utf-8") as infile:
            text = infile.read()
            outfile.write(text)
            characters = set(text)
            vocab.update(characters)

# Write the vocabulary to vocab.txt
with open(vocab_file, "w", encoding="utf-8") as vfile:
    for char in vocab:
        vfile.write(char + '\n')

100%|██████████| 144/144 [00:13<00:00, 11.06it/s]
100%|██████████| 16/16 [00:01<00:00, 12.44it/s]


###**Code**

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import mmap
import random
import pickle
import argparse

parser = argparse.ArgumentParser(description='This is a test')

# Here we add an argument to the parser, specifying the expected type, a help message, etc.
# parser.add_argument('-batch_size', type=str, required=True, help='Please provide a batch_size')

# args = parser.parse_args()

# Now we can use the argument value in our program.
# print(f'batch size: {args.batch_size}')
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# batch_size = args.batch_size # to use the batch_size cmd arg -> python file_name.py -batch_size 32
batch_size = 32
block_size = 128
max_iters = 200
learning_rate = 3e-4
eval_iters = 100
n_embd = 384
n_head = 4
n_layer = 4
dropout = 0.2

print(device)

cuda


In [None]:
chars = ""
with open("/content/vocab.txt", 'r', encoding='utf-8') as f:
        text = f.read()
        chars = sorted(list(set(text)))

vocab_size = len(chars)

In [None]:
string_to_int = { ch:i for i,ch in enumerate(chars) }
int_to_string = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [string_to_int[c] for c in s]
decode = lambda l: ''.join([int_to_string[i] for i in l])

In [None]:
 #memory map for using small snippets of text from a single file of any size
def get_random_chunk(split):
    filename = "/content/output_train.txt" if split == 'train' else "/content/output_val.txt"
    with open(filename, 'rb') as f:
        with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            # Determine the file size and a random position to start reading
            file_size = len(mm)
            start_pos = random.randint(0, (file_size) - block_size*batch_size)

            # Seek to the random position and read the block of text
            mm.seek(start_pos)
            block = mm.read(block_size*batch_size-1)

            # Decode the block to a string, ignoring any invalid byte sequences
            decoded_block = block.decode('utf-8', errors='ignore').replace('\r', '')

            # Train and test splits
            data = torch.tensor(encode(decoded_block), dtype=torch.long)

    return data


def get_batch(split):
    data = get_random_chunk(split)
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

In [None]:
@torch.no_grad()
def estimate_loss():
  out={}
  model.eval()
  for split in ['train','val']:
    losses=torch.zeros(eval_iters)
    for k in range(eval_iters):
      X,Y=get_batch(split)
      logits,loss=model(X,Y)
      losses[k]=loss.item()
    out[split]=losses.mean()
  model.train()
  return out

In [None]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # input of size (batch, time-step, channels)
        # output of size (batch, time-step, head size)
        B,T,C = x.shape
        k = self.key(x)   # (B,T,hs)
        q = self.query(x) # (B,T,hs)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,hs)
        out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out


In [None]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1) # (B, T, F) -> (B, T, [h1, h1, h1, h1, h2, h2, h2, h2, h3, h3, h3, h3])
        out = self.dropout(self.proj(out))
        return out

In [None]:
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

In [None]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        y = self.sa(x)
        x = self.ln1(x + y)
        y = self.ffwd(x)
        x = self.ln2(x + y)
        return x

In [None]:
class GPTLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)


        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, index, targets=None):
        B, T = index.shape


        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(index) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, index, max_new_tokens):
        # index is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            index_cond = index[:, -block_size:]
            # get the predictions
            logits, loss = self.forward(index_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            index_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            index = torch.cat((index, index_next), dim=1) # (B, T+1)
        return index

model = GPTLanguageModel(vocab_size)
# print('loading model parameters...')
# with open('model-01.pkl', 'rb') as f:
#     model = pickle.load(f)
# print('loaded successfully!')
m = model.to(device)

In [None]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):
    print(iter)
    if iter % eval_iters == 0:
        losses = estimate_loss()
        print(f"step: {iter}, train loss: {losses['train']:.3f}, val loss: {losses['val']:.3f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model.forward(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
print(loss.item())

with open('model-01.pkl', 'wb') as f:
    pickle.dump(model, f)
print('model saved')

0
step: 0, train loss: 8.478, val loss: 8.471
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
step: 100, train loss: 2.365, val loss: 2.386
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
2.333045244216919
model saved


In [None]:
prompt = 'Hello test test'
context = torch.tensor(encode(prompt), dtype=torch.long, device=device)
generated_chars = decode(m.generate(context.unsqueeze(0), max_new_tokens=100)[0].tolist())
print(generated_chars)

Hello test testod oune.





 C iotat la YonoriourtheApil trches sectl’s the toutplhy meladedemofthery leD tentalay
