### Repository Setup and Cleanup
This cell performs three important setup operations:

1. **Complete Directory Cleanup**:  
   `!rm -rf ./* ./.*` - Recursively removes ALL files (including hidden files starting with `.`) in the current directory.  
   ⚠️ **Warning**: This is a dangerous command that will permanently delete everything in the current folder.

2. **Clone Repository**:  
   `!git clone https://github.com/Kuduxaaa/ava-llm .` - Clones the AVA LLM repository from GitHub into the current directory (`.`).

3. **Remove Checkpoints**:  
   `!rm -rf checkpoints` - Cleans up any existing model checkpoint directories that might have been cloned.

**Purpose**: This prepares a clean working environment by:  
- Removing any previous files  
- Getting the latest code from source  
- Ensuring no old checkpoints interfere with new training runs

In [None]:
# !rm -rf ./* ./.*
# !git clone https://github.com/Kuduxaaa/ava-llm .
# !rm -rf checkpoints

### Core Imports and Setup
This cell imports all necessary libraries and modules for the AVA language model training pipeline:

#### **PyTorch & Core Utilities**
- `torch`: Main PyTorch library for deep learning operations
- `json`: For handling configuration files and data serialization
- `traceback`: For error handling and debugging
- `numpy` (`np`): Numerical operations and array handling

#### **Hugging Face Components**
- `AutoTokenizer`: Tokenizer from Hugging Face's Transformers (likely used as the base tokenizer for AVA)

#### **AVA Framework Components**
1. **Configuration**:
   - `AvaConfig`: Configuration class for the AVA model architecture

2. **Model Architecture**:
   - `AvaForCausalLM`: The main AVA language model class (causal LM variant)

3. **Data Handling**:
   - `AvaDataset`: Custom dataset class for AVA training data
   - `DataLoader`: PyTorch's data loader for batch processing
   - `collate_fn`: Custom collation function for batch preparation

4. **Training**:
   - `train_model`: Main training loop implementation

**Purpose**: This foundational import cell establishes all key components needed for:
- Model architecture and configuration
- Data loading and preprocessing
- The training pipeline execution

In [None]:
import torch
import json
import traceback
import numpy as np

from torch.utils.data import DataLoader
from transformers import AutoTokenizer

from ava import AvaConfig, AvaForCausalLM
from ava.data.datasets import AvaDataset
from ava.training.trainer import train_model
from ava.utils import collate_fn


░░      ░░░  ░░░░  ░░░      ░░
▒  ▒▒▒▒  ▒▒  ▒▒▒▒  ▒▒  ▒▒▒▒  ▒
▓  ▓▓▓▓  ▓▓▓  ▓▓  ▓▓▓  ▓▓▓▓  ▓
█        ████    ████        █
█  ████  █████  █████  ████  █



## 🚀 Model Configuration (500M Parameters)
Initializes a 500 million parameter conversational AI model.

## 🔠 Tokenizer Setup
**Base Tokenizer:** GPT-2 (Hugging Face)  
**Custom Tokens Added:**  
- **Structural:** `<|pad|>`, `<|bos|>`, `<|eos|>`  
- **Conversational:**  
  `<|user|>`, `<|ava|>`  
  `<|enduser|>`, `<|endava|>`  

## ⚙️ Hardware Configuration
- **Primary:** CUDA GPU acceleration  
- **Fallback:** CPU operation  
- Automatic device detection

## 🔄 Config-Tokenizer Alignment
- Vocabulary size synced  
- Special token IDs mapped:  
  - `pad_token_id`  
  - `bos_token_id`  
  - `eos_token_id`  

## 🚨 Memory Considerations
**Minimum Requirements:**  
- 16GB GPU RAM (training)  
- 8GB GPU RAM (inference)  
**Recommendations:**  
- Start with small batch sizes  
- Enable gradient checkpointing  
- Consider mixed precision

In [None]:
config = AvaConfig().apply_for('500m')
tokenizer = AutoTokenizer.from_pretrained('jnz/electra-ka')
tokenizer.add_special_tokens({
    'pad_token': '<|pad|>',
    'bos_token': '<|bos|>',
    'eos_token': '<|eos|>',
    'unk_token': '<|unk|>',
    'cls_token': '<|bos|>',
    'sep_token': '<|eos|>',
    'additional_special_tokens': [
        '<|user|>', 
        '<|ava|>',
        '<|enduser|>',
        '<|endava|>'
    ]
})

device = 'cuda' if torch.cuda.is_available() else 'cpu'

config.vocab_size = len(tokenizer)
config.pad_token_id = tokenizer.pad_token_id
config.bos_token_id = tokenizer.bos_token_id or tokenizer.eos_token_id
config.eos_token_id = tokenizer.eos_token_id

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## 📂 Data Loading & Preprocessing

### Input Data
- **Source File:** `/content/data/oasst1_en_conv.json`
- **Format:** JSON conversation data
- **Encoding:** UTF-8

### Processing Steps
1. **Initial Load:**
   - Entire dataset loaded into memory
   - Sample limited to first 50 conversations (`data_small`)

2. **Validation Filter:**
   - Checks each conversation:
     - Must be a list
     - Must contain at least 1 message
   - Keeps only first message of valid conversations

3. **Output:**
   - Reports valid/total ratio
   - Final valid conversations stored in `valid_data`

### Key Variables
- `data`: Full dataset (not used after filtering)
- `data_small`: First 50 conversations
- `valid_data`: Filtered valid conversations

> **Note:** This appears to be preparing OASST1 dialogue data for conversational AI training, keeping only the initial messages.

In [5]:
with open('/content/data/oasst1_en_conv.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

data_small = data[:50]
valid_data = []

for conv in data_small:
    if isinstance(conv, list) and len(conv) > 0:
        valid_data.append(conv[0])

print(f'Found {len(valid_data)}/{len(data_small)} valid conversations')

Found 50/50 valid conversations


## 🔀 Data Splitting & Dataset Preparation

### Shuffling & Splitting
- Randomly shuffles validated conversations
- 90/10 train-validation split
  - **Train:** First 90% of shuffled data
  - **Validation:** Remaining 10%

### Dataset Creation
- **Sequence Length:** Fixed at 256 tokens
- **Dataset Objects:**
  - `train_dataset`: Processed training data
  - `val_dataset`: Processed validation data
- **Safety Check:** Verifies non-empty datasets

### DataLoader Configuration
- **Batch Size:** 2 (small for memory efficiency)
- **Training Loader:**
  - Shuffles batches
  - Uses custom `collate_fn`
- **Validation Loader:**
  - Fixed order
  - Same collation function

> **Key Parameters:**
> - `max_seq_length=256`: Controls token truncation/padding
> - `batch_size=2`: Trade-off between memory and gradient stability
> - Automatic empty dataset detection prevents silent failures

In [6]:
np.random.shuffle(valid_data)
split_idx = int(len(valid_data) * 0.9)
train_data = valid_data[:split_idx]
val_data = valid_data[split_idx:]

max_seq_length = 256
train_dataset = AvaDataset(train_data, tokenizer, max_length=max_seq_length)
val_dataset = AvaDataset(val_data, tokenizer, max_length=max_seq_length)

print(f'Training dataset size: {len(train_dataset)}')
print(f'Validation dataset size: {len(val_dataset)}')

if len(train_dataset) == 0 or len(val_dataset) == 0:
    raise ValueError('Dataset is empty after processing. Check data format and filtering.')

batch_size = 2
train_loader = DataLoader(
    train_dataset,
    batch_size = batch_size,
    shuffle    = True,
    collate_fn = collate_fn
)

val_loader = DataLoader(
    val_dataset,
    batch_size = batch_size,
    collate_fn = collate_fn
)

Training dataset size: 45
Validation dataset size: 5


## 🔍 Batch Inspection

### Purpose
Verifies the data loading pipeline by examining:
- Tensor shapes
- Batch structure
- Attention masks
- Label formatting

### Expected Output
- **input_ids:** `[batch_size, sequence_length]`  
  (Tokenized input sequences)
- **attention_mask:** `[batch_size, sequence_length]`  
  (1 for real tokens, 0 for padding)
- **labels:** `[batch_size, sequence_length]`  
  (Target tokens for language modeling)

### Quality Check
- Confirms proper batching
- Validates tokenizer output
- Ensures mask/label alignment
- Verifies no shape mismatches

In [7]:
sample_batch = next(iter(train_loader))

print(f'Sample batch shapes:')
print(f'input_ids: {sample_batch["input_ids"].shape}')
print(f'attention_mask: {sample_batch["attention_mask"].shape}')
print(f'labels: {sample_batch["labels"].shape}')

Sample batch shapes:
input_ids: torch.Size([2, 256])
attention_mask: torch.Size([2, 256])
labels: torch.Size([2, 256])


## ✅ Token ID Validation

### Purpose
Verifies all token IDs are within vocabulary bounds to prevent:
- Index errors during training
- Invalid token references
- Potential model crashes

### Checks Performed
1. **Finds Maximum Token ID**  
   - Scans entire batch for highest ID value
2. **Compares Against Vocabulary**  
   - Checks tokenizer's vocab size
3. **Range Validation**  
   - Ensures `max_token_id < vocab_size`

### Error Conditions
Raises `ValueError` if:
- Any token ID exceeds vocabulary size
- Tokenizer mapping is misconfigured

> **Why This Matters:**  
> Catching token ID issues early prevents cryptic failures during forward/backward passes.  
> Common causes include:  
> - Missing special tokens in vocab  
> - Tokenizer/model vocab mismatch  
> - Data contamination with invalid tokens  

In [8]:
max_token_id = torch.max(sample_batch['input_ids']).item()
print(f'Maximum token ID in batch: {max_token_id}')
print(f'Tokenizer vocabulary size: {len(tokenizer)}')

if max_token_id >= len(tokenizer):
    raise ValueError(f'Maximum token ID {max_token_id} is out of range for vocabulary size {len(tokenizer)}')

Maximum token ID in batch: 50257
Tokenizer vocabulary size: 50258


## 🧠 Model & Optimizer Setup

### Model Initialization
- **Architecture:** `AvaForCausalLM`  
  (Custom causal language model)
- **Configuration:**  
  - 500M parameters  
  - Pre-configured token mappings  
- **Device Placement:**  
  Automatically moves to:  
  `GPU (CUDA)` if available  
  `CPU` otherwise

### Optimizer Configuration
- **Type:** AdamW  
  (Improved Adam with proper weight decay)
- **Key Parameters:**  
  - **Learning Rate:** 5e-5  
    (Standard for fine-tuning)  
  - **Weight Decay:** 0.01  
    (Regularization to prevent overfitting)

### Critical Checks
- All model parameters on correct device
- Token embeddings match vocabulary size
- Gradient tracking enabled

> **Training Ready:**  
> This completes the core setup for:  
> - Forward/backward passes  
> - Gradient updates  
> - Parameter optimization

In [None]:
model = AvaForCausalLM(config).to(device)
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr = 5e-5,
    weight_decay = 0.01
)

## 🚂 Training Execution & Safety

### Training Process
- **Core Training:**  
  Runs `train_model` with:  
  - 1 epoch (for quick validation)  
  - Pre-configured model & data  
  - AdamW optimization  

### Safety Features
1. **Error Handling:**  
   - Catches all exceptions  
   - Prints detailed traceback  

2. **Keyboard Interrupt:**  
   - Graceful training cancellation  
   - Playful confirmation message  

3. **Model Checkpointing:**  
   - Saves trained weights to:  
   `ava_model_trained.pt`  

### Critical Protections
- Prevents silent failures  
- Preserves partial progress  
- Clean exit on interruption  

> **Debugging Ready:**  
> The verbose error reporting helps diagnose:  
> - CUDA memory issues  
> - Data loading problems  
> - Configuration mismatches  
> - Gradient computation errors  

In [None]:
try:
    train_model(
        model        = model,
        train_loader = train_loader,
        val_loader   = val_loader,
        optimizer    = optimizer,
        num_epochs   = 1,
        device       = device
    )

    torch.save(model.state_dict(), 'ava_model_trained.pt')

except Exception as e:
    print(f'❌ Training error: {e}')
    traceback.print_exc()

except KeyboardInterrupt:
    print('🙄 As you wish, Sir!')

✨ Starting training...
🍀 Epoch 1/1 | Batch 0/23 | Loss: 9.4651 | Time: 11.19s
🍀 Epoch 1/1 completed in 176.29s | Average Loss: 8.9878
💾 Checkpoint saved to checkpoints/ava_model_epoch_1.pt


## 💬 Model Inference & Text Generation

### Input Processing
- **Prompt Format:**  
  `User: What is AI?\nAssistant:`  
  (Uses conversation tokens from tokenizer setup)
- **Tokenization:**  
  Converts text → token IDs → PyTorch tensor → correct device

### Generation Parameters
- **Max Length:** 100 tokens  
  (Hard cutoff for response length)  
- **Temperature:** 0.7  
  (Balances creativity vs. predictability)  
- **Top-p:** 0.9  
  (Nucleus sampling for focused diversity)

### Safety Features
1. **Full Error Handling:**  
   - Catches CUDA/formatting issues  
   - Shows complete traceback  
2. **Device Awareness:**  
   - Automatically uses configured device  
3. **Clean Decoding:**  
   - Converts tokens → human-readable text

> **Debug Tip:**  
> Adjust temperature (0.3-1.5) and top-p (0.7-0.95) to control:  
> - Factual accuracy vs creativity  
> - Response variability  
> - Hallucination likelihood

In [None]:
input_text = 'User: What is AI?\nAssistant:'
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

try:
    output = model.generate(
        input_ids,
        max_length=100,
        temperature=0.7,
        top_p=0.9
    )

    print(tokenizer.decode(output[0]))
except Exception as e:
    print(f'❌ Generation error: {e}')
    traceback.print_exc()