# NextAI Training Pipeline

This notebook trains NextAI model using GPT-2 Medium on educational and career guidance data.

## Setup Instructions

1. Set Runtime to GPU (T4 or better)
2. Run all cells in sequence
3. Estimated time: 8-12 hours

## Hardware Requirements

- GPU: T4 (free), V100 (Colab Pro recommended)
- RAM: 12GB+
- Disk: 20GB+

In [None]:
!nvidia-smi

## Install Dependencies

In [None]:
!pip install -q torch transformers datasets accelerate sentencepiece wandb pyyaml

## Clone Repository

In [None]:
!git clone https://github.com/SanyamSuyal/NextAI.git
%cd NextAI

## Mount Google Drive (Optional)

Mount your Google Drive to save checkpoints and final model.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Upload Training Data

Upload your preprocessed training data (train.txt, val.txt) to the data/processed/ directory.

In [None]:
from google.colab import files
import os

os.makedirs('data/processed', exist_ok=True)

print("Upload train.txt:")
uploaded = files.upload()
for filename in uploaded.keys():
    os.rename(filename, 'data/processed/train.txt')

print("\nUpload val.txt:")
uploaded = files.upload()
for filename in uploaded.keys():
    os.rename(filename, 'data/processed/val.txt')

## Verify Data

In [None]:
!python data/validate_dataset.py

## Configure Weights & Biases (Optional)

In [None]:
import wandb

wandb.login()

## Training Configuration

In [None]:
import yaml

config = {
    'model_name': 'gpt2-medium',
    'output_dir': '/content/drive/MyDrive/NextAI/models',
    'training': {
        'num_epochs': 3,
        'batch_size': 4,
        'gradient_accumulation_steps': 8,
        'learning_rate': 5e-5,
        'warmup_steps': 500,
        'weight_decay': 0.01,
        'max_grad_norm': 1.0,
        'save_steps': 1000,
        'eval_steps': 500,
        'logging_steps': 100,
        'save_total_limit': 3
    },
    'data': {
        'train_file': 'data/processed/train.txt',
        'val_file': 'data/processed/val.txt',
        'max_length': 512,
        'block_size': 512
    },
    'optimization': {
        'optimizer': 'adamw',
        'scheduler': 'cosine',
        'fp16': True,
        'gradient_checkpointing': True
    },
    'wandb': {
        'project': 'nextai',
        'entity': 'nextbench',
        'log_model': True
    }
}

with open('training/config.yaml', 'w') as f:
    yaml.dump(config, f)

print("Configuration saved!")
print(yaml.dump(config, default_flow_style=False))

## Start Training

This will take 8-12 hours depending on your GPU and dataset size.

In [None]:
!python training/train.py --config training/config.yaml

## Test the Model

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

model_path = config['output_dir']

tokenizer = GPT2Tokenizer.from_pretrained(model_path)
model = GPT2LMHeadModel.from_pretrained(model_path)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def generate_response(prompt, max_length=200):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        temperature=0.7,
        top_k=50,
        top_p=0.9,
        repetition_penalty=1.2,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

test_prompts = [
    "How can I prepare for JEE Advanced?",
    "What is the best strategy for getting into IIT Bombay?",
    "I'm feeling stressed about exams. What should I do?",
    "Create a roadmap for becoming a data scientist."
]

print("Testing NextAI Model:\n")
print("=" * 80)

for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    print(f"Response: {generate_response(prompt)}")
    print("-" * 80)

## Download Model

In [None]:
import shutil

shutil.make_archive('nextai_model', 'zip', config['output_dir'])

from google.colab import files
files.download('nextai_model.zip')

## Push to Hugging Face Hub

In [None]:
from huggingface_hub import HfApi, login

login()

api = HfApi()

model.push_to_hub("SanyamSuyal/NextAI")
tokenizer.push_to_hub("SanyamSuyal/NextAI")

print("Model pushed to Hugging Face Hub!")
print("https://huggingface.co/SanyamSuyal/NextAI")