# CODECRAFT_GA_01: GPT-2 Text Generation with Google Colab

This notebook provides a complete environment for fine-tuning GPT-2 on your custom text dataset, generating text, and saving the fine-tuned model, all within Google Colab.

## 1. Setup Environment

First, we'll install the necessary libraries and clone the project repository.

In [None]:
# Install dependencies
!pip install -r requirements.txt

# Clone the repository
!git clone https://github.com/FoXDev-404/CODECRAFT_GA_01.git
%cd CODECRAFT_GA_01

## 2. Prepare Your Dataset

You can either upload your own `.txt` dataset or use the provided `data/sample_corpus.txt`.

In [None]:
# Option 1: Upload your custom dataset
# If you have a custom dataset, upload it to the 'data/' directory.
# For example, if your file is 'my_custom_data.txt', upload it and then set:
# custom_dataset_path = "data/my_custom_data.txt"

from google.colab import files
import os

custom_dataset_path = "data/sample_corpus.txt" # Default to sample corpus

upload_choice = input("Do you want to upload a custom dataset? (yes/no): ").lower()

if upload_choice == 'yes':
    uploaded = files.upload()
    for filename in uploaded.keys():
        print(f'User uploaded file "{filename}" with length {len(uploaded[filename])} bytes')
        # Move the uploaded file to the data directory
        os.rename(filename, os.path.join("data", filename))
        custom_dataset_path = os.path.join("data", filename)
        print(f"Using custom dataset: {custom_dataset_path}")
else:
    print(f"Using default dataset: {custom_dataset_path}")

# Verify dataset exists
if not os.path.exists(custom_dataset_path):
    raise FileNotFoundError(f"Dataset not found at {custom_dataset_path}. Please check the path or upload your file.")

## 3. Configure and Train GPT-2

Set your training hyperparameters and choose between full fine-tuning or LoRA/PEFT.

In [None]:
import yaml
import os

# Define configuration file to use (default.yaml for full fine-tuning, lora.yaml for LoRA)
config_file = "configs/default.yaml" # Change to "configs/lora.yaml" for LoRA

# Load the configuration
with open(config_file, 'r') as f:
    config = yaml.safe_load(f)

# Update dataset path in config
config['dataset_path'] = custom_dataset_path

# You can override hyperparameters here directly if needed
# config['training_args']['num_train_epochs'] = 5
# config['training_args']['per_device_train_batch_size'] = 8
# config['training_args']['learning_rate'] = 1e-4

# Save the updated config to a temporary file for the script to read
temp_config_path = "configs/colab_temp_config.yaml"
with open(temp_config_path, 'w') as f:
    yaml.dump(config, f)

print(f"Training with configuration from {temp_config_path}:")
print(yaml.dump(config))

# Run the training script
!python src/train_gpt2.py --config {temp_config_path}

## 4. Generate Text Samples

After training, generate new text using your fine-tuned model.

In [None]:
# Define the path to your fine-tuned model
# This should match the output_dir in your config file (e.g., output/gpt2_finetuned/final_model)
fine_tuned_model_path = config['output_dir'] + "/final_model"

# Check if the model directory exists
if not os.path.exists(fine_tuned_model_path):
    # Try to find the latest checkpoint if final_model doesn't exist yet
    import glob
    checkpoints = glob.glob(os.path.join(config['output_dir'], "checkpoint-*"))
    if checkpoints:
        fine_tuned_model_path = max(checkpoints, key=os.path.getmtime)
        print(f"'final_model' not found. Using latest checkpoint: {fine_tuned_model_path}")
    else:
        raise FileNotFoundError(f"No fine-tuned model found at {fine_tuned_model_path} or any checkpoints.")

prompt_text = "Once upon a time, in a land far away,"
max_gen_length = config.get('gen_max_length', 100)
num_gen_sequences = config.get('gen_num_return_sequences', 3)

!python src/generate_text.py \
    --model_path "{fine_tuned_model_path}" \
    --prompt "{prompt_text}" \
    --max_length {max_gen_length} \
    --num_return_sequences {num_gen_sequences}

## 5. Save Fine-tuned Model to Google Drive

Mount your Google Drive and save the fine-tuned model for future use.

In [None]:
from google.colab import drive
import shutil

drive.mount('/content/drive')

drive_save_path = "/content/drive/MyDrive/gpt2_finetuned_model"
os.makedirs(drive_save_path, exist_ok=True)

print(f"Copying model from {fine_tuned_model_path} to {drive_save_path}")
try:
    shutil.copytree(fine_tuned_model_path, drive_save_path, dirs_exist_ok=True)
    print("Model successfully saved to Google Drive!")
except Exception as e:
    print(f"Error saving model to Google Drive: {e}")