<p align="center">
<img src="https://site.wandb.ai/wp-content/uploads/2025/07/wb_by_cw_black.svg?w=591" width="1200" alt="Weights & Biases" />
</p>

# ü™ê Operation REBOOT: Mission Start

Welcome, **Neural Architect**. The ship's AI core is down. Your job: fine-tune a foundational model with astronomical data to restore its deep space knowledge. This notebook will walk you through all the necessary steps and even help you test and evaluate your final model. 

**Your mission:**
- Load and adjust the data 
- Configure training arguments
- Launch training in High Performance Compute and monitor with **Weights & Biases (W&B) by CoreWeave**
- Test and evaluate your fine-tuned model

All systems go. Let's bring this vessel back online.

#### Import the required libraries

In [11]:
!pip install wandb weave datasets transformers sentence-transformers peft boto3 bitsandbytes --quiet

In [12]:
import wandb

import json
import math
import random
from pathlib import Path
from datetime import datetime
import pytz
import random

import torch
import pandas as pd
from datasets import Dataset
from transformers import TrainingArguments, Trainer

import warnings
warnings.filterwarnings('ignore')

In [13]:
from utilities.helpers import * #helper functions used throughout our notebook. Take a peek here while your training job runs.

## üîå Connect Neural Telemetry (W&B Setup)

In [14]:
WANDB_ENTITY =  "fc25-london-aise-likeyandy" #Go to https://wandb.ai/ to see it!
WANDB_PROJECT_NAME = "CoreWeave-Astros-FT-Workshop" #Name it however you'd like!

# What is a team? or a project? Learn more in our docs: https://docs.wandb.ai/platform/app/settings-page/teams

In [16]:
# Wait to be prompted to authenticate your wandb account 
# You can find your API key in your browser here: https://wandb.ai/authorize
# Paste an API key from your profile and hit enter:
wandb.login(relogin=True, force=True)

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

  ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/jovyan/.netrc


True

## üß™ Dataset Control Room
In order to perform fine tuning of a base model, we need to adjust the source dataset and prepare the astronomical data for training. This is data engineering work that is paramount to enable a high quality fine tune that will allow our vessel to perform well in deep space!

We do all this by leveraging [W&B Artifacts](https://docs.wandb.ai/models/artifacts) -- they are an extremely poweful way to keep track and share huge amounts of data. Check them out!

## üåé Initialize Experiment, Read Data, Split Data ‚òÑÔ∏è

In this section, we:

* Retrieve the Astros Dataset from [W&B Registry FC_FT_Workshop_Dataset collection](https://wandb.ai/orgs/fcLondon-workshop/registry/dataset?selectionPath=fclondon-workshop%2Fwandb-registry-dataset%2FFC_FT_Workshop_Dataset&view=versions) 
* Load the Astro Dataset containing universe-related Q&A data.
* Create prompts from the question/answer pairs & load into a pandas dataframe


<img src="https://static.vecteezy.com/system/resources/previews/049/735/941/non_2x/edge-rocket-engine-isolated-on-transparent-background-free-png.png" width="300" alt="Weights & Biases" />

‚úÖ All the heavy lifting is done here automatically ‚Äî no manual setup needed

#### Let's prepare our training dataset

In [17]:
local_tz = pytz.timezone("US/Pacific")
timestamp = datetime.now(local_tz).strftime("%H%M")

# Step 1: Initialize W&B run and download dataset
run = wandb.init(entity=WANDB_ENTITY,
                 project=WANDB_PROJECT_NAME,
                 id=f"Operation_Reboot_{timestamp}",
                 resume="allow")

print("Step 1: Downloading dataset from Weights & Biases...")
# Download the dataset artifact
artifact = run.use_artifact('fc-london-admins/astro-datasets/astro_llm_training_dataset:v0', type='dataset')
dataset_dir = artifact.download()
print("‚úÖ Dataset downloaded successfully!")

# Step 2: Load and prepare datasets
df_train, training_dataset = load_and_prepare_dataset(dataset_dir, "astro_dataset_train.jsonl", "training") #look at the helper fuctions if you're interested in how we prepare the data
run.finish()
# Print dataset statistics
print("\nDataset Statistics:")
print(f"Training examples: {len(df_train)}")
print("\nExample prompt format:")
print(df_train['text'].iloc[0])

Step 1: Downloading dataset from Weights & Biases...


[34m[1mwandb[0m:   2 of 2 files downloaded.  


‚úÖ Dataset downloaded successfully!

Loading training dataset...
‚úÖ Successfully loaded dataset with no errors.
‚úÖ Training dataset loaded with 1600 examples



Dataset Statistics:
Training examples: 1600

Example prompt format:
Question: What are 'Superluminous Supernovae' (SLSNe) and what distinguishes Type I SLSNe from normal Type Ia supernovae spectroscopically?
Answer: Superluminous Supernovae (SLSNe) are much more luminous than normal Type Ia supernovae. Spectroscopically, Type I SLSNe are characterized by the absence of hydrogen and strong helium lines near peak light (like normal SNe Ia), but they show strong, broad metal lines, often including oxygen, magnesium, and calcium. Normal SNe Ia are defined by the presence of strong silicon absorption lines (Si II Œª6355) near peak light, which are often weak or absent in SLSNe I. The differences in spectra indicate different progenitor systems and explosion mechanisms: SNe Ia are thermonuclear disruptions of white dwarfs, while SLSNe I are thought to be core-collapse explosions of massive stars, often powered by magnetars or CSM interaction, despite their lack of hydrogen.


### üß™ Task 2: Systems Check - W&B by CoreWeave Telemetry Sync üö© Checkpoint üö©

You've reached your first checkpoint. Follow the link above ("View run at") to see your run in the W&B environment. Grab the URL of the run and submit it on the quest page to get points.

### üåå Dataset Loaded Successfully!

<img src="https://www.freeiconspng.com/thumbs/checkmark-png/checkmark-png-5.png" width="150" alt="Weights & Biases" />


At this point, we've:
* Retrieved the Astros Dataset artifact
* Loaded it into a pandas DataFrame
* Created prompt-style text for fine-tuning

‚ú® Feel free to pause and explore the data before moving forward!

Exploring the dataset can help you:

* Understand the kinds of questions and answers the model will learn from
* Check for any strange patterns, formatting issues, or interesting insights
* Discover Easter Eggs

üõ°Ô∏è We've added soft error handling while loading, so if you accidentally modify the dataset file, you'll be warned if any loading issues happen.

üëâ Quick Tip: You don't need to modify the dataset to proceed, but if you want to explore, you can run things like:

```
print(df_train.sample(5))
print(df_train['question'].apply(len).describe())
print(df_train['answer'].apply(len).describe())
```

When you're ready, move on to loading the model and tokenizing the dataset!

## üß† Model Vault: Download & Configure the Base Model

## üöÄ Load Pretrained Model and Prepare Dataset for Fine-Tuning üå†
In this section, we:

* Retrieve the Model to Finetune.
* Select a pretrained language model.
* Split the dataset into training and validation sets.
* Finally, load the model and tokenizers to prepare for finetuning in CoreWeave hardware (H200s üî•). 

‚úÖ All the setup for model loading, tokenization, and data splitting is handled for you ‚Äî no manual steps required!

### Select Model
You will be prompted to select one of the following models

*   Option 1: [falcon-rw-1b](https://huggingface.co/tiiuae/falcon-rw-1b)

<img src="https://res.cloudinary.com/apideck/image/upload/w_196,f_auto/v1686871384/marketplaces/ckhg56iu1mkpc0b66vj7fsj3o/listings/ii7e_5o4jBoK3pS8WMaWK_e7wsoh.webp" width="200" alt="Falcon" />
  
*   Option 2: [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)

<img src="https://techcrunch.com/wp-content/uploads/2024/06/GettyImages-959993436-e1718640411389.jpg" width="200" alt="Falcon" />


In [18]:
model_name, version = get_model_from_wandb(WANDB_ENTITY, WANDB_PROJECT_NAME, run_id=f"Operation_Reboot_{timestamp}")


Available Models:
1. Falcon RW 1B
2. TinyLlama 1B



Select a model (1-2):  1



‚úÖ Selected: falcon-rw-1b

‚¨áÔ∏è Downloading falcon-rw-1b (version v0) from Weights & Biases...


[34m[1mwandb[0m: Downloading large artifact 'falcon-rw-1b:v0', 2505.97MB. 43 files...
[34m[1mwandb[0m:   43 of 43 files downloaded.  
Done. 00:00:03.8 (659.0MB/s)


‚úÖ Model saved to: models/falcon-rw-1b_v0
‚úÖ Model downloaded successfully!




Next, we'll make a few adjustments to ensure the model handles padding correctly,
and then prepare our dataset for training by tokenizing the input prompts.


## üîÑ Tokenize & Split: Format Data for Finetuning

<img src="https://towardsdatascience.com/wp-content/uploads/2024/09/1QVXvydRMEWTWiUP42bYBAg.png" width="400" alt="Falcon" />

#### Load the datasets

You can modify how our training data is passed to our training script to finetune the model. Make sure to analyze the data so you can select an appropriate **Sample Size** and  **Train/Test split** for the finetuning process.


In [19]:
sample_size = 100 # choose between 100 and 1600 samples to train. More samples can improve your model, but increase training time
train_test_split = 0.1 # choose a float value between 0 and 1

training_sample = training_dataset.shuffle(seed=42).select(range(sample_size)) 

#### Load the model, tokenizers, and tokenize the dataset

In [20]:
model, tokenizer, model_name = load_model_and_tokenizer(model_name, version)
train_dataset, eval_dataset = tokenized_train_test(training_dataset, train_test_split, tokenizer)


üì¶ Loading model from: models/falcon-rw-1b_v0

Step 1: Loading tokenizer...
‚úÖ Tokenizer loaded successfully!

Step 2: Configuring QLoRA...
Loading model with 4-bit quantization...


You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.


trainable params: 12,582,912 || all params: 1,324,208,128 || trainable%: 0.9502
‚úÖ Model loaded with QLoRA successfully!


Map:   0%|          | 0/1440 [00:00<?, ? examples/s]

Map:   0%|          | 0/160 [00:00<?, ? examples/s]

‚úÖ Tokenization applied to Training & Evaluation Datasets successfully!


## ‚öôÔ∏è Training Command Center
Set training arguments to guide your model's learning trajectory.

## üõ∞Ô∏è Training Arguments (Where You Fine-Tune Settings) üåô


<img src="https://www.svgrepo.com/show/330922/weights-and-biases.svg" width="200" alt="Falcon" />

This is where you'll do most of your experimentation! üéØ

The `TrainingArguments` object controls how your model is fine-tuned, including:

* Batch size
* Number of epochs
* Learning rate
* Warmup steps
* Mixed precision (fp16) for faster training
* Checkpoint saving
* Reporting to Weights & Biases and more......

You can modify the hyperparameters here to see how different settings impact model performance. Note you have an H100 (Hopper) GPU available to train, Frontier labs use this to train their models, try to make the best of it!

####  üí° Create your own Fine-Tuning Strategy:

üîß learning_rate  
Controls how fast the model learns. Use 1e-4 to 2e-3; lower for full finetuning, higher for qlora.  
Interacts with warmup_ratio and lr_scheduler_type ‚Äî higher learning rates need more warmup.

‚öôÔ∏è optim  
Selects optimizer; 'adamw_torch' is fused and faster on GPU.  
Combine with a good learning rate and warmup for stable convergence.

üìà num_train_epochs  
Number of times the model sees the full dataset. Start low.
Runs with more epochs will take longer and need more compute. 

üìä gradient_accumulation_steps  
Simulates large batch sizes by accumulating gradients across steps.  
We will use large batch sizes. Keep this low to speed up training.

üì¶ per_device_train_batch_size  
Controls the number of samples processed per gpu per step.  
Larger batch = more stable gradients but more memory usage.
Our GPU is not memory-constrained, keep this high.

üî• warmup_ratio  
Gradually ramps up learning rate to avoid early divergence.  
Set between 0.05‚Äì0.1, especially important with higher learning rates.

üìâ lr_scheduler_type  
Controls how learning rate decays. 'cosine' gives smoother transitions, 'linear' is simpler.  
Works in tandem with warmup_ratio and total training steps.

‚ö° fp16 / bf16  
Enables mixed-precision training for faster speed and lower memory use.  
Use bf16 on newer GPUs (A100+, H100), fp16 works best for our A10G GPU.

üß† gradient_checkpointing  
Reduces memory by recomputing activations during backward pass.  
Useful for large models on smaller gpus; increases compute cost.

üíæ save_strategy / save_total_limit  
Determines when and how often to save models. 'epoch' is safer for llms.  
Limit total checkpoints (e.g., 5) to save disk space.

üß™ eval_strategy  
Controls evaluation frequency. Use 'epoch' for stability, 'steps' for faster feedback.  
Combine with logging_steps for clear model monitoring.

üìè group_by_length  
Batches examples of similar length together.  
Speeds up training when working with variable-length sequences.

üìê auto_find_batch_size  
Automatically adjusts batch size to prevent OOM errors.  

üìù logging_steps  
How often to log metrics like loss and learning rate.  
Use 10‚Äì50 to balance visibility with log noise.

üìä report_to  
Set to 'wandb' to enable experiment tracking, charts, and comparisons.  
This will ensure we can capture our training and system metrics.

üèÜ metric_for_best_model  
Automatically selects the best checkpoint based on our validation metric.  

üß± weight_decay  
Regularization to prevent overfitting. Typical values: 0.01 to 0.1.  
Use with higher learning rates or longer training to avoid memorization.

üßπ remove_unused_columns  
Cleans up unused columns from the dataset during training.  

üîÑ dataloader_num_workers / dataloader_pin_memory  
Set workers to 4‚Äì16 for best throughput. 

üè∑Ô∏è label_names  
Used to identify which label columns to include in loss computation.  

*Ask our team for help if you have questions about any parameters listed here*

In [21]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    run_name=f"fine-tuning-{model_name}-qlora",
    output_dir="./results",
    num_train_epochs=2, #start low and go up as needed
    per_device_train_batch_size=42,
    per_device_eval_batch_size=4,
    dataloader_num_workers=16,
    gradient_accumulation_steps=1,
    warmup_ratio=0.1,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    do_train=True,
    do_eval=True,
    fp16=True,
    bf16=False,
    gradient_checkpointing=False, # Choose to store the full forward-pass activations in GPU RAM
    group_by_length=True,
    report_to=["wandb"],
    remove_unused_columns=True,
    dataloader_pin_memory=True,
    optim="adamw_torch", # See https://huggingface.co/docs/transformers/v4.51.3/en/perf_train_gpu_one#optimizers
    learning_rate=2e-3,
    lr_scheduler_type="cosine", # See https://huggingface.co/docs/transformers/en/main_classes/optimizer_schedules#transformers.SchedulerType
    auto_find_batch_size=False,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=5,
    load_best_model_at_end=False,
    metric_for_best_model="eval_loss",
    logging_strategy="steps",
    label_names=["labels"],
)

## üõ∞Ô∏è Engage Training Tracker
Launch the model and track training live with W&B.

## üî≠ Initialize Trainer, Train, and Save üåé

In this final section:

* We initialize the Trainer with:
  * The model
  * The tokenizer
  * The data
  * The training arguments

* We start training by calling trainer.train().
* We save the fine-tuned adapter and tokenizer locally.
* We finish the W&B run to close the logging cleanly. W&B has native integrations with popular training libraries! specifically here, we have a native integration with the [HF Trainer](https://docs.wandb.ai/models/integrations/huggingface)

üß† Reminder: After training finishes, we will test and evaluate the fine-tuned model in this notebook.

üö® Training Ahead: Be ready for longer runtimes!

üî• Because you are using W&B and CoreWeave GPUs, you have deep insights in the W&B interface of CoreWeave infra issues. We don't expect to see any today, but this is highly important in inference and training workloads. [Read More Here.](https://wandb.ai/wandb_fc/product-announcements-fc/reports/New-Deep-observability-for-AI-training-and-fine-tuning-on-CoreWeave--VmlldzoxMzI1MjUwMA)


In [22]:
#Configure model for training
model.config.use_cache = False  # Disable cache during training

# Set label names for PEFT model
model.config.label_names = ["labels"]

# Initialize trainer with modified configuration
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,
        pad_to_multiple_of=8  # Add padding to multiple of 8 for better performance
    ),
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

# Enable gradient checkpointing with the new format
if hasattr(model, "enable_input_require_grads"):
    model.enable_input_require_grads()

## ‚öô Now we kick off the training process ‚öô

In [23]:
#Train
run = wandb.init(entity=WANDB_ENTITY,
                 project=WANDB_PROJECT_NAME,
                 id=f"Operation_Reboot_{timestamp}", 
                 resume="allow")
train_output = trainer.train()
run.finish()

Using EarlyStoppingCallback without load_best_model_at_end=True. Once training is finished, the best model will not be loaded automatically.


Epoch,Training Loss,Validation Loss
1,2.1303,2.061493
2,1.6337,1.987093


0,1
eval/loss,‚ñà‚ñÅ
eval/runtime,‚ñà‚ñÅ
eval/samples_per_second,‚ñÅ‚ñà
eval/steps_per_second,‚ñÅ‚ñà
train/epoch,‚ñÅ‚ñÇ‚ñÉ‚ñÑ‚ñÖ‚ñÜ‚ñá‚ñà‚ñà‚ñà
train/global_step,‚ñÅ‚ñÇ‚ñÉ‚ñÑ‚ñÖ‚ñÜ‚ñá‚ñà‚ñà‚ñà
train/grad_norm,‚ñà‚ñÇ‚ñÅ‚ñÇ‚ñÅ‚ñÅ‚ñÖ
train/learning_rate,‚ñà‚ñá‚ñÜ‚ñÑ‚ñÉ‚ñÇ‚ñÅ
train/loss,‚ñà‚ñÜ‚ñÖ‚ñÉ‚ñÇ‚ñÅ‚ñÅ

0,1
eval/loss,1.98709
eval/runtime,2.3057
eval/samples_per_second,69.393
eval/steps_per_second,17.348
total_flos,1.080426806378496e+16
train/epoch,2
train/global_step,70
train/grad_norm,0.92934
train/learning_rate,0.0
train/loss,1.6337


# Ensure to understand your training, and use W&B to optimize it!



## üíæ Save & Upload
Preserve your fine-tuned model as a W&B reference artifact in CoreWeave CAIOS storage üî•.

<img src="https://docs.coreweave.com/assets/images/frame-87.coreweave-166308-updateallrubixcubeshapestoblack-storage-F3F3F5-Small-cb156845fe6317dcc8b3cdc56a64fba8.gif" width="300" alt="Falcon" />

**CoreWeave AI Object Storage** delivers exabyte-scale, S3-compatible storage tailored for GPU-intensive AI model training. Designed to integrate seamlessly with CoreWeave's NVIDIA GPU compute clusters, it supports performance levels up to 2 GB/s per GPU and scales effortlessly to hundreds of thousands of GPUs. With its unique Local Object Transport Accelerator (LOTA‚Ñ¢), AI Object Storage caches frequently used datasets and/or prestages data directly on the local NVMe disks of GPU nodes, reducing network latency and dramatically improving training speeds. [Learn more here](https://coreweave.com/products/storage#object-storage)

Tracking your model in W&B can be really helpful:

- You can now share this model with your team and beyond
- W&B creates a lineage map of your model so you can see the full model lifecycle: dataset->training->final state

In [24]:
#Saving and uploading best model
trainer.save_model(f"./best_model/{type(model.base_model.model).__name__}")
tokenizer.save_pretrained(f"./best_model/{type(model.base_model.model).__name__}")

###### SET A MODEL VERSION NAME SO YOU CAN REFERENCE IT LATER BEFORE SENDING TO CAIOS ###########
VERSION_NAME="initial_version"
###### SET A MODEL VERSION NAME SO YOU CAN REFERENCE IT LATER BEFORE SENDING TO CAIOS ###########

caios_path = send_to_caios("best_model/", WANDB_ENTITY, VERSION_NAME)

run = wandb.init(entity=WANDB_ENTITY,
                 project=WANDB_PROJECT_NAME,
                 id=f"Operation_Reboot_{timestamp}",
                 resume="allow")

artifact = wandb.Artifact(
    name=f"{WANDB_ENTITY}-ft-best-model-{type(model.base_model.model).__name__}",
    type="model",
    description="""Best FineTuned model from the Astros-FT-Workshop."""
)

artifact.add_reference(uri=caios_path)

logged_artifact = run.log_artifact(artifact)

run.finish()

Model sent to CAIOS: s3://fclondon2025/fc25-london-aise-likeyandy/initial_version


[34m[1mwandb[0m: Generating checksum for up to 10000000 objects in "fclondon2025/fc25-london-aise-likeyandy/initial_version"... Done. 0.1s


0,1
eval/loss,1.98709
eval/runtime,2.3057
eval/samples_per_second,69.393
eval/steps_per_second,17.348
total_flos,10804268063784960
train/epoch,2
train/global_step,70
train/grad_norm,0.92934
train/learning_rate,0.0
train/loss,1.6337


### üö© Checkpoint üö© üß¨ Task 3: Artifact Uplink - Core Memory Package

You've reached your second checkpoint. 

Navigate to wandb by clicking the link next to `View project at:` above and then click on the yellow Weights & Biases logo on the top left of the page.  
On the left panel, click **Artifacts**. Find your artifact and retrieve the `Full Artifact Name` that you can submit on the quest page.

<img src="imgs/artifactview.png" alt="Artifact View" style="max-width: 100%; width: 100%; height: auto;" />


## ‚úÖ Mission Checkpoint: Model Finetuned

Congratulations, Architect! You've:
- Loaded and prepped your training dataset ‚úÖ
- Configured a foundational model ‚úÖ
- Finetuned it with parameter-efficient methods ‚úÖ
- Logged your training runs and saved the final model to Weights & Biases ‚úÖ

Your model is now part of your mission's neural infrastructure.

Next, we prepare to test and evaluate. But first, a quick system check...

## üß∞ Systems Maintenance Bay: Utilities

Before testing, it's wise to flush memory and check your hardware status. Use these utilities to prepare the environment.

Just like a good engineer, make sure the ship's neural bays are cleared and ready. This will ensure our **CoreWeave GPUs** are kept nice and tidy!

## Utilities üß∞

In [25]:
# -- Flush out GPU memory - when required - may require restarting the notebook
import gc, torch

try:
    del trainer
except: print("cannot release memory")
try:
    del model
except: print("cannot release memory")
try:
    del tokenizer
except: print("cannot release memory")

gc.collect()
torch.cuda.empty_cache()
torch.cuda.ipc_collect()

wandb.finish()

In [26]:
!nvidia-smi

Tue Nov  4 11:32:02 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:4C:00.0 Off |                    0 |
| N/A   30C    P0            114W /  700W |    2001MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

## üß™ Testing the Neural Core

Now that your model is trained and uploaded, it‚Äôs time to test your new candidate neural core.

You‚Äôll load the fine-tuned model and run test prompts to ensure it responds with precision and depth, critical for deep-space operations.

We‚Äôve equipped you with a call function wrapped in `Weave`, our GenAI interface and telemetry layer.

## üîß Testing our model with the adapterü™õ

Let's start by creating some helper functions to load and call the model we just trained.

<img src="imgs/lora.png" alt="Artifact View" style="max-width: 50%; width: 50%; height: auto;" />


Since we created an adapter during the finetuning process, our load model function loads the original model along with our adapter using PEFT.

## üõ∞Ô∏è Introducing Weave: Your AI Telemetry and Evaluation Suite

<img src="https://mintcdn.com/wb-21fd5541/aRvhhwVWqlxBzke5/images/evals-hero.png?fit=max&auto=format&n=aRvhhwVWqlxBzke5&q=85&s=7d7466d666ad412ed3916bfab533d118" width="500" alt="Weave" />

[**Weave**](https://docs.wandb.ai/weave) is Weights & Biases‚Äô next-gen platform for tracking, evaluating, and visualizing GenAI applications.

You'll use Weave to:
- Log and score model generations
- Run structured evaluations on Q&A performance
- Compare outputs with reference answers

This enables you to **quantitatively assess** how mission-ready your model is.

Let‚Äôs initialize Weave and plug it into your finetuned system.

In [27]:
import weave
weave.init(f"{WANDB_ENTITY}/{WANDB_PROJECT_NAME}")

[36m[1mweave[0m: Logged in as Weights & Biases user: likeyandy1025.
[36m[1mweave[0m: View Weave data at https://wandb.ai/fc25-london-aise-likeyandy/CoreWeave-Astros-FT-Workshop/weave


<weave.trace.weave_client.WeaveClient at 0x7ad3f0619700>

## Calling our Local Finetuned Model

### Load model

We will now pass the finetuned adapter that you just trained and the base model to be loaded as one entity that we can use. You'll see the directories refferenced in the cell below are which contain the model files we are loading in. This model should be able to navigate complex astronomical questions from our evaluation dataset.

These are now being loaded onto a CoreWeave state of the art GPU üíæ

In [28]:
base_model_dir = "./models/TinyLlama_v1" # Path to base model - modify accordingly to fine_tuned_model/<TinyLlama_v1 or falcon-rw-1b_v0>
adapter_dir = "./best_model/LlamaForCausalLM" #add path to adapter dir (FalconForCausalLM or LlamaForCausalLM)

tokenizer, model = load_finetuned_model(adapter_dir, base_model_dir)

HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './best_model/LlamaForCausalLM'. Use `repo_type` argument if needed.

#### Generating respones from the model

This function acts as our interface with the model for text-based interactions. 

In [None]:
@weave.op()
def call_model(question: str) -> str:
    """Generate an answer from your Local LLM given a prompt."""

    system_prompt = "You are an expert in astrophysics. Please provide a concise and truthful answer to the following question:"
    prompt = system_prompt + "\n\n" + question + "\nAnswer:"

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=256, do_sample=False, no_repeat_ngram_size=3, repetition_penalty=1.2, eos_token_id=tokenizer.eos_token_id, pad_token_id=model.config.eos_token_id)
    return tokenizer.decode(output[0], skip_special_tokens=True).replace(prompt, '').strip()

## üìä Final Check: Evaluation Protocols

Your neural core is active, but is it mission-grade?

<img src="imgs/mission-grade.png" alt="Artifact View" style="max-width: 50%; width: 50%; height: auto;" />

Use this section to:
- Load an evaluation dataset
- Score model responses using embedding similarity
- Track performance with W&B + Weave

**Evaluation is critical** before deployment‚Äîit ensures your model‚Äôs reasoning is aligned with mission parameters.

In this task, we will evaluate the fine-tuned models by loading them into memory and inferring locally. Once you have the evaluation results, try going back and training another model with new parameters to improve your model. Ask our team for help if you have any questions about optimizing the parameters for better results. 

This task carries the most points, make sure you deploy only your best model. 

# Evaluating and Deployment

This evaluation setup is very similar to the one used in the quest backend to score your model. Let's get started!

## Get Evaluation Dataset

We have created a public evaluation dataset that you can use to test and quantitatively evaluate your model. This is small subset of our evalaution dataset that will be used for final scoring and should provide insights into how your finetuned model is performing.

In [None]:
eval_dataset_public = weave.ref('weave:///fc-london-admins/eval-dataset/object/astro_eval_public:I9rFUEYOFGJYvfbxtmL35jQkMqyitKJwf9ppYLXFQ5U').get()

## Test the model with a sample from our eval dataset

Let's try running our model with some sample questions from our eval dataset. 

In [None]:
question =  random.sample(list(eval_dataset_public.rows), k=1)[0]['question'] #choose any question number between 
answer = call_model(question)

print("üõ∞Ô∏è  Incoming Transmission ‚Äî Mission Q&A\n")
print(f"üß† Question:\n{question}\n")
print(f"ü§ñ Model Response:\n{answer}")
print(f"ü§ñ Refernce Asnwer:\n{eval_dataset_public[100]['answer']}")

## Setup evaluation

Now that we have vibe-checked our model, let's perform a quantitative analysis of its accuracy. We will be using a simple cosine similarity scorer we've prepared in order to do binary pass/fail categorization.

In [None]:
import asyncio
eval_dataset = random.sample(list(eval_dataset_public.rows), k=10) # select 10-20 samples to run evaluation against

#### Kick off the Evaluation and view the results in Weave 
Let's see how well our model performs against our reference dataset of astronomical QnA.

This evaluation takes about 5 minutes to run with 10 samples. Make sure to budget your time accordingly. 

Once you are satisfied with the model results, proceed to deployment. If you want to learn more about Evaluations with Weave, see [here](https://docs.wandb.ai/weave/guides/evaluation/evaluation_logger) 

In [None]:
evaluationlogger = weave.EvaluationLogger(
    model="call_model",
    dataset="eval_dataset",
)

scores = []

for i in eval_dataset:
    question = i["question"]
    answer = i["answer"]

    model_output = call_model(question)

    # Log the prediction input and output
    pred_logger = evaluationlogger.log_prediction(
        inputs=question,
        output=model_output
    )

    # Calculate and log a score for this prediction
    score = cosine_similarity_scorer(model_output, question, answer)
    scores.append(score)

    pred_logger.log_score(
        scorer="cosine_similarity_scorer",
        score=score
    )

    # Finish logging for this specific prediction
    pred_logger.finish()

avg = sum(item['similarity_score'] for item in scores) / len(scores)
evaluationlogger.log_summary({"cosine_similarity_scorer_avg": avg})

print(f"Evaluation complete. Average score: {avg:.4f}")

Now, navigate to [Weave](https://wandb.ai/) and navigate to Evaluations to see the results of your evaluation.

## ‚öíÔ∏è Deploy üöÄ

<img src="imgs/deploy.png" alt="Artifact View" style="max-width: 50%; width: 50%; height: auto;" />

#### Merge the model and the adapter

We are at the end of our task. By now, you should have some freshly trained adapters for our language models that will allow us to restore functionality to our ship. Let's take your best adapter and deploy it to our ship's inference engine.

Now we will merge the best adapter from our experiments to the base model and save it as once single model. All the data we trained our adapter with is now fused with the model weights. Only do this with your best model.

In [None]:
# Merge the adapter weights into the base model
model = model.merge_and_unload() # After this, it's a standard Hugging Face model

config = model.config

if hasattr(config, "auto_map"):
    delattr(config, "auto_map")
if "auto_map" in config.to_dict():
    del config.__dict__["auto_map"]

# === Save merged model and tokenizer ===
save_path = "merged_model/"

model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Merged model saved to: {save_path}")

###### SET A MODEL VERSION NAME SO YOU CAN REFERENCE IT LATER BEFORE SENDING TO CAIOS ###########
VERSION_NAME="final_submit"
###### SET A MODEL VERSION NAME SO YOU CAN REFERENCE IT LATER BEFORE SENDING TO CAIOS ###########

## Now we upload to CoreWeave AI Object Storage for fast retrieval by CoreWeave Inference Compute for evaluations!
caios_path = send_to_caios("merged_model/", WANDB_ENTITY, VERSION_NAME)

### üö© Checkpoint üö©

You've reached the final checkpoint. In the output for the function above, you will have gotten a CoreWeave AI Object Storage path to the model you've submitted.


Copy the path and enter it into the quest page to get points!


You made it! 

Your ship's neural core is restored. Make sure to go back and check all the tasks marked with üö© to collect all the points.

Go take a break. We will see you back here shortly for `Mission 2: Operation NAVARCH`