# Example: Fine-tuning and Serving with Predibase

Learn how to fine-tune and serve a large language model (LLM) for your application. Predibase offers you the ability to seamlessly put open-source LLMs in production without the training headaches or GPU setup.

In this guide, we will fine-tune and serve a text summarizer using mistral-7b, an open source LLM from Mistral. We will be using the same dataset we used for the News Headline Generation task in LoraLand.

## Supported Models

Predibase supports many popular OSS models for fine-tuning including:

- llama-3-1-8b-instruct
- mistral-7b-instruct-v0-2
- qwen2-7b

To see all models available to fine-tune, check out the [full list of available models.](https://docs.predibase.com/user-guide/fine-tuning/finetuning-models)

# Prepare Data

Predibase supports a variety of different data connectors including File Upload, S3, Snowflake, Databricks, and more. You can also upload your data in a few different file formats. We usually recommend CSV or JSONL.

## Instruction Tuning
Your dataset should follow the following structure:

- **prompt:** The fully materialized input to your model
- **completion:** The output your model

In the case of JSONL, it should look something like:



```
{"prompt": ..., "completion": ...}
{"prompt": "Please summarize the following article ...", "completion": "Madonna kicks off Celebration World Tour in London"}
{"prompt": "Please summarize the following article ...", "completion": "Facebook Releases First Transparency Report on what Americans see on the platform"}
{"prompt": ..., "completion": ...}
```



# Train Model

You can use the Web UI or the Python SDK to connect your data and start a fine-tuning job. In this example, we will use the SDK.

## Initialize Predibase

In [1]:
!pip install --upgrade --force-reinstall numpy pandas
!pip install -U predibase --quiet

Collecting numpy
  Using cached numpy-2.2.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting pandas
  Using cached pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting python-dateutil>=2.8.2 (from pandas)
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas)
  Using cached six-1.17.0-py2.py3-none-any.whl.metadata (1.7 kB)
Using cached numpy-2.2.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)
Using cached pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
Using cached pytz-2025.2-py2.py3-n

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.2.3 which is incompatible.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.[0m[31m
[0m

In [1]:
from predibase import Predibase, FinetuningConfig, DeploymentConfig

pb = Predibase(api_token="pb_9H8x44kIUnSeyygs82Ijrg")

You can generate an API token on the homepage or find an existing key under Settings > My Profile.

## Connect Data to Predibase

We will use the tldr_news that is at this [Google Drive Link.](https://drive.google.com/file/d/19n9tEkFIyRQxu3jj1Raw9MMb2BRyQ_AR/view?usp=sharing) This dataset has been pre-formatted with the [instruction template](https://docs.predibase.com/user-guide/fine-tuning/instruction_formats) for mistral-7b. You can find the original dataset with a numerical split column and no prompt template on [HuggingFace.](https://huggingface.co/datasets/JulesBelveze/tldr_news)

Once you download the dataset locally, you can upload it to the Google Colab environment using the following command.

In [3]:
from google.colab import files
files.upload()

Saving tldr_dataset.csv to tldr_dataset.csv




Now let's upload it to Predibase:

In [4]:
dataset = pb.datasets.from_file("./tldr_dataset.csv", name="tldr_dataset")

## Kickoff Training

We can start a fine-tuning job with the recommended defaults as follows:

In [None]:
# Create an adapter repository
repo = pb.repos.create(name="news-summarizer-model", description="TLDR News Summarizer Experiments", exists_ok=True)

# Start a fine-tuning job, blocks until training is finished
adapter = pb.adapters.create(
    config=FinetuningConfig(
        base_model="mistral-7b-instruct-v0-2"
    ),
    dataset="tldr_dataset", # Also accepts the dataset name as a string
    repo=repo,
    description="initial model with defaults"
)

Successfully requested finetuning of mistral-7b-instruct-v0-2 as `news-summarizer-model/6`. (Job UUID: 77ef6ec2-6dbb-4ca7-ad3a-8c0db12a5af5).

Watching progress of finetuning job 77ef6ec2-6dbb-4ca7-ad3a-8c0db12a5af5. This call will block until the job has finished. Canceling or terminating this call will NOT cancel or terminate the job itself.

Job is starting. Total queue time: 0:00:50         
Waiting to receive training metrics...

┌────────────┬────────────┬─────────────────┐
│ checkpoint [0m│ train_loss [0m│ validation_loss [0m│
├────────────┼────────────┼─────────────────┤
│     1      [0m│   1.1853   [0m│      1.2524     [0m│
│     2      [0m│   0.9614   [0m│      1.2336     [0m│
│     3      [0m│   1.3366   [0m│      1.2208     [0m│
│     4      [0m│   1.2255   [0m│      1.2114     [0m│
│     5      [0m│   1.2624   [0m│      1.2344     [0m│
│     6      [0m│   1.1093   [0m│      1.2463     [0m│
│     7      [0m│   1.0754   [0m│      1.2386     [0m│
│    

### Customize Hyperparameters

If you want to customize your hyperparameters, you can do so as shown below. Currently, we support modifying epochs, rank, and learning rate and are working to expose additional hyperparameters very soon!

In [None]:
# Create an adapter repository
repo = pb.repos.create(name="news-summarizer-model", description="TLDR News Summarizer Experiments", exists_ok=True)

# Start a fine-tuning job with custom parameters, blocks until training is finished
adapter = pb.adapters.create(
    config=FinetuningConfig(
        base_model="mistral-7b-instruct-v0-2",
        epochs=1, # default: 3
        rank=8, # default: 16
        learning_rate=0.0001, # default: 0.0002
        target_modules=["q_proj", "v_proj", "k_proj"], # default: None (infers [q_proj, v_proj] for mistral-7b)
    ),
    dataset=dataset,
    repo=repo,
    description="changing epochs, rank, learning rate, and target modules"
)

## Monitor Progress

Once the fine-tuning job is kicked off, you can monitor progress in the Web UI or in the SDK by:

In [None]:
# Get adapter, blocking call if training is still in progress

adapter = pb.adapters.get("news-summarizer-model/1")
adapter

Adapter(repo='news-summarizer-adapter', tag=1, base_model='mistral-7b-instruct-v0-2', description='My first model', artifact_path='2b868d40-79a5-4630-9bcb-01484b9e495d/99ec039b226941f385c0b17944bc581c/artifacts/model/model_weights', finetuning_error=None, finetuning_job_uuid='2b868d40-79a5-4630-9bcb-01484b9e495d')

# Use Your Adapter

Start by prompting your adapter using a **shared serverless endpoint** and then once you're happy with your adapter's performance, create a **private serverless deployment** for production use.

## 1. Shared Serverless Endpoints (Free, with rate limits)

Serverless endpoints are a shared resource we offer for getting started, experimentation, and fast iteration. If your base model is one that is hosted as a serverless endpoint, you can use your fine-tuned model instantly by utilizing LoRAX:

In [None]:
input_prompt="""
  <s>[INST] The following passage is content from a news report. Please summarize this passage in one sentence or less.
  Passage: Memray is a memory profiler for Python. It can help developers discover the cause of high memory usage, find memory leaks, and find hotspots in code that cause a lot of allocations. Memray can be used both as a command-line tool or as a library.
  Summary: [/INST]
"""

lorax_client = pb.deployments.client("mistral-7b-instruct-v0-2")
print(lorax_client.generate(input_prompt, adapter_id="news-summarizer-model/1", max_new_tokens=100).generated_text)

Deployment mistral-7b-instruct-v0-2 is still spinning up. Your prompt may take longer than normal to execute.

Memray (GitHub Repo)


The first line returns a [LoRAX client](https://loraexchange.ai/reference/python_client/) that we can use for prompting. The second line calls generate while passing in the adapter repo name and version to prompt our fine-tuned model.

We can compare the fine-tuned model to the base model by calling generate without adapter_id:

In [None]:
print(lorax_client.generate(input_prompt, max_new_tokens=100).generated_text)

Deployment mistral-7b-instruct-v0-2 is still spinning up. Your prompt may take longer than normal to execute.

Memray is a memory profiling tool for Python that identifies memory usage issues, including leaks and allocation hotspots, and can be used as a command-line tool or library.


## 2. Private Serverless Deployments ($/GPU-hour)

Once you're ready for production, deploy a private instance of the base model for greater reliability, control, and no rate limiting. LoRAX enables you to serve an unlimited number of adapters on a single base model deployment.

Predibase officially supports serving [these models](https://docs.predibase.com/user-guide/inference/models#private-serverless). Note that by default private serverless deployments spin down after 12 hours of no activity. (To change this, set min_replicas to 1.) For the base_model, you'll need the model name, which can be found [here for the models we officially support](https://docs.predibase.com/user-guide/inference/models#private-serverless).

Private serverless deployments are available to Developer and Enterprise tier users. To upgrade to Developer tier, Free tier users will need to add a credit card to automatically upgrade.

In [None]:
# Deploy
pb.deployments.create(
    name="my-mistral-7b",
    config=DeploymentConfig(
        base_model="mistral-7b-instruct-v0-2",
        # cooldown_time=3600, # Value in seconds, defaults to 43200 (12hrs)
        min_replicas=0, # Auto-scales to 0 replicas when not in use
        max_replicas=1
    )
    # description="", # Optional
)

---------  ---------------------------
ScalingUp  2024-07-26T20:51:05.118087Z
Stopped    2024-07-26T20:51:06.43712Z
---------  ---------------------------
--------------  ---------------------------
WaitingForNode  2024-07-26T20:52:18.779324Z
AcquiredNode    2024-07-26T20:52:18.885541Z
--------------  ---------------------------
--------------  ---------------------------
WarmingUpModel  2024-07-26T20:52:30.795613Z
--------------  ---------------------------
-----  ---------------------------
Ready  2024-07-26T20:53:06.381196Z
-----  ---------------------------


Deployment(name='my-mistral-7b', uuid='444c6254-6f20-4f0d-85d0-1e82d9952b3b', description='', type='dedicated', status='ready', cooldown_time=43200, context_window=32764, accelerator='a100_80gb_100', model='predibase/Mistral-7B-Instruct-v0.2-dequantized', min_replicas=0, max_replicas=1, current_replicas=1, scale_up_threshold=1)

In [None]:
# Prompt
input_prompt="<s>[INST] The following passage is content from a news report. Please summarize this passage in one sentence or less. \n Passage: Memray is a memory profiler for Python. It can help developers discover the cause of high memory usage, find memory leaks, and find hotspots in code that cause a lot of allocations. Memray can be used both as a command-line tool or as a library. \n Summary: [/INST] "
lorax_client = pb.deployments.client("my-mistral-7b")
print(lorax_client.generate(input_prompt, adapter_id="news-summarizer-model/1", max_new_tokens=100).generated_text)

Memray (GitHub Repo)


#### Delete Deployment

By default your deployment scales to 0 replicas. While it's scaled to 0, you won't be billed and as soon as you send a request, your deployment will automatically scale up. If you'd like, you may also delete your deployment if you don't intend to use it.

In [None]:
pb.deployments.delete("my-mistral-7b") # The name must match the name used when creating the dedicated deployment

## Download Model

To download your adapter (available on enterprise tier), you can do so by:

In [None]:
pb.adapters.download("news-summarizer-model/1")

Note that the exported model files will contain only the adapter weights, not the full LLM weights.

# Next Steps
- Try training with your own dataset and use-case
- Try training with a larger model (i.e. Mixtral 8x7B) to compare performance