# Preparing Dataset for Gemini Fine-tuning

This tutorial demonstrates how to prepare your dataset for fine-tuning the Gemini model. We'll cover:
1. Setting up the environment
2. Loading and examining the data
3. Converting data to the required JSONL format
4. Validating the prepared dataset

## 1. Setup

First, let's install the required packages:

In [None]:
!pip install pandas

Import the necessary libraries:

In [None]:
import json
import pandas as pd

## 2. Load and Examine the Data

We'll start by loading our CSV file that contains the training data. The CSV should have columns for the original prompts and their rewritten versions.

In [None]:
# Load the CSV file
df = pd.read_csv('data.csv')

# Display basic information about the dataset
print("Dataset Info:")
print(f"Total entries: {len(df)}")
print("\nFirst few rows:")
display(df.head())

Dataset Info:
Total entries: 83

First few rows:


Unnamed: 0,prompt,gemini_rewrite
0,"Using WebPilot, create an outline for an artic...","Craft a detailed article outline (for a 2,000-..."
1,"I want you to act as an English translator, sp...","I want you to act as an English translator, sp..."
2,I want you to act as an interviewer. I will be...,Assume the role of an interviewer. I will be t...
3,I want you to act as a javascript console. I w...,Simulate a JavaScript console. Respond to my c...
4,I want you to act as a text based excel. you'l...,I want you to act as a text-based Excel. You'l...


## 3. Convert to Gemini Fine-tuning Format

Gemini requires the training data to be in a specific JSONL format. Each line should contain:
- A system instruction
- Input-output pairs in the form of user and model messages

Here's how we'll structure our data:

In [None]:
# Define the system instruction
example_text = "Your task is to rewrite AI-generated prompts to make them more human-like."
system_instruction = {
    "parts": [
        {"text": example_text}
    ]
}

# Generate the JSONL file
output_file = 'training_data.jsonl'
with open(output_file, 'w', encoding='utf-8') as f:
    for i, (_, row) in enumerate(df.iterrows()):
        # Skip the first two entries that will be used as examples
        if i < 2:
            continue

        entry = {
            "systemInstruction": system_instruction,
            "contents": [
                {
                    "role": "user",
                    "parts": [{"text": f"AI-Generated: {row['gemini_rewrite']}"}]
                },
                {
                    "role": "model",
                    "parts": [{"text": f"Human-Like: {row['prompt']}"}]
                }
            ]
        }
        # Write each JSON object as a single line
        f.write(json.dumps(entry, ensure_ascii=False) + '\n')

print(f"\n{output_file} has been generated successfully with {len(df) - 2} entries.")

## 4. Validate the Generated JSONL

Let's examine a few entries from our generated JSONL file to ensure they're formatted correctly:

In [None]:
# Read and display the first few entries of the JSONL file
with open('training_data.jsonl', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i >= 3:  # Show first 3 entries
            break
        entry = json.loads(line)
        print(f"\nEntry {i+1}:")
        print(json.dumps(entry, indent=2))
        print("-" * 80)

## Next Steps

Your dataset is now prepared in the correct format for Gemini fine-tuning! The generated `training_data.jsonl` file contains:
- A system instruction defining the task
- Input-output pairs where:
  - Input: AI-generated prompts
  - Output: Human-like rewrites

You can now proceed with:
1. Uploading this JSONL file to Google Cloud Storage
2. Creating a fine-tuning job using the Vertex AI API
3. Training your custom Gemini model

Make sure to check Google's documentation for specific requirements regarding:
- Minimum number of training examples
- Maximum sequence length
- Other model-specific constraints