# Dolly-15K Dataset Analysis & Preprocessing

## Objective
Understand the Dolly-15K dataset structure and perform necessary preprocessing for LLaMA-2-7B fine-tuning.

## Dataset Overview
- **Source**: [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)
- **Size**: 15,011 instruction-following examples
- **Format**: `instruction`, `context`, `response` fields
- **Quality**: Human-generated, high-quality instruction-response pairs
- **Purpose**: Supervised Fine-Tuning

## 🔧 Preprocessing Tasks
1. **Load and inspect dataset structure**
2. **Data quality checks** (filter empty)
3. **Format standardization** to match assignment requirements
4. **Data splits** (80% train, 10% val, 10% test)
5. **Tokenization analysis** for LLaMA-2 compatibility

## 📋 Expected Output Format
```
### Instruction:
{instruction}

### Context:
{context}

### Response:
{response}
```

## 📁 Processed Dataset Storage
**Files Created**:
- `train.parquet` (12,008 examples - 80%)
- `val.parquet` (1,501 examples - 10%)  
- `test.parquet` (1,502 examples - 10%)

**Access Link**: https://drive.google.com/drive/folders/1CXJHPZEYk-XOypqvg71-j4fkdEvkboct?usp=sharing


In [3]:
import polars as pl
from datasets import load_dataset

In [4]:
dataset_name="databricks/databricks-dolly-15k"

dataset = load_dataset(dataset_name)
dataset = dataset['train']
row_count = dataset.num_rows
instruction = dataset['instruction']
context = dataset['context']
response = dataset['response']
category = dataset['category']

assert len(instruction) == row_count
assert len(context) == row_count
assert len(response) == row_count
assert len(category) == row_count

print("{0} has {1} rows".format(dataset_name, row_count))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

databricks-dolly-15k.jsonl:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/15011 [00:00<?, ? examples/s]

databricks/databricks-dolly-15k has 15011 rows


In [5]:
df = pl.DataFrame({
    'instruction': instruction,
    'context': context,
    'response': response,
    'category': category
})

df.head(10)

instruction,context,response,category
str,str,str,str
"""When did Virgin Australia star…","""Virgin Australia, the trading …","""Virgin Australia commenced ser…","""closed_qa"""
"""Which is a species of fish? To…","""""","""Tope""","""classification"""
"""Why can camels survive for lon…","""""","""Camels use the fat in their hu…","""open_qa"""
"""Alice's parents have three dau…","""""","""The name of the third daughter…","""open_qa"""
"""When was Tomoaki Komorida born…","""Komorida was born in Kumamoto …","""Tomoaki Komorida was born on J…","""closed_qa"""
"""If I have more pieces at the t…","""Stalemate is a situation in ch…","""No. Stalemate is a drawn posi…","""information_extraction"""
"""Given a reference text about L…","""Lollapalooza /ˌlɒləpəˈluːzə/ (…","""Lollapalooze is an annual musi…","""closed_qa"""
"""Who gave the UN the land in NY…","""""","""John D Rockerfeller""","""open_qa"""
"""Why mobile is bad for human""","""""","""We are always engaged one phon…","""brainstorming"""
"""Who was John Moses Browning?""","""John Moses Browning (January 2…","""John Moses Browning is one of …","""information_extraction"""


In [6]:
empty_instructions = df.filter(pl.col('instruction').str.strip_chars() == '').height
empty_contexts = df.filter(pl.col('context').str.strip_chars() == '').height
empty_responses = df.filter(pl.col('response').str.strip_chars() == '').height
empty_categories = df.filter(pl.col('category').str.strip_chars() == '').height

print(f"Empty instructions: {empty_instructions}")
print(f"Empty contexts: {empty_contexts} ({empty_contexts/len(df)*100:.1f}%)")
print(f"Empty responses: {empty_responses}")
print(f"Empty categories: {empty_categories}")

df_cleaned = df.filter(pl.col('response').str.strip_chars() != '')
removed_count = df.height - df_cleaned.height

print(f"Removed {removed_count} examples with empty responses")
print(f"Cleaned dataset size: {df_cleaned.height}")



Empty instructions: 0
Empty contexts: 10544 (70.2%)
Empty responses: 0
Empty categories: 0
Removed 0 examples with empty responses
Cleaned dataset size: 15011


In [7]:
def format_instruction_template(instruction: str, context: str, response: str) -> str:
    instruction = instruction.strip()
    context = context.strip() if context else ""
    response = response.strip()

    if context:
        return f"""### Instruction:
{instruction}

### Context:
{context}

### Response:
{response}"""
    else:
        return f"""### Instruction:
{instruction}

### Response:
{response}"""



df = df.with_columns([
    pl.struct(['instruction', 'context', 'response'])
      .map_elements(
          lambda x: format_instruction_template(x['instruction'], x['context'], x['response']),
          return_dtype=pl.Utf8
      )
      .alias('text')
])

print(df['text'][0])

### Instruction:
When did Virgin Australia start operating?

### Context:
Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.

### Response:
Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.


In [9]:
df_shuffled = df.sample(fraction=1.0, seed=42, shuffle=True)

total_size = df_shuffled.height
train_size = int(0.8 * total_size)
val_size = int(0.1 * total_size)
test_size = total_size - train_size - val_size

print(f"Total examples: {total_size:,}")
print(f"\nCalculated split sizes:")
print(f"  Train:      {train_size:,} ({train_size/total_size*100:.1f}%)")
print(f"  Validation: {val_size:,} ({val_size/total_size*100:.1f}%)")
print(f"  Test:       {test_size:,} ({test_size/total_size*100:.1f}%)")


train_df = df_shuffled.slice(0, train_size)
val_df = df_shuffled.slice(train_size, val_size)
test_df = df_shuffled.slice(train_size + val_size, test_size)

assert train_df.height + val_df.height + test_df.height == total_size, "Data loss in split!"

Total examples: 15,011

Calculated split sizes:
  Train:      12,008 (80.0%)
  Validation: 1,501 (10.0%)
  Test:       1,502 (10.0%)


In [None]:
import os
from google.colab import drive
drive.mount('/content/drive')
drive_path = '/content/drive/MyDrive/LLaMA2-Dolly-Training/data'
os.makedirs(drive_path, exist_ok=True)

train_df_storage = train_df.select(['text'])
val_df_storage = val_df.select(['text'])
test_df_storage = test_df.select(['text'])

train_df_storage.write_parquet(f'{drive_path}/train.parquet')
val_df_storage.write_parquet(f'{drive_path}/val.parquet')
test_df_storage.write_parquet(f'{drive_path}/test.parquet')



Mounted at /content/drive
