<a href="https://colab.research.google.com/github/Noel-Niko/AWS-Automate-Machine-Learning-Workflows/blob/master/use-cases/ODSC-Workshop/ODSC_Synthetic_Data_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üöÄ Build Synthetic Datasets with Cerebras + Synthetic Data Kit

Checkout: Synthetic-Data-Kit here: https://github.com/meta-llama/synthetic-data-kit/

**ODSC Workshop - From Research Paper to Fine-Tuning Dataset**

In this notebook, you'll:
- ‚úÖ Parse the Llama 3 research paper
- ‚úÖ Generate 50+ Q&A pairs using Cerebras inference
- ‚úÖ Filter for quality using LLM-as-judge
- ‚úÖ Export to fine-tuning format

**No coding required - just run the cells!** ‚ö°

## üîë Step 1: Set Your Cerebras API Key

Enter your Cerebras API key below:

In [18]:
import os
from google.colab import userdata

# Option 1: Enter your API key directly (not recommended for sharing)
CEREBRAS_API_KEY = "csk-3wfykep3w3trydreemye6wyk4mwcrvnkwv5wcm8m88wjhxw5"

# Option 2: Use Colab Secrets (recommended - add key as 'CEREBRAS_API_KEY' in secrets)
# Uncomment below if using secrets:
# CEREBRAS_API_KEY = userdata.get('CEREBRAS_API_KEY')

# Set environment variable
os.environ['CEREBRAS_API_KEY'] = CEREBRAS_API_KEY

print("‚úÖ API key configured!")
print(f"üîë Key preview: {CEREBRAS_API_KEY[:10]}...")

‚úÖ API key configured!
üîë Key preview: csk-3wfyke...


## üì¶ Step 2: Install Synthetic Data Kit

Installing the toolkit and dependencies...

In [19]:
!pip install -q synthetic-data-kit
!pip install -q datasets  # For HuggingFace format export

# Verify installation
!synthetic-data-kit --help | head -15

print("\n‚úÖ Installation complete!")

Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
                                                                                
 Usage: synthetic-data-kit [OPTIONS] COMMAND [ARGS]...                          
                                                                                
 A toolkit for preparing synthetic datasets for fine-tuning LLMs                
                                                                                
‚ï≠‚îÄ Options ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ïÆ
‚îÇ --config              -c      PATH  Path to configuration file               ‚îÇ
‚îÇ --install

## ‚öôÔ∏è Step 3: Download Workshop Configuration

Downloading the ready-to-use config from GitHub and setting up directories...

In [20]:
# Create directory structure
!mkdir -p data/{parsed,generated,curated,final}

print("üì• Downloading workshop config from GitHub...")

# Download the ready-to-use config from GitHub (ODSC-Workshop branch)
!wget -q https://raw.githubusercontent.com/meta-llama/synthetic-data-kit/ODSC-Workshop/configs/config.yaml -O cerebras_config.yaml

print("‚úÖ Config downloaded!")

# Replace the API key placeholder with your actual key
import os

with open('cerebras_config.yaml', 'r') as f:
    config_content = f.read()

# Replace the placeholder with actual API key
config_content = config_content.replace('YOUR_CEREBRAS_API_KEY', os.environ.get('CEREBRAS_API_KEY'))

with open('cerebras_config.yaml', 'w') as f:
    f.write(config_content)

print("‚úÖ Configuration ready with your API key!")
print("\nüìÅ Directory structure:")
!tree data/ || ls -R data/

print("\nüìÑ Config preview (first 35 lines):")
!head -35 cerebras_config.yaml

üì• Downloading workshop config from GitHub...
‚úÖ Config downloaded!
‚úÖ Configuration ready with your API key!

üìÅ Directory structure:
/bin/bash: line 1: tree: command not found
data/:
curated  final	generated  input  parsed

data/curated:
lenient_6.5.json  llama3_paper_qa_pairs_cleaned.json  strict_8.5.json

data/final:
llama3_paper_qa_pairs_cleaned_alpaca.json
llama3_paper_qa_pairs_cleaned_ft_hf
llama3_paper_qa_pairs_cleaned_ft.json

data/final/llama3_paper_qa_pairs_cleaned_ft_hf:
data-00000-of-00001.arrow  dataset_info.json  state.json

data/generated:
large_chunks.json   llama3_paper_qa_pairs.json	test_write.json
large_dataset.json  small_chunks.json

data/generated/large_chunks.json:
llama3_paper_qa_pairs.json  test_write.json

data/generated/large_dataset.json:
llama3_paper_qa_pairs.json  test_write.json

data/generated/small_chunks.json:
llama3_paper_qa_pairs.json  test_write.json

data/input:

data/parsed:
llama3_paper.txt

üìÑ Config preview (first 35 lines):
# Master c

## üîå Step 4: Test API Connection

Verifying connection to Cerebras...

In [21]:
!synthetic-data-kit -c cerebras_config.yaml system-check

print("\n‚úÖ If you see 'API endpoint access confirmed' above, you're ready to go!")

Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: cerebras_config.yaml
Config has LLM provider set to: api-endpoint
[1;34mEnvironment variable check:[0m
API_ENDPOINT_KEY: Not found
get_llm_provider returning: api-endpoint
API_ENDPOINT_KEY environment variable: Not found
API key source: Config file
[2K[32m‚†¥[0m Checking API endpoint access...INFO:httpx:HTTP Request: POST https://api.cerebras.ai/v1/chat/completions "HTTP/1.1 200 OK"
[2K[32m API endpoint access confirmed[0m
[2K[32mUsing custom API base: [0m[4;94mhttps://api.cerebras.ai/v1[0m
[2K[32mDefault model: llama3.[0m[1;36m3[0m[32m-70b[0m
[2K[32mResponse from model: Hello. How can I help you today?[0m
[2K[32m‚†¥[0m Checking API endpoint access...
[1A[2K
‚úÖ If you

## üì• Step 5: Download Llama 3 Paper

Downloading the research paper from arXiv...

In [22]:
!wget -q https://arxiv.org/pdf/2407.21783 -O llama3_paper.pdf

# Verify download
import os
file_size = os.path.getsize('llama3_paper.pdf') / 1024  # KB

print(f"‚úÖ Paper downloaded successfully!")
print(f"üìÑ File: llama3_paper.pdf")
print(f"üíæ Size: {file_size:.1f} KB")

!ls -lh llama3_paper.pdf

‚úÖ Paper downloaded successfully!
üìÑ File: llama3_paper.pdf
üíæ Size: 9602.7 KB
-rw-r--r-- 1 root root 9.4M Nov 26  2024 llama3_paper.pdf


---

# üîÑ The 4-Stage Pipeline

```
PDF ‚Üí INGEST ‚Üí CREATE ‚Üí CURATE ‚Üí SAVE-AS ‚Üí Training Data ‚ú®
```

## üìö Stage 1: INGEST - Parse the PDF

**What it does:** Extracts clean text from the PDF and saves as .txt

This takes ~30-60 seconds...

In [23]:
%%time

!synthetic-data-kit -c cerebras_config.yaml \
  ingest llama3_paper.pdf

print("\n" + "="*60)
print("‚úÖ INGEST complete!")
print("="*60)

# Check output
!ls -lh data/parsed/

# Preview first few lines of the extracted text
print("\nüìù Preview of extracted text:")
!head -20 data/parsed/llama3_paper.txt

Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: cerebras_config.yaml
Config has LLM provider set to: api-endpoint
[2K[32m‚†è[0m Processing llama3_paper.pdf...
[1A[2K[32m‚úÖ Text successfully extracted to [0m[1;32mdata/parsed/llama3_paper.txt[0m

‚úÖ INGEST complete!
total 352K
-rw-r--r-- 1 root root 352K Oct 28 18:42 llama3_paper.txt

üìù Preview of extracted text:
4
2
0
2

v
o
N
3
2

]
I

A
.
s
c
[

CPU times: user 104 ms, sys: 17.6 ms, total: 121 ms
Wall time: 16.1 s


## ü§ñ Stage 2: CREATE - Generate Q&A Pairs

**What it does:** Uses Cerebras + Llama 3.3-70B with custom prompts to generate intelligent Q&A pairs

This takes ~2-4 minutes for 50 pairs... ‚òï

In [24]:
%%time

!synthetic-data-kit -c cerebras_config.yaml \
  create data/parsed/llama3_paper.txt \
  --type qa \
  --num-pairs 50 \
  --verbose

print("\n" + "="*60)
print("‚úÖ CREATE complete!")
print("="*60)

# Check output
!ls -lh data/generated/

# Count Q&A pairs
import json
with open('data/generated/llama3_paper_qa_pairs.json', 'r') as f:
    data = json.load(f)

print(f"\nüìä Generated {len(data['qa_pairs'])} Q&A pairs")

Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: cerebras_config.yaml
Config has LLM provider set to: api-endpoint
get_llm_provider returning: api-endpoint
[32müîó Using api-endpoint provider[0m
[?25lLoading config from: cerebras_config.yaml
[32m‚†ã[0m Generating qa content from data/parsed/llama3_paper.txt...[2KConfig has LLM provider set to: api-endpoint
[32m‚†ã[0m Generating qa content from data/parsed/llama3_paper.txt...[2KAPI_ENDPOINT_KEY from environment: Not found
[32m‚†ã[0m Generating qa content from data/parsed/llama3_paper.txt...[2KUsing API key: From config
[2KUsing API base URL: https://api.cerebras.ai/v1
[2KL Using api-endpoint provider
[2KLoading config from: cerebras_config.yaml
[2KConfig has LLM provider set 

### üîç Preview Generated Q&A Pairs

In [25]:
import json

# Load and display first 3 Q&A pairs
with open('data/generated/llama3_paper_qa_pairs.json', 'r') as f:
    data = json.load(f)

print("üìù Summary:")
print(data['summary'][:200] + "...\n")

print("\n" + "="*60)
print("üìö Sample Q&A Pairs:")
print("="*60)

for i, pair in enumerate(data['qa_pairs'][:3], 1):
    print(f"\n{i}. Question:")
    print(f"   {pair['question']}")
    print(f"\n   Answer:")
    print(f"   {pair['answer'][:150]}...")
    print("\n" + "-"*60)

üìù Summary:
Here is a summary of the document in 3-5 sentences, focusing on the main topic and key concepts:

The paper introduces Llama 3, a new set of foundation models for language that natively support multil...


üìö Sample Q&A Pairs:

1. Question:
   What is the size of the largest Llama 3 model in terms of parameters?

   Answer:
   405B parameters...

------------------------------------------------------------

2. Question:
   How many parameters does the flagship model have?

   Answer:
   405B...

------------------------------------------------------------

3. Question:
   What is the size of the largest Llama 3 model?

   Answer:
   405B parameters...

------------------------------------------------------------


## ‚ú® Stage 3: CURATE - Filter Quality

**What it does:** Uses LLM-as-judge with custom rating prompt to rate and filter Q&A pairs

This takes ~2-3 minutes... üéØ

In [26]:
%%time

!synthetic-data-kit -c cerebras_config.yaml \
  curate data/generated/llama3_paper_qa_pairs.json \
  --threshold 7.5 \
  --verbose

print("\n" + "="*60)
print("‚úÖ CURATE complete!")
print("="*60)

# Check output
!ls -lh data/curated/

Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: cerebras_config.yaml
Config has LLM provider set to: api-endpoint
get_llm_provider returning: api-endpoint
[32müîó Using api-endpoint provider[0m
[?25lLoading config from: cerebras_config.yaml
[32m‚†ã[0m Cleaning content from data/generated/llama3_paper_qa_pairs.json...[2KConfig has LLM provider set to: api-endpoint
[32m‚†ã[0m Cleaning content from data/generated/llama3_paper_qa_pairs.json...[2KAPI_ENDPOINT_KEY from environment: Not found
[32m‚†ã[0m Cleaning content from data/generated/llama3_paper_qa_pairs.json...[2KUsing API key: From config
[32m‚†ã[0m Cleaning content from data/generated/llama3_paper_qa_pairs.json...[2KUsing API base URL: https://api.cerebras.ai/v1
[32m‚†ã

### üìä Quality Metrics

In [27]:
import json

# Load curated data
with open('data/curated/llama3_paper_qa_pairs_cleaned.json', 'r') as f:
    curated = json.load(f)

metrics = curated.get('metrics', {})

print("="*60)
print("üìä CURATION RESULTS")
print("="*60)
print(f"\nüìù Total pairs generated:     {metrics.get('total', 0)}")
print(f"‚úÖ Pairs kept (‚â•7.5 rating):  {metrics.get('filtered', 0)}")
print(f"üìà Retention rate:            {metrics.get('retention_rate', 0)*100:.1f}%")
print(f"‚≠ê Average quality score:     {metrics.get('avg_score', 0):.1f}/10")

print("\n" + "="*60)
print("üéØ Quality filtering complete!")
print(f"   Kept {metrics.get('filtered', 0)} high-quality pairs")
print("="*60)

üìä CURATION RESULTS

üìù Total pairs generated:     50
‚úÖ Pairs kept (‚â•7.5 rating):  44
üìà Retention rate:            88.0%
‚≠ê Average quality score:     8.4/10

üéØ Quality filtering complete!
   Kept 44 high-quality pairs


### üëÄ Preview Top-Rated Q&A Pairs

In [28]:
import json

with open('data/curated/llama3_paper_qa_pairs_cleaned.json', 'r') as f:
    curated = json.load(f)

# Sort by rating (descending)
sorted_pairs = sorted(curated['qa_pairs'], key=lambda x: x.get('rating', 0), reverse=True)

print("="*60)
print("üåü TOP 3 HIGHEST-RATED Q&A PAIRS")
print("="*60)

for i, pair in enumerate(sorted_pairs[:3], 1):
    print(f"\n{i}. Rating: ‚≠ê {pair.get('rating', 'N/A')}/10")
    print(f"\n   Q: {pair['question']}")
    print(f"\n   A: {pair['answer'][:200]}...")
    print("\n" + "-"*60)

üåü TOP 3 HIGHEST-RATED Q&A PAIRS

1. Rating: ‚≠ê 10/10

   Q: What is the size of the largest Llama 3 model in terms of parameters?

   A: 405B parameters...

------------------------------------------------------------

2. Rating: ‚≠ê 10/10

   Q: What is the size of the largest Llama 3 model?

   A: 405B parameters...

------------------------------------------------------------

3. Rating: ‚≠ê 10/10

   Q: What percentage of needles do Llama 3 models successfully retrieve at all document depths and context lengths in the Needle-in-a-Haystack task?

   A: 100%...

------------------------------------------------------------


## üíæ Stage 4: SAVE-AS - Export to Training Format

**What it does:** Converts to fine-tuning ready formats

We'll create multiple formats...

In [29]:
%%time

# Format 1: HuggingFace Dataset (Arrow format - recommended!)
print("üì¶ Creating HuggingFace dataset...")
!synthetic-data-kit -c cerebras_config.yaml \
  save-as data/curated/llama3_paper_qa_pairs_cleaned.json \
  --format ft \
  --storage hf

# Format 2: OpenAI Fine-Tuning (JSON)
print("\nüì¶ Creating OpenAI FT format...")
!synthetic-data-kit -c cerebras_config.yaml \
  save-as data/curated/llama3_paper_qa_pairs_cleaned.json \
  --format ft

# Format 3: Alpaca format
print("\nüì¶ Creating Alpaca format...")
!synthetic-data-kit -c cerebras_config.yaml \
  save-as data/curated/llama3_paper_qa_pairs_cleaned.json \
  --format alpaca

print("\n" + "="*60)
print("‚úÖ SAVE-AS complete!")
print("="*60)

# Show all formats
!ls -lh data/final/

üì¶ Creating HuggingFace dataset...
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: cerebras_config.yaml
Config has LLM provider set to: api-endpoint
[?25l[32m‚†ã[0m Converting data/curated/llama3_paper_qa_pairs_cleaned.json to ft format with 
[2K[1A[2K[32m‚†ô[0m Converting data/curated/llama3_paper_qa_pairs_cleaned.json to ft format with 
[2K[1A[2K[32m‚†π[0m Converting data/curated/llama3_paper_qa_pairs_cleaned.json to ft format with 
[2K[1A[2K[32m‚†∏[0m Converting data/curated/llama3_paper_qa_pairs_cleaned.json to ft format with 
[2K[1A[2K[32m‚†º[0m Converting data/curated/llama3_paper_qa_pairs_cleaned.json to ft format with 
hf storage...INFO:numexpr.utils:NumExpr defaulting to 2 threads.
[2K[1A[2K[32m‚†¥[0m Con

### üéØ Load & Inspect HuggingFace Dataset

In [30]:
from datasets import load_from_disk
import json

# Load the HuggingFace dataset
dataset = load_from_disk('data/final/llama3_paper_qa_pairs_cleaned_ft_hf')

print("="*60)
print("üìä HUGGINGFACE DATASET INFO")
print("="*60)
print(f"\nüì¶ Dataset size: {len(dataset)} examples")
print(f"\nüîß Features: {dataset.features}")

print("\n" + "="*60)
print("üìù SAMPLE TRAINING EXAMPLE (OpenAI Format)")
print("="*60)

# Show first example
example = dataset[0]
print(json.dumps(example, indent=2))

print("\n" + "="*60)
print("‚úÖ Ready to use with Transformers, Axolotl, or any training framework!")
print("="*60)

üìä HUGGINGFACE DATASET INFO

üì¶ Dataset size: 44 examples

üîß Features: {'messages': List({'content': Value('string'), 'role': Value('string')})}

üìù SAMPLE TRAINING EXAMPLE (OpenAI Format)
{
  "messages": [
    {
      "content": "You are a helpful assistant.",
      "role": "system"
    },
    {
      "content": "What is the size of the largest Llama 3 model in terms of parameters?",
      "role": "user"
    },
    {
      "content": "405B parameters",
      "role": "assistant"
    }
  ]
}

‚úÖ Ready to use with Transformers, Axolotl, or any training framework!


---

# üéâ Success! Your Dataset is Ready!

## üìä Final Summary

In [31]:
import json
from datasets import load_from_disk

# Load files
with open('data/generated/llama3_paper_qa_pairs.json', 'r') as f:
    generated = json.load(f)

with open('data/curated/llama3_paper_qa_pairs_cleaned.json', 'r') as f:
    curated = json.load(f)

dataset = load_from_disk('data/final/llama3_paper_qa_pairs_cleaned_ft_hf')

print("\n" + "="*60)
print("üéâ WORKSHOP COMPLETE - SUMMARY")
print("="*60)

print("\nüìö Source:")
print("   ‚Ä¢ Llama 3 Research Paper (arXiv:2407.21783)")

print("\nüîÑ Pipeline Results:")
print(f"   1Ô∏è‚É£ INGEST:   ‚úÖ PDF ‚Üí Clean text (.txt)")
print(f"   2Ô∏è‚É£ CREATE:   ‚úÖ Generated {len(generated['qa_pairs'])} Q&A pairs (custom prompts)")
print(f"   3Ô∏è‚É£ CURATE:   ‚úÖ Kept {len(curated['qa_pairs'])} high-quality pairs (‚â•7.5/10)")
print(f"   4Ô∏è‚É£ SAVE-AS:  ‚úÖ Exported to 3 formats")

metrics = curated.get('metrics', {})
print("\nüìä Quality Metrics:")
print(f"   ‚Ä¢ Retention rate: {metrics.get('retention_rate', 0)*100:.1f}%")
print(f"   ‚Ä¢ Average score: {metrics.get('avg_score', 0):.1f}/10")

print("\nüíæ Output Formats:")
print(f"   ‚Ä¢ HuggingFace Dataset: {len(dataset)} examples (Arrow format)")
print(f"   ‚Ä¢ OpenAI Fine-Tuning: JSON format")
print(f"   ‚Ä¢ Alpaca: JSON format")

print("\nüìÇ Files Location:")
print("   ‚Ä¢ data/final/ (all formats)")

print("\n" + "="*60)
print("üöÄ Your dataset is ready for fine-tuning!")
print("="*60)

print("\nüí° Next Steps:")
print("   ‚Ä¢ Download the dataset from data/final/")
print("   ‚Ä¢ Use with Transformers, Axolotl, or your training framework")
print("   ‚Ä¢ Fine-tune your model!")


üéâ WORKSHOP COMPLETE - SUMMARY

üìö Source:
   ‚Ä¢ Llama 3 Research Paper (arXiv:2407.21783)

üîÑ Pipeline Results:
   1Ô∏è‚É£ INGEST:   ‚úÖ PDF ‚Üí Clean text (.txt)
   2Ô∏è‚É£ CREATE:   ‚úÖ Generated 50 Q&A pairs (custom prompts)
   3Ô∏è‚É£ CURATE:   ‚úÖ Kept 44 high-quality pairs (‚â•7.5/10)
   4Ô∏è‚É£ SAVE-AS:  ‚úÖ Exported to 3 formats

üìä Quality Metrics:
   ‚Ä¢ Retention rate: 88.0%
   ‚Ä¢ Average score: 8.4/10

üíæ Output Formats:
   ‚Ä¢ HuggingFace Dataset: 44 examples (Arrow format)
   ‚Ä¢ OpenAI Fine-Tuning: JSON format
   ‚Ä¢ Alpaca: JSON format

üìÇ Files Location:
   ‚Ä¢ data/final/ (all formats)

üöÄ Your dataset is ready for fine-tuning!

üí° Next Steps:
   ‚Ä¢ Download the dataset from data/final/
   ‚Ä¢ Use with Transformers, Axolotl, or your training framework
   ‚Ä¢ Fine-tune your model!


---

# üéÆ Bonus Experiments

Try these optional experiments to explore more features!

## üß™ Experiment 1: Try Different Quality Thresholds

In [32]:
import json

# Strict filtering (8.5+)
print("üîç Testing threshold 8.5 (very strict)...")
!synthetic-data-kit -c cerebras_config.yaml \
  curate data/generated/llama3_paper_qa_pairs.json \
  --threshold 8.5 \
  -o data/curated/strict_8.5.json

# Lenient filtering (6.5+)
print("\nüîç Testing threshold 6.5 (lenient)...")
!synthetic-data-kit -c cerebras_config.yaml \
  curate data/generated/llama3_paper_qa_pairs.json \
  --threshold 6.5 \
  -o data/curated/lenient_6.5.json

# Compare results
with open('data/curated/strict_8.5.json') as f:
    strict = json.load(f)
with open('data/curated/lenient_6.5.json') as f:
    lenient = json.load(f)
with open('data/curated/llama3_paper_qa_pairs_cleaned.json') as f:
    default = json.load(f)

print("\n" + "="*60)
print("üìä THRESHOLD COMPARISON")
print("="*60)
print(f"\nThreshold 8.5 (strict):   {len(strict['qa_pairs'])} pairs kept")
print(f"Threshold 7.5 (default):  {len(default['qa_pairs'])} pairs kept")
print(f"Threshold 6.5 (lenient):  {len(lenient['qa_pairs'])} pairs kept")
print("\nüí° Lower threshold = more pairs, but potentially lower quality")

üîç Testing threshold 8.5 (very strict)...
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: cerebras_config.yaml
Config has LLM provider set to: api-endpoint
get_llm_provider returning: api-endpoint
[32müîó Using api-endpoint provider[0m
[?25lLoading config from: cerebras_config.yaml
[2KConfig has LLM provider set to: api-endpoint
[2KAPI_ENDPOINT_KEY from environment: Not found
[2KUsing API key: From config
[2KUsing API base URL: https://api.cerebras.ai/v1
[2KLoading config from: cerebras_config.yaml
[2KConfig has LLM provider set to: api-endpoint
[2KProcessing 17 batches of QA pairs...
[2K[32m‚†ô[0m Cleaning content from data/generated/llama3_paper_qa_pairs.json...INFO:httpx:HTTP Request: POST https://api.cerebras.ai/v1/chat/com

## üß™ Experiment 2: Generate More Q&A Pairs

In [16]:
%%time

print("üéØ Generating 100 Q&A pairs...\n")

!synthetic-data-kit -c cerebras_config.yaml \
  create data/parsed/llama3_paper.txt \
  --type qa \
  --num-pairs 100 \
  -o data/generated/large_dataset.json \
  --verbose

# Count pairs
import json
with open('data/generated/large_dataset.json') as f:
    large = json.load(f)

print(f"\n‚úÖ Generated {len(large['qa_pairs'])} Q&A pairs!")
print("\nüí° You can now curate this larger dataset with:")
print("   synthetic-data-kit curate data/generated/large_dataset.json")

[2K[32m‚†º[0m Generating qa content from data/parsed/llama3_paper.txt...INFO:httpx:HTTP Request: POST https://api.cerebras.ai/v1/chat/completions "HTTP/1.1 200 OK"
INFO:synthetic_data_kit.models.llm_client:Received response from api-endpoint
[2KGenerating QA pairs [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [35m  0%[0m [33m0:00:09[0m [36m-:--:--[0mINFO:httpx:HTTP Request: POST https://api.cerebras.ai/v1/chat/completions "HTTP/1.1 200 OK"
INFO:synthetic_data_kit.models.llm_client:Received response from api-endpoint
INFO:httpx:HTTP Request: POST https://api.cerebras.ai/v1/chat/completions "HTTP/1.1 200 OK"
INFO:synthetic_data_kit.models.llm_client:Received response from api-endpoint
INFO:httpx:HTTP Request: POST https://api.cerebras.ai/v1/chat/completions "HTTP/1.1 200 OK"
INFO:synthetic_data_kit.models.llm_client:Received response from api-endpoint
INFO:httpx:HTTP Request: POST https://api.cereb

IsADirectoryError: [Errno 21] Is a directory: 'data/generated/large_dataset.json'

## üß™ Experiment 3: Different Chunking Strategies

In [17]:
import json

# Small chunks (more granular)
print("üìè Testing small chunks (2000 chars)...\n")
!synthetic-data-kit -c cerebras_config.yaml \
  create data/parsed/llama3_paper.txt \
  --type qa \
  --num-pairs 20 \
  --chunk-size 2000 \
  --chunk-overlap 100 \
  -o data/generated/small_chunks.json

# Large chunks (more context)
print("\nüìè Testing large chunks (6000 chars)...\n")
!synthetic-data-kit -c cerebras_config.yaml \
  create data/parsed/llama3_paper.txt \
  --type qa \
  --num-pairs 20 \
  --chunk-size 6000 \
  --chunk-overlap 300 \
  -o data/generated/large_chunks.json

# Compare questions
with open('data/generated/small_chunks.json') as f:
    small = json.load(f)
with open('data/generated/large_chunks.json') as f:
    large = json.load(f)

print("\n" + "="*60)
print("üìä CHUNKING COMPARISON")
print("="*60)

print("\nüî¨ Small Chunks (2000 chars) - Sample Question:")
print(f"   {small['qa_pairs'][0]['question']}")

print("\nüìö Large Chunks (6000 chars) - Sample Question:")
print(f"   {large['qa_pairs'][0]['question']}")

print("\nüí° Small chunks = more specific questions")
print("üí° Large chunks = more context-aware questions")

üìè Testing small chunks (2000 chars)...

Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: /usr/local/lib/python3.12/dist-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: cerebras_config.yaml
Config has LLM provider set to: api-endpoint
get_llm_provider returning: api-endpoint
[32müîó Using api-endpoint provider[0m
[?25lLoading config from: cerebras_config.yaml
[2KConfig has LLM provider set to: api-endpoint
[2KAPI_ENDPOINT_KEY from environment: Not found
[2KUsing API key: From config
[2KUsing API base URL: https://api.cerebras.ai/v1
[2KL Using api-endpoint provider
[2KLoading config from: cerebras_config.yaml
[2KConfig has LLM provider set to: api-endpoint
[2K[32m‚†á[0m Generating qa content from data/parsed/llama3_paper.txt...INFO:httpx:HTTP Request: POST https://api.cerebras.ai/v1/chat/completions "HTTP/1

IsADirectoryError: [Errno 21] Is a directory: 'data/generated/small_chunks.json'

## üß† Experiment 4: Chain-of-Thought Enhancement

**Advanced:** Add reasoning steps to your Q&A pairs using custom CoT prompts!

In [None]:
# Step 1: Create CoT config with custom enhancement prompt
cot_config = f"""llm:
  provider: "api-endpoint"

api-endpoint:
  api_base: "https://api.cerebras.ai/v1"
  api_key: "{os.environ.get('CEREBRAS_API_KEY')}"
  model: "llama3.3-70b"

generation:
  temperature: 0.2
  max_tokens: 8192

prompts:
  cot_enhancement: |
    You are enhancing Q&A conversations by adding step-by-step reasoning.

    For each assistant response, add detailed reasoning BEFORE the answer:

    Transform:
    Q: "What is Llama 3's context length?"
    A: "128K tokens"

    Into:
    Q: "What is Llama 3's context length?"
    A: "Let me break this down:
    Step 1: Looking at the architecture section...
    Step 2: The paper states...
    Therefore: Llama 3 supports 128K tokens"

    Enhance these conversations:
    {{{{conversations}}}}
"""

with open('cot_config.yaml', 'w') as f:
    f.write(cot_config)

print("‚úÖ CoT config created with custom enhancement prompt!\n")

# Step 2: Generate simple Q&A
print("üìù Generating 10 simple Q&A pairs...\n")
!synthetic-data-kit -c cerebras_config.yaml \
  create data/parsed/llama3_paper.txt \
  --type qa \
  --num-pairs 10 \
  -o data/generated/simple_for_cot.json

# Step 3: Add reasoning
print("\nüß† Adding Chain-of-Thought reasoning...\n")
!synthetic-data-kit -c cot_config.yaml \
  create data/generated/simple_for_cot.json \
  --type cot-enhance \
  -o data/generated/with_reasoning.json \
  --verbose

print("\n‚úÖ Chain-of-Thought enhancement complete!")

In [None]:
import json

# Compare before and after
with open('data/generated/simple_for_cot.json') as f:
    before = json.load(f)
with open('data/generated/with_reasoning.json') as f:
    after = json.load(f)

print("="*60)
print("üîç CHAIN-OF-THOUGHT COMPARISON")
print("="*60)

# Get first Q&A from conversations
before_conv = before['qa_pairs'][0]
after_conv = after[0]['conversations'] if isinstance(after, list) else after['conversations'][0]

print("\nüìù BEFORE (Simple answer):")
print(f"Q: {before_conv['question']}")
print(f"A: {before_conv['answer'][:150]}...")

print("\n" + "-"*60)

print("\nüß† AFTER (With reasoning):")
for msg in after_conv:
    if msg['role'] == 'user':
        print(f"Q: {msg['content']}")
    elif msg['role'] == 'assistant':
        print(f"A: {msg['content'][:300]}...")

print("\n" + "="*60)
print("‚ú® Notice the step-by-step reasoning in the enhanced version!")
print("="*60)

---

# üì• Download Your Dataset

Download the files to your local machine:

In [None]:
# Create a zip file with all outputs
!zip -r llama3_dataset.zip data/final/

print("‚úÖ Dataset packaged!")
print("\nüì¶ Download 'llama3_dataset.zip' from the Files panel (left sidebar)")
print("   Or run this cell and click the download link below:")

from google.colab import files
files.download('llama3_dataset.zip')

---

# üéì Workshop Complete!

## What You Accomplished:

‚úÖ **Parsed** a research paper automatically (to .txt format)  
‚úÖ **Generated** 50+ Q&A pairs using Cerebras with custom prompts  
‚úÖ **Filtered** for quality using LLM-as-judge with custom rating criteria  
‚úÖ **Exported** to multiple training formats  
‚úÖ **Learned** advanced features (CoT, chunking, thresholds, custom prompts)  

## üöÄ Next Steps:

1. **Try your own PDFs** - Upload any research paper or document
2. **Customize prompts** - Edit the prompts in the config for your domain
3. **Adjust parameters** - Experiment with thresholds, chunk sizes, etc.
4. **Fine-tune a model** - Use your dataset with Transformers/Axolotl
5. **Scale up** - Process entire directories of documents

## üìö Resources:

- **Toolkit:** https://github.com/meta-llama/synthetic-data-kit
- **Cerebras API:** https://cerebras.ai/
- **Documentation:** Check the toolkit README for advanced features

---

**üéâ Happy Dataset Building!**