# <img src="assets/doro.png" width="32" height="32"> Dataset Maker - LoRA Training Assistant  

This notebook contains the **Dataset Widget** for preparing your training datasets.

## What this notebook handles:
- **Dataset upload** and extraction from ZIP files
- **Image tagging** with WD14 v3 taggers or BLIP captioning
- **Caption management** and editing
- **Trigger word** management and injection
- **Tag filtering** and blacklist management

## Workflow:
1. **Upload your dataset** (ZIP file or folder)
2. **Configure tagging** settings (WD14 for anime, BLIP for photos)
3. **Review and edit** generated captions
4. **Add trigger words** for your LoRA
5. **Move to training** in `Lora_Trainer_Widget.ipynb`

---

## üìñ Dataset Preparation Guide

This comprehensive widget handles all dataset preparation tasks:

### üìÅ Dataset Upload & Management
- **Local upload**: Drag & drop ZIP files or select folders
- **HuggingFace import**: Direct download from HF datasets
- **Automatic extraction**: Handles ZIP, RAR, 7z archives
- **Folder organization**: Creates proper training structure

### üè∑Ô∏è Auto-Tagging Systems

#### WD14 v3 Taggers (Recommended for Anime/Art)
- **wd14-vit-v2**: General purpose, balanced accuracy
- **wd14-convnext-v2**: Higher accuracy, slower
- **wd14-swinv2-v2**: Best for complex scenes
- **ONNX optimization**: 2-3x faster inference
- **Threshold control**: Adjust tag sensitivity

#### BLIP Captioning (Best for Photos/Realistic)
- **Natural language**: Generates descriptive sentences
- **Scene understanding**: Captures context and relationships
- **Perfect for**: Real photos, portraits, landscapes

### ‚úèÔ∏è Caption Management
- **Bulk editing**: Apply changes to all captions
- **Find & replace**: Update specific terms across dataset
- **Tag frequency analysis**: See most common tags
- **Manual editing**: Individual caption refinement

### üéØ Trigger Word System
- **Automatic injection**: Add trigger words to all captions
- **Position control**: Beginning, end, or random placement
- **Consistency checking**: Ensure all images have triggers
- **Preview mode**: See changes before applying

### üö´ Tag Filtering & Blacklists
- **Quality filters**: Remove low-confidence tags
- **Content filters**: Block unwanted content types
- **Custom blacklists**: Your own forbidden tags
- **Whitelist mode**: Only allow specific tags

### üìà Quality Analysis
- **Image statistics**: Resolution, format, size analysis
- **Tag distribution**: Most/least common tags
- **Caption length**: Optimal length recommendations
- **Duplicate detection**: Find similar images

### üí° Best Practices Tips

**Dataset Size:**
- **Characters**: 15-50 images work well
- **Styles**: 50-200 images for consistency
- **Concepts**: 20-100 images depending on complexity

**Image Quality:**
- **Resolution**: 768x768 minimum, 1024x1024 recommended
- **Variety**: Different poses, angles, expressions
- **Consistency**: Similar art style or character design

### üõ†Ô∏è Recommended Image Preparation Tools

**For Bulk Resizing & Cropping:**
- **[birme.net](https://www.birme.net/)** - **Highly Recommended!**
  - Bulk resize multiple images to exact dimensions (512x512, 1024x1024)
  - Smart cropping with auto focal point detection
  - Privacy-focused - images never leave your computer
  - Perfect for dataset standardization

**For Advanced Editing:**
- **[photopea.com](https://www.photopea.com/)** - Free Photoshop alternative
  - Browser-based, no downloads needed
  - Full PSD support and professional tools
  - Great for background removal, cleanup, advanced edits
  - Perfect for preparing images before adding to dataset

**Pro Tip:** Use birme.net first for bulk resizing, then photopea.com for any individual images that need special attention!

**Caption Quality:**
- **Accuracy**: Verify auto-generated tags
- **Completeness**: Include important details
- **Trigger words**: Use unique, memorable terms
- **Consistency**: Same terms for same concepts

## <img src="assets/OTNDORODUSKFIXED.png" width="32" height="32"> 1. Environment Setup (Optional - Run if needed)

**What this cell does:** Sets up the training environment if you haven't done it yet.

**When to run:** 
- If this is your first time using the system
- If you're getting "module not found" errors
- If you skipped setup in the main training notebook

**What happens:**
- Downloads and installs the training backend (~10-15GB)
- Sets up directory structure 
- Validates your system (GPU, VRAM, storage)

**Skip this if:** You already ran setup in `Lora_Trainer_Widget.ipynb`

In [None]:
# **CELL 1A:** Environment Setup Widget (Optional)
# Run this cell ONLY if you need to set up the training environment
# This will download ~10-15GB and take 5-15 minutes

from shared_managers import create_widget
from sidecar import Sidecar
import fiftyone as fo
from ipywidgets import HTML, VBox

# Create persistent sidecar for dataset exploration
dataset_explorer = Sidecar(title='üìä Dataset Explorer - FiftyOne', anchor='split-right')

# Initialize and display the setup widget with shared managers
setup_widget = create_widget('setup')
setup_widget.display()

In [None]:
# **CELL 2:** Main Dataset Preparation Widget
# This is THE widget for all your dataset needs - run this cell!

from shared_managers import create_widget

# Initialize and display the dataset widget with shared managers
dataset_widget = create_widget('dataset')
dataset_widget.display()

## <img src="assets/doro.png" width="32" height="32"> 2. Dataset Preparation Widget

**What this cell does:** The main dataset preparation interface with all tools you need.

**This widget handles:**
- üìÅ **Dataset Upload**: ZIP files, folders, or HuggingFace URLs
- üè∑Ô∏è **Auto-Tagging**: WD14 v3 (anime/art) or BLIP (photos)  
- ‚úèÔ∏è **Caption Editing**: Bulk edit, find/replace, manual tweaks
- üéØ **Trigger Words**: Add your unique trigger word to all captions
- üö´ **Tag Filtering**: Remove unwanted tags with blacklists
- üìä **Quality Tools**: Tag analysis, duplicate detection

**How to use:**
1. **Upload** your images (ZIP file or folder)
2. **Configure** tagging settings for your content type
3. **Review** and edit the generated captions
4. **Add** your trigger word to all images
5. **Filter** out any unwanted tags

---

##  <img src="assets/doro_fubuki.png" width="32" height="32"> Dataset Preparation Checklist

Before moving to training, ensure you have:

### ‚úÖ Dataset Structure
- [ ] Images are in a single folder
- [ ] All images have corresponding .txt caption files
- [ ] No corrupted or unreadable images
- [ ] Consistent image format (jpg/png)

### ‚úÖ Caption Quality
- [ ] All captions contain your trigger word
- [ ] Tags are accurate and relevant
- [ ] No unwanted or problematic tags
- [ ] Caption length is reasonable (50-200 tokens)

### ‚úÖ Content Verification
- [ ] Images represent what you want to train
- [ ] Sufficient variety in poses/angles
- [ ] Consistent quality across dataset
- [ ] No duplicate or near-duplicate images

---

##  <img src="assets/OTNANGELDOROFIX.png" width="32" height="32"> Next Steps

Once your dataset is prepared:

1. **Note your dataset path** - you'll need it for training
2. **Remember your trigger word** - important for generation
3. **Open** `Lora_Trainer_Widget.ipynb` for training setup
4. **Run the Setup widget** first in the training notebook

---

## <img src="assets/OTNEARTHFIXDORO.png" width="32" height="32"> Troubleshooting

### Common Issues

**"No images found":**
- Check ZIP file structure (images should be in root or single folder)
- Verify image formats (jpg, png, webp supported)
- Ensure files aren't corrupted

**"Tagging failed":**
- Check internet connection for model downloads
- Verify sufficient disk space (2-3GB for tagger models)
- Try different tagger model

**"Captions too long/short":**
- Adjust tag threshold settings
- Use tag filtering to remove excess tags
- Consider manual editing for important images

**"Missing trigger words":**
- Use bulk edit to add trigger words
- Check trigger word injection settings
- Verify trigger word isn't being filtered out

---

*Ready to create amazing LoRAs? Let's go! üöÄ*