# <img src="assets/doro.png" width="32" height="32"> Dataset Maker - LoRA Training Assistant  

This notebook contains the **Dataset Widget** for preparing your training datasets.

## What this notebook handles:
- **Dataset upload** and extraction from ZIP files
- **Image tagging** with WD14 v3 taggers or BLIP captioning
- **Caption management** and editing
- **Trigger word** management and injection
- **Tag filtering** and blacklist management

## Workflow:
1. **Upload your dataset** (ZIP file or folder)
2. **Configure tagging** settings (WD14 for anime, BLIP for photos)
3. **Review and edit** generated captions
4. **Add trigger words** for your LoRA
5. **Move to training** in `Lora_Trainer_Widget.ipynb`

---

## <img src="assets/OTNDORODUSKFIXED.png" width="32" height="32"> Dataset Preparation Widget

This comprehensive widget handles all dataset preparation tasks:

### 📁 Dataset Upload & Management
- **Local upload**: Drag & drop ZIP files or select folders
- **HuggingFace import**: Direct download from HF datasets
- **Automatic extraction**: Handles ZIP, RAR, 7z archives
- **Folder organization**: Creates proper training structure

### 🏷️ Auto-Tagging Systems

#### WD14 v3 Taggers (Recommended for Anime/Art)
- **wd14-vit-v2**: General purpose, balanced accuracy
- **wd14-convnext-v2**: Higher accuracy, slower
- **wd14-swinv2-v2**: Best for complex scenes
- **ONNX optimization**: 2-3x faster inference
- **Threshold control**: Adjust tag sensitivity

#### BLIP Captioning (Best for Photos/Realistic)
- **Natural language**: Generates descriptive sentences
- **Scene understanding**: Captures context and relationships
- **Perfect for**: Real photos, portraits, landscapes

### ✏️ Caption Management
- **Bulk editing**: Apply changes to all captions
- **Find & replace**: Update specific terms across dataset
- **Tag frequency analysis**: See most common tags
- **Manual editing**: Individual caption refinement

### 🎯 Trigger Word System
- **Automatic injection**: Add trigger words to all captions
- **Position control**: Beginning, end, or random placement
- **Consistency checking**: Ensure all images have triggers
- **Preview mode**: See changes before applying

### 🚫 Tag Filtering & Blacklists
- **Quality filters**: Remove low-confidence tags
- **Content filters**: Block unwanted content types
- **Custom blacklists**: Your own forbidden tags
- **Whitelist mode**: Only allow specific tags

### 📈 Quality Analysis
- **Image statistics**: Resolution, format, size analysis
- **Tag distribution**: Most/least common tags
- **Caption length**: Optimal length recommendations
- **Duplicate detection**: Find similar images

### 💡 Best Practices Tips

**Dataset Size:**
- **Characters**: 15-50 images work well
- **Styles**: 50-200 images for consistency
- **Concepts**: 20-100 images depending on complexity

**Image Quality:**
- **Resolution**: 768x768 minimum, 1024x1024 recommended
- **Variety**: Different poses, angles, expressions
- **Consistency**: Similar art style or character design

**Caption Quality:**
- **Accuracy**: Verify auto-generated tags
- **Completeness**: Include important details
- **Trigger words**: Use unique, memorable terms
- **Consistency**: Same terms for same concepts

In [None]:
from widgets.dataset_widget import DatasetWidget

# Initialize and display the dataset widget
dataset_widget = DatasetWidget()
dataset_widget.display()

---

##  <img src="assets/doro_fubuki.png" width="32" height="32"> Dataset Preparation Checklist

Before moving to training, ensure you have:

### ✅ Dataset Structure
- [ ] Images are in a single folder
- [ ] All images have corresponding .txt caption files
- [ ] No corrupted or unreadable images
- [ ] Consistent image format (jpg/png)

### ✅ Caption Quality
- [ ] All captions contain your trigger word
- [ ] Tags are accurate and relevant
- [ ] No unwanted or problematic tags
- [ ] Caption length is reasonable (50-200 tokens)

### ✅ Content Verification
- [ ] Images represent what you want to train
- [ ] Sufficient variety in poses/angles
- [ ] Consistent quality across dataset
- [ ] No duplicate or near-duplicate images

---

##  <img src="assets/OTNANGELDOROFIX.png" width="32" height="32"> Next Steps

Once your dataset is prepared:

1. **Note your dataset path** - you'll need it for training
2. **Remember your trigger word** - important for generation
3. **Open** `Lora_Trainer_Widget.ipynb` for training setup
4. **Run the Setup widget** first in the training notebook

---

## <img src="assets/OTNEARTHFIXDORO.png" width="32" height="32"> Troubleshooting

### Common Issues

**"No images found":**
- Check ZIP file structure (images should be in root or single folder)
- Verify image formats (jpg, png, webp supported)
- Ensure files aren't corrupted

**"Tagging failed":**
- Check internet connection for model downloads
- Verify sufficient disk space (2-3GB for tagger models)
- Try different tagger model

**"Captions too long/short":**
- Adjust tag threshold settings
- Use tag filtering to remove excess tags
- Consider manual editing for important images

**"Missing trigger words":**
- Use bulk edit to add trigger words
- Check trigger word injection settings
- Verify trigger word isn't being filtered out

---

*Ready to create amazing LoRAs? Let's go! 🚀*