# <img src="assets/doro.png" width="32" height="32"> Dataset Maker - LoRA Training Assistant  

This notebook contains the **Dataset Widget** for preparing your training datasets.

## What this notebook handles:
- **Dataset upload** and extraction from ZIP files
- **🆕 Image curation** with FiftyOne (duplicate detection, visual inspection)
- **Image tagging** with WD14 v3 taggers or BLIP captioning
- **Caption management** and editing
- **Trigger word** management and injection
- **Tag filtering** and blacklist management

## Workflow:
1. **Upload your dataset** (ZIP file or folder)
2. **🆕 Visual curation** with FiftyOne (remove duplicates, inspect quality)
3. **Configure tagging** settings (WD14 for anime, BLIP for photos)
4. **Review and edit** generated captions
5. **Add trigger words** for your LoRA
6. **Move to training** in `Lora_Trainer_Widget.ipynb`

---

## <img src="assets/OTNDORODUSKFIXED.png" width="32" height="32"> 1. Setup Validation

**What this cell does:** Sets up the training environment if you haven't done it yet.

**When to run:** 
- If this is your first time using the system
- If you're getting "module not found" errors
- If you skipped setup in the main training notebook

**What happens:**
- Downloads and installs the training backend (~10-15GB)
- Sets up directory structure 
- Validates your system (GPU, VRAM, storage)

**Skip this if:** You already ran setup in `Lora_Trainer_Widget.ipynb`

In [None]:
# **CELL 1A:** Environment Validation

from shared_managers import create_widget

# Initialize and display the simplified setup widget (validation only)
setup_widget = create_widget('setup_simple')
setup_widget.display()

## <img src="assets/doro_fubuki.png" width="32" height="32"> 2. Dataset Management Widget

**What this cell does:** Traditional dataset tagging interface for your CURATED images.

**This widget handles:**
- 📁 **Dataset Input**: Point to your curated dataset directory
- 🏷️ **Auto-Tagging**: WD14 v3 (anime/art) or BLIP (photos)  
- ✏️ **Caption Editing**: Bulk edit, find/replace, manual tweaks
- 🎯 **Trigger Words**: Add your unique trigger word to all captions
- 🚫 **Tag Filtering**: Remove unwanted tags with blacklists
- 📊 **Quality Tools**: Tag analysis and final review

**How to use:**
1. **Select** your curated dataset directory (from step 2)
2. **Configure** tagging settings for your content type
3. **Review** and edit the generated captions
4. **Add** your trigger word to all images
5. **Filter** out any unwanted tags

In [2]:
# **CELL 3:** Dataset Tagging Widget (After Curation)

from shared_managers import create_widget

# Initialize and display the dataset widget for tagging curated images
dataset_widget = create_widget('dataset')
dataset_widget.display()

VBox(children=(HTML(value='<h2>📊 2. Dataset Manager</h2>'), Accordion(children=(VBox(children=(HTML(value="<h3…

---

##  <img src="assets/OTNANGELDOROFIX.png" width="32" height="32"> Next Steps

Once your dataset is prepared:

1. **Note your dataset path** - you'll need it for training
2. **Remember your trigger word** - important for generation
3. **Open** `Lora_Trainer_Widget.ipynb` for training setup
4. **Run the Setup widget** first in the training notebook

---

## <img src="assets/OTNEARTHFIXDORO.png" width="32" height="32"> Troubleshooting

### Common Issues

**"No images found":**
- Check ZIP file structure (images should be in root or single folder)
- Verify image formats (jpg, png, webp supported)
- Ensure files aren't corrupted

**"Tagging failed":**
- Check internet connection for model downloads
- Verify sufficient disk space (2-3GB for tagger models)
- Try different tagger model

**"Captions too long/short":**
- Adjust tag threshold settings
- Use tag filtering to remove excess tags
- Consider manual editing for important images

**"Missing trigger words":**
- Use bulk edit to add trigger words
- Check trigger word injection settings
- Verify trigger word isn't being filtered out


