# Primate Vocalization Detection Pipeline
## Reproducible End-to-End System

This notebook demonstrates a complete pipeline for detecting primate vocalizations in long audio recordings.

### Pipeline Overview:
1. **Configuration** - Set all parameters
2. **Data Loading** - Load species clips and background noise
3. **Preprocessing** - Convert to mel-spectrograms
4. **Data Augmentation** - Apply augmentation strategies
5. **Model Training** - Train VGG19-based classifier
6. **Detection** - Detect vocalizations in long audio files
7. **Analysis & Reporting** - Create visualizations and reports
8. **Hard Negative Mining** - Improve model by learning from mistakes (NEW!)
9. **Optional: Extract Clips** - Extract detected clips

### Key Features:
-  Modular design - easy to add new species
-  Configurable parameters - adjust without code changes
-  Reproducible results - fixed random seeds
-  GPU-accelerated - optimized for Google Colab
-  Iterative improvement - hard negative mining for better accuracy

## 0. Setup & Installation

In [None]:
# Install required packages
!pip install -q librosa soundfile tensorflow scikit-learn pandas matplotlib

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

print("Setup complete")

## 1. Configuration

All parameters are defined in `config.py`. To add a new species or change parameters, simply edit the config file.

In [None]:
# Import all modules
import config
import data_loader
import preprocessing
import augmentation
import model as model_module
import train
import detection
import utils

# Print configuration summary
config.print_config_summary()

### To Add a New Species:

1. Add audio files to Google Drive: `chimp-audio/audio/new_species_folder/`
2. Edit `config.py` and add to `SPECIES_FOLDERS`:
   ```python
   SPECIES_FOLDERS = {
       'Cercocebus_torquatus': 'Cercocebus torquatus hack 5s',
       'Colobus_guereza': 'Colobus guereza Clips 5s',
       'New_Species': 'new_species_folder',  # ← Add this line
   }
   ```
3. Re-run the notebook - everything will automatically adjust!

## 2. Data Loading & Exploration

Load all audio files and verify the data.

In [None]:
# Load species data
species_data = data_loader.load_species_data()

# Load background data
background_data = data_loader.load_background_data()

# Print summary
data_loader.print_data_summary(species_data, background_data)

## 3. Complete Training Pipeline

This runs the entire training process:
- Data preprocessing
- Augmentation
- Model training
- Evaluation

In [None]:
# Run complete training pipeline
trained_model = train.run_complete_training_pipeline()

## 4. Detection on Long Audio

Now we'll use the trained model to detect primate vocalizations in long audio files.

### 4.1 Test on First Long Audio File

In [None]:
# Get list of long audio files
long_audio_files = data_loader.get_long_audio_files()

print(f"Found {len(long_audio_files)} long audio files:")
for i, file in enumerate(long_audio_files[:10], 1):  # Show first 10
    print(f"  {i}. {os.path.basename(file)}")
if len(long_audio_files) > 10:
    print(f"  ... and {len(long_audio_files) - 10} more")

In [None]:
# Detect in the first long audio file
import os

first_audio = long_audio_files[0]
print(f"Processing: {os.path.basename(first_audio)}")

detections_df = detection.detect_in_long_audio(
    trained_model, 
    first_audio,
    confidence_threshold=config.DETECTION_CONFIDENCE_THRESHOLD
)

# Display results
print("\n Detection Results:")
if len(detections_df) > 0:
    display(detections_df.head(20))  # Show first 20 detections
else:
    print("No detections found.")

In [None]:
# Save detections to CSV
csv_path = detection.save_detections(
    detections_df, 
    os.path.basename(first_audio)
)
print(f"Detections saved to: {csv_path}")

### 4.2 Visualize Detection Results

In [None]:
# Create visualization
utils.visualize_detection_results(
    first_audio,
    detections_df,
    save_path=None,  # Set to path to save, None to just display
    show_spectrogram=True
)

### 4.3 Process All Long Audio Files

In [None]:
# Process all long audio files
# Comment out this cell if you only want to process the first file

all_detections = detection.process_all_long_audio_files(
    trained_model,
    confidence_threshold=config.DETECTION_CONFIDENCE_THRESHOLD
)

## 5. Analysis & Reporting

In [None]:
# Print detection statistics
utils.print_detection_statistics(all_detections)

In [None]:
# Create summary report
summary_path = os.path.join(config.DETECTION_OUTPUT_DIR, 'detection_summary.csv')
summary_df = utils.create_detection_summary_report(all_detections, summary_path)

print("\n Summary Report:")
display(summary_df)

In [None]:
# Create visualizations for all files
utils.visualize_all_detections(all_detections)

## 6. Hard Negative Mining - Model Improvement (**uncertain now**)

### What is Hard Negative Mining?

If notice many **false positives** (e.g., bird calls being classified as primate calls), you can use this to improve your model.

**The Process:**
1. Extract samples where the model is uncertain (medium confidence 0.5-0.85)
2. Manually verify which are false positives
3. Add verified false positives as "hard negatives" to training data
4. Retrain the model

**Result:** Model learns to distinguish between commonly confused sounds (e.g., bird calls vs primate calls)

### When to Use This?

- After initial training and detection
- When you see many bird calls or environmental sounds misclassified
- When one species is heavily over-represented in detections
- To improve model precision on real-world recordings

### 6.1 Run Hard Negative Mining Script

In [None]:
# Run hard negative mining
exec(open('run_hard_negative_mining.py').read())

### 6.2 Manual Verification (Critical Step!)

**IMPORTANT: You must do this manually in Google Drive!**

#### Instructions:

1. **Go to Google Drive**: `chimp-audio/audio/hard_negative_candidates/`

2. **Listen to each audio file** 

3. **Make a decision for each file:**
   - **DELETE** if it's an actual primate call (model is correct)
   - **KEEP** if it's NOT a primate call (bird, insect, rain, wind, etc.)

4. **Create new folder**: `chimp-audio/audio/verified_hard_negatives/`

5. **MOVE** all kept files to `verified_hard_negatives/`

#### What to Expect:
- **Examples of what to keep**: Bird calls, insect sounds, rain, wind, rustling leaves
- **Examples of what to delete**: Actual primate vocalizations


**PAUSE HERE and complete the manual verification before continuing!**

### 6.3 Update Configuration

After you've verified and organized the files, update `config.py` to include the new hard negatives folder.

In [None]:
# Check if verified_hard_negatives folder exists
import os
verified_folder = os.path.join(config.AUDIO_ROOT, 'verified_hard_negatives')

if os.path.exists(verified_folder):
    file_count = len([f for f in os.listdir(verified_folder) if f.endswith('.wav')])
    print(f"Found verified_hard_negatives folder with {file_count} files")
    print("\nNow edit config.py manually!")
    print("\nAdd to BACKGROUND_FOLDERS (around line 28-32):")
    print("\nThen run the next cell to reload config.")
else:
    print(" verified_hard_negatives folder not found!")
    print("\nPlease:")
    print("1. Go to Google Drive: chimp-audio/audio/")
    print("2. Create folder: verified_hard_negatives/")
    print("3. Move verified files from hard_negative_candidates/ to verified_hard_negatives/")
    print("4. Run this cell again")

In [None]:
# Reload config after editing
import importlib
importlib.reload(config)

# Verify new folder is recognized
print("Updated Configuration:")
config.print_config_summary()

# Load updated data
background_data_updated = data_loader.load_background_data()
print(f"\n Total background samples: {len(background_data_updated)}")

### 6.4 Retrain with Hard Negatives

Now retrain the model with the expanded background dataset that includes verified hard negatives.

In [None]:
# Train improved model
improved_model = train.run_complete_training_pipeline()

### 6.5 Compare Results

Test the improved model on the same audio file and compare with original results.

In [None]:
# Run detection with improved model
print(f"Testing improved model on: {os.path.basename(first_audio)}\n")

improved_detections = detection.detect_in_long_audio(
    improved_model,
    first_audio,
    confidence_threshold=config.DETECTION_CONFIDENCE_THRESHOLD
)

# Compare

print(f"\nOriginal model detections: {len(detections_df)}")
print(f"Improved model detections: {len(improved_detections)}")
print(f"Change: {len(improved_detections) - len(detections_df)} ({(len(improved_detections)/len(detections_df)-1)*100:.1f}%)")

if len(detections_df) > 0:
    print("\n Original Distribution:")
    print(detections_df['species'].value_counts())

if len(improved_detections) > 0:
    print("\n Improved Distribution:")
    print(improved_detections['species'].value_counts())
    
    print("\n Average Confidence:")
    print(f"Original: {detections_df['confidence'].mean():.4f}")
    print(f"Improved: {improved_detections['confidence'].mean():.4f}")

In [None]:
# Visualize improved results
utils.visualize_detection_results(
    first_audio,
    improved_detections,
    save_path=None,
    show_spectrogram=True
)

### 6.6 Save Improved Model (Optional)

If you're satisfied with the improved results, save this model with a descriptive name.

In [None]:
# Save improved model with descriptive name
improved_model_path = os.path.join(config.MODEL_SAVE_DIR, 'model_v2_with_hard_negatives.h5')
improved_model.save(improved_model_path)

print(f" Improved model saved to: {improved_model_path}")

### 6.7 Process All Files with Improved Model (Optional)

Once satisfied with the improved model, process all long audio files again.

In [None]:
# Process all files with improved model
all_detections_improved = detection.process_all_long_audio_files(
    improved_model,
    confidence_threshold=config.DETECTION_CONFIDENCE_THRESHOLD
)

# Generate new reports
utils.print_detection_statistics(all_detections_improved)

summary_improved_path = os.path.join(config.DETECTION_OUTPUT_DIR, 'detection_summary_v2.csv')
summary_improved_df = utils.create_detection_summary_report(all_detections_improved, summary_improved_path)

print(f"\n Improved summary saved to: {summary_improved_path}")

## 7. Extract Detected Clips

Extract audio clips for each detection for manual validation.

In [None]:
# Extract clips from first audio file
clips_output_dir = os.path.join(config.OUTPUT_ROOT, 'detected_clips')

# Choose which detections to extract:
# Use 'improved_detections' if you've done hard negative mining
# Use 'detections_df' for original model results

detections_to_extract = improved_detections if 'improved_detections' in locals() else detections_df

if len(detections_to_extract) > 0:
    utils.extract_detected_audio_clips(
        first_audio,
        detections_to_extract,
        clips_output_dir,
        padding=0.5  # Add 0.5s padding around each detection
    )
else:
    print("No detections to extract.")

## 8. Model Persistence

The best model is automatically saved during training. You can also manually save/load models.

In [None]:
# Best model was saved during training
best_model_path = os.path.join(config.MODEL_SAVE_DIR, 'best_model.h5')
print(f"Best model saved at: {best_model_path}")

# If you did hard negative mining, you also have:
if os.path.exists(improved_model_path):
    print(f"Improved model (v2) saved at: {improved_model_path}")

# To load a model later:
# loaded_model = model_module.load_trained_model(best_model_path)

## 9. Adjust Detection Threshold (Optional)

If you want to experiment with different confidence thresholds without retraining:

In [None]:
# Try different thresholds
thresholds = [0.5, 0.7, 0.9]

# Use improved model if available, otherwise use original
test_model = improved_model if 'improved_model' in locals() else trained_model

for threshold in thresholds:
    print(f"Testing threshold: {threshold}")
    
    detections = detection.detect_in_long_audio(
        test_model,
        first_audio,
        confidence_threshold=threshold
    )
    
    print(f"\nFound {len(detections)} detections with threshold {threshold}")
    if len(detections) > 0:
        print(detections['species'].value_counts())

## Summary

### Output Files Generated:

```
drive/MyDrive/chimp-audio/outputs/
├── models/
│   ├── best_model.h5              # Initial trained model
│   ├── model_v2_with_hard_negatives.h5  # Improved model (if you did hard negative mining)
│   ├── training_history.json      # Training metrics
│   └── training_history.png       # Training curves
├── detections/
│   ├── *_detections.csv           # Detection results (CSV)
│   ├── detection_summary.csv      # Overall summary (v1)
│   └── detection_summary_v2.csv   # Overall summary (v2, if improved)
├── visualizations/
│   └── *_visualization.png        # Waveform/spectrogram plots
└── detected_clips/
    └── *.wav                      # Extracted audio clips
```

### Workflow Summary:

**Standard workflow:**
1. Setup → Configure → Load Data → Train → Detect → Analyze

**With Hard Negative Mining (Recommended):**
1. Setup → Configure → Load Data → Train → Detect → Analyze
2. **Hard Negative Mining** → Manual Verification → Update Config → Retrain
3. Compare Results → Use Improved Model for Production

### To Add New Species or Data:

1. Add new audio files to appropriate folders in Google Drive
2. Update `config.py` (add to `SPECIES_FOLDERS` or `BACKGROUND_FOLDERS`)
3. Re-run this notebook

### Next Steps:

- **If results are good**: Process all files and generate final reports
- **If too many false positives**: Use Hard Negative Mining (Section 6)
- **If adding new data**: Update config and retrain
- **For production use**: Save improved model and document threshold settings