<a href="https://colab.research.google.com/github/Aniruddha072/Noise-classification/blob/main/notebooks/02_download_dataset.ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Noise Classification Project — Download & Prepare Datasets (Colab Ready)

This notebook downloads UrbanSound8K and ESC-50, uploads custom campus recordings, and verifies downloads with file counts and checksums.
It saves both raw datasets in `/content/datasets/` and persists a copy in `/content/drive/MyDrive/NoiseClassification/datasets/`.

---

## 1. Download UrbanSound8K

```python
import os

local_dataset_dir = '/content/datasets/UrbanSound8K'
drive_dataset_dir = '/content/drive/MyDrive/NoiseClassification/datasets/UrbanSound8K'
os.makedirs(local_dataset_dir, exist_ok=True)

# Download and unzip (official link)
!wget -O UrbanSound8K.tar.gz https://zenodo.org/record/1203745/files/UrbanSound8K.tar.gz?download=1
!tar -xzf UrbanSound8K.tar.gz -C /content/datasets/
!mv /content/datasets/UrbanSound8K {local_dataset_dir}

# Copy to persistent Drive folder
!cp -r {local_dataset_dir} {drive_dataset_dir}
```

---

## 2. Download ESC-50

```python
local_esc_dir = '/content/datasets/ESC-50'
drive_esc_dir = '/content/drive/MyDrive/NoiseClassification/datasets/ESC-50'
os.makedirs(local_esc_dir, exist_ok=True)

!wget -O ESC-50-master.zip https://github.com/karoldvl/ESC-50/archive/master.zip
!unzip -q ESC-50-master.zip -d /content/datasets/
!mv /content/datasets/ESC-50-master/audio {local_esc_dir}
!mv /content/datasets/ESC-50-master/meta/esc50.csv {local_esc_dir}/esc50.csv

# Copy to persistent Drive folder
!cp -r {local_esc_dir} {drive_esc_dir}
```

---

## 3. Upload Custom Campus Recordings

```python
from google.colab import files
print("Upload campus .wav files (multiple selection allowed):")
uploaded = files.upload()
os.makedirs('/content/datasets/campus', exist_ok=True)

for fname in uploaded.keys():
    os.rename(fname, f'/content/datasets/campus/{fname}')

# Copy to persistent Drive folder
!cp -r /content/datasets/campus /content/drive/MyDrive/NoiseClassification/datasets/campus
```

---

## 4. File Count & Checksums

```python
import glob
import hashlib

def get_file_count_and_checksums(folder):
    files = glob.glob(os.path.join(folder, '**', '*.wav'), recursive=True)
    print(f"{folder}: {len(files)} .wav files")
    # Compute checksums for first 3 files for verification
    for f in files[:3]:
        with open(f, 'rb') as file:
            checksum = hashlib.md5(file.read()).hexdigest()
        print(f"Sample file: {os.path.basename(f)} | MD5: {checksum}")
    return len(files)

count_us8k = get_file_count_and_checksums(local_dataset_dir)
count_esc = get_file_count_and_checksums(local_esc_dir)
count_campus = get_file_count_and_checksums('/content/datasets/campus')
```

---

## 5. Sample Subset (for quick debugging)

```python
import shutil

def copy_sample_files(src_folder, dst_folder, n=10):
    files = glob.glob(os.path.join(src_folder, '**', '*.wav'), recursive=True)
    os.makedirs(dst_folder, exist_ok=True)
    for f in files[:n]:
        shutil.copy(f, dst_folder)
    print(f"Copied {n} sample files to {dst_folder}")

copy_sample_files(local_dataset_dir, '/content/datasets/UrbanSound8K_sample', n=10)
copy_sample_files(local_esc_dir, '/content/datasets/ESC-50_sample', n=10)
copy_sample_files('/content/datasets/campus', '/content/datasets/campus_sample', n=5)
```

---

## 6. Update Manifest

```python
import json
manifest_path = '/content/drive/MyDrive/NoiseClassification/manifest.json'
manifest = {
    "UrbanSound8K": {"count": count_us8k},
    "ESC-50": {"count": count_esc},
    "Campus": {"count": count_campus}
}
with open(manifest_path, 'w') as f:
    json.dump(manifest, f, indent=2)
print("Dataset counts updated in manifest.json")
```

---

> **Next steps:** Run the next notebook (`03_metadata_preparation.ipynb`) to prepare and analyze metadata for all datasets.