# Final Project - Data Downloading
- Course: AAI-511-Neural Networks
- Institution: University of San Diego
- Professor: Kahila Mokhtari Jadid
- Group 4 Members: 
    * Lucas Young
    * Titouan Margret
    * Juan Pablo Triana Martinez

The following jupyter notebook would contain a detailed version on the following:
1. Using `kagglehub`, download the dataset into a data folder outside the notebooks directory.
2. from the `data.zip` file, select the `midiclassics.zip` file and create a folder called project_data, divided into:
    - Midi files for Bach.
    - Midi files for Beethoven.
    - Midi files for Chopin
    - Midi files for Mozart
3. Using `mido`, read the files of the following 4 composers.

In [13]:
import sys, site, pprint
print("Notebook is running from:", sys.executable)
pprint.pprint(site.getsitepackages())      # optional: shows where it looks for packages


Notebook is running from: /Users/lucasyoung/MSAAI-511-group-4/msaa511_env/bin/python
['/Users/lucasyoung/MSAAI-511-group-4/msaa511_env/lib/python3.12/site-packages']


In [14]:
# Import necessary modules
import kagglehub
from pathlib import Path
import os

In [15]:
# Obtain current path and data folder path
curr_path = Path.cwd()
data_path = curr_path.parent / "data"

#Setup download path to specific data fold (GENUIENLY, Why Kaggelhub :) )
os.environ["KAGGLEHUB_CACHE"] = data_path.__str__()

if data_path.exists():
    print(f"{data_path}: Already exists, no need to create a new one")
else:
    data_path.mkdir(parents=True, exist_ok=True)

# Download latest version
path = kagglehub.dataset_download("blanderbuss/midi-classic-music")
print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/blanderbuss/midi-classic-music?dataset_version_number=1...


100%|██████████| 68.2M/68.2M [00:12<00:00, 5.65MB/s]

Extracting files...





Path to dataset files: /Users/lucasyoung/MSAAI-511-group-4/data/datasets/blanderbuss/midi-classic-music/versions/1


### Well! We have the following:
`data -> datasets -> blanderbuss -> midi-classic -> versions -> 1 -> midiclassics`

Inside this directory called `midiclassics`, we want to retrieve the midi files of:
- Midi files for Bach.
- Midi files for Beethoven.
- Midi files for Chopin
- Midi files for Mozart

Where each composer is a directory with an uneven amount of MIDI files and zip files that contain more midi files!

### Let's create a class to retrieve the data

In [16]:
from typing import List, Tuple
class DataRetriever:
    '''
    Class that would access a data folder
    containg multiple directories and zip files,
    where we would retrieve the MIDI files from:
        - Bach
        - Beethoven
        - Chopin
        - Mozart
    
    Args:
        data_path: pathlib path to the data folder where the data is contained
    
    Returns:
        retrieved_path: pathlib path where all the midi files of the 4 desired
        composers are.
    
    Example:
        retriever = DataRetriever(data_path = set_path)
        retrieved_path = retriever.subdivide_data()
    '''

    from pathlib import Path
import shutil

class DataRetriever:
    def __init__(self, data_path: Path):
        self.data_path = data_path
        self.dataset_path = data_path / "datasets"

        # Explicit, ordered list — not a set
        self.composers = ["Bach", "Beethoven", "Chopin", "Mozart"]

        # Map composer → destination Path
        self.dest_dirs = {
            c: (data_path / "final_proj_data" / c)
            for c in self.composers
        }
        for p in self.dest_dirs.values():
            p.mkdir(parents=True, exist_ok=True)

    def subdivide_data(self) -> Path:
        # Walk once and route every .mid by *name*, not by index
        for root, dirs, files in self.dataset_path.walk():
            composer = Path(root).name       # deepest directory’s name
            if composer not in self.dest_dirs:
                continue

            for f in files:
                if f.endswith(".mid"):
                    src = Path(root) / f
                    dst = self.dest_dirs[composer] / f

                    # If a file with the same name exists, append a suffix
                    if dst.exists():
                        dst = dst.with_stem(dst.stem + "_" + src.parent.name)

                    shutil.copy2(src, dst)   # copy to keep original intact
        return self.data_path / "final_proj_data"


In [17]:
retriever = DataRetriever(data_path = data_path)
retrieved_path = retriever.subdivide_data()
print(retrieved_path)

/Users/lucasyoung/MSAAI-511-group-4/data/final_proj_data
