# Wrythm – Data Extraction and Dataset

## Notebook Objective
This notebook focuses primarily on:

1. Extracting audio previews from Spotify / MusicBrainz.
2. Extracting relevant musical features (MFCCs, BPM, rhythmic, melodic patterns, and key).
3. Building a consolidated dataset in JSON format for future use in genre classification.
4. Initial data exploration (EDA – exploratory data analysis).

---

## Pipeline 

### 1. Audio Collection
- Use the Spotify API to download previews of tracks from different genres.
- Store audio files in folders organized by genre (`data/previews/<genre>/`).

### 2. Feature Extraction
For each collected audio, extract:

- **MFCCs**: mean and variance to capture timbre.
- **BPM**: beats per minute (song tempo).
- **Rhythmic Cell**: onset histogram divided into bins.
- **Melodic Cell**: dominant notes per segment.
- **Key**: estimated musical key.
- **Duration**: audio duration in seconds.

Features will be stored together with metadata in JSON format.

### 3. Dataset Construction
- Consolidate JSON files into a single folder (`data/metadata/`).
- Ensure each record contains:
  - Track identifier (Spotify ID)
  - Track name and artists
  - Audio file path
  - Extracted features
  - Genre(s)

### 4. Initial Exploratory Analysis (EDA)
- Count of tracks per genre.
- Distribution of duration, BPM, and key.
- Basic statistics of MFCCs and rhythmic patterns.
- Check for consistency and possible missing data.


# Development


## Initial Setup: Imports and Preliminaries

In [4]:
# Imports básicos e configuração de diretórios
import os
import sys
from pathlib import Path
import glob
import json

# Bibliotecas de análise de dados 
import pandas as pd
import numpy as np

# Bibliotecas de áudio
import librosa
import soundfile as sf

# Bibliotecas de visualização
import matplotlib.pyplot as plt
import seaborn as sns

# Configurações gerais
pd.set_option('display.max_columns', None)
sns.set(style="whitegrid")

# Adiciona pasta src ao Python Path
src_path = Path("../src").resolve()
if str(src_path) not in sys.path:
    sys.path.append(str(src_path))

# Diretórios principais
PREVIEWS_ROOT = "../data/previews"
META_ROOT     = "../data/metadata"

# Cria os diretórios se não existirem
Path(PREVIEWS_ROOT).mkdir(parents=True, exist_ok=True)
Path(META_ROOT).mkdir(parents=True, exist_ok=True)

# Variáveis de ambiente (Spotify)

# Carrega as variáveis de ambiente do .env
from dotenv import load_dotenv

# Ajuste o caminho do .env de acordo com a posição do notebook
dotenv_path = Path(os.getcwd()) / "../../.env"
print("Looking for .env at:", dotenv_path)
print(".env exists?", dotenv_path.exists())

# Carrega o .env
load_dotenv(dotenv_path.resolve())

# Testa se as variáveis foram carregadas
print("SPOTIPY_CLIENT_ID =", os.getenv("SPOTIPY_CLIENT_ID"))
print("SPOTIPY_CLIENT_SECRET =", os.getenv("SPOTIPY_CLIENT_SECRET"))

#Import dos módulos que dependem do Spotify

# Configuração do cliente Spotify
from dataset_builder import process_track_and_write_metadata, sp

print("Configurações iniciais carregadas com sucesso!")


Looking for .env at: c:\Users\Samue\VSCode Projects\Wrythm\genre-classification\notebooks\..\..\.env
.env exists? True
SPOTIPY_CLIENT_ID = ae2d0b2723c24654841431b9e183af4e
SPOTIPY_CLIENT_SECRET = 42b1a2a97e5743f58a62d3d23351ff9a
Configurações iniciais carregadas com sucesso!
