# 🎵 Preprocessing for Accordo.Ai Dataset  

## 📌 Overview  
This notebook processes **chroma feature files** and **chord annotations** to create a structured dataset for chord recognition.  

## 🔄 Preprocessing Steps  

### 1️⃣ Load Chroma Features  
- Reads `bothchroma.csv`.  
- Extracts **timestamps**, **bass chroma (bins 3-14)**, and **normal chroma (bins 15-26)**.  

### 2️⃣ Load Chord Annotations  
- Reads:  
  - `majmin7.lab` → Chord labels (`Start`, `End`, `Chord`).  
  - `majmin7inv.lab` → Inversions (`Start`, `End`, `Inversion`).  

### 3️⃣ Merge Data  
- Aligns **chroma features** with **chord labels** using timestamps.  
- Assigns **chord and inversion labels** to each chroma row.  

### 4️⃣ Standardize Chord Notation  
- Converts **enharmonic equivalents** to match chromagram bin names:  
  | Original | Standardized |
  |----------|-------------|
  | `Db`     | `C#`        |
  | `D#`     | `Eb`        |
  | `Gb`     | `F#`        |
  | `G#`     | `Ab`        |
  | `A#`     | `Bb`        |

### 5️⃣ Chord Distribution Analysis  
- Computes **percentage of each chord** in the dataset.  
- Helps ensure a **balanced dataset** for training.  
- Example output:  
  ```plaintext
  Cmaj7: 15.2%  
  Gmaj: 12.8%  
  Dm7: 9.5%  
  N (No Chord): 5.3%  
  ...


In [8]:
import os
import pandas as pd
import numpy as np
import pandas as pd
from collections import Counter
from IPython.display import clear_output

In [9]:
print(os.getcwd())

/tf/Model


In [5]:
base_folder = './'
processed_dataset = './DatasetPro'

# Define bin names for chromagram
bin_names = ["A", "Bb", "B", "C", "C#", "D", "Eb", "E", "F", "F#", "G", "Ab"]

# Map chords to match your bin naming convention
chord_naming_map = {"Db": "C#", "D#": "Eb", "Gb": "F#", "G#": "Ab", "A#": "Bb", "X": "N", "Cb" : "B", "E#": "F", "B#": "C", "Fb": "E"}

# Define the correct path to the directory
metadata_folder = './Dataset/metadata/metadata/'
annotations_folder = './Dataset/annotations/annotations/'

In [11]:

# List all the available numbered sample folders (e.g., 0001, 0002, etc.)
sample_folders = os.listdir(metadata_folder)

# Filter out anything that isn't a folder (optional, for safety)
sample_folders = [folder for folder in sample_folders if os.path.isdir(os.path.join(metadata_folder, folder))]

# Sort the folder names to maintain the order
sample_folders.sort()

# Create a new DataFrame to hold the sample folder index
index_data = {'index': [], 'sample_id': []}

# Loop through all sample folders and create the new index
for idx, sample_id in enumerate(sample_folders, 1):
    # Add the index and sample_id to the list
    index_data['index'].append(idx)
    index_data['sample_id'].append(sample_id)

# Create a DataFrame from the index data
index_df = pd.DataFrame(index_data)

# Print the first few rows of the new index DataFrame for verification
print("New Index DataFrame:")
print(index_df.head())

# Optionally, save the new index DataFrame as a CSV for later reference
index_df.to_csv('sample_index.csv', index=False)

New Index DataFrame:
   index sample_id
0      1      0003
1      2      0004
2      3      0006
3      4      0010
4      5      0012


In [12]:

index_file = os.path.join(base_folder, 'sample_index.csv')  # The index file containing folder names
sample_count = 0
# Load the index file to get sample_ids
index_df = pd.read_csv(index_file)

# Assuming the index.csv contains a column 'sample_id' with the folder names
sample_ids = index_df['sample_id'].tolist()  # Change 'sample_id' to the actual column name if needed

# Loop through all CSV files in the directory
for sample_id in sample_ids:

    print(f"Processing sample: {sample_id}")
    sample_count += 1
    print(f"Sample count: {sample_count}")

    sample_folder = os.path.join(metadata_folder, str(sample_id).zfill(4))
    file_path = os.path.join(sample_folder, 'bothchroma.csv')
    print(file_path)

    # Load the lab file for the current
    lab_folder = os.path.join(annotations_folder, str(sample_id).zfill(4))

    lab_file = os.path.join(lab_folder, 'majmin.lab')
    labinv_file = os.path.join(lab_folder, 'majmininv.lab')
    lab7_file = os.path.join(lab_folder, 'majmin7.lab')
    lab7inv_file = os.path.join(lab_folder, 'majmin7inv.lab')


    def standardize_chord(chord):
        """Convert enharmonic equivalents to match the bin naming convention."""
        for alt, standard in chord_naming_map.items():
            chord = chord.replace(alt, standard)  # Replace with the preferred notation
        return chord

    # Read the file, ignoring the first column, and only selecting relevant columns
    lab_df = pd.read_csv(lab_file, sep="\t", header=None, names = ["Start", "End", "Chord"])
    labinv_df = pd.read_csv(labinv_file, sep="\t", header=None, names=["Start", "End", "Inversion"])
    lab7_df = pd.read_csv(lab7_file, sep="\t", header=None, names = ["Start", "End", "Chord7"])
    lab7inv_df = pd.read_csv(lab7inv_file, sep="\t", header=None, names=["Start", "End", "Inversion7"])

    lab_df["Chord"] = lab_df["Chord"].apply(standardize_chord)
    labinv_df["Inversion"] = labinv_df["Inversion"].apply(standardize_chord)
    lab7_df["Chord7"] = lab7_df["Chord7"].apply(standardize_chord)
    lab7inv_df["Inversion7"] = lab7inv_df["Inversion7"].apply(standardize_chord)
    lab_final_df = lab_df[["Start", "End", "Chord"]].copy()


    # Identify rows in majmin7.lab that contain dominant 7th chords
    dominant7_rows = lab7_df[lab7_df["Chord7"].str.contains(":7", regex=True, na=False)]
    # Merge to update chords with dominant 7ths where applicable
    lab_final_df = pd.merge(lab_final_df, dominant7_rows, on=["Start", "End"], how="left")
    # Replace Chord with Chord7 where available
    lab_final_df["Final_Chord"] = lab_final_df["Chord7"].combine_first(lab_final_df["Chord"])
    # Keep only necessary columns
    lab_final_df = lab_final_df[["Start", "End", "Final_Chord"]]


    # Merge all dataframes on Start and End columns
    merged_lab_df = pd.merge(lab_df, labinv_df, on=["Start", "End"], how="left")
    merged_lab7_df = pd.merge(lab7_df, lab7inv_df, on=["Start", "End"], how="left")

    final_lab_merged_df = pd.merge(merged_lab_df, merged_lab7_df, on=["Start", "End"], how="left")
    final_merged_df = pd.merge(final_lab_merged_df, lab_final_df, on=["Start", "End"], how="left")
    
    # Display the extracted table
    print(final_merged_df.head())

    try:
        # Read the file, ignoring the first column, and only selecting relevant columns
        bothchroma_df = pd.read_csv(file_path, usecols=range(1, 26))

        # Rename columns for clarity
        bothchroma_df.columns = (
            ['timestamp'] +
            [f'{bin_names[i]}_B' for i in range(len(bin_names))] +
            [f'{bin_names[i]}' for i in range(len(bin_names))]
        )

        # Display the extracted table for verification
        print(f"Processed file: {sample_id}")
        print(bothchroma_df.head())


        # Save the structured table to a new file
        processed_sample_folder = os.path.join(processed_dataset, str(sample_id).zfill(4))
        os.makedirs(processed_sample_folder, exist_ok=True)

        output_file = os.path.join(processed_sample_folder, 'structured_lab.csv')
        final_merged_df.to_csv(output_file, index=False)
        print(f"Structured table saved to: {output_file}")

        output_file = os.path.join(processed_sample_folder, f"structured_bothchroma.csv")
        #os.remove(os.path.join(sample_folder, 'structured_bothchroma')) #correcting a mistake
        bothchroma_df.to_csv(output_file, index=False)
        print(f"Structured table saved to: {output_file}\n\n")



        # Load both CSV files
        chroma_file = os.path.join(processed_sample_folder, 'structured_bothchroma.csv')
        struct_lab_file = os.path.join(processed_sample_folder, 'structured_lab.csv')

        chroma_df = pd.read_csv(chroma_file)
        struct_lab_df = pd.read_csv(struct_lab_file)

        # Convert timestamps to numeric for proper merging
        chroma_df["timestamp"] = pd.to_numeric(chroma_df["timestamp"])
        struct_lab_df["Start"] = pd.to_numeric(struct_lab_df["Start"])
        struct_lab_df["End"] = pd.to_numeric(struct_lab_df["End"])

        # Assign chord labels to chroma timestamps
        def get_chord_label(timestamp):
            match = struct_lab_df[(struct_lab_df["Start"] <= timestamp) & (struct_lab_df["End"] > timestamp)]
            if not match.empty:
                    return match["Chord"].values[0], match["Inversion"].values[0], match["Chord7"].values[0], match["Inversion7"].values[0], match["Final_Chord"].values[0]  # Return both values
            return "N", "N"  # Default to "N" if no match

        chroma_df[["Chord", "Inversion", "Chord7", "Inversion7", "Final_Chord"]] = chroma_df["timestamp"].apply(lambda t: pd.Series(get_chord_label(t)))

        # Save the merged dataset
        merged_file = os.path.join(processed_sample_folder, 'merged_chroma_lab.csv')
        chroma_df.to_csv(merged_file, index=False)
        print(f"Merged dataset saved to: {merged_file}")
        clear_output(wait=True)

    except Exception as e:
        print(f"Error processing bothchroma of {sample_id}: {e}")
        break
        


Processing sample: 1300
Sample count: 890
./Dataset/metadata/metadata/1300/bothchroma.csv
       Start        End  Chord Inversion  Chord7 Inversion7 Final_Chord
0   0.000000   0.487619      N         N       N          N           N
1   0.487619  11.730295      N         N       N          N           N
2  11.730295  18.803926      N         N       N          N           N
3  18.803926  19.039714      N         N       N          N           N
4  19.039714  20.926016  C:min     C:min  C:min7     C:min7       C:min
Processed file: 1300
   timestamp  A_B  Bb_B  B_B  C_B  C#_B  D_B  Eb_B  E_B  F_B  ...    B    C  \
0    0.04644  0.0   0.0  0.0  0.0   0.0  0.0   0.0  0.0  0.0  ...  0.0  0.0   
1    0.09288  0.0   0.0  0.0  0.0   0.0  0.0   0.0  0.0  0.0  ...  0.0  0.0   
2    0.13932  0.0   0.0  0.0  0.0   0.0  0.0   0.0  0.0  0.0  ...  0.0  0.0   
3    0.18576  0.0   0.0  0.0  0.0   0.0  0.0   0.0  0.0  0.0  ...  0.0  0.0   
4    0.23220  0.0   0.0  0.0  0.0   0.0  0.0   0.0  0.0  0.0  

In [16]:
import os
import pandas as pd
from collections import Counter

# Dictionary to store chord counts
chord_counts = Counter()
total_chords = 0

# Loop through each sample folder
for sample_id in os.listdir(processed_dataset):
    print(f"Processing sample: {sample_id}")
    
    processed_sample_folder = os.path.join(processed_dataset, str(sample_id).zfill(4))
    merged_file = os.path.join(processed_sample_folder, 'merged_chroma_lab.csv')
    
    if os.path.exists(merged_file):
        df = pd.read_csv(merged_file)
        
        # Count chords in this file
        chord_counts.update(df['Final_Chord'])
        total_chords += len(df['Final_Chord'])

# Calculate percentages
chord_percentages = {chord: (count / total_chords) * 100 for chord, count in chord_counts.items()}

# Convert to DataFrame for better readability
chord_distribution_df = pd.DataFrame(list(chord_percentages.items()), columns=['Chord', 'Percentage'])

# Add chord counts to the DataFrame
chord_distribution_df['Chord_Count'] = chord_distribution_df['Chord'].map(chord_counts)

# Sort by percentage (descending)
chord_distribution_df = chord_distribution_df.sort_values(by='Percentage', ascending=False)

# Save distribution to CSV
output_file = os.path.join(processed_dataset, 'chord_distribution.csv')
chord_distribution_df.to_csv(output_file, index=False)

print("Chord distribution calculated and saved to:", output_file)
print(chord_distribution_df)


Processing sample: 0003
Processing sample: 0004
Processing sample: 0006
Processing sample: 0010
Processing sample: 0012
Processing sample: 0015
Processing sample: 0016
Processing sample: 0018
Processing sample: 0019
Processing sample: 0021
Processing sample: 0022
Processing sample: 0023
Processing sample: 0025
Processing sample: 0026
Processing sample: 0027
Processing sample: 0029
Processing sample: 0030
Processing sample: 0033
Processing sample: 0034
Processing sample: 0035
Processing sample: 0037
Processing sample: 0039
Processing sample: 0040
Processing sample: 0041
Processing sample: 0043
Processing sample: 0044
Processing sample: 0046
Processing sample: 0049
Processing sample: 0050
Processing sample: 0051
Processing sample: 0053
Processing sample: 0054
Processing sample: 0055
Processing sample: 0056
Processing sample: 0059
Processing sample: 0061
Processing sample: 0062
Processing sample: 0064
Processing sample: 0066
Processing sample: 0067
Processing sample: 0068
Processing sampl