# Chorus chord characterisation Project

## Research
### Questions
In this research project we will try to analyse a corpus of popular songs to try to identify  chords difference between verses and choruses. We will try to  answer the following research questions :
<ol>
<li>Does the chord distribution of the choruses differ from the one taken from the verses?</li>
<li>Is there a different chord sequence distribution in the choruses that in the verses?</li>
<li>How are these distributions evolving through time? </li>
</ol>

In the first question we want to know if a difference exists in term of chords statistic, are we more likely to find a specific chord in the choruses? Are they using less chords than the other parts of the song? <br> 
For the second one we want to have a more melodic insight, can we find important patern? is the markov model derived from the chorus vary from its counterpart? <br>
In the third one the focus is on the time dimension, we want to know if the results of the two previous answers are influenced by time. Were chorus very close to verses in 1968 and not anymore in 1985? How does each chord evolve in relation to the others? Do we find a sequence that appeared while other disapear? These are all the underlying questions we want to answer under our main research questions.  



### Dataset presentation
These questions come from our original idea to fully characterize a chorus over the years, we now want to focus specifically on the chords to differentiate them while keeping the temporal dimension. To give us the means to answer our questions, we selected a dataset containing 900 Pop-Rock in the top charts from the 60s to the 90s. They are simple text files with the following data : date, song name, artist, part of the song (chorus, verse),the timestamp of each musical phrase along with their chords. 
### Procedure
We will use the chorus/verse annotation at the begining of each phrase to classify the chords in each group, it will then be easy to make the chords distribution and markov models. The timestamp will allow us to further subdivide the distribution of the chords to see if an evolution is present. The possible outcome we expect are :
<ul>
<li>Empty results : There is no significative difference between the chord distribution of the chords and verses and no evolution over time. This could result in a realisation that the dataset is biased toward a specific Pop-Rock genre using the same chords all time, maybe there is indeed no difference in the chords used in a chorus and the ones used in verses, which would constitute an answer for our questions </li>
<li> Narrow chorus chord distribution : Since the chorus has to be immediately recognized as one, maybe the composers make more extensive use of a sub group of the chords provoquing this scenario. The same resoning could be applied to the sequences, maybe some specific ones will be more dominant in the chorus.</li>
<li> Time It will be interesting to see if this effect is present throughout the years. It could be flat, linear, or oscillating. The flat time evolution was mentioned above, a linear narrowing would imply that chorus are becoming more and more singular and vice versa. An oscillation pattern could be intersting as maybe we will see some musical innovation get adopted and spread </li>
</ul>
The confidence might be limited since the dataset we are using is not that exhaustive so maybe we will not have statistically meaningful results. Nevertheless if the p-Value of the statistical chord distribution shift is good enough and the shift is significant then we will be able to conclude that it is highly probable that a shift occured. A plot over time of the use of each chord will allow us to visualize the evolution of the chords, maybe we will see that certain chords were predominant at some time. Of course multiple chords sequence will also need our attention and we might observe that certain sequence of chords were highly popular at one point in history, or that new ones appeared. For that a percent distribution of chords sequence over time will allow us to determine what were the trends, and if there are some at all.

## Data description
The dataset has been created by [1] and corresponds to a random sample of 890 Billboard chart slots presented at ISMIR 2011 and MIREX 2012. Due to the nature of the sampling algorithm, there are some duplicates and this results in only 740 distinct songs. According to the authors, training algorithms that assume independent, identically distributed data should retain the duplicates.<br> This dataset is publicly available at https://ddmal.music.mcgill.ca/research/The_McGill_Billboard_Project_(Chord_Analysis_Dataset)/ and can be downloaded in various formats. Different features are given by the authors. In this project we will use metadata and chords annotations. 
The first dataset used is the index to the dataset (csv format), containing the following fields:
<ul>
<li><b>id</b>, the index for the sample entry;</li>
<li><b>chart_date</b>, the date of the chart for the entry;</li>
<li><b>target_rank</b>, the desired rank on that chart;</li>
<li><b>actual_rank</b>, the rank of the song actually annotated, which may be up to 2 ranks higher or lower than the target rank [1, 2];</li>
<li><b>title</b>, the title of the song annotated;</li>
<li><b>artist</b>, the name of the artist performing the song annotated;</li>
<li><b>peak_rank</b>, the highest rank the song annotated ever achieved on the Billboard Hot 100; and</li>
<li><b>weeks_on_chart</b>, the number of weeks the song annotated spent on the Billboard Hot 100 chart in total.</li>
</ul>

The main dataset comprehends chords, structure, instrumentation, and timing, given in a txt format. The annotation for each song begins with a header containing the title of the song, the name of the artist, the metre and the tonic pitch class of the opening key. In the main body, each line consists of a single phrase and begins with its timestamp, followed by the chords.

<b>What is the maximum available amount in theory (in the case of incomplete data aquisition)?
How much can you actually hope to lay hand on for milestone 3 on April 20th?</b>
 
To be cited:
[1]: John Ashley Burgoyne, Jonathan Wild, and Ichiro Fujinaga, ‘An Expert Ground Truth Set for Audio Chord Recognition and Music Analysis’, in Proceedings of the 12th International Society for Music Information Retrieval Conference, ed. Anssi Klapuri and Colby Leider (Miami, FL, 2011), pp. 633–38

In [1]:
import pandas as pd
import numpy as np
import os
import re

In [2]:
metadata_df = pd.read_csv("data/billboard-2.0-index.csv")

In [3]:
SONG_ID, LINE_NUMBER, MEASURE_NUMBER, CHORD_NUMBER, SEQUENCE_NUMBER, \
CHORD, INSTRUMENT, TYPE, TIME, STRUCTURE, DURATION, REPETITION, ELID = \
"song_id","line_id", "measure_id", "chord_id", "sequence_id",\
"chord", "instrument", "section_type", "time", "section_structure", "duration", "repetition", "elided"

#This is dependant of "metre" in the txt files.
METRE = "metre"

#Create a new dictionary from two other
def immutable_merge(dic1, dic2):
    result = dic1.copy()
    result.update(dic2)
    return result

#Create a row of the futur df as a dictionary
def create_row(persistent_attributes, line_attributes,
               measure_number = None, chord_number = None, chord = None, duration = None):
    result = immutable_merge(persistent_attributes, line_attributes)
    
    if not (measure_number is None and measure_number is None and chord_number is None and duration is None):
        result[MEASURE_NUMBER] = measure_number
        result[CHORD_NUMBER] = chord_number
        result[CHORD] = chord
        result[DURATION] = duration
    
    return result

#Generate the attributes of a given line and update the sequence counter
def process_line_metadata(header, line_counter, old_line_attributes, sequence_counter, suffix = ""):
    
    result = {}
    
    #Suffix (main instrument, elid, repetition)
    old_instrument = str(old_line_attributes.get(INSTRUMENT))
    
    for suffix in suffix.split(", "):
        
        suffix = suffix.strip()
        
        #Repetition
        if re.match("^x\d+$",suffix):
            result[REPETITION] = int(suffix[1])
        
        #Elid
        elif suffix == "->":
            result[ELID] = True

        #Instrument
        else:
            ##New instrument
            if len(suffix) > 0 and suffix != "\n":
                result[INSTRUMENT] = suffix.strip("\n").strip(",").strip()

            ##Main instrument continued (experimental)
            elif not old_instrument.endswith(")") and old_instrument.lower() not in ["nan","none"] \
            and len(old_instrument)>0:
                result[INSTRUMENT] = old_instrument.strip("(")

        
    #Line number
    result[LINE_NUMBER] = line_counter

    
    #Header    
    header_items = header.split()
        
    result[TIME] = header_items[0]
    
    #Case where a section is continued
    if len(header_items) == 1:
        result[TYPE] = old_line_attributes.get(TYPE)
        result[STRUCTURE] = old_line_attributes.get(STRUCTURE)
        result[SEQUENCE_NUMBER] = old_line_attributes.get(SEQUENCE_NUMBER)
    
    #Case where a section has no structure (silence, end, fadeout)
    elif len(header_items) == 2:
        
        #Z is a structure, not a type.
        if header_items[1].strip().strip(",") == "Z":
            result[STRUCTURE] = header_items[1].strip().strip(",")
        else:
            result[TYPE] = header_items[1].strip().strip(",")
            
        result[SEQUENCE_NUMBER] = sequence_counter
        sequence_counter += 1
    
    #Case where a section begins.
    elif len(header_items) == 3:
        result[STRUCTURE] = header_items[1].strip().strip(",")
        result[TYPE] = header_items[2].strip().strip(",")
        result[SEQUENCE_NUMBER] = sequence_counter
        sequence_counter += 1
    
    return sequence_counter, result

In [4]:
def parse_song_to_dict(song_id, path):
    
    rows = []
    persistent_attributes = {}
    
    persistent_attributes[SONG_ID] = song_id
    
    with open(path,"r") as file:
        line = file.readline()
        
        line_counter = 0
        measure_counter = 0
        chord_counter = 0
        sequence_counter = 0
        line_attributes = {}
        old_chord = None
 

        while line:
        
            if line != "\n":

                #Attribute lines
                if line.startswith("#"):
                    attribute, value = line.strip("#").split(":",1)
                    persistent_attributes[attribute.strip(" ")] = value.strip(" ").strip("\n")

                else:
                    line_items = line.split("|")

                    #Special lines
                    if len(line_items) <= 1:
                        sequence_counter, line_attributes = \
                        process_line_metadata(line, line_counter, line_attributes, sequence_counter)
                        row = create_row(persistent_attributes, line_attributes)
                        rows.append(row)

                    #Standard lines    
                    else:                    
                        header = line_items[0]
                        suffix = line_items[-1]
                        measures = line_items[1:-1]

                        sequence_counter, line_attributes = \
                        process_line_metadata(header, line_counter, line_attributes, sequence_counter, suffix)  

                        for measure in measures:
                            
                            chords = measure.split()
                            
                            #Special metric (experimental)
                            old_metre = persistent_attributes.get(METRE)
                            if re.match("^\(\d/\d\)$", chords[0]):
                                persistent_attributes[METRE] = str(chords[0][1]) + "/" + str(chords[0][3])
                                chords = chords[1:]
                            
                            if len(chords) == 1:
                                duration = "measure"
                            else:
                                duration = "beat"
                            
                            for chord in chords:
                                
                                if chord == ".":
                                    chord = old_chord
                                
                                row = create_row(persistent_attributes, line_attributes,
                                                 measure_counter, chord_counter, chord, duration)
                                rows.append(row)
                                old_chord = chord
                                chord_counter += 1

                            measure_counter += 1
                            persistent_attributes[METRE] = old_metre
            
            #Finally
            line_counter += 1
            line = file.readline()
    
    
    return rows

In [5]:
test = pd.DataFrame(parse_song_to_dict(0,"data/McGill-Billboard/0004/salami_chords.txt"))

In [6]:
def create_whole_collection_df():
    
    path = "data/McGill-Billboard/"
    file_name = "/salami_chords.txt"
    UPPER_BOUND = 1300
    
    whole_collection = []
    
    i = 0
    while i <= UPPER_BOUND:
        full_path = path + "0"*(4-len(str(i)))+ str(i) + file_name
        
        if os.path.exists(full_path):
            whole_collection += parse_song_to_dict(i, full_path)
        
        i += 1
        
    whole_collection_df = pd.DataFrame(whole_collection)
    
    return whole_collection_df.astype({SEQUENCE_NUMBER: 'Int64', MEASURE_NUMBER: 'Int64', CHORD_NUMBER: 'Int64', \
                                      REPETITION: 'Int64'})

In [7]:
collection_df = create_whole_collection_df()

In [8]:
collection_df.sample(10)

Unnamed: 0,song_id,title,artist,metre,tonic,line_id,time,section_type,sequence_id,section_structure,measure_id,chord_id,chord,duration,instrument,elided,repetition
97843,973,Little Too Late,Pat Benatar,4/4,E,29,167.012743764,chorus,7,C,95,165,E:maj(11),beat,voice,,
90091,887,Me Myself And I,De La Soul,4/4,A,29,186.763582766,instrumental,12,C,92,249,E:aug(b7),beat,,,
36163,361,This Song,George Harrison,4/4,E,8,22.72154195,verse,2,B,5,10,B:7,measure,(voice,,
77017,769,All Through The Night,Cyndi Lauper,4/4,Ab,33,242.516757369,fadeout,13,,100,119,Eb:sus4,beat,,,
3695,46,I'll Take You There,The Staple Singers,4/4,C,17,93.39755102,solo,7,B,41,61,C:maj,measure,(guitar,,
22522,240,Just The Way You Are,Billy Joel,4/4,D,7,7.842154195,verse,2,B,5,13,B:min7,measure,(vocal,,
58450,594,Situation,Yaz,4/4,C#,28,166.072675736,trans,14,A',83,158,B:1,beat,(synthesizer,,
31593,329,Smoking Gun,The Robert Cray Band,4/4,E,7,8.360136054,intro,1,A,7,10,E:min9,beat,guitar),,
100574,1002,Freeze-Frame,J. Geils Band,4/4,C,8,14.673582766,instrumental,2,B,13,13,Bb:maj,measure,(trumpet),,
74541,742,The Power,Snap,4/4,B,14,61.801496598,chorus,6,C,29,29,B:min,measure,(voice,,


In [9]:
#BUGS: Point = répétition du même accord? A élucider et modifier. DONE
# Il y a des mesures à un seul accord, d'autres à un accord par temps. A prendre en compte ! DONE
#Introduire un sequence_id qui identifie un chorus, un verse, etc... DONE
#Signalétique indiquant une répétition (x) ou un elid (->) à prendre en compte (colonne instrument) DONE
#Dans certains morceaux, il y a des mesures au metre différent du metre principal, indiqué entre (). DONE
#La lettre Z est parfois classée comme "type", parfois comme "structure" DONE