# Milestone 2 - Data gathering and preprocessing

## 1. Narrowing down the research question

### Research Questions

In this research project we will try to analyse a corpus of popular songs to try to identify  chords differences between verses and choruses. We will try to  answer the following research questions :
<ol>
<li>Does the chord distribution of the choruses differ from the one in the verses?</li>
<li>Is there a different chord sequence distribution in the choruses compared to the verses?</li>
<li>How are these distributions evolving over time? </li>
</ol>

In the first question we want to know if a difference exists in terms of chords statistic. Are we more likely to find a specific chord in the choruses? Are fewer different chords used in the chorus compared to the other parts of a song? <br> 
For the second one we want to have a more melodic insight. Can we find specific patterns? Is the Markov model derived from the chorus different from its counterparts? <br>
In the third one the focus is on the time dimension. We want to know if the results of the two previous answers change over time. Were chorus closer to verses in 1968 than in 1985? How does each chord evolve in relation to the others? Do we find a sequence that appeared while other disappeared? 

These are all the underlying questions we want to answer under our main research questions.  

These questions relate to our original idea to fully characterize a chorus over the years, especially in comparison with verses. We now want to focus specifically on the chords to differentiate these different parts, while keeping the temporal dimension as a potential factor to observe changes.

### Dataset presentation

To give us the means to answer our questions, we have selected a dataset containing approximately 900 Pop-Rock songs in the top Billboard charts from the 60s to the 90s. They are simple text files with the following informations : 
* Release date
* Song title 
* Artist name
* Labels for the different parts of the song (such as chorus or verse)
* Timestamp of each musical phrase beginning
* Chords

### Procedure

We will use the chorus/verse annotations to classify the chords in each group. This will allow us to divide the chords in two groups and compute statistics and distributions for each of them. As discussed in the research questions, we will start with a basic characterization, simply comparing which chords appear in which section. We will then move deeper and compare the distributions of chords as well as the Markov models. These will be computed at least with bigrams, perhaps with higher-n n-grams depending on the number of chords in each section (as it makes little sense to use n-grams with n close to the number of chords in a given section).

The metadata, especially the release date, will be used to study the evolution over time of the previously discussed statistics. Depending on the distribution of songs over the years, time analysis will be discussed either over years or over decades.

### Possible outcomes and confidence measures

The different outcomes we can reasonably expect are: 
<ul>
    <li> <strong>Null results:</strong> There are no significant differences between choruses and verses and no evolution over time. This could be explained by a bias in the corpus toward a specific Pop-Rock genre using the same chords all the time or maybe there is indeed no difference in the chords used in a chorus and the ones used in verses, which would constitute an answer for our questions. </li>
<li> <strong>Narrow chorus chord distribution:</strong> Since the chorus has to be immediately recognized as one, maybe the composers make more extensive use of a sub group of the chords to ensure it. The same reasoning could be applied to the chord sequences: perhaps some specific ones will be more dominant in the chorus.</li>
<li> <strong>Temporal evolution:</strong> It will be interesting to see if the differences between verse and chorus change over the years. This evolution, if present, could be linear or oscillating. A linear narrowing would imply that choruses are becoming more and more similar or verse more and more diverse. An oscillating pattern would be interesting as musical phenomenon could appear and disappear.</li>
</ul>

Statistical tests will be used throughtout our analysis to check if our findings are statistically relevant. Error bars will also be included in all our graphics to avoid wrong conclusions. Of course the relative small size of our corpus will influence our results but only further analysis can reveal if significant results can still be found.

## 2. Gathering the data
The dataset has been created by [1] and corresponds to a random sample of 890 Billboard chart slots presented at ISMIR 2011 and MIREX 2012. Due to the nature of the sampling algorithm, there are some duplicates and this results in only 740 distinct songs. According to the authors, training algorithms that assume independent, identically distributed data should retain the duplicates.<br> This dataset is publicly available at https://ddmal.music.mcgill.ca/research/The_McGill_Billboard_Project_(Chord_Analysis_Dataset)/ and can be downloaded in various formats. Different features are given by the authors. In this project we will use metadata and chords annotations. 
The first dataset used is the index to the dataset (csv format), containing the following fields:
<ul>
<li><b>id</b>, the index for the sample entry.</li>
<li><b>chart_date</b>, the date of the chart for the entry.</li>
<li><b>target_rank</b>, the desired rank on that chart.</li>
<li><b>actual_rank</b>, the rank of the song actually annotated, which may be up to 2 ranks higher or lower than the target rank [1, 2].</li>
<li><b>title</b>, the title of the song annotated.</li>
<li><b>artist</b>, the name of the artist performing the song annotated.</li>
<li><b>peak_rank</b>, the highest rank the song annotated ever achieved on the Billboard Hot 100.</li>
<li><b>weeks_on_chart</b>, the number of weeks the song annotated spent on the Billboard Hot 100 chart in total.</li>
</ul>

The main dataset comprehends chords, structure, instrumentation, and timing, given in a txt format. The annotation for each song begins with a header containing the title of the song, the name of the artist, the metre and the tonic pitch class of the opening key. In the main body, each line consists of a single phrase and begins with its timestamp, followed by the chords. This requires us to design a specific parser, as will be discussed in the next section.<br>
We downloaded the two datasets which constitutes the whole of their database so we have the maximum from this source. It is not excluded that we find some additional ones to perform further analysis. As for now we will try to get as much stastitically relevant information from this source.

[1]: John Ashley Burgoyne, Jonathan Wild, and Ichiro Fujinaga, ‘An Expert Ground Truth Set for Audio Chord Recognition and Music Analysis’, in Proceedings of the 12th International Society for Music Information Retrieval Conference, ed. Anssi Klapuri and Colby Leider (Miami, FL, 2011), pp. 633–38

## 3. Data format

The goal of this question is to load the data and have a look at it. A specific parser is designed to do this automatically, in order to extract and store in a Pandas dataframe all the relevant informations and musical features.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import re
import cufflinks as cf
cf.go_offline()
cf.set_config_file(theme='white')
import plotly.express as px


### Metadata

In [None]:
metadata_df = pd.read_csv("data/billboard-2.0-index.csv")
metadata_df.head(10)

In [None]:
print('There are %d entries in the index table.' %len(metadata_df))

In [None]:
print('There are %d entries with a given title.' %metadata_df.title.isna().value_counts()[0])

In [None]:
print('There are %d entries with a given artist.' %metadata_df.artist.isna().value_counts()[0])

In [None]:
print('There are %d entries with a given chart date.' %metadata_df.chart_date.isna().value_counts()[0])

In [None]:
months = {'01':'January', 
          '02':'February',
          '03':'March',
          '04':'April',
          '05':'May',
          '06':'June',
          '07':'July',
          '08':'August',
          '09':'September',
          '10':'October',
          '11':'November',
          '12':'December'}
def format_date(date):
    year = date[:4]
    month = date[5:7]
    day = date[-2:]
    if day == '01':
        suffix = 'st'
    elif day == '02':
        suffix = 'nd'
    elif day == '03':
        suffix = 'rd'
    else:
        suffix = 'th'
        
    if day[0] == '0':
        day = day[1]
    
    date_string = months[month] + ' ' + day + suffix + ', ' + year
    return(date_string)

#Test
format_date('1958-08-04')

In [None]:
print('The songs range from %s to %s.' %(format_date(metadata_df.chart_date.min()), format_date(metadata_df.chart_date.max())))

### Parser 

In [None]:
SONG_ID, LINE_NUMBER, MEASURE_NUMBER, CHORD_NUMBER, SEQUENCE_NUMBER, \
CHORD, INSTRUMENT, TYPE, TIME, STRUCTURE, DURATION, REPETITION, ELID = \
"song_id","line_id", "measure_id", "chord_id", "sequence_id",\
"chord", "instrument", "section_type", "time", "section_structure", "duration", "repetition", "elided"

#This is dependant of "metre" in the txt files.
METRE = "metre"

#Create a new dictionary from two other
def immutable_merge(dic1, dic2):
    result = dic1.copy()
    result.update(dic2)
    return result

#Create a row of the futur df as a dictionary
def create_row(persistent_attributes, line_attributes,
               measure_number = None, chord_number = None, chord = None, duration = None):
    result = immutable_merge(persistent_attributes, line_attributes)
    
    if not (measure_number is None and measure_number is None and chord_number is None and duration is None):
        result[MEASURE_NUMBER] = measure_number
        result[CHORD_NUMBER] = chord_number
        result[CHORD] = chord
        result[DURATION] = duration
    
    return result

#Generate the attributes of a given line and update the sequence counter
def process_line_metadata(header, line_counter, old_line_attributes, sequence_counter, suffix = ""):
    
    result = {}
    
    #Suffix (main instrument, elid, repetition)
    old_instrument = str(old_line_attributes.get(INSTRUMENT))
    
    for suffix in suffix.split(", "):
        
        suffix = suffix.strip()
        
        #Repetition
        if re.match("^x\d+$",suffix):
            result[REPETITION] = int(suffix[1])
        
        #Elid
        elif suffix == "->":
            result[ELID] = True

        #Instrument
        else:
            ##New instrument
            if len(suffix) > 0 and suffix != "\n":
                result[INSTRUMENT] = suffix.strip("\n").strip(",").strip()

            ##Main instrument continued (experimental)
            elif not old_instrument.endswith(")") and old_instrument.lower() not in ["nan","none"] \
            and len(old_instrument)>0:
                result[INSTRUMENT] = old_instrument.strip("(")

        
    #Line number
    result[LINE_NUMBER] = line_counter

    
    #Header    
    header_items = header.split()
        
    result[TIME] = header_items[0]
    
    #Case where a section is continued
    if len(header_items) == 1:
        result[TYPE] = old_line_attributes.get(TYPE)
        result[STRUCTURE] = old_line_attributes.get(STRUCTURE)
        result[SEQUENCE_NUMBER] = old_line_attributes.get(SEQUENCE_NUMBER)
    
    #Case where a section has no structure (silence, end, fadeout)
    elif len(header_items) == 2:
        
        #Z is a structure, not a type.
        if header_items[1].strip().strip(",") == "Z":
            result[STRUCTURE] = header_items[1].strip().strip(",")
        else:
            result[TYPE] = header_items[1].strip().strip(",")
            
        result[SEQUENCE_NUMBER] = sequence_counter
        sequence_counter += 1
    
    #Case where a section begins.
    elif len(header_items) == 3:
        result[STRUCTURE] = header_items[1].strip().strip(",")
        result[TYPE] = header_items[2].strip().strip(",")
        result[SEQUENCE_NUMBER] = sequence_counter
        sequence_counter += 1
    
    return sequence_counter, result

In [None]:
def parse_song_to_dict(song_id, path):
    
    rows = []
    persistent_attributes = {}
    
    persistent_attributes[SONG_ID] = song_id
    
    with open(path,"r") as file:
        line = file.readline()
        
        line_counter = 0
        measure_counter = 0
        chord_counter = 0
        sequence_counter = 0
        line_attributes = {}
        old_chord = None
 

        while line:
        
            if line != "\n":

                #Attribute lines
                if line.startswith("#"):
                    attribute, value = line.strip("#").split(":",1)
                    persistent_attributes[attribute.strip(" ")] = value.strip(" ").strip("\n")

                else:
                    line_items = line.split("|")

                    #Special lines
                    if len(line_items) <= 1:
                        sequence_counter, line_attributes = \
                        process_line_metadata(line, line_counter, line_attributes, sequence_counter)
                        row = create_row(persistent_attributes, line_attributes)
                        rows.append(row)

                    #Standard lines    
                    else:                    
                        header = line_items[0]
                        suffix = line_items[-1]
                        measures = line_items[1:-1]

                        sequence_counter, line_attributes = \
                        process_line_metadata(header, line_counter, line_attributes, sequence_counter, suffix)  

                        for measure in measures:
                            
                            chords = measure.split()
                            
                            #Special metric (experimental)
                            old_metre = persistent_attributes.get(METRE)
                            if re.match("^\(\d/\d\)$", chords[0]):
                                persistent_attributes[METRE] = str(chords[0][1]) + "/" + str(chords[0][3])
                                chords = chords[1:]
                            
                            if len(chords) == 1:
                                duration = "measure"
                            elif len(chords) == 2 and persistent_attributes[METRE] in ["4/4","12/8"]:
                                duration = "half-measure"
                            else:
                                duration = "beat"
                            
                            for chord in chords:
                                
                                if chord == ".":
                                    chord = old_chord
                                
                                row = create_row(persistent_attributes, line_attributes,
                                                 measure_counter, chord_counter, chord, duration)
                                rows.append(row)
                                old_chord = chord
                                chord_counter += 1

                            measure_counter += 1
                            persistent_attributes[METRE] = old_metre
            
            #Finally
            line_counter += 1
            line = file.readline()
    
    
    return rows

In [None]:
test = pd.DataFrame(parse_song_to_dict(0,"data/McGill-Billboard/0004/salami_chords.txt"))

In [None]:
def create_whole_collection_df():
    
    path = "data/McGill-Billboard/"
    file_name = "/salami_chords.txt"
    UPPER_BOUND = 1300
    
    whole_collection = []
    
    i = 0
    while i <= UPPER_BOUND:
        full_path = path + "0"*(4-len(str(i)))+ str(i) + file_name
        
        if os.path.exists(full_path):
            whole_collection += parse_song_to_dict(i, full_path)
        
        i += 1
        
    whole_collection_df = pd.DataFrame(whole_collection)
    
    return whole_collection_df.astype({SEQUENCE_NUMBER: 'Int64', MEASURE_NUMBER: 'Int64', CHORD_NUMBER: 'Int64', \
                                      REPETITION: 'Int64'})

In [None]:
collection_df = create_whole_collection_df()

In [None]:
collection_df.sample(10)

In [None]:
collection_df.set_index("song_id").head(5)

In [None]:
#BUGS: Point = répétition du même accord? A élucider et modifier. DONE
# Il y a des mesures à un seul accord, d'autres à un accord par temps. A prendre en compte ! DONE
#Introduire un sequence_id qui identifie un chorus, un verse, etc... DONE
#Signalétique indiquant une répétition (x) ou un elid (->) à prendre en compte (colonne instrument) DONE
#Dans certains morceaux, il y a des mesures au metre différent du metre principal, indiqué entre (). DONE
#La lettre Z est parfois classée comme "type", parfois comme "structure" DONE
#Certaines mesures contiennent deux accords de durée d'une demi-mesure DONE

#&pause est de taille arbitraire, pas forcément une mesure

Load your dataset and show examples of how you access the information that you are interested in.
Give an overview of your dataset by plotting some basic statistics of the relevant features and/or metadata.

### Exploratory statistical analysis

In [None]:
years=metadata_df.chart_date.map(lambda y:pd.to_datetime(y).year).value_counts(sort=False)
years.iplot(kind='bar', title="Number of sample song per year", xTitle="Year", yTitle="Samples")




In [None]:
collection_df.chord.value_counts().head(24).iplot(kind='bar',title="Overall chord occurence ditribution",xTitle="Chord",yTitle="Occurence")

In [None]:
collection_df[collection_df["section_type"]=="chorus"].chord.value_counts().head(20).iplot(kind='bar',title="Chorus chord occurence ditribution",xTitle="Chord",yTitle="Occurence")

In [None]:
#Basic statistics: Number of unique chords per songs
unique_chord_songs = collection_df[[SONG_ID,CHORD]].drop_duplicates().groupby(SONG_ID).count()
n_bins = int(unique_chord_songs.max())+1
unique_chord_songs.plot.hist(bins = n_bins, legend = False)
plt.title('Distribution of the number of unique chords per song', fontsize = 20)
plt.xlabel('Number of unique chord', fontsize = 18)
plt.ylabel('Frequency', fontsize = 18)
plt.show()

In [None]:
collection_df["time"]=collection_df.time.apply(lambda y: int(float(y)))


### Squeeze function
#### with examples


In [None]:
#The squeeze function returns a dataframe with the all the chords of a song squeezed in a row dependent on a subgroup
# of the section type. Default is "none" and will not filter any type of songs.
#try subgroup="chorus" or subgroup="verse"

def compress(s) :
    return s.dropna().to_list()

def squeeze(df,subgroup="none"):
    df_local=df
    if(subgroup!="none"):
        df_local=df[df["section_type"]==subgroup]
        
    return pd.DataFrame(df_local.groupby(["song_id","title"]).chord.agg(compress))

In [None]:
#squeeze section allows you to squeeze the sections of songs, t


def squeeze_section(df,subgroup="none"):
    df_local=df
    if(subgroup!="none"):
        df_local=df[df["section_type"]==subgroup]
        
    return pd.DataFrame(df_local.groupby(["song_id","sequence_id","section_type"]).chord.agg(compress))

In [None]:
squeeze_section(collection_df).head(5)

In [None]:
markov=squeeze(collection_df)
markov.head(5)

In [None]:
squeeze(collection_df,"chorus").head(5)

### Markov exploration

In [None]:
# Compute bigrams

def bigrams_seq(seq):
    return list(zip(seq[:-1], seq[1:]))

def bigrams_corpus(seqs):
    return [bg for seq in seqs for bg in bigrams_seq(seq)]

In [None]:
markov["bigrams"]=markov.chord.map(lambda y : bigrams_seq(y))

In [None]:
markov.head(5)

In [None]:
from collections import Counter

In [None]:
test=markov.bigrams.iloc[0]


In [None]:
c_one=Counter(test)

In [None]:
test_two=markov.bigrams.iloc[2]
c_two=Counter(test_two)
added=c_two+c_one


### Parser for the chords 

In [None]:

#Split a string and changes the target field depending on which side of the split to take
#side : 0 for left and 1 for right
def split_add(c,dic,split_char,target,side):
    temp=str.split(str(c),split_char)

    dic[target]=temp[side]
    return dic, temp[1-side]


def chord_to_tab(c):
    c=str(c)
    chord={"root":"", "shorthand" : "", "degree_list":[], "bass":"", "N" :False}
    rest=""
    
    if(c=="N"):
        chord["N"] = True
        return chord
    
    c=c.replace(")","")
    
    if('/' in c):
        chord, rest=split_add(c,dic=chord,split_char="/",target="bass",side=1)
    else :
        rest=c
    if(':' in rest):
        chord, rest=split_add(rest,dic=chord,split_char=":",target="root",side=0)
    if('(' in rest):
        chord, rest=split_add(rest,dic=chord,split_char="(",target="degree_list",side=1)
    if(rest != ""):
        chord["shorthand"]=rest
    
    return chord

    

In [None]:
collection_df["chord_dic"]=collection_df.chord.map(lambda y : chord_to_tab(y))
collection_df.head(5)

#### how to access the dic :

In [None]:
collection_df.chord_dic.map(lambda y: y["root"]).head(10)

In [None]:
collection_df.chord_dic.map(lambda y: y["N"]).head(10)

In [None]:
shorhand=['maj','min','dim','aug','maj7','min7', '7','dim7','hdim7'
'minmaj7','maj6', 'min6','9', 'maj9', 'min9','sus4','11','maj11','min11','maj13','min13','sus2']


### Quelques graphs avec le nouveau dico


In [None]:
#il faut encore que je vire l'élément vide
collection_df.chord_dic.map(lambda y: y["root"]).value_counts().iplot(kind="bar", title="chord distribution")


In [None]:
collection_df.chord_dic.map(lambda y: y["shorthand"]).value_counts().iplot(kind="bar", title="Type d'accord")


In [None]:
collection_df.chord_dic.map(lambda y: y["bass"]).value_counts()[1:].plot(kind="bar",title="bass note distribution")


## Tonic and duration analysis

### Creation of new dataframe

#### Creation of new columns fo relative-to-tonic roots

In [None]:
def chord_dic_to_columns(df):
    for key in ['root','shorthand','degree_list','bass','N']:
        df["{}".format(key)] = df.chord_dic.apply(lambda x : x.get(key))
        
    df.drop(columns = ["chord_dic"])
    
    return df

In [None]:
collection_df = chord_dic_to_columns(collection_df)

In [None]:
TPC_DIC = {"Cb":11,"C":0,"C#":1,"Db":1,"D":2,"D#":3,"Eb":3,"E":4,"E#":5,"Fb":4,"F":5,"F#":6,
           "Gb":6,"G":7,"G#":8,"Ab":8,"A":9,"A#":10,"Bb":10,"B":11,"B#":0}

In [None]:
collection_df["root_tpc"] = collection_df.root.apply(lambda r : TPC_DIC.get(r))
collection_df = collection_df.astype({"root_tpc":"Int64"})

In [None]:
collection_df["relative_root_tpc"] =\
collection_df.apply(lambda row: (row["root_tpc"] - TPC_DIC.get(row["tonic"]))%12,axis = 1)

In [None]:
collection_df = collection_df.fillna({REPETITION:1})

#### Duration tools

Duration of measure: all metre in the corpus are regular except song 700 (5/8 = 3 + 2). This was counted as two beats.

Creation of a one-row-per-beat dataframe (Need all cells of this section to work properly)

In [None]:
def weight_row(metre, duration, repetition):
    """
    Indicate to how many beats in total a chord described on a dataframe's row correspond.
    Careful ! The chords are non-successive in case of repetition
    """
    num, denom = metre.split("/")
    
    if metre == "5/8": #Specific to this dataset
        beat_per_measure = 2
    else:
        beat_per_measure = int(num)/(1 if int(denom) == 4 else 3)
    
    if duration == "measure":
        return beat_per_measure*(repetition)
    
    elif duration == "half-measure":
        return beat_per_measure/2*(repetition)
    
    elif duration == "beat":
        return repetition
    
    else:
        return 0

def successive_repetition_row(metre,duration):
    """
    Indicate on how many successive beats a chord is present
    """
    return weight_row(metre,duration,repetition=1)

#### Creation of df with only song with both verse and chorus

In [None]:
valid_songs = collection_df[[SONG_ID,TYPE]].drop_duplicates().groupby(SONG_ID)[TYPE].apply(list)\
.apply(lambda l: "chorus" in l and "verse" in l)

In [None]:
d_collection_df = collection_df.merge(valid_songs.reset_index().rename(columns = {TYPE:"valid"}),on = SONG_ID)
d_collection_df = d_collection_df[d_collection_df.valid]

#### Creation of one-row-per-bit dataframe

In [None]:
# Creation of one-row-per-bit dataframe
N_SUCC_BEATS = "n_succ_beats"

d_collection_df[N_SUCC_BEATS] = d_collection_df.apply(\
    lambda row: successive_repetition_row(row[METRE], row[DURATION]),axis=1)
d_collection_df = d_collection_df.astype({N_SUCC_BEATS:"Int64"})

In [None]:
from tqdm import tqdm

def create_beats_df(d_collection_df):
    beats_dics = []
    repetition_flag = False
    repeted_dics = []
    repetition_line = np.PINF
    repetition_song = np.PINF
    for i in tqdm(d_collection_df.reset_index().index):

        if repetition_flag == True and\
(repetition_line != d_collection_df.iloc[i][LINE_NUMBER] or repetition_song != d_collection_df.iloc[i][SONG_ID]):

            for r in range(repetition_n):
                beats_dics += repeted_dics

            repetition_flag = False
            repeted_dics = []
            repetition_line = np.PINF
            repetition_song = np.PINF
        
        
        if d_collection_df.iloc[i][REPETITION] == 1 :

            for b in range(d_collection_df.iloc[i][N_SUCC_BEATS]):
                beats_dics.append(d_collection_df.iloc[i].to_dict())

        else:

            repetition_flag = True
            repetition_line = d_collection_df.iloc[i][LINE_NUMBER]
            repetition_song = d_collection_df.iloc[i][SONG_ID]
            repetition_n = d_collection_df.iloc[i][REPETITION]

            for b in range(d_collection_df.iloc[i][N_SUCC_BEATS]):
                repeted_dics.append(d_collection_df.iloc[i].to_dict())

    beats_collection_df = pd.DataFrame(beats_dics)
    
    return beats_collection_df

In [None]:
beats_collection_df = create_beats_df(d_collection_df)
beats_collection_df = beats_collection_df.drop(N_SUCC_BEATS,axis=1)

### Chorus/verse proportion

We consider the distribution of the proportion of beats in a song that belong to choruses, respectively to verses.

In [None]:
TOTAL_WEIGHT = "total_weight"


d_collection_df[TOTAL_WEIGHT] = d_collection_df.apply(\
    lambda row: weight_row(row[METRE], row[DURATION], row[REPETITION]),axis=1)

d_collection_df[N_SUCC_BEATS] =  d_collection_df.apply(\
    lambda row: successive_repetition_row(row[METRE], row[DURATION]),axis=1)

d_collection_df = d_collection_df.astype({TOTAL_WEIGHT: 'Int64',N_SUCC_BEATS: 'Int64'})

In [None]:
n_beat_chorus = d_collection_df[d_collection_df.section_type == "chorus"].groupby("song_id")[TOTAL_WEIGHT].sum()\
.reset_index().rename(columns= {TOTAL_WEIGHT:"chorus_weight"})
n_beat_verse = d_collection_df[d_collection_df.section_type == "verse"].groupby("song_id")[TOTAL_WEIGHT].sum()\
.reset_index().rename(columns= {TOTAL_WEIGHT:"verse_weight"})
n_beat = d_collection_df.groupby("song_id")[TOTAL_WEIGHT].sum()

#Considering only songs with both chorus and verse. For all song, use line below
# n_beat_chorus.merge(n_beat_verse,on="song_id",how="outer").merge(n_beat,on="song_id",how="outer").fillna(0)
proportions_df = n_beat_chorus.merge(n_beat_verse,on="song_id").merge(n_beat,on="song_id")

proportions_df["chorus_weight"] = proportions_df["chorus_weight"]/proportions_df["total_weight"]
proportions_df["verse_weight"] = proportions_df["verse_weight"]/proportions_df["total_weight"]

In [None]:
proportions_df[["chorus_weight","verse_weight"]].describe().drop("count")

What we can observe is that the distributions of choruses and verses share similar statistics.

### Tonic Proportion

In [None]:
d_collection_df[[SONG_ID,"tonic"]].drop_duplicates().groupby("tonic")\
.count().rename(columns = {SONG_ID:"number_of_songs"}).plot.bar()

### Per-chord tonic distance analysis

In [None]:
#From now, we consider d_collection_df, that contains only songs that share a chorus and a verse

def per_chord_tonic_distance_analysis(distance,df,n_beat_type,btype):
    """
    Create weighted mean and std of chord distance to tonic and describe them
    
    distance: measure of distance, for exemple relative tpc from the tonic
    df: dataframe to consider
    n_beat_type: dataframe that give for each song the number of beat (= number of chords) for given type
    btype: name of the column in n_beat_type
    """
    
    df["weight*dist"] =df[distance]*df[TOTAL_WEIGHT]
    mean_numerator_df = df.groupby(SONG_ID)["weight*dist"].sum().reset_index()\
    .rename(columns = {"weight*dist":"mean_numerator"})
    
    
    agg_df = mean_numerator_df.merge(n_beat_type, on = "song_id")
    agg_df["mean_distance"] = agg_df["mean_numerator"]/agg_df[btype]
    
    #Need mean to compute variance
    df = df.merge(agg_df[["mean_distance","song_id"]], on = "song_id")
    
    df["variance_item"] = (df["weight*dist"] - df[TOTAL_WEIGHT]*df["mean_distance"])\
    .apply(lambda x : np.power(x,2))
    variance_numerator_df = df.groupby(SONG_ID)["variance_item"].sum().reset_index()\
    .rename(columns = {"variance_item":"variance_numerator"})
    
    agg_df = agg_df.merge(variance_numerator_df, on = "song_id")
    agg_df["variance_distance"] = agg_df["variance_numerator"]/agg_df[btype]
    agg_df["std_distance"] = agg_df["variance_distance"].apply(np.sqrt)
    
    return agg_df.describe()[["mean_distance","std_distance"]]
    
    return mean_distance

In [None]:
def plot_per_chord_tonic_distance_analysis(distance,df):
    chorus_distance_df = df[df.section_type == "chorus"]
    verse_distance_df = df[df.section_type == "verse"]
    
    n_beat_chorus = df[df.section_type == "chorus"].groupby("song_id")[TOTAL_WEIGHT].sum()\
    .reset_index().rename(columns= {TOTAL_WEIGHT:"chorus_weight"})
    n_beat_verse = df[df.section_type == "verse"].groupby("song_id")[TOTAL_WEIGHT].sum()\
    .reset_index().rename(columns= {TOTAL_WEIGHT:"verse_weight"})
    
    print("Choruses")
    print(per_chord_tonic_distance_analysis(distance,chorus_distance_df,n_beat_chorus,"chorus_weight"))
    print()
    print("Verses")
    print(per_chord_tonic_distance_analysis(distance,verse_distance_df,n_beat_verse,"verse_weight"))

In [None]:
plot_per_chord_tonic_distance_analysis("relative_root_tpc",d_collection_df)

Results are currently not very interesting, but we might choose a different measure of distance, for example related to how we perceive chords or to the tonal hierarchy

### Musical path analysis

Here we explore the melodic lines of music by investigating how the distance to tonic evolve along the time dimension.

In [None]:
### Squeeze with repetition
def weighted_squeeze(df,col):
    """
    Recreate a line with one chord per beat
    """
    df["chord_sublist"] = df.apply(lambda row: row[N_SUCC_BEATS]*[row[col]],axis = 1)
    
    line_df = df.groupby([SONG_ID,TYPE,LINE_NUMBER,])["chord_sublist"].sum().reset_index()
    
    
    #line_df["chord_sublist"] = line_df["chord_sublist"].apply(lambda l: np.array(l).flatten())
    
    
    return line_df.rename(columns = {"chord_sublist":"chord_list"})
    
def full_squeeze(df,col):
    """
    Recreate a section, considering repetitions of full lines.
    """
    #Not implemented yet
    return 0

In [None]:
line_df = weighted_squeeze(d_collection_df,"relative_root_tpc")

In [None]:
def mean_path(line_df,line_length,verbose=True):
    """
    Show mean path for given line length
    """
    
    f = plt.figure(figsize=(12,4))
    
    means_dic = {}
    
    for i, section_type in enumerate(("chorus","verse")):
        
        f.add_subplot(1,2,i+1)
        
        sample_df = line_df[line_df["chord_list"].apply(lambda l: len(l) == line_length)]

        sample_df = sample_df[sample_df[TYPE] == section_type]
        
        means = pd.DataFrame(sample_df["chord_list"].values.tolist()).mean(axis = 0)
        means_dic[section_type] = means
        
        if verbose:
            print(len(sample_df),"lines considered for",section_type)

            means.plot.bar()
            plt.title("Average relative root tpc for lines of length {} in {}".format(line_length,section_type))
            plt.xlabel("Position of chord")
            plt.ylabel("Avg relative root tpc")
    
    
    if not verbose:
        return means_dic    

In [None]:
mean_path(line_df,8)

In [None]:
# General shape, number of non-tonic chords between two tonic chords, evolution of tonic distance depending on note position, etc
#Considérer les LIGNES mélodiques. Faire des analyses sur chaque ligne de longueur X pour toutes les longueurs possibles
# Il faudrait faire la même chose avec les sections entières

In [None]:
def plot_line_path(df,distance,line_len):
    line_df = weighted_squeeze(df,distance)
    mean_path(line_df,line_len)
    
def plot_section_path(df,distance,sec_len):
    sec_df = full_squeeze(df,distance)
    mean_path(sec_df,sec_len)

### Tests with different distances

#### Distance between tonic and root of chord, in semi-tone up or down

In [None]:
d_collection_df["tpc_distance"] = d_collection_df["relative_root_tpc"].apply(lambda tpc: min(tpc,12-tpc))

In [None]:
plot_per_chord_tonic_distance_analysis("tpc_distance",d_collection_df)

In [None]:
plot_line_path(d_collection_df,"tpc_distance",8)

#### Concordance distance (manual mapping)

This small proximity attribution is done following the proximity graph of course 9 (slide 27)

In [None]:
concordance_dic = {0:15,7:5,5:2.5,9:2.5,1:2.5}

In [None]:
d_collection_df["concordance_proximity"] = d_collection_df["relative_root_tpc"]\
.apply(lambda r: concordance_dic.get(r))

d_collection_df = d_collection_df.fillna({"concordance_proximity":1}).dropna(subset = ["relative_root_tpc"])

In [None]:
plot_per_chord_tonic_distance_analysis("concordance_proximity",d_collection_df)

In [None]:
plot_line_path(d_collection_df,"concordance_proximity",8)

### Tests (DO NOT INCLUDE IN MERGE)

In [None]:
n_beat_verse

In [None]:
chorus_distance_df["weight*dist"] =chorus_distance_df["relative_root_tpc"]*chorus_distance_df[TOTAL_WEIGHT]
chorus_distance_df["mean_distance"] = chorus_distance_df["weight*dist"].groupby(SONG_ID).sum()/n_beat_chorus
chorus_distance_df["variance_distance"] = chorus_distance_df["weight*dist"] - \
    chrous_distance_df[TOTAL_WEIGHT]*chorus_distance_df["mean_distance"]

verse_distance_df["weight*dist"] =verse_distance_df["relative_root_tpc"]*verse_distance_df[TOTAL_WEIGHT]
verse_distance_df["mean_distance"] = chorus_distance_df["weight*dist"].groupby(SONG_ID).sum()/n_beat_verse


In [None]:
np.sqrt(2)

In [None]:
#mean_distance = df.groupby(SONG_ID)["weight*dist"].sum()[full_mask]/n_beat_type[btype]
    #df = df.merge(df.groupby(SONG_ID)["weight*dist"].sum().reset_index()\
    #              .rename(columns = {"weight*dist":"mean_numerator"}),on = "song_id")

In [None]:
#Filtering songs that have the wanted section (UNUSED)
full_mask=\
c_collection_df.groupby("song_id")["section_type"].apply(list).apply(lambda l: "chorus" in l and "verse" in l)
chorus_mask = c_collection_df.groupby("song_id")["section_type"].apply(list).apply(lambda l: "chorus" in l)
verse_mask = c_collection_df.groupby("song_id")["section_type"].apply(list).apply(lambda l: "verse" in l)

In [None]:
3*[3]

In [None]:
c_collection_df

In [None]:
np.array([[[3,4]]]).flatten()

In [None]:
collection_df.columns

In [None]:
c_collection_df.groupby(SONG_ID)["tonic"].count()

In [None]:
chorus_distance_df = c_collection_df[c_collection_df.section_type == "chorus"]\
[[SONG_ID,"relative_root_tpc",TOTAL_WEIGHT]]
verse_distance_df = c_collection_df[c_collection_df.section_type == "verse"]\
[[SONG_ID,"relative_root_tpc",TOTAL_WEIGHT]]

In [None]:
d_collection_df[["song_id",REPETITION]][d_collection_df[REPETITION]==1]

In [None]:
len(beats_collection_df)

In [None]:
d_collection_df.iloc[10000000]

In [None]:
d_collection_df

In [None]:
d_collection_df[N_SUCC_BEATS].unique()

In [None]:
beats_collection_df[(beats_collection_df[SONG_ID] == 34) & (beats_collection_df["section_structure"] == "D")].head(30)

In [None]:
test_df = d_collection_df[d_collection_df[SONG_ID] == 4]

In [None]:
a = create_beats_df(test_df)
len(a)

In [None]:
test_df[REPETITION] = 3

In [None]:
b = create_beats_df(test_df)
len(b)

In [None]:
test_df[REPETITION] = 3
test_df[test_df.section_structure == "B"][REPETITION]