### Global Jukebox Data

The **Global Jukebox** (https://www.theglobaljukebox.org/) builds on the legacy of Alan Lomax's Cantometrics project. Anna Wood (Lomax's daughter and director of the GJ) has generously agreed to help us work with their data, and is very interested to include the results of student work in their publications.

You will need to **create a login with GJ** for the best results

Anna has suggested we work on **Lullabies**, which afford some interesting possibilities for study.  **Read her advice** here: 'https://docs.google.com/document/d/1S1M5p9Zfkdft5IQlTM0p-liNrNj83yh6UeQ1yL6TtfY/edit?usp=sharing'

Read more about Cantometrics and the GJ here:  https://drive.google.com/file/d/1uY9jZ0dQd5Wmxg2zD-3bDFopFdzmW91h/view?usp=share_link

The is also an important **print publication about the Cantometrics** method on reserve for Music 255 in the Harris Music Library

In [24]:
import os
from decouple import AutoConfig # Install python-decouple
import requests # Install requests
import pandas as pd
import plotly as plt
from itertools import combinations
import numpy as np

### GJ Data are on Github in a series of CSV files

- **Canto** = the ratings for the individual songs, with about 40 different musical features in all
- **Societies** = data about the ethnic and social groups represented in the survey.  These are linked to the Canto data via group ids
- **Songs** = data about the songs themselves, also with data about the societies, but in addition information about the source recording and genres they represent
- **Codes** = for each music feature, there are (at least) about a dozen possible ratings.  They are given as integers but really represent categories as explained in this dataset
- **Raw Codes** = Since each song can in fact have more than one 'musical code' associated with it (as when it's slow at the start but fast at the end), the GJ team in fact uses a sophisticated system to encode more than one code with a single integer.  See below on this Power of 2 method.
- **Lines Explained** = detailed explanation of the categories of musical features (texture, rhythm, melody, timbre, etc).  Read about these here: 'https://docs.google.com/document/d/1Ga7qxbWV1UaD8wPABYORpJc2_4WPwimIv-zQCbl_v0U/edit?usp=sharing'


In [2]:
canto = 'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/raw/data.csv'
societies = 'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/raw/societies.csv'
songs = 'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/raw/songs.csv'
codes = 'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/etc/codes.csv'
lines_explained = 'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/etc/variables.csv'
raw_codes = 'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/etc/raw_codes.csv'

In [3]:
# cantometrics data contain the ratings for each item
# here we are making a 'short' df of the columns suggested by A Wood for the Lullaby project

canto_df = pd.read_csv(canto)
canto_df.columns.to_list()

# canto songid is a number not string, so fix it
canto_df['song_id'] = canto_df['song_id'].astype('str')

# rename columns with real names of the categories
# dict to rename columns
canto_name_dict = {'line_1': 'Social_Org_Group', 
'line_10': 'Repetition',
'line_11': 'Vocal_Rhythm',
'line_16': 'Melodic_Form',
'line_18': 'Number_Phrases',
'line_20': 'Melodic_Range',
'line_24': 'Tempo',
'line_25': 'Volume',
'line_26': 'Vocal_Rubato',
'line_28': 'Glissando'}
canto_renamed = canto_df.rename(columns=canto_name_dict)

# Now we select only the columns (lines) that Anna suggests are relevant to the Lullaby Project

canto_short = canto_renamed.iloc[:,[0, 1, 2, 3, 12, 13, 20, 22, 26, 27, 28, 30]]
canto_lullaby_features = canto_short.drop(columns="society_id")
# canto_lullaby_features.iloc[0]['Vocal_Rhythm']
canto_lullaby_features.head()

Unnamed: 0,song_id,Preferred_name,Social_Org_Group,Repetition,Vocal_Rhythm,Number_Phrases,Melodic_Range,Tempo,Volume,Vocal_Rubato,Glissando
0,4241,'Are'are,64,16,2048,512,1024,512,2,512,512
1,4246,'Are'are,64,8192,512,8192,128,32,16,512,8192
2,30075,'Are'are,8208,1024,64,2048,128,512,2,8192,512
3,30120,'Are'are,8208,128,64,8192,128,512,2,8192,32
4,30121,'Are'are,32,128,64,512,1024,32,2,8192,32


### Lines = The Musical Feature Categories

- These are organised as about 40 'lines', each relating to a particular dimension of musical performance
- Line 1 = Social Organization of the Group:  data type is 'Categorical', but encoded (see below) with integers

In [15]:
# lines are the musical categories.  These are the definitions

lines_explained_df = pd.read_csv(lines_explained)
lines_explained_df.head()


Unnamed: 0,id,category,title,definition,type,units,source,changes,notes,short_title
0,line_1,Social organization,The social organization of the vocal group,This line describes the social organization of...,Categorical,,,,,Social Org Vocal
1,line_2,Orchestra,Relationship of orchestra to vocal parts,The term “orchestra” refers to the performers ...,Categorical,,,,,Social Org Voc/Orch
2,line_3,Orchestra,Social organization of the orchestra,Line 3 and Line 1 (Social Organization of the ...,Categorical,,,,,Social Org Orch
3,line_4,Musical organization,Musical organization of the vocal part,The musical coordination amongst the singers i...,Categorical,,,,,Musical Org Vocal
4,line_5,Musical organization,Tonal blend of the vocal group,Both diffuse and cohesive sounds are pleasing ...,Ordinal,,,,,Tonal Blend Vocal


### The Codes tell us what the ratings actually mean for each line (musical type)

- Normally these are integer 'codes' for each line (feature), from 1-13.
- Here are the codes for Line 1 (Social Organization of the Group)


In [5]:

codes_df = pd.read_csv(codes)
codes_df.head(13)


Unnamed: 0,var_id,code,description,name
0,line_1,1,No singers,NoSinger
1,line_1,2,"One solo singer, whether or not accompanied by...",SoloSinger
2,line_1,3,"One singer with an audience whose dancing, sho...",SoloSingerAudience
3,line_1,4,Two or more singers alternate in singing a mel...,SoloSingerConsecutive
4,line_1,5,A single predominant voice—a leader— stands ou...,UnisonPredominantLeader
5,line_1,6,No predominant lead voice is steadily heard ov...,SocialUnisonPredominantGroup
6,line_1,7,"A diffuse, individualized group performance is...",DiffuseGroup
7,line_1,8,Situations in which there is alternation betwe...,LeaderGroup
8,line_1,9,Two groups of two or more singers interact as ...,GroupGroup
9,line_1,10,A more integrated alternation of leader and ch...,LeaderGroupOverlap


### Combined Codes = Powers of 2.  

- But when we look at the canto data, we will often find integers other than 13!  What is going on?

```
canto_lullaby_features.iloc[0]['Vocal_Rhythm']
2048
```

- Since some songs might use two or more different aspects of a given musical feature (for instance, a slow introduction and a fast conclusion) the GJ data combine the 1>13 ratings according to a system they call **Powers of 2**.  In brief:
    - use the original rating as a power of 2.  If the rating was "4", then 2 to the 4th power = 16.  That's the integer recorded in the canto data
    - If there are TWO ratings, then use each as power of 2, then add them together!  Original ratings of 2 and 4 would be 2 to the 2nd (4) plus 2 to the 4th (16) = 20.
- The "raw_codes" csv unpacks these combinations, which can involve up to three 'original' codes combined into a single Power of Two integer.

```
raw_code_list = pd.read_csv(raw_codes)
set_combined_codes = set(raw_code_list.code)
raw_code_list
```

In [6]:

raw_code_list = pd.read_csv(raw_codes)
set_combined_codes = set(raw_code_list.code)
raw_code_list.head(25)

Unnamed: 0,var_id,code,original_1,original_2,original_3,code_description,shortname
0,line_1,0,0,-,-,No Reading,No Reading
1,line_1,2,1,-,-,No singers,NoSinger
2,line_1,4,2,-,-,"One solo singer, whether or not accompanied by...",SoloSinger
3,line_1,8,3,-,-,"One singer with an audience whose dancing, sho...",SoloSingerAudience
4,line_1,16,4,-,-,Two or more singers alternate in singing a mel...,SoloSingerConsecutive
5,line_1,20,4,2,-,Solo singer consecutive and solo singer,SoloSingerConsecutiveAndSoloSinger
6,line_1,32,5,-,-,A single predominant voice—a leader— stands ou...,UnisonPredominantLeader
7,line_1,34,5,1,-,Unison predominant leader and no singer,UnisonPredominantLeaderAndNoSinger
8,line_1,36,5,2,-,Unison predominant leader and solo singer,UnisonPredominantLeaderAndSoloSinger
9,line_1,48,5,4,-,Unison predominant leader and solo singer cons...,UnisonPredominantLeaderAndSoloSingerConsecutive


In [7]:
# There are in fact 27 of these combinations in the real data that are NOT in the raw list of possibles listed above!
canto_data_set_values = set(canto_df.iloc[:, 3:].stack().unique())
diff = canto_data_set_values - set_combined_codes
len(diff)

27

### Checking the Powers and Combinations 

- Here we build out all the possible values for powers of 2 from 1 to 13
- And also build out all the unique sums of combinations of 1, 2, or 3 of these integers
- This in turn will allow us to retrieve the original codes from the combined numbers found in the canto data set

In [8]:
powers = [2**n for n in range(1, 14)]
combo_list = list(combinations(powers, 1)) + list(combinations(powers, 2)) + list(combinations(powers, 3))
sums = [{"sum" : sum(t), "full_tuple": t} for t in combo_list]
df = pd.DataFrame(sums)
df['sorted_original_values'] = df.full_tuple.apply(lambda x: tuple(sorted([np.log2(value) for value in x], reverse=True)))
df.sort_values(by="sum")

Unnamed: 0,sum,full_tuple,sorted_original_values
0,2,"(2,)","(1.0,)"
1,4,"(4,)","(2.0,)"
13,6,"(2, 4)","(2.0, 1.0)"
2,8,"(8,)","(3.0,)"
14,10,"(2, 8)","(3.0, 1.0)"
...,...,...,...
356,12416,"(128, 4096, 8192)","(13.0, 12.0, 7.0)"
366,12544,"(256, 4096, 8192)","(13.0, 12.0, 8.0)"
372,12800,"(512, 4096, 8192)","(13.0, 12.0, 9.0)"
375,13312,"(1024, 4096, 8192)","(13.0, 12.0, 10.0)"


###  A Dictionary of Summed Values and the Original Values

- This will allow us to translate from the summed powers of 2 back to the component ratings.

In [10]:
dictionary_of_value_sets = dict(zip(df["sum"], df["sorted_original_values"]))
# dictionary_of_value_sets

In [56]:
lullaby_feature_workbook = canto_lullaby_features
transformed_lullaby_features = lullaby_feature_workbook.iloc[:, 3:].applymap(lambda x : dictionary_of_value_sets.get(x, 0))
lullabies_unpacked = pd.concat([lullaby_feature_workbook.iloc[:, :3], transformed_lullaby_features], axis="columns")

In [57]:
lullabies_unpacked

Unnamed: 0,song_id,Preferred_name,Social_Org_Group,Repetition,Vocal_Rhythm,Number_Phrases,Melodic_Range,Tempo,Volume,Vocal_Rubato,Glissando
0,4241,'Are'are,64,"(4.0,)","(11.0,)","(9.0,)","(10.0,)","(9.0,)","(1.0,)","(9.0,)","(9.0,)"
1,4246,'Are'are,64,"(13.0,)","(9.0,)","(13.0,)","(7.0,)","(5.0,)","(4.0,)","(9.0,)","(13.0,)"
2,30075,'Are'are,8208,"(10.0,)","(6.0,)","(11.0,)","(7.0,)","(9.0,)","(1.0,)","(13.0,)","(9.0,)"
3,30120,'Are'are,8208,"(7.0,)","(6.0,)","(13.0,)","(7.0,)","(9.0,)","(1.0,)","(13.0,)","(5.0,)"
4,30121,'Are'are,32,"(7.0,)","(6.0,)","(9.0,)","(10.0,)","(5.0,)","(1.0,)","(13.0,)","(5.0,)"
...,...,...,...,...,...,...,...,...,...,...,...
5771,364,Hokkaido Japanese,4,"(1.0,)","(13.0,)","(5.0,)","(10.0,)","(3.0,)","(7.0,)","(1.0,)","(9.0,)"
5772,183,Eastern Ojibwa,4,"(1.0,)","(11.0,)","(9.0,)","(7.0,)","(9.0,)","(1.0,)","(13.0,)","(13.0,)"
5773,180,Gayogo̱hó꞉nǫʼ (Cayuga),256,"(10.0,)","(11.0,)","(3.0,)","(10.0,)","(11.0,)","(13.0,)","(13.0,)","(13.0,)"
5774,181,Gayogo̱hó꞉nǫʼ (Cayuga),256,"(10.0,)","(11.0,)","(3.0,)","(7.0,)","(9.0,)","(10.0,)","(9.0,)","(5.0,)"


In [46]:
# data about the social groups
societies_df = pd.read_csv(societies)
societies_df.head()

Unnamed: 0,Society_latitude,Society_longitude,Homeland_latitude_of_diasporic_peoples,Homeland_longitude_of_diasporic_peoples,Area_latitude,Area_longitude,Region,Division,Subregion,Area,...,eHRAF_soc_w_xd_id_2?,eHRAF_OWC.1,eHRAF_OWC_Name.1,eHRAF_SubOWC_in_DPLACE.1,eHRAF_SubOWC_Name_in_DPLACE.1,eHRAF_subOWC_FINAL_STATUS_2021.1,STATUS_JULY_31_2021,language match checked by KK,D-PLACE match checked by KK,KK_ISSUE
0,1.55,28.44,,,1.83,29.5,Africa,Central Africa,Equatorial Central Africa,"Ituri Prov, N E DR Congo",...,NONE,NONE,NONE,NONE,NONE,NONE,D-PLACE_GJB_society_names_AND_languages_match,DONE,DONE,
1,2.36,31.01,,,2.79,30.86,Africa,Central Africa,Equatorial Central Africa,N E DR Congo/ N W Uganda,...,NONE,NONE,NONE,NONE,NONE,NONE,D-PLACE_GJB_society_names_AND_languages_match,DONE,DONE,
2,-2.27,14.44,,,-2.07,15.37,Africa,Central Africa,Equatorial Central Africa,"Plateaux Dept, C Congo",...,,,,,,,NOT_CHECKED_BY_KK,NOT_CHECKED_BY_KK,NOT_CHECKED_BY_KK,
3,-5.12,18.04,,,-5.13,15.69,Africa,Central Africa,Equatorial Central Africa,S W DR Congo,...,NONE,NONE,NONE,NONE,NONE,NONE,D-PLACE_GJB_society_names_AND_languages_match,DONE,DONE,
4,3.96,16.59,,,5.34,23.13,Africa,Central Africa,N Central Africa,Cameroon/ Central African Rep/ DR Congo/ South...,...,NONE,NONE,NONE,NONE,NONE,NONE,D-PLACE_GJB_society_names_match; language_assi...,DONE,DONE,


In [47]:
# now the data about the individual songs
song_df = pd.read_csv(songs)
song_df.columns.to_list()
song_cols = ["song_id", 
             "Local_latitude",
             "society_id",
             'Local_longitude',
             'Homeland_latitude',
             'Homeland_longitude',
             'Region',
             'Genre']
song_data_selected = song_df[song_cols].fillna('')
len(song_data_selected)

6038

In [58]:
combined = pd.merge(lullabies_unpacked, song_data_selected,
how='left', on='song_id')
combined.fillna('')


Unnamed: 0,song_id,Preferred_name,Social_Org_Group,Repetition,Vocal_Rhythm,Number_Phrases,Melodic_Range,Tempo,Volume,Vocal_Rubato,Glissando,Local_latitude,society_id,Local_longitude,Homeland_latitude,Homeland_longitude,Region,Genre
0,4241,'Are'are,64,"(4.0,)","(11.0,)","(9.0,)","(10.0,)","(9.0,)","(1.0,)","(9.0,)","(9.0,)",-9.57,10000,161.37,,,Oceania,Women's Song
1,4246,'Are'are,64,"(13.0,)","(9.0,)","(13.0,)","(7.0,)","(5.0,)","(4.0,)","(9.0,)","(13.0,)",-9.57,10000,161.37,,,Oceania,
2,30075,'Are'are,8208,"(10.0,)","(6.0,)","(11.0,)","(7.0,)","(9.0,)","(1.0,)","(13.0,)","(9.0,)",-9.21,10000,161.16,,,Oceania,Lullaby
3,30120,'Are'are,8208,"(7.0,)","(6.0,)","(13.0,)","(7.0,)","(9.0,)","(1.0,)","(13.0,)","(5.0,)",-9.32,10000,161.33,,,Oceania,Lullaby; Roromera
4,30121,'Are'are,32,"(7.0,)","(6.0,)","(9.0,)","(10.0,)","(5.0,)","(1.0,)","(13.0,)","(5.0,)",-9.32,10000,161.33,,,Oceania,Lullaby; Roromera
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5771,364,Hokkaido Japanese,4,"(1.0,)","(13.0,)","(5.0,)","(10.0,)","(3.0,)","(7.0,)","(1.0,)","(9.0,)",41.86,62554,140.12,,,North Eurasia,Fisherman's Song; Work Song; Fishing Song
5772,183,Eastern Ojibwa,4,"(1.0,)","(11.0,)","(9.0,)","(7.0,)","(9.0,)","(1.0,)","(13.0,)","(13.0,)",46.77,62555,-88.48,,,North America,Animal Song; Spirit Song; Clan Song; Totem Song
5773,180,Gayogo̱hó꞉nǫʼ (Cayuga),256,"(10.0,)","(11.0,)","(3.0,)","(10.0,)","(11.0,)","(13.0,)","(13.0,)","(13.0,)",43.05,62556,-80.12,,,North America,Dance Song; Magic Song; Rain Song
5774,181,Gayogo̱hó꞉nǫʼ (Cayuga),256,"(10.0,)","(11.0,)","(3.0,)","(7.0,)","(9.0,)","(10.0,)","(9.0,)","(5.0,)",43.05,62556,-80.12,,,North America,Thanksgiving Song; Dance Song; Chief's Song; C...


### Filter for Lullabies!

```
lullabies = combined[combined['Genre'].str.contains('Lullaby')]
```

In [59]:

lullabies = combined[combined['Genre'].str.contains('Lullaby')]
# lullaby["Glissando"].value_counts()

In [60]:
# an initial exploratory grouping
grouped = lullabies.groupby(['Region', 'Melodic_Range'])['song_id'].count()
grouped


Region           Melodic_Range
Africa           (4.0,)            1
                 (7.0,)            2
                 (10.0,)           2
                 (13.0,)           1
Australia        (7.0,)            1
Central America  (4.0,)            5
                 (7.0,)            9
                 (10.0,)           1
Central Asia     (7.0,)            1
East Asia        (7.0,)            1
                 (10.0,)           2
Europe           (4.0,)            6
                 (7.0,)           15
                 (10.0,)           2
North America    (4.0,)            4
                 (7.0,)            3
                 (10.0,)           3
                 (13.0,)           1
North Eurasia    (4.0,)            2
                 (7.0,)            2
                 (10.0,)           1
Oceania          (1.0,)            2
                 (4.0,)            4
                 (4.0, 1.0)        1
                 (7.0,)            5
                 (10.0,)           4
South A