### Global Jukebox Data

The Global Jukebox data are on Github in a series of CSV files, each containing a distinct 'data type'.  It might at first seem confusing to have information about the recordings separate from the information about the societies, and each of these separate from the explanation of the rating system and categories.  But this is the basis of a *relational database*, after all, and we often need to coordinate data across many 'tables' in this context.  Fortunately, Pandas can make short work of merging and combining data, or filtering one table on the basis of information gleaned from another.  So in the course of your work with Global Jukebox Data you will often need to coordinate and combine data from the following sets:

- **Canto** = the ratings for the individual songs, with about 37 different musical features in all.  
- **Societies** = data about the ethnic and social groups represented in the survey.  These are linked to the Canto data via group ids.
- **Songs** = more data about the recording themselves, also with data about the societies, but in addition information about the source recording and genres they represent.
- **Codes** = for each music feature, there are a dozen possible ratings.  They are given as integers but really represent complex combinations of social and sonic features.  This dataset explains the meanings represented by each value.
- **Raw Codes** = Since each song can in fact have more than one 'musical code' associated with it (as when it's slow at the start but fast at the end), the GJ team in fact uses a sophisticated system to encode more than one code with a single integer.  See below on this **Power of 2** method.
- **Lines Explained** = detailed explanation of the categories of musical features (texture, rhythm, melody, timbre, etc).  Read about these [here]('https://docs.google.com/document/d/1Ga7qxbWV1UaD8wPABYORpJc2_4WPwimIv-zQCbl_v0U/edit?usp=sharing')



```python
canto = 'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/raw/data.csv'
societies = 'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/raw/societies.csv'
songs = 'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/raw/songs.csv'
codes = 'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/etc/codes.csv'
raw_codes = 'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/etc/raw_codes.csv'
lines_explained = 'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/etc/variables.csv'
```


In [2]:
import os

import requests # Install requests
import pandas as pd
import plotly as plt
from itertools import combinations
import numpy as np

### GJ Data are on Github in a series of CSV files

- **Canto** = the ratings for the individual songs, with about 40 different musical features in all
- **Societies** = data about the ethnic and social groups represented in the survey.  These are linked to the Canto data via group ids
- **Songs** = data about the songs themselves, also with data about the societies, but in addition information about the source recording and genres they represent
- **Codes** = for each music feature, there are (at least) about a dozen possible ratings.  They are given as integers but really represent categories as explained in this dataset
- **Raw Codes** = Since each song can in fact have more than one 'musical code' associated with it (as when it's slow at the start but fast at the end), the GJ team in fact uses a sophisticated system to encode more than one code with a single integer.  See below on this Power of 2 method.
- **Lines Explained** = detailed explanation of the categories of musical features (texture, rhythm, melody, timbre, etc).  Read about these here: 'https://docs.google.com/document/d/1Ga7qxbWV1UaD8wPABYORpJc2_4WPwimIv-zQCbl_v0U/edit?usp=sharing'


In [330]:

# List of URLs to the data files
data_files_list = [
    'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/raw/data.csv',
    'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/raw/societies.csv',
    'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/raw/songs.csv',
    'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/etc/codes.csv',
    'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/etc/variables.csv',
    'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/etc/raw_codes.csv'
]

# Short names for DataFrames
short_names = ['canto', 'societies', 'songs', 'codes', 'lines_explained', 'raw_codes']

# Initialize empty variables for each DataFrame
canto = None
societies = None
songs = None
codes = None
lines_explained = None
raw_codes = None

# Loop through the list of URLs and short names
for url, short_name in zip(data_files_list, short_names):
    # Read the CSV file from the URL into a DataFrame
    df = pd.read_csv(url)
    
    # Replace non-breaking spaces in column names with regular spaces
    df.columns = df.columns.str.replace('\xa0', ' ')
    
    # Iterate over each column to replace non-breaking spaces in cell values
    for col in df.columns:
        # Check if the column contains string values
        if df[col].dtype == 'object':
            df[col] = df[col].str.replace('\xa0', ' ')
    
    # Assign the modified DataFrame to the corresponding variable
    globals()[short_name] = df




### Now you can access each DataFrame directly by its variable name

```python
print(canto.head())
print(societies.head())
print(songs.head())
print(codes.head())
print(lines_explained.head())
print(raw_codes.head())
```

In [204]:
canto.head()

Unnamed: 0,song_id,Preferred_name,society_id,line_1,line_2,line_3,line_4,line_5,line_6,line_7,...,line_28,line_29,line_30,line_31,line_32,line_33,line_34,line_35,line_36,line_37
0,4241,'Are'are,10000,64,2,2,8192,1024,1024,2,...,512,128,16,8192,1024,1024,16,128,1024,128
1,4246,'Are'are,10000,64,4096,8192,128,8192,8192,8192,...,8192,2,8192,128,1024,8192,16,1024,1024,8192
2,30075,'Are'are,10000,8208,2,2,8192,8192,1024,2,...,512,2,1024,8192,144,1024,8192,8192,8192,8192
3,30120,'Are'are,10000,8208,2,2,8192,1024,1024,2,...,32,2,128,8192,128,256,1024,8192,8192,1024
4,30121,'Are'are,10000,32,2,2,8192,1024,1024,2,...,32,128,128,8192,128,256,1024,8192,8192,1024


## Lines Explained

In [205]:
ordinals = lines_explained[lines_explained['type'] == 'Ordinal']
ordinals.drop(['units', 'source', 'changes', 'notes', 'short_title'], axis=1)

Unnamed: 0,id,category,title,definition,type
4,line_5,Musical organization,Tonal blend of the vocal group,Both diffuse and cohesive sounds are pleasing ...,Ordinal
6,line_7,Musical organization,Musical organization of the orchestra,Overall musical coordination amongst members o...,Ordinal
7,line_8,Musical organization,Tonal blend of the orchestra,The concept of tonal blend applies to orchestr...,Ordinal
8,line_9,Orchestra,Rhythmic coordination of the orchestra,Rates the degree of rhythmic coordination betw...,Ordinal
15,line_16,Melodic form,Melodic form,Line 16 is an attempt to deal succinctly with ...,Ordinal
16,line_17,Metrical pattern,Phrase length,A simple five-point scale is used to describe ...,Ordinal
17,line_18,Melodic form,Number of phrases,Determine the number of melodic phrases that o...,Ordinal
18,line_19,Articulation,Position of final tone,The relation of the final note to the total ra...,Ordinal
19,line_20,Musical characterstics,Melodic range,This is a method of judging the total melodic ...,Ordinal
20,line_21,Articulation,Interval size,An interval is the distance in pitch between t...,Ordinal


In [208]:
categoricals = lines_explained[lines_explained['type'] == 'Categorical']
categoricals.drop(['units', 'source', 'changes', 'notes', 'short_title'], axis=1)
categoricals

Unnamed: 0,id,category,title,definition,type,units,source,changes,notes,short_title
0,line_1,Social organization,The social organization of the vocal group,This line describes the social organization of...,Categorical,,,,,Social Org Vocal
1,line_2,Orchestra,Relationship of orchestra to vocal parts,The term “orchestra” refers to the performers ...,Categorical,,,,,Social Org Voc/Orch
2,line_3,Orchestra,Social organization of the orchestra,Line 3 and Line 1 (Social Organization of the ...,Categorical,,,,,Social Org Orch
3,line_4,Musical organization,Musical organization of the vocal part,The musical coordination amongst the singers i...,Categorical,,,,,Musical Org Vocal
5,line_6,Musical organization,Rhythmic coordination of the vocal group,The degree of rhythmic coordination between me...,Categorical,,,,,Rhythm Blend Vocal
9,line_10,Social organization,Repetition of text,"Listening to the text as it is performed, cons...",Categorical,,,,,Text Repetition
10,line_11,Metrical pattern,Overall rhythm: vocal,"In most musical styles, the performer or perfo...",Categorical,,,,,Overall Voc Rhythm
11,line_12,Rhythmic relationship,Rhythmic relationship within the vocal group,Singing groups establish their rhythmic activi...,Categorical,,,,,Rhythm Rel'n Vocal
12,line_13,Metrical pattern,Overall rhythm: orchestra,"In most musical styles, the performer or perfo...",Categorical,,,,,Overall Orch Rhythm
13,line_14,Rhythmic relationship,Rhythmic relationship within the orchestra,The various types of relationships between the...,Categorical,,,,,Rythm Rel'n Orch


### Dictionary of Lines and Short Titles

In [207]:

my_dict = pd.Series(lines_explained.short_title.values, index=lines_explained.id).to_dict()
my_dict

{'line_1': 'Social Org Vocal',
 'line_2': 'Social Org Voc/Orch',
 'line_3': 'Social Org Orch',
 'line_4': 'Musical Org Vocal',
 'line_5': 'Tonal Blend Vocal',
 'line_6': 'Rhythm Blend Vocal',
 'line_7': 'Music Orch Org',
 'line_8': 'Tonal Blend Orch',
 'line_9': 'Rhythm Blend Orch',
 'line_10': 'Text Repetition',
 'line_11': 'Overall Voc Rhythm',
 'line_12': "Rhythm Rel'n Vocal",
 'line_13': 'Overall Orch Rhythm',
 'line_14': "Rythm Rel'n Orch",
 'line_15': 'Melodic Shape',
 'line_16': 'Melodic Form',
 'line_17': 'Phrase Length',
 'line_18': 'No. Phrases',
 'line_19': 'Pos Final Tone',
 'line_20': 'Melodic Range',
 'line_21': 'Interval Width',
 'line_22': 'Polyphonic Type',
 'line_23': 'Embellishment',
 'line_24': 'Tempo',
 'line_25': 'Volume',
 'line_26': 'Vocal Rubato',
 'line_27': 'Orch Rubato',
 'line_28': 'Glissando',
 'line_29': 'Melisma',
 'line_30': 'Tremolo',
 'line_31': 'Glottal',
 'line_32': 'Vocal Register',
 'line_33': 'Vocal Width',
 'line_34': 'Nasality',
 'line_35': 'Ra

### The "Classes" (Categories) of Lines

In [209]:
lines_explained['category'].unique().tolist()


['Social organization',
 'Orchestra',
 'Musical organization',
 'Metrical pattern',
 'Rhythmic relationship',
 'Melodic form',
 'Articulation',
 'Musical characterstics',
 'Ornament',
 'Dynamics',
 'Vocal noise']

In [150]:
# cantometrics data contain the ratings for each item
# here we are making a 'short' df of the columns suggested by A Wood for the Lullaby project

# canto_df = pd.read_csv(canto)
canto.columns.to_list()

# canto songid is a number not string, so fix it
canto['song_id'] = canto['song_id'].astype('str')

# rename columns with real names of the categories
# dict to rename columns
canto_name_dict = {'line_1': 'Social_Org_Group', 
'line_10': 'Repetition',
'line_11': 'Vocal_Rhythm',
'line_16': 'Melodic_Form',
'line_18': 'Number_Phrases',
'line_20': 'Melodic_Range',
'line_24': 'Tempo',
'line_25': 'Volume',
'line_26': 'Vocal_Rubato',
'line_28': 'Glissando'}
canto_renamed = canto.rename(columns=canto_name_dict)

# Now we select only the columns (lines) that Anna suggests are relevant to the Lullaby Project

canto_short = canto_renamed.iloc[:,[0, 1, 2, 3, 12, 13, 20, 22, 26, 27, 28, 30]]
canto_lullaby_features = canto_short.drop(columns="society_id")
# canto_lullaby_features.iloc[0]['Vocal_Rhythm']
canto_lullaby_features.head()

Unnamed: 0,song_id,Preferred_name,Social_Org_Group,Repetition,Vocal_Rhythm,Number_Phrases,Melodic_Range,Tempo,Volume,Vocal_Rubato,Glissando
0,4241,'Are'are,64,16,2048,512,1024,512,2,512,512
1,4246,'Are'are,64,8192,512,8192,128,32,16,512,8192
2,30075,'Are'are,8208,1024,64,2048,128,512,2,8192,512
3,30120,'Are'are,8208,128,64,8192,128,512,2,8192,32
4,30121,'Are'are,32,128,64,512,1024,32,2,8192,32


### The Codes tell us what the ratings actually mean for each line (musical type)

- Normally these are integer 'codes' for each line (feature), from 1-13.
- Here are the codes for Line 1 (Social Organization of the Group)
- The _meaning_ of the code changes from line to line.  

For ex:  in line 1, code 1 means 'no singer'.  But for line 2, it means no accompaniment

As we see the codes move systematically through the kinds of roles and feature one might hear in a given piece--how many performing, what they are performing, how, when and where they are performing it; who is listening, etc.


### Combined Codes = Powers of 2.  

- But when we look at the canto data, we will often find integers other than 13!  What is going on?

```
canto_lullaby_features.iloc[0]['Vocal_Rhythm']
2048
```

- Since some songs might use two or more different aspects of a given musical feature (for instance, a slow introduction and a fast conclusion) the GJ data combine the 1>13 ratings according to a system they call **Powers of 2**.  In brief:
    - use the original rating as a power of 2.  If the rating was "4", then 2 to the 4th power = 16.  That's the integer recorded in the canto data
    - If there are TWO ratings, then use each as power of 2, then add them together!  Original ratings of 2 and 4 would be 2 to the 2nd (4) plus 2 to the 4th (16) = 20.
- The "raw_codes" csv unpacks these combinations, which can involve up to three 'original' codes combined into a single Power of Two integer.

```
raw_code_list = pd.read_csv(raw_codes)
set_combined_codes = set(raw_code_list.code)
raw_code_list
```

### Checking the Powers and Combinations 

- Here we build out all the possible values for powers of 2 from 1 to 13
- And also build out all the unique sums of combinations of 1, 2, or 3 of these integers
- This in turn will allow us to retrieve the original codes from the combined numbers found in the canto data set

In [178]:
powers = [2**n for n in range(1, 14)]
combo_list = list(combinations(powers, 1)) + list(combinations(powers, 2)) + list(combinations(powers, 3))
sums = [{"sum" : sum(t), "full_tuple": t} for t in combo_list]
sums_df = pd.DataFrame(sums)
sums_df['sorted_original_values'] = sums_df.full_tuple.apply(lambda x: tuple(sorted([np.log2(value) for value in x], reverse=True)))
sums_df.sort_values(by="sum")

Unnamed: 0,sum,full_tuple,sorted_original_values
0,2,"(2,)","(1.0,)"
1,4,"(4,)","(2.0,)"
13,6,"(2, 4)","(2.0, 1.0)"
2,8,"(8,)","(3.0,)"
14,10,"(2, 8)","(3.0, 1.0)"
...,...,...,...
356,12416,"(128, 4096, 8192)","(13.0, 12.0, 7.0)"
366,12544,"(256, 4096, 8192)","(13.0, 12.0, 8.0)"
372,12800,"(512, 4096, 8192)","(13.0, 12.0, 9.0)"
375,13312,"(1024, 4096, 8192)","(13.0, 12.0, 10.0)"


In [194]:
sums_df

Unnamed: 0,sum,full_tuple,sorted_original_values
0,2,"(2,)","(1.0,)"
1,4,"(4,)","(2.0,)"
2,8,"(8,)","(3.0,)"
3,16,"(16,)","(4.0,)"
4,32,"(32,)","(5.0,)"
...,...,...,...
372,12800,"(512, 4096, 8192)","(13.0, 12.0, 9.0)"
373,7168,"(1024, 2048, 4096)","(12.0, 11.0, 10.0)"
374,11264,"(1024, 2048, 8192)","(13.0, 11.0, 10.0)"
375,13312,"(1024, 4096, 8192)","(13.0, 12.0, 10.0)"


In [185]:
codes

Unnamed: 0,var_id,code,description,name
0,line_1,1,No singers,NoSinger
1,line_1,2,"One solo singer, whether or not accompanied by...",SoloSinger
2,line_1,3,"One singer with an audience whose dancing, sho...",SoloSingerAudience
3,line_1,4,Two or more singers alternate in singing a mel...,SoloSingerConsecutive
4,line_1,5,A single predominant voice—a leader— stands ou...,UnisonPredominantLeader
...,...,...,...,...
211,line_37,1,Very precise enunciation. Highly articulated c...,VeryPreciseEnunciation
212,line_37,4,Precise enunciation. Clearly articulated conso...,PreciseEnunciation
213,line_37,7,Moderate enunciation. A moderate degree of enu...,ModerateEnunciation
214,line_37,10,Softened enunciation. Consonants are hard to d...,SoftenedEnunciation


###  A Dictionary of Summed Values and the Original Values

- This will allow us to translate from the summed powers of 2 back to the component ratings.

In [179]:
dictionary_of_value_sets = dict(zip(sums_df["sum"], sums_df["sorted_original_values"]))


In [None]:
my_dict = pd.Series(lines_explained.short_title.values, index=lines_explained.id).to_dict()

In [198]:
canto_transformed_features = canto.iloc[:, 3:].applymap(lambda x : dictionary_of_value_sets.get(x, 0))
canto_unpacked = pd.concat([canto.iloc[:, :3], canto_transformed_features], axis="columns")
canto_renamed = canto_unpacked.rename(columns=my_dict)
canto_renamed

Unnamed: 0,song_id,Preferred_name,society_id,Social Org Vocal,Social Org Voc/Orch,Social Org Orch,Musical Org Vocal,Tonal Blend Vocal,Rhythm Blend Vocal,Music Orch Org,...,Glissando,Melisma,Tremolo,Glottal,Vocal Register,Vocal Width,Nasality,Rasp,Accent,Enunciation
0,4241,'Are'are,10000,"(6.0,)","(1.0,)","(1.0,)","(13.0,)","(10.0,)","(10.0,)","(1.0,)",...,"(9.0,)","(7.0,)","(4.0,)","(13.0,)","(10.0,)","(10.0,)","(4.0,)","(7.0,)","(10.0,)","(7.0,)"
1,4246,'Are'are,10000,"(6.0,)","(12.0,)","(13.0,)","(7.0,)","(13.0,)","(13.0,)","(13.0,)",...,"(13.0,)","(1.0,)","(13.0,)","(7.0,)","(10.0,)","(13.0,)","(4.0,)","(10.0,)","(10.0,)","(13.0,)"
2,30075,'Are'are,10000,"(13.0, 4.0)","(1.0,)","(1.0,)","(13.0,)","(13.0,)","(10.0,)","(1.0,)",...,"(9.0,)","(1.0,)","(10.0,)","(13.0,)","(7.0, 4.0)","(10.0,)","(13.0,)","(13.0,)","(13.0,)","(13.0,)"
3,30120,'Are'are,10000,"(13.0, 4.0)","(1.0,)","(1.0,)","(13.0,)","(10.0,)","(10.0,)","(1.0,)",...,"(5.0,)","(1.0,)","(7.0,)","(13.0,)","(7.0,)","(8.0,)","(10.0,)","(13.0,)","(13.0,)","(10.0,)"
4,30121,'Are'are,10000,"(5.0,)","(1.0,)","(1.0,)","(13.0,)","(10.0,)","(10.0,)","(1.0,)",...,"(5.0,)","(7.0,)","(7.0,)","(13.0,)","(7.0,)","(8.0,)","(10.0,)","(13.0,)","(13.0,)","(10.0,)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5771,364,Hokkaido Japanese,62554,"(2.0,)","(8.0,)","(2.0,)","(4.0,)","(1.0,)","(1.0,)","(4.0,)",...,"(9.0,)","(1.0,)","(1.0,)","(1.0,)","(10.0,)","(3.0,)","(4.0,)","(10.0,)","(10.0,)","(7.0,)"
5772,183,Eastern Ojibwa,62555,"(2.0,)","(2.0,)","(2.0,)","(4.0,)","(1.0,)","(1.0,)","(4.0,)",...,"(13.0,)","(13.0,)","(1.0,)","(7.0,)","(7.0,)","(8.0,)","(4.0,)","(4.0,)","(10.0,)","(7.0,)"
5773,180,Gayogo̱hó꞉nǫʼ (Cayuga),62556,"(8.0,)","(2.0,)","(6.0,)","(7.0,)","(7.0,)","(10.0,)","(7.0,)",...,"(13.0,)","(13.0,)","(10.0,)","(1.0,)","(10.0,)","(6.0,)","(7.0,)","(4.0,)","(1.0,)","(10.0,)"
5774,181,Gayogo̱hó꞉nǫʼ (Cayuga),62556,"(8.0,)","(2.0,)","(2.0,)","(7.0,)","(7.0,)","(7.0,)","(4.0,)",...,"(5.0,)","(13.0,)","(10.0,)","(7.0,)","(10.0,)","(10.0,)","(4.0,)","(7.0,)","(1.0,)","(7.0,)"


In [195]:
lullaby_feature_workbook = canto_lullaby_features
transformed_lullaby_features = lullaby_feature_workbook.iloc[:, 3:].applymap(lambda x : dictionary_of_value_sets.get(x, 0))
lullabies_unpacked = pd.concat([lullaby_feature_workbook.iloc[:, :3], transformed_lullaby_features], axis="columns")
lullabies_unpacked.sample(3)

Unnamed: 0,song_id,Preferred_name,Social_Org_Group,Repetition,Vocal_Rhythm,Number_Phrases,Melodic_Range,Tempo,Volume,Vocal_Rubato,Glissando
823,134,Kamayurá,64,"(10.0,)","(11.0,)","(1.0,)","(4.0,)","(5.0,)","(7.0,)","(5.0,)","(13.0,)"
3419,4186,Salar,256,"(10.0,)","(6.0,)","(13.0,)","(4.0,)","(9.0,)","(10.0,)","(5.0,)","(5.0,)"
68,408,Afridi,4,"(1.0,)","(13.0,)","(8.0,)","(7.0,)","(5.0,)","(10.0,)","(1.0,)","(1.0,)"


In [354]:
songs.columns.to_list()

['song_id',
 'Local_latitude',
 'Local_longitude',
 'Homeland_latitude',
 'Homeland_longitude',
 'Region',
 'Division',
 'Subregion',
 'Area',
 'Preferred_name',
 'Society_location',
 'society_id',
 'Audio_notes',
 'Duration',
 'Audio_file',
 'Song',
 'Genre',
 'Song_notes',
 'Performers',
 'Instruments',
 'Vocalist_gender',
 'Lyrics',
 'Recorded_by',
 'Year',
 'Publisher',
 'Publcation_collection',
 'Repository',
 'Sources',
 'Source_tag']

In [235]:
societies[societies['society_id'] == 10000]

Unnamed: 0,Society_latitude,Society_longitude,Homeland_latitude_of_diasporic_peoples,Homeland_longitude_of_diasporic_peoples,Area_latitude,Area_longitude,Region,Division,Subregion,Area,...,eHRAF_soc_w_xd_id_2?,eHRAF_OWC.1,eHRAF_OWC_Name.1,eHRAF_SubOWC_in_DPLACE.1,eHRAF_SubOWC_Name_in_DPLACE.1,eHRAF_subOWC_FINAL_STATUS_2021.1,STATUS_JULY_31_2021,language match checked by KK,D-PLACE match checked by KK,KK_ISSUE
1050,-9.21,161.16,,,-8.97,160.95,Oceania,Melanesia,Solomon Islands,Malaita Island,...,,,,,,,NOT_CHECKED_BY_KK,NOT_CHECKED_BY_KK,NOT_CHECKED_BY_KK,


In [368]:
selected_song_cols = ['song_id',
 'Genre',
 'Performers',
 'Instruments',
 'Vocalist_gender',
 'Year',
 'society_id',
 'Region',
 'Division',
 'Subregion',
 'Area',
 'Local_latitude',
 'Local_longitude',
 'Preferred_name',
 'Society_location'
 ]

In [369]:
selected_song_cols

['song_id',
 'Genre',
 'Performers',
 'Instruments',
 'Vocalist_gender',
 'Year',
 'society_id',
 'Region',
 'Division',
 'Subregion',
 'Area',
 'Local_latitude',
 'Local_longitude',
 'Preferred_name',
 'Society_location']

In [389]:
# songs[selected_song_cols]

In [377]:
# copy the data so we avoid problems
songs_some_cols  = songs[selected_song_cols].copy()

# split the long strings at the ";"
songs_some_cols['Genre'] = songs_some_cols['Genre'].str.split(';')

# explode the complete df on the 'genre' column to tidy the data
songs_exploded = songs_some_cols.explode('Genre')

# remove trailing/leading spaces that might remain in the individual strings
songs_exploded["Genre"] = songs_exploded["Genre"].str.strip()

In [388]:
songs_exploded = songs_exploded.fillna('')
# songs_exploded

In [390]:
# songs_exploded[songs_exploded['Genre'] == "Lullaby"]

In [379]:

genre_counts = songs_exploded['Genre'].value_counts().dropna()

# Convert the Series to a DataFrame
df_genre_counts = genre_counts.reset_index()
df_genre_counts.columns = ['Genre', 'Count']

# filter by count of genres
filtered_df = df_genre_counts.loc[df_genre_counts['Count'] > 10]
filtered_df




In [391]:
# now the data about the individual songs
songs.columns.to_list()
song_cols = ["song_id", 
             "Local_latitude",
             "society_id",
             'Local_longitude',
             'Homeland_latitude',
             'Homeland_longitude',
             'Region',
             'Genre']
song_data_selected = songs[song_cols].fillna('')
# song_data_selected

In [425]:
canto_selected_features =  canto.copy()

# song_id is number, so convert to string for matching
canto_selected_features['song_id'] = canto_selected_features['song_id'].astype('str')

# dict to rename columns with more useful names
lullaby_name_dict = {'line_1': 'Social_Org_Group', 
'line_10': 'Repetition',
'line_11': 'Vocal_Rhythm',
'line_16': 'Melodic_Form',
'line_18': 'Number_Phrases',
'line_20': 'Melodic_Range',
'line_24': 'Tempo',
'line_25': 'Volume',
'line_26': 'Vocal_Rubato',
'line_28': 'Glissando'}

# rename cols
canto_selected_features = canto_selected_features.rename(columns=canto_name_dict)

# Now we select only the columns (lines) that Anna suggests are relevant to the Lullaby Project
canto_selected_features = canto_selected_features.iloc[:,[0, 1, 2, 3, 12, 13, 20, 22, 26, 27, 28, 30]]
canto_selected_features

Unnamed: 0,song_id,Preferred_name,society_id,Social_Org_Group,Repetition,Vocal_Rhythm,Number_Phrases,Melodic_Range,Tempo,Volume,Vocal_Rubato,Glissando
0,4241,'Are'are,10000,64,16,2048,512,1024,512,2,512,512
1,4246,'Are'are,10000,64,8192,512,8192,128,32,16,512,8192
2,30075,'Are'are,10000,8208,1024,64,2048,128,512,2,8192,512
3,30120,'Are'are,10000,8208,128,64,8192,128,512,2,8192,32
4,30121,'Are'are,10000,32,128,64,512,1024,32,2,8192,32
...,...,...,...,...,...,...,...,...,...,...,...,...
5771,364,Hokkaido Japanese,62554,4,2,8192,32,1024,8,128,2,512
5772,183,Eastern Ojibwa,62555,4,2,2048,512,128,512,2,8192,8192
5773,180,Gayogo̱hó꞉nǫʼ (Cayuga),62556,256,1024,2048,8,1024,2048,8192,8192,8192
5774,181,Gayogo̱hó꞉nǫʼ (Cayuga),62556,256,1024,2048,8,128,512,1024,512,32


In [417]:
# 2 to the n for all n values from 1 to 13
powers = [2**n for n in range(1, 14)] 

# make a list of all combinations of the previous, for 1, 2, and 3 numbers
combo_list = list(combinations(powers, 1)) + list(combinations(powers, 2)) + list(combinations(powers, 3)) 

# a dictionary that maps the original sums to the combinations
sums = [{"sum" : sum(t), "full_tuple": t} for t in combo_list]  

# as a df
sums_df = pd.DataFrame(sums) 

# clean up tuples and sort
sums_df['sorted_original_values'] = sums_df.full_tuple.apply(lambda x: tuple(sorted([np.log2(value) for value in x], reverse=True))) 
sums_df.sort_values(by="sum") 

#create a dictionary that maps the summed values to their original meanings:
dictionary_of_value_sets = dict(zip(sums_df["sum"], sums_df["sorted_original_values"]))
my_dict = pd.Series(lines_explained.short_title.values, index=lines_explained.id).to_dict()

canto__selected_features_transformed = canto_selected_features.iloc[:, 3:].applymap(lambda x : dictionary_of_value_sets.get(x, 0))
canto_unpacked = pd.concat([canto_selected_features.iloc[:, :3], canto__selected_features_transformed], axis="columns")
canto_unpacked = canto_unpacked.rename(columns=my_dict)
canto_unpacked

Unnamed: 0,song_id,Preferred_name,society_id,Social_Org_Group,Repetition,Vocal_Rhythm,Number_Phrases,Melodic_Range,Tempo,Volume,Vocal_Rubato,Glissando
0,4241,'Are'are,10000,"(6.0,)","(4.0,)","(11.0,)","(9.0,)","(10.0,)","(9.0,)","(1.0,)","(9.0,)","(9.0,)"
1,4246,'Are'are,10000,"(6.0,)","(13.0,)","(9.0,)","(13.0,)","(7.0,)","(5.0,)","(4.0,)","(9.0,)","(13.0,)"
2,30075,'Are'are,10000,"(13.0, 4.0)","(10.0,)","(6.0,)","(11.0,)","(7.0,)","(9.0,)","(1.0,)","(13.0,)","(9.0,)"
3,30120,'Are'are,10000,"(13.0, 4.0)","(7.0,)","(6.0,)","(13.0,)","(7.0,)","(9.0,)","(1.0,)","(13.0,)","(5.0,)"
4,30121,'Are'are,10000,"(5.0,)","(7.0,)","(6.0,)","(9.0,)","(10.0,)","(5.0,)","(1.0,)","(13.0,)","(5.0,)"
...,...,...,...,...,...,...,...,...,...,...,...,...
5771,364,Hokkaido Japanese,62554,"(2.0,)","(1.0,)","(13.0,)","(5.0,)","(10.0,)","(3.0,)","(7.0,)","(1.0,)","(9.0,)"
5772,183,Eastern Ojibwa,62555,"(2.0,)","(1.0,)","(11.0,)","(9.0,)","(7.0,)","(9.0,)","(1.0,)","(13.0,)","(13.0,)"
5773,180,Gayogo̱hó꞉nǫʼ (Cayuga),62556,"(8.0,)","(10.0,)","(11.0,)","(3.0,)","(10.0,)","(11.0,)","(13.0,)","(13.0,)","(13.0,)"
5774,181,Gayogo̱hó꞉nǫʼ (Cayuga),62556,"(8.0,)","(10.0,)","(11.0,)","(3.0,)","(7.0,)","(9.0,)","(10.0,)","(9.0,)","(5.0,)"


In [440]:
# a shortlist of columns to use from songs table
selected_song_cols = ['song_id',
 'Genre',
 'Performers',
 'Instruments',
 'Vocalist_gender',
 'Year',
 'society_id',
 'Region',
 'Division',
 'Subregion',
 'Area',
 'Local_latitude',
 'Local_longitude',
 'Preferred_name',
 'Society_location'
 ]


# slicing out just the relevant columns
songs_some_cols = songs[selected_song_cols].copy()

# getting just the lullabies
lullabies_metadata = songs_some_cols[songs_some_cols['Genre'].notna() & songs_some_cols['Genre'].str.contains("Lullaby")]
lullabies_metadata.head(3)

Unnamed: 0,song_id,Genre,Performers,Instruments,Vocalist_gender,Year,society_id,Region,Division,Subregion,Area,Local_latitude,Local_longitude,Preferred_name,Society_location
139,1914,Lullaby,"Female solo, group of children",Female Voice; Children's Voices,Mixed children; Women,1954,11793,Africa,East Africa,Great Lakes Africa,"Busoga Kingdom, E Uganda",1.04,33.48,Basoga,"Bulamogi, Kaliro District, Busoga, Eastern Reg..."
328,420,Lullaby,Female solo,Female Voice,Women,1959,15577,Africa,Central Africa,N W Central Africa,N W Cameroon,6.09,10.3,Fut,"Bamenda, Mezam, Northwest Region, Cameroon"
329,2565,Lullaby,Female solo,Female Voice,Women,1959,15577,Africa,Central Africa,N W Central Africa,N W Cameroon,6.09,10.3,Fut,"Bamenda, Mezam, Northwest Region, Cameroon"


In [443]:
lullabies_final = pd.merge(lullabies_metadata, canto_unpacked,
how='left', on='song_id')
lullabies_final.fillna('')

Unnamed: 0,song_id,Genre,Performers,Instruments,Vocalist_gender,Year,society_id_x,Region,Division,Subregion,...,society_id_y,Social_Org_Group,Repetition,Vocal_Rhythm,Number_Phrases,Melodic_Range,Tempo,Volume,Vocal_Rubato,Glissando
0,1914,Lullaby,"Female solo, group of children",Female Voice; Children's Voices,Mixed children; Women,1954,11793,Africa,East Africa,Great Lakes Africa,...,11793,"(8.0,)","(10.0,)","(11.0,)","(13.0,)","(13.0,)","(9.0,)","(7.0,)","(13.0,)","(5.0,)"
1,420,Lullaby,Female solo,Female Voice,Women,1959,15577,Africa,Central Africa,N W Central Africa,...,15577,"(2.0,)","(4.0,)","(6.0,)","(3.0,)","(7.0,)","(5.0,)","(4.0,)","(13.0,)","(5.0,)"
2,2565,Lullaby,Female solo,Female Voice,Women,1959,15577,Africa,Central Africa,N W Central Africa,...,15577,"(2.0,)","(10.0,)","(6.0,)","(1.0,)","(7.0,)","(9.0,)","(4.0,)","(13.0,)","(9.0,)"
3,1519,Lullaby,Alice Mwale (Nguluwe),Female Voice,Women,1961,22848,Africa,Southern Africa,E Southern Africa,...,22848,"(2.0,)","(10.0,)","(11.0,)","(1.0,)","(10.0,)","(11.0,)","(4.0,)","(9.0,)","(1.0,)"
4,2095,Lullaby,Female chorus,Female Voices,Women,1957,26797,Africa,Southern Africa,E Southern Africa,...,26797,"(10.0,)","(10.0,)","(6.0,)","(13.0,)","(10.0,)","(5.0,)","(4.0,)","(13.0,)","(9.0,)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,3767,Lullaby,Male solo,Male Voice,Men,1953,12134,South Asia,Afghanistan/ Pakistan/ India,Indus Valley,...,12134,"(2.0,)","(10.0,)","(13.0,)","(11.0,)","(7.0,)","(11.0,)","(7.0, 4.0)","(1.0,)","(5.0,)"
95,3783,Lullaby,Female solo,Female Voice,Women,1953,14477,South Asia,India,C India/ Central Tribes Area,...,14477,"(2.0,)","(10.0,)","(11.0,)","(13.0,)","(4.0,)","(5.0,)","(4.0,)","(5.0,)","(9.0,)"
96,3788,Lullaby,Male chorus,Male Voices,Men,1953,14708,South Asia,India,C India/ Central Tribes Area,...,14708,"(5.0,)","(10.0,)","(11.0,)","(11.0,)","(13.0,)","(9.0,)","(7.0, 4.0)","(13.0,)","(5.0,)"
97,13,Lullaby,"Male singer, crying baby, voices",Male Voice; Crying Baby; Voices,Men,1953-73,14840,Southeast Asia,Island S E Asia,Greater Sundas,...,14840,"(2.0,)","(7.0,)","(6.0,)","(13.0,)","(10.0,)","(9.0,)","(4.0,)","(9.0,)","(9.0,)"


In [464]:
type(lullabies_final.iloc[0]['song_id'])

str

In [458]:
# an initial exploratory grouping
grouped = lullabies_final.groupby(['Region', 'Melodic_Range'])['song_id'].count()
regional_lullabies = pd.DataFrame(grouped)
regional_lullabies = regional_lullabies.reset_index()
regional_lullabies = regional_lullabies.rename(columns={'song_id' : 'Song_Count'})

In [461]:
regional_lullabies

Unnamed: 0,Region,Melodic_Range,Song_Count
0,Africa,"(4.0,)",1
1,Africa,"(7.0,)",2
2,Africa,"(10.0,)",2
3,Africa,"(13.0,)",1
4,Australia,"(7.0,)",1
5,Central America,"(4.0,)",5
6,Central America,"(7.0,)",9
7,Central America,"(10.0,)",1
8,Central Asia,"(7.0,)",1
9,East Asia,"(7.0,)",1


In [465]:
import plotly.express as px
fig = px.bar(regional_lullabies, x="Region", y="Song_Count", color="Melodic_Range",
             category_orders={"Melodic_Range": sorted(set([tuple(x) for x in regional_lullabies['Melodic_Range'].tolist()]))},
             title="Regional Variation in Melodic Range of Global Jukebox Lullabies")

# Show the figure
fig.show()


In [467]:
import pandas as pd
import plotly.express as px
from itertools import combinations
import numpy as np

# import jgb data
# List of URLs to the data files
data_files_list = [
    'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/raw/data.csv',
    'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/raw/societies.csv',
    'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/raw/songs.csv',
    'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/etc/codes.csv',
    'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/etc/variables.csv',
    'https://raw.githubusercontent.com/theglobaljukebox/cantometrics/main/etc/raw_codes.csv'
]

# Short names for DataFrames
short_names = ['canto', 'societies', 'songs', 'codes', 'lines_explained', 'raw_codes']

# Initialize empty variables for each DataFrame
canto = None
societies = None
songs = None
codes = None
lines_explained = None
raw_codes = None

# Loop through the list of URLs and short names
for url, short_name in zip(data_files_list, short_names):
    # Read the CSV file from the URL into a DataFrame
    df = pd.read_csv(url)
    
    # Replace non-breaking spaces in column names with regular spaces
    df.columns = df.columns.str.replace('\xa0', ' ')
    
    # Iterate over each column to replace non-breaking spaces in cell values
    for col in df.columns:
        # Check if the column contains string values
        if df[col].dtype == 'object':
            df[col] = df[col].str.replace('\xa0', ' ')
    
    # Assign the modified DataFrame to the corresponding variable
    globals()[short_name] = df



# lines explained dictionary of lines and short titles
my_dict = pd.Series(lines_explained.short_title.values, index=lines_explained.id).to_dict()
my_dict


#  code to unpack powers of 2
# 2 to the n for all n values from 1 to 13
powers = [2**n for n in range(1, 14)] 

# make a list of all combinations of the previous, for 1, 2, and 3 numbers
combo_list = list(combinations(powers, 1)) + list(combinations(powers, 2)) + list(combinations(powers, 3)) 

# a dictionary that maps the original sums to the combinations
sums = [{"sum" : sum(t), "full_tuple": t} for t in combo_list]  

# as a df
sums_df = pd.DataFrame(sums) 

# clean up tuples and sort
sums_df['sorted_original_values'] = sums_df.full_tuple.apply(lambda x: tuple(sorted([np.log2(value) for value in x], reverse=True))) 
sums_df.sort_values(by="sum") 

#create a dictionary that maps the summed values to their original meanings:
dictionary_of_value_sets = dict(zip(sums_df["sum"], sums_df["sorted_original_values"]))

# create ictionary of lines and short title meanings
short_title_dict = pd.Series(lines_explained.short_title.values, index=lines_explained.id).to_dict() 

# unpack the sums for all 'line' columns, using dict of value sets created above, using only the 'line' cols
canto_transformed_features = canto.iloc[:, 3:].map(lambda x : dictionary_of_value_sets.get(x, 0)) 

# put the transformed columns back in place with the song names, etc
canto_unpacked = pd.concat([canto.iloc[:, :3], canto_transformed_features], axis="columns") 

 # rename the columns with short_title dictionary
canto_renamed = canto_unpacked.rename(columns=my_dict)
canto_unpacked = canto_renamed


# song metadata with selected columns

selected_song_cols = ['song_id',
 'Genre',
 'Performers',
 'Instruments',
 'Vocalist_gender',
 'Year',
 'society_id',
 'Region',
 'Division',
 'Subregion',
 'Area',
 'Local_latitude',
 'Local_longitude',
 'Preferred_name',
 'Society_location'
 ]

songs_some_cols = songs[selected_song_cols].copy()

# split strings and explode:  'genre' as example
songs_some_cols = songs[selected_song_cols].copy()
songs_some_cols['Genre'].unique().tolist()

# selected output

['Responsorial Song; Call & Response',
 'Dance Song',
 'Ceremonial Song; Song For Royalty',
 'Spirit Song; Dance Song; Cult Song',
 "Boys' Song; Adolescents' Song",
 'Funeral Song; Mourning Song',
 "Wedding Song; Girls' Song; Responsorial Song",
 'Chant; Song For Royalty',
 "Men's Song; Song For Royalty"]

# copy the data so we avoid problems
songs_some_cols  = songs[selected_song_cols].copy()
# split the long strings at the ";"
songs_some_cols['Genre'] = songs_some_cols['Genre'].str.split(';')
# explode the complete df on the 'genre' column to tidy the data
songs_exploded = songs_some_cols.explode('Genre')
# remove trailing/leading spaces that might remain in the individual strings
songs_exploded["Genre"] = songs_exploded["Genre"].str.strip()
# fillnas
songs_exploded = songs_exploded.fillna('')

# lullaby project


# copy to safeguard data
canto_selected_features =  canto.copy()

# song_id is number, so convert to string for matching
canto_selected_features['song_id'] = canto_selected_features['song_id'].astype('str')

# dict to rename columns with more useful names
lullaby_name_dict = {'line_1': 'Social_Org_Group', 
'line_10': 'Repetition',
'line_11': 'Vocal_Rhythm',
'line_16': 'Melodic_Form',
'line_18': 'Number_Phrases',
'line_20': 'Melodic_Range',
'line_24': 'Tempo',
'line_25': 'Volume',
'line_26': 'Vocal_Rubato',
'line_28': 'Glissando'}

# rename cols
canto_selected_features = canto_selected_features.rename(columns=lullaby_name_dict)

# Now we select only the columns (lines) that Anna suggests are relevant to the Lullaby Project
canto_selected_features = canto_selected_features.iloc[:,[0, 1, 2, 3, 12, 13, 20, 22, 26, 27, 28, 30]]

# unpack powers of 2

# 2 to the n for all n values from 1 to 13
powers = [2**n for n in range(1, 14)] 
# make a list of all combinations of the previous, for 1, 2, and 3 numbers
combo_list = list(combinations(powers, 1)) + list(combinations(powers, 2)) + list(combinations(powers, 3)) 
# a dictionary that maps the original sums to the combinations
sums = [{"sum" : sum(t), "full_tuple": t} for t in combo_list]  
# as a df
sums_df = pd.DataFrame(sums) 
# clean up tuples and sort
sums_df['sorted_original_values'] = sums_df.full_tuple.apply(lambda x: tuple(sorted([np.log2(value) for value in x], reverse=True))) 
sums_df.sort_values(by="sum") 
#create a dictionary that maps the summed values to their original meanings:
dictionary_of_value_sets = dict(zip(sums_df["sum"], sums_df["sorted_original_values"]))
my_dict = pd.Series(lines_explained.short_title.values, index=lines_explained.id).to_dict()
canto_transformed_features = canto_selected_features.iloc[:, 3:].map(lambda x : dictionary_of_value_sets.get(x, 0))
canto_unpacked = pd.concat([canto_selected_features.iloc[:, :3], canto_transformed_features], axis="columns")
canto_unpacked = canto_unpacked.rename(columns=my_dict)
# fix song_id as string
canto_unpacked['song_id'] = canto_unpacked['song_id'].astype('str')


# Find Lullabies from the Song Table, and Collect Metadata

# a shortlist of columns to use from songs table
selected_song_cols = ['song_id',
'Genre',
'Performers',
'Instruments',
'Vocalist_gender',
'Year',
'society_id',
'Region',
'Division',
'Subregion',
'Area',
'Local_latitude',
'Local_longitude',
'Preferred_name',
'Society_location'
]


# slicing out just the relevant columns
songs_some_cols = songs[selected_song_cols].copy()

# getting just the lullabies
lullabies_metadata = songs_some_cols[songs_some_cols['Genre'].notna() & songs_some_cols['Genre'].str.contains("Lullaby")]

# combine the Feature Data with Context Data



lullabies_final = pd.merge(lullabies_metadata, canto_unpacked,
how='left', on='song_id')
lullabies_final = lullabies_final.fillna('')


# groups based on features

grouped = lullabies_final.groupby(['Region', 'Melodic_Range'])['song_id'].count()
regional_lullabies = pd.DataFrame(grouped)
regional_lullabies = regional_lullabies.reset_index()
regional_lullabies = regional_lullabies.rename(columns={'song_id' : 'Song_Count'})

# make sure you import the library!

# the plot
fig = px.bar(regional_lullabies, x="Region", y="Song_Count", color="Melodic_Range",
             category_orders={"Melodic_Range": sorted(set([tuple(x) for x in regional_lullabies['Melodic_Range'].tolist()]))})

# Show the figure
fig.show()

AttributeError: 'DataFrame' object has no attribute 'map'