This notebook selects the subset of data we want to work with. The filters are the following:

Person artists:
- gender is Male or Female
- song pubblication year determined
- songs of artists who published more than 10 songs
- between 1960 to 2010 (excluded)

Group artists:
- gender of members known for all the available members (aka, n_unknown=0)
- song pubblication year determined
- songs of artists who published more than 10 songs
- between 1960 to 2010 (excluded)

Then, song lyrics missing, almost missing, shorter than 10 words, or with less than 4 lines are discarded.

We also consider the span of activity of the artist when available. For instance, if the artist ended its activity in 1994 we discard all its songs published after 1994.

In [2]:
# mount GDrive
from google.colab import drive
#drive.mount('/content/drive')
drive._mount('/content/drive')

Mounted at /content/drive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
!cp -r "drive/MyDrive/Artistic_Content_Creation/WASABI_gender_experiments/WASABI_gender_experiments_definitive/data/data_lyrics_group_decades" .
!cp -r "drive/MyDrive/Artistic_Content_Creation/WASABI_gender_experiments/WASABI_gender_experiments_definitive/data/data_lyrics_person_decades" .
!cp -r "drive/MyDrive/Artistic_Content_Creation/WASABI_gender_experiments/WASABI_gender_experiments_definitive/data/data_lyrics_others_decades" .

!cp "drive/MyDrive/Artistic_Content_Creation/WASABI_gender_experiments/WASABI_gender_experiments_definitive/data/artists_info.json.gz" .

In [4]:
import pandas as pd
import glob
import re
import json

In [5]:
# load artists
artists = pd.read_json("artists_info.json.gz", orient='records', lines=True)
artists = artists.set_index('artist_id')
artists.head()

Unnamed: 0_level_0,lifeSpan,nameVariations,labels,deezerFans,n_unknown,gender,abstract,id_artist_discogs,urlWikipedia,subject,urlPureVolume,artist_name,recordLabel,urlMusicBrainz,urls,urlSoundCloud,id_artist_deezer,urlDeezer,disambiguation,urlOfficialWebsite,location,urlYouTube,name_accent_fold,urlMySpace,urlWikidata,urlFacebook,languages,type,urlTwitter,urlRateYourMusic,members,locationInfo,dbp_abstract,genres,urlAmazon,id_artist_musicbrainz,dbp_genre,n_male,n_songs,urlDiscogs,n_female,urlWikia,nameVariations_fold,urlITunes,n_albums,n_members,urlAllmusic,urlSpotify,urlBBC,urlInstagram,urlLastFm,urlSecondHandSongs,urlGooglePlus
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1
56d7e91b6b60c09814f93e4a,"{'ended': False, 'begin': '1995', 'end': ''}",['A'],[],6519.0,0.0,,"Alternative rock band formed in Leeds, England...",72848.0,http://en.wikipedia.org/wiki/A_(band),"[Musical groups established in 1995, English a...",,A,[Warner Bros. Records],http://musicbrainz.org/artist/55c6eb6e-8388-49...,[http://www.myspace.com/officialA],,3412.0,http://www.deezer.com/artist/3412,British band,http://www.a-communication.com/,{'id_city_musicbrainz': '6e2d2d30-dbc9-4d27-99...,,A,,https://www.wikidata.org/wiki/Q300307,,"{'english': 98, 'unknown': 3, 'spanish': 1}",Group,,http://rateyourmusic.com/artist/a,[{'id_member_musicbrainz': '3ec05e94-bf6e-439f...,"[England, West Yorkshire, Leeds]",A (later changed to A + R) are a British alter...,[],http://www.amazon.com/asdf/e/B000APPUE6?tag=wi...,55c6eb6e-8388-497c-acaf-dbff584d0c3a,"[Alternative rock, Pop punk, Hard rock]",6.0,102,http://www.discogs.com/artist/72848,0.0,A,['A'],https://itunes.apple.com/us/artist/id635168856,6,6.0,http://www.allmusic.com/artist/mn0000474971,,,,,,
56d7e91c6b60c09814f93e4c,"{'ended': False, 'begin': '2010-04-18', 'end':...",,[],,1.0,,,,,,,A (エース) (ACE),,http://musicbrainz.org/artist/51257cf7-1672-45...,,,,,Japanese Band,http://a-rock.jp/index.php,"{'id_city_musicbrainz': '', 'country': 'Japan'...",,A,,,,"{'english': 24, 'hausa': 4, 'unknown': 3, 'tur...",Group,,,[{'id_member_musicbrainz': '82bd3da4-7085-40b8...,[Japan],,"[J-Rock, Visual Kei]",,51257cf7-1672-4580-ae5c-93eefe3684fb,,0.0,34,,0.0,A_(%E3%82%A8%E3%83%BC%E3%82%B9)_(ACE),[],https://itunes.apple.com/us/artist/id4328888,7,1.0,,,,,,,
56d7e91d6b60c09814f93e4e,"{'ended': False, 'begin': '', 'end': ''}","[a balladeer, A BALLADEER AND FRIENDS, a balla...",[],423.0,1.0,,A Balladeer (stylised as 'a balladeer') is Dut...,472300.0,https://en.wikipedia.org/wiki/A_Balladeer,,,A Balladeer,,http://musicbrainz.org/artist/8cb0ebc9-db95-47...,[http://www.aballadeer.com/],,242156.0,http://www.deezer.com/artist/242156,,http://www.aballadeer.com/,"{'id_city_musicbrainz': '', 'country': 'Nether...",,A Balladeer,https://myspace.com/aballadeer,https://www.wikidata.org/wiki/Q4655340,https://www.facebook.com/aballadeer,{'english': 29},Group,https://twitter.com/aballadeerhere,,[{'id_member_musicbrainz': '2931cbb9-56a0-4a96...,[],,[],http://www.amazon.com/asdf/e/B003BF7QWG?tag=wi...,8cb0ebc9-db95-4748-81df-8e1e24e70541,,0.0,29,http://www.discogs.com/artist/472300,0.0,A_Balladeer,"[a balladeer, A BALLADEER AND FRIENDS, a balla...",https://itunes.apple.com/us/artist/id130037087,4,1.0,http://www.allmusic.com/artist/mn0001591642,https://play.spotify.com/artist/5MUNbMtqB3EOKx...,,,,,
56d7e91e6b60c09814f93e50,"{'ended': False, 'begin': '', 'end': ''}",,[],0.0,,,,,,,,A Beautiful Silence,,http://musicbrainz.org/artist/4616c4f1-fe79-40...,,,4708137.0,http://www.deezer.com/artist/4708137,,,"{'id_city_musicbrainz': '', 'country': '', 'ci...",,A Beautiful Silence,,,,{'english': 23},,,,[],"[United States, Michigan, Marquette]",,[],http://www.amazon.com/asdf/e/B001LI3SMC?tag=wi...,4616c4f1-fe79-40f0-ac8d-2b319528b683,,,23,,,A_Beautiful_Silence,[],https://itunes.apple.com/us/artist/id115104139,2,,http://www.allmusic.com/artist/mn0001930454,https://play.spotify.com/artist/2FcgcBYwiCDG37...,,,,,
56d7e91e6b60c09814f93e52,"{'ended': False, 'begin': '2001', 'end': ''}",[],[],32.0,4.0,,,407539.0,http://en.wikipedia.org/wiki/A_Band_Called_Pain,"[Musical groups from Oakland, California, Afri...",,A Band Called Pain,[Hieroglyphics Imperium Recordings],http://musicbrainz.org/artist/e5fd8fd1-9073-45...,[http://abandcalledpain.com],,1006041.0,http://www.deezer.com/artist/1006041,,http://www.abandcalledpain.com/,"{'id_city_musicbrainz': '', 'country': 'United...",,A Band Called Pain,https://myspace.com/abandcalledpain,https://www.wikidata.org/wiki/Q4655349,,"{'english': 1, 'unknown': 32}",Group,,,[{'id_member_musicbrainz': '2f647f4c-6272-4ad9...,"[United States, California, Oakland]",A Band Called Pain (abbreviated ABCP) is an Am...,[],http://www.amazon.com/asdf/e/B001LHOG4M?tag=wi...,e5fd8fd1-9073-4586-a741-e44164e543db,[Heavy metal music],0.0,33,http://www.discogs.com/artist/407539,0.0,A_Band_Called_Pain,[],https://itunes.apple.com/us/artist/id83305886,2,4.0,http://www.allmusic.com/artist/mn0000843313,https://play.spotify.com/artist/4g3RlzXVHjXaPp...,,,,,


In [6]:
# count how many songs in total person+groups

data_folders = ['data_lyrics_person_decades/', 'data_lyrics_group_decades/']

total_songs = 0

for data_folder in data_folders:

    for file in glob.glob(data_folder+'*.json.gz'): 

        print('Opening file ', file)
        data_chunk = pd.read_json(file, orient='records', lines=True, chunksize=50000)
        for chunk in data_chunk: 

            n_rows = chunk.shape[0]
            total_songs += n_rows

print("Number lyrics total lyrics (person and group): ", total_songs)

Opening file  data_lyrics_person_decades/lyrics_1920.json.gz
Opening file  data_lyrics_person_decades/lyrics_1940.json.gz
Opening file  data_lyrics_person_decades/lyrics_2010.json.gz
Opening file  data_lyrics_person_decades/lyrics_1910.json.gz
Opening file  data_lyrics_person_decades/lyrics_1900.json.gz
Opening file  data_lyrics_person_decades/lyrics_2000.json.gz
Opening file  data_lyrics_person_decades/lyrics_1980.json.gz
Opening file  data_lyrics_person_decades/lyrics_1950.json.gz
Opening file  data_lyrics_person_decades/lyrics_1970.json.gz
Opening file  data_lyrics_person_decades/lyrics_.json.gz
Opening file  data_lyrics_person_decades/lyrics_1960.json.gz
Opening file  data_lyrics_person_decades/lyrics_1930.json.gz
Opening file  data_lyrics_person_decades/lyrics_1990.json.gz
Opening file  data_lyrics_group_decades/lyrics_1920.json.gz
Opening file  data_lyrics_group_decades/lyrics_1940.json.gz
Opening file  data_lyrics_group_decades/lyrics_2010.json.gz
Opening file  data_lyrics_group

In [7]:
# count how many songs removed if we skip decades
decades_to_keep = ['1960', '1970', '1980', '1990', '2000']

data_folders = ['data_lyrics_person_decades/', 'data_lyrics_group_decades/']

total_songs_in_decades_not_keep = 0

for data_folder in data_folders:

    for file in glob.glob(data_folder+'*.json.gz'): 

        file_name = file.split("/")[-1]
        decade = re.findall("\d+", file_name)
        decade = decade[0] if len(decade)>0 else ''

        if decade in decades_to_keep:
            continue

        print('Opening file ', file)
        data_chunk = pd.read_json(file, orient='records', lines=True, chunksize=5000)
        for chunk in data_chunk: 

            n_rows = chunk.shape[0]
            total_songs_in_decades_not_keep += n_rows

print("Number removed lyrics (person and group) out of time: ", total_songs_in_decades_not_keep)

Opening file  data_lyrics_person_decades/lyrics_1920.json.gz
Opening file  data_lyrics_person_decades/lyrics_1940.json.gz
Opening file  data_lyrics_person_decades/lyrics_2010.json.gz
Opening file  data_lyrics_person_decades/lyrics_1910.json.gz
Opening file  data_lyrics_person_decades/lyrics_1900.json.gz
Opening file  data_lyrics_person_decades/lyrics_1950.json.gz
Opening file  data_lyrics_person_decades/lyrics_.json.gz
Opening file  data_lyrics_person_decades/lyrics_1930.json.gz
Opening file  data_lyrics_group_decades/lyrics_1920.json.gz
Opening file  data_lyrics_group_decades/lyrics_1940.json.gz
Opening file  data_lyrics_group_decades/lyrics_2010.json.gz
Opening file  data_lyrics_group_decades/lyrics_1910.json.gz
Opening file  data_lyrics_group_decades/lyrics_1900.json.gz
Opening file  data_lyrics_group_decades/lyrics_1950.json.gz
Opening file  data_lyrics_group_decades/lyrics_.json.gz
Number removed lyrics (person and group) out of time:  308967


In [8]:
# These are the songs by Person and Groups between 1960 and 2010
print("Songs with years in [1960, 2010): ", total_songs - total_songs_in_decades_not_keep)

Songs with years in [1960, 2010):  889203


In [9]:
# create new folders
!mkdir dataset_10

!mkdir dataset_10/data_lyrics_group_decades
!mkdir dataset_10/data_lyrics_person_decades


In [10]:
# these are the other filters:
# - keep songs from Male or Female artists, and groups with gender of all members known
# - keep songs from artists performing more than N songs (no care of language)
# - keep songs in English
# - keep songs with non empty lyrics and long enough (more than 10 words and 4 lines)
# - keep songs published within the activity period of the artist (when available)

def is_consistent_with_artist_lifespan(song_year, artist_lifespan):

    artist_begin = re.findall("\d{4}", artist_lifespan['begin'])
    artist_begin = int(artist_begin[0]) if len(artist_begin)==1 else ''
    artist_end = re.findall("\d{4}", artist_lifespan['end'])
    artist_end = int(artist_end[0]) if len(artist_end)==1 else ''

    after_artist_begin = song_year >= artist_begin if artist_begin!='' else True
    before_artist_end = song_year <= artist_end if artist_end!='' else True
    
    return (after_artist_begin and before_artist_end)


def filter_song(song, artist_info, threshold_n_songs=10):

    artist_type = song.other_artist_info['type']
    artist_lifespan = artist_info.lifeSpan
    song_pubdate = song['song_year_combined']
    n_eng_songs = song.other_artist_info['languages']['english'] if 'english' in song.other_artist_info['languages'].keys() else 0
    n_songs = song.other_artist_info['n_songs']
    is_song_lang_eng = song.language_detect == 'english'

    if artist_type=='Person':
        is_gender_ok = song.other_artist_info['gender'] in ['Male', 'Female']
    elif artist_type=='Group':
        is_gender_ok = song.other_artist_info['n_members'] is not None and song.other_artist_info['n_members']>0 and song.other_artist_info['n_unknown']==0 
    else:
        pass

    lyrics = song['lyrics']
    is_lyrics_not_missing = lyrics is not None
    is_lyrics_real =  lyrics.strip()!='' and song['n_words']>10 and song['n_lines']>4 if is_lyrics_not_missing else False
    is_lyrics_ok = is_lyrics_not_missing and is_lyrics_real
    is_published_within_lifespan = is_consistent_with_artist_lifespan(song_pubdate, artist_lifespan)
    
    n_no_male_female = 1 if not is_gender_ok else 0
    n_no_more_threshold_n_songs = 1 if not n_songs>threshold_n_songs else 0
    n_no_english = 1 if not is_song_lang_eng else 0
    n_few_words = 1 if not song['n_words']>10 else 0
    n_few_lines = 1 if not song['n_lines']>4 else 0
    n_missing_lyrics = 1 if not is_lyrics_not_missing else 0
    n_published_outside_lifespan = 1 if not is_published_within_lifespan else 0

    report = {
        'n_no_male_female':n_no_male_female,
        'n_no_more_threshold_n_songs':n_no_more_threshold_n_songs,
        'n_no_english':n_no_english,
        'n_few_words':n_few_words,
        'n_few_lines':n_few_lines,
        'n_missing_lyrics':n_missing_lyrics,  
        'n_published_outside_lifespan':n_published_outside_lifespan      
    }

    if is_gender_ok and n_songs>threshold_n_songs and is_lyrics_ok and is_song_lang_eng and is_published_within_lifespan:
        return {'is_to_keep':True, 'report':report}
    else:
        return {'is_to_keep':False, 'report':report}


def write_json_rows(file, df):

    with open(file, 'a') as ww:
        for idx, row in df.iterrows():
            ww.write(json.dumps(row.to_dict())+"\n")

In [11]:
decades_to_keep = ['1960', '1970', '1980', '1990', '2000']
data_folders = ['data_lyrics_person_decades/', 'data_lyrics_group_decades/']

# this counts the songs discarded from the file we are reading
total_songs_60_10 = 0
n_songs_discarded = 0

count_reports = []

for data_folder in data_folders:
    new_data_folder = "dataset_10/"+data_folder

    for file in glob.glob(data_folder+'*_[!.]*.json.gz'): # skip songs with no dates

        file_name = file.split("/")[-1].strip(".gz")
        decade = re.findall("\d+", file_name)
        decade = decade[0] if len(decade)>0 else ''

        if decade not in decades_to_keep:
            continue

        print('Reading file: ', file)
        print('Writing in file: ', new_data_folder+file_name)
        print()
        
        data_chunk = pd.read_json(file, orient='records', lines=True, chunksize=5000)
        for chunk in data_chunk: 

            n_rows = chunk.shape[0]
            total_songs_60_10 += n_rows

            chunk = chunk.merge(chunk.apply(lambda row: pd.Series(filter_song(row, 
                                                                              artists.loc[row.artist_id],
                                                                              threshold_n_songs=10)), axis=1),
                                left_index=True, right_index=True)
            reports = chunk.report.tolist()
            for n_ in range(len(reports)):
                reports[n_]['file_name'] = file_name
                reports[n_]['song_id'] = chunk.iloc[n_].song_id
            count_reports.extend(reports)
            
            chunk = chunk[chunk.is_to_keep==True].drop(columns=['report'])
            n_rows_after = chunk.shape[0]

            n_rows_removed = n_rows - n_rows_after
            n_songs_discarded += n_rows_removed

            write_json_rows(new_data_folder+file_name, chunk)
        

print('Initial number of songs: ', total_songs)
print("Songs with years in [1960, 2010): ", total_songs - total_songs_in_decades_not_keep)
print("Songs with years in [1960, 2010): ", total_songs_60_10)
print('Number of songs discarded: ', n_songs_discarded)
print('Number of songs in dataset: ', total_songs_60_10-n_songs_discarded)
print()
# get the info of the filters
count_reports = pd.DataFrame(count_reports)

print("How many songs trigger each filter: ")
count_reports[['n_no_male_female', 'n_no_more_threshold_n_songs', 'n_no_english', 
               'n_few_words', 'n_few_lines', 'n_missing_lyrics', 'n_published_outside_lifespan']].sum()
    

Reading file:  data_lyrics_person_decades/lyrics_2000.json.gz
Writing in file:  dataset_10/data_lyrics_person_decades/lyrics_2000.json

Reading file:  data_lyrics_person_decades/lyrics_1980.json.gz
Writing in file:  dataset_10/data_lyrics_person_decades/lyrics_1980.json

Reading file:  data_lyrics_person_decades/lyrics_1970.json.gz
Writing in file:  dataset_10/data_lyrics_person_decades/lyrics_1970.json

Reading file:  data_lyrics_person_decades/lyrics_1960.json.gz
Writing in file:  dataset_10/data_lyrics_person_decades/lyrics_1960.json

Reading file:  data_lyrics_person_decades/lyrics_1990.json.gz
Writing in file:  dataset_10/data_lyrics_person_decades/lyrics_1990.json

Reading file:  data_lyrics_group_decades/lyrics_2000.json.gz
Writing in file:  dataset_10/data_lyrics_group_decades/lyrics_2000.json

Reading file:  data_lyrics_group_decades/lyrics_1980.json.gz
Writing in file:  dataset_10/data_lyrics_group_decades/lyrics_1980.json

Reading file:  data_lyrics_group_decades/lyrics_1970

n_no_male_female                358579
n_no_more_threshold_n_songs      22178
n_no_english                         0
n_few_words                      47527
n_few_lines                      52857
n_missing_lyrics                 45869
n_published_outside_lifespan     42214
dtype: int64

In [12]:
count_reports.head()

Unnamed: 0,n_no_male_female,n_no_more_threshold_n_songs,n_no_english,n_few_words,n_few_lines,n_missing_lyrics,n_published_outside_lifespan,file_name,song_id
0,1,0,0,0,0,0,0,lyrics_2000.json,5714dec325ac0d8aee3807c4
1,1,0,0,0,0,0,0,lyrics_2000.json,5714dec325ac0d8aee3807c5
2,1,0,0,0,0,0,0,lyrics_2000.json,5714dec325ac0d8aee3807c6
3,1,0,0,0,0,0,0,lyrics_2000.json,5714dec325ac0d8aee3807c7
4,1,0,0,0,0,0,0,lyrics_2000.json,5714dec325ac0d8aee3807c8


In [13]:
# how many songs we discarded
(count_reports[['n_no_male_female', 'n_no_more_threshold_n_songs', 'n_no_english', 
                'n_few_words', 'n_few_lines', 'n_missing_lyrics', 
                'n_published_outside_lifespan']]>0).any(axis=1).sum()

428864

In [14]:
count_reports.to_json("dataset_10/filtering_count_reports.json")

In [15]:
!gzip dataset_10/data_lyrics_group_decades/*.json
!gzip dataset_10/data_lyrics_person_decades/*.json

In [16]:
!cp -r dataset_10 "drive/MyDrive/Artistic_Content_Creation/WASABI_gender_experiments/WASABI_gender_experiments_definitive"
