This notebook reads the WASABI database and extracts information for all the artists present in the database and all the songs in English. These two informations will be saved in two different ways.

For artists, we collect all the lyrics id, pubDate, language of the lyrics the artist published. In addition, for groups, we determine the gender of all the members without adding an aggregate "gender" for the group.

For songs, we determine few basic information (i.e., number of words and number of lines), and the publication date of the song by combining different date information (e.g., album publication date).

The produced file is saved such that each row is a json file containing the information we are interested in.

Note: this notebook was run in Google Colab, so the environment won't work here.

In [None]:
# this is useful to extract the text from html
!pip install html2text

Collecting html2text
  Downloading html2text-2020.1.16-py3-none-any.whl (32 kB)
Installing collected packages: html2text
Successfully installed html2text-2020.1.16


The following 2 cells are used to read the WASABI database from our Google Drive accounts.

In [None]:
# This cell is to download the file through the link shared by XXXX

'''
# Install the PyDrive wrapper & import libraries.
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

file_id = 'XXXXXX'
downloaded = drive.CreateFile({'id':file_id})
downloaded.FetchMetadata(fetch_all=True)
downloaded.GetContentFile(downloaded.metadata['title'])
'''

In [None]:
# mount GDrive
from google.colab import drive
#drive.mount('/content/drive')
drive._mount('/content/drive')

Mounted at /content/drive


In [None]:
# unpack the folder
# no need to run it. To download the dataset, please follow the instructions reported in the README of the root page of the repository
!tar -xzf 2MillionSongsDB.tar.gz

In [None]:
import html2text
import json
import glob
import os
from collections import Counter, defaultdict
from itertools import groupby
from datetime import datetime
import pandas as pd

import html2text

In [None]:
# create folders to store info
!mkdir data_lyrics_group_decades
!mkdir data_lyrics_person_decades
!mkdir data_lyrics_others_decades

In [None]:
song_fields_to_keep = ['_id', 'abstract', 'aligned_id', 'availableCountries', 'award', 
                         'begin', 'bpm', 'deezer_mapping', 'disambiguation', 'end', 'explicitLyrics', 'explicit_content_lyrics', 'format', 
                         'gain', 'genre', 'id_album', 'id_song_deezer', 'id_song_musicbrainz', 'isClassic', 'isrc', 
                         'language', 'language_detect', 'length', 'lyrics', 'producer', 
                         'publicationDate', 'rank', 'recordLabel', 'recorded', 'releaseDate', 'runtime', 'subject', 'summary', 'title', 
                         'title_accent_fold', 'urlAllmusic', 'urlAmazon', 'urlDeezer', 'urlGoEar', 'urlHypeMachine', 'urlITunes', 'urlLastFm', 'urlMusicBrainz', 
                         'urlPandora', 'urlSong', 'urlSpotify', 'urlWikipedia', 'urlYouTube', 'urlYouTubeExist', 'writer']

# 'title' -> 'song_title'
# '_id' -> 'song_id'

artist_field_to_keep = ['_id', 'abstract', 'dbp_abstract', 'dbp_genre', 'deezerFans', 'disambiguation', 
                        'gender', 'genres', 'id_artist_deezer', 'id_artist_discogs', 'id_artist_musicbrainz', 'labels', 'lifeSpan', 'location', 
                        'locationInfo', 'members', 'name', 'nameVariations', 'nameVariations_fold', 'name_accent_fold', 'recordLabel', 'subject', 
                        'type', 'urlAllmusic', 'urlAmazon', 'urlBBC', 'urlDeezer', 'urlDiscogs', 'urlFacebook', 'urlGooglePlus', 'urlITunes', 'urlInstagram', 
                        'urlLastFm', 'urlMusicBrainz', 'urlMySpace', 'urlOfficialWebsite', 'urlPureVolume', 'urlRateYourMusic', 'urlSecondHandSongs', 
                        'urlSoundCloud', 'urlSpotify', 'urlTwitter', 'urlWikia', 'urlWikidata', 'urlWikipedia', 'urlYouTube', 'urls']

# 'name' -> 'artist_name'
# '_id' -> 'artist_id'

In [None]:
def is_date(date):
    '''
    format yyyy-mm-dd or yyyy-mm or yyyy
    '''

    date_split = date.split('-')
    return all([spl.isnumeric() for spl in date_split])

def get_validated_date(date):

    date_split = date.split('-')
    is_numeric = all([spl.isnumeric() for spl in date_split])

    if len(date_split)==1:
        pass
    elif len(date_split)==2:
        year, month = date_split
        date = f"{year}-{month}" if int(month) <= 12 and int(month)>=1 else year
    elif len(date_split)==3:
        year, month, day = date_split
        is_month_valid = int(month) <= 12 and int(month)>=1
        is_date_valid = int(day) <= 31 and int(day)>=1

        if is_month_valid and not is_date_valid:
            date = f"{year}-{month}"
        elif not is_month_valid and not is_date_valid:
            date = year
        else:
            pass

    else:
        pass

    return date

def clean_lyric(text):
    if text[:5]=='<span':
        text_cleaned = ''
    else:
        text_cleaned = html2text.html2text(text).replace('  \n','\n')
    return text_cleaned


def gender_of_members(members):
    ''' Return statistics of the gender of members
    '''

    # it may happen that some member has not the 'gender' key
    members_with_info = [member for member in members if 'gender' in member.keys()]

    if members_with_info==[]:
        n_members, n_male, n_female, n_unknown = None, None, None, None

    else:
        n_members = len(members)
        members_genders = [member['gender'] for member in members_with_info]

        count_genders = Counter(members_genders)
        n_male = count_genders['Male']
        n_female = count_genders['Female']
        n_unknown = len(members_genders) - n_male - n_female
        # consider members with no 'gender' field as unknown
        n_unknown += len(members) - len(members_with_info)

    return {'n_members':n_members,
            'n_male':n_male,
            'n_female':n_female,
            'n_unknown':n_unknown}

def get_song_year(album_dateRelease, album_pubDate, song_pubDate):
    '''
    Input format respectively:
    yyyy-mm-dd, yyyy, yyyy-mm-dd
    '''

    # extract candidate year
    album_dateRelease_year = int(album_dateRelease.split('-')[0]) if album_dateRelease.split('-')[0].isnumeric() else ''
    album_pubDate_year = int(album_pubDate) if album_pubDate.isnumeric() else ''
    song_pubDate_year = int(song_pubDate.split('-')[0]) if song_pubDate.split('-')[0].isnumeric() else ''
    
    candidate_years = [album_dateRelease_year, album_pubDate_year, song_pubDate_year]
    candidate_years = [d for d in candidate_years if d!='' and (d>1900 and d<2020)]
    year = min(candidate_years) if len(candidate_years)>0 else ''

    # get candidate days of publication
    candidate_pubdates = [d for d in [album_dateRelease, song_pubDate] if d!='' and d.split("-")[0]==str(year) and is_date(d)]

    for n in range(len(candidate_pubdates)):

        candidate_pubdates[n] = get_validated_date(candidate_pubdates[n])
        n_ = len(candidate_pubdates[n].split("-"))
        if n_==1:
            # if only year, we are not interested in this info
            candidate_pubdates[n] = datetime.strptime(candidate_pubdates[n], '%Y')
        elif n_==2:
            candidate_pubdates[n] = datetime.strptime(candidate_pubdates[n], '%Y-%m')
        elif n_==3:
            candidate_pubdates[n] = datetime.strptime(candidate_pubdates[n], '%Y-%m-%d')
        else:
            pass

    candidate_pubdate = min(candidate_pubdates).strftime('%Y-%m-%d') if len(candidate_pubdates)>0 else ''

    # return candidate pubdate, year of publication and decade
    decade = year // 10 * 10 if year!='' else ''

    return candidate_pubdate, year, decade


def get_simple_stats_of_song_lyrics(song_lyrics):

    song_lyrics_clean = song_lyrics.replace("\n", " ")
    n_words = len(song_lyrics_clean.split())
    n_lines = sum([1 for l in song_lyrics.split('\n') if l.strip()!=''])

    return n_words, n_lines


def get_song_info(song):
    '''
    Here only songs in English in input
    '''

    # rename id and title
    song_id = song['_id']
    del song['_id']
    song['song_id'] = song_id

    song_title = song['title']
    del song['title']
    song['song_title'] = song_title
    
    # keep fields of interest
    song = {key:song[key] for key in set(song.keys()) if key in song_fields_to_keep or key in ['song_id', 'song_title']}

    lyrics = song['lyrics']
    lyrics = clean_lyric(lyrics).strip()
    lyrics = lyrics if lyrics!='' else None
    song['lyrics'] = lyrics
        
    if lyrics!=None:
        n_words, n_lines = get_simple_stats_of_song_lyrics(lyrics)
    else:
        n_words, n_lines = None, None

    # add simple stats
    song['n_words'] = n_words
    song['n_lines'] = n_lines

    # get songwriters (if any)
    songwriters = song['writer'] if 'writer' in song.keys() else []
    song['writer'] = songwriters

    return song

        
def info_about_song_production(albums):
    ''' 
    
    '''

    n_albums = 0
    n_songs = 0
    songs = []
    languages = defaultdict(int)

    # loop across albums to get songs info
    for album in albums:
        n_albums += 1
        album_id = album['_id']
        album_pubdate = album['publicationDate']
        album_dateRelease = album['dateRelease'] if 'dateRelease' in album.keys() else ''
        album_genre = album['genre']
        for song in album['songs']:
            n_songs += 1
            
            lang = song['language_detect']
            lang = lang if lang!='' else 'unknown'
            languages[lang] += 1

            if lang not in ['english']:
                continue

            song_new = get_song_info(song)
            song_new['album_pubdate'] = album_pubdate
            song_new['album_genre'] = album_genre
            song_new['album_dateRelease'] = album_dateRelease
            song_pubDate = song_new['publicationDate']

            # get publication year and decade
            candidate_pubdate, year, decade = get_song_year(album_dateRelease, album_pubdate, song_pubDate)
            song_new['song_pubdate_combined'] = candidate_pubdate
            song_new['song_year_combined'] = year
            song_new['song_decade_combined'] = decade

            songs.append(song_new)

    return n_albums, n_songs, languages, songs

def write_json_rows(fold, rows):

    rows.sort(key=lambda item: item['song_decade_combined'] if type(item['song_decade_combined']) is int else 0)
    for decade, group in groupby(rows, key=lambda item: item['song_decade_combined']):

        file_name = fold+f'lyrics_{decade}.json'
        with open(file_name, 'a') as ww:
            for row in group:
                ww.write(json.dumps(row)+"\n")

def write_json_row(file, row):

    with open(file, 'a') as ww:
        ww.write(json.dumps(row)+"\n")

In [None]:
# please, point to the directory containing the dataset

# collect all the author objects
path = '/content/2MillionSongsDB/'
files = os.listdir(path)
files = [f for f in files if '.' not in f]    # remove strange folders--
files = sorted(files, key=lambda i: int(i))   # sort the folders
n_files = len(files)

#album_keys = set()
#song_keys = set()

for n_file, file in enumerate(files, 1):
    f=open(os.path.join(path,str(file)),'r')
    json_datas = json.load(f)      
    f.close()               # it contains 200 artists

    if n_file%20 == 0:
        print(f'Done {n_file} files of {n_files}..')

    for artist in json_datas:
        
        n_albums, n_songs, languages, songs = info_about_song_production(artist['albums'])

        #album_keys.update(set([k for album in artist['albums'] for k in album.keys()]))
        #song_keys.update(set([k for album in artist['albums'] for song in album['songs'] for k in album.keys()]))
             
        # rename id and title
        artist_id = artist['_id']
        del artist['_id']
        artist['artist_id'] = artist_id

        artist_name = artist['name']
        del artist['name']
        artist['artist_name'] = artist_name

        # get gender of members
        genders = gender_of_members(artist['members'])

        # add other fields
        artist['n_albums'] = n_albums
        artist['n_songs'] = n_songs
        artist['languages'] = dict(languages)
        artist = {**artist, **genders}

        other_artist_info = {
            'n_albums':artist['n_albums'],
            'n_songs':artist['n_songs'],
            'languages':artist['languages'],
            'gender':artist['gender'],
            'type':artist['type'],
            **genders
        }
        for n in range(len(songs)):
            songs[n]['other_artist_info'] = other_artist_info
            songs[n]['artist_id'] = artist_id
            songs[n]['artist_name'] = artist_name


        # keep fields of interest
        artist = {key:artist[key] for key in set(artist.keys()) 
                        if key in artist_field_to_keep or key in ['artist_id', 'artist_name', 'n_members', 'n_male', 'n_female', 'n_unknown', 'n_albums', 'n_songs', 'languages']}

        
        # now save artist row and song rows
        artist_file = "artists_info.json"
        write_json_row(artist_file, artist)

        if artist['type']=='Group':
            song_fold = f"data_lyrics_group_decades/"
            write_json_rows(song_fold, songs)
        elif artist['type']=='Person':
            song_fold = f"data_lyrics_person_decades/"
            write_json_rows(song_fold, songs)
        else:
            song_fold = f"data_lyrics_others_decades/"
            write_json_rows(song_fold, songs)
        
 

Done 20 files of 388..
Done 40 files of 388..
Done 60 files of 388..
Done 80 files of 388..
Done 100 files of 388..
Done 120 files of 388..
Done 140 files of 388..
Done 160 files of 388..
Done 180 files of 388..
Done 200 files of 388..
Done 220 files of 388..
Done 240 files of 388..
Done 260 files of 388..
Done 280 files of 388..
Done 300 files of 388..
Done 320 files of 388..
Done 340 files of 388..
Done 360 files of 388..
Done 380 files of 388..


In [None]:
# gzip all
!gzip data_lyrics_group_decades/*.json
!gzip data_lyrics_others_decades/*.json
!gzip data_lyrics_person_decades/*.json

!gzip artists_info.json

In [None]:
# mode to Drive
!cp -r data_lyrics_group_decades "drive/MyDrive/Artistic_Content_Creation/WASABI_gender_experiments/WASABI_gender_experiments_definitive/data"
!cp -r data_lyrics_person_decades "drive/MyDrive/Artistic_Content_Creation/WASABI_gender_experiments/WASABI_gender_experiments_definitive/data"
!cp -r data_lyrics_others_decades "drive/MyDrive/Artistic_Content_Creation/WASABI_gender_experiments/WASABI_gender_experiments_definitive/data"

!cp artists_info.json.gz "drive/MyDrive/Artistic_Content_Creation/WASABI_gender_experiments/WASABI_gender_experiments_definitive/data"