# An Analysis of Lyrical Diversity in Some Groups of Music Artists
### by Joan Chirinos

We will be ananlyzing the diversity of lyrics, defined by how many different words an artist uses compared to how many total words that artist uses in their body of work.

We will be focusing on a couple of different groups:
* top artists
* some of my favorite artists
* some of Siena Larrick's favorite artists
* a spread of pop artists from those with the lowest lyrical diversity to those with the highest lyrical diversity
* a spread of all artists, from those with the lowest lyrical diversity to those with the highest lyrical diversity

For the last group stated, we will be doing an analysis of which musical genre, if any, is generally most lyrically diverse.

We are going to use the Genius API to find song data, and we're going to use Selenium to scrape the lyrics off of the Genius webpage.

We are also going to use the Last.FM API to find top artists in an easy an managable way, as they sort artists by tags and keep track of tags and whatnot.

## Necessary Tools

We first need to install some necessary tools, including a webdriver, a webdriver manager, and selenium


In [15]:
# # Our Ubuntu package installs
# !apt update
# !apt install chromium-chromedriver
# !cp /usr/lib/chromium-browser/chromedriver /usr/bin

# Our MacOS package installs
# TODO?

# Our PIP package installs
%pip install selenium
%pip install webdriver_manager

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Now we can get started!

Let's import our packages into Python to make sure they're all set up.

In [16]:
# from selenium import webdriver
# from selenium.webdriver.chrome.service import Service as ChromiumService
# from webdriver_manager.chrome import ChromeDriverManager
# from webdriver_manager.core.os_manager import ChromeType

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import TimeoutException

from selenium.webdriver.common.by import By

from unidecode import unidecode

import requests
import json
import os
# import pprint

Now we can put in our access token and define our default header for all things Genius

In [17]:
ACCESS_TOKEN = 'sk3T-7f0Bl8p6V2vCRB6zdQDWeE7AEF6nsY6sIs4rasG64E6IAexSoQ7nv7hchkA'
DEFAULT_HEADERS = {'Authorization': f'Bearer {ACCESS_TOKEN}'}

Let's define a path variable to use as our root directory for any files we create

In [18]:
ROOT_PATH = 'files/'

## Building a Dictionary

Due to the ever-changing nature of language, and the ever-present human error in the lyrics on Genius (no shade to Genius, they do a great job), I would like to build a dictionary of valid words.

This also allows us to count Spanish words as words (if we decide to do so, since as of now, we do not count Spanish words), as well as not double-count words such as "yuh" and "yuhh" as two separate words.

We will build this dictionary as a Python dictionary, where the value for the key `word` is `1` if that's a unique/root word, and the value for key `word` is `parent` if the word is the same as another.

For example, `{'yuh': 1, 'yuhh': 'yuh'}`.

This dictionary will be build at the same time as we collect data for singers, so it makes it a somewhat active process.

I'll likely add a feature where, if you're sure you got most of the possible words and variations, you can toggle something to let it run unsupervised. Then, it should make a list of problematic words and who sang them, to be reviewed later.

In [77]:
# Let's begin by importing a dictionary of English words. This is the same dictionary that will be built upon.

RUN_THIS = False

if RUN_THIS:
    ENGLISH = None

    with open('words_dictionary.json') as f:
        ENGLISH = json.load(f)

    # Now, we actually want every word to point to itself
    for word in ENGLISH:
        ascii_word = unidecode(word)
        ENGLISH[ascii_word] = ascii_word

    with open('our_words.json', 'w+') as f:
        json.dump(ENGLISH, f, indent=2)
    
    with open('not_words.json', 'w+') as f:
        json.dump({'1': '1'}, f, indent=2)

In [20]:
def add_word_to_dictionary(word, point_to, filename='our_words.json'):
    true_word = unidecode(word)
    true_point_to = unidecode(point_to)
    ENGLISH[true_word] = true_point_to
    with open(filename, 'w+') as f:
        json.dump(ENGLISH, f, indent=2)
    return ENGLISH

We're also going to create a set of all Spanish words in order to remove them from lyrics. This is so we can compare counts of mostly English words, since the artists we're looking at mostly sing in English, followed by Spanish.

This was the simplest solution, and if it's not effective, I'll look for others.

In [21]:
SPANISH_SET = None

with open('spanish.json', 'r') as spanish:
    SPANISH_SET = set(json.load(spanish))

## Genius Functions

We will begin by defining the functions that will look for artists and songs using the Genius API

### search(search_term)

Take a search term search_term and return the response from http://api.genius.com/search as a dictionary.

In [22]:
def search(search_term):
    payload = {'q': search_term}
    r = requests.get('http://api.genius.com/search',
                     params=payload,
                     headers=DEFAULT_HEADERS)
    return r.json()

### get_artist_id(artist_name)

Take an artist name and return their Genius-assigned id. Creates JSON file for artist in `ROOT_PATH` directory.

In [23]:
def _get_artist_id(artist_name):
    r = search(artist_name)
    for i in range(10):
        if r['response']['hits'][i]['result']['primary_artist']['name'].casefold() == artist_name.casefold():
            return r['response']['hits'][i]['result']['primary_artist']['id']

In [24]:
def get_artist_id(artist_name):
    print(f'Getting id for artist {artist_name}')
    try:
        with open(os.path.join(ROOT_PATH, 'name_to_id.json')) as f:
            name_to_id = json.load(f)
    except FileNotFoundError:
        name_to_id = {}
    finally:
        if artist_name in name_to_id:
            print(
                f'Found id for {artist_name} in {os.path.join(ROOT_PATH, "name_to_id.json")}. Returning...')
            return name_to_id[artist_name]
        else:  # if artist_name not in name_to_id:
            print(
                f'Did not find id for {artist_name} in {os.path.join(ROOT_PATH, "name_to_id.json")}. Pulling from Genius...')
            artist_id = _get_artist_id(artist_name)
            name_to_id[artist_name] = artist_id
            with open(os.path.join(ROOT_PATH, 'name_to_id.json'), 'w+') as f:
                json.dump(name_to_id, f, indent=2)
                print(
                    f'Added artist name and id to {os.path.join(ROOT_PATH, "name_to_id.json")}')
            try:
                with open(os.path.join(ROOT_PATH, f'{artist_id}.json')) as f:
                    artist = json.load(f)
            except FileNotFoundError:
                print(
                    f'Artist JSON does not exist for {artist_name}. Creating {os.path.join(ROOT_PATH, f"{artist_id}.json")}...')
                artist = {}
                artist['name'] = artist_name
                artist['id'] = artist_id
                with open(os.path.join(ROOT_PATH, f'{artist_id}.json'), 'w+') as f:
                    print(
                        f'Created artist JSON at {os.path.join(ROOT_PATH, f"{artist_id}.json")}')
                    json.dump(artist, f, indent=2)
            else:
                print(
                    f'Artist JSON exists for {artist_name} at {os.path.join(ROOT_PATH, f"{artist_id}.json")}')

    return artist_id

In [25]:
get_artist_id('Snail Mail')

Getting id for artist Snail Mail
Did not find id for Snail Mail in files/name_to_id.json. Pulling from Genius...


Added artist name and id to files/name_to_id.json
Artist JSON does not exist for Snail Mail. Creating files/1007900.json...
Created artist JSON at files/1007900.json


1007900

### get_artist_songs(artist_name)

Take an artist name and return a list of all of the songs where they're the primary artist, in the form `{song_id: {name: song_name, url: song_url}, ...}`. Also writes song info to artist's JSON file.

In [79]:
def _get_artist_songs(artist_id):
    # songs = [[song_name, song_id, song_url], ...]
    songs = {}

    # We're assuming artists have at least one page of music (one song)
    next_page = '1'
    while next_page is not None:
        payload = {'[per_page': '50', 'page': next_page, 'sort': 'popularity'}
        r = requests.get(f'http://api.genius.com/artists/{artist_id}/songs',
                         params=payload,
                         headers=DEFAULT_HEADERS).json()
        for song in r['response']['songs']:
            fixed_title = unidecode(song['title'].casefold())
            if song['primary_artist']['id'] == artist_id and\
                'live' not in fixed_title and\
                'remix' not in fixed_title and\
                'translation' not in fixed_title and\
                'simlish' not in fixed_title:
                # songs.append([song['title'], int(song['id']), song['url']])
                song_id = str(song['id'])
                songs[song_id] = {}
                songs[song_id]['name'] = unidecode(song['title']).strip()
                songs[song_id]['url'] = song['url']
            next_page = r['response']['next_page']
        return songs

In [27]:
def get_artist_songs(artist_name=None, artist_id=None):
    if artist_id is None:
        if artist_name is None:
            raise ValueError(
                'One of artist_id or song_name must be given as an argument')
        artist_id = get_artist_id(artist_name)

    with open(os.path.join(ROOT_PATH, f'{artist_id}.json')) as f:
        artist = json.load(f)
        if 'songs' in artist:
            print(
                f'Song list exists in artist JSON at {os.path.join(ROOT_PATH, f"{artist_id}.json")}. Returning...')
            return artist['songs']
        print(
            f'Song list does not exist in artist JSON at {os.path.join(ROOT_PATH, f"{artist_id}.json")}. Pulling from Genius...')
        songs = _get_artist_songs(artist_id)
        artist['songs'] = songs
    with open(os.path.join(ROOT_PATH, f'{artist_id}.json'), 'w+') as f:
        print(
            f'Writing songs to {artist_name} JSON at {os.path.join(ROOT_PATH, f"{artist_id}.json")}')
        json.dump(artist, f, indent=2)

    return songs

In [28]:
# Example
get_artist_songs('Snail Mail')

Getting id for artist Snail Mail
Found id for Snail Mail in files/name_to_id.json. Returning...
Song list does not exist in artist JSON at files/1007900.json. Pulling from Genius...
Writing songs to Snail Mail JSON at files/1007900.json


{'3595541': {'name': 'Pristine',
  'url': 'https://genius.com/Snail-mail-pristine-lyrics'},
 '3678164': {'name': 'Heat Wave',
  'url': 'https://genius.com/Snail-mail-heat-wave-lyrics'},
 '3683677': {'name': 'Speaking Terms',
  'url': 'https://genius.com/Snail-mail-speaking-terms-lyrics'},
 '2867874': {'name': 'Thinning',
  'url': 'https://genius.com/Snail-mail-thinning-lyrics'},
 '3078104': {'name': 'Stick',
  'url': 'https://genius.com/Snail-mail-stick-lyrics'},
 '7191839': {'name': 'Valentine',
  'url': 'https://genius.com/Snail-mail-valentine-lyrics'},
 '3739598': {'name': 'Anytime',
  'url': 'https://genius.com/Snail-mail-anytime-lyrics'},
 '3739597': {'name': 'Deep Sea',
  'url': 'https://genius.com/Snail-mail-deep-sea-lyrics'},
 '3739596': {'name': 'Full Control',
  'url': 'https://genius.com/Snail-mail-full-control-lyrics'},
 '3721962': {'name': "Let's Find an Out",
  'url': 'https://genius.com/Snail-mail-lets-find-an-out-lyrics'},
 '7191843': {'name': 'Ben Franklin',
  'url': '

## Lyric Scraping

We then define some functions related to scraping the lyrics from the Genius website

### get_song_lyrics(song_id, song_url, artist_name=None, artist_id=None)

Take an `song_id: str`, `song_url: str` and a `artist_name: str` or `artist_id: str` and return its lyrics as a string, where each line of the song is separated by a newline. Writes lyrics to artist's JSON.

In [29]:
def _get_song_lyrics(song_url):
    # Scrape the lyrics
    lyrics = []
    options = webdriver.ChromeOptions()
    options.add_argument('--no-sandbox')
    options.add_argument('--headless')
    options.add_argument('--disable-dev-shm-usage')
    # options.add_argument("--headless=new")
    with webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options) as driver:
        # driver = webdriver.Chrome(service=ChromiumService(ChromeDriverManager(chrome_type=ChromeType.CHROMIUM).install()))
        # with webdriver.Chrome(service=ChromiumService(ChromeDriverManager(chrome_type=ChromeType.CHROMIUM).install())) as driver:
        # with webdriver.Chrome(service=ChromiumService(ChromeDriverManager(chrome_type=ChromeType.CHROMIUM).install()),
        #                       options=options) as driver:
        driver.set_page_load_timeout(5)
        attempts = 0
        while attempts < 10:
            try:
                driver.get(song_url)
            except TimeoutException as e:
                attempts += 1
                if attempts == 11:
                    raise e
                print(f'Attempt {attempts} timed out. Trying again...')
            else:
                break
        chunks = driver.find_elements(
            By.XPATH, '//div[@data-lyrics-container="true"]')
        for chunk in chunks:
            for line in chunk.text.split('\n'):
                if line != '' and line[0] != '[':
                    lyrics.append(' '.join(line.split()))
        return unidecode('\n'.join(lyrics), errors='ignore')

In [30]:
def get_song_lyrics(song_id, artist_name=None, artist_id=None):
    if artist_id is None:
        if artist_name is None:
            raise ValueError(
                'One of artist_id or song_name must be given as an argument')
        artist_id = get_artist_id(artist_name)

    with open(os.path.join(ROOT_PATH, f'{artist_id}.json')) as f:
        artist = json.load(f)
    song_name = artist['songs'][song_id]['name']
    if 'lyrics' in artist['songs'][str(song_id)]:
        print(
            f'Found lyrics for {song_name} ({song_id}) in artist JSON at {os.path.join(ROOT_PATH, f"{artist_id}.json")}. Returning...')
        return artist['songs'][str(song_id)]['lyrics']

    song_url = artist['songs'][song_id]['url']
    print(f'Lyrics for {song_name} ({song_id}) not found in artist JSON at {os.path.join(ROOT_PATH, f"{artist_id}.json")}. Pulling from Genius...')
    lyrics = _get_song_lyrics(song_url)
    artist['songs'][str(song_id)]['lyrics'] = lyrics
    with open(os.path.join(ROOT_PATH, f'{artist_id}.json'), 'w+') as f:
        print(
            f'Writing lyrics for {song_name} ({song_id}) to artist JSON at {os.path.join(ROOT_PATH, f"{artist_id}.json")}.')
        json.dump(artist, f, indent=2)
    return lyrics

In [31]:
artist_id = get_artist_id('Snail Mail')
songs = get_artist_songs(artist_id=artist_id)
isongs = iter(songs)
next(isongs)
next(isongs)
next(isongs)
next(isongs)
next(isongs)
next(isongs)
next(isongs)
song_id = next(isongs)
print(get_song_lyrics(song_id, artist_id=artist_id))

Getting id for artist Snail Mail
Found id for Snail Mail in files/name_to_id.json. Returning...
Song list exists in artist JSON at files/1007900.json. Returning...
Lyrics for Deep Sea (3739597) not found in artist JSON at files/1007900.json. Pulling from Genius...
<Response [200]>
<Response [200]>
Attempt 1 timed out. Trying again...
Writing lyrics for Deep Sea (3739597) to artist JSON at files/1007900.json.
Deep sea dive
Got down, but you stayed alive
It's only you down there
You and the bends
Lose your mind
Lose track of breathing in time
It's only you down there
Sleep with the tides
We can be anyone
It took so long to know someone like you
And age in the dying sun
Wake only to bathe in greens and blues
Die, my love
Breathe in twos and fours
To know what's worth breathing for
Some days it's easier than falling asleep
We can be anyone
It took so long to know someone like you
And age in the dying sun
Wake only to bathe in greens and blues


### get_all_song_lyrics(songs)

Takes songs `songs = {song_id: {name: str, url: str}}` and add `lyrics: the_lyrics` to each song

Returns the updated dict of songs `songs = {song_id: {name: str, url: str, lyrics: str}}`

In [32]:
def get_all_song_lyrics(artist_name=None, artist_id=None):
    if artist_id is None:
        if artist_name is None:
            raise ValueError(
                'One of artist_id or song_name must be given as an argument')
        artist_id = get_artist_id(artist_name)

    songs = get_artist_songs(artist_id=artist_id)
    for song_id in songs:
        songs[song_id]['lyrics'] = get_song_lyrics(
            song_id, artist_id=artist_id)
    return songs

In [33]:
get_all_song_lyrics('Snail Mail')

Getting id for artist Snail Mail
Found id for Snail Mail in files/name_to_id.json. Returning...
Song list exists in artist JSON at files/1007900.json. Returning...
Lyrics for Pristine (3595541) not found in artist JSON at files/1007900.json. Pulling from Genius...
<Response [200]>
<Response [200]>
Attempt 1 timed out. Trying again...
Attempt 2 timed out. Trying again...
Attempt 3 timed out. Trying again...
Writing lyrics for Pristine (3595541) to artist JSON at files/1007900.json.
Lyrics for Heat Wave (3678164) not found in artist JSON at files/1007900.json. Pulling from Genius...
<Response [200]>
<Response [200]>
Attempt 1 timed out. Trying again...
Writing lyrics for Heat Wave (3678164) to artist JSON at files/1007900.json.
Lyrics for Speaking Terms (3683677) not found in artist JSON at files/1007900.json. Pulling from Genius...
<Response [200]>
<Response [200]>
Writing lyrics for Speaking Terms (3683677) to artist JSON at files/1007900.json.
Lyrics for Thinning (2867874) not found i

{'3595541': {'name': 'Pristine',
  'url': 'https://genius.com/Snail-mail-pristine-lyrics',
  'lyrics': "Pristine\nUntraced by the world outside you\nAnyways\nI'll never get real\nand you'll never change to me 'cause I'm not looking\nAnyways\nSame night\nSame humility for those that love you\nAnyways,\nanyways\nAnd if you do find someone better\nI'll still see you in everything\nTomorrow and all the time\nDon't you like me for me?\nIs there any better feeling than coming clean?\nAnd I know myself and I'll never love anyone else\nI won't love anyone else\nI'll never love anyone else\nIt just feels like\nThe same party every weekend\nDoesn't it? Doesn't it?\nAnd if you do find someone better\nI'll still see you in everything\nFor always, tomorrow, and all the time\nDon't you like me for me?\nIs there any better feeling than coming clean?\nAnd I know myself and I'll never love anyone else\nI won't love anyone else\nI'll never love anyone else\nIf it's not supposed to be\nThen I'll just let

## Misc. Functions

We define some helper functions

### tokenize_lyrics(lyrics)

Take song lyrics in the form output by `get_song_lyrics` and tokenize them, removing unnecessary punctuation

In [65]:
TRANSLATION_TABLE = str.maketrans('', '', '!?{}[]()<>,./@#$%^&*_\\|`~+=\";:')


def tokenize_lyrics(lyrics):
    lines = [line.split(' ') for line in lyrics.split('\n')]
    tokens = []
    for line in lines:
        for word in line:
            # Remove casing and most punctuation
            clean_word = word.casefold().translate(TRANSLATION_TABLE)
            clean_word = unidecode(clean_word.strip())

            # Remove apostrophes in a reasonable way
            if clean_word[-2:] == "'s":
                clean_word = clean_word[:-2]
            if len(clean_word) < 1:
                continue
            if clean_word[0] == "'":
                clean_word = clean_word[1:]
            if len(clean_word) < 1:
                continue
            if clean_word[-1] == "'":
                clean_word = clean_word[:-1]
            if len(clean_word) < 1:
                continue
            # NOTE: I will be counting contractions as separate words, since many
            # are a notable part of one's vocabulary (for example, ain't) rather
            # than simply a shortening.
            
            # Remove hyphens from the end and beginning, though
            while len(clean_word) > 1 and clean_word[0] == '-':
                clean_word = clean_word[1:]
            while len(clean_word) > 1 and clean_word[-1] == '-':
                clean_word = clean_word[:-1]
            if len(clean_word) < 1:
                continue
            
            clean_word = unidecode(clean_word.strip())

            tokens.append(clean_word)
    return tokens

## Stats Functions

We define functions related to getting lyrics stats

### get_song_stats(song_id, artist_name=None, artist_id=None, print_stats=True)

Take a `song_id` with either an `artist_name` or `artist_id` and return their word count and a dictionary with the frequency of each unique word.

In [75]:
def _get_song_stats(song_lyrics):
    with open('our_words.json') as f:
        english_dict = json.load(f)
    with open('not_words.json') as f:
        not_words = json.load(f)
    word_dict = {}
    for _word in tokenize_lyrics(song_lyrics):
        word = unidecode(_word, errors='ignore')
        if word in SPANISH_SET:
            continue
        skip = False
        if word in not_words:
            continue
        # elif word + 'g' in english_dict:
        #     # THERE ARE SO MANY WORDS SANS THE FINAL G, IT'S A PAIN
        #     # Like "kissin" instead of "kissing"
        #     # We're gonna deal with this once and for all >:(
        #     print(f'Caught {word + "g"} without a "g" >:(')
        #     english_dict[word] = word + 'g'
        #     with open('our_words.json', 'w+') as f:
        #         json.dump(english_dict, f, indent=2)
        elif word not in english_dict:
            print('\n')
            print(f'Word: |{word}|')
            print('(1) New Word')
            print('(2) Not a word')
            print('(panic) Escape to fix something')
            print('(<word>) Same as <word>')
            while True:
                r = input()
                if r == '1':
                    print(f'Adding {word} to dictionary')
                    english_dict[word] = word
                    print(f'{word}: {english_dict[word]}')
                    with open('our_words.json', 'w+') as f:
                        json.dump(english_dict, f, indent=2)
                    break
                elif r == '2':
                    print(f'Skipping word')
                    not_words[word] = word
                    with open('not_words.json', 'w+') as f:
                        json.dump(not_words, f, indent=2)
                    skip = True
                    break
                elif r == 'panic!':
                    raise RuntimeError('Panic! at the tokenization, probably')
                else:
                    if r not in english_dict:
                        print(f'Make {r} a new root?')
                        print(f'(1) Yes')
                        print(f'(2) No')
                        new_root = input()
                        if new_root == '1':
                            english_dict[r] = r

                    if r in english_dict:
                        while english_dict[r] != r:
                            r = english_dict[r]
                        print(f'Adding {word} as alias to {r}')
                        english_dict[word] = r
                        print(f'{word}: {english_dict[word]}')
                        with open('our_words.json', 'w+') as f:
                            json.dump(english_dict, f, indent=2)
                        break
                    else:
                        print(f'{r} not in english_dict')
        if skip:
            continue
        true_word = english_dict[word]
        word_dict[true_word] = word_dict.get(true_word, 0) + 1
    
    return word_dict

In [54]:
def get_song_stats(song_id, artist_name=None, artist_id=None):
    if artist_id is None:
        if artist_name is None:
            raise ValueError(
                'One of artist_id or song_name must be given as an argument')
        artist_id = get_artist_id(artist_name)

    with open(os.path.join(ROOT_PATH, f'{artist_id}.json')) as f:
        artist = json.load(f)
    
    song_name = artist['songs'][song_id]['name']

    if 'word_dict' in artist['songs'][song_id]:
        print(
            f'Word dict found for {song_name} ({song_id}) in artist JSON at {os.path.join(ROOT_PATH, f"{artist_id}.json")}. Returning...')
        word_dict = artist['songs'][song_id]['word_dict']
    else:
        print(
            f'Word dict not found for {song_name} ({song_id}) in artist JSON at {os.path.join(ROOT_PATH, f"{artist_id}.json")}. Counting...')
        if 'lyrics' not in artist['songs'][song_id]:
            get_all_song_lyrics(artist_id=artist_id)
            with open(os.path.join(ROOT_PATH, f'{artist_id}.json')) as f:
                artist = json.load(f)
        song_lyrics = artist['songs'][song_id]['lyrics']
        word_dict = _get_song_stats(song_lyrics)
        artist['songs'][song_id]['word_dict'] = word_dict
        with open(os.path.join(ROOT_PATH, f'{artist_id}.json'), 'w+') as f:
            print(
                f'Writing word dict to artist JSON at {os.path.join(ROOT_PATH, f"{artist_id}.json")}.')
            json.dump(artist, f, indent=2)

    return word_dict

In [37]:
artist_id = get_artist_id('Snail Mail')
songs = get_artist_songs(artist_id=artist_id)
isongs = iter(songs)
next(isongs)
next(isongs)
next(isongs)
next(isongs)
next(isongs)
next(isongs)
next(isongs)
song_id = next(isongs)
word_dict = get_song_stats(song_id, artist_id=artist_id)
print(f'{len(word_dict)} unique words')
total_words= sum(word_dict.values())
print(f'{total_words} total words')
sorted_words = sorted(list(word_dict.items()), key=lambda x: x[1], reverse=True)
print(sorted_words)

Getting id for artist Snail Mail
Found id for Snail Mail in files/name_to_id.json. Returning...
Song list exists in artist JSON at files/1007900.json. Returning...
Word dict not found for Deep Sea (3739597) in artist JSON at files/1007900.json. Counting...
Writing word dict to artist JSON at files/1007900.json.
50 unique words
88 total words
[('you', 6), ('and', 6), ('it', 5), ('only', 4), ('the', 4), ('down', 3), ('know', 3), ('there', 2), ('breathing', 2), ('we', 2), ('anyone', 2), ('took', 2), ('long', 2), ('someone', 2), ('like', 2), ('age', 2), ('dying', 2), ('sun', 2), ('wake', 2), ('bathe', 2), ('greens', 2), ('deep', 1), ('dive', 1), ('got', 1), ('but', 1), ('stayed', 1), ('alive', 1), ('bends', 1), ('your', 1), ('mind', 1), ('track', 1), ('of', 1), ('sleep', 1), ('with', 1), ('tides', 1), ('die', 1), ('my', 1), ('love', 1), ('breathe', 1), ('twos', 1), ('fours', 1), ('what', 1), ('worth', 1), ('for', 1), ('some', 1), ('days', 1), ('easier', 1), ('than', 1), ('falling', 1), ('a

### get_all_stats(artist, print_stats=True)

Take an artist name and return their word count, a set with every unique word, and a dictionary with the frequency of each word.

For now, this will be a somewhat active process as you will be building up a standard English dictionary.

If `print_stats` is `True`, will print word count, number of unique words, and their ratio.

In [40]:
def get_all_stats(artist_name=None, artist_id=None, print_stats=True):
    if artist_id is None:
        if artist_name is None:
            raise ValueError(
                'One of artist_id or song_name must be given as an argument')
        artist_id = get_artist_id(artist_name)

    songs = get_artist_songs(artist_id=artist_id)
    
    artist_dict = {}

    for song_id in songs:
        song_dict = get_song_stats(song_id, artist_id=artist_id)
        for word in song_dict:
            artist_dict[word] = artist_dict.get(word, 0) + song_dict[word]
    
    with open(os.path.join(ROOT_PATH, f'{artist_id}.json')) as f:
        artist = json.load(f)
    
    artist['word_dict'] = artist_dict
    
    with open(os.path.join(ROOT_PATH, f'{artist_id}.json'), 'w+') as f:
        artist = json.dump(artist, f, indent=2)
    
    return artist_dict

In [48]:
word_dict = get_all_stats(artist_name='Snail Mail')

# sorted(list(word_dict.items()), key=lambda x: x[1])

len(word_dict)

Getting id for artist Snail Mail
Found id for Snail Mail in files/name_to_id.json. Returning...
Song list exists in artist JSON at files/1007900.json. Returning...
Word dict found for Pristine (3595541) in artist JSON at files/1007900.json. Returning...
Word dict found for Heat Wave (3678164) in artist JSON at files/1007900.json. Returning...
Word dict found for Speaking Terms (3683677) in artist JSON at files/1007900.json. Returning...
Word dict found for Thinning (2867874) in artist JSON at files/1007900.json. Returning...
Word dict found for Stick (3078104) in artist JSON at files/1007900.json. Returning...
Word dict found for Valentine (7191839) in artist JSON at files/1007900.json. Returning...
Word dict found for Anytime (3739598) in artist JSON at files/1007900.json. Returning...
Word dict found for Deep Sea (3739597) in artist JSON at files/1007900.json. Returning...
Word dict found for Full Control (3739596) in artist JSON at files/1007900.json. Returning...
Word dict found fo

827

## Last.fm Functions

We define some functions using the Last.fm API. These functions will help with looking up top artists and top tags (genres).

In [None]:
LASTFM_ROOT_URL = 'http://ws.audioscrobbler.com/2.0'
LASTFM_API_KEY = '86c0c6cfe8cbb522dc355b223101f16f'

### get_tags()

Fairly self-explanatory. Uses the Last.fm API to get top global tags.

This will likely be used to manually form a list of valid and relevant Last.fm tags, as some may be too specific (like progressive metal (no disrespect to those who enjoy progressive metal!))

In [None]:
def get_top_tags():
    payload = {'api_key': LASTFM_API_KEY,
               'method': 'tag.getTopTags',
               'format': 'json'}
    r = requests.get(LASTFM_ROOT_URL,
                     params=payload)
    # We sort tags by their reach
    tags = sorted([(tag['name'], tag['count']) for tag in r.json()['toptags']['tag']],
                  key=lambda x: x[1],
                  reverse=True)
    # We remove the 'reach' value since we don't care about it besides for sorting
    return [tag[0] for tag in tags]

In [None]:
get_top_tags()

### get_top_artists(tag, limit)

Gets the top `limit` artists with the tag `tag`. I recommend doing more than you need in order to select an appropriate artist pool (those who all sing in the same language, so as to make vocabulary size comparison easier and meaningful)

In [None]:
def get_top_artists(tag, limit):
    payload = {'api_key': '86c0c6cfe8cbb522dc355b223101f16f',
               'method': 'tag.gettopartists',
               'tag': tag,
               'limit': limit,
               'format': 'json'}
    r = requests.get(LASTFM_ROOT_URL,
                     params=payload)
    artists = [artist['name'] for artist in r.json()['topartists']['artist']]
    return artists

In [None]:
get_top_artists('pop', '100')

## Lists of Favorite Artists

Here we define the lists of mine and Siena's favorite artists to use to test our methods.

In [63]:
joan_artists = ['Snail Mail', 'Alvvays', 'Indigo de Souza']

siena_artists = ['Bastille', 'Hozier', 'The Amazing Devil']

## Sample Workflow

Here's a sample workflow for using these tools, using `joan_artists` and `siena_artists`

In [78]:
for artist_name in siena_artists:
    word_dict = get_all_stats(artist_name=artist_name)
    unique_word_count = len(word_dict)
    total_word_count = sum(word_dict.values())

    sorted_words = sorted(word_dict.items(), key=lambda x: x[1])
    print(f'Popular words: {sorted_words[-10:]}')
    print(f'Unpopular words: {[word for word in sorted_words if word[1] == min([x[1] for x in sorted_words])]}')

    print(f'Unique Words: {unique_word_count}')
    print(f'Total Words: {total_word_count}')
    print(f'Ratio: {unique_word_count / total_word_count}')

Getting id for artist Bastille
Found id for Bastille in files/name_to_id.json. Returning...
Song list exists in artist JSON at files/43324.json. Returning...
Word dict found for Pompeii (133381) in artist JSON at files/43324.json. Returning...
Word dict found for Good Grief (2493550) in artist JSON at files/43324.json. Returning...
Word dict found for No Angels (OPH2) (192700) in artist JSON at files/43324.json. Returning...
Word dict found for Send Them Off! (2385635) in artist JSON at files/43324.json. Returning...
Word dict found for Doom Days (4343485) in artist JSON at files/43324.json. Returning...
Word dict found for Oblivion (259115) in artist JSON at files/43324.json. Returning...
Word dict found for World Gone Mad (3306077) in artist JSON at files/43324.json. Returning...
Word dict found for Fake It (2797664) in artist JSON at files/43324.json. Returning...
Word dict found for Quarter Past Midnight (3689474) in artist JSON at files/43324.json. Returning...
Word dict found for

RuntimeError: Panic! at the tokenization, probably

In [None]:
get_top_tags()

In [None]:
get_top_artists('alternative', 100)