# Data Aquisition

For the project, we will need various data about each song. That is: lyrics, some measure of popularity, genre, artist, and release date. Unfortunately, this information is not all available from the same source. We will be collecting data from Genius (a website for song analysis which houses genre and lyrics) and Spotify (A major streaming platform which tracks the total plays and current popularity of a track). This means that in their raw form, the data we receive will not be properly associated by song. We will have to correctly identify which songs correspond across the platforms, and then associate the relavent data into our database. We will have to be carefull to minimize error which may arise from this process, as that will significantly effect our final results.

First, import the packages that we will be using

In [1]:
import requests
import json
import base64
from rauth import OAuth2Service
from bs4 import BeautifulSoup

First, lets take a look at the Genius API. We have to set up an authenticated account with Genius, so that it will respond to our API calls. We must identify our application and its purpose, and we will then be given a client_id and a client_secret which will allow us to make requests to the server.

For this, we will need an email an password. I didn't want to use mine, and it makes sense for everyone on the project to have access to the email should they require it when working on the API; thus, I made a new email for project purposes only.

Email: popularity.by.lyrics.analysis@gmail.com

Password: 2020uofudatascience

After creating the Genius account and registering a new API Client App, we have our ID and secret:

Client ID: f6xD9D1KtMiCZij5I-71axaKAN7G6NsuAtPcXzVLXenKdAyxZvmy9pMFBAvnP3j6
Client Secret: l8ducnbdIpED0rQNKpADr1M5x4_Q4qTuPvXQ7tC5X1p9D-vkiX2uqdQ-oGs9J48jr_sbueHofnB4thJigUrsZA

(Later, lets place this into a file which we access 'credientials.py')

Now that we have these, we need to perform an OAuth authentication process with the Genius API. This can be handled by a python package called rauth. The following cell installs rauth through a jupyter magic command.

In [2]:
%conda install rauth

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.

Note: you may need to restart the kernel to use updated packages.



PackagesNotFoundError: The following packages are not available from current channels:

  - rauth

Current channels:

  - https://repo.anaconda.com/pkgs/main/win-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/win-64
  - https://repo.anaconda.com/pkgs/r/noarch
  - https://repo.anaconda.com/pkgs/msys2/win-64
  - https://repo.anaconda.com/pkgs/msys2/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.




In [3]:
#Create a variable to house the Genius authentication
genius = OAuth2Service(
    client_id = 'f6xD9D1KtMiCZij5I-71axaKAN7G6NsuAtPcXzVLXenKdAyxZvmy9pMFBAvnP3j6',
    client_secret = 'l8ducnbdIpED0rQNKpADr1M5x4_Q4qTuPvXQ7tC5X1p9D-vkiX2uqdQ-oGs9J48jr_sbueHofnB4thJigUrsZA',
    name = 'genius',
    authorize_url = 'https://api.genius.com/oauth/authorize',
    access_token_url = 'https://api.genius.com/oauth/token',
    base_url = 'https://api.genius.com/')

#Create a new session
access_token = 'qIcPSKG-IpOYMR0j-Y2NcuHVuGsbGHO3osa4b7BEJuFaBbZgDn26EKjl_whhxSjO'
session = genius.get_session(access_token)

In [4]:
#Make a simple request to the API
response = session.get('songs/378195')

#Print the first 300 characters of the binary stream converted to string by utf-8 encoding
response.content.decode('utf-8')[:300]

'{"meta":{"status":200},"response":{"song":{"annotation_count":33,"api_path":"/songs/378195","apple_music_id":"882945383","apple_music_player_url":"https://genius.com/songs/378195/apple_music_player","description":{"dom":{"tag":"root","children":[{"tag":"p","children":["“Chandelier” was released as t'

Looks like the song which we grabed was Chandelier by Sia. Lets explore the structure of this object ot find the data which we are looking for. The response we recieved is a JSON object, so we will need to convert it from the binary stream to a string, and then finally to a proper JSON object which we can use indexing on. To do this, we will import the json library.

In [5]:
json_object = json.loads(response.content.decode('utf-8'))
json_object

{'meta': {'status': 200},
 'response': {'song': {'annotation_count': 33,
   'api_path': '/songs/378195',
   'apple_music_id': '882945383',
   'apple_music_player_url': 'https://genius.com/songs/378195/apple_music_player',
   'description': {'dom': {'tag': 'root',
     'children': [{'tag': 'p',
       'children': ['“Chandelier” was released as the lead single from Sia’s sixth studio album, ',
        {'tag': 'em', 'children': ['1000 Forms Of Fear']},
        '. Despite its soaring melody, the song has a sad theme describing the life of a “party girl”. On the track, Sia talks about her alcohol and drug addiction.']},
      '',
      {'tag': 'p',
       'children': ['Sia has written songs for pop stars like ',
        {'tag': 'a',
         'attributes': {'href': 'https://genius.com/Rihanna-diamonds-lyrics',
          'rel': 'noopener'},
         'data': {'api_path': '/songs/89794'},
         'children': ['Rihanna']},
        ', ',
        {'tag': 'a',
         'attributes': {'href': 'http

While there is a lot of data contained in the response object, it seems that it contains neither the lyrics of the song, nor the Genius tags (Tags are the system by which Genius identifies genres and sub-genres of a song). This data is available on their website, but as of now, there is no support for it within the API. What we can do, however, is use the API as a springboard for more traditional web-scraping. The response includes a url for the song, under which the tag and lyric information are housed. We can first use the api to get a url, and then proceed to the url and scrape the data we desire.

In [6]:
json_object['response']['song']['path']

'/Sia-chandelier-lyrics'

Navigating to https://genuis.com/Sia-chandelier-lyrics gives us the page for the song, and we can see the lyrics displayed, as well as the 'tags' (read genres) near the bottom of the page. Now, we can use the BeautifulSoup python package to scrape the html of this page. 

In [7]:
url = 'https://genius.com' + json_object['response']['song']['path']
page = requests.get(url)

In [8]:
html = BeautifulSoup(page.text, 'html.parser')
lyrics = html.find('div',class_='lyrics').get_text()
print(lyrics)



[Verse 1]
Party girls don't get hurt
Can't feel anything, when will I learn?
I push it down, push it down
I'm the one "for a good time call"
Phone's blowin' up, ringin' my doorbell
I feel the love, feel the love

[Pre-Chorus]
1 2 3, 1 2 3, drink
1 2 3, 1 2 3, drink
1 2 3, 1 2 3, drink
Throw 'em back till I lose count

[Chorus]
I'm gonna swing from the chandelier
From the chandelier
I'm gonna live like tomorrow doesn't exist
Like it doesn't exist
I'm gonna fly like a bird through the night
Feel my tears as they dry
I'm gonna swing from the chandelier
From the chandelier

[Post-Chorus]
But I'm holding on for dear life
Won't look down, won't open my eyes
Keep my glass full until morning light
'Cause I'm just holding on for tonight
Help me, I'm holding on for dear life
Won't look down, won't open my eyes
Keep my glass full until morning light
'Cause I'm just holding on for tonight
On for tonight

[Verse 2]
Sun is up, I'm a mess
Gotta get out now, gotta run from this
Here comes the shame,

As you can see, now we have data containing all of the lyrics for this song, scraped from the site. Now to try and find the genre tags. They are hidden within a metadata object as a json object, and then loaded dynamically. Luckily we can still veiw that metadata object in the original html call.

In [9]:
json.loads(html.find('meta', itemprop='page_data')['content'])['dmp_data_layer']['page']['genres']

['Ballad', 'Reggae', 'R&B Genius', 'Electro-Pop', 'Pop Genius']

Now we can successfully gather the genres of each song from the Genius tagging system, which concludes all of the data which we are collecting from Genius. Now, we need to grab the Spotify hotness and number of listens. This requires another authentication process with OAuth2. Once again, we will use the rauth library to handle most of this. Requesting the authorization token, however, is a bit more involved, and Spotify requires that the request has specific body and headers. Because of this, we do that portion manually through the requests package using a get_spotify_token method we define.

In [10]:
spotify = OAuth2Service(
    client_id = 'f8ed44dea3354f30b7e49042cfbe1dd1',
    client_secret = 'ecfe526f97074cbf99680e6a09ee58a8',
    name = 'spotify',
    authorize_url = 'https://accounts.spotify.com/authorize',
    access_token_url = 'https://accounts.spotify.com/api/token',
    base_url = 'https://api.spotify.com')

In [11]:
def get_spotify_token(spotify):
    authorization_id = 'Basic ' + str(base64.b64encode(bytes(spotify.client_id + ':' + spotify.client_secret,'utf-8')),'utf-8')
    header = {'Authorization' : authorization_id}
    body = {'grant_type' : 'client_credentials'}
    return requests.post('https://accounts.spotify.com/api/token', data=body, headers=header).json()['access_token']

session = spotify.get_session(get_spotify_token(spotify))

Some Genius songs contain referances to other places that song appears on the web. Places like YouTube, Soundcloud and even Spotify. While originally, we thought that we could use this to connect our datapoints, we found that these are often unreliable, since they are user defined feilds. Thus, we must find a way to get songs from Spotify, knowing what we have gathered from Genius. Luckily, Spotify exposes their search methods to their API, so we can make queries to their database using the title and artist from Genius. The following is an example of a search for the track Chandelier from earlier.

In [12]:
title = 'Chandelier'
artist = 'Sia'
spotify_id = None
#Search spotify by name and name + artist
search = title + ' ' + artist
query_params = {'q' : search,
            'type' : 'track'}
response = json.loads(session.get('v1/search', params=query_params).content)
spotify_results = response['tracks']['items']
for result in spotify_results:
    if result['name'] == search:
        spotify_id = result['uri'].split(':')[2]
        break
print(spotify_id)

None


Unfortunately, it seems that we were unable to find the correct song. This is because the titles within Genius and Spotify are often different from each other by capitalization and other nuances. To combat this we created several string cleaning functions which help to identify the song robustly. It's a fine line between being to strict and too lenient, but while verifying the initial 200 results, we were not able to find an error, so we can assume that it is very accurate.

In [13]:
import re
def remove_post_hyphen(string):
    return string.split('-')[0].strip()

def remove_parenthesis(string):
    return string.split('(')[0].strip()

def alphise(string):
    return re.sub(r'\W+','',re.sub(r'\d+','',string))

def remove_the(string):
    return string.replace('The ','').strip()

def clean_string(string):
    return alphise(remove_post_hyphen(remove_parenthesis(remove_the(string))))

def verify_song_identity(song, name, artist):
    return (clean_string(song['name'].strip().lower()) == clean_string(name.strip().lower()) and clean_string(song['artists'][0]['name'].strip().lower()) == clean_string(artist.strip().lower()))

In [14]:
title = 'Fast Cars'
artist = 'Craig David'
spotify_id = None
#Search spotify by name and name + artist
search = title + ' ' + artist
query_params = {'q' : search,
            'type' : 'track'}
response = json.loads(session.get('v1/search', params=query_params).content)
spotify_results = response['tracks']['items']
for result in spotify_results:
    if verify_song_identity(result, title, artist):
        spotify_id = result['uri'].split(':')[2]
        break
print(spotify_id)

12tcvsrkRhDmBHq62vAXaH


Now that we have used the verify_song_identity method, we get an output other than None. The result of this bit of code is to give the Spotify ID, which is a hexidecimal number corresponding to a particular song. Using this, we can get our final pieces of data. Using the last song from our previous results, this is how one would access the hotness metric which is exposed in the Spotify API.

In [15]:
result['popularity']

28

And it's as easy as that. Now the final piece of data which we have to access is the Spotify number of listens. Unfortunately, as we have said before, they don't make this available in the API. Luckily a github user by the name of evilarceus has already created a project that will give the number of listens by interfacing with the Spotify application. This is difficult stuff, so we're lucky that someone else has already done it. The way that we get a playcount using his project looks like this:

In [16]:
html = requests.get('https://t4ils.dev:4433/api/beta/albumPlayCount?albumid=3xFSl9lIRaYXIYkIn3OIl9')
for song in json.loads(html.content)['data']:
    if song['uri'] == 'spotify:track:' + '4VrWlk8IQxevMvERoX08iC':
        print(song['playcount'])

970505387


And as you can see, we get the playcount for the song with id '4VrWlk8IQxevMvERoX08iC'. Now all that is left to do is generating random songs and completing this process for each. The way that we will generate random songs is by the Genius ID. Since Genius IDs are integers between 0 and 4,000,000 (Found by extensive trial and error) it makes it easy for us to just pick a random one. There are some numbers which are not valid within, so we will need the program to be relatively error robust.

Here is an example of generating a list of numbers, and then shuffling it. That way, we wont be picking the same song multiple times, and have to check each new song against the entire dataset. To do this, we'll use python's random package.

In [17]:
import random
my_list = [i for i in range(5)]
random.shuffle(my_list)
print(my_list)

[4, 2, 0, 1, 3]


And just like that, we have a random list we can progress through to get random numbers. To see the full implementation of the scraping, along with the additional saving and loading mechanics that were necessary to run it over multiple weeks, see the scrapingscript.py file located in the same folder as this jupyter notebook.

In [2]:
import pandas as pd
index = pd.read_csv('data/index.csv')
index['current id'] = -1

In [4]:
index.to_csv('data/index.csv')