For the project, we will need various data about each song. That is: lyrics, some measure of popularity, genre, artist, and release date. Unfortunately, this information is not all available from the same source. We will be collecting data from Genius (a website for song analysis which houses genre and lyrics) and Spotify (A major streaming platform which tracks the total plays and current popularity of a track). This means that in their raw form, the data we receive will not be properly associated by song. We will have to correctly identify which songs correspond across the platforms, and then associate the relavent data into our database. We will have to be carefull to minimize error which may arise from this process, as that will significantly effect our final results.

First, lets take a look at the Genius API. We have to set up an authenticated account with Genius, so that it will respond to our API calls. We must identify our application and its purpose, and we will then be given a client_id and a client_secret which will allow us to make requests to the server.

For this, we will need an email an password. I didn't want to use mine, and it makes sense for everyone on the project to have access to the email should they require it when working on the API; thus, I made a new email for project purposes only.

Email: popularity.by.lyrics.analysis@gmail.com

Password: 2020uofudatascience

After creating the Genius account and registering a new API Client App, we have our ID and secret:

Client ID: f6xD9D1KtMiCZij5I-71axaKAN7G6NsuAtPcXzVLXenKdAyxZvmy9pMFBAvnP3j6
Client Secret: l8ducnbdIpED0rQNKpADr1M5x4_Q4qTuPvXQ7tC5X1p9D-vkiX2uqdQ-oGs9J48jr_sbueHofnB4thJigUrsZA

(Later, lets place this into a file which we access 'credientials.py')

Now that we have these, we need to perform an OAuth authentication process with the Genius API. This can be handled by a python package called rauth. The following cell installs rauth and imports it into the notebook.

In [1]:
%conda install rauth
from rauth import OAuth2Service

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.

Note: you may need to restart the kernel to use updated packages.



PackagesNotFoundError: The following packages are not available from current channels:

  - rauth

Current channels:

  - https://repo.anaconda.com/pkgs/main/win-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/win-64
  - https://repo.anaconda.com/pkgs/r/noarch
  - https://repo.anaconda.com/pkgs/msys2/win-64
  - https://repo.anaconda.com/pkgs/msys2/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.




In [2]:
#Create a variable to house the Genius authentication
genius = OAuth2Service(
    client_id = 'f6xD9D1KtMiCZij5I-71axaKAN7G6NsuAtPcXzVLXenKdAyxZvmy9pMFBAvnP3j6',
    client_secret = 'l8ducnbdIpED0rQNKpADr1M5x4_Q4qTuPvXQ7tC5X1p9D-vkiX2uqdQ-oGs9J48jr_sbueHofnB4thJigUrsZA',
    name = 'genius',
    authorize_url = 'https://api.genius.com/oauth/authorize',
    access_token_url = 'https://api.genius.com/oauth/token',
    base_url = 'https://api.genius.com/')

#Create a new session
access_token = 'qIcPSKG-IpOYMR0j-Y2NcuHVuGsbGHO3osa4b7BEJuFaBbZgDn26EKjl_whhxSjO'
session = genius.get_session(access_token)

#Make a simple request to the API
response = session.get('songs/378195')

#Print the first 300 characters of the binary stream converted to string by utf-8 encoding
response.content.decode('utf-8')[:300]

'{"meta":{"status":200},"response":{"song":{"annotation_count":33,"api_path":"/songs/378195","apple_music_id":"882945383","apple_music_player_url":"https://genius.com/songs/378195/apple_music_player","description":{"dom":{"tag":"root","children":[{"tag":"p","children":["“Chandelier” was released as t'

Looks like the song which we grabed was Chandelier by Sia. Lets explore the structure of this object ot find the data which we are looking for. The response we recieved is a JSON object, so we will need to convert it from the binary stream to a string, and then finally to a proper JSON object which we can use indexing on. To do this, we will import the json library.

In [3]:
import json

json_object = json.loads(response.content.decode('utf-8'))
json_object

{'meta': {'status': 200},
 'response': {'song': {'annotation_count': 33,
   'api_path': '/songs/378195',
   'apple_music_id': '882945383',
   'apple_music_player_url': 'https://genius.com/songs/378195/apple_music_player',
   'description': {'dom': {'tag': 'root',
     'children': [{'tag': 'p',
       'children': ['“Chandelier” was released as the lead single from Sia’s sixth studio album, ',
        {'tag': 'em', 'children': ['1000 Forms Of Fear']},
        '. Despite its soaring melody, the song has a sad theme describing the life of a “party girl”. On the track, Sia talks about her alcohol and drug addiction.']},
      '',
      {'tag': 'p',
       'children': ['Sia has written songs for pop stars like ',
        {'tag': 'a',
         'attributes': {'href': 'https://genius.com/Rihanna-diamonds-lyrics',
          'rel': 'noopener'},
         'data': {'api_path': '/songs/89794'},
         'children': ['Rihanna']},
        ', ',
        {'tag': 'a',
         'attributes': {'href': 'http

While there is a lot of data contained in the response object, it seems that it contains neither the lyrics of the song, nor the Genius tags (Tags are the system by which Genius identifies genres and sub-genres of a song). This data is available on their website, but as of now, there is no support for it within the API. What we can do, however, is use the API as a springboard for more traditional web-scraping. The response includes a url for the song, under which the tag and lyric information are housed. We can first use the api to get a url, and then proceed to the url and scrape the data we desire.

In [4]:
json_object['response']['song']['path']

'/Sia-chandelier-lyrics'

Navigating to https://genuis.com/Sia-chandelier-lyrics gives us the page for the song, and we can see the lyrics displayed, as well as the 'tags' (read genres) near the bottom of the page. Now, we can use the BeautifulSoup python package to scrape the html of this page. 

In [5]:
import requests
from bs4 import BeautifulSoup

url = 'https://genius.com' + json_object['response']['song']['path']
page = requests.get(url)

In [6]:
html = BeautifulSoup(page.text, 'html.parser')

line_collection = html.find('div',class_='lyrics').find('p').find_all('a')
text_collection = []
for line_object in line_collection:
    text_collection.append(line_object.text)
print(text_collection[:5])

['[Verse 1]', "Party girls don't get hurt", "Can't feel anything, when will I learn?\nI push it down, push it down", 'I\'m the one "for a good time call"\nPhone\'s blowin\' up, ringin\' my doorbell\nI feel the love, feel the love', '1 2 3, 1 2 3, drink\n1 2 3, 1 2 3, drink\n1 2 3, 1 2 3, drink']


As you can see, now we have data containing all of the lyrics for this song, scraped from the site. Now to try and find the genre tags.

In [7]:
metadata = html.find_all('div',class_='metadata_with_icon_tags')
metadata

[]

Unfortunately, it looks like that portion of the website is generated after the html is loaded, so while the browser can see it easily, our html files don't included. We have to change the way in which we recieved the html so that the our file is from after the javascript is run, and the data we want generated.

In [8]:
html


<!DOCTYPE html>

<html class="snarly apple_music_player--enabled bagon_song_page--enabled song_stories_public_launch--enabled react_forums--disabled" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<base href="//genius.com/" target="_top"/>
<script type="text/javascript">
//<![CDATA[

  var _sf_startpt=(new Date()).getTime();
  if (window.performance && performance.mark) {
    window.performance.mark('parse_start');
  }

//]]>
</script>
<title>Sia – Chandelier Lyrics | Genius Lyrics</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="app-id=709482991" name="apple-itunes-app"/>
<link href="https://assets.genius.com/images/apple-touch-icon.png?1582907364" rel="apple-touch-icon"/>
<link href="https://assets.genius.com/images/apple-touch-icon.png?1582907364" rel="apple-touch-icon"/>
<!-- Mobile IE allows us to activa