### Scraping the Billboard Year End charts and getting the lyrics for the songs from Genius.com.

Using mainly the "Hot" charts for each genre, except Pop which doesn't have a "Hot" chart. The Hot charts factor in physical sales, radio airplay and streams and as they are normally the 100 most popular songs of the year should be a good represenation of each genre. The Pop chart only factors in radio play.

#### Please Note:
- Not using the Latin or International charts as some of the songs not being in English would skew later analysis results.
- The lyrics for all songs aren't found for every song. See the missing values section below for more detail.


Links to the 2017 versions of the charts being used:
- https://www.billboard.com/charts/year-end/2017/hot-100-songs
- https://www.billboard.com/charts/year-end/2017/hot-rock-songs
- https://www.billboard.com/charts/year-end/2017/hot-country-songs
- https://www.billboard.com/charts/year-end/2017/hot-r-and-and-b-hip-hop-songs
- https://www.billboard.com/charts/year-end/2017/hot-dance-electronic--songs
- https://www.billboard.com/charts/year-end/2017/pop-songs
- https://www.billboard.com/charts/year-end/2017/hot-christian-songs

In [305]:
import requests
from bs4 import BeautifulSoup as bs
import json
from IPython.display import clear_output
import numpy as np
import pandas as pd
import re
from config import client_access_token
import lyricsgenius as genius
geniusAPI = genius.Genius(client_access_token)


def getChartsAndLyrics(year):
    '''
    Returns a dataframe with the charts and lyrics to the songs for a passed year.
    Get's entries in Billboard year end charts for Hot100, Rock, Country, R&B/Hip-Hop, 
    Dance/Electronic, Pop and Christian charts.
    '''
    
    
    def getChartEntries(url):
        '''
        Returns a list of dictionaries with the rank, title and artist of each song 
        from the chart on the passed url page
        '''

        r = requests.get(url)
        soup = bs(r.content, "lxml")
        chartEntries = soup.find_all("div", attrs={"class":"ye-chart-item__primary-row"})

        chart = []
        for entry in chartEntries:
            chart.append({'rank': int(entry.find("div", attrs={"class":"ye-chart-item__rank"}).text),
                          'song': entry.find("div", attrs={"class":"ye-chart-item__title"}).text.strip(),

                          #replacing 'x' and 'X' with '&' as that's how genius.com has the names
                          #need to match so the lyrics be looked up later
                          'artist': entry.find("div", 
                                               attrs={"class":"ye-chart-item__artist"}).text.strip()\
                                                  .replace(' x ', ' & ').replace(' X ', ' & ')})

        return chart

    

    def getLyric(song, artist):
        '''return lyrics for a single song or None if not found'''
        try: 
            return geniusAPI.search_song(song, artist).lyrics
        except:
            #if there was an error then the lyrics weren't found so return None
            return
        
        

    def getLyrics(charts):
        '''get lyrics for all songs in charts of passed dict'''
        for chart in charts:
            i = 0
            for song in chart['entries']:
                i+=1
                clear_output()
                print('Year: ', year) 
                print('Chart: ', chart['name']) 
                print('Getting song', i, ':', song['song'])

                #try getting lyrics with name as is
                song['lyrics'] = getLyric(song['song'], song['artist'])

                #if lyrics not found it's normally because of the featuring artists
                #try splitting on different ways songs add featuring artists to the end of artist names
                #sometimes a combination is used so trying each individually
                artistSplits = ['Featuring', 'With', 'And', '&', '/', ',']
                for splitter in artistSplits:
                    #if lyrics have been found they'll be a non-empty string which evaluates as true
                    if song['lyrics']:
                        break

                    song['lyrics'] = getLyric(song['song'], 
                                              song['artist'].split(splitter)[0].strip())

                #saw a few not found songs with brackets in the name
                #like 'Bodak Yellow (Money Moves)' so try split on open bracket
                if not song['lyrics']:
                    song['lyrics'] = getLyric(song['song'].split('(')[0].strip(), 
                                              song['artist'])
        return
        
        
        
    def createChartsDf(charts):
        '''Returns a dataframe for the passed charts dict'''

        chartsDf = None
        for chart in charts:
            tempDf = pd.DataFrame.from_dict(chart['entries'])
            tempDf['chart'] = chart['name']
            tempDf['chartURL'] = chart['url']

            chartsDf = pd.concat([chartsDf, tempDf])

        chartsDf['year'] = year

        #tidy up table

        #resetting index as is currently 0-99 repeating 6 times
        #after reset will just be 0-599
        chartsDf.reset_index(inplace=True, drop=True)

        #reorder columns
        chartsDf = chartsDf[['year',
                             'chart', 
                             'chartURL', 
                             'rank', 
                             'song', 
                             'artist', 
                             'lyrics']]

        return chartsDf
    
    
    
    
    charts = [{'name':'Hot100', 'urlTag':'hot-100-songs'},
              {'name':'Rock', 'urlTag':'hot-rock-songs'},
              {'name':'Country', 'urlTag':'hot-country-songs'},
              {'name':'R&B/Hip-Hop', 'urlTag':'hot-r-and-and-b-hip-hop-songs'},
              {'name':'Dance/Electronic', 'urlTag':'hot-dance-electronic--songs'},
              {'name':'Pop', 'urlTag':'pop-songs'},
              {'name':'Christian', 'urlTag':'hot-christian-songs'}]

    #add full urls
    for chart in charts:
        chart['url'] = ("https://www.billboard.com/charts/year-end/" + 
                        "/" + str(year) + '/' + chart['urlTag'])
        
    #get entries for all charts
    for chart in charts:
        chart['entries'] = getChartEntries(chart['url'])
        
    #get Lyrics for all songs
    getLyrics(charts)
    
    return createChartsDf(charts)

### Define year range to get charts for

In [213]:
#different charts go back differing amounts of years
#They all go back to 2013 or without dance/elec they go back to at least 2006

years = [i for i in range(2013, 2017 +1)]
print(years)

[2013, 2014, 2015, 2016, 2017]


### Put all charts and lyrics into a dataframe

In [None]:
allChartsDf = None
for year in years:
    chartsDf = getChartsAndLyrics(year)
    allChartsDf = pd.concat([allChartsDf, chartsDf])

### Check the missing values

In [304]:
entriesFound = allChartsDf.groupby(['year', 'chart']).chart.count()
entriesFound.name = 'entriesFound'

lyricsFoundCount = allChartsDf.groupby(['year', 'chart']).lyrics.count()
lyricsFoundCount.name = 'songsFoundCount'

lyricsFoundPercentage = (allChartsDf.groupby(['year', 'chart']).lyrics.count() /
                         allChartsDf.groupby(['year', 'chart']).chart.count())
lyricsFoundPercentage.name = 'songsFoundPercentage'

entriesAndMissingLyricsSummaryDf = pd.concat([entriesFound, 
                                              lyricsFoundCount, 
                                              lyricsFoundPercentage], 
                                             axis=1)
print(entriesAndMissingLyricsSummaryDf)
print('\n================\n')
print('Total Entries:', len(allChartsDf))
print('Total Lyrics Found:', allChartsDf.lyrics.count())
print('Total Lyrics Missing:', len(allChartsDf) - allChartsDf.lyrics.count())
print('Overall percentage of lyrics found:', 
      allChartsDf.lyrics.count() / len(allChartsDf))

                       entriesFound  songsFoundCount  songsFoundPercentage
year chart                                                                
2013 Christian                   49               46              0.938776
     Country                    100              100              1.000000
     Dance/Electronic            99               88              0.888889
     Hot100                     100               94              0.940000
     Pop                         50               47              0.940000
     R&B/Hip-Hop                100               87              0.870000
     Rock                       100               99              0.990000
2014 Christian                  100               91              0.910000
     Country                    100               98              0.980000
     Dance/Electronic           100               88              0.880000
     Hot100                     100               96              0.960000
     Pop                 

- The percentage found lyrics is 94% overall and 84% or greater for each chart. Obviously this isn't perfect but it's good enough for me as I want to put my energy into the analysis instead of trying to get the last 6%.
- The entries found don't always match the expected chart length 
    - Sometimes it's intentional like the 2016 and earlier Christian charts and all the Pop charts have only 50 entries.
    - Sometimes it looks like there's errors on Billboard's website like for the 2016 Hot 100 where entry 87 is missing. It just goes 86 then 88.
    - The [2015 R&B/Hip-Hop chart](https://www.billboard.com/charts/year-end/2017/hot-r-and-and-b-hip-hop-songs) only has 25 entries. I did double check the website and it really is only displaying the top 25 songs.



### Save Off CSV of Results

Further analysis will done in a seperate analysis workbook.

In [216]:
allChartsDf.to_csv('charts_and_lyrics_' + 
                    str(min(years)) + '-' + str(max(years)) + 
                   '.csv', 
                   index=False, encoding='utf-8')