In [1]:
import requests
from bs4 import BeautifulSoup as bs
import json
import lyricwikia
from IPython.display import clear_output

Scraping the year end "Hot" charts for each genre. These charts factor in physical sales, radio airplay and streams and as they are the 100 most popular songs of the year should be a good represenation of each genre.

Not using the Latin or International charts as some of the songs not being in English would skew the results.

There is a "pop" specific chart on Billboard but using the overall "hot 100" chart for pop instead as the their pop songs chart only includes radio play. 

The charts being used are:
- https://www.billboard.com/charts/year-end/2017/hot-100-songs
- https://www.billboard.com/charts/year-end/2017/hot-rock-songs
- https://www.billboard.com/charts/year-end/2017/hot-country-songs
- https://www.billboard.com/charts/year-end/2017/hot-r-and-and-b-hip-hop-songs
- https://www.billboard.com/charts/year-end/2017/hot-dance-electronic--songs
- https://www.billboard.com/charts/year-end/2017/hot-christian-songs

In [2]:
#different charts go back differing amounts of years
#They all go back to 2013 or without dance/elec they go back to at least 2006
yearAsString = "2017" 

charts = [{'name': 'Overall', 'urlTag':'100'},
          {'name': 'Rock', 'urlTag': 'rock'},
          {'name': 'Country', 'urlTag': 'country'},
          {'name': 'R&B/Hip-Hop', 'urlTag': 'r-and-and-b-hip-hop'},
          {'name': 'Dance/Electronic', 'urlTag': 'dance-electronic-'},          
          {'name': 'Christian', 'urlTag': 'christian'}]

#add full urls
for chart in charts:
    chart['url'] = ("https://www.billboard.com/charts/year-end/" + 
                         "/" + yearAsString + 
                         "/hot-" + chart['urlTag'] + "-songs")

#test the strings are being created right
for i in charts:
    print(i['url'])

https://www.billboard.com/charts/year-end//2017/hot-100-songs
https://www.billboard.com/charts/year-end//2017/hot-rock-songs
https://www.billboard.com/charts/year-end//2017/hot-country-songs
https://www.billboard.com/charts/year-end//2017/hot-r-and-and-b-hip-hop-songs
https://www.billboard.com/charts/year-end//2017/hot-dance-electronic--songs
https://www.billboard.com/charts/year-end//2017/hot-christian-songs


Testing to get chart entries from just one table first to get it working:

In [3]:
r = requests.get(charts[1]['url'])
soup = bs(r.content, "lxml")
chartEntries = soup.find_all("div", attrs={"class": "ye-chart-item__primary-row"})

#chart should have 100 entries in it
assert len(chartEntries) == 100

In [4]:
print(chartEntries[0])

<div class="ye-chart-item__primary-row" data-chart-info-url="/fe_data/charts/year-end/2017/hot-rock-songs/other-charts/1" data-date="2017">
<div class="ye-chart-item__rank">
1
</div>
<div class="ye-chart-item__image">
<img alt="" class="" sizes="(max-width: 1023px) 53px, (min-width: 1024px) 87px" src="https://charts-static.billboard.com/img/1840/12/imagine-dragons-hy6-53x53.jpg" srcset="https://charts-static.billboard.com/img/1840/12/imagine-dragons-hy6-53x53.jpg 53w, https://charts-static.billboard.com/img/2017/02/imagine-dragons-hy6-106x106.jpg 106w, https://charts-static.billboard.com/img/2017/02/imagine-dragons-hy6-87x87.jpg 87w, https://charts-static.billboard.com/img/2017/02/imagine-dragons-hy6-174x174.jpg 174w"/> </div>
<div class="ye-chart-item__text">
<div class="ye-chart-item__title">
Believer
</div>
<div class="ye-chart-item__artist">
Imagine Dragons
</div>
</div>
<div class="ye-chart-item__expand-caret">
<span class="fa fa-chevron-down"></span>
<span class="fa fa-chevron-up

In [5]:
print("rank: ", int(chartEntries[0].find("div", attrs={"class": "ye-chart-item__rank"}).text))
print("song: ", chartEntries[0].find("div", attrs={"class": "ye-chart-item__title"}).text.strip())
print("artist: ", chartEntries[0].find("div", attrs={"class": "ye-chart-item__artist"}).text.strip())

rank:  1
song:  Believer
artist:  Imagine Dragons


In [6]:
def GetChartEntries(url):
    '''
    Returns a list of dictionaries with the rank, title and artist of each song 
    from the chart on the passed url page
    '''
    
    r = requests.get(url)
    soup = bs(r.content, "lxml")
    chartEntries = soup.find_all("div", attrs={"class": "ye-chart-item__primary-row"})

    chart = []
    for entry in chartEntries:
        chart.append({'rank': int(entry.find("div", attrs={"class": "ye-chart-item__rank"}).text),
                      'song': entry.find("div", attrs={"class": "ye-chart-item__title"}).text.strip(),
                      'artist': entry.find("div", attrs={"class": "ye-chart-item__artist"}).text.strip()
                     })
        
    return chart

In [7]:
x = GetChartEntries(charts[0]['url'])
for i in x:
    print(i)

{'rank': 1, 'song': 'Shape Of You', 'artist': 'Ed Sheeran'}
{'rank': 2, 'song': 'Despacito', 'artist': 'Luis Fonsi & Daddy Yankee Featuring Justin Bieber'}
{'rank': 3, 'song': "That's What I Like", 'artist': 'Bruno Mars'}
{'rank': 4, 'song': 'Humble.', 'artist': 'Kendrick Lamar'}
{'rank': 5, 'song': 'Something Just Like This', 'artist': 'The Chainsmokers & Coldplay'}
{'rank': 6, 'song': 'Bad And Boujee', 'artist': 'Migos Featuring Lil Uzi Vert'}
{'rank': 7, 'song': 'Closer', 'artist': 'The Chainsmokers Featuring Halsey'}
{'rank': 8, 'song': 'Body Like A Back Road', 'artist': 'Sam Hunt'}
{'rank': 9, 'song': 'Believer', 'artist': 'Imagine Dragons'}
{'rank': 10, 'song': 'Congratulations', 'artist': 'Post Malone Featuring Quavo'}
{'rank': 11, 'song': "Say You Won't Let Go", 'artist': 'James Arthur'}
{'rank': 12, 'song': "I'm The One", 'artist': 'DJ Khaled Featuring Justin Bieber, Quavo, Chance The Rapper & Lil Wayne'}
{'rank': 13, 'song': 'XO TOUR Llif3', 'artist': 'Lil Uzi Vert'}
{'rank':

Now get entries for all charts

In [8]:
for chart in charts:
    chart['entries'] = GetChartEntries(chart['url'])

Check content is all there

In [9]:
#test all charts are the same length and have 100 entries
chartLengths = set()
for chart in charts:
    chartLengths.add(len(chart['entries']))
assert len(chartLengths) == 1, 'Should only be one value in set as all charts should have the same length.'
assert list(chartLengths)[0] == 100, 'All charts should have 100 entries'

for chart in charts:
    print('===================')
    print(chart['name'])
    print('===================')
    for song in chart['entries']:
        print(song)
    print('\n\n')

Overall
{'rank': 1, 'song': 'Shape Of You', 'artist': 'Ed Sheeran'}
{'rank': 2, 'song': 'Despacito', 'artist': 'Luis Fonsi & Daddy Yankee Featuring Justin Bieber'}
{'rank': 3, 'song': "That's What I Like", 'artist': 'Bruno Mars'}
{'rank': 4, 'song': 'Humble.', 'artist': 'Kendrick Lamar'}
{'rank': 5, 'song': 'Something Just Like This', 'artist': 'The Chainsmokers & Coldplay'}
{'rank': 6, 'song': 'Bad And Boujee', 'artist': 'Migos Featuring Lil Uzi Vert'}
{'rank': 7, 'song': 'Closer', 'artist': 'The Chainsmokers Featuring Halsey'}
{'rank': 8, 'song': 'Body Like A Back Road', 'artist': 'Sam Hunt'}
{'rank': 9, 'song': 'Believer', 'artist': 'Imagine Dragons'}
{'rank': 10, 'song': 'Congratulations', 'artist': 'Post Malone Featuring Quavo'}
{'rank': 11, 'song': "Say You Won't Let Go", 'artist': 'James Arthur'}
{'rank': 12, 'song': "I'm The One", 'artist': 'DJ Khaled Featuring Justin Bieber, Quavo, Chance The Rapper & Lil Wayne'}
{'rank': 13, 'song': 'XO TOUR Llif3', 'artist': 'Lil Uzi Vert'}


In [10]:
#save off to a file to have a frozen backup incase website changes format in future
#with open('charts.json', 'w') as outfile:
#    json.dump(charts, outfile)

# Get Lyrics

In [11]:
#test getting lyrics for one song
print(charts[0]['entries'][0]['artist'])
print(charts[0]['entries'][0]['song'])
lyricwikia.get_lyrics(charts[0]['entries'][0]['artist'], 
                      charts[0]['entries'][0]['song'])

Ed Sheeran
Shape Of You


"The club isn't the best place to find a lover\nSo the bar is where I go\nMe and my friends at the table doing shots\nDrinking fast and then we talk slow\n\nYou come over and start up a conversation with just me\nAnd trust me I'll give it a chance now\nTake my hand, stop, put van the man on the jukebox\nAnd then we start to dance\nAnd now I'm singing like\n\nGirl you know I want your love\nYour love was handmade for somebody like me\nCome on now follow my lead\nI may be crazy, don't mind me\n\nSay boy, let's not talk too much\nGrab on my waist and put that body on me\nCome on now follow my lead\nCome, come on now follow my lead\n\nI'm in love with the shape of you\nWe push and pull like a magnet do\nAlthough my heart is falling too\nI'm in love with your body\n\nLast night you were in my room\nAnd now my bed sheets smell like you\nEvery day discovering something brand new\nWell I'm in love with your body\n\n(Oh I, oh I, oh I, oh I)\nI'm in love with your body\n(Oh I, oh I, oh I, oh I)\

In [21]:
def getLyrics(charts):
    for chart in charts:
        i = 0
        for song in chart['entries']:
            i+=1
            clear_output()
            print('Chart: ', chart['name']) 
            print('Getting song', i, ':', song['song'])
            try:
                song['lyrics'] = lyricwikia.get_lyrics(song['artist'], 
                                                       song['song'])
            except:
                song['lyrics'] = 'notFound'
    return

- get Lyrics for all songs

In [12]:
#getLyrics(charts)

Chart:  Christian
Getting song 100 : Tremble


- Check which songs lyrics weren't found for

In [36]:
def printNotFoundSongs(charts):
    totalNotFoundCount = 0
    for chart in charts:
        print("\n=================")
        print(chart['name'])
        print("=================")
        i=0
        for song in chart['entries']:
            if song['lyrics'] == 'notFound':
                i += 1
                totalNotFoundCount += 1
                print(i, ": (rank ", song['rank'], ")", song['song'], " - ", song['artist'])

    print("\n=================")
    print('Total songs lyrics not found for: ', totalNotFoundCount)

In [20]:
#print all the song's lyrics weren't found for
printNotFoundSongs(charts)

Overall
1 : Despacito  -  Luis Fonsi & Daddy Yankee Featuring Justin Bieber
2 : Bad And Boujee  -  Migos Featuring Lil Uzi Vert
3 : Closer  -  The Chainsmokers Featuring Halsey
4 : Congratulations  -  Post Malone Featuring Quavo
5 : I'm The One  -  DJ Khaled Featuring Justin Bieber, Quavo, Chance The Rapper & Lil Wayne
6 : XO TOUR Llif3  -  Lil Uzi Vert
7 : Unforgettable  -  French Montana Featuring Swae Lee
8 : Wild Thoughts  -  DJ Khaled Featuring Rihanna & Bryson Tiller
9 : Black Beatles  -  Rae Sremmurd Featuring Gucci Mane
10 : Starboy  -  The Weeknd Featuring Daft Punk
11 : I Don't Wanna Live Forever (Fifty Shades Darker)  -  Zayn / Taylor Swift
12 : It Ain't Me  -  Kygo x Selena Gomez
13 : iSpy  -  KYLE Featuring Lil Yachty
14 : 1-800-273-8255  -  Logic Featuring Alessia Cara & Khalid
15 : I Feel It Coming  -  The Weeknd Featuring Daft Punk
16 : Strip That Down  -  Liam Payne Featuring Quavo
17 : Don't Wanna Know  -  Maroon 5 Featuring Kendrick Lamar
18 : Bad Things  -  Machine 

232 songs not found. 

Most common issues is "Featuring" artists. Lyrics wikia only has lead artist name so need to split artists name at featuring.

In [33]:
for chart in charts:
    for song in chart['entries']:
        if(song['lyrics'] == 'notFound'):
            clear_output()
            print('Chart: ', chart['name']) 
            print('Getting song', song['rank'], ':', song['song'])
            try:
                song['lyrics'] = lyricwikia.get_lyrics(song['artist'].split('Featuring')[0].strip(), 
                                                       song['song'])
            except:
                song['lyrics'] = 'notFound'

Chart:  Christian
Getting song 100 : Tremble


In [37]:
printNotFoundSongs(charts)


Overall
1 : (rank  13 ) XO TOUR Llif3  -  Lil Uzi Vert
2 : (rank  26 ) I Don't Wanna Live Forever (Fifty Shades Darker)  -  Zayn / Taylor Swift
3 : (rank  27 ) It Ain't Me  -  Kygo x Selena Gomez
4 : (rank  28 ) iSpy  -  KYLE Featuring Lil Yachty
5 : (rank  41 ) Bad Things  -  Machine Gun Kelly x Camila Cabello
6 : (rank  62 ) DNA.  -  Kendrick Lamar
7 : (rank  63 ) Juju On That Beat (TZ Anthem)  -  Zay Hilfigerrr & Zayion McCall
8 : (rank  80 ) Love Galore  -  SZA Featuring Travis Scott
9 : (rank  84 ) What About Us  -  P!nk
10 : (rank  95 ) Everyday We Lit  -  YFN Lucci Featuring PnB Rock
11 : (rank  99 ) Look At Me!  -  XXXTENTACION

Rock
1 : (rank  8 ) Sucker For Pain  -  Lil Wayne, Wiz Khalifa & Imagine Dragons With Logic & Ty Dolla $ign Feat. X Ambassadors
2 : (rank  10 ) HandClap  -  Fitz And The Tantrums
3 : (rank  42 ) Good News  -  Ocean Park Standoff
4 : (rank  56 ) Song #3  -  Stone Sour
5 : (rank  70 ) Fire Escape  -  Andrew McMahon In The Wilderness
6 : (rank  82 ) Ahead

Now 89 songs not found.

