# Web Scraping Wikipedia's Billboard Year-End Charts

# About:

This file extracts song information from Wikipedia's pages for Billboard's Year-End Hot 100 charts for years 1960-2020.

### Website: 

https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1960 (*Replace 1960 with any desired year between 1960-2020*)

Note that there are three years (1963, 1966, 1975) that had an original list, but were revised by Billboard due to errors or recalculations. So the old lists eventually got replaced with a newer list. We will account for these changes and use the revised lists to collect the data in our web scraping.

**Import necessary libraries for scraping websites**

In [156]:
import csv
import time
import requests
import bs4

In [157]:
# Wikipedia Billboard Year-End pages from 1960-Present follow this format
base_url = "https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_{}"

For whatever reason, the HTML formatting used for the 2020 Wikipedia page is a bit different than all the other previous years. We'll just go ahead and begin by web scraping every year besides 2020.

In [None]:
# initizialize empty list
song_list = []

# loop through each year and extract relevant song information to list
for year in range(1960,2020):
    res = requests.get(base_url.format(year))
    soup = bs4.BeautifulSoup(res.text, "lxml")
    
    # break out of loop if website technical issues occur
    if res.status_code != 200:
        print("ERROR. CANNOT SCRAPE ANYMORE.")
        break
    
    # select revised list if it exists, else get the original list
    try:
        selection = soup.select('.wikitable')[1]
    except:
        selection = soup.select('.wikitable')[0]
    
    tr = selection.select('tr')
    
    # add each row to song list
    for item in tr[1:]:
        row_list = []

        # append rankings
        rank = item.select('td')[0].text
        row_list.append(rank)
        
        # append song titles
        title = item.select('td')[1].text
        row_list.append(title.strip('"'))
        
        # append song artists
        artist = item.select('td')[2].text
        row_list.append(artist.strip('\n'))
        
        # append Billboard year    
        row_list.append(year)
        
        # append row with all of the data above
        song_list.append(row_list)
        
    # slowdown to avoid making quick requests
    time.sleep(5)

Let's output the first 5 rows and last 5 rows of our list.

In [363]:
song_list[0:5]

[['1', 'Theme from A Summer Place', 'Percy Faith', 1960],
 ['2', "He'll Have to Go", 'Jim Reeves', 1960],
 ['3', "Cathy's Clown", 'The Everly Brothers', 1960],
 ['4', 'Running Bear', 'Johnny Preston', 1960],
 ['5', 'Teen Angel', 'Mark Dinning', 1960]]

In [364]:
song_list[-5:]

[['96', 'Eyes on You', 'Chase Rice', 2019],
 ['97', 'All to Myself', 'Dan + Shay', 2019],
 ['98', 'Boyfriend', 'Ariana Grande and Social House', 2019],
 ['99', 'Walk Me Home', 'Pink', 2019],
 ['100', 'Robbery', 'Juice Wrld', 2019]]

Seems like everything is working correctly. We still need song information for the year 2020, so we'll go ahead and do that now using similar code as above, but accounting for the different HTML formatting used in the Wikipedia page for the year 2020.

In [344]:
# initizialize empty list
song_list_2020 = []

# loop through each year and extract relevant song information to list
for year in range(2020,2021):
    res = requests.get(base_url.format(year))
    soup = bs4.BeautifulSoup(res.text, "lxml")
    
    # break out of loop if website technical issues occur
    if res.status_code != 200:
        print("ERROR. CANNOT SCRAPE ANYMORE.")
        break
    
    # select the original list
    selection = soup.select('.wikitable')[0]
    
    tr = selection.select('tr')
    
    # add each row to song list
    for item in tr[1:]:
        row_list = []
        
        # append rankings
        rank = item.select('th')[0].text
        row_list.append(rank.strip('\n'))
        
        # append song titles
        title = item.select('td')[0].text
        row_list.append(title.strip('"'))
        
        # append song artists
        artist = item.select('td')[1].text
        row_list.append(artist.strip('\n'))
        
        # append Billboard year    
        row_list.append(year)
        
        # append row with all of the data above
        song_list_2020.append(row_list)

Again, checking the first and last 5 rows of the list, it seems like everything is working as intended.

In [362]:
song_list_2020[0:5]

[['1', 'Blinding Lights', 'The Weeknd', 2020],
 ['2', 'Circles', 'Post Malone', 2020],
 ['3', 'The Box', 'Roddy Ricch', 2020],
 ['4', "Don't Start Now", 'Dua Lipa', 2020],
 ['5', 'Rockstar', 'DaBaby featuring Roddy Ricch', 2020]]

In [361]:
song_list_2020[-5:]

[['96', 'More Than My Hometown', 'Morgan Wallen', 2020],
 ['97', "Lovin' on You", 'Luke Combs', 2020],
 ['98', 'Said Sum', 'Moneybagg Yo', 2020],
 ['99', 'Slide', 'H.E.R. featuring YG', 2020],
 ['100', 'Walk Em Down', 'NLE Choppa featuring Roddy Ricch', 2020]]

With our final procedure, we will put everything we gathered together into a csv file.

In [355]:
file = open('wikipedia_scraper.csv', mode='w', newline='')

In [356]:
csv_writer = csv.writer(file, delimiter=',')

In [357]:
# write header row
csv_writer.writerow(['rank', 'title', 'artist', 'year'])

24

In [358]:
csv_writer.writerows(song_list)
csv_writer.writerows(song_list_2020)

In [359]:
file.close()