# Scraping Music Tags from Bensound.com

The Website https://www.bensound.com/ offers music to download and use for free under the creative commons licence.
It features more than 250 tracks, each with plenty of tags. In this notebook, I am going to use webscraping to create a dataset featuring all tags for every track on the website.

# 1. Extract All Tracks on a Page

On bensound, you cannot access the tags from the main page, but only from each tracks individual page. Therefore, our first step is to write a function that returns all track urls on a page.

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
def get_tracks(page_url):
    
    # Setup soup for main page
    page_r = requests.get(page_url)
    page_soup = BeautifulSoup(page_r.text, "html.parser")
    page_html = page_soup.body.div.find_all("a")
    
    # Setup empty list and fill it with the URLS
    track_list = []
    for i in range(33,84, 5): # these are the positions where we find the track urls in this case
        entry = page_html[i]["href"]
        track_list.append(entry)
        
    return track_list

In [3]:
get_tracks("https://www.bensound.com/royalty-free-music/")

['https://www.bensound.com/royalty-free-music/track/ukulele',
 'https://www.bensound.com/royalty-free-music/track/creative-minds',
 'https://www.bensound.com/royalty-free-music/track/a-new-beginning',
 'https://www.bensound.com/royalty-free-music/track/little-idea',
 'https://www.bensound.com/royalty-free-music/track/jazzy-frenchy',
 'https://www.bensound.com/royalty-free-music/track/happy-rock',
 'https://www.bensound.com/royalty-free-music/track/hey-happy-cheerful',
 'https://www.bensound.com/royalty-free-music/track/cute',
 'https://www.bensound.com/royalty-free-music/track/memories',
 'https://www.bensound.com/royalty-free-music/track/going-higher',
 'https://www.bensound.com/royalty-free-music/track/acoustic-breeze']

# 2. Extract All Tags of a Song

Now that we know how to get to the track urls, we need to write a function that finds the tags on a track url.

In [4]:
def get_tags(track_url):
    
    # Setup soup for track page
    track_r = requests.get(track_url)
    track_soup = BeautifulSoup(track_r.text, "html.parser")
    taglist_html = track_soup.body.div.find("p", {"class": "taglist"})
    taglist_html = taglist_html.find_all("a")
    taglist = []
    for i in taglist_html:
        for j in i:
            taglist.append(j)          
    return taglist

In [5]:
get_tags("https://www.bensound.com/royalty-free-music/track/ukulele")

['ukulele',
 'happy',
 'funny',
 'advertising',
 'upbeat',
 'kid',
 'kids',
 'positive',
 'chidren',
 'joy',
 'fun',
 'acoustic',
 'light',
 'gentle']

# 3. Extract All Tags from All Songs on All Pages

Let's combine the two functions above to scrape all tags from all tags on all 23 pages.

Fortunately, every page on bensound only has a slightly different url, adding a "/num" for the page number (starting at page 2). We can therefore easily use an integer based loop for this task. 

In [105]:
url_template = "https://www.bensound.com/royalty-free-music/{page}"

In [106]:
page_urls = ["https://www.bensound.com/royalty-free-music/"]

In [107]:
for i in range(2,24):
    page_urls.append(url_template.format(page = i))

In [111]:
def get_all_tags(page_urls):
    
    # Setup the dict that we will create the df from later
    tag_dict = {"URL" : [], "Tags" : []}
    
    # Loop through all page urls and extract the track urls
    for i, page_url in enumerate(page_urls):    
        track_list = get_tracks(page_url)
        
        # Loop through the track urls and extract the revleant information
        for track_url in track_list:
            tag_list = get_tags(track_url)
            tag_dict["URL"].append(track_url)
            tag_dict["Tags"].append(tag_list)
    
    return tag_dict

In [1]:
tag_dict = get_all_tags(page_urls)

NameError: name 'get_all_tags' is not defined

# 4. Store and Export the Data

We'll store the dag_dict in a DataFrame, making use of pandas' dict comprehension {"Column" : ["Value_1", "Value_2"]}.

In [2]:
import pandas as pd

In [114]:
df_tags = pd.DataFrame(tag_dict)

In [116]:
df_tags.head()

Unnamed: 0,URL,Tags
0,https://www.bensound.com/royalty-free-music/tr...,"[ukulele, happy, funny, advertising, upbeat, k..."
1,https://www.bensound.com/royalty-free-music/tr...,"[corporate, motivation, background, presentati..."
2,https://www.bensound.com/royalty-free-music/tr...,"[rock, uplifting, success, positive, hope, hop..."
3,https://www.bensound.com/royalty-free-music/tr...,"[kid, kids, corporate, bouncy, happy, upbeat, ..."
4,https://www.bensound.com/royalty-free-music/tr...,"[jazz, jazzy, acoustic, old, light, retro, swi..."


Finally, let's export the data as a .csv file. We can't use "," as our separator, because it already separated our list elements in the "Tags" column. Let's use ";" instead.

In [120]:
df_tags.to_csv("music_tags_raw.csv", sep = ";", index = False)