# Radio playlists analysis

In this project, we investigated playlists from Swiss radios over the past few years.

First, we wrote scrapers for the following Swiss radios, that extract a song's name, its performer, and the date at which it was played:

* La 1ère
* Espace 2
* Couleur 3
* Option Musique
* Rouge FM

This gives us a fairly large panel of music choices.

The scrapers themselves are not always easy to write, especially because the radios have (undocumented) limits on how much one may scrape per hour/day/...; while the end result is code that is fairly readable, their development took some time. Their code can be found in the `scrapers` folder.

The raw scraped playlists are in the `data/raw-playlists` folder.

Let's have a quick look at one of the files:

In [131]:
import pandas as pd

pd.read_csv("data/raw-playlists/option-musique", sep="\x1f", header=None, names=["date", "track", "artist"]).head()

Unnamed: 0,date,track,artist
0,2017-01-05 00:03:06,Encore une fois,Catherine Lara
1,2017-01-05 00:08:00,Je t'aime...moi non plus,Serge Gainsbourg/ Jane Birkin
2,2017-01-05 00:12:23,Please forgive me,Bryan Adams
3,2017-01-05 00:16:56,Comme un brasero,Pierre Rapsat
4,2017-01-05 00:21:17,L'esprit grande prairie,Eddy Mitchell


From this short overview, one can already notice an unfortunate but important fact: names must be normalized.

Indeed, *Serge Gainsbourg/ Jane Birkin* is not one single artist, but two different people. And in this case, they're separated in a slightly odd way since there's a space only in one side of the separator. Clearly, we must deal with this kind of problems, to avoid duplicates and failed metadata searches later.

We will, of course, encounter false positives; for instance, *Ike & Tina Turner* were a duo that always sang together, hence it does not make sense to split them. But these should be rare enough; if we see too many bad things, we can always hardcode the most frequent ones.

In [80]:
import json
import urllib.parse
import urllib.request

def top_tags(artist, track):
    url = "http://ws.audioscrobbler.com/2.0/?method=track.gettoptags" \
          "&artist={a}" \
          "&track={t}" \
          "&api_key=7120a4263ebddfdbbb7c0a061b13c170" \
          "&format=json"
    url = url.format(a=urllib.parse.quote(artist), t=urllib.parse.quote(track))
        
    req = urllib.request.urlopen(url)
    obj = json.loads(req.read().decode("utf-8"))
    
    try:
        return [tag["name"] for tag in obj["toptags"]["tag"] if int(tag["count"]) >= 10]
    except KeyError:
        return []

In [86]:
top_tags("the drums", "money")

['indie',
 'indie pop',
 'indie rock',
 'surf rock',
 'alternative',
 'summer',
 'alternative rock',
 'post-punk',
 'rock',
 'american',
 'Post punk',
 'best song ever',
 '010s',
 'snowlist']

In [94]:
def normalize_name(name):
    open_paren = 0
    result = ""
    for c in name:
        if c == '(':
            open_paren += 1
        elif c == ')':
            open_paren -= 1
        elif open_paren == 0:
            result += c
    return result.strip()

In [95]:
normalize_name("(REP) Lost Boy & Suicide Girl (feat. Simon Jäggi)")

'Lost Boy & Suicide Girl'

In [120]:
import re

def split_artist(artist):
    # this causes a few false positives, e.g. "Ike & Tina Turner"
    return [a.strip() for a in re.split("&| x |/|,|feat.", artist)]

In [121]:
split_artist("Miss Kittin & the Hacker")

['Miss Kittin', 'the Hacker']

In [122]:
split_artist("Cypress Hill x Rusko")

['Cypress Hill', 'Rusko']

In [123]:
split_artist("Common feat. Bilal")

['Common', 'Bilal']

In [124]:
split_artist("Willie Nelson/ Jack Johnson/ Be Harper ")

['Willie Nelson', 'Jack Johnson', 'Be Harper']