# Top 2000 data visualisation 2021
As a way to challenge myself in that weird period between Christmas and New Year's I decided to do some web scraping and data visualisation with the songs I'm listening to in these days: the Top 2000. 

I got the data from: 
* Songs, artists, listing and release year: NPO Radio 2, https://www.nporadio2.nl/top2000/nieuws/a1b89e18-8082-4513-8cb1-f5849ca670dc/dit-is-de-npo-radio-2-top-2000-van-2021 (I did change the file from xlsx to xls manually) 
* Music genres per artist: Last.fm, https://www.last.fm/music/ 
* Music supergenres: Multimediaeval, selected from a few of the lists on https://multimediaeval.github.io/2017-AcousticBrainz-Genre-Task/data_stats/ 

You can read my accompanying blogpost for some background information and observations here: https://almaliezenga.com/2021/12/30/top-2000-2021/

## Prerequisites

In [1]:
#importing all the packages 

#standard for data stuff
import numpy as np
import pandas as pd

#for plotting graphs 
import plotly.graph_objects as go
import plotly.express as px
import matplotlib
import matplotlib.pyplot as pltq

#for web scraping 
from textblob import TextBlob
from bs4 import BeautifulSoup
import urllib.request
from lxml import html
import requests
import re

## Prerequisites 2
For writing the figure to a html page, use the username and api key from your Plotly account (I used this tutorial for this: https://towardsdatascience.com/how-to-create-a-plotly-visualization-and-embed-it-on-websites-517c1a78568b) 

In [2]:
import chart_studio
import chart_studio.plotly as py
import plotly.io as pio

username = 'AlmaLiezenga'
api_key = 'a0zr6olUaeOt0JHCeQy2'
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)

## Data gathering

In [17]:
#loading and reformatting the data 
data = pd.read_excel('TOP-2000-2021.xls', index_col = 0)
data = data.reset_index()
data = data.iloc[1: , :]

#changing the data types 
data['positie'] = data['positie'].astype('int32')
data['jaar'] = data['jaar'].astype('int32')
data

Unnamed: 0,positie,titel,artiest,jaar
1,1,Bohemian Rhapsody,Queen,1975
2,2,Roller Coaster,Danny Vera,2019
3,3,A Whiter Shade Of Pale,Procol Harum,1967
4,4,Hotel California,Eagles,1977
5,5,Piano Man,Billy Joel,1974
...,...,...,...,...
1996,1996,Sunshine Of Your Love,Cream,1968
1997,1997,Anarchy In The Uk,Sex Pistols,1976
1998,1998,I Want To Hold Your Hand,The Beatles,1964
1999,1999,Kissing A Fool,George Michael,1988


## Scraping the genres
The data in itself is pretty neat but it doesn't offer that much interesting information to visualise. I was interested to see how different genres were represented in the Top 2000. To this end, I used https://www.last.fm/music/ because they make use of very convenient hyperlinks (it's basically the artist's name with some exceptions for punctuation marks but really, for the most part, it works great). The downside is that this website is per artist, not per song. I accepted this because I couldn't find a website that does it per song and has hyperlinks that I could reconstruct. Perhaps an improvement for next year's edition!

In [4]:
# to speed up the process I created a separate list with only the artists 
artiesten = pd.DataFrame(data['artiest'])
artiesten["genres"] = " "
artiesten = artiesten.drop_duplicates()
artiesten = artiesten.reset_index()
artiesten = artiesten.drop(columns=['index'])
artiesten

Unnamed: 0,artiest,genres
0,Queen,
1,Danny Vera,
2,Procol Harum,
3,Eagles,
4,Billy Joel,
...,...,...
801,Miles Davis,
802,The Chemical Brothers,
803,Beyoncé & Jay-Z,
804,Miley Cyrus,


In [5]:
#define functions to help with the web scraping 
def tag(href):
    return href and re.compile("/tag/").search(href)

#changing the name of the artist to a partial hyperlink
def simplify(naam_artiest):
    artist_url = naam_artiest.replace(" ", "+")\
        .replace("'", "%27").replace("/", "%2F")\
        .replace("é", "e").replace("è", "e").replace("É", "E").replace("ë", "e")\
        .replace("ø", "o").replace("ö", "o").replace("Ö", "O")\
        .replace("ü", "u").replace("ÿ", "y").replace("å", "a")
    return artist_url

#finding the music genre tags based on the url 
def find_tags(extended_url): 
    html = urllib.request.urlopen(extended_url)
    soup = BeautifulSoup(html, 'html.parser')
    tags = [] 
    for x in soup.find_all(href=tag):
        tags.extend(x)
    tags = list(dict.fromkeys(tags))
    tags_str = ', '.join(tags)
    return tags_str

In [6]:
base_url = 'https://www.last.fm/music/'

#iterating through all the rows in the artiesten dataframe 
for index, row in artiesten.iterrows():
    naam_artiest = row['artiest']
    artist_url = simplify(naam_artiest)
    
    #not all artists where on the website so I used a 'try' 
    try: 
        artist_url = simplify(naam_artiest)
        extended_url = base_url + artist_url 
        artiesten.loc[index, 'genres'] = find_tags(extended_url) 
        
        #since this entire process took pretty long I printed a small statement indicating where we were at 
        print(f"Success {index}/{len(artiesten)+1}: {extended_url} {artiesten.loc[index, 'genres']}")
    except: 
        try:
            
            #artist collabs were difficult... decided to include the tags for the first artist in this case 
            if 'ft.' in naam_artiest:
                naam_artiest = naam_artiest.split("ft.")[0]
            elif '&' in row['artiest']:
                naam_artiest = naam_artiest.split("&")[0]                
            else:
                naam_artiest = naam_artiest
            
            artist_url = simplify(naam_artiest)
            extended_url = base_url + artist_url 
            artiesten.loc[index, 'genres'] = find_tags(extended_url) 
            
            print(f"Success {index}/{len(artiesten)-1}: {extended_url} {artiesten.loc[index, 'genres']}")            
        except: 
            row['genres'] = ' '
            print(f"Failure: {row['artiest']}")

Success 0/807: https://www.last.fm/music/Queen classic rock, rock, hard rock, glam rock, 80s
Success 1/807: https://www.last.fm/music/Danny+Vera americana, rock, singer-songwriter, country, dutch
Success 2/807: https://www.last.fm/music/Procol+Harum progressive rock, blues, progre…, classic rock, 60s, rock, psychedelic
Success 3/807: https://www.last.fm/music/Eagles rock, classic rock, 70s, soft rock, country
Success 4/807: https://www.last.fm/music/Billy+Joel classic rock, singer-songwriter, rock, pop, piano
Success 5/807: https://www.last.fm/music/Golden+Earring classic rock, rock, hard rock, dutch, 70s
Success 6/807: https://www.last.fm/music/Led+Zeppelin classic rock, hard rock, rock, 70s, progressive rock
Success 7/807: https://www.last.fm/music/Metallica heavy metal, thrash metal, metal, hard rock, metallica
Success 8/807: https://www.last.fm/music/Pearl+Jam rock, grunge, alternative rock, 90s, alternative
Success 9/807: https://www.last.fm/music/Boudewijn+de+Groot dutch, singer-

Success 77/807: https://www.last.fm/music/Animals rock, blues rock, 60s, garage, classic rock
Success 78/807: https://www.last.fm/music/Bob+Marley+&+The+Wailers  reggae, reggae, roots reggae, bob marley, ska, roots
Success 79/807: https://www.last.fm/music/Adele singer-songwriter, soul, female vocalists, british, pop
Success 80/807: https://www.last.fm/music/The+Cranberries alternative rock, rock, alternative, female vocalists, irish
Success 81/807: https://www.last.fm/music/Foo+Fighters rock, alternative rock, grunge, alternative, hard rock
Success 82/807: https://www.last.fm/music/Oasis britpop, rock, british, alternative, indie
Success 83/807: https://www.last.fm/music/Andre+Hazes dutch, nederlandstalig, folk, levenslied, pop
Success 84/807: https://www.last.fm/music/The+Moody+Blues rock, progressive rock, classic rock, 60s, psychedelic
Success 85/807: https://www.last.fm/music/The+Killers indie rock, indie, rock, alternative, alternative rock
Success 86/807: https://www.last.fm/mus

Success 154/807: https://www.last.fm/music/Avicii+ft.+Aloe+Blacc Swedish, house, electronic, dance, progressive house, electro house
Success 155/807: https://www.last.fm/music/Arctic+Monkeys indie rock, indie, british, rock, alternative
Success 156/807: https://www.last.fm/music/Don+McLean singer-songwriter, folk, classic rock, 70s, rock
Success 157/807: https://www.last.fm/music/The+Mamas+&+The+Papas vocal group, 60s, classic rock, oldies, rock, pop
Success 158/807: https://www.last.fm/music/Kansas progressive rock, classic rock, rock, hard rock, 70s
Success 159/807: https://www.last.fm/music/Kings+Of+Leon so…, rock, indie, indie rock, alternative, southern rock
Success 160/807: https://www.last.fm/music/Nightwish symphonic metal, power metal, gothic metal, metal, female fronted metal
Success 161/807: https://www.last.fm/music/De+Dijk dutch, nederlandstalig, nederpop, rock, pop
Success 162/807: https://www.last.fm/music/The+Proclaimers scottish, rock, pop, folk, 80s
Success 163/807: h

Success 230/807: https://www.last.fm/music/Avicii Swedish, house, electronic, dance, progressive house, electro house
Success 231/807: https://www.last.fm/music/Rob+de+Nijs nederlandstalig, pop, dutch, male vocalists, nederpop
Success 232/807: https://www.last.fm/music/Van+Halen hard rock, classic rock, rock, heavy metal, 80s
Success 233/807: https://www.last.fm/music/Maroon+5 pop rock, pop, rock, alternative, maroon 5
Success 234/807: https://www.last.fm/music/Hozier singer-songwriter, blues, indie, soul, irish
Success 235/807: https://www.last.fm/music/Andrea+Bocelli+&+Giorgia classical, opera, italian, male vocalists, andrea bocelli
Success 236/807: https://www.last.fm/music/Tears+For+Fears pop, rock, ska, new wave, synthesiser, 80s, synth pop
Success 237/807: https://www.last.fm/music/Aretha+Franklin soul, jazz, female vocalists, blues, rhythm and blues
Success 238/807: https://www.last.fm/music/America soft, rock, 70s, classic rock, soft rock, folk
Success 239/807: https://www.las

Success 306/807: https://www.last.fm/music/Tina+Turner pop, soul, rock, female vocalists, 80s
Success 307/807: https://www.last.fm/music/Simply+Red punk, pop, soul, 80s, british, simply red
Success 308/805: https://www.last.fm/music/P!nk pop, pop rock, female vocalists, rock, pink
Success 309/807: https://www.last.fm/music/Cat+Stevens singer-songwriter, folk, classic rock, acoustic, 70s
Success 310/807: https://www.last.fm/music/The+Cats oldies, pop, soft rock, 60s, skinhead reggae
Success 311/807: https://www.last.fm/music/Coolio rap, hip-hop, 90s, hip hop, west coast rap
Success 312/807: https://www.last.fm/music/Don+Henley classic rock, rock, 80s, singer-songwriter, soft rock
Success 313/807: https://www.last.fm/music/Tenacious+D rock, comedy, hard rock, alternative rock, comedy rock
Success 314/807: https://www.last.fm/music/Men+At+Work rock, new wave, 80s, australian, pop
Success 315/807: https://www.last.fm/music/Stevie+Wonder soul, funk, motown, pop, 70s
Success 316/807: https:/

Success 382/807: https://www.last.fm/music/Nickelback rock, alternative rock, hard rock, alternative, nickelback
Success 383/807: https://www.last.fm/music/Guus+Meeuwis+&+Vagant nederlandstalig, pop, dutch, singer-songwriter, nederpop
Success 384/807: https://www.last.fm/music/Gloria+Gaynor disco, 70s, soul, female vocalists, pop
Success 385/807: https://www.last.fm/music/Fischer-Z new wave, 80s, british, post-punk, rock
Success 386/807: https://www.last.fm/music/Survivor classic rock, rock, 80s, hard rock, aor
Success 387/807: https://www.last.fm/music/The+Script pop rock, rock, pop, irish, acoustic
Success 388/807: https://www.last.fm/music/The+Blue+Nile dream pop, scottish, 80s, alternative, pop
Success 389/807: https://www.last.fm/music/Nazareth hard rock, classic rock, rock, 70s, heavy metal
Success 390/807: https://www.last.fm/music/Dusty+Springfield pop, singer, British Invasion, rock, soul, 60s, female vocalists, oldies
Success 391/807: https://www.last.fm/music/One+Direction p

Success 460/807: https://www.last.fm/music/Ten+Sharp pop, 90s, 80s, soft rock, dutch
Success 461/807: https://www.last.fm/music/Daft+Punk+ft.+Pharrell+Williams electronic, 90s, house, dance, techno, electronica
Success 462/807: https://www.last.fm/music/John+Farnham 80s, pop, rock, australian, classic rock
Success 463/807: https://www.last.fm/music/The+Doobie+Brothers classic rock, rock, 70s, southern rock, soft rock
Success 464/807: https://www.last.fm/music/Maneskin italian, italy, italia, sanremo, eurovision 2021
Success 465/807: https://www.last.fm/music/Killing+Joke post-punk, new wave, industrial, industrial rock, industrial metal
Success 466/807: https://www.last.fm/music/Zucchero italian, rock, pop, blues, soft rock
Success 467/807: https://www.last.fm/music/Gladys+Knight+&+The+Pips soul, rnb, motown, rhythm and blues, female vocalists, oldies
Success 468/807: https://www.last.fm/music/George+Baker+Selection classic rock, rock, pop, oldies, soundtrack
Success 469/807: https://w

Success 535/807: https://www.last.fm/music/Iron+Butterfly psychedelic rock, heavy metal, 60s, one hit wonders, classic rock, psychedelic, progressive rock, rock
Success 536/807: https://www.last.fm/music/Lionel+Richie singer, songwriter, composer, soul, pop, 80s, rnb, lionel richie
Success 537/807: https://www.last.fm/music/Air electronic, chillout, ambient, french, electronica
Success 538/807: https://www.last.fm/music/Michael+Buble jazz, swing, christmas, pop, easy listening
Success 539/807: https://www.last.fm/music/B.B.+King blues, blues rock, guitar, jazz, classic rock
Success 540/807: https://www.last.fm/music/Stealers+Wheel folk, rock, classic rock, 70s, oldies
Success 541/807: https://www.last.fm/music/3JS dutch, eurovision, nederlandstalig, levenslied, nederlands
Success 542/807: https://www.last.fm/music/Darude trance, electronic, dance, techno, electronica
Success 543/807: https://www.last.fm/music/Snowy+White blues rock, blues, classic rock, rock, guitar
Success 544/807: ht

Success 612/807: https://www.last.fm/music/James+Taylor singer-songwriter, folk, acoustic, classic rock, james taylor
Success 613/807: https://www.last.fm/music/House+Of+Pain hip-hop, rap, old school, hip hop, 90s
Success 614/807: https://www.last.fm/music/Extreme hard rock, rock, funk metal, hair metal, metal
Success 615/807: https://www.last.fm/music/Bruce+Hornsby+&+The+Range 80s, classic rock, rock, singer-songwriter, soft rock
Success 616/807: https://www.last.fm/music/The+Babys pop rock, 70s, 80s, power pop, classic rock, rock, japanese, hard rock
Success 617/807: https://www.last.fm/music/The+Human+League synthpop, Post-Punk, New Wave, new wave, 80s, electronic, synth pop
Success 618/807: https://www.last.fm/music/The+Script+ft.+Will.I.Am pop rock, rock, pop, irish, acoustic
Success 619/807: https://www.last.fm/music/Tom+Petty+&+The+Heartbreakers rock, classic rock, 80s, singer-songwriter, christmas
Success 620/807: https://www.last.fm/music/The+The new wave, post-punk, 80s, alte

Success 690/807: https://www.last.fm/music/Roger+Glover+&+Guests rock, my autumn of 00, songs that remind me of robik, 70s, classic rock
Success 691/807: https://www.last.fm/music/Cheap+Trick classic rock, rock, power pop, hard rock, 80s
Success 692/807: https://www.last.fm/music/Celine+Dion+&+Barbra+Streisand duets, female vocalists, 90s, céline dion, duett
Success 693/807: https://www.last.fm/music/Notorious+B.I.G. hip-hop, rap, gangsta rap, east coast rap, hip hop
Success 694/807: https://www.last.fm/music/Marvin+Gaye+&+Tammi+Terrell soul, motown, rhythm and blues, duets, oldies
Success 695/807: https://www.last.fm/music/Suzanne+Vega singer-songwriter, folk, female vocalists, pop, alternative
Success 696/807: https://www.last.fm/music/Smash+Mouth alternative rock, neo-ska, rock, alternative, ska, pop
Success 697/807: https://www.last.fm/music/Prodigy hip hop, electronic, hip-hop, big beat, rave, rap
Success 698/807: https://www.last.fm/music/Khruangbin+&+Leon+Bridges soul, rock, psy

Success 768/807: https://www.last.fm/music/Jeroen+van+Koningsbrugge all, rock, hard rock, acoustic, nederpop
Success 769/807: https://www.last.fm/music/Trockener+Kecks dutch, rock, all, nederlandstalig, jesters live list
Success 770/807: https://www.last.fm/music/Ed+Sheeran+&+Justin+Bieber pop, party, featuring, wake-up song, uk number one
Success 771/807: https://www.last.fm/music/Herman+Brood+&+Henny+Vrienten home collection, 80s, ska, pop, dutch
Success 772/807: https://www.last.fm/music/Rare+Earth American, rock, 60s, 70s, classic rock, funk, motown, soul
Success 773/807: https://www.last.fm/music/Drukwerk dutch, nederlandstalig, pop, home collection, other
Success 774/807: https://www.last.fm/music/Rihanna+ft.+Mikky+Ekko pop, rnb, dance, female vocalists, rihanna
Success 775/807: https://www.last.fm/music/Vitesse indie pop, rock, electropop, synth-pop, chicago pop represent
Success 776/807: https://www.last.fm/music/Toploader pop, britpop, british, indie, rock
Success 777/807: htt

In [7]:
#printing the exceptions that didn't work 
for index, row in artiesten.iterrows():
    if row['genres'] == "":
        print(row) 

artiest    Floor Jansen & Henk Poort
genres                              
Name: 52, dtype: object
artiest    Bløf ft. Geike
genres                   
Name: 64, dtype: object
artiest    Diggy Dex ft. JW Roy
genres                         
Name: 128, dtype: object
artiest    Alderliefste & Ramses Shaffy & Liesbeth List
genres                                                 
Name: 187, dtype: object
artiest    Suzan & Freek & Snelle
genres                           
Name: 211, dtype: object
artiest    Armin van Buuren ft. Kensington
genres                                    
Name: 242, dtype: object
artiest    André Hazes Jr.
genres                    
Name: 301, dtype: object
artiest    Bökkers
genres            
Name: 413, dtype: object
artiest    Davina Michelle & Snelle
genres                             
Name: 457, dtype: object
artiest    Teskey Brothers
genres                    
Name: 495, dtype: object
artiest    Kirsty Maccoll & Pogues
genres                            
Name: 57

In [8]:
#fixing those cases with some manual work 
artiesten.loc[52,'genres']='symphonic metal, netherlands, better than tarja, female fronted metal music, classical'
artiesten.loc[64,'genres']='dutch, pop, nederlandstalig, rock, nederpop'
artiesten.loc[128,'genres']='nederhop, hip-hop, dutch, rap'
artiesten.loc[187,'genres']='pop, the netherlands, dutch, rock, pop rock'
artiesten.loc[211,'genres']='pop, dutch, singersongwriter' #Suzan & Freek were not on last.fm so I made something up
artiesten.loc[242,'genres']='trance, progressive trance, electronic, vocal trance, dance'
artiesten.loc[301,'genres']='dutch'
artiesten.loc[413,'genres']='rock, dutch, boerenrock, streektaal'
artiesten.loc[457,'genres']='pop, dutch, female vocalist, netherlands, linedance 2021'
artiesten.loc[495,'genres']='soul, blues, aussie rock'
artiesten.loc[578,'genres']='female vocalists, pop, 80s, singer-songwriter, folk'
artiesten.loc[580,'genres']='nederlandstalig, belgian female vocalists, europe 2021, netherlands'
artiesten.loc[643,'genres']='pop, synthpop, british, electropop, female vocalists'
artiesten.loc[663,'genres']='trance, progressive trance, electronic, vocal trance, dance'
artiesten.loc[710,'genres']='house, dance, electronic, south africa, gqom'
artiesten.loc[745,'genres']='house, electronic, dance, progressive house, electro house'

In [18]:
#merging the genres with the original data and saving this data 
data = pd.merge(left=data, right=artiesten, how='left', left_on='artiest', right_on='artiest')
data.to_csv('data/data_genres.csv', header=True, index=False)
data

Unnamed: 0,positie,titel,artiest,jaar,genres
0,1,Bohemian Rhapsody,Queen,1975,"classic rock, rock, hard rock, glam rock, 80s"
1,2,Roller Coaster,Danny Vera,2019,"americana, rock, singer-songwriter, country, d..."
2,3,A Whiter Shade Of Pale,Procol Harum,1967,"progressive rock, blues, progre…, classic rock..."
3,4,Hotel California,Eagles,1977,"rock, classic rock, 70s, soft rock, country"
4,5,Piano Man,Billy Joel,1974,"classic rock, singer-songwriter, rock, pop, piano"
...,...,...,...,...,...
1995,1996,Sunshine Of Your Love,Cream,1968,"blues, hard rock, psychedelic rock, classic ro..."
1996,1997,Anarchy In The Uk,Sex Pistols,1976,"punk rock, punk, british, 70s, rock"
1997,1998,I Want To Hold Your Hand,The Beatles,1964,"pop rock and beat, classic rock, rock, 60s, b..."
1998,1999,Kissing A Fool,George Michael,1988,"pop, duo, 80s, soul, british, dance"


## Count the genres
Now I want to know how many songs there are in each genre and what is the highest ranking of that genre. Since each song has about 5 genres there will be more than 2000 songs in genres in total. Some genres were a bit strange, like 'miley cyrus'. To slightly compensate for this I decided to only include genres with more than 5 songs in them.

In [24]:
#loading and setting up the dataframes
data = pd.read_csv('data/data_genres.csv')
genre_counter = pd.DataFrame(columns=['aantal nummers', 'hoogste positie'])

#iterating over all the rows in the data 
for index, row in data.iterrows():
    for g in row['genres'].split(", "):
        #adding one to the counter and checking if this is the most highly ranked song if the genre is in the genre_counter 
        if g in genre_counter.index: 
            genre_counter.loc[g, 'aantal nummers'] += 1; 
            if genre_counter.loc[g, 'hoogste positie'] > row['positie']:
                genre_counter.loc[g, 'hoogste positie'] = row['positie']; 
        #adding a new row for the genre if it is not in the genre_counter yet 
        else:
            genre_counter.loc[g, 'aantal nummers'] = 1; 
            genre_counter.loc[g, 'hoogste positie'] = row['positie']; 

#some data cleaning and reformatting 
genre_counter = genre_counter[:-1]
genre_counter = genre_counter.reset_index()
genre_counter = genre_counter.rename(columns={"index": "genre"})
genre_counter['aantal nummers'] = genre_counter['aantal nummers'].astype(int)
genre_counter['hoogste positie'] = genre_counter['hoogste positie'].astype(int)
genre_counter = genre_counter[genre_counter['aantal nummers'] > 10]
genre_counter = genre_counter[genre_counter['genre'] != 'all']

#writing the data to csv
genre_counter.to_csv('data/genre_counter.csv', header=False, index=False)
genre_counter

Unnamed: 0,genre,aantal nummers,hoogste positie
0,classic rock,692,1
1,rock,1171,1
2,hard rock,217,1
3,glam rock,74,1
4,80s,467,1
...,...,...,...
239,motown,13,467
240,british artist,12,470
260,australian,18,498
272,chillout,12,536


In [38]:
#changing some column values and names to make it more pretty and English for visualisation
genre_counter['genre'] = genre_counter['genre'].str.capitalize()
genre_counter = genre_counter.rename(columns={'hoogste positie': 'Highest position of a song', 'aantal nummers': 'Number of songs'})

#visualising the data
fig = px.scatter(genre_counter, x='Highest position of a song', y='Number of songs', hover_name='genre', color='Number of songs', size = 'Number of songs')

fig.update_layout(
    title = "Highest position and total number of songs in the Top 2000 per genre",
)

fig.show()

#writing the figure to html, figure is visible at: https://almaliezenga.github.io/Top2000/html/genre_counter
pio.write_html(fig, file='html/genre_counter.html', auto_open=True)

## Creating supergenres
These genres are quite niche which makes it fun to look through them and click around but I want to make something which is slightly more generalizable. We are gonna create a few supergenres and put every song into one of those supergenres. For this I will only look at the first genre of every song. I have combined some of the lists of Multimediaeval: https://multimediaeval.github.io/2017-AcousticBrainz-Genre-Task/data_stats/ 

In [31]:
#reading and reformatting supergenres.txt 
supergenres = pd.read_csv('supergenres.txt', delimiter=r"\s+", header = None)
supergenres = supergenres[1:]
supergenres = pd.DataFrame(supergenres[0])
supergenres = supergenres.rename(columns={0: "supergenre"})
supergenres['subgenre'] = " "

#iterating over all the rows in supergenres to separate the subgenre from the supergenre
for index, row in supergenres.iterrows():
    supergenres.loc[index, 'supergenre'] = row['supergenre'].split("---", 2)
    try:
        supergenres.loc[index, 'subgenre'] = row['supergenre'][1]
    except:
        supergenres.loc[index, 'subgenre'] = row['supergenre']
        supergenres.loc[index, 'subgenre'] = " ".join(row['subgenre'])
    supergenres.loc[index, 'supergenre'] = row['supergenre'][0]
    
supergenres

Unnamed: 0,supergenre,subgenre
1,blues,blues
2,blues,acousticblues
3,blues,chicagoblues
4,blues,classicblues
5,blues,countryblues
...,...,...
501,world,indian
502,world,middleeastern
503,world,nationalmusic
504,world,traditional


In [32]:
#adding the supergenre to the data 
data['supergenre'] = " "

#iterating all the rows in the data 
for index, row in data.iterrows():
    #getting the subgenre of the current song
    subgenre_value = row['genres'].replace(" ", "")
    subgenre_value = subgenre_value.split(",")[0].lower()
    
    #iterating all the rows in the supergenres and retracting the subgenres
    for i, r in supergenres.iterrows(): 
        subgenre_target = r['subgenre'].lower()
        
        #checking if the subgenre of the song and the supergenres list align
        if str(subgenre_value)  ==  str(subgenre_target):
            data.loc[index, 'supergenre'] = r['supergenre']
    
    #to catch some more songs we also check if the second genre of the song does align with a subgenre in the supergenre list
    if data.loc[index, 'supergenre'] == " ":
        subgenre_value2 = row['genres'].replace(" ", "")
        try: 
            subgenre_value2 = subgenre_value2.split(",")[1].lower()
            for i, r in supergenres.iterrows(): 
                subgenre_target = r['subgenre'].lower()
                if str(subgenre_value2)  ==  str(subgenre_target):
                    data.loc[index, 'supergenre'] = r['supergenre']
                    
        #printing a message when the first and second subgenre don't align with the supergenres
        except: 
            print(f"No supergenre matches subgenre 1 or 2 for {row['titel']}, {subgenre_value}, {subgenre_value2}")
            data.loc[index, 'supergenre'] = " "

#writing the data to a csv 
data.to_csv('data/data_supergenres.csv', header=True, index=False)
data

No supergenre matches subgenre 1 or 2 for Just Give Me A Reason, 6of10stars, 6of10stars
No supergenre matches subgenre 1 or 2 for Can't Hold Us, , 
No supergenre matches subgenre 1 or 2 for Don't Leave Me This Way, linedance2011, linedance2011
No supergenre matches subgenre 1 or 2 for I Feel It Coming, canada, canada
No supergenre matches subgenre 1 or 2 for Pa Olvidarte, linedance2019, linedance2019
No supergenre matches subgenre 1 or 2 for Can't Get It Out Of My Head, , 


Unnamed: 0,positie,titel,artiest,jaar,genres,supergenre
0,1,Bohemian Rhapsody,Queen,1975,"classic rock, rock, hard rock, glam rock, 80s",rock
1,2,Roller Coaster,Danny Vera,2019,"americana, rock, singer-songwriter, country, d...",country
2,3,A Whiter Shade Of Pale,Procol Harum,1967,"progressive rock, blues, progre…, classic rock...",rock
3,4,Hotel California,Eagles,1977,"rock, classic rock, 70s, soft rock, country",rock
4,5,Piano Man,Billy Joel,1974,"classic rock, singer-songwriter, rock, pop, piano",rock
...,...,...,...,...,...,...
1995,1996,Sunshine Of Your Love,Cream,1968,"blues, hard rock, psychedelic rock, classic ro...",blues
1996,1997,Anarchy In The Uk,Sex Pistols,1976,"punk rock, punk, british, 70s, rock",rock
1997,1998,I Want To Hold Your Hand,The Beatles,1964,"pop rock and beat, classic rock, rock, 60s, b...",rock
1998,1999,Kissing A Fool,George Michael,1988,"pop, duo, 80s, soul, british, dance",pop


In [33]:
#loading and setting up the dataframes
data = pd.read_csv('data/data_supergenres.csv')
supergenre_counter = pd.DataFrame(columns=['aantal nummers', 'hoogste positie'])

#iterating all the rows in the data, basically the same as for the genrecounter
for index, row in data.iterrows():
    g = row['supergenre']
    if g in supergenre_counter.index: 
        supergenre_counter.loc[g, 'aantal nummers'] += 1; 
        if supergenre_counter.loc[g, 'hoogste positie'] > row['positie']:
            supergenre_counter.loc[g, 'hoogste positie'] = row['positie']; 
    else:
        supergenre_counter.loc[g, 'aantal nummers'] = 1; 
        supergenre_counter.loc[g, 'hoogste positie'] = row['positie']; 

#some data cleaning and reformatting 
supergenre_counter = supergenre_counter.reset_index()
supergenre_counter = supergenre_counter.rename(columns={"index": "genre"})
supergenre_counter['aantal nummers'] = supergenre_counter['aantal nummers'].astype(int)
supergenre_counter['hoogste positie'] = supergenre_counter['hoogste positie'].astype(int)
supergenre_counter = supergenre_counter[supergenre_counter['aantal nummers'] > 20]
supergenre_counter = supergenre_counter[supergenre_counter['genre'] != "None"]
supergenre_counter = supergenre_counter[supergenre_counter['genre'] != " "]

supergenre_counter.to_csv('data/supergenre_counter.csv', header=False, index=False)
supergenre_counter

Unnamed: 0,genre,aantal nummers,hoogste positie
0,rock,967,1
1,country,28,2
2,metal,31,8
3,nederlandstalig,176,10
4,pop,420,11
5,soul,83,21
6,folk,47,28
7,hiphop,42,76
8,blues,26,90
9,reggae,23,123


In [39]:
#changing some column values and names to make it more pretty and English for visualisation
supergenre_counter['genre'] = supergenre_counter['genre'].str.capitalize()
supergenre_counter = supergenre_counter.rename(columns={'hoogste positie': 'Highest position of a song', 'aantal nummers': 'Number of songs'})

#visualising the data
fig = px.scatter(supergenre_counter, x='Highest position of a song', y='Number of songs', hover_name='genre', color='Number of songs', size = 'Number of songs')

fig.update_layout(
    title = "Highest position and total number of songs in the Top 2000 per supergenre",
)

fig.show()

#writing the figure to html, figure is visible at: https://almaliezenga.github.io/Top2000/html/supergenre_counter
pio.write_html(fig, file='html/supergenre_counter.html', auto_open=True)

## Visualising all the songs with their supergenre
I want to show all the songs by position and year of release with their genre. 

In [63]:
data = pd.read_csv('data/data_supergenres.csv')
data['supergenre'] = data['supergenre'].str.capitalize()
data.loc[data['supergenre'] == " ", 'supergenre'] = "Other"
data.loc[data['supergenre'] == "None", 'supergenre'] = "Other"
data = data.rename(columns={'positie': 'Position', 'jaar': 'Year of release', 'artiest':'Artist', 'titel':'Title', 'supergenre':'Supergenre'})

fig = px.scatter(data, hover_name='Title', hover_data=['Artist'], color='Supergenre', x='Year of release', y='Position')

fig.update_layout(
    title = "Songs by year of release and position",
)

fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.3,
    xanchor="right",
    x=1
))

fig.show()

#writing the figure to html, figure is visible at: https://almaliezenga.github.io/Top2000/html/songs_supergenres
pio.write_html(fig, file='html/songs_supergenres.html', auto_open=True)

In [66]:
data['Supersupergenre'] = data['Supergenre']

data.loc[data['Supergenre'] == 'Jazz', 'Supersupergenre'] = 'Jazz, Blues & Soul'
data.loc[data['Supergenre'] == 'Blues', 'Supersupergenre'] = 'Jazz, Blues & Soul'
data.loc[data['Supergenre'] == 'Soul', 'Supersupergenre'] = 'Jazz, Blues & Soul'

data.loc[data['Supergenre'] == 'Rock', 'Supersupergenre'] = 'Rock and metal'
data.loc[data['Supergenre'] == 'Metal', 'Supersupergenre'] = 'Rock and metal'

data.loc[data['Supergenre'] == 'Country', 'Supersupergenre'] = 'Country and Folk'
data.loc[data['Supergenre'] == 'Folk', 'Supersupergenre'] = 'Country and Folk'

data.loc[data['Supergenre'] == 'Reggae', 'Supersupergenre'] = 'Other'
data.loc[data['Supergenre'] == 'Hiphop', 'Supersupergenre'] = 'Other'
data.loc[data['Supergenre'] == 'Classical', 'Supersupergenre'] = 'Other'
data.loc[data['Supergenre'] == 'Latin', 'Supersupergenre'] = 'Other'
data.loc[data['Supergenre'] == 'None', 'Supersupergenre'] = 'Other'
data.loc[data['Supergenre'] == 'Electronic', 'Supersupergenre'] = 'Other'

fig = px.scatter(data, hover_name='Title', hover_data=['Artist'], color='Supersupergenre', x='Year of release', y='Position')

fig.update_layout(
    title = "Songs by year of release and position",
)

fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.3,
    xanchor="right",
    x=1
))

fig.show()

#writing the figure to html, figure is visible at: https://almaliezenga.github.io/Top2000/html/songs_supersupergenres
pio.write_html(fig, file='html/songs_supersupergenres.html', auto_open=True)
data.to_csv('data/data_supersupergenre.csv', header=False, index=False)

In [67]:
data

Unnamed: 0,Position,Title,Artist,Year of release,genres,Supergenre,Supersupergenre
0,1,Bohemian Rhapsody,Queen,1975,"classic rock, rock, hard rock, glam rock, 80s",Rock,Rock and metal
1,2,Roller Coaster,Danny Vera,2019,"americana, rock, singer-songwriter, country, d...",Country,Country and Folk
2,3,A Whiter Shade Of Pale,Procol Harum,1967,"progressive rock, blues, progre…, classic rock...",Rock,Rock and metal
3,4,Hotel California,Eagles,1977,"rock, classic rock, 70s, soft rock, country",Rock,Rock and metal
4,5,Piano Man,Billy Joel,1974,"classic rock, singer-songwriter, rock, pop, piano",Rock,Rock and metal
...,...,...,...,...,...,...,...
1995,1996,Sunshine Of Your Love,Cream,1968,"blues, hard rock, psychedelic rock, classic ro...",Blues,"Jazz, Blues & Soul"
1996,1997,Anarchy In The Uk,Sex Pistols,1976,"punk rock, punk, british, 70s, rock",Rock,Rock and metal
1997,1998,I Want To Hold Your Hand,The Beatles,1964,"pop rock and beat, classic rock, rock, 60s, b...",Rock,Rock and metal
1998,1999,Kissing A Fool,George Michael,1988,"pop, duo, 80s, soul, british, dance",Pop,Pop
