# Getting data 1 - top100 tracks

For my first Hypothesis I need to get all the top100 tracks from the official German charts from 2010-2021.

**0 Hypothesis: The audio features of deutsch rap tracks became more similar to the audio features of tracks found in the top100 German charts from 2010-2021**

I am looking for the following info to scrape:
- artist
- track title
- label
- rank
- week

Once I sucessfully scraped the desired data, I will store it in a csv file for further processing.

In [1]:
# importing libraries

from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
# url I want to scrape

url = 'https://www.offiziellecharts.de/charts/single/for-date-1620390068000'

In [3]:
# setting the response variable with the website url
# checking the response status

response = requests.get(url)
response.status_code

# status is 200, which means I got something back

200

In [4]:
# looking at the response

response.content

b'<!doctype html>\r\n<html prefix="og: http://ogp.me/ns#" class="no-js" xmlns="http://www.w3.org/1999/xhtml" xml:lang="de-de"\r\n      lang="de-de">\r\n    <meta name="description" content="Hier gibt\xe2\x80\x99s die Offiziellen Deutschen Charts in ihrer ganzen Vielfalt. Denn: Hier z\xc3\xa4hlt die Musik." />\r\n<head>\r\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n    <meta name="viewport"\r\n          content="height=device-height, width=device-width, initial-scale=1.0, maximum-scale=1, user-scalable=no, minimal-ui">\r\n    <meta name="apple-mobile-web-app-capable" content="yes">\r\n    <link rel="shortcut icon" href="/templates/gfktemplate/favicon.ico">\r\n    <title>Offizielle Deutsche Charts - Offizielle Deutsche Charts</title>\r\n            <link rel="stylesheet"\r\n              href="/templates/gfktemplate/styles/styles.css?v=106">\r\n        <style>\r\n        #gfk-preload {\r\n            position: fixed;\r\n            top: 0;\r\n            left: 0;\r\n   

In [5]:
# lets make it a bit more readable

soup = BeautifulSoup(response.content, 'html.parser')
soup

# looks better now

<!DOCTYPE html>

<html class="no-js" lang="de-de" prefix="og: http://ogp.me/ns#" xml:lang="de-de" xmlns="http://www.w3.org/1999/xhtml">
<meta content="Hier gibt’s die Offiziellen Deutschen Charts in ihrer ganzen Vielfalt. Denn: Hier zählt die Musik." name="description"/>
<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="height=device-height, width=device-width, initial-scale=1.0, maximum-scale=1, user-scalable=no, minimal-ui" name="viewport"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<link href="/templates/gfktemplate/favicon.ico" rel="shortcut icon"/>
<title>Offizielle Deutsche Charts - Offizielle Deutsche Charts</title>
<link href="/templates/gfktemplate/styles/styles.css?v=106" rel="stylesheet"/>
<style>
        #gfk-preload {
            position: fixed;
            top: 0;
            left: 0;
            width: 100%;
            height: 100%;
            background: #fff;
            text-align: center;
            padding: 3rem;
      

### Scraping Top100 current week

First I am going to scrape the current weeks top100 tracks. If that is successfull, I will try to build a function or a loop which allows me to scrape several pages, dating back until 2010.

I am also going to first get the info I want for one example track to see what I will need in my loop to get all tracks from the first page.

In [7]:
# getting the current rank

soup.select('td.ch-pos span')[0].text

'1'

In [8]:
# getting the last weeks rank

soup.select('span.last-week')[0].text

'\n                                        2                                    '

In [9]:
# getting the artist

soup.select('span.info-artist')[0].text

'Nathan Evans'

In [10]:
# getting the track title

soup.select('span.info-title')[0].text

'Wellerman'

In [11]:
# getting the label

soup.select('span.info-label')[0].text

'Polydor'

In [12]:
# getting the date

soup.select('body > main > div.container > div > div.col-md-8 > span > strong:nth-child(1)')

[<strong>07.05.2021</strong>]

In [13]:
# 0 is the start date of the week

soup.select('strong')[0].text

'07.05.2021'

In [14]:
# 1 is the start date of the week

soup.select('strong')[1].text

'13.05.2021'

Now that I have the individual infos from the track, I am going to build a loop to get info from al tracks from this page.

In [15]:
# for loop to get artist, title, label, rank, week for all tracks of the specific week

artist = []
title = []
label = []
rank = []
week_start = []
week_end = []

track_lst = soup.select('span.info-title')
len_tracks = len(track_lst)

for track in range(len_tracks):
    artist.append(soup.select('span.info-artist')[track].text)
    title.append(soup.select('span.info-title')[track].text)
    label.append(soup.select('span.info-label')[track].text)
    rank.append(soup.select('td.ch-pos span')[track].text)
    week_start.append(soup.select('strong')[0].text)
    week_end.append(soup.select('strong')[1].text)

In [16]:
# creating a data frame for the specific week

tracks_2021_cv19 = pd.DataFrame({'artist' : artist, 
                                'title' : title,
                                'label' : label,
                                'rank' : rank,
                                'week_start' : week_start,
                                'week_end' : week_end })

In [17]:
# this is the df for the top100 of week 19 from 2021

tracks_2021_cv19

Unnamed: 0,artist,title,label,rank,week_start,week_end
0,Nathan Evans,Wellerman,Polydor,1,07.05.2021,13.05.2021
1,Cro feat. Capital Bra,Blessed,Urban,2,07.05.2021,13.05.2021
2,Jamule,Liege wieder wach,Life Is Pain,3,07.05.2021,13.05.2021
3,Riton x Nightcrawlers feat. Mufasa & Hypeman,Friday (Dopamine Re-Edit),Columbia,4,07.05.2021,13.05.2021
4,Lil Nas X,Montero (Call Me By Your Name),Columbia,5,07.05.2021,13.05.2021
...,...,...,...,...,...,...
95,Ilira x Vize,Dynamite,Epic / Crash Your Sound,96,07.05.2021,13.05.2021
96,Ufo361,No Hugs,Stay High,97,07.05.2021,13.05.2021
97,BHZ,Overdose,Sony Music / BHZ,98,07.05.2021,13.05.2021
98,T-Low,Ordentlich,Spinnup,99,07.05.2021,13.05.2021


In [18]:
# saving df to csv

tracks_2021_cv19.to_csv('../data/tracks_2021_cv19_raw.csv') 

### Scraping Top100 several pages 

Now that this worked for the first page, I want to get all the pages for all weeks between 2010 and 2021.

**These are the urls:**<br/>
url current week - https://www.offiziellecharts.de/charts/single/for-date-1620390068000
<br/><br/>
url previous week - https://www.offiziellecharts.de/charts/single/for-date-1619733600000

Only the number in the end changes. Unfortunatly there is no pattern behind the number that I recognise.
However there is a 'previous' button which I can use.
I have to check how I can access the 'previous' button on the website to loop through all the desired pages and apply the code from above.
The previous button contains the part of the url that needs to be changed to get the url of the previous page.

In [19]:
# importing libraries

import re
import numpy as np
from tqdm import tqdm

import datetime

from time import sleep
from random import randint

In [20]:
# checking how to access the previous button

soup.select('body > main > div.container > div > div.col-md-8 > span > span > a.prev-link.btn.btn-default.btn-sm')

[<a class="prev-link btn btn-default btn-sm" href="/charts/single/for-date-1619771950000"><i class="fa fa-chevron-left"></i> ZURÜCK</a>]

In [21]:
# checking how to extract the url part and putting it together to form the new url

url = "https://www.offiziellecharts.de" + str([a['href'] for a in soup.select('body > main > div.container > div > div.col-md-8 > span > span > a.prev-link.btn.btn-default.btn-sm')][0]) + "/"
url

In [23]:
# code inspiration from: https://github.com/jast92/corona-impact-on-DE-music/blob/main/top100/jupyter/01_charts-scrape-multiple.ipynb

url = 'https://www.offiziellecharts.de/charts/single/for-date-1620390068000'

artist = []
title = []
label = []
rank = []
week_start = []
week_end = []
page = 0


while True:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    #finding out the start and end date of the top 100 songs list
    periodhelp = soup.select('strong')
    
    weekdates = []
    for dt in periodhelp:
        weekdates.append(dt.text)
    
    if int(datetime.datetime.strptime(weekdates[0], "%d.%m.%Y").year) == 2009:
        break

    track_lst = soup.select('span.info-title')
    len_tracks = len(track_lst)

    for track in range(len_tracks):
        artist.append(soup.select('span.info-artist')[track].text)
        title.append(soup.select('span.info-title')[track].text)
        label.append(soup.select('span.info-label')[track].text)
        rank.append(soup.select('td.ch-pos span')[track].text)
        week_start.append(soup.select('strong')[0].text)
        week_end.append(soup.select('strong')[1].text)
    page += 1
    print('page', page, 'done' )

    if len(soup.select('body > main > div.container > div > div.col-md-8 > span > span > a.prev-link.btn.btn-default.btn-sm')) > 0:
        url = "https://www.offiziellecharts.de" + str([a['href'] for a in soup.select('body > main > div.container > div > div.col-md-8 > span > span > a.prev-link.btn.btn-default.btn-sm')][0]) + "/"
        continue
    
    elif len(soup.select('body > main > div.container > div > div.col-md-8 > span > span > a.prev-link.btn.btn-default.btn-sm')) <= 0:
        break
    
    
dict_top100 = {'artist' : artist, 
                 'title' : title,
                 'label' : label,
                 'rank' : rank,
                 'week_start' : week_start,
                 'week_end' : week_end}
    
df = pd.DataFrame(dict_top100)
   
df.to_csv('../data/top100_tracks_raw.csv')

page 1 done
page 2 done
page 3 done
page 4 done
page 5 done
page 6 done
page 7 done
page 8 done
page 9 done
page 10 done
page 11 done
page 12 done
page 13 done
page 14 done
page 15 done
page 16 done
page 17 done
page 18 done
page 19 done
page 20 done
page 21 done
page 22 done
page 23 done
page 24 done
page 25 done
page 26 done
page 27 done
page 28 done
page 29 done
page 30 done
page 31 done
page 32 done
page 33 done
page 34 done
page 35 done
page 36 done
page 37 done
page 38 done
page 39 done
page 40 done
page 41 done
page 42 done
page 43 done
page 44 done
page 45 done
page 46 done
page 47 done
page 48 done
page 49 done
page 50 done
page 51 done
page 52 done
page 53 done
page 54 done
page 55 done
page 56 done
page 57 done
page 58 done
page 59 done
page 60 done
page 61 done
page 62 done
page 63 done
page 64 done
page 65 done
page 66 done
page 67 done
page 68 done
page 69 done
page 70 done
page 71 done
page 72 done
page 73 done
page 74 done
page 75 done
page 76 done
page 77 done
page 78 

page 594 done


In [25]:
# checking, if it worked & how many rows we got

data = pd.read_csv('../data/top100_tracks_raw.csv')
data.shape

(59400, 7)

In [26]:
# looking at how the data looks like

data.tail()

Unnamed: 0.1,Unnamed: 0,artist,title,label,rank,week_start,week_end
59395,59395,a-ha,Nothing Is Keeping You Here,WE LOVE MU,96,01.01.2010,07.01.2010
59396,59396,Coldplay,Viva la vida,PARLOPHONE,97,01.01.2010,07.01.2010
59397,59397,Sido,Geburtstag,UDD UNIVER,98,01.01.2010,07.01.2010
59398,59398,Pur,Irgendwo,EMI,99,01.01.2010,07.01.2010
59399,59399,Inna,Hot,KONTOR REC,100,01.01.2010,07.01.2010


Next steps with thie df will be:

- clean up the not needed index column
- extract the year in a separate column
- check the individual artist and decide on how to deal with features
- check outliers like all the christmas songs
- add another feature called genre
  -  take the artists from the top20 hip hip charts and assign the genre hip hop to them

## Scarping German Top20 Hip Hip charts

I also want to get the most popular DR albums. Therefore I need to scrape another subsite of the German charts:
The Top 20 Hip Hop Charts. It should worl with the same code from before, just needs the new url.

In [28]:
url = 'https://www.offiziellecharts.de/charts/hiphop'

artist = []
title = []
label = []
rank = []
week_start = []
week_end = []
genre = []
page = 0


while True:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    #finding out the start and end date of the top 100 songs list
    periodhelp = soup.select('strong')
    
    weekdates = []
    for dt in periodhelp:
        weekdates.append(dt.text)
    
    if int(datetime.datetime.strptime(weekdates[0], "%d.%m.%Y").year) == 2009:
        break

    track_lst = soup.select('span.info-title')
    len_tracks = len(track_lst)

    for track in range(len_tracks):
        artist.append(soup.select('span.info-artist')[track].text)
        title.append(soup.select('span.info-title')[track].text)
        label.append(soup.select('span.info-label')[track].text)
        rank.append(soup.select('td.ch-pos span')[track].text)
        week_start.append(soup.select('strong')[0].text)
        week_end.append(soup.select('strong')[1].text)
        genre.append('HipHop')
    page += 1
    print('page', page, 'done' )

    if len(soup.select('body > main > div.container > div > div.col-md-8 > span > span > a.prev-link.btn.btn-default.btn-sm')) > 0:
        url = "https://www.offiziellecharts.de" + str([a['href'] for a in soup.select('body > main > div.container > div > div.col-md-8 > span > span > a.prev-link.btn.btn-default.btn-sm')][0]) + "/"
        continue
    
    elif len(soup.select('body > main > div.container > div > div.col-md-8 > span > span > a.prev-link.btn.btn-default.btn-sm')) <= 0:
        break
    
    
dict_hiphop_top20 = {'artist' : artist, 
                 'title' : title,
                 'label' : label,
                 'rank' : rank,
                 'week_start' : week_start,
                 'week_end' : week_end,
                  'genre' : genre}
    
df = pd.DataFrame(dict_hiphop_top20)
   
df.to_csv('../data/top20_hiphop_tracks_raw.csv')

page 1 done
page 2 done
page 3 done
page 4 done
page 5 done
page 6 done
page 7 done
page 8 done
page 9 done
page 10 done
page 11 done
page 12 done
page 13 done
page 14 done
page 15 done
page 16 done
page 17 done
page 18 done
page 19 done
page 20 done
page 21 done
page 22 done
page 23 done
page 24 done
page 25 done
page 26 done
page 27 done
page 28 done
page 29 done
page 30 done
page 31 done
page 32 done
page 33 done
page 34 done
page 35 done
page 36 done
page 37 done
page 38 done
page 39 done
page 40 done
page 41 done
page 42 done
page 43 done
page 44 done
page 45 done
page 46 done
page 47 done
page 48 done
page 49 done
page 50 done
page 51 done
page 52 done
page 53 done
page 54 done
page 55 done
page 56 done
page 57 done
page 58 done
page 59 done
page 60 done
page 61 done
page 62 done
page 63 done
page 64 done
page 65 done
page 66 done
page 67 done
page 68 done
page 69 done
page 70 done
page 71 done
page 72 done
page 73 done
page 74 done
page 75 done
page 76 done
page 77 done
page 78 

In [29]:
# checking, if it worked & how many rows we got

data2 = pd.read_csv('../data/top20_hiphop_tracks_raw.csv')
data2.shape

(6420, 8)

In [30]:
# looking at how the data looks like

data2.tail()

Unnamed: 0.1,Unnamed: 0,artist,title,label,rank,week_start,week_end,genre
6415,6415,Kollegah,King,Selfmade Records,16,23.03.2015,29.03.2015,HipHop
6416,6416,Alligatoah,Triebwerke,Trailerpark Recordings,17,23.03.2015,29.03.2015,HipHop
6417,6417,Casper,Hinterland,Four Music,18,23.03.2015,29.03.2015,HipHop
6418,6418,Haftbefehl,Russisch Roulette,Universal Domestic Urban,19,23.03.2015,29.03.2015,HipHop
6419,6419,Eminem,Curtain Call,Interscope,20,23.03.2015,29.03.2015,HipHop


Realised that the data on the top20 hip hop albums are only available until 2015.