# Hot 100 Data

The data set we'll put into shape here is the weekly hot 100 charts. The data covers Aug. 4, 1958 through Apr. 13, 2019.

In [1]:
import os
import numpy as np
import pandas as pd
from glob import glob

# Switch Directory to ../data/interim/
current_dir = os.getcwd()
destination_dir = '/data/interim/hot100'
if not current_dir[-len(destination_dir):] == destination_dir:
    os.chdir('..' + destination_dir)

In [2]:
# Read in all Subfiles
filenames = glob('*.csv')
print(len(filenames))
hot100_dfs = [pd.read_csv(filepath,sep='\t',usecols=['date','rank','title','artist'],
                          parse_dates=['date']) for filepath in filenames]
hot100_all = pd.concat(hot100_dfs)
hot100_all.head()

2966


Unnamed: 0,rank,date,title,artist
0,1,1967-07-08,Windy,The Association
1,2,1967-07-08,Little Bit O' Soul,The Music Explosion
2,3,1967-07-08,Can't Take My Eyes Off You,Frankie Valli
3,4,1967-07-08,San Francisco (Be Sure To Wear Flowers In Your...,Scott McKenzie
4,5,1967-07-08,Don't Sleep In The Subway,Petula Clark


### Add Performance Values

A track can be on the Hot 100 for a long time, during this time its rank will vary. To get better clarity of a songs best performance I'll add several performance indicators. This includes `peak` (highest rank), `streak` (longest continuous ranking on Hot 100), `multiple` (max songs by one artist at the same time).

In [3]:
# Add peak values
peak_ranks = hot100_all.groupby(['title','artist']).min().loc[:,'rank']
peak_ranks = peak_ranks.rename('peak')
hot100_all_peak = hot100_all.merge(peak_ranks,how='left',on=['title','artist'])
hot100_all_peak.head(5)

Unnamed: 0,rank,date,title,artist,peak
0,1,1967-07-08,Windy,The Association,1
1,2,1967-07-08,Little Bit O' Soul,The Music Explosion,2
2,3,1967-07-08,Can't Take My Eyes Off You,Frankie Valli,2
3,4,1967-07-08,San Francisco (Be Sure To Wear Flowers In Your...,Scott McKenzie,4
4,5,1967-07-08,Don't Sleep In The Subway,Petula Clark,5
