# How to be Choosy: Billboard Hot 100

In [None]:
# import urllib # uncomment if you'd like to use the live datastream
import pandas as pd
from datetime import datetime as dt

In [None]:
source = "bh100.csv" # this is the dataset we want
bh100 = pd.read_csv(source) # load it as a data frame
bh100

# Wrangling Too Many Cases

## Random Selection

In [None]:
bh100reduced = bh100.sample(5000).head(10) # put a randomly selected 5000 rows in bh100reduced
bh100reduced

In [None]:
# the line below returns , a total of 4916 records
bh100reduced = bh100.iloc[::6, :] # put every 6th row of bh100 in bh100reduced
bh100reduced

## Purposeful Sampling by Attribute(s)

In [None]:
# put only the records with "Highest BH100 Position" equal to 1 in bh100 reduced
bh100reduced = bh100[bh100['Highest BH100 Position']==1] 
bh100reduced

# LEARN MORE ABOUT THIS DATASET

This dataset includes every song that's ever appeared on the Billboard Hot 100 Charts (August 1958-May 2021). Only some of the available attributes are initially shown. Use the "attributes" tab in the "Choosy" window to select which attributes or attribute groups to show and hide.

### About the Attribute Groups. 
Each record includes Basic Info such as the Song Name, Performer, the Month and Year released; Popularity measures such as the song's highest Billboard position and the number of weeks the song stayed on the the Hot 100 list; and a list of the Genre(s) represented by the song. When available, each song record also includes information scraped from Spotify including the Spotify ID, URL, and popularity on the Spotify app. A number of Spotify-generated measures of musical features are described in Performance Features exploring the song's "speechiness," "liveness," and other inferred performance features; Emotion Features exploring inferred features such as the song's energy level and valence; and Sound Features such as the song's tempo, time signature, and loudness. 

### About the Attributes. 
Song Name
Performer
Year Released
*Highest BH100 Position* was computed this from "Hot Stuff" database using min position listing for this SongID
Spotify Popularity
*Danceability* describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
*Energy* is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
*Loudness* of a track is measured in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
*Speechiness* detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
*Valence* is a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
*Tempo* is estimated in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

### History and Purpose. 
This dataset was initially imported into CODAP in Summer 2022 for a teacher workshop. It was updated in Spring 2023, and is being used as part of the Writing Data Stories project and the City University of New York's Computing Integrated Teacher Education (CUNY CITE) program. 

### Data Sources and Data Cleaning. 
This dataset was constructed by Sean Miller (github handle: HipsterVizData) using APIs to download Billboard Hot 100 (BH100) and Spotify (S) data. The full dataset and additional information about its original construction can be accessed at this link. Michelle Wilkerson of the WDS team  merged the two tables in the dataset by mapping the BH100 Song Name attribute to the Spotify SongID, removing all Spotify records that did not have a corresponding BH100 entry but retaining BH100 songs that did not have corresponding Spotify entries. Michelle imported attribute descriptions from the original dataset, editing a few descriptions for readability at the middle school level, consolidated music genres into 7 major genre flags while retaining the full genre list as a separate attribute; and removed several attributes for simplicity. Michelle also grouped Spotify-generated song features into three categories (Performance, Emotion, and Sound Features) visible in the "Choosy" menu.