# How to be Choosy: Billboard Hot 100
_by Michelle Hoda Wilkerson_

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/CalCoRE/choosy/main?urlpath=%2Fdoc%2Ftree%2Fbh100.ipynb)  <- Click here to open an interactive version of this notebook!

<div class="alert alert-block alert-info"> <b>NOTE:</b> This notebook is written to illustrate the data wrangling moves described in [anonymized manuscript], using a dataset about songs that have charted on the Billboard Hot 100 charts (see <a href="#about">here</a> for more information). Some of the text below is excerpted from the manuscript. Each section below corresponds to a subheading in the manuscript. See also How to be Choosy: 2022 CA Toxic Release Inventory. </div>

This is a Jupyter Python notebook. It mixes together Python code and text into one runnable, editable document (like a Word file or Google Doc). When you see code, you can run it by clicking the "Play" button in your interface. You can also edit it and see what happens! Often, pieces of code that appear later in the document rely on piece of code that appear earlier. So if you are a beginner, it's useful to work through every chunk of the document in order, top to bottom, and try to understand what each part does before hopping around or making too many edits.

With any Jupyter notebook, we begin by importing the necessary libraries. You should run the cell below before running any other code. Here, we are using the `pandas` library, an industry standard that provides us with special methods for reading in and working with data in Python. We also import the `datetime` library to help us process and sort date information in the dataset. Matplotlib helps us with data visualization.

In [None]:
import pandas # for data wrangling
import datetime # to handle dates
import seaborn as sns # for data visualization
import matplotlib.pyplot as plt # for data visualization
import numpy as np # for calculations

# if there are more than 10 records to show in a table, only show the first and last 5
pandas.set_option('display.max_rows', 10) 

# Table of Contents
* [Introduction to the BH100 Dataset](#Introduction-to-BH100-)
* [Wrangling Too Many Cases](#Wrangling-Too-Many-Cases-)
    * [Random Selection](#Random-Selection-)
    * [Purposeful Selection by Attribute(s)](#Purposeful-Selection-by-Attribute(s)-)
    * [Building Your Own Selection Attribute](#Building-Your-Own-Selection-Attribute-)
* [Wrangling Too Many Attributes](#Wrangling-Too-Many-Attributes-)
    * [Thematic Selection](#Thematic-Selection-)
    * [Pattern-Driven Selection](#Pattern-Driven-Selection-)
    * [Question-Driven Selection](#Question-Driven-Selection-)
* [More About This Dataset](#About-the-Billboard-Hot-100-Dataset-)

# Introduction to BH100

Let's check out the dataset. In the setup code above, we read the csv file and used the `pandas` library to process this information and turn it into a data frame called `bh100`. The line below, which just says `bh100`, will give a brief summary of the contents of the dataframe so you can check and make sure it has been read and processed correctly. Since there are too many cases to list, you will see the first five rows and the last five rows of the dataset, with "..." in the middle to indicate there are more cases that are not shown.

In [None]:
# read the contents of bh100.csv into a dataframe
bh100 = pandas.read_csv("bh100.csv")

bh100 # show us the contents of the new data frame

At the bottom of the output from our code above, we can see the specific dimensions of this dataset. It has 29494 cases (rows), and 35 attributes (columns). Below, we model the different wrangling strategies described in _How to be Choosy_ to reduce the size and/or complexity of the `bh100` dataset so that it is more appropriate for different educational applications.

# Wrangling Too Many Cases <a id="cases"></a>

## Random Selection <a id="random"></a>

Random selection extracts random rows from a large dataset to create one of a more manageable size. This is the most appropriate strategy for downsizing a dataset while preserving a representative snapshot of the full phenomenon to be studied. Starting with your dataframe, you can create a random selection from that dataframe using the sample() method. Below, we select exactly 5000 random cases from the `bh100` dataset and save them as a new reduced dataset called bh100reduced. If you run the code more than once, you will see that different cases are included in the dataset output each time.

In [None]:
bh100random = bh100.sample(5000) # put a randomly selected 5000 rows in bh100reduced
bh100random

A related technique is interpolated selection, or selecting every <i>n</i>th row of a data table. This might be useful when the order of the data matters (for example, if records are organized by date and you are interested in modeling patterns over time). However, we do not recommend interpolated selection unless you have a specific reason for using this method, because it can also lead to unintentionally non-random sampling.

In [None]:
bh100reduced = bh100.iloc[::6, :] # put every 6th row of bh100 in bh100reduced
bh100reduced

## Purposeful Selection by Attribute(s) <a id="purposeful"></a>

Purposeful selection involves reducing a dataset so that it only includes records with certain characteristics related to one or more attributes. This method is appropriate if you suspect that the majority of records in your too-large dataset are not useful or usable for your intended activity. To get all the cases in a dataframe that meet certain conditions, use the expression dataframe[condition]. Below, we want only the songs that reached 1 in the charts, represented by cases in bh100 where the value of ‘Highest BH100 Position’ is equal to 1.

In [None]:
topsongs = bh100['Highest BH100 Position']==1  # for each song, see if the highest position was 1
bh100reduced = bh100[topsongs]                 # put only top song records in the reduced dataset
bh100reduced

## Building Your Own Selection Attribute <a id="byo"></a>

There are other ways of creating a smaller dataset based on information that is not already available in the dataset itself. These could be specific cases that you identify manually, or cases you might identify using computer code to extract some new, meaningful information from the attributes you already have. These techniques allow teachers and students the most customization, but they require careful planning to select and identify which cases should be included and to consider how those decisions will shape what analyses and claims are appropriate. 

### By Identifying Specific Indices

Sometimes, you may want to create a small dataset through manually selecting a small number of cases from a larger data corpus. This can be helpful, for example, if you would like students to explore the meaning of different attributes and measures using a small, familiar dataset before diving into larger-scale analyses. One way to do this is by building a list of the ID numbers or _indices_ of the songs you want to include. Below is code you can use to find out what your favorite song indices are and use those to create a list. In the Toxic Release Inventory notebook, we show how you can construct a list directly using text searches.

In [None]:
# Use the lines below find the indices of the songs you want to include by name
# For example, let's see if the dataset has Adele's velvety epic "Skyfall."
# If the song with this title is in the dataset, the second line of code will output
# the full record for the song. If a song with the title does not exist in the dataset, 
# it will output an empty table.

favoriteSong = bh100['Song Name']=="Skyfall" # look up the song by name
bh100[favoriteSong]                          # show the full record of the song

# The number in bold, in the first column of the table above, is the song's ID
# Try looking for a few of your favorite songs!

Once you have collected the indices of songs you want to include in the dataset, you can use a list of these numbers to create the new subset with the code below.

In [None]:
# We might not immediately know what songs we want to include, and would benefit
# from searching the full dataset for certain attributes. Let's check out what was 
# going on in 1995, a year that may or may not reflect someone's carefree adolescent years.
# Notice that here, the year is not in quotes because it is an integer, not a string:

carefreeSongs = bh100['Year Released']==1995 # look up the song by year
bh100[carefreeSongs]

In [None]:
# Use the list below to store the indices of songs you've identified to include in your subset.
# Try finding some of your favorite songs above and adding them to the list!
mySongIndices = [10386,29091,9635,21252,5452,11140,4153]

#once you have a list of indices for the records you want to keep, you can make your new dataset
bh100reduced = pandas.DataFrame([bh100.loc[i] for i in mySongIndices])
bh100reduced

### Using Code to Construct a New Attribute

In other cases, you may want to use more complex queries with code to reduce the number of cases in a dataset to a focused set. Below, we look for songs that have versions of the word "love" in the title.

In [None]:
# use the list below to store the indices of cases to include in your subset
# We use "Love|love|LOVE" so that the search will look for each capitalization.
# The pipe character | means "or," so that songs with any of these words are included.
# Try editing the code to find other versions of the word "love" as well.
# You can also try editing the code to find song titles that have certain
# word combinations, like "love" and "you" using the & character instead of |.
searchFor = bh100['Song Name'].str.contains("Love|love|LOVE")
bh100reduced = bh100[searchFor]
bh100reduced

# Wrangling Too Many Attributes <a id="attributes"></a>

Another common issue when using large public datasets for educational purposes is the problem of too many attributes. Environmental datasets such as the Toxic Release Inventory dataset we use in the other notebook example can include hundreds of specific indicators; survey datasets from organizations such as the Pew Research Center similarly report scores of questions per participant. Even this Billboard Hot 100 dataset has too many attributes to comfortably review in a python notebook. While having access to so many attributes can enable the pursuit of a variety of investigative questions, it can easily become overwhelming. Working with these datasets requires planning and thoughtfulness to consider which attributes are actually connected to one's research question.

## Thematic Selection <a id="thematic"></a>

Thematic selection involves splitting a dataset up into related, but distinct, groups of attributes that are more likely to be conceptually or statistically related to one another. Thematic attribute selection can be especially useful for jigsaw-like activities, in which different groups explore different aspects of an interconnected system. Below, we create thematic groups by the different types of information we have about each song. You can then access a dataset with a reduced number of selected attributes by calling the dataframe with the attribute group name in brackets. You can include attributes from multiple groups using the plus sign: `dataframe[selection1+selection2]`. Below, we create a reduced dataset with only basic song information.

In [None]:
# Build our thematic categories using lists of column names
basicInfo = ['Song Name','Performer','Year Released','Month Released','Track Duration','Album']
popularity = ['Highest BH100 Position','First Week BH100','Last Week BH100','Weeks on BH100']
spotifyInfo = ['Spotify Popularity','Spotify Track Id','Spotify Track Preview URL']
genre = ['Spotify Genre Full List','Rock','Latin Pop','Rap','Dance','Novelty','Adult Standards','Rhythm N Blues']
performanceFeatures = ['Explicit','Speech-iness','Acoustic-ness','Instrumental-ness','Live-ness']
emotionFeatures = ['Danceability','Energy','Valence']
soundFeatures = ['Key','Loudness','Mode','Tempo','Time Signature']

# use brackets to reference only the columns associated with one category.
bh100BasicInfo = bh100[basicInfo]
bh100BasicInfo

## Pattern-Driven Selection <a id="pattern"></a>

Similar to thematic selection, pattern-driven selection allows educators and students to focus only on the attributes that are known to align with investigative or pedagogical goals. Whereas thematic selection focuses on the real-world relationships and system components that a student may wish to explore, pattern selection focuses attention on particular mathematical relationships and analytic methods that the dataset makes available for investigation.

Below, we demonstrate one example of pattern-driven selection that focuses on continuous measures that describe various acoustic features of songs included in the BH100 dataset. We use a heatmap of pairwise correlations between these attributes to get a quick sense of whether and what types of relationships may exist between these acoustic features. This reveals a range of correlation strengths and directions (ranging from -0.14 to 0.68), suggesting that this specific pattern-driven grouping may provide students opportunities to visualize and practice describing, comparing, and reasoning about correlation.

In [None]:
# Let's limit our dataset to continuous measures that describe
# musical features of songs, and see what patterns emerge among 
# this group.

toExplore = ['Danceability',
             'Energy',
             'Valence',
             'Loudness',
             'Tempo']

viz = bh100[toExplore].corr()

sns.heatmap(viz, annot=True)
plt.title('Correlation Matrix of Selected Variables')
plt.show() 

## Question-Driven Selection <a id="question"></a>

Question-driven selection involves identifying the attributes within a dataset that are most appropriate for addressing a particular question–whether that question is posed by a student, or posed as a driving question within curriculum materials themselves. We suggest making 3-5 attributes available for a given question in these curricular configurations. When students are constructing their own questions, we recommend encouraging them to also brainstorm what kind of data they would need to address their questions _before_ they have access to a dataset. Once they are provided with access, we suggest asking students to make predictions and construct hypotheses about what patterns they might find. 

Below, we assemble data to address the simple descriptive question, "How has the longevity of songs on the charts changed over the years?" One might ask students to quickly verbally describe or sketch their predictions. Then, after observing that the dataset includes information about the year a song is released and the number of weeks it appeared on the Billboard charts, they might be encouraged to return to their predictions and make them more concrete using the specific measures and ranges they observe.

In [None]:
# how has the longevity of songs on the charts changed over the years?

toExplore = ['Song Name','Year Released','Weeks on BH100'] #

bh100[toExplore]

Below, we create a scatterplot with each song arranged by year of release on the x-axis and number of weeks on the Billboard Hot 100 chart on the y-axis. 

In [None]:
plt.figure(figsize=(8, 6)) # Adjust figure size if needed
plt.scatter(bh100['Year Released'], bh100['Weeks on BH100'])

# Calculate the line of best fit (linear regression)
slope, intercept = np.polyfit(bh100['Year Released'], bh100['Weeks on BH100'], 1)

# Create the line of best fit equation
line = slope * bh100['Year Released'] + intercept

plt.plot(bh100['Year Released'], line, color='red', label='Line of Best Fit')

Both the scatterplot and the line of best fit suggest there has been some increase in the number of total weeks a song has remained on the Billboard Hot 100 charts. However, it is difficult to tell without more analysis whether this is simply the function of increased variability, or an overall increase in song longevity on the charts. These questions can lead to additional investigations that are conducive to additional modeling methods or hypothesis testing.

# About the Billboard Hot 100 Dataset <a id='about'></a>

This dataset includes every song that's ever appeared on the Billboard Hot 100 Charts (August 1958-May 2021). Only some of the available attributes are initially shown. Use the "attributes" tab in the "Choosy" window to select which attributes or attribute groups to show and hide.

### About the Attribute Groups
Each record includes Basic Info such as the Song Name, Performer, the Month and Year released; Popularity measures such as the song's highest Billboard position and the number of weeks the song stayed on the the Hot 100 list; and a list of the Genre(s) represented by the song. When available, each song record also includes information scraped from Spotify including the Spotify ID, URL, and popularity on the Spotify app. A number of Spotify-generated measures of musical features are described in Performance Features exploring the song's "speechiness," "liveness," and other inferred performance features; Emotion Features exploring inferred features such as the song's energy level and valence; and Sound Features such as the song's tempo, time signature, and loudness. 

### About the Attributes
- Song Name
- Performer
- Year Released
- *Highest BH100 Position* was computed this from "Hot Stuff" database using min position listing for this SongID
- Spotify Popularity
- *Danceability* describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
- *Energy* is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
- *Loudness* of a track is measured in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
- *Speechiness* detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
- *Valence* is a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
- *Tempo* is estimated in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

### History and Purpose
This dataset was initially imported into CODAP in Summer 2022 for a teacher workshop assosicated with the Writing Data Stories project (IIS-1900606). It was updated in Spring 2023, and is being used as part of the Writing Data Stories project and the City University of New York's Computing Integrated Teacher Education (CUNY CITE) program. 

### Data Sources and Data Cleaning
This dataset was constructed by Sean Miller (github handle: HipsterVizData) using APIs to download Billboard Hot 100 (BH100) and Spotify (S) data. The full dataset and additional information about its original construction can be accessed at this link. Michelle Wilkerson of the WDS team  merged the two tables in the dataset by mapping the BH100 Song Name attribute to the Spotify SongID, removing all Spotify records that did not have a corresponding BH100 entry but retaining BH100 songs that did not have corresponding Spotify entries. She imported attribute descriptions from the original dataset, editing a few descriptions for readability at the middle school level, consolidated music genres into 7 major genre flags while retaining the full genre list as a separate attribute; and removed several attributes for simplicity. She also grouped Spotify-generated song features into three categories (Performance, Emotion, and Sound Features) visible in the "Choosy" menu.

The development of these materials was supported by the National Science Foundation under Grant No. IIS-1900606