## Rubric

Instructions: DELETE this cell before you submit via a `git push` to your repo before deadline. This cell is for your reference only and is not needed in your report. 

Scoring: Out of 10 points

- Each Developing  => -2 pts
- Each Unsatisfactory/Missing => -4 pts
  - until the score is 

If students address the detailed feedback in a future checkpoint they will earn these points back


|                  | Unsatisfactory                                                                                                                                                                                                    | Developing                                                                                                                                                                                              | Proficient                                     | Excellent                                                                                                                              |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| Data relevance   | Did not have data relevant to their question. Or the datasets don't work together because there is no way to line them up against each other. If there are multiple datasets, most of them have this trouble | Data was only tangentially relevant to the question or a bad proxy for the question. If there are multiple datasets, some of them may be irrelevant or can't be easily combined.                       | All data sources are relevant to the question. | Multiple data sources for each aspect of the project. It's clear how the data supports the needs of the project.                         |
| Data description | Dataset or its cleaning procedures are not described. If there are multiple datasets, most have this trouble                                                                                              | Data was not fully described. If there are multiple datasets, some of them are not fully described                                                                                                      | Data was fully described                       | The details of the data descriptions and perhaps some very basic EDA also make it clear how the data supports the needs of the project. |
| Data wrangling   | Did not obtain data. They did not clean/tidy the data they obtained.  If there are multiple datasets, most have this trouble                                                                                 | Data was partially cleaned or tidied. Perhaps you struggled to verify that the data was clean because they did not present it well. If there are multiple datasets, some have this trouble | The data is cleaned and tidied.                | The data is spotless and they used tools to visualize the data cleanliness and you were convinced at first glance                      |


# COGS 108 - Data Checkpoint

## Authors

Instructions: REPLACE the contents of this cell with your team list and their contributions. Note that this will change over the course of the checkpoints

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Alice Anderson: Conceptualization, Data curation, Methodology, Writing - original draft
- Bob Barker:  Analysis, Software, Visualization
- Charlie Chang: Project administration, Software, Writing - review & editing
- Dani Delgado: Analysis, Background research, Visualization, Writing - original draft

## Research Question

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback



## Background and Prior Work

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Hypothesis


Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback


## Data

### Data overview

**Dataset Name:** Billboard Hot 100 Year-End Singles (2015–2025)
> This dataset will be our primary data that we will analyze to explore our research question for this project. We obtained this dataset through a series of intermediate steps. First, we decided to adopt a divide and conquer approach to minimize errors and maintain organization of the data. We split the data collection process among each member, making 5 individual CSV files that each contained a specific year range. For instance, the first CSV has data from 2015–2017, while the second has data from 2018–2019, etc. Since our research question involves examining genres, release seasons, song tempo, and music video presence, we had to use multiple sources to curate the data for each of these additional variables. This included Wikipedia for the Billboard charts, release dates and genre, YouTube for the official music videos, and finally SongBPM to identify the tempo for each song. In addition to this, we cross-referenced the information with other credible sources to ensure we were collecting accurate data. Each CSV was identically structured and built using the same collection techniques to prevent member bias and inconsistencies. After curating all 5 CSV files, we merged them into one master dataset, which we will use for the rest of the project duration.
  
* **Link to the dataset:** [insert Google Sheet link]
* **Number of observations:** 1,100
* **Number of variables:** 8 (Rank, Title, Artist, Year, BPM, Genre, Release Date, Music Video)
* **Most relevant variables:**
    
    * **BPM:** Tracks a song’s tempo using beats per minute as a unit of measurement, defining the speed of the music (fast indicates a more energetic song while slow indicates a calmer song). In this project, a song having a tempo ranging 120 BPM or higher will be considered upbeat, high-energy music.
    * **Genre:** Classifies songs based on overlapping musical characteristics (such as its instruments, rhythm, and mood). This dataset categorizes songs into the main umbrella categories of music, namely Pop, Hip-Hop, R&B, and Country as the dominant genres. This project will focus on charting songs that fall under the Pop genre. 
    * **Release Date:** Displays when a song was first released into the music industry. Since this project focuses on the season a song is released, we will convert these dates and categorize them based on whether they fall under spring, summer, fall, or winter. 
    * **Music Video:** Indicates whether a song has an associated official music video on YouTube (1 for Yes, 0 for No).

- **Shortcomings:** Genres are a nuanced topic, where a song can be classified into numerous subcategories or contain mixtures of various genres. Therefore, our dataset is a more simplified version of the song’s characteristics, where we focus on categorizing the music based on the broader umbrella genres. Regarding release dates, due to some songs having newer versions or being re-released in more recent albums, the dataset displays the release date of the song depending on which version and from which album the song charted. For instance, if an older version of the song hit the charts compared to its updated one, we will list the release date of the older one. Similar to genres, music videos are also becoming more diverse and exclusive regarding what counts as a music video for a song. To simplify the data collection process for this project, we did not take into account lyric videos and mainly focused on the presence of more standard, traditional song music videos. Additionally, we referenced the actual video title to see if it explicitly includes the phrase “Music Video” to help us further determine whether we should count the song as having a music video or not. 

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Team Expectations 

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback


## Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/3 | 5:15 PM | **All:** Have individual tasks for Project Proposal completed and ready for group review. | **All:** Discuss/Complete Team Expectations. Review and push Project Proposal components. | 
| 2/10 | 5:15 PM | **All:** Read checkpoint 1 requirements and required datasets should be created as csv files (total of 5) and ready for wrangling. | **Krisha:** Walkthrough setting up Anaconda Prompt/JupyterLab environment, assign next steps for data wrangling, and update timeline. <br> **Charlie:** Import necessary packages to repo. <br> **Youjia:** Upload raw csv files to data/00-raw/. <br> **Cindy:** Update checkpoint file with overlapping sections from proposal. | 
| 2/17 | 5:15 PM | **Cindy:** Merge csv files into master dataset. Complete Dataset Part A writing. <br> **Andrea:** Remove non-pop genres and complete Dataset Part B writing <br> **Krisha:** Add ‘Season’ column based on release date. Complete Data Overview written part and update Team Timeline. <br> **Charlie:** Check for outliers/formatting errors. <br> **Youjia:** Upload cleaned master dataset to data/02-processed/. | **Krisha:** Ensure all datasets are wrangled/pushed and conduct final rubric check. Assign group members to complete each specific part of the analysis. <br> **All:** Discuss possible analytical/EDA approaches. |
| 2/24 | 5:15 PM | **All:** Made progress in assigned analysis/EDA. | **All:** Hold a group progress-check for individual analysis/EDA. Discuss more visualization and analysis strategies. |
| 3/3 | 5:15 PM | **All:** Complete analysis of datasets. Come prepared to discuss individual findings and results of data analysis. | **All:** Review and submit Checkpoint 2. <br> **Andrea, Cindy, Krisha:** Edit analysis and results of data. <br> **Youjia, Charlie:** Draft conclusion and discussion portion of project. |
| 3/10 | 5:15 PM | **All:** Have drafts of the written portion of the project completed.Read over video requirements. | **All:** Revision of written project components. Discuss, plan, and assign roles for Final Video. Share concerns/final plans for submission. |
| 3/17 | 5:15 PM | **All:** Have Final Video ready. Come prepared to review/edit project components for submission. | **All:** Review and make final edits on all project components. Submit Final Project, Final Video, & Team Eval. Surveys. |