# Project Title - Edit Me


## Data set selection

> In this section, you will need to provide the following information about the selected data set:
>
> - Source with a link
> - Fields
> - License

### Data set selection rationale

> Why did you select this data set?
> I selected a data set that contains all of the Billboard Hot 100 songs that have charted between the dates 1958-08-01 and 2021-05-28 with a variety of fields like how long it charted and its peak position as well as all the attributes Spotify connects to these songs (danceability, genre, duration, loudness, tempo, etc.). Music and itâ€™s patterns in popularity is an interesting topic to me in understanding where most people fall in something subjective like music taste, as well as how societyâ€™s general taste has shifted overtime.

### Questions to be answered

> Using statistical analysis and visualization, what questions would you like to be able answer about this dataset.
> This could include questions such as:
>
> - What is the relationship between X and Y variables?
> - What is the distribution of the variables?
> - What is the relationship between the variables and the target?
>   You will need to frame these questions in a way to show value to a stakeholder (i.e.why should we know about the relationship between X and Y variables?)

How have the most popular genres changed over time, and what does this reveal about shifting listener preferences that record labels should consider for strategic releases?

What is the relationship between song audio characteristics (danceability, energy, valence, tempo) and chart trajectory (climbing vs. debuting high and dropping), and how can this inform promotional strategies?

Which performers demonstrate the greatest chart longevity, what characteristics do their songs share, and how can this help labels identify sustainable artist careers?

What is the trend in genre diversity on the charts over time, and what does this reveal about whether artists should pursue niche or mainstream positioning?

What is the distribution of weeks-on-chart across different time periods - are songs charting longer or shorter than in the past, and how should this inform promotional budget allocation?

Which songs have achieved the highest cumulative chart weeks across multiple instances, and what does this reveal about catalog value and re-release opportunities?

What is the relationship between audio features (danceability, energy, valence, tempo) and peak chart position that could provide data-driven guidance for production decisions?

### Visualization ideas

> Provide a few examples of what you plan to visualize to answer the questions you posed in the previous section. In this project, you will be producing 6-8 visualizations. You will also be producing an interactive chart using Plotly.
> Think about what those visualization could be: what are the variables used in the charts? what insights do you hope to gain from them?

1. To show how genres, song characteristics, and chart longevity have evolved overtime I can use line charts or stacked area charts to help understand the trends and shifts in popular music.
2. Box plots or bar charts can help contrast different groups like songs with different chart trajectories, top performers vs others, and other metrics.
3. Scatter plots or heatmaps can help with relationships between musical features and chart success.
4. Histograms or horizontal bar charts will highlight patterns in chart performance and identify standout songs and artists.


In [None]:
# ðŸš€ Importing some libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Import the datasets
billboard = pd.read_csv('data/billboard.csv')
audio_features = pd.read_csv('data/audio_features.csv')

# Display basic info about each dataset
print("Billboard Dataset Shape:", billboard.shape)
print("\nBillboard Columns:")
print(billboard.columns.tolist())
print("\n" + "="*50 + "\n")

print("Audio Features Dataset Shape:", audio_features.shape)
print("\nAudio Features Columns:")
print(audio_features.columns.tolist())

# Preview first few rows
print("\n" + "="*50)
print("Billboard Data Preview:")
print(billboard.head())

print("\n" + "="*50)
print("Audio Features Preview:")
print(audio_features.head())

# Merge the datasets on song_id
print("\n" + "="*50)
print(f"Unique songs in billboard: {billboard['song_id'].nunique()}")
print(f"Unique songs in audio_features: {audio_features['song_id'].nunique()}")

df_merged = billboard.merge(
    audio_features, on='song_id', how='left', suffixes=('', '_audio'))

print(f"\nMerged dataset shape: {df_merged.shape}")
print(f"Columns in merged dataset: {len(df_merged.columns)}")

# Check for missing values after merge
print("\nMissing values in key columns:")
print(df_merged[['spotify_genre', 'danceability',
      'energy', 'valence', 'tempo']].isnull().sum())

Billboard Dataset Shape: (327895, 10)

Billboard Columns:
['url', 'week_id', 'week_position', 'song', 'performer', 'song_id', 'instance', 'previous_week_position', 'peak_position', 'weeks_on_chart']


Audio Features Dataset Shape: (29503, 22)

Audio Features Columns:
['song_id', 'performer', 'song', 'spotify_genre', 'spotify_track_id', 'spotify_track_preview_url', 'spotify_track_duration_ms', 'spotify_track_explicit', 'spotify_track_album', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature', 'spotify_track_popularity']

Billboard Data Preview:
                                                 url    week_id  \
0  http://www.billboard.com/charts/hot-100/1965-0...  7/17/1965   
1  http://www.billboard.com/charts/hot-100/1965-0...  7/24/1965   
2  http://www.billboard.com/charts/hot-100/1965-0...  7/31/1965   
3  http://www.billboard.com/charts/hot-100/1965-0...   8/7/1965   
4  http://www