In [None]:
import pandas as pd

# First, let's import the code and clean the code.

## I'll take you through step by step. To start, let's import it!

In [None]:
Billboard_Charts = pd.read_csv('BillboardCharts.csv')

Spotify_Data = pd.read_csv('SpotifyTracks.csv')

### Let's do a quick check to make sure we properly sourced the data:

In [None]:
Billboard_Charts.head()

In [None]:
Spotify_Data.head()

### Next, let's check to see if either of our data sets have any missing data.

In [None]:
print(Billboard_Charts.isnull().sum())

In [None]:
print(Spotify_Data.isnull().sum())

### Okay! So, now we see that there are some missing values in the Spotify_Data CSV. Let's drop those!
Yes, there seems to be some missing values from the Billboard Chart, as well, but they are all in a column I plan to drop, anyway!

In [None]:
Spotify_Data_Cleaned = Spotify_Data.dropna()

### Let's make sure it dropped properly:

In [None]:
print(Spotify_Data_Cleaned.isnull().sum())

### Because both of these data sets have so many songs listed in them, let's make sure there's no duplicate songs that have snuck their way in to either data set, just in case.

In [None]:
billboard_duplicates = Billboard_Charts.duplicated()

spotify_duplicates = Spotify_Data_Cleaned.duplicated()

print('In the Billboard data, there are ' + str(billboard_duplicates.sum()) + ' duplicate rows.')

print('In the Spotify Data, there are ' + str(spotify_duplicates.sum()) + ' duplicate rows.')

### Great! Now, here is where we drop some columns from the cleaned data. 

There are a lot of columns, and we don't need them all to get the information we are seeking!



In [None]:
Billboard_Cleaned = Billboard_Charts.drop(columns=['last-week'])

Spotify_Cleaned = Spotify_Data_Cleaned.drop(columns=['duration_ms', 'popularity', 'explicit', 'track_id', 'number', 'instrumentalness', 'time_signature', 'mode', 'loudness'])

### Let's check to make sure the correct columns are gone!

In [None]:
Billboard_Cleaned.head()

In [None]:
Spotify_Cleaned.head()

## Awesome! 
### Now, we're going to combine the datasets.

In [None]:
HitSongs = pd.merge(Billboard_Cleaned, Spotify_Cleaned, on=['song', 'artists'])

### Let's check to see if it combined correctly!

In [None]:
HitSongs.head()

### Woah! That is a lot of repeats. Let's try cleaning up this combined dataset.

In [None]:
HitSongs = HitSongs.dropna()

HitSongs = HitSongs.drop_duplicates(subset=['song', 'artists'])

In [None]:
HitSongs.head()

I'm going to turn this data into its' own csv file!

In [None]:
HitSongs.to_csv('HitSongs.csv', index = False)

## Nice!
### Now let's find some information data with this new dataframe! 

First, I'm interested in finding out what genre is the most popular for these songs! We're going to find this by looking at the mode!



In [None]:
genre_mode = HitSongs['track_genre'].mode()
print('The most popular genre for these songs is ' + genre_mode[0] + '.')

Out of curiosity, I just want a sneak peak as to what songs this dataset classifies as grunge!

In [None]:
grunge = HitSongs[HitSongs['track_genre'].str.contains('grunge', case=False, na=False)]

grunge[['song', 'artists', 'track_genre']].head()



I wasn't expecting grunge to be the most popular of those genres! How exciting!

Now, I want to see, on average, how many weeks these songs stayed on the Billboard Top 100 lists! We will do this using mean!

In [None]:
weeks_mean = HitSongs['weeks-on-board'].mean()

print(f'On average, songs would stay on the Billboard Top 100 list for {weeks_mean:.2f} weeks.')

Okay, so 17 to 18 weeks on average! That is a lot longer than I expected. 

What songs do you think were on there that long? Let's take a little peak!

In [None]:
weeks_mean = HitSongs[(HitSongs['weeks-on-board'] >= 17) & (HitSongs['weeks-on-board'] <= 18)]

weeks_mean[['song', 'artists', 'weeks-on-board', 'track_genre']].head()

Now, let's do a few quick calculations. I want to find the average and the most common results for the following columns: dancebility, energy, speechiness, acousticness, valence, and tempo. The reason that these are the columns I am doing calculations on is these are the columns that I believe are what are more closely related to what makes a song popular. 

First, let's look at the danceability of the songs. 

Danceability is defined as describing how well the track is suited for dancing (this is based on tempo, how stable the rhythm is, beat strength, and overall regularity of the beat). This is based on a 0.0 to 1.0 scale: 0.0 being the least danceable and 1.0 being the most danceable.

In [None]:
# danceability data

# mean
dance_mean = HitSongs['danceability'].mean()

print(f'On average, the songs danceability is {dance_mean:.3f}.')

# mode
dance_mode = HitSongs['danceability'].mode()

print(f'The most common danceability rating for the songs is {dance_mode[0]:.3f}.')

Next, let's look at how energetic the songs are!

In this case, energy is a perceptual measure of how intense the music is and the activity.  This is rated on a scale of 0.0, being the least energetic, and 1.0, being the most energetic.

In this case, think of a relaxing Debussy piece being 0.0, and rave music (which is made to make you want to dance and move) being a 1.0.

In [None]:
# energy data

# mean
energy_mean = HitSongs['energy'].mean()

print(f'On average, the songs energy is {energy_mean:.3f}.')

# mode
energy_mode = HitSongs['energy'].mode()

print(f'The most common energy rating for the songs is {energy_mode[0]:.3f}.')

Time to look at how "speechy" a song is!

Speechiness is what detects spoken or sung words in a track. The more speech there is in a track, (think audiobooks, podcasts, talk shows, etc.), the closer to 1.0. 

According to the creator of the Spotify data, "Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks."

In [None]:
# speechiness data

# mean
speechy_mean = HitSongs['speechiness'].mean()

print(f'On average, the songs speechiness is {speechy_mean:.3f}.')

# mode
speechy_mode = HitSongs['speechiness'].mode()

print(f'The most common speechiness rating for the songs is {speechy_mode[0]:.3f}.')

 Next, let's examine how acoustic the songs are!

 Acousticness is just that - how acoustic the song was. This is on a scale of 0.0 to 1.0, with 1.0 being high confidence that the song is acoustic!

In [None]:
# acousticness data

# mean
acoustics_mean = HitSongs['acousticness'].mean()

print(f'On average, the songs acousticness is {acoustics_mean:.3f}.')

# mode
acoustics_mode = HitSongs['acousticness'].mode()

print(f'The most common acousticness rating for the songs is {acoustics_mode[0]:.3f}.')

Now we look at the valence of the music.

Valence is the "mood" of the tracks, for lack of better word. Songs that are closer to 0.0 are more negative (i.e., sad, angry), while songs that are closer to 1.0 are high valence (i.e., happy, cheerful).

In [None]:
# valence data

# mean
valence_mean = HitSongs['valence'].mean()

print(f'On average, the songs valence is {valence_mean:.3f}.')

# mode
valence_mode = HitSongs['valence'].mode()

print(f'The most common valence rating for the songs is {valence_mode[0]:.3f}.')

Finally, time to look at the tempo of the data. 

Tempo is measured in beets per minute (BMP). In music, tempo is the speed of a piece. It's literally derived from how many beats you can measure in a minute!

In [None]:
# tempo data

# mean
tempo_mean = HitSongs['tempo'].mean()

print(f'On average, the songs valence is {tempo_mean:.3f}.')

# mode
tempo_mode = HitSongs['tempo'].mode()

print(f'The most common tempo rating for the songs is {tempo_mode[0]:.3f}.')

Now that's a lot of interesting data! I'm sure we can get some good information from that.

# Next, I would like to make a few plots just to look at this data a little deeper.
## To see the plots, please go to the HitSongs.py file and run the file! It will populate all the plots one at a time.
### After you have looked at the plots, return to this file to see my findings!

# Welcome back!
## Let's talk about those plots.

As a quick reminder, the 6 columns of interest for me were danceability, energy, valence, speechiness, acousticness, and tempo.

For the first plot examining the comparison between energy and danceability, we see a large cluster towards the top of energy and between .4 and .6 on the danceability scale, it looks like songs that are higher energy, but only moderately danceable are more popular; this led me to wonder about these 2 columns and how they correlate with valence. With this in mind, it lead me to make the next plot!

In the next plot, I made a trendline plot of the danceability, energy, and valence columns to see if there was any correlations between the 3 columns. While energy and danceability did not cross each other, valence did cross with both of them fairly close, within 0.1 of each other on the scale. This further leads me to believing that there is a strong correlation between these 3 things and what makes a song popular. However, there are still 3 more plots I made.

The next 2 plots is the boxplots comparing all 6 of the columns I was interested in finding information on. The first boxplot contains the information for danceability, energy, valence, acousticness, and valence, and the second boxplot was only tempo. The first thing I noticed was just how many outliers there was for speechiness. Statistically speaking, usually, when data has a lot of outliers, it has to do with the amount of data being calculated and how much the data deviated from the general pattern, which can skew the analysis done by the plot. Because of the amount of outliers, this is the reason I decided to do a histogram for the final plot, but we'll get to that in a minute. Looking at the rest of the data on the boxplots, danceabiliy, energy, and valence do have a lot of overlap with each other, and the upper quartile value of the acousticness plot seems to just overlap with those. The tempo boxplot seems to have 100 bpm to 140 bmp as a good tempo range, which seems about right to me as a musician.

The final plot shows the histograms for all 6 of the columns in question. For danceability, a good range to focus on is 0.5 to 0.7. For energy, since there seems to be a like for higher-energy songs, a good range to look at seems to be between 0.7 to 0.9. For speechiness, it seems that it drops off at 0.45, so I'm just going to go ahead and call the range for that 0.0 to 0.45. For acousticness, a good range seems to be between 0.0 and 0.4. Valence has an interesting histogram as it seems a little bumpy, but a good range seems between 0.4 to 0.8. For tempo, like in the boxplot, a good range seems to be in the 100bmp to 140bmp range. 

# "What do these ranges even mean?" 
## "What makes a song a hit?" 

You might be asking. In my opinion, it seems like songs that you can dance to and have a good time to, but aren't too serious about the lyrics are a good mix. These are songs that can range from being a little melancholic to songs that make you feel a little happier, but are a little lighter on the instruments, with a good, strong tempo that allows you to dance to it.

Mind you, these things are pretty subjective. What is a melancholic song to me might be a terribly sad song to you, or a happy song to someone else. That's the beauty of music -- it makes us all feel something.

But, since we have those ranges I talked about above, why don't we try seeing if we can pull up any songs that are within all those ranges above?

In [None]:
hitranges = HitSongs[
    (HitSongs['danceability'] >= 0.5) & (HitSongs['danceability'] <= 0.7) &
    (HitSongs['energy'] >= 0.7) & (HitSongs['energy'] <= 0.9) &
    (HitSongs['speechiness'] >= 0.0) & (HitSongs['speechiness'] <= 0.45) &
    (HitSongs['acousticness'] >= 0.0) & (HitSongs['acousticness'] <= 0.4) &
    (HitSongs['valence'] >= 0.4) & (HitSongs['valence'] <= 0.8) &
    (HitSongs['tempo'] >= 100) & (HitSongs['tempo'] <= 140)
    ]

hitranges[['song', 'artists', 'danceability', 'energy', 'speechiness', 'acousticness', 'valence', 'tempo', 'track_genre']].head()

# Honestly, I think these songs would make a pretty cool playlist.
## Wait, I already did. The link to the Spotify playlist will be towards the end of the README.me file!

Out of curiousty, I wonder just how many songs are in the dataframe in total, and how many of these fit in these ranges.

In [None]:
num_rows_whole = HitSongs.shape[0]
num_rows_ranges = hitranges.shape[0]
percent = num_rows_ranges/num_rows_whole

print(f'In total, there are {num_rows_whole} in the dataframe. Of these, only {num_rows_ranges} fall into the specified ranges.')
print(f'That means only {(percent * 100):.3f}% of songs in this dataframe are within these ranges.')

# Thank you for coming on this journey with me.

Feel free to reach out if you have any questions!