In [None]:
import pandas as pd

**Cleaning 1: Columns- Converting and Sorting.**

1. The above code is cleaned by reading only the columns needed. The 'track_name' and 'artist(s)_name' columns identify the song and artist, the 'streams' column identifies the popularity of the song, and the 'bpm', 'key', and 'mode' columns identify the song's musical attributes.

2. Convert the 'streams' column to numeric using to_numeric() forcing errors to NaN 

3. Drop rows with NaN using .dropna().

4. Convert 'streams' column to integers using .astype(int).

5. Sort the DataFrame by 'streams' column in descending order using .sort() to have the most popular songs at the top of the DataFrame.

In [None]:
columns_to_read = ['track_name', 'artist(s)_name', 'streams', 'bpm', 'key', 'mode']
df = pd.read_csv('spotify-2023.csv', usecols= columns_to_read, encoding= 'latin-1')

# Convert 'streams' column to numeric, forcing errors to NaN
df['streams'] = pd.to_numeric(df['streams'], errors='coerce')

# Drop rows with NaN if necessary
df = df.dropna()

# Convert 'streams' column to integers
df['streams'] = df['streams'].astype(int)

# Sort the DataFrame by 'streams' column in descending order
df_desc = df.sort_values(by='streams', ascending=False)

print(df_desc)

**Interpretation/Analysis 1: Most Common Attributes.**

Rather than tracking the songs based on popularity alone, we will also take a look at each song's musical attributes.

After having the initial DataFrame cleaned, we want to find the most common musical attributes for the entire DataFrame. 

We use .mode() to find the most common bpm, key, and mode used for each song. They are:

BPM= 120 KEY= C# MODE= Major


In [None]:
common_bpm = df_desc['bpm'].mode()[0]
common_key = df_desc['key'].mode()[0]
common_mode = df_desc['mode'].mode()[0]

print(common_bpm, 'is the most common BPM in the entire DataFrame.')
print(common_key, 'is the most common KEY in the entire DataFrame.')
print(common_mode, 'is the most common MODE in the entire DataFrame.')

**Interpretation/Analysis 2: Statistics of Attributes.**

Now, we will calculate how often the the most common attributes occur in the entire DataFrame. This will take 4 steps.

1. Count how many times an attribute occures in the column using .value_counts().

2. Count the total number of entries for the column using .count().

3. Divide the attribute count and total count, then multiply by 100 to get the percentage.

4. Round the percentage to the nearest 2 decimal points using the round() function.

In [None]:
#BPM Statistics
count_bpm = df_desc['bpm'].value_counts()[common_bpm]
total_bpm = df_desc['bpm'].count()
percentage_bpm = (count_bpm / total_bpm) * 100
percentage_bpm = round(percentage_bpm, 2)

#KEY Statistics
count_key = df_desc['key'].value_counts()[common_key]
total_key = df_desc['key'].count()
percentage_key = (count_key / total_key) * 100
percentage_key = round(percentage_key, 2)

#MODE Statistics
count_mode = df_desc['mode'].value_counts()[common_mode]
total_mode = df_desc['mode'].count()
percentage_mode = (count_mode / total_mode) * 100
percentage_mode = round(percentage_mode, 2)


print(common_bpm, 'is the most common BPM of all the songs in the DataFrame. It occurs', count_bpm, 'times.', 'This is', percentage_bpm, 'percent of the entire DataFrame.')
print(common_key, 'is the most common KEY of all the songs in the DataFrame. It occurs', count_key, 'times.', 'This is', percentage_key, 'percent of the entire DataFrame.')
print(common_mode, 'is the most common MODE of all the songs in the DataFrame. It occurs', count_mode, 'times.', 'This is', percentage_mode, 'percent of the entire DataFrame.')

**Now let's take a look at the top 10 songs of the DataFrame**

In [None]:
df_top10 = df_desc.head(10)

print(df_top10)

**Interpretation/Analysis 3: Statistics of Attributes in Top 10 songs.**

Now, we will calculate how often the the most common attributes occur in only the top 10 songs.

We quickly find that the calculations above in Interpretation/Analysis 3 might work, but can raise errors if those attributes don't appear in the DataFrame. We will add some code to prevent errors, specifically in the BPM statistics where the most common BPM does not appear in the top 10.

1. Count how many times an attribute occures in the column **using .get().** This will default to 0 if the value you are trying to find is not present (which is the case with BPM.)

2. Count the total number of entries for the column using .count().

3. Divide the attribute count and total count, then multiply by 100 to get the percentage.

4. Round the percentage to the nearest 2 decimal points using the round() function.



In [164]:
#BPM Statistics in the Top 10 songs
count_bpm_top10 = df_top10['bpm'].value_counts().get(common_bpm, 0) #Get count, or default to 0 if necessary
total_bpm_top10 = df_top10['bpm'].count()
percentage_bpm_top10 = (count_bpm_top10 / total_bpm_top10) * 100
percentage_bpm_top10 = round(percentage_bpm_top10, 2)

#Key Statistics in the Top 10 songs
count_key_top10 = df_top10['key'].value_counts().get(common_key, 0) #Get count, or default to 0 if necessary
total_key_top10 = df_top10['key'].count()
percentage_key_top10 = (count_key_top10 / total_key_top10) * 100
percentage_key_top10 = round(percentage_key_top10, 2)

#Mode Statistics in the Top 10 songs
count_mode_top10 = df_top10['mode'].value_counts().get(common_mode, 0) #Get count, or default to 0 if necessary
total_mode_top10 = df_top10['mode'].count()
percentage_mode_top10 = (count_mode_top10 / total_mode_top10) * 100
percentage_mode_top10 = round(percentage_mode_top10, 2)

print(common_bpm, 'BPM occurs', count_bpm_top10, 'times in the top-ten songs. This is', percentage_bpm_top10, 'percent of the top-ten songs, as opposed to', percentage_bpm, 'percent of the entire DataFrame.')
print(common_key, 'occurs', count_key_top10, 'times in the top-ten songs. This is', percentage_key_top10, 'percent of the top-ten songs, as opposed to', percentage_key, 'percent of the entire DataFrame.')
print(common_mode, 'occurs', count_mode_top10, 'times in the top-ten songs. This is', percentage_mode_top10, 'percent of the top-ten songs, as opposed to', percentage_mode, 'percent of the entire DataFrame.')


120 BPM occurs 0 times in the top-ten songs. This is 0.0 percent of the top-ten songs, as opposed to 3.97 percent of the entire DataFrame.
C# occurs 5 times in the top-ten songs. This is 50.0 percent of the top-ten songs, as opposed to 14.0 percent of the entire DataFrame.
Major occurs 7 times in the top-ten songs. This is 70.0 percent of the top-ten songs, as opposed to 55.31 percent of the entire DataFrame.
