<a href="https://colab.research.google.com/github/Jlok17/2022MSDS/blob/main/Final_Project_602.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Abstract:**
This project goes into the relationships between music attributes and streaming platform success. As we try to devolve in the drivers behind a song's popularity. Utilizing this vast dataset that includes metrics such as danceability, energy, acousticness, and artist details. The proposed analysis challenges whether distinct musical traits correlate to higher streaming rates and chart performance.

The study employs exploratory data analysis techniques to broadcast distributions, trends, and interrelationships within the dataset. This is motivated by the exploration of how inherent musical characteristics influence the song's reception by the audience. This is then showcased through different graphics and visualization for added evidence.

The data preparation phase involved data wrangling to handle missing values, for better data formatting for analysis. Subsequently the analysis helps leverage statistical measures and to be included within machine learning techniques for different insights. Initial showings suggest correlations between musical traits—such as energy levels, danceability—and a song's placement.

In summation, this study is to highlight the important roles of certain musical attributes in shaping a song's popularity. In order to increase discoverability and engagement there is a basic form of a recommendation system that will use these musical characteristics. Lastly, further research is needed to validate these observations across a more diverse dataset and broaden comprehension of the intricate dynamics of today's music consumption.

### **Introduction:**

#### Research Question:

What are common trends and characteristics for each genre of music? Using this
information can we create a recommendation of a different genre song similar to the songs that a person likes?

<br><br>

This study utilizes a dataset encompassing a compilation of the most popular songs from 2023, as curated by Spotify. It encompasses a wide array of features and attributes for each song. The rationale behind selecting this dataset stems from a personal affinity for music. Delving into the mathematical underpinnings and discernible patterns prevalent in music aids in the quest to unearth a particular song that has lingered in my mind.


[Spotify Dataset](https://www.kaggle.com/datasets/rajatsurana979/most-streamed-spotify-songs-2023/)

In [67]:
# Libraries Needed:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets as ds
from sklearn import linear_model as lm
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier as KNN
import plotly.express as px

### **Data Wrangling:**



In [68]:
# Read_CSV from Github
df = pd.read_csv("https://raw.githubusercontent.com/Jlok17/2022MSDS/main/spotify-2023.csv",  encoding='ISO-8859-1')

In [69]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 953 entries, 0 to 952
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   track_name            953 non-null    object
 1   artist(s)_name        953 non-null    object
 2   artist_count          953 non-null    int64 
 3   released_year         953 non-null    int64 
 4   released_month        953 non-null    int64 
 5   released_day          953 non-null    int64 
 6   in_spotify_playlists  953 non-null    int64 
 7   in_spotify_charts     953 non-null    int64 
 8   streams               953 non-null    object
 9   in_apple_playlists    953 non-null    int64 
 10  in_apple_charts       953 non-null    int64 
 11  in_deezer_playlists   953 non-null    object
 12  in_deezer_charts      953 non-null    int64 
 13  in_shazam_charts      903 non-null    object
 14  bpm                   953 non-null    int64 
 15  key                   858 non-null    ob

In [70]:
# Create a copy of the original dataframe
df3 = df.copy()

# Rename Columns
df3.rename(columns={'mode': 'Chord Type', 'artist_count':'Number of Artist'}, inplace=True)

# Check data structure and convert columns to correct types if needed
df3.info()
print("")

# Find and handle non-English characters in 'track_name'
non_characters = df3[df3['track_name'].str.contains(r'[^\x00-\x7F]', regex=True)]

# Drop some columns
columns_to_drop = ['in_spotify_playlists', 'in_spotify_charts', 'in_apple_playlists', 'in_apple_charts',
                   'in_deezer_playlists', 'in_deezer_charts','in_shazam_charts']
df3.drop(columns=columns_to_drop, inplace=True)

# Sort the data based on 'streams' column in descending order
df3.sort_values(by='streams', ascending=False, inplace=True)

# Convert all month numbers to their corresponding string names
month_names = {1: 'January', 2: 'February', 3: 'March', 4: 'April', 5: 'May', 6: 'June', 7: 'July',
               8: 'August', 9: 'September', 10: 'October', 11: 'November', 12: 'December'}
df3['released_month'] = df3['released_month'].map(month_names)

# Convert 'streams' column to numeric type and perform grouping/aggregation
labels1 = ['0-20', '21-40', '41-60', '61-80', '81-100', '101-120', '121-140', '141-160', '161-180', '181-200', '200+']
df3['BPM_interval'] = pd.cut(df['bpm'], bins=list(range(0, df['bpm'].max() + 21, 20)), labels=labels1)
df3['streams'] = pd.to_numeric(df['streams'], errors='coerce')
summary_df1 = (df3[['track_name', 'bpm', 'BPM_interval', 'streams']]
               .groupby('BPM_interval')['streams']
               .agg(['count', 'mean', 'min', 'max', 'std'])
               .reset_index())

print(summary_df1)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 953 entries, 0 to 952
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   track_name            953 non-null    object
 1   artist(s)_name        953 non-null    object
 2   Number of Artist      953 non-null    int64 
 3   released_year         953 non-null    int64 
 4   released_month        953 non-null    int64 
 5   released_day          953 non-null    int64 
 6   in_spotify_playlists  953 non-null    int64 
 7   in_spotify_charts     953 non-null    int64 
 8   streams               953 non-null    object
 9   in_apple_playlists    953 non-null    int64 
 10  in_apple_charts       953 non-null    int64 
 11  in_deezer_playlists   953 non-null    object
 12  in_deezer_charts      953 non-null    int64 
 13  in_shazam_charts      903 non-null    object
 14  bpm                   953 non-null    int64 
 15  key                   858 non-null    ob

In [71]:
df3.head()

Unnamed: 0,track_name,artist(s)_name,Number of Artist,released_year,released_month,released_day,streams,bpm,key,Chord Type,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%,BPM_interval
574,Love Grows (Where My Rosemary Goes),Edison Lighthouse,1,1970,January,1,,110,A,Major,53,75,69,7,0,17,3,101-120
33,Anti-Hero,Taylor Swift,1,2022,October,21,999748277.0,97,E,Major,64,51,63,12,0,19,5,81-100
625,Arcade,Duncan Laurence,1,2019,March,7,991336132.0,72,A,Minor,45,27,33,82,0,14,4,61-80
253,Glimpse of Us,Joji,1,2022,June,10,988515741.0,170,G#,Major,44,27,32,89,0,14,5,161-180
455,Seek & Destroy,SZA,1,2022,December,9,98709329.0,152,C#,Major,65,35,65,44,18,21,7,141-160


### **Data Analysis:**

In [72]:
df3.drop(['released_year', 'released_day'], axis=1).describe()

Unnamed: 0,Number of Artist,streams,bpm,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
count,953.0,952.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0
mean,1.556139,514137400.0,122.540399,66.96957,51.43127,64.279119,27.057712,1.581322,18.213012,10.131165
std,0.893044,566856900.0,28.057802,14.63061,23.480632,16.550526,25.996077,8.4098,13.711223,9.912888
min,1.0,2762.0,65.0,23.0,4.0,9.0,0.0,0.0,3.0,2.0
25%,1.0,141636200.0,100.0,57.0,32.0,53.0,6.0,0.0,10.0,4.0
50%,1.0,290530900.0,121.0,69.0,51.0,66.0,18.0,0.0,12.0,6.0
75%,2.0,673869000.0,140.0,78.0,70.0,77.0,43.0,0.0,24.0,11.0
max,8.0,3703895000.0,206.0,96.0,97.0,97.0,97.0,91.0,97.0,64.0


In [73]:
corr = df3.corr()
fig = px.imshow(corr, color_continuous_scale='RdBu')
fig.update_layout(title='Correlation Heatmap')
fig.show()





In [74]:
fig = px.histogram(df, x='instrumentalness_%', title='Instrumentalness Distribution')
fig.show()

In [75]:
fig = px.histogram(df, x='liveness_%', title='Liveness Distribution')
fig.show()

In [76]:
fig = px.histogram(df, x='speechiness_%', title='Speechiness Distribution')
fig.show()

In [77]:
fig = px.histogram(df3, x='released_year', nbins=10, color_discrete_sequence=['#4287f5'], title='Distribution of Released Years')
fig.update_layout(xaxis_title='Released Year', yaxis_title='Frequency (Log Scale)')
fig.update_layout(yaxis_type='log')
fig.show()

In [78]:
fig = px.ecdf(df3, x='danceability_%', title='ECDF of Danceability Percentage')
fig.update_layout(xaxis_title='Danceability %', yaxis_title='Cumulative Proportion')
fig.show()


In [79]:
fig = px.box(df3, x='released_month', y='streams', title='Streams Across Song Release Months',
             labels={'released_month': 'Month', 'streams': 'Streams'})
fig.show()


In [80]:
fig = px.scatter_matrix(df3, dimensions=['streams', 'danceability_%', 'valence_%', 'energy_%'],
                        title='Pairwise Scatterplot Matrix')
fig.update_traces(diagonal_visible=False)
fig.show()


In [81]:
fig = px.box(df3, x='Chord Type', y='danceability_%', title='Danceability % by Chord Type',
             labels={'Chord Type': 'Mode', 'danceability_%': 'Danceability %'})
fig.show()

In [82]:
fig = px.box(df3, x='BPM_interval', y='streams', title='BPM Distribution by Genre')
fig.update_traces(marker=dict(color='blue'))
fig.update_traces(hovertemplate='Genre: %{x}<br>Streams: %{y}')
fig.update_layout(xaxis_title='BPM Interval', yaxis_title='Streams')
fig.show()

In [83]:
fig = px.scatter_matrix(df3, dimensions=['bpm', 'streams', 'key'], color='track_name',
                        title='Relationship between BPM, Streams, and Key by Genre')
fig.show()


In [84]:
df2 = df.copy()
df2['streams'] = pd.to_numeric(df2['streams'], errors='coerce')
summary2 = df2.groupby('key')['streams'].describe().reset_index()
summary2.sort_values(by='mean', ascending=False)

Unnamed: 0,key,count,mean,std,min,25%,50%,75%,max
3,C#,120.0,604280200.0,725831400.0,14780425.0,133850800.0,309573860.0,812638700.0,3703895000.0
6,E,62.0,577497200.0,614434300.0,29562220.0,150045700.0,284811322.5,807425400.0,2355720000.0
5,D#,33.0,553036500.0,562937700.0,76831876.0,157990700.0,273194684.0,924193300.0,1840365000.0
1,A#,57.0,552475400.0,602072400.0,2762.0,133753700.0,363467642.0,723894500.0,2594040000.0
4,D,81.0,529525600.0,573949600.0,39228929.0,162887100.0,298063749.0,599770200.0,2808097000.0
8,F#,73.0,522363200.0,584515100.0,39666245.0,138334400.0,283359161.0,629173100.0,2864792000.0
2,B,81.0,519348000.0,591014400.0,11956641.0,127409000.0,322336177.0,582863400.0,2557976000.0
10,G#,91.0,476911900.0,522907000.0,33381454.0,155538900.0,288101651.0,585968300.0,2591224000.0
7,F,89.0,468446400.0,471203100.0,22581161.0,130419400.0,255120451.0,609293400.0,1788326000.0
9,G,96.0,452599400.0,491175900.0,1365184.0,145633000.0,251810759.0,583521500.0,2565530000.0


In [85]:
fig = px.violin(df3, x='key', y='streams', title='Key Distribution by Genre',
                labels={'key': 'Key', 'track_name': 'Genre'})
fig.update_layout(xaxis_title='Key', yaxis_title='Genre')
fig.show()


In [86]:
genre_stats = df3.groupby('track_name').agg({'bpm': 'mean', 'streams': 'mean'}).reset_index()
heatmap_data = genre_stats.pivot(index='track_name', columns='bpm', values='streams')

fig = px.imshow(heatmap_data, color_continuous_scale='Viridis', title='Average BPM and Streams by Genre')
fig.update_layout(xaxis_title='Average BPM', yaxis_title='Genre')
fig.show()


### **Conclusions:**


#### Key Findings:

When looking at the dataset the biggest positive correlation between variables are Valence % and Danceability % as well as Energy % and Valence %. While the variables with the highest negative correlation are Acousticness % and Energy %. This makes sense since Valence is the measure of music positiveness within a track as higher value means more positive and a lower value is more negative. Based on typical trends for music, a higher valence would be for more energy and danceability which can be found within a higher BPM song or one with those characteristics. The largest negative correlation is expaned since energy % is usually the exact opposite in music production compared to acousticness which is usually without further music production beyond the instrument itself.

The 2 variables that were the most right skewed were Istrumentalness % and Liveness % which are more known to be the artistic and preproduction versions of music. Something that I found also interesting was the category of speechiness % as the mean was only 10% and the median was around 6%.

For popularity characterstics of songs, something that was interesting to be is that 77% of the dataset was released after 2020 while 23% was released before 2020. The Y-Axis is shown logrithmically so it wasn't heavily left skewed but that % is still fascination as it shows that majority of listened to music are more trendy and has a decay rate of a big drop off after 10 years of being released.

Some isolated analysis that were interesting is that 161-180 BPM intervals showcased the most popular songs based on streams. The most popular keys in decending order by song counts was: C#,G,G#, and F. However the most popular keys based on average streams was: C#,E,D#, and A#. While the later 3 by song counts were towards the bottom of average streams per songs.

#### Recommendation System Development:

Using the data I found some different patterns and recognitionsfrom music trends and characteristics. Shown below is the first step towards future directions of a recommendation system for Artists/Song Titles. This system will currently utilize KNN or K-Nearest Neighbor model to recommend songs. I decided to use KNN for the initial model since it is an easy implementation for items in similar characterisitcs as well as since its a non-parametric model it won't make assumptions about the data distribution. For expanding KNN model, I would want to improve the different similarity measures instead of just using cosine similarity, including but not limited to Spearman Rank Correlation, Minkowski Distance and Jaccard Similarity. Further more as a general premise, I would want to use a larger database for the recommender instead of narrowly using the top 2023 songs, as this might provide a better match in specific characteristics as well as help find less popular songs.  

#### Future Research Directions:

Some future development that I will be looking at after this semester are trying different models such as Cascade Hybrid Recommender Systems as well as Knowledge-based/Deep Learning-base/ and Session-Based Recommender systems. I want to use these systems as a better way to portray moods in listening to music as people typically enjoy different music based on what their environment is currently happening. Also a big key that I want to do for setting up a larger dataset is to try and use the spotify API documentation to just use their database to find songs and recommendation via inputs. I think something that can be done to help the song characteristic inputs is to link to your account or allow a input/look up function so we can pull the song characteristics for the recommendation.


#### Conclusion:

Analysis of the spotify dataset has shown some notable correlations: Valence % is positively correlated with Danceability % as well as Energy %. While Acousticness % displayed a strong negative correlation with Energy %. Further investigation highlighted right-skewed variables in: Instrumentalness % and Liveness %. Interestingly enough Speechiness % portrayed a mean of 10% and a median of 6%. Popularity insights showcased that 77% of songs in the database was released after 2020, and showed a trend towards recent music, with a gradual decline in popularity after a decade. The current recommendations are based with a KNN system which will be explored to include some alternative similarity measures such as Spearman Rank Correlation and Minkowski Distance.Some future research targets for advancing the recommender systems are mood based preferences, leveraging the Spotify API for their extensive database, and incorporating user inputs for personalized song characteristics.

In [87]:
# Basic Recommender System By KNN
# Preprocess the data (excluding specified columns)
data = df[['artist_count', 'bpm', 'danceability_%', 'valence_%', 'energy_%', 'acousticness_%',
           'instrumentalness_%', 'liveness_%', 'speechiness_%']]

# Training KNN model
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(data)

# Function to recommend artists and tracks based on input data
def recommend(artist_data):
    # Preprocess input data
    artist_data = artist_data[['artist_count', 'bpm', 'danceability_%', 'valence_%', 'energy_%', 'acousticness_%',
                               'instrumentalness_%', 'liveness_%', 'speechiness_%']]

    # Find K Nearest Neighbors vua cosine similarity
    distances, indices = knn.kneighbors(artist_data, n_neighbors=5)  # Change n_neighbors as desired

    # Return Recommended artist(s) and track(s) based on the indices
    recommended_artists = df.iloc[indices[0]]['artist(s)_name']
    recommended_tracks = df.iloc[indices[0]]['track_name']

    return recommended_artists, recommended_tracks

# Example input data
input_artist_data = pd.DataFrame({
    'artist_count': [2],
    'bpm': [130],
    'danceability_%': [80],
    'valence_%': [50],
    'energy_%': [60],
    'acousticness_%': [30],
    'instrumentalness_%': [10],
    'liveness_%': [30],
    'speechiness_%': [10]
})

# Recommendation based on Example Input
recommended_artists, recommended_tracks = recommend(input_artist_data)
print("Recommended Artists:")
print(recommended_artists)
print("\nRecommended Tracks:")
print(recommended_tracks)


Recommended Artists:
500                                Gayle
217                  Arcangel, Bad Bunny
510                         Jaymes Young
635                              ENHYPEN
219    Future, Chris Brown, Metro Boomin
Name: artist(s)_name, dtype: object

Recommended Tracks:
500                                           ýýýabcdefu
217                                             La Jumpa
510                                             Infinity
635                                        Polaroid Love
219    Superhero (Heroes & Villains) [with Future & C...
Name: track_name, dtype: object
