## 1 Title <span style="color:red">(to be completed)<span>

## 2 Introcuction
As the music market becomes more and more flourishing, more people listen to songs and find the types of music they like the most. It is interesting to understand people’s preferences for popular songs and the traits of these songs. Our group project plans to analyze the popularity of songs on Spotify in 2023. The data we plan to use is a dataset that lists the most popular songs on Spotify in 2023. These songs’ popularity is assessed based on their total number of streams, the rank of the songs on Spotify charts, and the number of music playlists each song is included in on Spotify and Apple Music. Based on these traits and other characteristics explained later, we plan to predict the total number of streams on Spotify of one song, which is the predictive question we will try to answer with our project. The dataset is the Most Streamed Spotify Songs 2023 created by Nidula Elgiriyewithana. As assessed by Kaggle, the usability for this dataset is 10 out of 10, which suggests its compatibility, reliability, and completeness. This dataset provides statistics for multiple traits of the songs, including the presence of live performance elements, the positivity of music content, the year when the songs were released, etc. From these interesting and diverse characteristics of the songs, we can have a great and comprehensive prediction of the total stream of one song on Spotify.

## 3 Preliminary exploratory data analysis

In [3]:
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier

### 3.1 Read data 

In [50]:
spotify_df = pd.read_csv("data/spotify-2023-dataset.csv")
spotify_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 953 entries, 0 to 952
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   track_name            953 non-null    object
 1   artist(s)_name        953 non-null    object
 2   artist_count          953 non-null    int64 
 3   released_year         953 non-null    int64 
 4   released_month        953 non-null    int64 
 5   released_day          953 non-null    int64 
 6   in_spotify_playlists  953 non-null    int64 
 7   in_spotify_charts     953 non-null    int64 
 8   streams               953 non-null    object
 9   in_apple_playlists    953 non-null    int64 
 10  in_apple_charts       953 non-null    int64 
 11  in_deezer_playlists   953 non-null    object
 12  in_deezer_charts      953 non-null    int64 
 13  in_shazam_charts      903 non-null    object
 14  bpm                   953 non-null    int64 
 15  key                   858 non-null    ob

### 3.2 Clean and wrangle data
Upon preliminary inspection, we noted that our dataset is relatively clean, with all the variables being in an appropriate format for analysis. We have transformed the dataset into a tidy format, ensuring each variable is a column, each observation is a row, and each type of observational unit forms a table.


In [56]:
spotify_df_cleaned = spotify_df.dropna()
spotify_df['liveness_%'] = pd.to_numeric(spotify_df['bpm'], errors='coerce')

# print(spotify_df['danceability_%'].isnull().sum())



### 3.3 Summary


In [68]:

attributes = [
    "in_spotify_playlists", "in_spotify_charts", "in_apple_playlists", "in_apple_charts",
    "in_deezer_playlists", "in_deezer_charts", "in_shazam_charts", "bpm",
    "danceability_%", "valence_%", "energy_%", "acousticness_%", 
    "instrumentalness_%", "liveness_%", "speechiness_%"
]

count_values = spotify_df[attributes].count()
missing_values = spotify_df[attributes].isnull().sum()
mean_values = spotify_df[attributes].mean(numeric_only = True)

summary_df = pd.DataFrame({
    'Attribute': attributes,
    'Count': count_values,
    'Missing Values': missing_values,
    'Mean': mean_values
}).reset_index(drop=True)
summary_df


Unnamed: 0,Attribute,Count,Missing Values,Mean
0,in_spotify_playlists,953,0,27.057712
1,in_spotify_charts,953,0,122.540399
2,in_apple_playlists,953,0,66.96957
3,in_apple_charts,953,0,64.279119
4,in_deezer_playlists,953,0,51.908709
5,in_deezer_charts,953,0,67.812172
6,in_shazam_charts,953,0,2.666317
7,bpm,953,0,
8,danceability_%,903,50,
9,valence_%,953,0,12.009444


### 3.4 Visualize data

In [67]:
spotify_df['streams'] = pd.to_numeric(spotify_df['streams'], errors='coerce')

spotify_df_new = spotify_df.assign(streams_in_thousands = spotify_df['streams'] / 1000)

attributes = ['danceability_%', 'valence_%', 'energy_%', 'acousticness_%', 'instrumentalness_%', 'liveness_%', 'speechiness_%']

# Create the scatter plots and use facet
chart = alt.Chart(spotify_df_new).mark_point().encode(
    x=alt.X(alt.repeat('column'), type='quantitative'),
    y=alt.Y('streams_in_thousands:Q', title='Streams (in thousands)')
).properties(width=200, height=200).repeat(
    column=attributes
).interactive()

chart.display()

## 4 Methods

We will read our dataset from the web into Python then clean and wrangle the data first into a clear and tidy data. Using numerical values such as danceability_%, energy_%, liveness_%, and instrumentalness_% that may affect the song’s future popularity to find the answer to our question. We will split the dataset into a training set and a testing set (could split it 75% for training and 25% for testing) and scale our numerical values to ensure that they are on a comparable scale. After choosing a “K” value (number of neighbors) to find the best “K” value for our dataset, we use the model we created to make predictions on the testing dataset, which gives us the estimated total streams for a song. We will use a scatter plot to visualize our results, which will illustrate the predicted number of streams for a song. 


## 5 Expected outcomes and significance

### 5.1 Expected outcomes

We expect to find the number of total streams a song might receive on the basis of its characteristics like its
- **danceability_%**: Percentage indicating how suitable the song is for dancing
- **valence_%**: Positivity of the song's musical content
- **energy_%**: Perceived energy level of the song
- **acousticness_%**: Amount of acoustic sound in the song
- **instrumentalness_%**: Amount of instrumental content in the song
- **liveness_%**: Presence of live performance elements
- **speechiness_%**: Amount of spoken words in the song

### 5.2 Significance

Our findings might give music producers an unforeseen insight on what type of songs to make if there is a clear trend of songs that already have a relatively large number of streams, or are predicted to amass the same.
In addition, it could potentially highlight if there is a certain branch of music or a list of characteristics in a song that appeal to the majority of humans. Apart from the minute individuality in music everyone possesses, do we all like to listen to certain songs?
