<a href="https://colab.research.google.com/github/MaggieWelch/MLProject/blob/main/Final_Paper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Executive Summary:

This project investigates the factors influencing a song’s success by modelling the number of dreams for Spotify’s most-streamed songs. Using a dataset of popular tracks, we aimed to predict streaming numbers and identify significant predictors. Variables analyzed included the number of artists on a song, playlist inclusions, BPM, and chart positions on Spotify, Apple Music, Deezer, and Shazam. We also looked at audio features such as dancability, valence, energy, acousticness, instrumentalness, liveliness, and speechiness.


To enhance model interpretability, we presented a table of regression coefficients, highlighting each variable’s impact on streaming numbers, along with p-values to assess statistical significance. Variables with low p-values were considered significant predictors and were retained in the final model. The model’s exploratory power was measured using R- squared and adjusted R-squared metrics; the latter accounts for the number of variables, helping to prevent overfitting and ensuring that the model remains parsimonious. These metrics allowed us to compare different models and select the one with the best balance of complexity and predictive performance.


We evaluated predictive accuracy using R mean squared error (RMSE), measuring the average difference between predicted and actual streams. Lower RMSE values indicate better predictive capabilities, and we focused on minimizing this error to enhance model reliability. By optimizing RMSE, we ensured our model not only fits the existing data but also generalizes effectively on new, unseen songs.


Our findings indicate that chart rankings and playlist inclusions are strong predators of a song’s streaming success. Audio features such as energy and dancability also significantly contribute to popularity. This suggests that both exposure and musical qualities are crucial. The model provides valuable insights for artists, producers, and record labels aiming to optimize song characteristics and marketing strategies. By combining statistical interpretability with predictive accuracy, our approach offers a robust framework for understanding and forecasting a song’s popularity in today’s music industry. Future research could enhance this model by incorporating additional variables or exploring non-linear relationships.


# Introduction:

# Data:
For our final project, we hoped to create a model that was accurate in predicting the number of streams that a song would achieve based on song attributes, artist characteristics, and temporal factors. In turn, we decided to use a data set, called "Spotify Most Streamed Songs", that we found on Kaggle, containing 25 columns, including artist, streams, BPM, and danceability, and 953 rows. To make the data usable for regression analysis, we had to clean the data through various processes.

Firstly, to deal with potential null values in the dataset, we summed the number of NAs in each column in the data, finding only 145 instances across two columns. Because this is a relatively small proportion of the dataset, we chose to remove these columns to help prepare the data for regression. Later on in our analysis, we found an error with the record at position 478. So, we found the index of the record at this position and chose to drop it because it is also such a small portion of our data.

From there, we checked the data types of each of our columns to ensure that the columns contained the correct respective type. In doing so, we found that the columns “in_deezer_playlists", "in_shazam_charts", and "streams” were all objects instead of numbers. Therefore, we used a for loop and the function pd.to_numeric to coerce these values into the correct data type for regression.

Next, we chose to one-hot encode the variables key and mode. This converts a categorical variable into one that the computer can more easily use for analysis by assigning binary values to these variables. However, we also found another 65 NA values in the column 'in_deezer_charts' and 4 in 'in_shazam_charts'. Again, because they are such a small proportion of the data, and we did not feel that these columns were that influential, we chose to drop these NA values. We also chose to add in a new variable, danceability squared, because we thought that there may be diminishing returns to the benefit of danceability in predicting streams.

In the final part of the EDA and cleaning process, we found that the histogram of our response variable, streams, was skewed heavily to the left, violating the normality assumption. Therefore, we chose to log streams in order to fit this assumption, despite a few outliers, and better prepare the data for regression.

Overall, despite EDA and cleaning being a relatively simple process for this dataset, we still had a few challenges. The main problem was coercing the streams from a string data type to a numeric data type. This was a process that we were unfamiliar with, and despite being relatively simple, took us a little while to figure out. We also struggled with deciding how to deal with null values. In the end we decided just to drop them because they were a small enough portion of our dataset that they would not affect the final results. Lastly, we realized the issue with the skew in our histogram of the response late in our EDA process. In turn, we did not really know how to deal with it and had to seek assistance from Professor Johnson, as we eventually landed on logging the streams variable to fit the normality assumption.


# Methods:
**What is an observation:**

In the original dataset, each observation contains twenty five variables. four are categorical: track title, artist name(s), the key of the song, the mode (major or minor), and the Spotify pages’ url. The pages’ URL is irrelevant to our analysis, the track name or artist are good key values for each observation; however, in terms of regression, we believe neither are consequential. The mode and key values contain information about the song itself and have the potential to be predictors contingent on our model-building process. The rest are numerical variables and include: our response variable, the number of streams, the artist count (how many artists worked on the song), variables for how many playlists in which the song is included and its place on the charts for four different platforms: spotify, apple, deezer, and shazam, the BPM (Beats per minute), and percentages of danceability, valence, energy, acousticness, instrumentalness, liveliness, and speechiness. Some of these variables are measured subjectively and others objectively, yet they contain information about the song’s characteristics and may serve as useful predictors regardless. Time related release dates (the year and month) could serve as a categorical or numeric variable to track trends in the event it is used for our regression analysis portion.

During the initial cleaning process, we created a streams category (which delineates between Low Medium High and Very High) and intuitively chose some variables which we found less compelling. This new variable turns our model into a classification question and we ended up choosing regression as a more comprehensive method to answer our research question. We then went through a revised cleaning process centered around preparing our variables for regression analysis. Some basic exploratory data analysis methods might omit certain variables from our consideration. Another potential step for selecting these variables could be a variable importance plot produced from the random forest method.

**Supervised vs. unsupervised learning:**

After performing EDA, we found that our data suffers from a few cases of multicollinearity. To help combat against this, we plan on using unsupervised learning on the front end of our model building to make our regression more robust against these problems. Furthermore, unsupervised learning may help with discovering natural patterns and, in turn, reduce dimensionality through techniques such as PCA.

After using unsupervised learning to reduce dimensionality, we plan on using supervised learning in our regression model in an attempt to minimize error in predictions and train the model towards a specific goal (in this case predicting the number of streams). Furthermore, we chose to do regression over classification because our goal is to find the statistics that reflect trends or patterns associated with the popularity of songs, which is described by the number of streams a song has.

**Models & Algorithms:**

We plan on using a couple of different methods to perform our analysis. First, we will use LASSO to assist in reducing multicollinearity problems and prevent overfitting on the training data through regularization. Additionally, this will help keep better interpretability of the model through techniques such as ridge regression. The automated variable selection will allow us to simplify our model and enhance prediction accuracy.

We are also considering using Random Forest. Random Forest can reduce overfitting of our model by averaging predictions across multiple decision trees. Averaging our decision trees will smooth out extreme predictions and should lead to more generalizable results. Further, using the Random Forest method is beneficial for handling both numerical and categorical data well. Similarly to LASSO, Random Forest will assist us in choosing the most influential variables and will help build a simpler and more effective model.

**Success:**

Our research question aims to find out which aspects of a song are most useful in predicting a song’s success on the spotify platform. Our approach is to create multiple models, through LASSO, random forests, and other forms of variable and regression analysis to create the best predictive model possible. A success then involves an iterative process wherein each model improves upon another version or is ruled out mainly based on its R-squared and adjusted R-squared values until we find the best model we can create from these data. R-squared and adjusted R-squared are useful starting points, but we will need to be cautious about solely improving predictiveness and holistically compare each of these models using other measures.

**Issues:**

We anticipate that we will face several issues in our regression analysis, mostly tied to the dataset itself. Firstly, our dataset has 95 null values in the column ‘key’ and 50 in the column ‘in_shazam_charts’. While there are several ways that we could reconcile this, we likely will have to pull in additional data sets to fill in any null data, provide additional data when needed, or remove them entirely. In turn, this could force us to clean an additional dataset; however, the process would be far easier since we have already done so with a similar dataset. Then we would have to join the two datasets, checking to make sure that we do not interfere with the integrity of the data during the process.

Furthermore, our data is skewed significantly towards more recent release years. In turn, this will likely lead to a non-representative sample, which keeps the model from accurately capturing trends in the number of streams over time. Furthermore, this may lead to further bias towards recent trends, and make it harder to isolate the effects of release timing on overall streaming popularity. To reconcile this, bringing in an additional data set may help us analyze the effect of release year and/or potential changes in trends across time.

If our initial approach fails, we may be forced to reconsider the structure of our dataset, and whether or not it is actually usable in answering our research question. Furthermore, to solve this problem, we would have to follow the steps listed above in order to reconcile our data with additional datasets to fill in the gaps.

**Feature Engineering:**

When looking at a corrplot of our numeric variables, there are only two relationships that may pose a problem for our model. Firstly, the relationship between released_year and streams is highly positively correlated. This may partially be due to a significantly higher number of observations for more recent years, compared to the past. Furthermore, accousticness_% and energy_% have a relatively high negative correlation. In order to deal with these, we would conduct PCA to determine which of these components explain the greatest percentage of the variance, in order to reduce the dimensionality of our model. Additionally, one-hot encoding may be useful for converting certain categorical variables. However, it could be best used to help regress on the variables key and mode, because one-hot encoding artist_name will likely greatly expand our dataset, whereas key and mode are much more concise.


# Results:
We have decided on four different outputs that will summarize the results of our model. We will first use a table of regression coefficients in order to display the final variables in our model and assist in increasing the interpretability of our model. We could also include p-values in this table if we wanted to display statistical significance. We will then use both R-squared and Adjusted R-squared to quantify the proportion of the variance in y explained by the model. We will use Adjusted R-squared because we want to be able to control for the number of variables in our model if we end up with many variables. Furthermore, we will use R-squared and Adjusted R-squared to compare the predictability of our models with each other, to determine the best model to use in our final regression. Finally, we will include RMSE in order to have a measure of the predictive error in our model.


# Conclusion

# References:

Elgiriyewithana, Nidula. Most Streamed Spotify Songs 2023. 2023. Kaggle, https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023.

Open AOI. ChatGPT. 2024. Open AI,
	https://openai.com/chatgpt.

