# Predicting Spotify streams

In this assignment, you will analyze a dataset of the most popular songs on Spotify and try to predict the number of streams. You can find the dataset and a description of the variables here: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023

The dataset is also available to download from Brightspace. If you download it from Kaggle, note that the value in `df.loc[576, 'streams']` (cell I576 if you open it in Excel) is not a number so you need to deal with that in some way or you'll get errors when fitting the model. I've replaced it with a random number in the version on Brightspace.

Through the Kaggle link above, you can also find Python notebooks by other people that have analyzed the dataset! Feel free to go through them to get inspired, but remember that you should only hand in your own work. If you use large parts of code from an existing notebook, you have to put a comment above the code with a link to where you found it.

## The data

* The datafile `spotify-2023.csv` has 24 columns and 953 rows;
* For the variable descriptions, check the Kaggle link;
* The variable that you need to predict is `streams`. Note that we are predicting the number of streams, so this is a regression task!

## Compulsory elements of the task

1. [2 points] Load the dataset into a `pandas` dataframe, have a look at the first few lines. Briefly describe each column and the target variable. Are they numerical or categorical? What are the measurement units?
2. [2 points] The goal of the assignment is to make a prediction model.  Explain which variables you want to use, as well as which prediction model(s), and why.
3. [2 points] Show basic descriptives (mean, proportion, distribution, range, etc) of the variables that you plan to use in your prediction model, either numerically, graphically or a combination of the two.
4. [1 point] Clean the data, making sure data types are correct for each column. Convert categorical variables to dummy variables, check for and deal with missing data, etc.
5. [1 point] Create at least one additional column to use in your model.
6. [2 points] Use `.groupby()` or `.pivot_table()` in combination with row filtering at least once to demonstrate something related to your analysis.
7. [1 point] Split your data into a train and test set in a sensible way, and explain this split.  
8. [1 point] Train your model and predict the target value for the test set.
9. [3 points] Evaluate your results using relevant metrics to check the quality of your predictions. Investigate feature importances and whether they seem sensible to you.
10. [3 points] Write a conclusion: do you think you succeeded in building a good prediction model? Where could you improve? If you tried something that didn't work, why do you think it didn't work? Did you get the results you expected?


**Sum: 18 points. Partial points can be obtained for each element. Your grade will be equal to `1 + points/2`.**

Use your creativity and any method you'd like to try to achieve the best possible score for your prediction model. You can even use deep learning if you feel comfortable with that! Bonus points can be awarded for creativity.


## Getting started

To help you get started with this regression analysis, I've done a simple linear regression model on this dataset and evaluated it. Linear regression is quite a simple model, so your machine learning model will likely perform much better! Note that I have not done all the steps required for the assignment, your analysis should be much more in-depth!

In [88]:
import pandas as pd
df = pd.read_csv('spotify-2023.csv', encoding='latin-1') # Note that without specifying the encoding, pandas won't read the csv!
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 953 entries, 0 to 952
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   track_name            953 non-null    object
 1   artist(s)_name        953 non-null    object
 2   artist_count          953 non-null    int64 
 3   released_year         953 non-null    int64 
 4   released_month        953 non-null    int64 
 5   released_day          953 non-null    int64 
 6   in_spotify_playlists  953 non-null    int64 
 7   in_spotify_charts     953 non-null    int64 
 8   streams               953 non-null    int64 
 9   in_apple_playlists    953 non-null    int64 
 10  in_apple_charts       953 non-null    int64 
 11  in_deezer_playlists   953 non-null    object
 12  in_deezer_charts      953 non-null    int64 
 13  in_shazam_charts      903 non-null    object
 14  bpm                   953 non-null    int64 
 15  key                   858 non-null    ob

Currently, we cannot see all the columns, so we adjust the display to display all columns

In [None]:
pd.set_option('display.max_columns', None)
df.head(10)

So we have 24 Columns in total:
Categorical data:
1. track_name
2. artist(s)_name
3. mode
4. key

Numerical Data and their units:
- artist_count: number
- released_year: number(year)
- released_month: month
- released_day: day
- in_spotify_playlists: number
- in_spotify_charts: number
- streams: number
- in_apple_playlists: number
- in_apple_charts: number
- in_deezer_playlists: number
- in_deezer_charts: number
- in_shazam_charts: number
- bpm: beats per minute
- danceability_%: percentage
- valence_%: percentage
- energy_%: percentage
- acousticness_%: percentage
- instrumentalness_%: percentage
- liveness_%: percentage
- speechiness_%: percentage

I would love to use all possible variables, further more i would especially use the artist name since more popular artist get higher streams and vice versa. further more, the month  of release is also important because it could affect the song's stream count. especially if it summer or winter. Moreover, all the numerical data are useful and can together play a major role in my model

In [95]:
from sklearn.preprocessing import OneHotEncoder

cat_data = df[["key"]]


In [96]:
pd.get_dummies(cat_data, dummy_na= True, drop_first=True)

Unnamed: 0,key_A#,key_B,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#,key_nan
0,False,True,False,False,False,False,False,False,False,False,False
1,False,False,True,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
948,False,False,False,False,False,False,False,False,False,False,False
949,False,False,False,False,False,False,False,True,False,False,False
950,False,False,True,False,False,False,False,False,False,False,False
951,False,False,True,False,False,False,False,False,False,False,False


In [97]:
df_2 = cat_data


In [98]:
df_2

Unnamed: 0,key
0,B
1,C#
2,F
3,A
4,A
...,...
948,A
949,F#
950,C#
951,C#


In [99]:
ohe = OneHotEncoder(categories = 'auto', drop = "first")

In [100]:
ohe.fit(df_2.fillna("Missing"))

In [102]:
ohe.get_feature_names_out(["key"])

array(['key_A#', 'key_B', 'key_C#', 'key_D', 'key_D#', 'key_E', 'key_F',
       'key_F#', 'key_G', 'key_G#', 'key_Missing'], dtype=object)

In [103]:
df_3= ohe.transform(df_2.fillna("Missing")).toarray()

In [104]:
pd.DataFrame(df_3, columns=ohe.get_feature_names_out(["key"]))

Unnamed: 0,key_A#,key_B,key_C#,key_D,key_D#,key_E,key_F,key_F#,key_G,key_G#,key_Missing
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
948,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
949,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
950,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
951,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
pd.merge

In [None]:
# Create a new variable - we will create a variable that equals 1 when the song is by
# Taylor Swift, and 0 if it is not
def TSinartists(artists):
  if 'Taylor Swift' in artists:
    return 1
  else:
    return 0

df['TaylorSwift'] = df['artist(s)_name'].map(TSinartists)

# Keep only the variables we want to use: artist_count, released_month, danceability_%, and TaylorSwift
# and of course what we want to predict: streams!
df2 = df[['artist_count', 'released_month', 'danceability_%', 'TaylorSwift', 'streams']]
df2.head()

Please note that I am only selecting four predictor variables to show you this example. For your analysis, only exclude columns if you have good reason to! If it is relevant and could add information to the model, don't just leave it out.

In [None]:
# Make sure all variables are numerical
df2.info()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import median_absolute_error

# Split in train and test
train, test = train_test_split(df2, test_size=0.3)
X_train = train[['artist_count', 'released_month', 'danceability_%', 'TaylorSwift']]
X_test = test[['artist_count', 'released_month', 'danceability_%', 'TaylorSwift']]
y_train = train['streams']
y_test = test['streams']

# Fit the model to the train data
model = LinearRegression()
model.fit(X_train, y_train)

# Predict streams for test data
y_predicted = model.predict(X_test)

# Calculate error (lower is better, see https://scikit-learn.org/stable/modules/generated/sklearn.metrics.median_absolute_error.html#sklearn.metrics.median_absolute_error)
print(f'MAE is {round(median_absolute_error(y_test, y_predicted))}')

We are using the median absolute error (MAE) to assess our model's prediction accuracy. Note that this value is not very interpretable on its own, unlike simple accuracy, when we are doing classification. Interpret the MAE as follows: for every observation in the test set, we calculate the absolute difference between the true and predicted value. Then we take the median value of all those absolute differences. This is equal to 317 million, so the predictions are off by a lot of streams!

Let's visualize the predictions. The straight orange line is the line that all points would lie on if all predicted values were exactly correct. We see that this model is not good at predicting streams at all! The reason is that linear regression models only work well if there is a linear relationship between the number of streams and the variables we use to predict them. Apparently this is not the case, so we need a better model! I will leave that up to you 😺



In [None]:
import matplotlib.pyplot as plt

plt.scatter(y_test, y_predicted)
plt.xlabel('Actual number of streams')
plt.ylabel('Predicted number of streams')
plt.plot([0,1000000000],[0,1000000000], c='orange')
plt.show()