## Math 475 Final Project

#### Data Exploration and Preprocessing

Bryson Herron

#### Neccessary imports:

In [2]:
import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV

#### Loading Data:

In [3]:
# List each JSON file individually
file_names = ['Streaming_History_Audio_2021-2023_0.json', 'Streaming_History_Audio_2023-2024_1.json', 'Streaming_History_Audio_2024_2.json']

# Initialize an empty list to store DataFrames
data_frames = []

# Loop through each specified file and load it into a DataFrame
for file in file_names:
    try:
        df = pd.read_json(file)
        data_frames.append(df)
    except ValueError as e:
        print(f"Could not read {file}: {e}")

# Concatenate all the DataFrames into a single DataFrame
if data_frames:
    spotify_data = pd.concat(data_frames, ignore_index=True)
else:
    print("No data frames to concatenate. Check the JSON file contents or file names.")


My spotify data set contains 42894 song entries. Each entry contains the username of the account listened, personal account identifyiers, song name, time played, album name, album artist name, and information on how the song was selected and why it ended. There are also additional columns that are used when podcasts are included, but I have chosen to leave out my podcast data for this project as I want to focus on the songs.

In [4]:
print(spotify_data.shape)
print()
print(spotify_data.columns)

(42894, 21)

Index(['ts', 'username', 'platform', 'ms_played', 'conn_country',
       'ip_addr_decrypted', 'user_agent_decrypted',
       'master_metadata_track_name', 'master_metadata_album_artist_name',
       'master_metadata_album_album_name', 'spotify_track_uri', 'episode_name',
       'episode_show_name', 'spotify_episode_uri', 'reason_start',
       'reason_end', 'shuffle', 'skipped', 'offline', 'offline_timestamp',
       'incognito_mode'],
      dtype='object')


#### EDA:

I found that in addition to useful song data, there is also many columns for spotify user information such as IP, connected country, ect... These will be removed in data processing as they are not useful to the prediction algorithm. I also am printing off the total number of null values, as well as which column they are in. This helps when deciding which method to use for removing them.

#### Finding null values:

In [5]:
# Check for null values in the DataFrame
null_counts = spotify_data.isnull().sum()

# Filter columns with null values
columns_with_nulls = null_counts[null_counts > 0]

# Print the results
print("Number of null values in each column:")
print(null_counts)

print("\nColumns with null values:")
if columns_with_nulls.empty:
    print("No columns with null values.")
else:
    print(columns_with_nulls)

Number of null values in each column:
ts                                       0
username                                 0
platform                                 0
ms_played                                0
conn_country                             0
ip_addr_decrypted                        0
user_agent_decrypted                  7532
master_metadata_track_name             145
master_metadata_album_artist_name      145
master_metadata_album_album_name       145
spotify_track_uri                      145
episode_name                         42751
episode_show_name                    42751
spotify_episode_uri                  42751
reason_start                             0
reason_end                               0
shuffle                                  0
skipped                              13116
offline                                  0
offline_timestamp                        0
incognito_mode                           0
dtype: int64

Columns with null values:
user_agent_decrypte

Note that most of the null/NaN values are from personal data/information that will be removed

#### Data cleaning:

This is where I am removing the uneccesarry user information. I am also encoding all non-numerical columns with two different methods. Frequency encoding for song names, album names, and album artist. Since there are a very large number of unique values for these columns I feel this is the best approach. I am then using Binary encoding for reason start, and reason end as they only contain 8 values each.

In [6]:
# Clean DataFrame to remove columns with excessive nulls / irrelevant to this project
spotify_data = spotify_data.drop(columns=['platform', 'ts', 'username', 'user_agent_decrypted', 'episode_name','episode_show_name','spotify_episode_uri', 'offline', 'offline_timestamp', 'incognito_mode', 'ip_addr_decrypted', 'conn_country'])

# Convert 'ms_played' from milliseconds to seconds
spotify_data['ms_played'] = spotify_data['ms_played'] / 1000

# Rename the column to 'seconds_played'
spotify_data.rename(columns={'ms_played': 'seconds_played'}, inplace=True)

# Convert NaN values in 'skipped' column to 0 and non-NaN values to 1
spotify_data['skipped'] = spotify_data['skipped'].notna().astype(int)

# Frequency encoding for names of songs albums and artists
spotify_data['song_freq'] = spotify_data['master_metadata_track_name'].map(spotify_data['master_metadata_track_name'].value_counts())
spotify_data['album_freq'] = spotify_data['master_metadata_album_album_name'].map(spotify_data['master_metadata_album_album_name'].value_counts())
spotify_data['album_freq_artist'] = spotify_data['master_metadata_album_artist_name'].map(spotify_data['master_metadata_album_artist_name'].value_counts())

# Binary encoding for 'reason_start' and 'reason_end'
spotify_data = pd.get_dummies(spotify_data, columns=['reason_start', 'reason_end'], drop_first=True)

# Handle NaNs
spotify_data.fillna(0, inplace=True)


#### Recount nulls:

Many of the columns containing nulls were the columns I was already removing for being unhelpful, so this cleared a large number of them. For the remaining nulls I used fillna and set them to 0. I felt this was the best option for the dataset as the total number of nulls was relativily small, and setting them to zero would not cause errors further down when performing feature selection.

In [7]:
# Check for null values in the DataFrame
null_counts = spotify_data.isnull().sum()

# Filter columns with null values
columns_with_nulls = null_counts[null_counts > 0]

# Print the results
print("Number of null values in each column:")
print(null_counts)

print("\nColumns with null values:")
if columns_with_nulls.empty:
    print("No columns with null values.")
else:
    print(columns_with_nulls)

Number of null values in each column:
seconds_played                             0
master_metadata_track_name                 0
master_metadata_album_artist_name          0
master_metadata_album_album_name           0
spotify_track_uri                          0
shuffle                                    0
skipped                                    0
song_freq                                  0
album_freq                                 0
album_freq_artist                          0
reason_start_backbtn                       0
reason_start_clickrow                      0
reason_start_fwdbtn                        0
reason_start_playbtn                       0
reason_start_remote                        0
reason_start_trackdone                     0
reason_start_trackerror                    0
reason_start_unknown                       0
reason_end_endplay                         0
reason_end_fwdbtn                          0
reason_end_logout                          0
reason_end_remote

#### Split data into training and test sets then scale it:

In [8]:
# Define X as a list that contains all features except target
x = spotify_data.drop(['seconds_played', 'spotify_track_uri', 'master_metadata_track_name', 'master_metadata_album_album_name', 'master_metadata_album_artist_name'], axis=1)

# Define y as seconds played
y = spotify_data['seconds_played']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the training data
X_train = scaler.fit_transform(X_train)

# Transform the test data (using the same scaler fitted on the training data)
X_test = scaler.transform(X_test)

#### Feature Selection Using a RandomForestGregressor model:

This code trains a Random Forest regression model using the RandomForestRegressor from scikit-learn with 500 estimators and a fixed random seed. I felt this was a good approach as many of the features would have a non-linear relationship to the target variable and this model is well suited to handle this scenario. After training the model on the training data (X_train and y_train), it calculates the importance of each feature, sorts them in descending order, and outputs the top 5 most important features. Finally, my model makes predictions on the test set (X_test), and the R² score is calculated to evaluate its performance.

In [25]:
# Initialize the Random Forest model
rf_model = RandomForestRegressor(n_estimators=500, random_state=42)  # Note that the 500 estimators do take a fair amount of time and computing power

# Train the model
rf_model.fit(X_train, y_train)

# Get feature importances
importances = rf_model.feature_importances_
features = x.columns

# Create a DataFrame with feature names and their importance
importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})

# Sort by importance and select the top 5
top_features = importance_df.sort_values(by='Importance', ascending=False).head(5)

# Extract top 5 feature names as a list
top_feature_names = top_features['Feature'].tolist()
print("Top 5 features:", top_feature_names)

Top 5 features: ['song_freq', 'album_freq_artist', 'album_freq', 'reason_end_trackdone', 'skipped']


#### Evaluation:

In [26]:
# Predict and evaluate
y_pred = rf_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2}")

R² Score: 0.5788891971957113


The result I achieved is an R^2 score of 0.5789. While this is not a very high score, I feel that it is fair for this dataset due to one major reason. The overall biggest factor for determining how long a song will play is the mood of the person listening to it. Some days a song will come on and be listened to in its entirety, then another day the listener could be in a different mood and skip it right away. There are a large number of external factors when it comes to listening to music that you could never hope to quantify/categroize. It is for these reasons that I feel an r^2 score of 0.5789 is very good for a model of this size and complexity.

#### Challenges to overcome:

        The primary challenge for this data set was the non-measurable factors. I touched on this above but one of these factors is listener mood. The general mood of the listener is the single biggest factor when it comes to how long a song will be listened to. Not only can this not be measured through simple data collection and analysis, but it would also be tough to quantify. Due to factors like this, it is impossible to make a fully accurate model for this dataset. For this project, I decided to do the best I could with what I had to try and mitigate this as much as possible.    
    Another issue that arose was the very high computational cost of this model. For the final model, I had to significantly modify and simplify it for submission. The initial most accurate model took over 20 minutes for my computer to train. I found this unacceptable as any minor code changes would require me to re-train the model and wait another 20 minutes to test. To resolve this I lowered the number of estimators to 500 and removed some extra testing. This did lower the overall accuracy from around 63% to 57.9% but for the amount of time saved, I considered this a much better alternative.    
    One final issue that was fairly easy to overcome was a large amount of unnecessary data. Since I imported this data directly from my own Spotify account, there were a lot of personal identifiers, OS information, and other unneeded data. There was also a lot of non-numerical data that needed to be encoded    .
    If I had more time with this project I would experiment with different model types and features to try and achieve an even better balance of accuracy vs. computational demand.

