<h1> Spotify Skip Prediction Dataset </h1>
This dataset comes in two sets. The first set is details about 'sessions': chunks of songs a user listens to in one go, and what songs were listened to. The second set details the song's features. <br>
Our analysis will include just the mini set availible on AI Crowd. The input we are using is an augmented table that combines the user session data and the song features data. 
There are 167880 entires and 50 total features. Only 47 features will be used. 

<h3>References</h3>
We'd like to recognize that due to enormity of this dataset and the complexity of how it was stored (in multiple seperate and unorganized .csv files), we did use online references to decide on our stack and how we would approach the data. <br>
We used the following a examples: <br> <br>
<li> <a>https://github.com/a-poor/spotify-skip-prediction/blob/master/README.md</a>
<br><i>Used for template tech stack and reorganize dataset. </i>

<h2>Import Libraries and Datasets</h2>

In [48]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# load both csv files into dataframes
log_df = pd.read_csv('log_mini.csv') # user log
tf_df  = pd.read_csv('tf_mini.csv')  # track features
 
# rename and merge the two data frames so that the 
log_df = log_df.rename(columns={'track_id_clean': 'track_id'})

# perform a merge so that song information is attached to the user information
og_data_df = pd.merge(log_df, tf_df, on='track_id')

# Save the merged DataFrame to a new CSV file
og_data_df.to_csv('merged_file.csv', index=False)
data_df = og_data_df

In [49]:
print(data_df.head(1))
print(data_df.shape)

                               session_id  session_position  session_length  \
0  0_00006f66-33e5-4de7-a324-2d18e439fc1e                 1              20   

                                 track_id  skip_1  skip_2  skip_3  \
0  t_0479f24c-27d2-46d6-a00c-7ec928f2b539   False   False   False   

   not_skipped  context_switch  no_pause_before_play  short_pause_before_play  \
0         True               0                     0                        0   

   long_pause_before_play  hist_user_behavior_n_seekfwd  \
0                       0                             0   

   hist_user_behavior_n_seekback  hist_user_behavior_is_shuffle  hour_of_day  \
0                              0                           True           16   

         date  premium        context_type hist_user_behavior_reason_start  \
0  2018-07-15     True  editorial_playlist                       trackdone   

  hist_user_behavior_reason_end    duration  release_year  \
0                     trackdone  180.0666

# Reorganizing the Data
We decided to select the the variable skip_3 as our 'y' variable. The 'skip_3' feautre represents when a 

In [50]:
# Making a 'skipped' feature for whether a song has been skipped or not, regardless of how fast
data_df['skipped'] = (data_df.skip_3 | data_df.skip_2 | data_df.skip_1).astype('int32')

# Make 'skipped' column our 'y' value for prediction
y_df = data_df['skipped']

data_df = data_df.drop(columns=["skip_1", "skip_2", "skip_3", "not_skipped"], axis=1)
#Optional print just to check features
print(data_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167880 entries, 0 to 167879
Data columns (total 47 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   session_id                       167880 non-null  object 
 1   session_position                 167880 non-null  int64  
 2   session_length                   167880 non-null  int64  
 3   track_id                         167880 non-null  object 
 4   context_switch                   167880 non-null  int64  
 5   no_pause_before_play             167880 non-null  int64  
 6   short_pause_before_play          167880 non-null  int64  
 7   long_pause_before_play           167880 non-null  int64  
 8   hist_user_behavior_n_seekfwd     167880 non-null  int64  
 9   hist_user_behavior_n_seekback    167880 non-null  int64  
 10  hist_user_behavior_is_shuffle    167880 non-null  bool   
 11  hour_of_day                      167880 non-null  int64  
 12  da

## Dealing with Non-Float Values
The IDs of the songs and the users are strings. We've chosen to completely drop these values. While it is reasonable to assume they impact the predicted value, we opt to focus on more generally modeling whether a song will be skipped or not as opposed to whether a song will be skipped or not depending on previous skips and sessions since there are 10,000 sessions in the mini dataset, a value we are not sure how to deal with considering the reasources we have.

###

In [51]:
# drop id values
data_df = data_df.drop(columns=["session_id", "track_id"], axis=1)

In [52]:
# fix the session_date column into seperate parts. dropping day. 
data_df['session_year'] = pd.to_datetime(data_df['date']).dt.year
data_df['session_month'] = pd.to_datetime(data_df['date']).dt.month
# data_df['day'] = pd.to_datetime(data_df['date']).dt.day
data_df['session_day_of_week'] = pd.to_datetime(data_df['date']).dt.dayofweek
#print(data_df.head(3))
data_df = data_df.drop('date', axis=1)
print(data_df.head(3))



   session_position  session_length  context_switch  no_pause_before_play  \
0                 1              20               0                     0   
1                 7              12               0                     0   
2                 6              20               0                     0   

   short_pause_before_play  long_pause_before_play  \
0                        0                       0   
1                        1                       1   
2                        1                       1   

   hist_user_behavior_n_seekfwd  hist_user_behavior_n_seekback  \
0                             0                              0   
1                             0                              0   
2                             0                              0   

   hist_user_behavior_is_shuffle  hour_of_day  premium        context_type  \
0                           True           16     True  editorial_playlist   
1                          False           17     Tru

In [53]:
data_df['premium'] = data_df['premium'].astype(int)
# print(data_df['premium'].head(5))

# hist_user_behavior_is_shuffle
data_df['hist_user_behavior_is_shuffle'] = data_df['hist_user_behavior_is_shuffle'].astype(int)
# print(data_df['hist_user_behavior_is_shuffle'].head(5))

data_df['mode'] = data_df['mode'].map({'major':1, 'minor':0})
# print(data_df['mode'].head(5))










0    major
1    major
2    major
3    major
4    major
Name: mode, dtype: object
0    1
1    1
2    1
3    1
4    1
Name: mode, dtype: int64


### Categorical Variables
The following variables were categorical in nature:
* time_signature
* key_signature
* context_type
* hist_user_behavior_reason_start	
* hist_user_behavior_reason_end
<br><br>Lets analyze how many types of values are in each column to determine whether one-hot encoding or ordinal encoding is more advantageous. 

In [55]:
list = ['time_signature', 'key', 'context_type', 'hist_user_behavior_reason_start', 'hist_user_behavior_reason_end' ]

for col in list:
    unique_values = data_df[col].unique()
    print(col, ": ", unique_values)

time_signature :  [4 5 3 1 0]
key :  [ 1  7 10  8  6  5  4  2  0  3  9 11]
context_type :  ['editorial_playlist' 'user_collection' 'catalog' 'radio' 'charts'
 'personalized_playlist']
hist_user_behavior_reason_start :  ['trackdone' 'fwdbtn' 'appload' 'playbtn' 'clickrow' 'backbtn' 'remote'
 'endplay' 'trackerror']
hist_user_behavior_reason_end :  ['trackdone' 'endplay' 'fwdbtn' 'backbtn' 'remote' 'logout' 'clickrow']


### Analyzing Unique Values
For *time signature*, due to the ordered nature, we will use *ordinal* encoding.
For *context_type*, *key*, *hist_user_behavior_reason_start*, and *hist_user_behavior_reason_end* we will use *one-hot* encoding as their seems to be no inheret order to the values. 

Let's make the changes now!

## One-Hot Encoding


In [47]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()  # No 'sparse' parameter
encoded_columns = encoder.fit_transform(data_df[['context_type', 'hist_user_behavior_reason_start', 'hist_user_behavior_reason_end']])
encoded_df = pd.concat([data_df.drop(columns=['context_type', 'hist_user_behavior_reason_start', 'hist_user_behavior_reason_end']), pd.DataFrame(encoded_columns.toarray())], axis=1)

print(encoded_df.columns)
# print(data_df['context_type', 'hist_user_behavior_reason_start', 'hist_user_behavior_reason_end'])


Index([             'session_position',                'session_length',
                      'context_switch',          'no_pause_before_play',
             'short_pause_before_play',        'long_pause_before_play',
        'hist_user_behavior_n_seekfwd', 'hist_user_behavior_n_seekback',
       'hist_user_behavior_is_shuffle',                   'hour_of_day',
                                'date',                       'premium',
                            'duration',                  'release_year',
              'us_popularity_estimate',                  'acousticness',
                       'beat_strength',                    'bounciness',
                        'danceability',                'dyn_range_mean',
                              'energy',                      'flatness',
                    'instrumentalness',                           'key',
                            'liveness',                      'loudness',
                           'mechanism',            

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Assuming 'df' is your dataframe and 'feature_columns' is a list of columns to be encoded
encoder = OrdinalEncoder()

# Fit-transform the specified columns and replace the original values with encoded values
data_df[feature_columns] = encoder.fit_transform(data_df[feature_columns])

### Check for Missing Values

In [19]:
# check for missing values
missing_values = data_df.isnull().sum()

# display columns with missing values and counts
print(missing_values[missing_values > 0])

Series([], dtype: int64)


### Split Data into Test and Training Sets

In [20]:
X_train, X_test, y_train, y_test = train_test_split(data_df, y_df, test_size=0.2, random_state=42)

In [21]:
# value_counts = data_df['session_id'].value_counts()

# # Display the counts of each unique value in the column
# print(value_counts)

# Model #1: Linear Regression
Our first model will be a linear regression model using sckit-learn. 


In [22]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

In [23]:
model = LinearRegression()
model.fit(X_train, y_train)

ValueError: could not convert string to float: '2018-07-14'