# Lab 3: Building a Spotify Song Classifier

This week's lab is a musical lab. You'll be requesting data from the Spotify API and using it to build k-nearest neighbor and decision tree models.

You have two options for completing this lab.

**Option 1**: **Classify by users**. Build models that predict whether a given song will be in your collection vs. a partner in class. This requires that you were already a Spotify user so you have enough data to work with. You will download your data from the Spotify API and then exchange with another member of class.

**Option 2**: **Classify by genres**. Build models that predict which genre a song belongs to. This will use a pre-existing Spotify dataset available from Kaggle.com (<https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify>)

## Option 1: Classify by Users
## Step 1: Setting up Spotify API
In order to use the Spotify API you must have a Spotify account. If you don't have one, sign up for a free one here: <https://www.spotify.com/us/signup>

Once you have an account, go to Spotify for developers (<https://developer.spotify.com/>) and log in. Click the green "Create a Client ID" button to fill out the form to create an app create an app so you can access the API.

On your developer dashboard page, click on the new app you just created. On the app's dashboard page you will find your Client ID just under the header name of your app. Click "Show Client Secret" to access your secondary Client ID. When you do this you'll be issued a Spotify client ID and client secret key.


In [1]:
pip install spotipy

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.
Collecting spotipy
  Downloading spotipy-2.24.0-py3-none-any.whl.metadata (4.9 kB)
Collecting redis>=3.5.3 (from spotipy)
  Downloading redis-5.2.0-py3-none-any.whl.metadata (9.1 kB)
Downloading spotipy-2.24.0-py3-none-any.whl (30 kB)
Downloading redis-5.2.0-py3-none-any.whl (261 kB)
Installing collected packages: redis, spotipy
Successfully installed redis-5.2.0 spotipy-2.24.0
Note: you may need to restart the kernel to use updated packages.


## Step 2: Authentication

In [None]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth

# Set your client ID and client secret
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(
    client_id='your_client_id',
    client_secret='your_client_secret',
    redirect_uri='http://localhost:8888/callback',
    scope="user-library-read user-read-recently-played"
))

## Step 3: Data Collection
You can use fetch_tracks() and the sp.current_user_saved_tracks() method to request all your liked tracks. It would be good if you had at least 150-200 liked tracks so the model has enough data to work with. If you don't have enough liked tracks, you can instead use sp.current_user_recently_played(), and in that case grab at least 500 recently played tracks if you can.

The Spotify API returns a dataframe of tracks and associated attributes. However, it will only return up to 50 (or 20) tracks at a time, so you will have to make multiple requests. Use a function to combine all your requests in one call.

Once you have your tracks, familiarize yourself with this initial dataframe. You'll need to request some additional information for the analysis. If you give the API a list of track IDs using fetch_audio_features(), it will return an audio features dataframe of all the tracks and some attributes of them.

CHECK THIS: These track audio features are the predictors we are interested in, but this dataframe doesn't have the actual names of the tracks. Append the 'track.name' column from your favorite tracks database.

Find a class mate whose data you would like to use. Add your partner's data to your dataset. Create a new column that will contain the outcome variable that you will try to predict. This variable should contain two values that represent if the track came from your data set or your partner's.
Fetch your liked tracks or recently played tracks:

In [None]:
import pandas as pd

def fetch_tracks(sp, fetch_function, limit=50):
    results = []
    offset = 0
    while True:
        response = fetch_function(limit=limit, offset=offset)
        items = response['items']
        if not items:
            break
        results.extend(items)
        offset += limit
    return pd.DataFrame(results)

# Fetch liked tracks
liked_tracks = fetch_tracks(sp, sp.current_user_saved_tracks)

# Fetch recently played tracks
recent_tracks = fetch_tracks(sp, sp.current_user_recently_played)


### Get audio features for tracks

In [None]:
def fetch_audio_features(sp, track_ids):
    features = sp.audio_features(track_ids)
    return pd.DataFrame(features)

# Example: Fetch audio features for liked tracks
liked_track_ids = liked_tracks['track'].apply(lambda x: x['id'])
audio_features = fetch_audio_features(sp, liked_track_ids)


### Combine datasets with an outcome variable

In [None]:
# Add track names for reference
audio_features['track_name'] = liked_tracks['track'].apply(lambda x: x['name'])

# Combine with another user's data
partner_audio_features = ...  # Replace with partner's dataset
audio_features['label'] = 'your_data'
partner_audio_features['label'] = 'partner_data'
combined_data = pd.concat([audio_features, partner_audio_features])


### Option 2: Classify by Genres
1. Download and inspect Kaggle dataset:

In [None]:
spotify_data = pd.read_csv('path_to_spotify_dataset.csv')

# Filter down to two genres
genre_1_data = spotify_data[spotify_data['genre'] == 'Genre1']
genre_2_data = spotify_data[spotify_data['genre'] == 'Genre2']
combined_data = pd.concat([genre_1_data, genre_2_data])
combined_data['label'] = combined_data['genre']


## Step 4: Data Exploration
Use descriptive statistics and visualizations to explore the data.

In [None]:
import matplotlib.pyplot as plt

# Example: Plot danceability
combined_data['danceability'].hist(by=combined_data['label'])
plt.show()

# Example: Compare energy levels
combined_data.boxplot(column='energy', by='label')
plt.show()


Step 5: Modeling
1. Preprocessing

In [None]:
from sklearn.model_selection import train_test_split

X = combined_data.drop(columns=['label', 'track_name', 'genre'])
y = combined_data['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


2. Define models

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

# KNN
knn = KNeighborsClassifier()

# Decision Tree
dt = DecisionTreeClassifier(random_state=42)

# Bagged Tree
bagged_tree = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)

# Random Forest
random_forest = RandomForestClassifier(random_state=42)


3. Train and evaluate models

In [None]:
from sklearn.metrics import classification_report, accuracy_score

models = [knn, dt, bagged_tree, random_forest]
model_names = ['KNN', 'Decision Tree', 'Bagged Tree', 'Random Forest']

for model, name in zip(models, model_names):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"--- {name} ---")
    print(classification_report(y_test, y_pred))
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")


Step 6. Model Comparison Summarize the model performances in a table and create visualizations for comparison

In [None]:
import matplotlib.pyplot as plt

accuracies = [accuracy_score(y_test, model.predict(X_test)) for model in models]
plt.bar(model_names, accuracies)
plt.ylabel('Accuracy')
plt.title('Model Comparison')
plt.show()
