# SPOTIFY DATA


1. `track_name`: Name of the track.
2. `artist(s)_name`: Name of the artist(s).
3. `artist_count`: Number of artists.
4. `released_year`: Year of release.
5. `released_month`: Month of release.
6. `released_day`: Day of release.
7. `in_spotify_playlists`: Number of Spotify playlists the track is in.
8. `in_spotify_charts`: Number of Spotify charts the track is in.
9. `streams`: Number of streams.
10. `in_apple_playlists`: Number of Apple Music playlists the track is in.
11. `in_apple_charts`: Number of Apple Music charts the track is in.
12. `duration_s`: Duration of the track in seconds.
13. `popularity`: Popularity score.
14. `bpm`: Beats per minute.
15. `key`: Musical key of the track.
16. `mode`: Musical mode of the track.
17. `danceability_%`: Danceability score (percentage).
18. `valence_%`: Valence score (percentage).
19. `energy_%`: Energy score (percentage).
20. `acousticness_%`: Acousticness score (percentage).
21. `instrumentalness_%`: Instrumentalness score (percentage).
22. `liveness_%`: Liveness score (percentage).
23. `speechiness_%`: Speechiness score (percentage).

In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import time
import seaborn as sns
from sklearn.model_selection import KFold

try:
    spotify_df = pd.read_csv("spotify-2023.csv", encoding='ISO-8859-1')
except Exception as e:
    error = e

spotify_df.head() if 'spotify_df' in locals() else error


Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,bpm,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",2,2023,7,14,553,147,141381703,43,...,125,B,Major,80,89,83,31,0,8,4
1,LALA,Myke Towers,1,2023,3,23,1474,48,133716286,48,...,92,C#,Major,71,61,74,7,0,10,4
2,vampire,Olivia Rodrigo,1,2023,6,30,1397,113,140003974,94,...,138,F,Major,51,32,53,17,0,31,6
3,Cruel Summer,Taylor Swift,1,2019,8,23,7858,100,800840817,116,...,170,A,Major,55,58,72,11,0,11,15
4,WHERE SHE GOES,Bad Bunny,1,2023,5,18,3133,50,303236322,84,...,144,A,Minor,65,23,80,14,63,11,6


In [2]:
spotify_data_types = spotify_df.dtypes
spotify_missing_values = spotify_df.isna().sum()

spotify_data_types, spotify_missing_values

(track_name              object
 artist(s)_name          object
 artist_count             int64
 released_year            int64
 released_month           int64
 released_day             int64
 in_spotify_playlists     int64
 in_spotify_charts        int64
 streams                 object
 in_apple_playlists       int64
 in_apple_charts          int64
 in_deezer_playlists     object
 in_deezer_charts         int64
 in_shazam_charts        object
 bpm                      int64
 key                     object
 mode                    object
 danceability_%           int64
 valence_%                int64
 energy_%                 int64
 acousticness_%           int64
 instrumentalness_%       int64
 liveness_%               int64
 speechiness_%            int64
 dtype: object,
 track_name               0
 artist(s)_name           0
 artist_count             0
 released_year            0
 released_month           0
 released_day             0
 in_spotify_playlists     0
 in_spotify_charts  





The Spotify 2023 dataset contains a mix of numerical and categorical data, with the following characteristics:

- Most columns are of numerical type (`int64`), including metrics like `artist_count`, `released_year`, `in_spotify_playlists`, and various musical feature scores (e.g., `danceability_%`, `energy_%`).
- Some columns are textual or categorical (`object`), such as `track_name`, `artist(s)_name`, `streams`, `key`, and `mode`.

Regarding missing values:
- `in_shazam_charts` has 50 missing values.
- `key` has 95 missing values.





1. **Target Variable**: We'll use `mode` as the target variable. Since it's categorical (typically 'Major' or 'Minor'), it's suitable for LazyFCA without needing additional categorization.

2. **Missing Values**:
   - For `key`, I'll fill missing values with a placeholder value like 'Unknown'.
   - As `in_shazam_charts` is missing in some rows, I'll fill these with zero, assuming missing values imply the track is not in any Shazam chart.

3. **Feature Transformation**:
   
   - For categorical features like `track_name`, `artist(s)_name`, `key`, I'll use label encoding to convert them into numerical values.

4. **Feature Selection**: Given the vast number of features, I'll select a subset that might be relevant for analysis. Features like `track_name` might not be as informative for this analysis and can be excluded.

5. **Encoding Categorical Variables**: Apply label encoding to transform categorical features into a format suitable for LazyFCA.

6. **Other Columns**: Columns like `released_year`, `bpm`, and various percentage scores will be used as numerical features.



In [4]:
spotify_df['key'].fillna('Unknown', inplace=True)
spotify_df['in_shazam_charts'].fillna(0, inplace=True)
spotify_label_cols = ['track_name', 'artist(s)_name', 'key']
# Label encoding for categorical variables
spotify_label_encoders = {col: LabelEncoder().fit(spotify_df[col]) for col in spotify_label_cols}
for col, le in spotify_label_encoders.items():
    spotify_df[col+'_encoded'] = le.transform(spotify_df[col])

# Selecting relevant columns for LazyFCA
spotify_relevant_cols = ['artist_count', 'released_year', 'released_month', 'released_day', 'in_spotify_playlists', 'in_spotify_charts', 'streams', 'in_apple_playlists', 'in_apple_charts', 'bpm', 'danceability_%', 'valence_%', 'energy_%', 'acousticness_%', 'instrumentalness_%', 'liveness_%', 'speechiness_%'] + [col+'_encoded' for col in spotify_label_cols]
spotify_target_col = 'mode'

# Final dataset for analysis
spotify_fca_df = spotify_df[spotify_relevant_cols + [spotify_target_col]]
spotify_fca_df.head()


Unnamed: 0,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,in_apple_charts,bpm,...,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%,track_name_encoded,artist(s)_name_encoded,key_encoded,mode
0,2,2023,7,14,553,147,141381703,43,263,125,...,89,83,31,0,8,4,687,326,2,Major
1,1,2023,3,23,1474,48,133716286,48,126,92,...,61,74,7,0,10,4,397,401,3,Major
2,1,2023,6,30,1397,113,140003974,94,207,138,...,32,53,17,0,31,6,936,431,7,Major
3,1,2019,8,23,7858,100,800840817,116,207,170,...,58,72,11,0,11,15,170,558,0,Major
4,1,2023,5,18,3133,50,303236322,84,133,144,...,23,80,14,63,11,6,864,43,0,Minor


In [5]:
def LazyFCA(X, y, cod, cv=5, min_supp=1, ranged=False, gap=None):
  """Performs a lazy classification and computes CV score

  Parameters
  ----------
  X : List
      Data features
  y : List
      Target feature
  cod : List
      Type of coding of features
      -c categorical
      -i interval intersection
  cv : int
      Number of folds in k-fold CV
  min_supp ; int
      Minimal support of hypothesis
  ranged : bool
      If classes are ordered
  gap : int
      Maximum length of interval of classification

  Returns
  -------
  prediction: List
      Class predictions for objects
  acc: float
      Accuracy on CV
  """
  y = np.array(y)
  kf = KFold(n_splits=cv, random_state=None, shuffle=False)
  prediction = [None] * len(y)
  if ranged:
    acc=[]
    for train_index, test_index in kf.split(X):
      for test in test_index:#outer loop through test examples
        for tr in train_index:#first inner loop through hypotheses
          hyp = [None] * len(cod)
          for i in range(len(cod)):#creating hypothesis
            if cod[i] == 'i':
              hyp[i] = [min(X.iloc[test][i], X.iloc[tr][i]), max(X.iloc[test][i], X.iloc[tr][i])]
            elif cod[i] == 'c':
              hyp[i] = X.iloc[test][i] == X.iloc[tr][i]
          pred_int = [y[tr]]
          for htr in train_index:#second inner loop to check hypothesis
            for i in range(len(cod)):#checing on a single example
              if (cod[i] == 'i' and not(hyp[i][0] <= X.iloc[htr][i] <= hyp[i][1])) or (cod[i] == 'c' and hyp[i] == True and X.iloc[htr][i] != X.iloc[test][i]):
                break
              elif i == len(cod)-1 and htr != tr:
                pred_int.append(y[htr])
            if (max(pred_int)-min(pred_int)+1) > gap:
              break
            elif htr == train_index[-1] and len(pred_int) >= min_supp:
              prediction[test] = pred_int
          if prediction[test] != None:
            break
      right = 0
      for p in test_index:
        if prediction[p]!= None and min(prediction[p]) <= y[p] <= max(prediction[p]):
          right += 1
      acc.append(right/len(test_index))

  else:
    acc=[]
    for train_index, test_index in kf.split(X):
      for test in test_index:#outer loop through test examples
        for tr in train_index:#first inner loop through hypotheses
          hyp = [None] * len(cod)
          for i in range(len(cod)):#creating hypothesis
            if cod[i] == 'i':
              hyp[i] = [min(X.iloc[test][i], X.iloc[tr][i]), max(X.iloc[test][i], X.iloc[tr][i])]
            elif cod[i] == 'c':
              hyp[i] = X.iloc[test][i] == X.iloc[tr][i]
          pred_int = [y[tr]]
          for htr in train_index:#second inner loop to check hypothesis
            for i in range(len(cod)):#checing on a single example
              if (cod[i] == 'i' and not(hyp[i][0] <= X.iloc[htr][i] <= hyp[i][1])) or (cod[i] == 'c' and hyp[i] == True and X.iloc[htr][i] != X.iloc[test][i]):
                break
              elif i == len(cod)-1 and htr != tr:
                pred_int.append(y[htr])
            if pred_int[-1] != pred_int[0]:
              break
            elif htr == train_index[-1] and len(pred_int) >= min_supp:
              prediction[test] = pred_int
          if prediction[test] != None:
            break
      right = 0
      for p in test_index:
        if prediction[p]!= None and y[p] == prediction[p][0]:
          right += 1
      acc.append(right/len(test_index))

  unclass = 0
  for p in range(len(y)):
    if prediction[p] == None:
      unclass += 1
  unclass /= len(y)


  return prediction, acc, np.mean(acc), unclass

In [6]:
# Coding type for each feature ('i' for interval, 'c' for categorical)
spotify_feature_coding = ['i'] * len(spotify_relevant_cols)  # Assuming all features as interval type for simplicity

# Applying LazyFCA
# Splitting features (X) and target (y)
X_spotify = spotify_fca_df[spotify_relevant_cols]
y_spotify = spotify_fca_df[spotify_target_col]

# Applying the LazyFCA function to a subset of the Spotify dataset
# We'll start with a small subset due to computational intensity
toy_x_spotify = X_spotify.iloc[:50]
toy_y_spotify = y_spotify.iloc[:50]

# Applying the LazyFCA function to the toy dataset
pred_spotify, acc_spotify, mean_acc_spotify, uc_spotify = LazyFCA(toy_x_spotify, toy_y_spotify, spotify_feature_coding, cv=5)
pred_spotify, acc_spotify, mean_acc_spotify, uc_spotify


([['Minor'],
  ['Minor'],
  ['Minor'],
  ['Minor'],
  ['Minor'],
  ['Minor'],
  ['Minor'],
  ['Minor'],
  ['Minor'],
  ['Minor'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major'],
  ['Major']],
 [0.4, 0.5, 0.4, 0.6, 0.6],
 0.5,
 0.0)

Based on the output from your LazyFCA analysis on the Spotify dataset, let's interpret the results:

1. **Predictions (`pred_spotify`)**: The LazyFCA algorithm predicts the `mode` of the tracks. The predictions include both 'Minor' and 'Major', showing a variety of results across the 50 instances in the subset.

2. **Accuracy (`acc_spotify`)**: The accuracy scores across the 5-fold cross-validation vary from 0.4 (40%) to 0.6 (60%). This variation indicates a moderate level of predictive capability, suggesting that the features and the LazyFCA method can somewhat distinguish between Major and Minor modes but with limited accuracy.

3. **Mean Accuracy (`mean_acc_spotify`)**: The mean accuracy across all folds is 0.5 (50%), which is akin to random guessing for a binary classification task. This might suggest that the features used, or the method itself, might need refinement for better predictive performance.

4. **Unclassified (`uc_spotify`)**: The value is 0.0, indicating that all instances were successfully classified and none were left unclassified.

