# SoundScope Milestone 1 Exploration


This notebook documents the first milestone for the SoundScope project. It inspects a prototype dataset of Spotify-style audio features and trains a lightweight baseline model from scratch to classify songs by genre. The goal is to ensure the data pipeline is wired up and to gather initial intuition about how the audio descriptors relate to musical characteristics.


## 1. Setup


In [None]:
import csv
import math
import random
from collections import Counter, defaultdict
from statistics import mean, median, pstdev

random.seed(42)

DATA_PATH = '../data/spotify_tracks.csv'
NUMERIC_FEATURES = ['tempo', 'danceability', 'energy', 'valence', 'loudness', 'acousticness', 'instrumentalness', 'speechiness', 'liveness', 'popularity']
CATEGORICAL_FIELDS = ['genre', 'mood', 'era']

def read_dataset(path):
    with open(path, newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        fieldnames = reader.fieldnames
        rows = []
        for row in reader:
            parsed = {}
            for key, value in row.items():
                if key in NUMERIC_FEATURES:
                    parsed[key] = float(value)
                else:
                    parsed[key] = value
            rows.append(parsed)
    return rows, fieldnames


## 2. Load prototype dataset


In [3]:
dataset, columns = read_dataset(DATA_PATH)
print(f"Loaded {len(dataset)} tracks with {len(columns)} columns.")
print("Columns:", columns)

print("\nPreview of the first three tracks:")
for entry in dataset[:3]:
    preview = {
        'track_name': entry['track_name'],
        'artist': entry['artist'],
        'genre': entry['genre'],
        'mood': entry['mood'],
        'era': entry['era'],
        'tempo': entry['tempo'],
        'danceability': entry['danceability'],
        'energy': entry['energy'],
    }
    print(preview)

numeric_data = {feature: [row[feature] for row in dataset] for feature in NUMERIC_FEATURES}
categorical_data = {field: [row[field] for row in dataset] for field in CATEGORICAL_FIELDS}


Loaded 25 tracks with 16 columns.
Columns: ['track_id', 'track_name', 'artist', 'tempo', 'danceability', 'energy', 'valence', 'loudness', 'acousticness', 'instrumentalness', 'speechiness', 'liveness', 'popularity', 'genre', 'mood', 'era']

Preview of the first three tracks:
{'track_name': 'Neon Skies', 'artist': 'Lumina', 'genre': 'Pop', 'mood': 'Happy', 'era': '2010s', 'tempo': 120.0, 'danceability': 0.72, 'energy': 0.65}
{'track_name': 'Midnight Run', 'artist': 'The Chromatics', 'genre': 'Pop', 'mood': 'Energetic', 'era': '2010s', 'tempo': 128.0, 'danceability': 0.63, 'energy': 0.79}
{'track_name': 'Electric Veins', 'artist': 'SynthPulse', 'genre': 'Electronic', 'mood': 'Energetic', 'era': '2020s', 'tempo': 132.0, 'danceability': 0.75, 'energy': 0.88}


## 3. Dataset overview


In [4]:
print("Numeric feature summary (across 25 tracks):")
for feature in NUMERIC_FEATURES:
    values = numeric_data[feature]
    feature_mean = mean(values)
    feature_median = median(values)
    feature_min = min(values)
    feature_max = max(values)
    feature_std = pstdev(values) if len(values) > 1 else 0.0
    print(f"- {feature:<15} mean={feature_mean:6.2f} | median={feature_median:6.2f} | min={feature_min:6.2f} | max={feature_max:6.2f} | std={feature_std:6.2f}")


Numeric feature summary (across 25 tracks):
- tempo           mean=113.20 | median=118.00 | min= 84.00 | max=150.00 | std= 18.39
- danceability    mean=  0.62 | median=  0.63 | min=  0.40 | max=  0.80 | std=  0.12
- energy          mean=  0.64 | median=  0.65 | min=  0.35 | max=  0.91 | std=  0.18
- valence         mean=  0.54 | median=  0.55 | min=  0.30 | max=  0.75 | std=  0.13
- loudness        mean= -6.46 | median= -6.00 | min=-10.00 | max= -3.50 | std=  1.94
- acousticness    mean=  0.29 | median=  0.22 | min=  0.05 | max=  0.70 | std=  0.19
- instrumentalness mean=  0.16 | median=  0.15 | min=  0.01 | max=  0.45 | std=  0.13
- speechiness     mean=  0.05 | median=  0.05 | min=  0.03 | max=  0.08 | std=  0.01
- liveness        mean=  0.16 | median=  0.15 | min=  0.09 | max=  0.27 | std=  0.05
- popularity      mean= 62.32 | median= 65.00 | min= 42.00 | max= 76.00 | std=  9.84


In [5]:
print("Categorical distributions:")
for field in CATEGORICAL_FIELDS:
    counts = Counter(categorical_data[field])
    print(f"\n{field.title()} counts:")
    for value, count in counts.most_common():
        print(f"  - {value:<12} {count}")


Categorical distributions:

Genre counts:
  - Pop          5
  - Electronic   5
  - Rock         3
  - Folk         2
  - Alternative  2
  - Indie        2
  - World        1
  - Hip-Hop      1
  - Soul         1
  - Chillwave    1
  - Lo-Fi        1
  - Ambient      1

Mood counts:
  - Energetic    9
  - Calm         8
  - Happy        5
  - Sad          3

Era counts:
  - 2010s        9
  - 2000s        6
  - 2020s        5
  - 1990s        2
  - 1980s        2
  - 1970s        1


In [6]:
def missing_value_counts(rows, column_names):
    totals = {name: 0 for name in column_names}
    for row in rows:
        for name in column_names:
            value = row.get(name)
            if value is None or value == '':
                totals[name] += 1
    return totals

missing_counts = missing_value_counts(dataset, columns)
if any(missing_counts.values()):
    print("Missing values detected:")
    for name, count in missing_counts.items():
        if count:
            print(f"- {name}: {count}")
else:
    print("No missing values detected in the sample dataset.")


No missing values detected in the sample dataset.


## 4. Exploratory insights


In [7]:
def ascii_bar(value, scale=28, symbol='█'):
    filled = max(1, int(round(value * scale)))
    return symbol * filled

genre_groups = defaultdict(list)
for row in dataset:
    genre_groups[row['genre']].append(row)

print("Average energy by genre (0-1 scale):")
for genre in sorted(genre_groups):
    avg_energy = mean(item['energy'] for item in genre_groups[genre])
    print(f"  {genre:<12} {ascii_bar(avg_energy)} {avg_energy:5.2f}")

era_groups = defaultdict(list)
for row in dataset:
    era_groups[row['era']].append(row)

print("\nAverage danceability by era (0-1 scale):")
for era in sorted(era_groups):
    avg_danceability = mean(item['danceability'] for item in era_groups[era])
    print(f"  {era:<8} {ascii_bar(avg_danceability)} {avg_danceability:5.2f}")


Average energy by genre (0-1 scale):
  Alternative  ███████████████  0.54
  Ambient      █████████████  0.47
  Chillwave    ████████████  0.44
  Electronic   ███████████████████████  0.82
  Folk         ██████████  0.36
  Hip-Hop      █████████████████████████  0.91
  Indie        ██████████████  0.50
  Lo-Fi        ███████████  0.41
  Pop          ████████████████████  0.73
  Rock         ██████████████████████  0.78
  Soul         ████████████  0.43
  World        █████████████  0.45

Average danceability by era (0-1 scale):
  1970s    ███████████  0.40
  1980s    ████████████  0.44
  1990s    ██████████████  0.49
  2000s    ████████████████  0.59
  2010s    ███████████████████  0.67
  2020s    █████████████████████  0.74


In [8]:
def pearson_r(x_values, y_values):
    mean_x = mean(x_values)
    mean_y = mean(y_values)
    num = sum((x - mean_x) * (y - mean_y) for x, y in zip(x_values, y_values))
    denom_x = math.sqrt(sum((x - mean_x) ** 2 for x in x_values))
    denom_y = math.sqrt(sum((y - mean_y) ** 2 for y in y_values))
    if denom_x == 0 or denom_y == 0:
        return 0.0
    return num / (denom_x * denom_y)

core_features = ['danceability', 'energy', 'valence', 'popularity']
header = ' ' * 16 + ''.join(f"{name[:10]:>12}" for name in core_features)
print('Pearson correlation matrix for selected features:')
print(header)
for fx in core_features:
    row_text = f"{fx[:10]:>16}"
    for fy in core_features:
        corr = 1.0 if fx == fy else pearson_r(numeric_data[fx], numeric_data[fy])
        row_text += f"{corr:>12.2f}"
    print(row_text)


Pearson correlation matrix for selected features:
                  danceabili      energy     valence  popularity
      danceabili        1.00        0.89        0.85        0.93
          energy        0.89        1.00        0.67        0.95
         valence        0.85        0.67        1.00        0.81
      popularity        0.93        0.95        0.81        1.00


## 5. Baseline genre model


In [9]:
def standardize_features(rows, features):
    means = {feature: mean(numeric_data[feature]) for feature in features}
    stds = {feature: pstdev(numeric_data[feature]) or 1.0 for feature in features}
    matrix = []
    for row in rows:
        matrix.append([(row[feature] - means[feature]) / (stds[feature] if stds[feature] else 1.0) for feature in features])
    return matrix, means, stds

feature_matrix, feature_means, feature_stds = standardize_features(dataset, NUMERIC_FEATURES)
print("Feature standardization reference:")
for feature in NUMERIC_FEATURES:
    print(f"  {feature:<15} mean={feature_means[feature]:6.2f} | std={feature_stds[feature]:6.2f}")


Feature standardization reference:
  tempo           mean=113.20 | std= 18.39
  danceability    mean=  0.62 | std=  0.12
  energy          mean=  0.64 | std=  0.18
  valence         mean=  0.54 | std=  0.13
  loudness        mean= -6.46 | std=  1.94
  acousticness    mean=  0.29 | std=  0.19
  instrumentalness mean=  0.16 | std=  0.13
  speechiness     mean=  0.05 | std=  0.01
  liveness        mean=  0.16 | std=  0.05
  popularity      mean= 62.32 | std=  9.84


In [10]:
class SoftmaxRegression:
    def __init__(self, n_features, n_classes, learning_rate=0.25):
        self.n_features = n_features
        self.n_classes = n_classes
        self.learning_rate = learning_rate
        self.weights = [[0.0 for _ in range(n_features + 1)] for _ in range(n_classes)]

    def _logits(self, features):
        extended = features + [1.0]
        return [sum(self.weights[class_idx][j] * extended[j] for j in range(self.n_features + 1)) for class_idx in range(self.n_classes)]

    def predict_proba(self, features):
        logits = self._logits(features)
        max_logit = max(logits)
        exps = [math.exp(value - max_logit) for value in logits]
        total = sum(exps)
        return [value / total for value in exps]

    def predict(self, features):
        logits = self._logits(features)
        return max(range(self.n_classes), key=lambda idx: logits[idx])

    def fit(self, samples, targets, epochs=1800, report_every=450):
        for epoch in range(epochs):
            grads = [[0.0 for _ in range(self.n_features + 1)] for _ in range(self.n_classes)]
            loss_total = 0.0
            for features, target in zip(samples, targets):
                extended = features + [1.0]
                logits = [sum(self.weights[class_idx][j] * extended[j] for j in range(self.n_features + 1)) for class_idx in range(self.n_classes)]
                max_logit = max(logits)
                exps = [math.exp(value - max_logit) for value in logits]
                total = sum(exps)
                probs = [value / total for value in exps]
                loss_total += -math.log(max(probs[target], 1e-12))
                for class_idx in range(self.n_classes):
                    diff = probs[class_idx] - (1 if class_idx == target else 0)
                    for j in range(self.n_features + 1):
                        grads[class_idx][j] += diff * extended[j]
            sample_count = len(samples)
            for class_idx in range(self.n_classes):
                for j in range(self.n_features + 1):
                    self.weights[class_idx][j] -= self.learning_rate * grads[class_idx][j] / sample_count
            if report_every and (epoch + 1) % report_every == 0:
                avg_loss = loss_total / sample_count
                print(f"Epoch {epoch + 1:4d}: average cross-entropy loss = {avg_loss:.4f}")

def accuracy_score(true_labels, predicted_labels):
    correct = sum(1 for true, pred in zip(true_labels, predicted_labels) if true == pred)
    return correct / len(true_labels)

def classification_report(label_names, true_labels, predicted_labels):
    lines = [f"{'Class':<15}{'Precision':>10}{'Recall':>10}{'F1':>10}{'Support':>10}"]
    macro_precision = macro_recall = macro_f1 = 0.0
    for idx, name in enumerate(label_names):
        tp = sum(1 for true, pred in zip(true_labels, predicted_labels) if true == idx and pred == idx)
        fp = sum(1 for true, pred in zip(true_labels, predicted_labels) if true != idx and pred == idx)
        fn = sum(1 for true, pred in zip(true_labels, predicted_labels) if true == idx and pred != idx)
        support = sum(1 for true in true_labels if true == idx)
        precision = tp / (tp + fp) if (tp + fp) else 0.0
        recall = tp / (tp + fn) if (tp + fn) else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision and recall) else 0.0
        macro_precision += precision
        macro_recall += recall
        macro_f1 += f1
        lines.append(f"{name:<15}{precision:>10.2f}{recall:>10.2f}{f1:>10.2f}{support:>10}")
    label_count = len(label_names)
    lines.append('-' * 55)
    lines.append(f"{'Macro avg':<15}{(macro_precision / label_count):>10.2f}{(macro_recall / label_count):>10.2f}{(macro_f1 / label_count):>10.2f}{len(true_labels):>10}")
    return '\n'.join(lines)

def confusion_matrix_str(label_names, true_labels, predicted_labels):
    header = ' ' * 12 + ''.join(f"{name[:9]:>10}" for name in label_names)
    lines = [header]
    for idx, name in enumerate(label_names):
        row_counts = [0 for _ in label_names]
        for true, pred in zip(true_labels, predicted_labels):
            if true == idx:
                row_counts[pred] += 1
        row = f"{name[:9]:>12}" + ''.join(f"{count:>10}" for count in row_counts)
        lines.append(row)
    return '\n'.join(lines)


In [11]:
labels = sorted({row['genre'] for row in dataset})
label_to_index = {label: idx for idx, label in enumerate(labels)}
index_to_label = {idx: label for label, idx in label_to_index.items()}

indices = list(range(len(dataset)))
random.shuffle(indices)
split = int(len(indices) * 0.75)
train_indices = indices[:split]
test_indices = indices[split:]

X_train = [feature_matrix[idx] for idx in train_indices]
y_train = [label_to_index[dataset[idx]['genre']] for idx in train_indices]
X_test = [feature_matrix[idx] for idx in test_indices]
y_test = [label_to_index[dataset[idx]['genre']] for idx in test_indices]

model = SoftmaxRegression(n_features=len(NUMERIC_FEATURES), n_classes=len(labels), learning_rate=0.25)
model.fit(X_train, y_train, epochs=1800, report_every=450)

train_predictions = [model.predict(features) for features in X_train]
test_predictions = [model.predict(features) for features in X_test]

train_accuracy = accuracy_score(y_train, train_predictions)
test_accuracy = accuracy_score(y_test, test_predictions)

print(f"Training accuracy: {train_accuracy:.2f}")
print(f"Test accuracy: {test_accuracy:.2f}")

ordered_labels = [index_to_label[idx] for idx in range(len(labels))]
print("\nClassification report (test set):")
print(classification_report(ordered_labels, y_test, test_predictions))

print("\nConfusion matrix (test set):")
print(confusion_matrix_str(ordered_labels, y_test, test_predictions))

print("\nSample predictions from the test set:")
for idx in test_indices[:5]:
    predicted_label = ordered_labels[model.predict(feature_matrix[idx])]
    true_label = dataset[idx]['genre']
    print(f"  - {dataset[idx]['track_name']} ({true_label}) -> predicted {predicted_label}")


Epoch  450: average cross-entropy loss = 0.2484
Epoch  900: average cross-entropy loss = 0.1306
Epoch 1350: average cross-entropy loss = 0.0878
Epoch 1800: average cross-entropy loss = 0.0659
Training accuracy: 1.00
Test accuracy: 0.57

Classification report (test set):
Class           Precision    Recall        F1   Support
Alternative          1.00      1.00      1.00         1
Ambient              0.00      0.00      0.00         0
Chillwave            0.00      0.00      0.00         0
Electronic           1.00      1.00      1.00         2
Folk                 0.00      0.00      0.00         0
Hip-Hop              0.00      0.00      0.00         0
Indie                0.00      0.00      0.00         0
Lo-Fi                0.00      0.00      0.00         1
Pop                  1.00      0.50      0.67         2
Rock                 0.00      0.00      0.00         0
Soul                 0.00      0.00      0.00         0
World                0.00      0.00      0.00         1
-

## 6. Next steps

* Expand the dataset by connecting to the Spotify Web API and Billboard or Top 500 chart archives.
* Engineer temporal and popularity-alignment features that capture whether a track matches current trends.
* Compare additional classifiers (Random Forest, Gradient Boosting, LightGBM) against the custom baseline.
* Begin prototyping the Streamlit interface for interactive visualizations and explanations.
