# Million Song Dataset - Data Loading

This notebook loads and explores:
1. **Taste Profile Subset** (train_triplets.txt): User listening data
2. **Million Song Subset**: Song metadata from HDF5 files

## Import Libraries

In [3]:
import pandas as pd
import numpy as np
import h5py
import os
from pathlib import Path
from tqdm import tqdm

## 1. Load Data (From File or Original Sources)

This cell will check if preprocessed data exists and load from file. If not, it will load from the original sources.

In [4]:
# Check if preprocessed data files exist
taste_profile_file = '../data/taste_profile.pkl'
songs_metadata_file = '../data/songs_metadata.pkl'

if os.path.exists(taste_profile_file) and os.path.exists(songs_metadata_file):
    # Load from preprocessed files (fast)
    print("Loading preprocessed data from files...")
    taste_profile = pd.read_pickle(taste_profile_file)
    songs_metadata = pd.read_pickle(songs_metadata_file)
    
    print(f"✓ Loaded {len(taste_profile):,} listening records from {taste_profile_file}")
    print(f"✓ Loaded {len(songs_metadata):,} songs with metadata from {songs_metadata_file}")
    print("\nData loaded successfully from saved files!")
    
else:
    # Load from original sources (slow)
    print("Preprocessed files not found. Loading from original sources...")
    print("This may take several minutes...")
    
    # Load the train_triplets.txt file
    triplets_file = '../train_triplets.txt'
    print("\nLoading Taste Profile data...")
    taste_profile = pd.read_csv(
        triplets_file, 
        sep='\t', 
        header=None, 
        names=['user_id', 'song_id', 'play_count']
    )
    
    print(f"✓ Loaded {len(taste_profile):,} listening records")
    print(f"  Unique users: {taste_profile['user_id'].nunique():,}")
    print(f"  Unique songs: {taste_profile['song_id'].nunique():,}")
    
    # Load Million Song Subset
    print("\nLoading Million Song Subset...")
    
    # Helper function to extract data from HDF5 files
    def get_song_data_from_h5(file_path):
        """Extract relevant song information from an HDF5 file."""
        try:
            with h5py.File(file_path, 'r') as h5:
                song_data = {
                    'song_id': h5['metadata']['songs']['song_id'][0].decode('utf-8'),
                    'title': h5['metadata']['songs']['title'][0].decode('utf-8'),
                    'artist_name': h5['metadata']['songs']['artist_name'][0].decode('utf-8'),
                    'artist_id': h5['metadata']['songs']['artist_id'][0].decode('utf-8'),
                    'release': h5['metadata']['songs']['release'][0].decode('utf-8'),
                    'year': int(h5['musicbrainz']['songs']['year'][0]),
                    'duration': float(h5['analysis']['songs']['duration'][0]),
                    'tempo': float(h5['analysis']['songs']['tempo'][0]),
                    'loudness': float(h5['analysis']['songs']['loudness'][0]),
                    'key': int(h5['analysis']['songs']['key'][0]),
                    'mode': int(h5['analysis']['songs']['mode'][0]),
                    'time_signature': int(h5['analysis']['songs']['time_signature'][0]),
                    'energy': float(h5['analysis']['songs']['energy'][0]),
                    'danceability': float(h5['analysis']['songs']['danceability'][0]),
                    'artist_hotttnesss': float(h5['metadata']['songs']['artist_hotttnesss'][0]),
                    'song_hotttnesss': float(h5['metadata']['songs']['song_hotttnesss'][0])
                }
                return song_data
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
            return None
    
    # Find all HDF5 files
    msd_path = Path('../MillionSongSubset')
    h5_files = list(msd_path.rglob('*.h5'))
    print(f"Found {len(h5_files):,} song files")
    
    # Load all song metadata
    song_data_list = []
    for h5_file in tqdm(h5_files):
        song_data = get_song_data_from_h5(h5_file)
        if song_data:
            song_data_list.append(song_data)
    
    songs_metadata = pd.DataFrame(song_data_list)
    print(f"✓ Loaded metadata for {len(songs_metadata):,} songs")
    
    # Save the loaded data for future use
    print("\nSaving data to files for future quick loading...")
    os.makedirs('../data', exist_ok=True)
    taste_profile.to_pickle(taste_profile_file)
    songs_metadata.to_pickle(songs_metadata_file)
    print(f"✓ Saved to {taste_profile_file} and {songs_metadata_file}")

print("\nFirst few rows of taste_profile:")
display(taste_profile.head())
print("\nFirst few rows of songs_metadata:")
display(songs_metadata.head())

Loading preprocessed data from files...
✓ Loaded 48,373,586 listening records from ../data/taste_profile.pkl
✓ Loaded 10,000 songs with metadata from ../data/songs_metadata.pkl

Data loaded successfully from saved files!

First few rows of taste_profile:


Unnamed: 0,user_id,song_id,play_count
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAPDEY12A81C210A9,1
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2
3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBFNSP12AF72A0E22,1
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBFOVM12A58A7D494,1



First few rows of songs_metadata:


Unnamed: 0,song_id,title,artist_name,artist_id,release,year,duration,tempo,loudness,key,mode,time_signature,energy,danceability,artist_hotttnesss,song_hotttnesss
0,SOIDSSZ12A8C142A76,Ride On Time (Original Version),Black Box,AROWNZP1187FB3C028,NOW Dance Anthems,0,272.14322,118.834,-9.159,9,0,4,0.0,0.0,0.361992,0.212045
1,SOFRVYQ12A8C137BAC,Ombre Et Lumière,Vincent Bruley,ARQXGWX11F50C49BC7,Le Temps Suspendu,0,498.38975,150.867,-14.037,4,0,4,0.0,0.0,0.30585,
2,SOGQIEK12AB0186792,That Girl,Gary Morris,AR0NQD81187FB3AD13,THAT GIRL,0,305.10975,130.72,-9.636,10,0,4,0.0,0.0,0.286033,
3,SOUEDBC12AC90972E5,Warung Beach,John Digweed,AROZNFA1187B99D367,Warung Beach,2006,405.10649,127.995,-5.988,10,1,4,0.0,0.0,0.433928,
4,SOUZOPT12A58A79B94,Bad Reputation (Originally Performed by Thin L...,The Meatmen,AR00A6H1187FB5402A,Cover the Earth,0,170.4224,130.164,-5.681,1,1,5,0.0,0.0,0.395628,0.0


## 2. Data Summary

Now we have two main datasets loaded:
- **taste_profile**: User listening behavior (user_id, song_id, play_count)
- **songs_metadata**: Song features and metadata (audio features, artist info, etc.)

In [5]:
# Display summary information
print("=" * 60)
print("TASTE PROFILE DATASET")
print("=" * 60)
print(f"Total listening records: {len(taste_profile):,}")
print(f"Unique users: {taste_profile['user_id'].nunique():,}")
print(f"Unique songs: {taste_profile['song_id'].nunique():,}")
print(f"Total play counts: {taste_profile['play_count'].sum():,}")
print(f"Average plays per user-song: {taste_profile['play_count'].mean():.2f}")
print(f"Median plays per user-song: {taste_profile['play_count'].median():.0f}")

print("\n" + "=" * 60)
print("SONGS METADATA DATASET")
print("=" * 60)
print(f"Total songs with metadata: {len(songs_metadata):,}")
print(f"Unique artists: {songs_metadata['artist_name'].nunique():,}")
print(f"Year range: {songs_metadata['year'].min()} - {songs_metadata['year'].max()}")
print(f"Average duration: {songs_metadata['duration'].mean():.2f} seconds")
print(f"Average tempo: {songs_metadata['tempo'].mean():.2f} BPM")

print("\n" + "=" * 60)
print("DATA OVERLAP")
print("=" * 60)
# Check overlap between datasets
overlap_songs = set(taste_profile['song_id'].unique()) & set(songs_metadata['song_id'].unique())
print(f"Songs in both datasets: {len(overlap_songs):,}")
print(f"Percentage of taste profile songs with metadata: {len(overlap_songs)/taste_profile['song_id'].nunique()*100:.2f}%")

TASTE PROFILE DATASET
Total listening records: 48,373,586
Unique users: 1,019,318
Unique songs: 384,546
Total play counts: 138,680,243
Average plays per user-song: 2.87
Median plays per user-song: 1

SONGS METADATA DATASET
Total songs with metadata: 10,000
Unique artists: 4,412
Year range: 0 - 2010
Average duration: 238.51 seconds
Average tempo: 122.92 BPM

DATA OVERLAP
Songs in both datasets: 3,675
Percentage of taste profile songs with metadata: 0.96%


## 3. Manual Save (Optional)

If you made changes to the data and want to save it again, run this cell.

In [14]:
# Save both datasets to pickle format (built-in, efficient, preserves data types)
print("Saving datasets to files...")

# Save taste profile data
taste_profile_file = '../data/taste_profile.pkl'
os.makedirs('../data', exist_ok=True)
taste_profile.to_pickle(taste_profile_file)
print(f"✓ Saved taste profile to: {taste_profile_file}")
print(f"  Size: {len(taste_profile):,} rows")

# Save songs metadata
songs_metadata_file = '../data/songs_metadata.pkl'
songs_metadata.to_pickle(songs_metadata_file)
print(f"✓ Saved songs metadata to: {songs_metadata_file}")
print(f"  Size: {len(songs_metadata):,} rows")

print("\nData saved successfully! These files can be loaded quickly in future sessions.")

Saving datasets to files...
✓ Saved taste profile to: ../data/taste_profile.pkl
  Size: 48,373,586 rows
✓ Saved songs metadata to: ../data/songs_metadata.pkl
  Size: 10,000 rows

Data saved successfully! These files can be loaded quickly in future sessions.
