# Data Collection
Before we begin to explore what aspects of a song contribute to its potential for virality, the audio data must first be scraped from Spotify's API.

### `Session` API Client Description
To access Spotify's data efficiently and within their guidelines, I set up a special `Session` object. This object contains an access token required to interface with Spotify's end points, as well as logic to auto-renew old access tokens when they expire. Since Spotify has rules on how much information we can ask for at one time our `Session` object is smart about how it collects data. It makes a series of smaller requests instead of one big one, patiently waiting for 1 second after each request. This way, we avoid reaching our rate limit and our per-request fetch limit. In the end, all the pieces of data from these separate requests are put together seamlessly, giving us a complete and comprehensive set of information to work with.

In [1]:
# Import Session class from 'session.py' 
from session import *

In [2]:
# Initialize a spotify session
session = Session()

# Generate session access token
session.renew_token()

### Step 1: Picking Spotify playlists to explore 
To ensure a fair comparison between each genre, each playlist selected was curated by Spotify and each have atleast 1 million likes. And though certain tracks may appear in multiple playlists within a genre, our `Session` API Client is designed to prevent duplicate entries. Once all playlists for each group have been processed, we hope to see a roughly even number of tracks across genres. 

In [3]:
''' Define the Spotify IDs for playlists to explore '''

# Popular Pop playlists 
pop_playlist_ids = [
    '37i9dQZF1DXcBWIGoYBM5M', # Today's Top Hits - 50 songs
    '37i9dQZF1DX0kbJZpiYdZl', # Hot Hits USA - 50 songs
    '37i9dQZF1DWUa8ZRTfalHk', # Pop Rising - 85 songs
    '37i9dQZEVXbLp5XoPON0wI', # Top Songs USA - 50 songs
    '37i9dQZF1DWWvvyNmW9V9a' # tean beats - 104 songs 
]

# Popular Hip-Hop playlists
hiphop_playlist_ids = [
    '37i9dQZF1DX0XUsuxWHRQd', # RapCaviar - 51 songs
    '37i9dQZF1DX6GwdWRQMQpq', # Feelin' Myself - 50 songs
    '37i9dQZF1DX2RxBh64BHjQ', # Most Necessary - 100 songs
    '37i9dQZF1DWY4xHQp97fN6' # Get Turnt - 100 songs
]

# Viral internet playlists 
viral_playlist_ids = [
    '37i9dQZF1DX2L0iB23Enbq', # Viral Hits - 76 songs
    '37i9dQZF1DX4KeocBrdbJg', # Hits de Internet - 100 songs
    '37i9dQZF1DX6OgmB2fwLGd' # Internet People - 100 songs 
]

### Step 2: Collecting track data from all playlists
Before we can request audio-feature data, we need to first collect the Spotify IDs of all tracks. The data returned from `Session` is stored as a dictionary with track IDs as keys and a nested dictonary to hold track attributes and values. By nature of dictionaries, using track IDs as keys prevents duplicates and allows easy data aggregation of audio-feature data later. 

In [4]:
''' Collect the basic track data from all playlists '''
# Import required library
from collections import defaultdict

# Collect Pop playlists' track data  
pop_data = defaultdict(dict) # Key = Track ID, Val = {'att': 'val', ...}
for p_playlist_id in pop_playlist_ids:
    # Retrieve playlist from Session
    p_tracks = session.get_playlist_tracks(p_playlist_id)

    # Add tracks to group dictionary 
    for pt in p_tracks:
        pop_data[pt['id']]['name'] = pt['name']
        pop_data[pt['id']]['artist'] = pt['artist']

# Collect Hip-Hop playlists' track data  
hiphop_data = defaultdict(dict) # Key = Track ID, Val = {'att': 'val', ...}
for h_playlist_id in hiphop_playlist_ids:
    # Retrieve playlist from Session
    h_tracks = session.get_playlist_tracks(h_playlist_id)

    # Add tracks to group dictionary 
    for ht in h_tracks:
        hiphop_data[ht['id']]['name'] = ht['name']
        hiphop_data[ht['id']]['artist'] = ht['artist']

# Collect Viral playlists' track data 
viral_data = defaultdict(dict) # Key = Track ID, Val = {'att': 'val', ...}
for v_playlist_id in viral_playlist_ids:
    # Retrieve playlist from Session
    v_tracks = session.get_playlist_tracks(v_playlist_id)

    # Add tracks to group dictionary
    for vt in v_tracks:
        viral_data[vt['id']]['name'] = vt['name']
        viral_data[vt['id']]['artist'] = vt['artist']


In [5]:
''' Check the data samples we collected for each group '''
num_pop = len(pop_data)
num_hiphop = len(hiphop_data)
num_viral = len(viral_data)
total = num_pop + num_hiphop + num_viral

print(f'Total track recorded: {total}')
print(f'Pop: {num_pop} tracks, {round(((num_pop / total) * 100), 2)}% of total dataset')
print(f'Hip-Hop: {num_hiphop} tracks, {round(((num_hiphop / total) * 100), 2)}% of total dataset')
print(f'Viral: {num_viral} tracks, {round(((num_viral / total) * 100), 2)}% of total dataset')

Total track recorded: 726
Pop: 233 tracks, 32.09% of total dataset
Hip-Hop: 241 tracks, 33.2% of total dataset
Viral: 252 tracks, 34.71% of total dataset


Of the total number of tracks collected, each genre contains roughly the same number of tracks which is what we are looking for. 

### Step 3: Collecting Audio-feature data from all playlists
Now that the keys for a genre's track data contains its track ID, our `Session` client can request the audio-feature data every track ID in a genre. The returned audio-feature data is formatted as a list of dictionaries with 13 pre-defined attributes as keys which hold a track's audio-features. Every entry in this returned list is appended to the attribute dictionary associated with its track ID. 

In [6]:
''' Collect Audio-feature data for all genres '''
# Collect Pop tracks' audio data  
pop_track_ids = list(pop_data.keys())
pop_audio_features = session.get_audio_features(pop_track_ids)
for p_entry in pop_audio_features:
    for attr in ATTRIBUTES:
        pop_data[p_entry['id']][attr] = p_entry[attr]

# Collect Hip-Hop tracks' audio data 
hiphop_track_ids = list(hiphop_data.keys())
hiphop_audio_features = session.get_audio_features(hiphop_track_ids)
for h_entry in hiphop_audio_features:
    for attr in ATTRIBUTES:
        hiphop_data[h_entry['id']][attr] = h_entry[attr]

# Collect Viral tracks' audio data 
viral_track_ids = list(viral_data.keys())
viral_audio_features = session.get_audio_features(viral_track_ids)
for v_entry in viral_audio_features:
    for attr in ATTRIBUTES:
        viral_data[v_entry['id']][attr] = v_entry[attr]

### Step 4: Data Validation
Despite the robust documentation and maintenance of Spotify's API endpoints, we occasionally encounter null entries or records that lack certain attributes. To maintain the integrity of our dataset, it's crucial to perform checks for missing data. This step ensures that the data collection aligns with our expectations and that our analyses will be based on complete and accurate information. 

In [7]:
''' Define helper function to validate the collected data '''
# The expected length for values in our data dictionaries is 15: ['name', 'artist', and 13 ATTRIBUTES]
def show_missing_data(data):
    num_entries = len(data)
    num_rows_with_missing = 0
    missing_attr_ct = {} # Dictionary to count entries with null data for an attribute 

    # Define all expected keys to find in nested data dictionary for each track 
    expected_keys = ATTRIBUTES.copy()
    expected_keys.extend(['name', 'artist'])
    
    # Count missing data 
    for entry in data.values():
        is_row_missing_any = False

        # Check if key in current row is null
        for key in expected_keys: 
            if key not in entry:
                # Row is missing value for key 
                is_row_missing_any = True
                missing_attr_ct[key] = missing_attr_ct.get(key, 0) + 1

        if is_row_missing_any:
            num_rows_with_missing += 1

    # Print results
    print(f'Dataset contains {num_entries} rows')
    print(f'# of rows with missing values: {num_rows_with_missing}')
    missing_attr_items = missing_attr_ct.items()
    total_missing = 0
    if missing_attr_items: 
        # Dataset contains attributes with null entries 
        print('{Missing Attributes : # of null entries}')
        
        for k, v in missing_attr_items:
            print(f'\'{k}\' : {v}')
            total_missing += v
    print(f'# of missing values: {total_missing}\n')

In [8]:
''' Validate our collected datasets '''
print('== Pop dataset ==')
show_missing_data(pop_data)

print('== Hip-Hop dataset ==')
show_missing_data(hiphop_data)

print('== Viral dataset ==')
show_missing_data(viral_data)

== Pop dataset ==
Dataset contains 233 rows
# of rows with missing values: 0
# of missing values: 0

== Hip-Hop dataset ==
Dataset contains 241 rows
# of rows with missing values: 0
# of missing values: 0

== Viral dataset ==
Dataset contains 252 rows
# of rows with missing values: 0
# of missing values: 0



### Step 5: Saving the collected data 
The dataset collected for each genre contains no missing values and aligns with our expectations of data collection. We can now merge and record each dataset as a csv file with the attribute 'genre' to record a track's genre. The genres are encoded as such: 

| **Genre** | **Encoding** |
| -------- | ------- |
| Pop | 0 |
| Hip-Hop | 1 |
| Viral | 2 |

In [9]:
''' Save track data to csv '''
# Import library 
import csv 

# Define headers 
csv_headers = ATTRIBUTES.copy()
csv_headers.extend(['name', 'artist', 'genre'])

# Write to CSV file
with open('spotify_api_data.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=csv_headers)
    writer.writeheader()  # Write the header

    # Write Pop genre data 
    for row in pop_data.values():
        row['genre'] = 0 # Pop tracks are encoded with 'genre' = 0
        writer.writerow(row)

    # Write Hip-Hop genre data 
    for row in hiphop_data.values():
        row['genre'] = 1 # Hip-Hop tracks are encoded with 'genre' = 1
        writer.writerow(row)

    # Write Viral genre data 
    for row in viral_data.values():
        row['genre'] = 2 # Viral tracks are encoded with 'genre' = 2
        writer.writerow(row)