# Portfolio-Exam Task 2 in MADS-DVVA (Data Visualization and Visual Analytics) - Data Retrieval

## Data Source
In this notebook I'm retrieving playlist tracks with the [Spotify API](https://developer.spotify.com/documentation/web-api). This notebook is structured as followed:

1. Request an API key and save it in a file.

2. Get the tracks of my friends spotify playlist. 1 playlist per friend

3. Get further information about the tracks by requesting the features of the songs.

4. Get information about the songs artists and genres

5. Merge the 3 datasets and select interesting features.

## You want it to run yourself?
If you want to ran the data retrieval yourself then you need to follow these steps.

1. Create an app to get an access token: https://developer.spotify.com/documentation/web-api/tutorials/getting-started

2. Create a file called 'secred.json' and paste your secrets and id in the json. The structure can be accessed in the code below.

3. Retrieve playlist data by pasting the [playlist id](https://clients.caster.fm/knowledgebase/110/How-to-find-Spotify-playlist-ID.html) in the necessary lists.Don't forget to change the naming of the file. You need to do that steps 2,3,4,5 at the end of the code blocks.

4. Done!

## Data Retrieval

Import the used librarys:

In [1]:
import requests
import json
import pandas as pd
import re
from typing import List, Dict

  from pandas.core import (


## 1. Request an API key and save it in a file.


In [2]:
def get_authorization_code() -> None:
    """
    Fetches an authorization token from Spotify API using client credentials flow 
    and saves the access token to a file.
    """
    url = "https://accounts.spotify.com/api/token"

    header = {
        "Content-Type": "application/x-www-form-urlencoded"
    }

    # Read client_id and client_secret from json file
    with open("secret.json", "r") as file:
        data = json.load(file)
        client_id = data["client_id"]
        client_secret = data["client_secret"]

    data = {
        "grant_type": "client_credentials",
        "scope": "playlist-read-private",
        "client_id": client_id,  
        "client_secret": client_secret 
    }

    # Send POST request to Spotify API
    response = requests.post(url, data=data, headers=header)

    # Handle response
    if response.status_code == 200:
        result = response.json()
        save_access_token(result)
        print("Access token obtained successfully:")
        print(result)
    else:
        print(f"Error: {response.status_code}")
        print(response.text)

def save_access_token(response: Dict[str, str]) -> None:
    """
    Saves the access token from the Spotify API response to a file.
    
    Args:
    - response (dict): Response JSON object from Spotify API containing access token.
    """
    access_token = response["access_token"]
    with open("access_token.txt", "w") as file:
        file.write(access_token)

# Execute the function to obtain and save the access token
get_authorization_code()


Access token obtained successfully:
{'access_token': 'BQBzTwWo2aiJgnMr-V-hmoUMceC9_fzZ8Z4dUW1QSkRkA2uU2UOKWlgvuEebtmt16IJiTYPLvZF8nqZQ9qvgmtqhzloB8xul2aS6bs3Y1vjYe0Gxhyrusob7CFREIeyD', 'token_type': 'Bearer', 'expires_in': 3600, 'scope': 'playlist-read-private'}


## 2. Get the Tracks of Playlists


In [3]:
class SpotifyPlaylistFetcher:
    def __init__(self, playlist_id: str, access_token_file: str = "access_token.txt", 
                 offset: int = 0, limit: int = 100, save_name: str = 'response'):
        """
        Initialize the SpotifyPlaylistFetcher instance.

        Args:
        - playlist_id (str): Spotify playlist ID.
        - access_token_file (str): File containing Spotify API access token.
        - offset (int): Offset for pagination (default is 0).
        - limit (int): Limit for number of items per request (default is 100, max is 100).
        - save_name (str): Name of the file to save the combined response (default is 'response').
        """
        self.playlist_id = playlist_id
        self.access_token_file = access_token_file
        self.offset = offset
        self.limit = limit
        self.save_name = save_name
        self.access_token = self._read_access_token()
        self.headers = {'Authorization': f'Bearer {self.access_token}'}
    
    def _read_access_token(self) -> str:
        """
        Read the Spotify API access token from a file.

        Returns:
        - str: Spotify API access token.
        """
        with open(self.access_token_file, "r") as file:
            return file.read().strip()
    
    def get_playlist_items(self) -> pd.DataFrame:
        """
        Fetch all items (tracks) from the specified Spotify playlist.

        Returns:
        - pd.DataFrame: DataFrame containing all items (tracks) from the playlist.
        """
        # Initialize lists to hold responses
        responses_items = []
        responses_hole = []
        
        # Make the first request
        url = f"https://api.spotify.com/v1/playlists/{self.playlist_id}/tracks?offset={self.offset}&limit={self.limit}"
        response = requests.get(url, headers=self.headers).json()
        responses_items.append(self._get_items(response))
        responses_hole.append(response)
        
        # Loop through next pages if they exist
        while response.get("next"):
            self.offset += self.limit
            next_url = response["next"]
            response = requests.get(next_url, headers=self.headers).json()
            responses_items.append(self._get_items(response))

        # Combine the responses
        items = self._combine_items(responses_items)
        
        # Save the combined items
        self._save_response(items)
        
        return items
    
    def _combine_items(self, responses_list: list) -> pd.DataFrame:
        """
        Combine list of responses into a single DataFrame.

        Args:
        - responses_list (list): List of JSON responses from Spotify API.

        Returns:
        - pd.DataFrame: Combined DataFrame containing all items (tracks) from responses.
        """
        # Create a list of DataFrames from the JSON responses
        items_list = [pd.json_normalize(response) for response in responses_list]
        # Combine all DataFrames into a single DataFrame
        items = pd.concat(items_list, ignore_index=True)
        return items
    
    def _save_response(self, response: pd.DataFrame) -> None:
        """
        Save DataFrame to a CSV file.

        Args:
        - response (pd.DataFrame): DataFrame to be saved.
        """
        response.to_csv(self.save_name, index=False)
    
    def _get_items(self, response: dict) -> list:
        """
        Extract items (tracks) from Spotify API response.

        Args:
        - response (dict): JSON response from Spotify API.

        Returns:
        - list: List of items (tracks) from the response.
        """
        return response.get("items", [])

# change playlist_ids and names to the ones you want to fetch
playlist_ids = ['1b5wfP2ZAaNXEqLYEuo748', '4PK4fDXPX3Fi6puntDvxIG']
names = ['lars.csv', 'marco.csv']

for i in range(len(playlist_ids)):
    fetcher = SpotifyPlaylistFetcher(playlist_ids[i], save_name=f"data/{names[i]}")
    result = fetcher.get_playlist_items()


## 3. Get further information about the tracks by requesting the features of the songs.

The playlist request does not return much information about the tracks. We will use the following API: https://developer.spotify.com/documentation/web-api/reference/get-several-audio-features .

In [4]:
def get_audio_features(multiple_ids: List[str]) -> pd.DataFrame:
    """
    Fetches audio features from Spotify API for multiple track IDs.

    Args:
    - multiple_ids (list): List of track IDs.

    Returns:
    - pd.DataFrame: DataFrame containing audio features for the tracks.
    """
    id_batches = [multiple_ids[i:i + 100] for i in range(0, len(multiple_ids), 100)]
    access_token_file = "access_token.txt"
    
    with open(access_token_file, "r") as file:
        token = file.read().strip()
    
    headers = {'Authorization': f'Bearer {token}'}
    features = []
    
    for batch in id_batches:
        url = "https://api.spotify.com/v1/audio-features?ids=" + ','.join(batch)
        try:
            response = requests.get(url, headers=headers)
            response.raise_for_status()  # Raise exception for HTTP errors
            data = response.json()
            features.extend(data['audio_features'])
        except requests.exceptions.RequestException as e:
            print(f"Error fetching audio features: {e}")
    
    return pd.DataFrame(features)


def get_playlist_ids(path: str) -> List[str]:
    """
    Extracts track IDs from a Spotify playlist CSV file.

    Args:
    - path (str): Path to the CSV file containing playlist data.

    Returns:
    - list: List of track IDs extracted from the playlist.
    """
    try:
        df = pd.read_csv(path)
        ids = []

        for index, row in df.iterrows():
            track_url = row.get('track.external_urls.spotify', '')
            track_id = track_url.split('/')[-1] if track_url else None
            if track_id:
                ids.append(track_id)
        
        return ids
    except FileNotFoundError:
        print(f"Error: File {path} not found.")
        return []
    except pd.errors.EmptyDataError:
        print(f"Error: File {path} is empty.")
        return []


def save_features(path: str, new_features: pd.DataFrame) -> None:
    """
    Saves audio features DataFrame to a CSV file.

    Args:
    - path (str): Path to the original playlist CSV file.
    - new_features (pd.DataFrame): DataFrame containing audio features to be saved.
    """
    try:
        new_features.to_csv(path[:-4] + '_audio_features.csv', index=False)
    except Exception as e:
        print(f"Error saving features to file: {e}")


playlists = ['lars.csv', 'marco.csv']
for playlist in playlists:
    path = 'data/' + playlist
    ids = get_playlist_ids(path)
    features = get_audio_features(ids)
    save_features(path, features)


## 4. Get Artists of Songs
Previous requests did not contain usable infromation about the artists. Spotify does not provide information about the genre of a specific song. Only for the artist. We will assume that the songy by the artists are also in the artists genre. We will sue follwoing API for this: https://developer.spotify.com/documentation/web-api/reference/get-multiple-artists .

In [5]:
def get_artists_from_playlist(name: str) -> List[List[str]]:
    """
    Extracts artist IDs from a Spotify playlist CSV file.

    Args:
    - name (str): Path to the CSV file containing playlist data.

    Returns:
    - List[List[str]]: List of lists, where each sublist contains artist IDs for a track.
    """
    df = pd.read_csv(name)
    artists = []

    for index, row in df.iterrows():
        artist = row['track.artists']
        
        # Replace single quotes used as delimiters with double quotes
        artist = re.sub(r"(?<!\\)'", "\"", artist)
        
        try:
            artist_list = json.loads(artist)
            dummy = []
            for a in artist_list:
                dummy.append(a['id'])
            artists.append(dummy)
        except json.JSONDecodeError as e:
            # give where the error is

            print(f"Failed to decode JSON for row {index}: {e}")
    
    return artists


def get_artist_info(multiple_ids: List[str]) -> pd.DataFrame:
    """
    Fetches detailed information for artists from Spotify API based on their IDs.

    Args:
    - multiple_ids (List[str]): List of artist IDs.

    Returns:
    - pd.DataFrame: DataFrame containing artists' information (name, genres, popularity).
    """
    id_batches = [multiple_ids[i:i + 50] for i in range(0, len(multiple_ids), 50)]
    access_token_file = "access_token.txt"
    
    with open(access_token_file, "r") as file:
        token = file.read().strip()
    
    headers = {'Authorization': f'Bearer {token}'}
    artist_df = pd.DataFrame()

    for ids in id_batches:
        url = "https://api.spotify.com/v1/artists?ids=" + ','.join(ids)
        response = requests.get(url, headers=headers).json()
        artists_info = response['artists']
        artist_df = pd.concat([artist_df, pd.DataFrame(artists_info, columns=['name', 'genres', 'popularity'])], ignore_index=True, axis=0)
    
    return artist_df


def get_nr_artists_per_song(artists: List[List[str]]) -> List[int]:
    """
    Computes the number of artists for each song in a playlist.

    Args:
    - artists (List[List[str]]): List of lists, where each sublist contains artist IDs for a track.

    Returns:
    - List[int]: List containing the number of artists per song.
    """
    return [len(a) for a in artists]


def flatten_list(l: List[List[str]]) -> List[str]:
    """
    Flattens a list of lists into a single list.

    Args:
    - l (List[List[str]]): List of lists to be flattened.

    Returns:
    - List[str]: Flattened list of strings.
    """
    return [item for sublist in l for item in sublist]


def unflatten_df(nr_artists_per_song: List[int], flattened_artists_df: pd.DataFrame) -> pd.DataFrame:
    """
    Converts a flattened DataFrame back into a structured DataFrame for songs.

    Args:
    - nr_artists_per_song (List[int]): List containing the number of artists per song.
    - flattened_artists_df (pd.DataFrame): Flattened DataFrame containing artists' information.

    Returns:
    - pd.DataFrame: DataFrame with structured columns (names, genres, popularity) for each song.
    """
    song_names = []
    song_genres = []
    song_popularity = []
    beg_index = 0
    
    for nr in nr_artists_per_song:
        end_index = beg_index + nr
        sub_df = flattened_artists_df.iloc[beg_index:end_index]
        
        song_names.append(sub_df['name'].tolist())
        song_genres.append(sub_df['genres'].tolist())
        song_popularity.append(sub_df['popularity'].tolist())
        
        beg_index = end_index

    result_df = pd.DataFrame({
        'names': song_names,
        'genres': song_genres,
        'popularity': song_popularity
    })
    
    # Ensure 'genres' column is one-dimensional
    result_df['genres'] = result_df['genres'].apply(lambda x: x[0] if x else None)
    
    return result_df

def fuse_dataframes(name: str, artists: pd.DataFrame) -> pd.DataFrame:
    """
    Merges a DataFrame containing playlist data with a DataFrame containing artists' information.

    Args:
    - name (str): Path to the CSV file containing playlist data.
    - artists (pd.DataFrame): DataFrame containing artists' information.

    Returns:
    - pd.DataFrame: Merged DataFrame containing both playlist data and artists' information.
    """
    df1 = pd.read_csv(name)
    return pd.concat([df1, artists], axis=1)

def save_artist_info(artist_info: pd.DataFrame, name: str) -> None:
    """
    Saves DataFrame containing artists' information to a CSV file.

    Args:
    - artist_info (pd.DataFrame): DataFrame containing artists' information.
    - name (str): Path to save the CSV file.
    """
    artist_info.to_csv(name, index=False)

names = ['lars.csv', 'marco.csv']
for name in names:
    path = 'data/' + name
    artists = get_artists_from_playlist(path)
    nr_artists_per_song = get_nr_artists_per_song(artists)
    flattened_artists = flatten_list(artists)
    artist_info = get_artist_info(flattened_artists)
    artists_correct_shape = unflatten_df(nr_artists_per_song, artist_info)
    save_artist_info(artists_correct_shape, path[:-4] + '_artist_info.csv')


Failed to decode JSON for row 54: Expecting ',' delimiter: line 1 column 209 (char 208)
Failed to decode JSON for row 55: Expecting ',' delimiter: line 1 column 209 (char 208)
Failed to decode JSON for row 56: Expecting ',' delimiter: line 1 column 209 (char 208)


Some rows are not parsed properly. For example 'Guns N' Roses'. Because it contains ' in the name itself. This bug is not fixed since it does not have influence on the data analysis. This bug should be fixed in the future.

## 5. Merging Datasets
In the last step we will merge the datasets und select the iteresting features.


In [6]:
def select_features(combined_dfs):
    """
    Selects specific features (columns) from a combined DataFrame.

    Args:
    - combined_dfs (pd.DataFrame): Combined DataFrame containing data from multiple sources.

    Returns:
    - pd.DataFrame: DataFrame containing selected features.
    """
    selected_features = [
        'track.name', 'added_at',
        'track.album.release_date', 'track.album.release_date_precision', 'danceability', 
        'energy', 'key', 'loudness',
        'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
        'type', 'duration_ms', 'time_signature', 'uri',
        'names', 'genres', 'popularity'
    ]

    selected_df = combined_dfs[selected_features]
    return selected_df

def rename_columns(df):
    """
    Renames columns in a DataFrame to make them more descriptive.

    Args:
    - df (pd.DataFrame): DataFrame to be modified.

    Returns:
    - pd.DataFrame: DataFrame with renamed columns.
    """
    column_mapping = {
        'track.name': 'track_name',
        'track.album.release_date': 'release_date', 
        'track.album.release_date_precision': 'release_date_precision',
        'danceability': 'danceability', 
        'energy': 'energy', 
        'key': 'key', 
        'loudness': 'loudness',
        'mode': 'mode', 
        'speechiness': 'speechiness', 
        'acousticness': 'acousticness', 
        'instrumentalness': 'instrumentalness', 
        'liveness': 'liveness', 
        'valence': 'valence', 
        'tempo': 'tempo',
        'type': 'type', 
        'uri': 'uri', 
        'duration_ms': 'duration_ms', 
        'time_signature': 'time_signature',
        'names': 'artist_names', 
        'genres': 'artist_genres', 
        'popularity': 'artist_popularity'
    }

    return df.rename(columns=column_mapping)

def concat_dfs(dfs):
    """
    Concatenates multiple DataFrames horizontally (column-wise).

    Args:
    - dfs (list of pd.DataFrame): List of DataFrames to be concatenated.

    Returns:
    - pd.DataFrame: Concatenated DataFrame.
    """
    return pd.concat(dfs, axis=1)

def save_df(df, name):
    """
    Saves a DataFrame to a CSV file.

    Args:
    - df (pd.DataFrame): DataFrame to be saved.
    - name (str): Name of the CSV file to save.
    """
    df.to_csv(name, index=False)


files = [['data/lars.csv', 'data/lars_audio_features.csv', 'data/lars_artist_info.csv'],
             ['data/marco.csv', 'data/marco_audio_features.csv', 'data/marco_artist_info.csv']]
for file in files:
    dfs = [pd.read_csv(one_file) for one_file in file]
    combined_df = concat_dfs(dfs)
    selected_df = select_features(combined_df)
    selected_df = rename_columns(selected_df)
    save_df(selected_df, file[0][:-4] + '_combined.csv')


## Code Assistance

- **Github Copilot**: Version 1.206.0.0

- **ChatGPT**: Version  3.5