### Custom Dataset Creation Script

This Python script fetches information for the top N songs using the Last.fm API, processes the data, and creates a custom dataset. The resulting dataset includes details such as song name, artist, genre, and lyrics.

### Configurables:

- **`ALL_GENRES`:**
  - All possible genres obtained from the 'config.json' file.

- **`NUM_SONGS`:**
  - Default number of top songs to fetch. Set to 100 by default.

### Instructions for Creating Dataset

To generate the custom dataset, run each cell in order. The resulting dataset will be stored as CSV file called 'custom_dataset.csv' in the 'data/custom' directory. You can name this so it wont be overwritten.


In [1]:
import pandas as pd 
from utils.lastfm_functions import get_tags, get_top_tracks
from utils.genuis_functions import get_lyrics
from utils.genre_helper import tags_to_genre
import json 

In [2]:
# read global constants in from the config file
json_file_path = 'config.json'
with open(json_file_path, 'r') as json_file:
    config_dict = json.load(json_file)

In [3]:
ALL_GENRES = config_dict['ALL_GENRES'] # All possible genres obtained from the configuration file
NUM_SONGS = 100 # Default number of top songs to fetch

In [4]:
def make_data_set(num_songs=5):
    """
    Fetches information for the top N songs using Last.fm API, processes the data, and creates a custom dataset.

    Args:
        num_songs (int, optional): Number of top songs to fetch. Defaults to 5.

    Returns:
        DataFrame: A Pandas DataFrame containing information such as song name, artist, genre, and lyrics.
    """
    # Get the top n songs 
    tracks = get_top_tracks(num_songs)  
    
    # init dataframe to hold all the data 
    df = pd.DataFrame(columns=['song', 'artist', 'genre', 'lyrics'])
    count = 0 
    id = 0
    for track in tracks:
        id = id + 1 
        print("{}/{}".format(id, num_songs))
        # get song info for track 
        song = track['name']    
        artist = track['artist']['name']
        
        tags = get_tags(artist, song)
        genre = tags and tags_to_genre(tags, ALL_GENRES)
        lyrics = get_lyrics(artist, song)
        
        # store in dataframe
        if genre and lyrics:
            print("Successfully added info for: {} by {}".format(song, artist))
            df.loc[len(df.index)] = [song, artist, genre, lyrics]
            count += 1 
        else:
            print("Error adding {} by {}".format(song, artist))
    
    print("Added {} / {} songs".format(count, len(tracks)))
    df.to_csv("custom_dataset.csv", index=False)
        
    return df

In [5]:
dataset = make_data_set(NUM_SONGS)

1/100
Successfully added info for: My Love Mine All Mine by Mitski
2/100
Successfully added info for: yes, and? by Ariana Grande
3/100
Successfully added info for: Pink + White by Frank Ocean
4/100
Successfully added info for: See You Again (feat. Kali Uchis) by Tyler, the Creator
5/100
Successfully added info for: Cruel Summer by Taylor Swift
6/100
Successfully added info for: Lovers Rock by TV Girl
7/100
Successfully added info for: Murder on the Dancefloor by Sophie Ellis-Bextor
8/100
Successfully added info for: Feather by Sabrina Carpenter
9/100
Error adding Kill Bill by SZA
10/100
Successfully added info for: vampire by Olivia Rodrigo
11/100
Successfully added info for: Stargirl Interlude by The Weeknd
12/100
