### Kaggle Dataset Modifications

This Python script is designed to enhance a Kaggle dataset containing song lyrics by incorporating additional genres obtained from Last.fm tags. The primary goal is to diversify the genre information associated with each song in the dataset.

### Configurables:

1. ***`DATASET_SIZE`:***
   - Determines the size of each chunk while processing the Kaggle dataset. Default value is set to 500 songs per chunk.

2. ***`OVERWRITE`:***
   - Controls whether existing dataset chunks should be overwritten or not. Set to `False` by default, avoiding overwriting existing chunks.

3. ***`ALL_GENRES`:***
   - Obtained from the 'config.json' file, it represents the set of all possible genres. Used during the genre determination process based on Last.fm tags.

### Instructions for Creating Dataset

To generate the enhanced dataset, follow these instructions and run each cell in order. The resulting dataset will be broken into chunks and stored as CSV files in the 'data/kaggle_modified' directory.

#### Steps:
1. **Read Configuration:** Open the script and ensure the 'config.json' file is correctly configured with your preferences.
2. **Run Each Cell:** The Kaggle dataset will be processed in chunks, and the progress will be displayed.
3. **Check Results:** Verify the 'data/kaggle_modified' directory contains the CSV files named 'dataset_{chunk_number}.csv'.

In [None]:
import json
import pandas as pd 
import os 
from utils.lastfm_functions import get_tags
from utils.genre_helper import tags_to_genre

In [2]:
# read global constants in from the config file
json_file_path = 'config.json'
with open(json_file_path, 'r') as json_file:
    config_dict = json.load(json_file)

In [4]:
DATASET_SIZE = 500 # How many songs should be in each dataframe chunk ( you don't need to change this really)
OVERWRITE = False # weather or not to overwrite dataset chunk files (you don't need to change this really)
ALL_GENRES = config_dict['ALL_GENRES']

In [5]:
def read_csv(file_name):
    """
    
    Reads in the large kaggle dataset in chunks 
    
    Args:
        file_name (string): path to the kaggle dataset 

    Yields:
        yield: gives a yield object of the current dataset
    """
    for chunk in pd.read_csv(file_name, chunksize=DATASET_SIZE):
        yield chunk

In [6]:
def modified_dataset(data):
    """
    
    Modifies the kaggle dataframe given to add additional 
    genres by utilizing tags found using lastfm
    
    Args:
        data (DataFrame): the dataframe from kaggle 

    Returns:
        DataFrame: the modified dataframe with additional genres  
    """
    
    df = pd.DataFrame(columns=['song', 'artist', 'genres', 'lyrics'])

    # loop through each song in the dataframe 
    for index, row in data.iterrows():
        print("Song {} / {}".format((index + 1) % len(data) , len(data)))
        song = row['title']
        artist = row['artist']
        tags = [row['tag']] # keep the initial song genre obtain from kaggle dataset  
        lyrics = row['lyrics']
        langugae = row['language'] # maybe use this to define a World Music genre
        
        # only use english songs for now 
        if langugae != 'en':
            continue
        
        # find additional tags using lastfm 
        additional_tags = get_tags(artist, song)
        if additional_tags:
            tags.extend(additional_tags)
        
        # use these tags to determine the genre 
        genres = tags_to_genre(tags, ALL_GENRES)
        
        # add to dataframe  
        df.loc[len(df.index)] = [song, artist, genres, lyrics]

    return df   

In [7]:
def make_dataset_chunks(kaggle_dataset_path='song_lyrics.csv', save_dest='data/kaggle_modified'):
    """
    
        The kaggle dataset is extremely large so we process it in chunks. 
        This function splits the main kaggle dataset into chunks and stores it 
        in a dataframe formmated as follows: 
        
        dataset_{chunk_number}_{chunk_size}.csv 
    
    Args:
        kaggle_dataset_path (str, optional): the path to the kaggle dataframe
        save_dest (str, optional): the path where we want to save the dataframe chunks 
    """
    
    # read_csv() breaks the kaggle_dataset into chunks and lets us itterate through each chunk
    for i, df in enumerate(read_csv(kaggle_dataset_path)):
        print("On Chunk: {}".format(i))
        # get df name 
        folder_path = save_dest
        file_name = 'dataset_{}.csv'.format(i)
        file_path = os.path.join(folder_path, file_name)
        
        # dont overwrite datafiles that already exist
        if os.path.exists(file_path) and not OVERWRITE:
            continue
        
        # modify dataset by adding additional tag fields
        df = modified_dataset(df)
        df.to_csv(file_path, index=False) # save the dataset  

In [8]:
make_dataset_chunks()

On Chunk: 0
On Chunk: 1
On Chunk: 2
On Chunk: 3
On Chunk: 4
On Chunk: 5
On Chunk: 6
On Chunk: 7
On Chunk: 8
On Chunk: 9
On Chunk: 10
On Chunk: 11
On Chunk: 12
Song 1 / 500
No results found for TheSpark by Hero! (English).
Song 2 / 500
Song 3 / 500
No tag information available for Childish Gambino - Break (All of the Lights).
Song 4 / 500
No tag information available for Laws - Daytona 500.
Song 5 / 500
No results found for Rock n Rollin by M.I.M.S.
Song 6 / 500


KeyboardInterrupt: 