<h1><b> Set-up to Import the Spotify Dataset to MySQL Workbench! </b></h1>

<H3><b>1. Install Necessary Libraries</b></H3>

In [2]:
pip install mysql-connector-python pandas gdown

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install pymysql

Collecting pymysql
  Downloading PyMySQL-1.1.1-py3-none-any.whl.metadata (4.4 kB)
Downloading PyMySQL-1.1.1-py3-none-any.whl (44 kB)
   ---------------------------------------- 0.0/45.0 kB ? eta -:--:--
   --------------------------- ------------ 30.7/45.0 kB 640.0 kB/s eta 0:00:01
   --------------------------- ------------ 30.7/45.0 kB 640.0 kB/s eta 0:00:01
   ------------------------------------ --- 41.0/45.0 kB 245.8 kB/s eta 0:00:01
   ------------------------------------ --- 41.0/45.0 kB 245.8 kB/s eta 0:00:01
   ------------------------------------ --- 41.0/45.0 kB 245.8 kB/s eta 0:00:01
   ---------------------------------------- 45.0/45.0 kB 147.8 kB/s eta 0:00:00
Installing collected packages: pymysql
Successfully installed pymysql-1.1.1
Note: you may need to restart the kernel to use updated packages.


<h3><b> 2. Download the dataset from Google Drive and Directly Load it to MySQL </b></h3>

<p style="font-size:16px;"><b>Note:</b><br>
Extract the File ID:
    ** From the shareable link, extract the file ID. The file ID is the part after /d/ and before /view. **

<b>For example, if your link is:</b>

<spam>https://drive.google.com/file/d/1009ZQdqIQV1-TNqNt1rXENK4zFWM_55u/view?usp=drive_link</spam>

The file ID is <b>1009ZQdqIQV1-TNqNt1rXENK4zFWM_55u.</b>
</p>

In [11]:
# Step 1: Import necessary libraries
import gdown
import pandas as pd
import pymysql
from pymysql import OperationalError

# Step 2: Download the dataset from Google Drive
file_id = '1009ZQdqIQV1-TNqNt1rXENK4zFWM_55u'
output = 'spotify_dataset.csv'
gdown.download(f'https://drive.google.com/uc?id={file_id}', output, quiet=False)

# Step 3: Load the dataset into a pandas DataFrame
df = pd.read_csv(output)

# Display the first few rows of the DataFrame to confirm successful loading
print(df.head())

# Step 4: Handle NaN values by replacing them with appropriate defaults
df.fillna({
    'spotify_track_uri': '',
    'ts': '1970-01-01 00:00:00',
    'platform': '',
    'ms_played': 0,
    'track_name': '',
    'artist_name': '',
    'album_name': '',
    'reason_start': '',
    'reason_end': '',
    'shuffle': False,
    'skipped': False
}, inplace=True)

# Step 5: Remove duplicate entries based on the primary key
df.drop_duplicates(subset=['spotify_track_uri', 'ts'], inplace=True)

# Step 6: Connect to the MySQL database
try:
    connection = pymysql.connect(
        host='localhost',      # Replace with your MySQL host
        user='root',  # Replace with your MySQL username
        password='12345678',  # Replace with your MySQL password
        database='spotify_analysis'  # Replace with your MySQL database name
    )
    if connection.open:
        print("Successfully connected to the database")
        
        # Step 7: Create a cursor object
        cursor = connection.cursor()

        # Step 8: Create a table to store the dataset
        create_table_query = """
        CREATE TABLE IF NOT EXISTS spotify_data (
            spotify_track_uri VARCHAR(255),
            ts TIMESTAMP,
            platform VARCHAR(255),
            ms_played INT,
            track_name VARCHAR(255),
            artist_name VARCHAR(255),
            album_name VARCHAR(255),
            reason_start VARCHAR(255),
            reason_end VARCHAR(255),
            shuffle BOOLEAN,
            skipped BOOLEAN,
            PRIMARY KEY (spotify_track_uri, ts)
        )
        """
        cursor.execute(create_table_query)
        print("Table created successfully")

        # Step 9: Insert data into the table in chunks, using INSERT IGNORE to handle duplicates
        chunk_size = 1000
        for start in range(0, len(df), chunk_size):
            end = start + chunk_size
            chunk = df.iloc[start:end]
            for i, row in chunk.iterrows():
                insert_query = """
                INSERT IGNORE INTO spotify_data (spotify_track_uri, ts, platform, ms_played, track_name, artist_name, album_name, reason_start, reason_end, shuffle, skipped)
                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
                """
                cursor.execute(insert_query, tuple(row))

            # Commit the transaction after each chunk
            connection.commit()
            print(f"Inserted rows {start} to {end}")

except OperationalError as e:
    print(f"Error: {e}")
finally:
    if connection and connection.open:
        cursor.close()
        connection.close()
        print("MySQL connection is closed")

Downloading...
From: https://drive.google.com/uc?id=1009ZQdqIQV1-TNqNt1rXENK4zFWM_55u
To: C:\Users\Richard Muchoki\Documents\SQL\Spotify_Dataset_Analysis\spotify_dataset.csv
100%|██████████| 21.3M/21.3M [00:37<00:00, 567kB/s]


        spotify_track_uri                   ts    platform  ms_played  \
0  2J3n32GeLmMjwuAzyhcSNe  2013-07-08 02:44:34  web player       3185   
1  1oHxIPqJyvAYHy0PVrDU98  2013-07-08 02:45:37  web player      61865   
2  487OPlneJNni3NWC8SYqhW  2013-07-08 02:50:24  web player     285386   
3  5IyblF777jLZj1vGHG2UD3  2013-07-08 02:52:40  web player     134022   
4  0GgAAB0ZMllFhbNc3mAodO  2013-07-08 03:17:52  web player          0   

                                      track_name        artist_name  \
0                            Say It, Just Say It       The Mowgli's   
1  Drinking from the Bottle (feat. Tinie Tempah)      Calvin Harris   
2                                    Born To Die       Lana Del Rey   
3                               Off To The Races       Lana Del Rey   
4                                      Half Mast  Empire Of The Sun   

                           album_name reason_start reason_end  shuffle  \
0                Waiting For The Dawn     autoplay   clickro

In [13]:
df.shape

(148350, 11)