<a href="https://colab.research.google.com/github/SJinji/recommendation-system-with-last.fm-dataset/blob/main/Deezer_1_Data_Preprocessing_v0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install spotipy

Collecting spotipy
  Downloading spotipy-2.23.0-py3-none-any.whl (29 kB)
Collecting redis>=3.5.3 (from spotipy)
  Downloading redis-4.6.0-py3-none-any.whl (241 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m241.1/241.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: redis, spotipy
Successfully installed redis-4.6.0 spotipy-2.23.0


In [None]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

In [None]:
# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os
import sys
import glob
import string
import nltk
from tqdm.auto import tqdm

In [None]:
! unzip /content/deezer-tech-test-DS-internship.zip

In [None]:
artists = pd.read_csv('deezer-business-case/data/artists.dat',sep="\t")
tags = pd.read_csv('deezer-business-case/data/tags.dat',encoding="gbk",sep="\t")
user_artists = pd.read_csv('deezer-business-case/data/user_artists.dat',sep="\t")
user_friends = pd.read_csv('deezer-business-case/data/user_friends.dat',sep='\t')
tag_artists = pd.read_csv('deezer-business-case/data/user_taggedartists.dat',sep='\t')
# tag_artists_timestamps = pd.read_csv('/content/deezer-business-case/data/user_taggedartists-timestamps.dat',sep='\t')

We opted to perform outer joins while combining the original data tables to retain all the data. This approach allows us to keep records from all tables even if there are missing values (nulls) in some columns. As we progress through the data analysis process, we can then choose to filter out or handle these null values as needed at different stages. This ensures that no information is lost during the merging of the tables, and we have a complete dataset to work with throughout the analysis.

In [None]:
df1 = pd.merge(user_artists, user_friends, on="userID", how="outer")
df2 = pd.merge(df1, artists, left_on="artistID", right_on="id", how="outer").drop(columns=["id", "url", "pictureURL"])
df3 = pd.merge(tag_artists, tags, on="tagID", how="left")

original_df = pd.merge(df2, df3, on=["userID", "artistID"], how="outer")

# rename columns to make more clear
original_df.rename(columns={"name": "artistName", "weight": "artistWeight"}, inplace=True)

In [None]:
original_df.head()

Unnamed: 0,userID,artistID,artistWeight,friendID,artistName,tagID,day,month,year,tagValue
0,2,51,13883.0,275.0,Duran Duran,,,,,
1,2,51,13883.0,428.0,Duran Duran,,,,,
2,2,51,13883.0,515.0,Duran Duran,,,,,
3,2,51,13883.0,761.0,Duran Duran,,,,,
4,2,51,13883.0,831.0,Duran Duran,,,,,


Upon examining the original data and considering potential improvements, we identified a couple of issues. Firstly, the tags provided were inconsistent, making it challenging to establish a clear rule for selecting relevant genres across all users. Secondly, the pictureURL column contained outdated links, preventing us from displaying artist images for visual representation.

To address these challenges, we turned to the Spotify API. By leveraging this API, we could retrieve additional data about each artist, including their genres and a valid image URL. This information could then be merged with our existing dataset, allowing us to enhance the data quality and provide more comprehensive insights in our analysis.

In [None]:
# Create client
spotify = spotipy.Spotify(auth_manager=SpotifyClientCredentials(
    client_id="22ba1096ffda44fca858c1c6880ca020",
    client_secret="2342a132a9a74186bdff50fce2a95778"
))

In [None]:
def get_artist_info(name):
    """
    Fetches info about the artist from the Spotify API.
    """

    # Get the response from the API
    results = spotify.search(q=f'artist:{name}', type='artist')

    spotify_name = name
    img_url = None
    genres = []
    spotify_url = None

    if len(results['artists']['items']) > 0:

        # Sort the returned results based on how close they are to the query name.
        # Uses https://en.wikipedia.org/wiki/Levenshtein_distance aka Edit Distance.
        items = sorted(results['artists']['items'], key=(lambda x: nltk.edit_distance(name.lower(), x["name"].lower())))

        # Sort results by popularity
        #items = sorted(results['artists']['items'], key=(lambda x: x["popularity"]), reverse=True)

        if len(items) > 0:
            artist = items[0]

            # assign variables
            genres = artist["genres"]
            spotify_name = artist["name"]
            spotify_url = artist["external_urls"]["spotify"]

            # assign image url
            image_list = artist["images"]
            if len(image_list) > 0:
                img_url = image_list[0]["url"]

    return spotify_name, img_url, genres, spotify_url

We utilized the Spotify API to scrape artist details, but due to rate limits, we could only make 100 requests every 3.5 seconds. After each batch, we had to wait 6 seconds before making further requests for the next 100 artists. Despite this limitation, the entire scraping process for all 17,500 artists took approximately 30 minutes.

The scraping process involved iterating through the artists' names and searching for them in Spotify's database. We then selected the most similarly named artist from the search results as the correct one. From the chosen artist, we obtained the URL for their top image (usually an album cover) and the associated genres, which were stored in a table for further analysis.

In [None]:
spotify_data = pd.DataFrame(columns=["artistName", "spotifyName", "imageUrl", "genres", "spotifyUrl"])

for artist_name_in in tqdm(list(original_df.artistName.unique())):

    artist_name = str(artist_name_in)

    try:
        spotify_name, image_url, genres_list, spotify_url = get_artist_info(artist_name)

        df_row = pd.DataFrame({"artistName": artist_name, "spotifyName": spotify_name, "imageUrl": image_url, "genres": str(genres_list), "spotifyUrl": spotify_url}, index=[0])

        spotify_data = pd.concat([spotify_data, df_row], axis=0, ignore_index=True)

    except:
            print("Artist:", artist_name)
            print("Spotify artist:", spotify_name)
            print("URL:", image_url)
            print("Genres:", genres_list)
            print("SpotifyURL:", spotify_url)
            sys.exit(1)

  0%|          | 0/17633 [00:00<?, ?it/s]

In [None]:
spotify_data.head()

Unnamed: 0,artistName,spotifyName,imageUrl,genres,spotifyUrl
0,Duran Duran,Duran Duran,https://i.scdn.co/image/ab6761610000e5eb47f638...,"['album rock', 'dance rock', 'new romantic', '...",https://open.spotify.com/artist/0lZoBs4Pzo7R89...
1,Morcheeba,Morcheeba,https://i.scdn.co/image/ab6761610000e5eb36946a...,"['downtempo', 'electronica', 'trip hop']",https://open.spotify.com/artist/6bWxFw65IEJzBY...
2,Air,Air,https://i.scdn.co/image/ab6761610000e5ebb3f06b...,"['ambient pop', 'downtempo', 'electronica', 'i...",https://open.spotify.com/artist/1P6U1dCeHxPui5...
3,Hooverphonic,Hooverphonic,https://i.scdn.co/image/ab6761610000e5eb1f66a2...,"['downtempo', 'electronica', 'trip hop']",https://open.spotify.com/artist/5EP020iZcwBqHR...
4,Kylie Minogue,Kylie Minogue,https://i.scdn.co/image/ab6761610000e5eb8fba8b...,"['australian dance', 'australian pop', 'dance ...",https://open.spotify.com/artist/4RVnAU35WRWra6...


In [None]:
spotify_data.to_csv('spotify_data.csv', index=False)

In [None]:
df = pd.merge(original_df, spotify_data, on="artistName", how="left")
df

Unnamed: 0,userID,artistID,artistWeight,friendID,artistName,tagID,day,month,year,tagValue,spotifyName,imageUrl,genres,spotifyUrl
0,2,51,13883.0,275.0,Duran Duran,,,,,,Duran Duran,https://i.scdn.co/image/ab6761610000e5eb47f638...,"['album rock', 'dance rock', 'new romantic', '...",https://open.spotify.com/artist/0lZoBs4Pzo7R89...
1,2,51,13883.0,428.0,Duran Duran,,,,,,Duran Duran,https://i.scdn.co/image/ab6761610000e5eb47f638...,"['album rock', 'dance rock', 'new romantic', '...",https://open.spotify.com/artist/0lZoBs4Pzo7R89...
2,2,51,13883.0,515.0,Duran Duran,,,,,,Duran Duran,https://i.scdn.co/image/ab6761610000e5eb47f638...,"['album rock', 'dance rock', 'new romantic', '...",https://open.spotify.com/artist/0lZoBs4Pzo7R89...
3,2,51,13883.0,761.0,Duran Duran,,,,,,Duran Duran,https://i.scdn.co/image/ab6761610000e5eb47f638...,"['album rock', 'dance rock', 'new romantic', '...",https://open.spotify.com/artist/0lZoBs4Pzo7R89...
4,2,51,13883.0,831.0,Duran Duran,,,,,,Duran Duran,https://i.scdn.co/image/ab6761610000e5eb47f638...,"['album rock', 'dance rock', 'new romantic', '...",https://open.spotify.com/artist/0lZoBs4Pzo7R89...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2269513,2099,16468,,,,191.0,1.0,2.0,2009.0,instrumental,,,,
2269514,2099,16745,,,,13.0,1.0,8.0,2009.0,chillout,,,,
2269515,2099,16745,,,,15.0,1.0,8.0,2009.0,downtempo,,,,
2269516,2099,16745,,,,21.0,1.0,8.0,2009.0,trip-hop,,,,


In [None]:
df.to_csv('original_and_spotify_data.csv', index=False)