*** DO NOT RUN ALL CELLS ***

This will wipe some of the outputs that take some time to run and are only there for the purpose of explanation and demonstration.

More details on this are explained below.

---
---

# 1. Setup

---
---

### Import packages needed for Spotify API

In [None]:
!pip install spotipy

Collecting spotipy
  Downloading https://files.pythonhosted.org/packages/7a/cd/e7d9a35216ea5bfb9234785f3d8fa7c96d0e33999c2cb72394128f6b4cce/spotipy-2.16.1-py3-none-any.whl
Installing collected packages: spotipy
Successfully installed spotipy-2.16.1


In [None]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

### General Imports

In [None]:
from google.colab import drive
import os
import pandas as pd
import matplotlib.pyplot as plt
import sys
import glob
import string
import nltk

from tqdm.auto import tqdm

### Mount the Drive

In [None]:
drive.mount("/content/drive")#, force_remount=True)

Mounted at /content/drive


In [None]:
os.chdir("drive/MyDrive/ca4015_assignment4/data")

---
---

# 2. Read in Data

---
---

## Original Data

First we will read in the dataset we were given for the purpose of this assignment.
This datasets was generated by the Information Retrieval Group at Universidad Autónoma de Madrid and contains music data from a set of 1,892 users of the Last.fm music system.

This data includes the artists the user listens to, the amount of time the user spends listening to this artist, the tags that the user gives to the artist and also corrosponding userIDs of the users friends. A total of 17,632 artists and 11,946 tags are contained in this dataset.
This information is contained in 5 seperate files, these are:

1.   A file on the artists
2.   A file on the tags
3.   A file on the users' friends
4.   A file on the artists the user listens to and the weight of preference the user has for each artist
5.   A file on the details of when and what tags were given to the artists by users


We will start by reading all of these 5 files into pandas dataframes and then joining them all up into one table. This will make our lives a bit easier rather than having to deal with multiple files all at the same time.



##### Read in original data

In [None]:
directory = "raw/"

# read in the different tables
artists = pd.read_csv(os.path.join(directory, "artists.dat"), sep="\t")
tags = pd.read_csv(os.path.join(directory, "tags.dat"), sep="\t", encoding="latin")
listening = pd.read_csv(os.path.join(directory, "user_artists.dat"), sep="\t")
friends = pd.read_csv(os.path.join(directory, "user_friends.dat"), sep="\t")
user_tags = pd.read_csv(os.path.join(directory, "user_taggedartists-timestamps.dat"), sep="\t")

# change the time in the user_tags table to a timestamp
user_tags["timestamp"] = pd.to_datetime(user_tags["timestamp"], unit="ms")

In [None]:
artists.shape

(17632, 4)

In [None]:
artists.head()

Unnamed: 0,id,name,url,pictureURL
0,1,MALICE MIZER,http://www.last.fm/music/MALICE+MIZER,http://userserve-ak.last.fm/serve/252/10808.jpg
1,2,Diary of Dreams,http://www.last.fm/music/Diary+of+Dreams,http://userserve-ak.last.fm/serve/252/3052066.jpg
2,3,Carpathian Forest,http://www.last.fm/music/Carpathian+Forest,http://userserve-ak.last.fm/serve/252/40222717...
3,4,Moi dix Mois,http://www.last.fm/music/Moi+dix+Mois,http://userserve-ak.last.fm/serve/252/54697835...
4,5,Bella Morte,http://www.last.fm/music/Bella+Morte,http://userserve-ak.last.fm/serve/252/14789013...


In [None]:
tags.shape

(11946, 2)

In [None]:
tags.head()

Unnamed: 0,tagID,tagValue
0,1,metal
1,2,alternative metal
2,3,goth rock
3,4,black metal
4,5,death metal


In [None]:
listening.shape

(92834, 3)

In [None]:
listening.head()

Unnamed: 0,userID,artistID,weight
0,2,51,13883
1,2,52,11690
2,2,53,11351
3,2,54,10300
4,2,55,8983


In [None]:
friends.shape

(25434, 2)

In [None]:
friends.head()

Unnamed: 0,userID,friendID
0,2,275
1,2,428
2,2,515
3,2,761
4,2,831


In [None]:
user_tags.shape

(186479, 4)

In [None]:
user_tags.head()

Unnamed: 0,userID,artistID,tagID,timestamp
0,2,52,13,2009-03-31 22:00:00
1,2,52,15,2009-03-31 22:00:00
2,2,52,18,2009-03-31 22:00:00
3,2,52,21,2009-03-31 22:00:00
4,2,52,41,2009-03-31 22:00:00


##### Join the original data tables together

Here we decided to use outer joins to maintain all the data, nulls can then be filtered out at different stages of the process.

In [None]:
user_and_friend = pd.merge(listening, friends, on="userID", how="outer")
user_friend_nd_artist = pd.merge(user_and_friend, artists, left_on="artistID", right_on="id", how="outer").drop(columns=["id", "url", "pictureURL"])
artist_nd_tag = pd.merge(user_tags, tags, on="tagID", how="left")

original_df = pd.merge(user_friend_nd_artist, artist_nd_tag, on=["userID", "artistID"], how="outer")

# rename columns to make more clear
original_df.rename(columns={"name": "artistName", "weight": "artistWeight"}, inplace=True)

In [None]:
original_df.head()

Unnamed: 0,userID,artistID,artistWeight,friendID,artistName,tagID,timestamp,tagValue
0,2,51,13883.0,275.0,Duran Duran,,NaT,
1,2,51,13883.0,428.0,Duran Duran,,NaT,
2,2,51,13883.0,515.0,Duran Duran,,NaT,
3,2,51,13883.0,761.0,Duran Duran,,NaT,
4,2,51,13883.0,831.0,Duran Duran,,NaT,


In [None]:
original_df.artistID.nunique()

18022

## Spotify API

From examining this original data and hypothesising about how we would recommend data related to it, we noticed there were a few problems with the data given and we felt that there were a few additions we could make to the data to improve it and make our lives easier.

For one, we felt the tags were too inconsistant. These tags mostly contained the genre that the artist was but there were quite a number of outliers to this as well due to the nature of them being user generated. While we had attempted to iterate through these tags and to automatically have a criteria where only relevant genres would be returned from these tags across all users, this proved difficult as we were unsure of a hard rule that would catch all cases nicely.

There was also the problem with the pirctureURL column in the original data. While there was a URL to an image given, as this data from Last.fm is from 2011, these links no longer worked. This proved important as from our plan for this project, we wanted to include some nice summary statistics on the recommended music and as a result wanted to have an image to act as a visual represenation of the artist. 

This is where the Spotify API comes in. We discovered this API as we looked to obtain more information about the artists in the dataset.

Using this API, we could look up the artist in our data on Spotify and return some recorded Spotify data on this artist. This extra data included the genres covered by this artist along with a URL of the location of an image that we could assign to each artist. We could then iterate through our artists and this data in a table to then be merged to the other data we have.

#### Create functions for scraping artist details

In [None]:
# Create client
spotify = spotipy.Spotify(auth_manager=SpotifyClientCredentials(
    client_id="22ba1096ffda44fca858c1c6880ca020",
    client_secret="2342a132a9a74186bdff50fce2a95778"
))

In [None]:
def get_artist_info(name):
    """
    Fetches info about the artist from the Spotify API.
    """

    # Get the response from the API
    results = spotify.search(q=f'artist:{name}', type='artist')

    spotify_name = name
    img_url = None
    genres = []
    spotify_url = None

    if len(results['artists']['items']) > 0:

        # Sort the returned results based on how close they are to the query name.
        # Uses https://en.wikipedia.org/wiki/Levenshtein_distance aka Edit Distance.
        items = sorted(results['artists']['items'], key=(lambda x: nltk.edit_distance(name.lower(), x["name"].lower())))
  
        # Sort results by popularity
        #items = sorted(results['artists']['items'], key=(lambda x: x["popularity"]), reverse=True)

        if len(items) > 0:
            artist = items[0]

            # assign variables
            genres = artist["genres"]
            spotify_name = artist["name"]
            spotify_url = artist["external_urls"]["spotify"]

            # assign image url
            image_list = artist["images"]
            if len(image_list) > 0:
                img_url = image_list[0]["url"]

    return spotify_name, img_url, genres, spotify_url

#### Scrape the artist details using the spotify API

Due to the nature of APIs there was a rate limit on the number of times we could make a request. For us, this worked out to be that we could make 100 requests in about 3.5 seconds but that then we would have to wait 6 seconds before we could request data on another 100 artists.

While this was not ideal, it didn't work out too badly with the full batch of 17,500 artists only taking about 30 minutes to scrape.

The process for this scrape is as follows:

1.   Iterate through the artists' names
2.   In this loop, search for the individual artists' names through all of Spotify's artists
3.   Look at the returned artists for this Spotify search
4.   Of the returned artists, scrape the most similarly named artist returned as the correct artist we are looking for
5.   Within this artist:
  *   Scrape the URL to the top image associated with their account (usually an album cover)
  *   Scrape the genres associated with that artist
6.   Store these values in a table

In [None]:
spotify_data = pd.DataFrame(columns=["artistName", "spotifyName", "imageUrl", "genres", "spotifyUrl"])

for artist_name_in in tqdm(list(original_df.artistName.unique())):

    artist_name = str(artist_name_in)

    try:
        spotify_name, image_url, genres_list, spotify_url = get_artist_info(artist_name)

        df_row = pd.DataFrame({"artistName": artist_name, "spotifyName": spotify_name, "imageUrl": image_url, "genres": str(genres_list), "spotifyUrl": spotify_url}, index=[0])

        spotify_data = pd.concat([spotify_data, df_row], axis=0, ignore_index=True)

        # if artist_name == "nan":
        #     print("Artist:", artist_name)
        #     print("Spotify artist:", spotify_name)
        #     print("URL:", image_url)
        #     print("Genres:", genres_list)
        #     print("SpotifyURL:", spotify_url)

    except:
        print("Artist:", artist_name)
        print("Spotify artist:", spotify_name)
        print("URL:", image_url)
        print("Genres:", genres_list)
        print("SpotifyURL:", spotify_url)
        sys.exit(1)

HBox(children=(FloatProgress(value=0.0, max=17633.0), HTML(value='')))




We noticed that there is an artistName value as 'nan' in this dataset but we were not sure if this was a valid song artist or was a missing value. Nevertheless, we interpreted this value as a valid artist name and scraped the details for the artist 'nan'.

In [None]:
spotify_data.head()

Unnamed: 0,artistName,spotifyName,imageUrl,genres,spotifyUrl
0,Duran Duran,Duran Duran,https://i.scdn.co/image/2e6d5966dafa119f4a75e7...,"['dance rock', 'new romantic', 'new wave', 'ne...",https://open.spotify.com/artist/0lZoBs4Pzo7R89...
1,Morcheeba,Morcheeba,https://i.scdn.co/image/e347fc3b68b822c479d6ab...,"['big beat', 'bmore', 'downtempo', 'electronic...",https://open.spotify.com/artist/6bWxFw65IEJzBY...
2,Air,Air,https://i.scdn.co/image/acbe16a3fbfda0ce2425fd...,"['ambient pop', 'downtempo', 'electronica', 't...",https://open.spotify.com/artist/1P6U1dCeHxPui5...
3,Hooverphonic,Hooverphonic,https://i.scdn.co/image/a3112974cabed304347fdc...,"['downtempo', 'electronica', 'trip hop']",https://open.spotify.com/artist/5EP020iZcwBqHR...
4,Kylie Minogue,Kylie Minogue,https://i.scdn.co/image/471aae2bc20f974dbf573a...,"['australian dance', 'australian pop', 'dance ...",https://open.spotify.com/artist/4RVnAU35WRWra6...


#### Export this dataframe as a CSV file so that the results are saved

In [None]:
spotify_data.to_csv(os.path.join("processed_data", 'spotify_data.csv'), index=False)

## Additional Song data

We also wanted to include some data on the individual songs these artists had. While the original data we were given for this assignment didn't have these songs, we felt this additional dimension would allow us to make more fine grained recommendations to a particular user.

Without this information, we were just left with the user data, the artist data, the tag data and the genre data. While this is still a lot of data, we thought it would be good to also add an additional layer to this with song data.

After some thought into this aspect of the data, we eventually decided despite the work we had already put in trying to get it sorted, it was not worth including.

The reason for this was that we simply could not get hold of the data needed in an efficient way.

We did find the following website of the million song dataset:
> http://millionsongdataset.com/lastfm

This website was similar to the one we were using for this project and it included the song data we were looking for but one problem was that the size of this dataset was too big. On this website there was a subset of data, a training set of data and a test set of data.

We initially downloaded the subset of data and scraped its details, however, this process took over an hour and only included 10,000 songs. Along with this, these songs were distributed across our 17,500 artists and also included a lot of artists not in our dataset. As a result, this dataset proved too small for our purpose.

We then tried to download the train and test dataset to expand the data but these proved too big with the train dataset being over 1 GB when compressed in a zip file. Due to the size of this data, the time needed to unzip this was too long for us let alone the time that would be needed to read this data back in to compile a dataframe.

While we, in the end, decided not to proceed with the inclusion of this song data, you can find the code written to facilitate this below.

### *** Do NOT run the below code***

This code is not needed for the purpose of our project but shows the process of how we would complete this step.
If the below code is run, it will wipe the outputs and take over an hour to run.
For this reason, do not run it.

View the code below and only continue running the code again at step *3. Create Master data table & Export file*

##### Crawl through the directories in file and create a list of all the paths to the individual files

In [None]:
list_of_files = []

for folder_1 in ["A", "B"]:
    for folder_2 in list(string.ascii_uppercase):
        for folder_3 in list(string.ascii_uppercase):
          
            list_of_files.extend(glob.glob("lastfm_subset/lastfm_subset/{}/{}/{}/*.json".format(folder_1, folder_2, folder_3)))

print("There are {} files".format(len(list_of_files)))

There are 9330 files


##### Read in the data in these files

In [None]:
songs_df = pd.DataFrame(columns=["artistName", "song_timestamp", "trackID", "songTitle"])

for file_path in tqdm(list_of_files):
    with open(file_path, 'r') as song_file:
        song_dict = json.load(song_file)

        dict_without_similarity = {}

        dict_without_similarity["artistName"] = song_dict["artist"]
        dict_without_similarity["song_timestamp"] = song_dict["timestamp"]
        dict_without_similarity["trackID"] = song_dict["track_id"]
        dict_without_similarity["songTitle"] = song_dict["title"]

        df_row = pd.DataFrame(dict_without_similarity, index=[0])
        songs_df = pd.concat([songs_df, df_row], axis=0, ignore_index=True)

HBox(children=(FloatProgress(value=0.0, max=9330.0), HTML(value='')))




In [None]:
len(songs_df)

9330

In [None]:
songs_df.head()

Unnamed: 0,artistName,timestamp,trackID,songTitle
0,Casual,2011-08-02 20:13:25.674526,TRAAAAW128F429D538,I Didn't Mean To
1,Adam Ant,2011-08-11 21:57:49.876264,TRAAAEF128F4273421,Something Girls
2,Rated R,2011-08-11 21:18:23.312581,TRAAAMQ128F1460CD3,Keepin It Real (Skit)
3,Planet P Project,2011-08-09 16:39:17.211274,TRAAARJ128F9320760,Pink World
4,Gob,2011-08-09 01:59:41.352247,TRAAAFD128F92F423A,Face the Ashes


##### Export this dataframe as a json file so that the results are saved

In [None]:
songs_df.to_json(os.path.join("processed_data", 'adittional_song_data.json'))

---
---

# 3. Create master data table & Export file

---
---

In this step we join all of this data together into one big table. As mentioned above, this is to make our job easier when it comes to the next steps as we know all of the data is in one place.

In [None]:
joined_df = pd.merge(original_df, spotify_data, on="artistName", how="left")

In [None]:
joined_df.head()

Unnamed: 0,userID,artistID,artistWeight,friendID,artistName,tagID,timestamp,tagValue,spotifyName,imageUrl,genres,spotifyUrl
0,2,51,13883.0,275.0,Duran Duran,,NaT,,Duran Duran,https://i.scdn.co/image/2e6d5966dafa119f4a75e7...,"['dance rock', 'new romantic', 'new wave', 'ne...",https://open.spotify.com/artist/0lZoBs4Pzo7R89...
1,2,51,13883.0,428.0,Duran Duran,,NaT,,Duran Duran,https://i.scdn.co/image/2e6d5966dafa119f4a75e7...,"['dance rock', 'new romantic', 'new wave', 'ne...",https://open.spotify.com/artist/0lZoBs4Pzo7R89...
2,2,51,13883.0,515.0,Duran Duran,,NaT,,Duran Duran,https://i.scdn.co/image/2e6d5966dafa119f4a75e7...,"['dance rock', 'new romantic', 'new wave', 'ne...",https://open.spotify.com/artist/0lZoBs4Pzo7R89...
3,2,51,13883.0,761.0,Duran Duran,,NaT,,Duran Duran,https://i.scdn.co/image/2e6d5966dafa119f4a75e7...,"['dance rock', 'new romantic', 'new wave', 'ne...",https://open.spotify.com/artist/0lZoBs4Pzo7R89...
4,2,51,13883.0,831.0,Duran Duran,,NaT,,Duran Duran,https://i.scdn.co/image/2e6d5966dafa119f4a75e7...,"['dance rock', 'new romantic', 'new wave', 'ne...",https://open.spotify.com/artist/0lZoBs4Pzo7R89...


In [None]:
joined_df.tail()

Unnamed: 0,userID,artistID,artistWeight,friendID,artistName,tagID,timestamp,tagValue,spotifyName,imageUrl,genres,spotifyUrl
2269513,2099,16468,,,,191.0,2009-01-31 23:00:00,instrumental,,,,
2269514,2099,16745,,,,13.0,2009-07-31 22:00:00,chillout,,,,
2269515,2099,16745,,,,15.0,2009-07-31 22:00:00,downtempo,,,,
2269516,2099,16745,,,,21.0,2009-07-31 22:00:00,trip-hop,,,,
2269517,2100,3855,,,,3271.0,2010-06-30 22:00:00,melancholic rock,,,,


In [None]:
joined_df.userID.count()

2269518

##### Export this dataframe as a CSV file so that the results are saved

In [None]:
joined_df.to_csv(os.path.join("processed_data", 'original_and_spotify_data_.csv'), index=False)