# Getting Data with the Last.fm API 

Note: this code is based on the sources from Dataquest.

In this code example, we’re going to learn some advanced techniques for working with the Last.fm API. In the lecture code, we worked with a simple API that was ideal for teaching the basics:

- It had a few, easy to understand end points.
- Because it didn’t require authentication, we didn’t have to worry about how to tell the API that we had permission to use it.
- The data that each end point responded with was small and had an easy-to-understand structure.

In reality, most APIs are more complex than this, and so to work with them you need to understand some more advanced concepts. Specifically, we’re going to show:

- How to authenticate yourself with an API key.
- How to use rate limiting and other techniques to work within the guidelines of an API.
- How to use pagination to work with large responses.

in this code example.

### Last.fm API

We’ll be working with the Last.fm API. Last.fm is a music service that builds personal profiles by connecting to music streaming apps like iTunes, Spotify and others like it and keeping track of the music you listen to.

They provide free access to their API so that music services can send them data, but also provide endpoints that summarize all the data that Last.fm has on various artists, songs, and genres. We’ll be building a dataset of popular artists using their API.

### Following API Guidelines
When working with APIs, it’s important to follow their guidelines. If you don’t, you can get yourself banned from using the API.

- When you make a request to the last.fm API, you can identify yourself using headers. Last.fm wants us to specify a user-agent in the header so they know who we are. We’ll learn how to do that when we make our first request in a moment.
- In order to build our data set, we’re going to need to make thousands of requests to the Last.fm API. While they don’t provide a specific limit in their documentation, they do advise that we shouldn’t be continuously making many calls per second. In this tutorial we’re going to learn a few strategies for rate limiting, or making sure we don’t hit their API too much, so we can avoid getting banned.


## Authenticating with API Keys

The majority of APIs require you to authenticate yourself so they know you have perlesson to use them. One of the most common forms of authentication is to use an API Key, which is like a password for using their API. If you don’t provide an API key when making a request, you will get an error.

The process for using an API key works like this:

- You create an account with the provider of the API.
- You request an API key, which is usually a long string like 54686973206973206d7920415049204b6579.
- You record your API key somewhere safe, like a password keeper. If someone gets your API key, they can use the API pretending to be you.
- Every time you make a request, you provide the API key to authenticate yourself.

To get an API key for Last.fm, start by https://www.last.fm/api

We’ll start by defining our API key and a user-agent 

In [1]:
API_KEY = '9f976420cbe574345638913272165410'   # Note that API key shown in this code is not a real API key!
USER_AGENT = 'jguo23' # Note that name shown in this code is not a real name!

Next, we’ll import the requests library, create a dictionary for our headers and parameters, and make our first request!

In [2]:
import requests

headers = {
    'user-agent': USER_AGENT
}

payload = {
    'api_key': API_KEY,
    'method': 'chart.gettopartists',
    'format': 'json'
}

r = requests.get('https://ws.audioscrobbler.com/2.0/', headers=headers, params=payload)
r.status_code

200

### API Status Codes

Status codes are returned with every request that is made to a web server. Status codes indicate information about what happened with a request. Here are some codes that are relevant to GET requests:

200: Everything went okay, and the result has been returned (if any).\
301: The server is redirecting you to a different endpoint. This can happen when a company switches domain names, or an endpoint name is changed.\
400: The server thinks you made a bad request. This can happen when you don’t send along the right data, among other things.\
401: The server thinks you’re not authenticated. Many APIs require login ccredentials, so this happens when you don’t send the right credentials to access an API.\
403: The resource you’re trying to access is forbidden: you don’t have the right perlessons to see it.\
404: The resource you tried to access wasn’t found on the server.\
503: The server is not ready to handle the request.

To save ourselves time, we’re going to create a function that does a lot of this work for us. We’ll provide the function with a payload dictionary, and then we’ll add extra keys to that dictionary and pass it with our other options to make the request.

In [3]:
def lastfm_get(payload):
    # define headers and URL
    headers = {'user-agent': USER_AGENT}
    url = 'https://ws.audioscrobbler.com/2.0/'

    # Add API key and format to the payload
    payload['api_key'] = API_KEY
    payload['format'] = 'json'

    response = requests.get(url, headers=headers, params=payload)
    return response

See how much it simplifies making our earlier request:

In [4]:
r = lastfm_get({
    'method': 'chart.gettopartists'
})

r.status_code

200

As we learned in our lecture code, most APIs return data in a JSON format, and we can use the Python json module to print the JSON data in an easiler to understand format.

Let’s re-use the jprint() function we created in that tutorial and print our response from the API:

In [5]:
import json

def jprint(obj):
    # create a formatted string of the Python JSON object
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)

jprint(r.json())

{
    "artists": {
        "@attr": {
            "page": "1",
            "perPage": "50",
            "total": "5808871",
            "totalPages": "116178"
        },
        "artist": [
            {
                "image": [
                    {
                        "#text": "https://lastfm.freetls.fastly.net/i/u/34s/2a96cbd8b46e442fc41c2b86b821562f.png",
                        "size": "small"
                    },
                    {
                        "#text": "https://lastfm.freetls.fastly.net/i/u/64s/2a96cbd8b46e442fc41c2b86b821562f.png",
                        "size": "medium"
                    },
                    {
                        "#text": "https://lastfm.freetls.fastly.net/i/u/174s/2a96cbd8b46e442fc41c2b86b821562f.png",
                        "size": "large"
                    },
                    {
                        "#text": "https://lastfm.freetls.fastly.net/i/u/300x300/2a96cbd8b46e442fc41c2b86b821562f.png",
                        "s

The structure of the JSON response is:

A dictionary with a single artists key, containing:
- an @attr key containing a number of attributes about the response.
- an artist key containing a list of artist objects.

Let’s look at the '@attr' (attributes) key by itself:

In [6]:
jprint(r.json()['artists']['@attr'])

{
    "page": "1",
    "perPage": "50",
    "total": "5808871",
    "totalPages": "116178"
}


There are over 4 million total artists in the results of this API endpoint, and we’re being showing the first 50 artists in a single ‘page’. This technique of spreading the results over multiple pages is called pagination.

## Working with Paginated Data

In order to build a dataset with many artists, we need to make an API request for each page and then put them together. We can control the pagination of our results using two optional parameters specified in the documentation:

- limit: The number of results to fetch per page (defaults to 50).
- page: Which page of the results we want to fetch.

Because the '@attrs' key gives us the total number of pages, we can use a while loop and iterate over pages until the page number is equal to the last page number.

We can also use the limit parameter to fetch more results in each page — we’ll fetch 500 results per page so we only need to make ~8,600 calls instead of ~86,000. Still, we need to think about rate limiting to comply with the Last.fm API’s terms of service. Let’s look at a few approaches.

### Rate Limiting

Rate limiting is using code to limit the number of times per second that we hit a particular API. Rate limiting will make your code slower, but it’s better than getting banned from using an API altogether.

**time.sleep() function.**

The easiest way to perform rate limiting is to use Python time.sleep() function. This function accepts a float specifying a number of seconds to wait before proceeding.

For instance, the following code will wait one quarter of a second between the two print statements:

*import time*

*print("one")*\
*time.sleep(0.25)*\
*print("two")*\

Because making the API call itself takes some time, we’re likely to be making two or three calls per second, not the four calls per second that sleeping for 0.25s might suggest. This should be enough to keep us under Last.fm’s threshold (if we were going to be hitting their API for a number of hours, we might choose an even slower rate).

**Cache**

Another technique that’s useful for rate limiting is using a local database to cache the results of any API call, so that if we make the same call twice, the second time it reads it from the local cache. Imagine that as you are writing your code, you discover syntax errors and your loop fails, and you have to start again. By using a local cache, you have two benefits:

You don’t make extra API calls that you don’t need to.
You don’t need to wait the extra time to rate limit when reading the repeated calls from the cache.

Creating logic for a local cache is a reasonably complex task, but there’s a great library called requests-cache which will do all of the work for you with only a couple of lines of code.

You can install requests-cache using pip:

*pip install requests-cache*

In [8]:
# pip install requests-cache

Collecting requests-cache
  Downloading requests_cache-1.1.1-py3-none-any.whl (60 kB)
     ---------------------------------------- 60.3/60.3 kB 3.3 MB/s eta 0:00:00
Collecting url-normalize>=1.4
  Using cached url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Collecting cattrs>=22.2
  Downloading cattrs-23.2.3-py3-none-any.whl (57 kB)
     ---------------------------------------- 57.5/57.5 kB ? eta 0:00:00
Collecting attrs>=21.2
  Downloading attrs-23.2.0-py3-none-any.whl (60 kB)
     ---------------------------------------- 60.8/60.8 kB 3.2 MB/s eta 0:00:00
Collecting exceptiongroup>=1.1.1
  Downloading exceptiongroup-1.2.0-py3-none-any.whl (16 kB)
Installing collected packages: url-normalize, exceptiongroup, attrs, cattrs, requests-cache
  Attempting uninstall: attrs
    Found existing installation: attrs 22.1.0
    Uninstalling attrs-22.1.0:
      Successfully uninstalled attrs-22.1.0
Successfully installed attrs-23.2.0 cattrs-23.2.3 exceptiongroup-1.2.0 requests-cache-1.1.1 url-no

In [9]:
import requests_cache

requests_cache.install_cache()



The last thing we should consider is that our 8,600 requests will likely take about 30 minutes to make, and so we’ll print some output in each loop so we can see where everything is at. We’ll use an IPython display trick to clear the output after each run so things look neater in our notebook.

In [10]:
import time
from IPython.core.display import clear_output

responses = []

page = 1
total_pages = 99999 # this is just a dummy number so the loop starts

while page <= total_pages:
    payload = {
        'method': 'chart.gettopartists',
        'limit': 500,
        'page': page
    }

    # print some output so we can see the status
    print("Requesting page {}/{}".format(page, total_pages))
    # clear the output to make things neater
    clear_output(wait = True)

    # make the API call
    response = lastfm_get(payload)

    # if we get an error, print the response and halt the loop
    if response.status_code != 200:
        print(response.text)
        break

    # extract pagination info
    page = int(response.json()['artists']['@attr']['page'])
    total_pages = int(response.json()['artists']['@attr']['totalPages'])

    # append response
    responses.append(response)

    # if it's not a cached result, sleep
    if not getattr(response, 'from_cache', False):
        time.sleep(0.25)

    # increment the page number
    page += 1

Requesting page 11618/11618


## Processing the Data

In [11]:
import pandas as pd

r0 = responses[0]
r0_json = r0.json()
r0_artists = r0_json['artists']['artist']
r0_df = pd.DataFrame(r0_artists)
r0_df.head()

Unnamed: 0,name,playcount,listeners,mbid,url,streamable,image
0,IU,60950542,718429,b9545342-1e6d-4dae-84ac-013374ad8d7c,https://www.last.fm/music/IU,0,[{'#text': 'https://lastfm.freetls.fastly.net/...
1,Jessie J,42829875,2293778,d24fb461-dee8-41fc-bb15-2f13bb2644a6,https://www.last.fm/music/Jessie+J,0,[{'#text': 'https://lastfm.freetls.fastly.net/...
2,Meek Mill,33168929,1369403,31bcadcc-e1da-4cad-bec8-2f4f1d41b095,https://www.last.fm/music/Meek+Mill,0,[{'#text': 'https://lastfm.freetls.fastly.net/...
3,All Time Low,95061511,1937768,62162215-b023-4f0e-84bd-1e9412d5b32c,https://www.last.fm/music/All+Time+Low,0,[{'#text': 'https://lastfm.freetls.fastly.net/...
4,Sixpence None the Richer,15337463,1463521,c2c70ed6-5f10-445c-969f-2c16bc9a4c2e,https://www.last.fm/music/Sixpence+None+the+Ri...,0,[{'#text': 'https://lastfm.freetls.fastly.net/...


We can use list comprehension to perform this operation on each response from responses, giving us a list of dataframes, and then use the pandas.concat() function to turn the list of dataframes into a single dataframe.

In [12]:
frames = [pd.DataFrame(r.json()['artists']['artist']) for r in responses]
artists = pd.concat(frames)
artists.head()

Unnamed: 0,name,playcount,listeners,mbid,url,streamable,image
0,IU,60950542,718429,b9545342-1e6d-4dae-84ac-013374ad8d7c,https://www.last.fm/music/IU,0,[{'#text': 'https://lastfm.freetls.fastly.net/...
1,Jessie J,42829875,2293778,d24fb461-dee8-41fc-bb15-2f13bb2644a6,https://www.last.fm/music/Jessie+J,0,[{'#text': 'https://lastfm.freetls.fastly.net/...
2,Meek Mill,33168929,1369403,31bcadcc-e1da-4cad-bec8-2f4f1d41b095,https://www.last.fm/music/Meek+Mill,0,[{'#text': 'https://lastfm.freetls.fastly.net/...
3,All Time Low,95061511,1937768,62162215-b023-4f0e-84bd-1e9412d5b32c,https://www.last.fm/music/All+Time+Low,0,[{'#text': 'https://lastfm.freetls.fastly.net/...
4,Sixpence None the Richer,15337463,1463521,c2c70ed6-5f10-445c-969f-2c16bc9a4c2e,https://www.last.fm/music/Sixpence+None+the+Ri...,0,[{'#text': 'https://lastfm.freetls.fastly.net/...


Our next step will be to remove the image column, which contains URLs for artist images that aren’t really helpful to us from an analysis standpoint.

In [13]:
artists = artists.drop('image', axis=1)
artists.head()

Unnamed: 0,name,playcount,listeners,mbid,url,streamable
0,IU,60950542,718429,b9545342-1e6d-4dae-84ac-013374ad8d7c,https://www.last.fm/music/IU,0
1,Jessie J,42829875,2293778,d24fb461-dee8-41fc-bb15-2f13bb2644a6,https://www.last.fm/music/Jessie+J,0
2,Meek Mill,33168929,1369403,31bcadcc-e1da-4cad-bec8-2f4f1d41b095,https://www.last.fm/music/Meek+Mill,0
3,All Time Low,95061511,1937768,62162215-b023-4f0e-84bd-1e9412d5b32c,https://www.last.fm/music/All+Time+Low,0
4,Sixpence None the Richer,15337463,1463521,c2c70ed6-5f10-445c-969f-2c16bc9a4c2e,https://www.last.fm/music/Sixpence+None+the+Ri...,0


Now, let’s get to know the data a little using DataFrame.info() and DataFrame.describe():

In [14]:
artists.info()
artists.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11750 entries, 0 to 999
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        11750 non-null  object
 1   playcount   11750 non-null  object
 2   listeners   11750 non-null  object
 3   mbid        11750 non-null  object
 4   url         11750 non-null  object
 5   streamable  11750 non-null  object
dtypes: object(6)
memory usage: 642.6+ KB


Unnamed: 0,name,playcount,listeners,mbid,url,streamable
count,11750,11750,11750,11750.0,11750,11750
unique,8193,8188,8110,4218.0,8193,1
top,Vanilla,2681903,128764,,https://www.last.fm/music/Vanilla,0
freq,2,4,5,5694.0,2,11750


We were expecting about 4,330,000 artists but we only have 10,000. 

Let’s let’s look at the length of the list of artists across our list of response objects to see if we can better understand what has gone wrong.

In [15]:
artist_counts = [len(r.json()['artists']['artist']) for r in responses]
pd.Series(artist_counts).value_counts()

0       11598
1000        8
500         7
50          5
dtype: int64

It looks like only twenty of our requests had a list of responses – let’s look at the first fifty in order and see if there’s a pattern.

In [16]:
print(artist_counts[:50])

[50, 50, 50, 50, 50, 1000, 500, 1000, 500, 1000, 500, 1000, 500, 1000, 500, 1000, 500, 1000, 500, 1000, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


It looks like after the first twenty responses, this API doesn’t return any data — an undocumented limitation.

In [17]:
artists = artists.drop_duplicates().reset_index(drop=True)
artists.describe()

Unnamed: 0,name,playcount,listeners,mbid,url,streamable
count,8193,8193,8193,8193.0,8193,8193
unique,8193,8188,8110,4218.0,8193,1
top,IU,3641952,128764,,https://www.last.fm/music/IU,0
freq,1,2,3,3967.0,1,8193


## Augmenting the Data Using a Second Last.fm API Endpoint

In order to make our data more interesting, let’s use another last.fm API endpoint to add some extra data about each artist.

Last.fm allows its users to create “tags” to categorize artists. By using the artist.getTopTags endpoint we can get the top tags from an individual artist.

Let’s look at the response from that endpoint for one of our artists as an example:

In [18]:
r = lastfm_get({
    'method': 'artist.getTopTags',
    'artist':  'The Weeknd'
})

jprint(r.json())

{
    "toptags": {
        "@attr": {
            "artist": "The Weeknd"
        },
        "tag": [
            {
                "count": 100,
                "name": "rnb",
                "url": "https://www.last.fm/tag/rnb"
            },
            {
                "count": 74,
                "name": "electronic",
                "url": "https://www.last.fm/tag/electronic"
            },
            {
                "count": 39,
                "name": "dubstep",
                "url": "https://www.last.fm/tag/dubstep"
            },
            {
                "count": 38,
                "name": "Canadian",
                "url": "https://www.last.fm/tag/Canadian"
            },
            {
                "count": 24,
                "name": "prog-rnb",
                "url": "https://www.last.fm/tag/prog-rnb"
            },
            {
                "count": 18,
                "name": "seen live",
                "url": "https://www.last.fm/tag/seen+live"
       

We’re really only interested in the tag names, and then only the most popular tags. Let’s use list comprehension to create a list of the top three tag names:

In [19]:
tags = [t['name'] for t in r.json()['toptags']['tag'][:3]]
tags

['rnb', 'electronic', 'dubstep']

And then we can use the str.join() method to turn the list into a string:

In [20]:
', '.join(tags)

'rnb, electronic, dubstep'

Let’s create a function that uses this logic to return a string of the most popular tag for any artist, which we’ll use later to apply to every row in our dataframe.

Remember that this function will be used a lot in close succession, so we’ll reuse our time.sleep() logic from earlier.

In [21]:
def lookup_tags(artist):
    response = lastfm_get({
        'method': 'artist.getTopTags',
        'artist':  artist
    })

    # if there's an error, just return nothing
    if response.status_code != 200:
        return None

    # extract the top three tags and turn them into a string
    tags = [t['name'] for t in response.json()['toptags']['tag'][:3]]
    tags_str = ', '.join(tags)

    # rate limiting
    if not getattr(response, 'from_cache', False):
        time.sleep(0.25)
    return tags_str

In [22]:
lookup_tags("Billie Eilish")

'pop, indie pop, seen live'

Applying this function to our 10,000 rows will take just under an hour. So we know that things are actually progressing, we’ll look to monitor the operation with output like we did before.

Unfortunately, manually printing output isn’t an approach we can use when applying a function with the pandas Series.apply() method. Instead, we’ll use the tqdm package which automates this.

In [25]:
from tqdm import tqdm
tqdm.pandas()

artists['tags'] = artists['name'].progress_apply(lookup_tags)

100%|██████████| 8193/8193 [47:48<00:00,  2.86it/s]  


Let’s look at the result of our operation:

In [27]:
artists.head()

Unnamed: 0,name,playcount,listeners,mbid,url,streamable,tags
0,IU,60950542,718429,b9545342-1e6d-4dae-84ac-013374ad8d7c,https://www.last.fm/music/IU,0,"k-pop, Korean, female vocalists"
1,Jessie J,42829875,2293778,d24fb461-dee8-41fc-bb15-2f13bb2644a6,https://www.last.fm/music/Jessie+J,0,"pop, british, female vocalists"
2,Meek Mill,33168929,1369403,31bcadcc-e1da-4cad-bec8-2f4f1d41b095,https://www.last.fm/music/Meek+Mill,0,"Hip-Hop, rap, hip hop"
3,All Time Low,95061511,1937768,62162215-b023-4f0e-84bd-1e9412d5b32c,https://www.last.fm/music/All+Time+Low,0,"pop punk, rock, powerpop"
4,Sixpence None the Richer,15337463,1463521,c2c70ed6-5f10-445c-969f-2c16bc9a4c2e,https://www.last.fm/music/Sixpence+None+the+Ri...,0,"pop, female vocalists, rock"


## Finalizing and Exporting the Data

Before we export our data, we might like to sort the data so the most popular artists are at the top. So far we’ve just been storing data as text without converting any types:

In [28]:
artists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8193 entries, 0 to 8192
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        8193 non-null   object
 1   playcount   8193 non-null   object
 2   listeners   8193 non-null   object
 3   mbid        8193 non-null   object
 4   url         8193 non-null   object
 5   streamable  8193 non-null   object
 6   tags        8193 non-null   object
dtypes: object(7)
memory usage: 448.2+ KB


Let’s start by converting the listeners and playcount columns to numeric.

In [29]:
artists[["playcount", "listeners"]] = artists[["playcount", "listeners"]].astype(int)

Now, let’s sort by number of listeners

In [30]:
artists = artists.sort_values("listeners", ascending=False)
artists.head(10)

Unnamed: 0,name,playcount,listeners,mbid,url,streamable,tags
14,Beck,123093980,3539193,a8baaa41-50f1-4f63-979e-717c14979dfb,https://www.last.fm/music/Beck,0,"alternative, indie, rock"
20,Nickelback,102831614,3434912,bc710bcf-8815-42cf-bad2-3f1d12246aeb,https://www.last.fm/music/Nickelback,0,"rock, alternative rock, hard rock"
24,Simon & Garfunkel,87203562,3001132,5d02f264-e225-41ff-83f7-d9b1f0b1874a,https://www.last.fm/music/Simon+&+Garfunkel,0,"folk, classic rock, singer-songwriter"
49,Beastie Boys,98829048,2909334,9beb62b2-88db-4cea-801e-162cd344ee53,https://www.last.fm/music/Beastie+Boys,0,"Hip-Hop, rap, alternative"
33,Sum 41,112155374,2815887,f2eef649-a6d5-4114-afba-e50ab26254d2,https://www.last.fm/music/Sum+41,0,"punk rock, punk, pop punk"
46,Lily Allen,85568797,2809413,6e0c7c0e-cba5-4c2c-a652-38f71ef5785d,https://www.last.fm/music/Lily+Allen,0,"pop, female vocalists, british"
28,Jimmy Eat World,86983395,2698952,bbc5b66b-d037-4f26-aecf-0b129e7f876a,https://www.last.fm/music/Jimmy+Eat+World,0,"rock, alternative, emo"
36,Lynyrd Skynyrd,44795948,2605487,c544ed4d-2390-4442-a83e-1ea2883b09c8,https://www.last.fm/music/Lynyrd+Skynyrd,0,"classic rock, Southern Rock, rock"
52,The Prodigy,117342613,2587952,4a4ee089-93b1-4470-af9a-6ff575d32704,https://www.last.fm/music/The+Prodigy,0,"electronic, techno, industrial"
19,Whitney Houston,41666258,2457611,0307edfc-437c-4b48-8700-80680e66a228,https://www.last.fm/music/Whitney+Houston,0,"pop, female vocalists, soul"


In [31]:
artists.to_csv('artists.csv', index=False)