# Kworb.net Spotify Most Streamed Artists
The code below collects data available on kworb.net that shows us the top 3000 most streamed artists on Spotify, it first collects all the artists' unique URL and puts them in a DF that is then saved to a CSV file (artist_urls.csv). Then the function below iterates through each URL in the CSV and creates a dictionary to contain the data collected from tables. After scraping all the tables for each artist, the data is saved to another .csv (scraped_artist_data.csv).

The data contains artists total all time streams, daily streams and total tracks as a lead artist, solo artist and featured artist.

We then go back to the main Spotify Most Streamed Artists page to collect all the artists' names to merge with the data we just collected so that we know which rows belong to which artist instead of using the artist's URL as the indentifier. 

In [5]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import os
import time

The following code performs the initial web scraping step, where we extract individual Spotify artist page URLs from Kworb's Spotify Most Streamed Artists of All Time page:

- A GET request is sent to the main artists page using the requests library
- The HTML content of the page is parsed with BeautifulSoup
- All \<a> tags are scanned to find links that match the pattern of individual artist pages (kworb.net/spotify/artist/...html)
- Matching URLs are converted into full links by appending the base URL (https://kworb.net) and stored in a list
- We print out a few of the links to check if they are correct
- The collected artist URLs are saved into a CSV file called artist_urls.csv for later use

In [20]:
# Base URL for the artists page
url = 'https://kworb.net/spotify/artists.html'

# Send GET request to the base artists page
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Initialise a list for artist URLs
artist_links = []

# Search for all <a> tags that link to artist pages
for a in soup.find_all('a', href=True):
    href = a['href']
    # Check if the link is an artist page (it will have '/spotify/artist/' in it)
    if '/spotify/artist/' in href and href.endswith('.html'):
        artist_links.append("https://kworb.net" + href)

# Print a few links to check if they're correct
for link in artist_links[:5]:
    print(link)

# Create a DataFrame
df = pd.DataFrame(artist_links, columns=['Artist_URL'])

# Define the relative path
save_path = os.path.join('..', '..', 'Data', 'Raw', 'artist_urls.csv')

# Make sure directory exists
os.makedirs(os.path.dirname(save_path), exist_ok=True)

# Save the DataFrame to CSV
df.to_csv(save_path, index=False)
print(f"CSV saved to: {save_path}")

https://kworb.net/spotify/artist/3TVXtAsR1Inumwj472S9r4_songs.html
https://kworb.net/spotify/artist/06HL4z0CvFAxyc27GXpf02_songs.html
https://kworb.net/spotify/artist/4q3ewBCX7sLwd24euuV69X_songs.html
https://kworb.net/spotify/artist/1Xyo4u8uXC1ZmMpatF05PJ_songs.html
https://kworb.net/spotify/artist/1uNFoZAHBGtllmzznpCI3s_songs.html
CSV saved to: ../../Data/Raw/artist_urls.csv


In [24]:
# Read in the CSV file with artist URLs
df = pd.read_csv('../../Data/Raw/artist_urls.csv')

# Check if loaded correctly
print(df.head())

# Check the columns
print(df.columns)

# Get the number of rows
num_rows = df.shape[0]
print(f"The DataFrame has {num_rows} rows.")  # Print the number of rows

                                          Artist_URL
0  https://kworb.net/spotify/artist/3TVXtAsR1Inum...
1  https://kworb.net/spotify/artist/06HL4z0CvFAxy...
2  https://kworb.net/spotify/artist/4q3ewBCX7sLwd...
3  https://kworb.net/spotify/artist/1Xyo4u8uXC1Zm...
4  https://kworb.net/spotify/artist/1uNFoZAHBGtll...
Index(['Artist_URL'], dtype='object')
The DataFrame has 3000 rows.


The function below scrapes data from an artist's page on Kworb and returns the artist's streams and track totals as  solo artist, lead artist and featured artist.

The function starts by sending a GET request to the artist's page URL. If the page loads correctly, it moves on to the next step.
Once the page is loaded, BeautifulSoup is used to parse the HTML. 

The artist's data is stored in a table (\<tbody>) so the function searches for this specific table in the page. If it doesn't find the table, it prints a message and returns None.

The table rows are looped through, and the function takes the relevant information. Each row contains a label (like "Streams", "Daily", or "Tracks"), followed by the data points (total streams or streams as a lead artist etc).

All the extracted data is stored in a dictionary so it can be easily accessed later.

If there's any issue during the scraping (e.g., the request fails or the data can't be found), an error message is printed and None is returned.

In [27]:
# Scrape artist data from their page
def scrape_artist_data(artist_url):
    # Send GET request to artist's page
    try:
        response = requests.get(artist_url)
        response.raise_for_status()  # Ensure a successful response
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the tbody containing the streams, daily, and tracks data
        tbody = soup.find('tbody')

        if not tbody:
            print(f"No tbody found for {artist_url}")
            return None

        # Extract rows from the tbody
        rows = tbody.find_all('tr')

        # Initialize variables to store scraped data
        data = {
            'artist_url': artist_url,
            'total_streams': None,
            'streams_as_lead': None,
            'solo_streams': None,
            'featured_streams': None,
            'daily_streams': None,
            'tracks_total': None,
            'tracks_as_lead': None,
            'tracks_solo': None,
            'tracks_as_feature': None
        }

        # Iterate through rows and extract data
        for row in rows:
            columns = row.find_all('td')
            if len(columns) == 5:
                label = columns[0].text.strip()
                total = columns[1].text.strip()
                as_lead = columns[2].text.strip()
                solo = columns[3].text.strip()
                as_feature = columns[4].text.strip()

                # Assign values based on the label
                if label == 'Streams':
                    data['total_streams'] = total
                    data['streams_as_lead'] = as_lead
                    data['solo_streams'] = solo
                    data['featured_streams'] = as_feature
                elif label == 'Daily':
                    data['daily_streams'] = total
                elif label == 'Tracks':
                    data['tracks_total'] = total
                    data['tracks_as_lead'] = as_lead
                    data['tracks_solo'] = solo
                    data['tracks_as_feature'] = as_feature

        return data

    except Exception as e:
        print(f"Error scraping {artist_url}: {e}")
        return None

This calls the function on each URL in the CSV to scrape each page with a delay to avoid bombarding the servers with requests. 

In [None]:
# List to store the scraped data
scraped_data = []

# Make sure we're referencing the correct column for artist URLs
artist_column = 'Artist_URL'

# Iterate over each artist URL in the CSV
for index, row in df.iterrows():
    artist_url = row[artist_column]
    
    # Scrape data for the current artist
    print(f"Scraping artist URL: {artist_url}")  # Print the URL being scraped
    
    # Scrape the data
    data = scrape_artist_data(artist_url)
    
    if data:
        scraped_data.append(data)
    
    # Add a delay
    time.sleep(0.5)

Here we're saving the scraped data to a new CSV file and we check the first few rows to make sure the data was collected correctly.

In [31]:
# Create a DataFrame from the scraped data
scraped_df = pd.DataFrame(scraped_data)

# Save the data to a new CSV file
scraped_df.to_csv('../../Data/Raw/scraped_artist_data.csv', index=False)

# Print the first few rows of the scraped data to check
print(scraped_df.head())

Scraped Data:
                                          artist_url    total_streams  \
0  https://kworb.net/spotify/artist/3TVXtAsR1Inum...  112,592,066,779   
1  https://kworb.net/spotify/artist/06HL4z0CvFAxy...  102,222,940,841   
2  https://kworb.net/spotify/artist/4q3ewBCX7sLwd...   97,253,221,301   
3  https://kworb.net/spotify/artist/1Xyo4u8uXC1Zm...   79,246,015,185   
4  https://kworb.net/spotify/artist/1uNFoZAHBGtll...   59,964,949,358   

  streams_as_lead    solo_streams featured_streams daily_streams tracks_total  \
0  76,886,517,261  42,430,139,565   35,705,549,518    50,091,662          513   
1  98,993,346,012  90,320,824,607    3,229,594,829    49,907,494          593   
2  61,472,589,366  34,896,852,568   35,780,631,935    63,464,912          269   
3  63,772,467,221  42,714,560,989   15,473,547,964    44,626,887          319   
4  36,504,644,580  22,040,079,971   23,460,304,778    22,204,379          289   

  tracks_as_lead tracks_solo tracks_as_feature  
0          

Now we're going back to the main artist page to collect all the names to merge with the CSV file we just made containing the scraped data, we want to do this so we can replace the URL with the artist's names so it's easier to indentify which row belongs to which artist. Here, we're reading the web page's html to take the first table which contains the artist's data and we only keep the name as the rest of the table's data has been collected above.

In [34]:
# Read the HTML from the URL
artist_name_data = pd.read_html('https://kworb.net/spotify/artists.html')

# Print the number of tables found in the HTML file
print(f'Total tables: {len(artist_name_data)}')

# Select the first table
artist_name_df = artist_name_data[0]

# Extract only the 'Artist' column
artist_names_df = artist_name_df[['Artist']]

# Display the first few rows to check the artists names
artist_names_df.head()

Total tables: 1


Unnamed: 0,Artist
0,Drake
1,Taylor Swift
2,Bad Bunny
3,The Weeknd
4,Justin Bieber


We then merge the two DFs with the artist names on the left and we drop the column containing the URLs as they're not longer relevant. After checking the data, we then save this new DF as a CSV file (kworb_spotify_top_artists.csv).

In [40]:
merged_df = pd.concat([artist_names_df, scraped_df], axis=1)
merged_df.head()

Unnamed: 0,Artist,artist_url,total_streams,streams_as_lead,solo_streams,featured_streams,daily_streams,tracks_total,tracks_as_lead,tracks_solo,tracks_as_feature
0,Drake,https://kworb.net/spotify/artist/3TVXtAsR1Inum...,112592066779,76886517261,42430139565,35705549518,50091662,513,314,197,199
1,Taylor Swift,https://kworb.net/spotify/artist/06HL4z0CvFAxy...,102222940841,98993346012,90320824607,3229594829,49907494,593,579,519,14
2,Bad Bunny,https://kworb.net/spotify/artist/4q3ewBCX7sLwd...,97253221301,61472589366,34896852568,35780631935,63464912,269,144,87,125
3,The Weeknd,https://kworb.net/spotify/artist/1Xyo4u8uXC1Zm...,79246015185,63772467221,42714560989,15473547964,44626887,319,241,169,78
4,Justin Bieber,https://kworb.net/spotify/artist/1uNFoZAHBGtll...,59964949358,36504644580,22040079971,23460304778,22204379,289,206,111,83


In [42]:
merged_df.drop(columns=['artist_url'], inplace=True)

In [44]:
merged_df.head()
merged_df.shape

(3000, 10)

In [46]:
merged_df.to_csv('../../Data/Raw/kworb_spotify_top_artists.csv', index=False)