# Data Processing Pipeline

In this notebook, we will have the full data processing pipeline for the project.

The goal is to have a clean collaboration network from the Discogs release XML files with:

- Nodes = artists (Discogs artist IDs or their names)
- Edges = collaborations between the artists
- Edge attributes: This will be the release_year, which is actually what will help us in our train/test split processes

The final folders with all our data for now will have the following:
- inside data/final_data_raw: we will have the Discogs '*releases.xml' files which will go as our input in the processing steps
- inside data/final_data_processed: we will have the processed Parquet files which will be the cleaned up data

The outputs from the processing steps will be:
- 'discogs_edges.parquet' with columns as: source_id, target_id, release_year
- 'discogs_artists.parquet' with columns as: discogs_artist_id

In the end, we will use the Spotify metadata, not really will but we might since it is mostly optional, but even if we do we will do it on a subset of artists.


## Imports and Paths

In [2]:
# Note: Handling Imports
import os
import glob
import io
import re
import xml.etree.ElementTree as ET
import pandas as pd
import networkx as nx

# Note: The following will be our directories, as described earlier
RAW_DATA = '../data/final_data_raw'
PROCESSED_DATA = '../data/final_data_processed'

## Parsing the XML file

For each release in the XML:
- We will extract the released field and pull out the 4-digit year
- We will extract all artists under './artists/artist'
    - We will also use id if present otherwise a fall back will be the name of the artist
- We will only keep the releases with at least 2 artists such that there is some collaboration data
- For such unordered paris of artists, u,v on the same release, we will create the source_id = min(u,v), target_id = max(u,v), and the release year

Also each artists on each edge will be added to the nodes set

In [4]:
# Note: Now we have all the downloaded data in the XML file in the final_data_raw
# Note: The path is ../data/final_data_raw/discogs_20251101_releases.xml
# Note: To get the XML file
xml_file = '../data/final_data_raw/discogs_20251101_releases.xml'
# Note: First step is to initialize the containers
edges, nodes = [], set()

# Note: To get the release year from the xml file, we will use the following pattern
for_year = re.compile(r'(\d{4})') # Note: Since yr will be in 4 digits

# Note: Since we have to parse a very large XML file, currently the raw dataset is 10GB+, we need to do it safely so as to not crash the memory
for event, elem in ET.iterparse(xml_file, events=('end',)):
    if elem.tag == 'release':

        # Note: First step is to extract the release year
        release_year = None
        released_elem = elem.find('released')
        if released_elem is not None and released_elem.text:
            yr = for_year.search(released_elem.text)
            if yr:
                release_year = int(yr.group(1))
        # Note: Next we get all the artists from the song
        artists = [ar.findtext('id') or ar.findtext('name') for ar in elem.findall('./artists/artist')]
        aritsts = [ar for ar in artists if ar]

        # Note: After extracting the artists, we will create the nodes and the edges
        if len(artists) >= 2:
            for i in range(len(artists)):
                nodes.add(aritsts[i])
                for j in range(i+1, len(artists)):
                    edges.append((artists[i], artists[j], release_year)) # Note: Release year is the edge attribute

        # Note: This is the main step, doing elem.clear(), basicalyl frees up the memory for the processed part of XML
        elem.clear()

# Note: After all that is done, print len of nodes and edges to check what we have
print(f'Nodes: {len(nodes)}, Edges: {len(edges)}')

# Reference: https://stackoverflow.com/questions/7171140/using-python-iterparse-for-large-xml-files
# Reference: https://lxml.de/parsing.html#incremental-event-parsing
# Reference: https://boscoh.com/programming/reading-xml-serially.html
# Reference: https://stackoverflow.com/questions/324214/what-is-the-fastest-way-to-parse-large-xml-docs-in-python

FileNotFoundError: [Errno 2] No such file or directory: '../data/final_data_raw/discogs_20251101_releases.xml'

## Convert to DataFrame

Now, that we have our parsed XML file and we have teh edges with source_id, target_id, and release year and our nodes, we can convert them into a pd Dataframe. This will help us inspect the structure, run analysis and save teh data efficiently

We should have:
- Millions of edges
- Hundreds of thousands to millions of artists

Please note that this is not modifying the data but only structuring it for alter processing

In [6]:
# Note: First convert the edges list to DataFrame
edges_df = pd.DataFrame(edges, columns=['source_id', 'target_id', 'release_year'])
# Note: Next we convert the nodes set to DF
artists_df = pd.DataFrame({'discogs_artist_id': list(nodes)})
print(edges_df.head())
print(artists_df.head())
print(f'{len(edges_df)} Edges, {len(artists_df)} Nodes(Artists)')

  source_id target_id  release_year
0        92     17757        1999.0
1     33257      3482        2000.0
2        96        95        1999.0
3     12007    583687        1995.0
4       645      5823        2000.0
  discogs_artist_id
0           9134149
1            403139
2           1983365
3           4570205
4           3566446
7284440 Edges, 1045947 Nodes(Artists)


## Cleaning the Edges

For the link prediction step, it is important to ensure that each edge has a valid release year, and so based on research a small percentage of Discgos entries might not include a year, so these must be removed

- We will remove edges with release_year = None
- This ensures the final dataset will be consisten for our time-based predictions

(Realistically, we are not expecting a big decrease in the number of edges)

In [10]:
# Note: As mentioned, removing edges with no release years
edges_df = edges_df.dropna(subset=['release_year'])
# Note: Later noticed that there are datasets with 0.0 as release_year, so just doing the following too
edges_df = edges_df[edges_df['release_year'] != 0.0]
edges_df['release_year'] = edges_df['release_year'].astype(int)

print(f'Remaining edges: {len(edges_df)}')

Remaining edges: 5673764


## Saving the processed Parquet Files

Now that we have data that is parsed and also cleaned, we will save it inside the data/final_data_processed folder

We will save two files:
- discogs_edges.parquet: Columns will be source_id, target_id, release_year
- discogs_artists.parquet: Columns will be discogs_artist_id

The reason for the Parquet format is it is highly compressed, fast to load, and very ideal for such large datasets with millions of rows

In [11]:
# Note: Defining our paths for output
edges_output = os.path.join(PROCESSED_DATA, 'discogs_edges.parquet')
artists_output = os.path.join(PROCESSED_DATA, 'discogs_artists.parquet')

# Note: Saving our dataframes
edges_df.to_parquet(edges_output, index=False)
artists_df.to_parquet(artists_output, index = False)

# Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html

## Spotify Discussion

Realistically speaking, we already have over 1 million nodes and over 5 million edges after cleaning, this dataset is already massive

Link prediction works off of Edges + Graph Structure, for which we have all the rows we need (millions in fact). Spotify would genuinely just be 'extra decoration' to put it that way, it would not really be improving most of our graph link prediction methods, and so we will currently put it on hold and if everything works well with what we have then we do not really need it, since fetching the spotify data for all these artists on its own is damn near impossible. If it seems relevant we will run something for a top N frequent artists, if it seems like that will be handly later on. For now, this concludes our data processing.

## General Link Predicition with Non-Graphical Measures

Here, we will attempt to predict the same edges with a method that doesn't utilize graphical measures. If our hypothesis is correct, this should yield a worse outocme than above. 

In [None]:
df_artistsID = pd.read_parquet("../data/final_data_processed/discogs_artists.parquet")

print(df.head())

# read the artists xml file so we can use the artist IDs from the parquet file to get artist names
rows = []
for event, elem in ET.iterparse("../data_raw/discogs_20251101_artists.xml", events=("end",)):
    if elem.tag == "artist":   # repeating entry tag for Discogs artists
        row = {child.tag: child.text for child in elem}

        rows.append(row)

        # free memory
        elem.clear()
        parent = elem.getparent() if hasattr(elem, "getparent") else None
        if parent is not None:
            while parent.getprevious() is not None:
                del parent[0]

df_artistsNames = pd.DataFrame(rows)
print(df_artistsNames.head(-5))

  discogs_artist_id
0           9134149
1            403139
2           1983365
3           4570205
4           3566446
               id                    name         realname  \
0               1           The Persuader  Jesper Dahlbäck   
1               2  Mr. James Barth & A.D.              NaN   
2               3               Josh Wink   Josh Winkelman   
3               4           Johannes Heil    Johannes Heil   
4               5              Heiko Laux       Heiko Laux   
...           ...                     ...              ...   
9798342  16856335             Izumi Kohki              NaN   
9798343  16856338                  Kate08        Kate Webb   
9798344  16856341   The Evil B-Side Twins              NaN   
9798345  16856347             Carol Lundy              NaN   
9798346  16856350            문선 (Moonsun)              NaN   

                                                   profile  \
0        Electronic artist working out of Stockholm, ac...   
1          

In [34]:
names = []

for name in df_artistsID['discogs_artist_id'][:1000]:
    artist_name = df_artistsNames.loc[df_artistsNames['id'] == name, 'name'].values
    if len(artist_name) > 0:
        # print(artist_name[0])
        names.append(artist_name[0])
    else:
        print("Name not found") 



In [None]:
from dotenv import load_dotenv
import os
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

load_dotenv()  # loads .env into environment

client_id = os.getenv("SPOTIPY_CLIENT_ID")
client_secret = os.getenv("SPOTIPY_CLIENT_SECRET")

auth_manager = SpotifyClientCredentials(
    client_id=client_id,
    client_secret=client_secret
)

sp = spotipy.Spotify(auth_manager=auth_manager)

# testing to see if the API call works
artist = sp.artist("1uNFoZAHBGtllmzznpCI3s")  # Justin Bieber ID
print(artist["name"])



Justin Bieber


In [None]:
# method 
def get_artist(artist, market):
    results = sp.search(q=f"artist:{artist}", type="artist", limit=1, market=market)
    items = results.get("artists", {}).get("items", [])

    
    return items[0] if items else None


def get_artist_data(artist_name, market):
    artist = get_artist(artist_name, market)
    if artist:
        return {
                "spotify_id": artist["id"],
                "spotify_name": artist["name"],
                "genres": artist.get("genres", []),
                "popularity": artist.get("popularity", None),
                "followers": artist.get("followers", {}).get("total", None),
            }
    else:
        return None
    