## Extracting edges from song releases

Below is the code that extracts song releases from discogs release type files.

Strips troublesome control characters as well as adding a root element so iterparse can parse.

This only gets an edge if artists >= 2

Drag discogs files and change the path name as needed.

In [1]:
import xml.etree.ElementTree as ET
import io
import re

edges = []

# with open("./data_raw/discogs_20080309_releases.xml", "r", encoding="utf-8") as file:
#     raw = file.read()

with open("./data_raw/discogs_20100101_releases.xml", "r", encoding="utf-8") as file:
    raw = file.read()

# stripping control characters and escaping stray ampersands
raw = re.sub(r"[\x00-\x08\x0B\x0C\x0E-\x1F]", "", raw)
safe = re.sub(r"&(?!(amp|lt|gt|apos|quot|#\d+);)", "&amp;", raw)

# adding a root element because iterparse needs to find a root element to parse
wrapped_source = "<root>\n" + safe + "\n</root>"

for event, elem in ET.iterparse(io.StringIO(wrapped_source), events= ("end",)):
    # want to look by song release
    if elem.tag == "release":
        # for each song release we get the artists by the id and then name
        artists = [
                a.findtext("id") or a.findtext("name") for a in elem.findall("./artists/artist")
        ]
        
        # removes all the nones (data is old so might be bad)
        artists = [a for a in artists if a]
        # generate edges for all pairs, we are only looking for artists collaborations
        if len(artists) >= 2:
            for i in range(len(artists)):
                for j in range(i+1, len(artists)):
                    edges.append((artists[i], artists[j]))

        elem.clear()


In [2]:
print(len(edges))
for u, v in edges[:14]:
    print(f"{u} <-> {v}")

233217
DJ Romain <-> Danny Krivit
Robert Rich <-> Lustmord
Kings Of Tomorrow <-> Soul Vision
Josh Wink <-> Lil' Louis
Critical Point <-> Vikter Duplaix
Marshall Jefferson <-> Noosa Heads
Max Reich <-> Johannes Foufas
Pure Science <-> Mashupheadz
Miguel Migs <-> DJ Rasoul
Onionz <-> Joeski
Onionz <-> Master D
Joeski <-> Master D
J.T. Donaldson <-> Lance DeSardi
J.T. Donaldson <-> Chris Nazuka


## Exracting artists and integrating spotify web api (WIP)

Start with initializing spotify API

In [10]:
from dotenv import load_dotenv
from pathlib import Path
import os
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

env_path = Path.cwd() / ".env"
load_dotenv(env_path)

client_id = os.environ.get("SPOTIPY_CLIENT_ID")
client_secret = os.environ.get("SPOTIPY_CLIENT_SECRET")

auth_manager = SpotifyClientCredentials(
    client_id=client_id,
    client_secret=client_secret
)
sp = spotipy.Spotify(auth_manager=auth_manager)


## Searching for artists

In [14]:
results = sp.search(q="artist:Robert Rich", type="artist", limit=1, market="US")
items = results.get("artists", {}).get("items", [])

print(results)
print(items)




{'artists': {'href': 'https://api.spotify.com/v1/search?offset=0&limit=1&query=artist%3ARobert%20Rich&type=artist&market=US', 'limit': 1, 'next': 'https://api.spotify.com/v1/search?offset=1&limit=1&query=artist%3ARobert%20Rich&type=artist&market=US', 'offset': 0, 'previous': None, 'total': 76, 'items': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/3ux92I3CgfnhgLyYNsXIwZ'}, 'followers': {'href': None, 'total': 26658}, 'genres': ['space music', 'ambient', 'dark ambient', 'drone', 'new age'], 'href': 'https://api.spotify.com/v1/artists/3ux92I3CgfnhgLyYNsXIwZ', 'id': '3ux92I3CgfnhgLyYNsXIwZ', 'images': [{'url': 'https://i.scdn.co/image/5971536a720ea99cf4317412239c7a565acd1010', 'height': 1000, 'width': 1000}, {'url': 'https://i.scdn.co/image/9071461f1db1621cbf6a0ec0f9bd9c113f5861e4', 'height': 640, 'width': 640}, {'url': 'https://i.scdn.co/image/9e3cedea941c4afea30d99b1c19f1d346fc37796', 'height': 200, 'width': 200}, {'url': 'https://i.scdn.co/image/1f07b19a73f39ed239ef73