The ListenBrainz data is distributed in 12 files, one per month of 2022. The data is in a format called json lines. This format has one json document per line, **each representing a single instance of a person listening to a song.**

Data in ListenBrainz is referenced by an identifier called a `“MessyBrainz ID”` (recording_msid). This ID is random, and the same song could have multiple different MessyBrainz IDs depending on how it was submitted to ListenBrainz. In order to be able to compare items with each other we want to convert it to data that is available in the MusicBrainz database.


To do this we provide three additional pieces of data.
1. The ListenBrainz Mapping, which provides a link between **MessyBrainz IDs** and a **MusicBrainz Recording ID** (listenbrainz_msid_mapping.csv). 
2. The ListenBrainz Canonical Metadata database, which provides **additional MusicBrainz data for a given Recording ID**. This includes an artist credit, which is a list of MusicBrainz Artist Ids. Note that some songs could be performed by multiple artists, and so MusicBrainz includes IDs of all artists which are involved in the performance (canonical_musicbrainz_data.csv in the metadata-dump folder). 
3. Artist MBID to name mapping, which gives the name of all artists in the MusicBrainz database. (musicbrainz_artist_mbid_name.csv)
Data pre-processing.  


We need to convert the ListenBrainz data dump to a matrix containing user ids, artist ids, and play counts (how many times someone listened to a given artist in 2022). Write a script to do the following:

1. Read the jsonlines data
2. For each item, extract user information and the MessyBrainz id
3. Look up a recording MusicBrainz id in the mapping. Note that not all items in the dataset have a mapping available, so you will need to skip these items.
4. Look up an artist for the recording. In the case that a recording has multiple artists you can choose what to do
- Take only the first artist in the list
- Add multiple entries, one for each artists in the list
- Treat the artist credit as a single entity (in this case you can use the “artist credit name” from the metadata file as your mapping in the next step)
5. Build a mapping of artist id to a textual name so that you can use the name to show results

You may wish to process only a part of this dataset during your initial development and then process the full dataset once you are ready to evaluate it.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!head -5 /content/drive/MyDrive/Colab_Notebooks/AMPLAB/ListenBrainzData/listenbrainz-2022/1.listens

{"user_id":17240,"user_name":"Winterbay","timestamp":1640995200,"track_metadata":{"artist_name":"Tiken Jah Fakoly","release_name":"Coup De Gueule","additional_info":{"artist_msid":"f03fa31b-a428-47e2-8dcb-bceec9ca1221","release_msid":"238bd863-a293-4718-a66b-093ea54bf8f3","listening_from":"lastfm","recording_msid":"f216e2fb-784d-470a-81b2-6e27cd532204","lastfm_artist_mbid":"edef3cfa-4e5e-4d64-8bd8-20f9dc1d8cad","lastfm_release_mbid":"9dc7fe6a-3fa4-4461-8975-ecb7218b39a3"},"track_name":"Alou Maye"},"recording_msid":"f216e2fb-784d-470a-81b2-6e27cd532204"}
{"user_id":16930,"user_name":"kazoo","timestamp":1640995200,"track_metadata":{"artist_name":"Fraktus","release_name":"Millennium Edition","additional_info":{"artist_msid":"afe0b08d-f47d-4adf-bd78-3e95ca276f7c","tracknumber":11,"release_msid":"9cd2285c-4860-4e83-b8bb-5fef7328b224","recording_msid":"7a490913-f4d5-40a8-b34a-02166fbc511e"},"track_name":"Computerliebe"},"recording_msid":"7a490913-f4d5-40a8-b34a-02166fbc511e"}
{"user_id":1513

In [None]:
import json
import time
import csv

start=time.time()

user_ids=[]
user_names=[]
recording_msids=[]

with open ("/content/drive/MyDrive/Colab_Notebooks/AMPLAB/ListenBrainzData/listenbrainz-20022/1.listens","r") as jsonl_file:
    for line in jsonl_file:
      # Load the line as a JSON object
      data = json.loads(line)
      # Extract the user_id value
      user_id = data["user_id"]
      user_name = data["user_name"]
      recording_msid=data["recording_msid"]
      # Add the user_id to the list
      user_ids.append(user_id)
      user_names.append(user_name)
      recording_msids.append(recording_msid)


with open('/content/drive/MyDrive/Colab_Notebooks/AMPLAB/AMPLAB1/user_info_MBID.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(user_ids)
    writer.writerow(user_names)
    writer.writerow(recording_msids)
# Write the list of user_ids to a txt file
end=time.time()
t=end-start
print(t)

print(len(recording_msids))


90.77061295509338
4339352


En cada mes, debe haber alrededor de 4 millones de eventos, de usuarios escuchando cierta canción. 

In [None]:
import pandas as pd
from tqdm import tqdm

# Crea un set vacío
mapping_set = set()

# Lee el archivo de mapeo en lotes más pequeños
chunk_size = 10000
for chunk in tqdm(pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/AMPLAB/ListenBrainzData/listenbrainz_msid_mapping.csv', chunksize=chunk_size),usecols=['recording_msid','recording_mbid']):
    # Agrega los elementos de este lote al set existente
    mapping_set.update(set(chunk['recording_msid']))

mapped_elements = set(recording_msids).intersection(mapping_set)

pd.DataFrame(list(mapped_elements)).to_csv('/content/drive/MyDrive/Colab_Notebooks/AMPLAB/mapping_result.csv', index=False, header=False)

7214it [03:12, 39.95it/s]

In [None]:
def remove_duplicates(filename, uniqueMSIDs):

  with open (filename,"r") as jsonl_file:
    for line in jsonl_file:
      data = json.loads(line)
      recording_msid=data["recording_msid"]
    
      if recording_msid not in uniqueMSIDs:
          uniqueMSIDs.add(recording_msid)

  print(f"Finished processing file, with {len(uniqueMSIDs)} ids in total.")
  return uniqueMSIDs


## Wow. Había bastantes msids que estaban repetidos. 

In [None]:
uniqueMSIDs = set()

for i in range(12):
  filename = '/content/drive/MyDrive/Colab_Notebooks/AMPLAB/ListenBrainzData/listenbrainz-2022/' + str(i+1) + '.listens'
  print(f"Processing file {i}..")
  uniqueMSIDs = remove_duplicates(filename, uniqueMSIDs)

Processing file 0..
Finished processing file, with 1889328 ids in total.
Processing file 1..
Finished processing file, with 3049386 ids in total.
Processing file 2..
Finished processing file, with 4065632 ids in total.
Processing file 3..
Finished processing file, with 5023256 ids in total.
Processing file 4..
Finished processing file, with 5958139 ids in total.
Processing file 5..
Finished processing file, with 6772333 ids in total.
Processing file 6..
Finished processing file, with 7547700 ids in total.
Processing file 7..


KeyboardInterrupt: ignored

QUE ES LO QUE HAY QUE HACER

Tenemos un file con eventos de usuarios que han escuchado cierta canción durante todo el año. Queremos extraer información de manera que sepamos user_ids, artist_ids y cuantas veces (o sea, cuantas veces cierto usuario escuchó cierto artista).

En primer lugar extraemos del dump: user_name, user_id y identificador msid.

En segundo lugar, tenemos un archivo de mapping que linkea identificadores en listenbrainz e identificadores en musicbrainz. Lo que creo que tenemos que hacer es hacer un mapeo de manera que para cada ID de ListenBrainz que hemos sacado, obtengamos el ID de MusicBrainz que le toca. Esto sirve para acceder a otro tipo de información que hay en los archivos que contienen información sobre MusicBrainz. 

En tercer lugar, para cada recording tengo que buscar a un artista, porque en los metadatos de MusicBrainz está esta información.





In [None]:
!head -5 /content/drive/MyDrive/Colab_Notebooks/AMPLAB/ListenBrainzData/listenbrainz_msid_mapping.csv

recording_msid,recording_mbid,match_type
00000737-3a59-4499-b30a-31fe2464555d,1fe669c9-5a2b-4dcb-9e95-77480d1e732e,exact_match
000013b3-dbb4-43a0-8fd4-ca92ff5ed033,c5bfd98d-ccde-4cf3-8abb-63fad1b6065a,exact_match
00002714-6f74-409d-9fa4-441c8dfb195f,c6acc112-3df7-4716-b5b6-953b5e93743f,exact_match
00003a81-2a6c-4d6c-ad43-990c0806458b,007770e2-90c5-49f2-b894-690db7ebea40,exact_match


In [None]:
## Básicamente tengo que buscar el recording MusicBrainz id para cada id de ListenBrainz y añadirlo a la lista. Para ello tengo que ver
## si existe, y si existe ponerlo. 
import time
import csv
from tqdm import tqdm
## Tengo que generar una lista igual de grande que las de antes en las que se añada al lado de cada msid su correspondiente mbid. 
msid_mapping=[]
mbid_mapping=[]
start=time.time()
with open("/content/drive/MyDrive/Colab_Notebooks/AMPLAB/ListenBrainzData/listenbrainz_msid_mapping.csv","r") as mapping:
  reader=csv.reader(mapping)
  for line in tqdm(reader):
    msid_mapping.append(line[0])
  for line in tqdm(reader):
    mbid_mapping.append(line[1])
end=time.time()

time=end-start
print(time)

## ESTO ME DA ERROR SI LO HAGO CON LAS DOS COLUMNAS PERO NO ME DA ERROR SI LO HAGO SOLO CON UNA DE ELLAS. 

36167231it [01:15, 647936.14it/s]

In [None]:
## MSID AND MBID MAPPING



In [None]:
print(msid_mbid_mapping[0:4])

['recording_msid', '00000737-3a59-4499-b30a-31fe2464555d', '000013b3-dbb4-43a0-8fd4-ca92ff5ed033', '00002714-6f74-409d-9fa4-441c8dfb195f']


In [None]:
!head -20 /content/drive/MyDrive/Colab_Notebooks/AMPLAB/ListenBrainzData/metabrainz-metadata-dump-20230117-172210/metabrainz/canonical_musicbrainz_data.csv

##The ListenBrainz Canonical Metadata database, which provides **additional MusicBrainz data for a given Recording ID**. This includes an artist credit, which is a list of MusicBrainz Artist Ids. 
##Note that some songs could be performed by multiple artists, and so MusicBrainz includes IDs of
## all artists which are involved in the performance (canonical_musicbrainz_data.csv in the metadata-dump folder). ESTE SIRVE PARA, CUANDO TENGAMOS MUSICBRAINZ ID, SACAR MÁS INFORMACIÓN DE ELLOS, COMO LOS AUTORES. 

id,artist_credit_id,artist_mbids,artist_credit_name,release_mbid,release_name,recording_mbid,recording_name,combined_lookup,score,year
28939355,1415161,{5e3071a8-8c56-4ab2-91f6-c76d35388dbd},Michie One,430bd180-0f13-4144-9ab6-ad50067303ee,Power of One,5f5f649f-1938-4a3e-a879-a95693a99a71,Heavenly Flow,michieoneheavenlyflow,371181,2006
28939356,1415161,{5e3071a8-8c56-4ab2-91f6-c76d35388dbd},Michie One,430bd180-0f13-4144-9ab6-ad50067303ee,Power of One,6ee381af-e9b4-46c4-a4f0-541dce2f03ea,Party,michieoneparty,371181,2006
28939357,1415161,{5e3071a8-8c56-4ab2-91f6-c76d35388dbd},Michie One,430bd180-0f13-4144-9ab6-ad50067303ee,Power of One,8b371ea0-dee1-4fbf-bacd-3aa87a4aef13,Free Like Jah,michieonefreelikejah,371181,2006
28939358,1415161,{5e3071a8-8c56-4ab2-91f6-c76d35388dbd},Michie One,430bd180-0f13-4144-9ab6-ad50067303ee,Power of One,9d904f0f-314b-4089-b151-d81e0857c431,People,michieonepeople,371181,2006
28939359,1415161,{5e3071a8-8c56-4ab2-91f6-c76d35388dbd},Michie One,430bd180-0f13-4144-

In [None]:
!head -20 /content/drive/MyDrive/Colab_Notebooks/AMPLAB/ListenBrainzData/musicbrainz_artist_mbid_name.csv

##Artist MBID to name mapping, which gives the name of all artists in the MusicBrainz database. (musicbrainz_artist_mbid_name.csv)
##Data pre-processing. ESTE SIRVE PARA PASAR DE ARTIST ID AL NOMBRE DEL ARTISTA.

mbid,name
fadeb38c-833f-40bc-9d8c-a6383b38b1be,Доктор Сатана
49add228-eac5-4de8-836c-d75cde7369c3,Pete Moutso
c112a400-af49-4665-8bba-741531d962a1,Zachary
ca3f3ee1-c4a7-4bac-a16a-0b888a396c6b,The Silhouettes
7b4a548e-a01a-49b7-82e7-b49efeb9732c,Aric Leavitt
60aca66f-e91a-4cb5-9308-b6e293cd833e,Fonograff
3e1bd546-d2a7-49cb-b38d-d70904a1d719,Al Street
df120895-f6c6-4a66-b9cf-73350f0beb61,Love .45
c14f8d3f-ee81-416f-800f-8eff7e77a2e1,Sintellect
b68a3969-319a-462f-942b-cd35581414fc,Evie Tamala
2c8ae2e0-3934-440e-81f5-2ec7fd0d7899,Jean-Pierre Martin
ac63d693-7b24-4258-a3db-09743b1b4269,Deejay One
4c4b7c6f-9285-4d6a-bc10-e5c9e08045f8,wecamewithbrokenteeth
055f435f-dba6-4156-9050-6ac41113e45f,The Blackbelt Band
ab1b631b-9896-4433-bef9-7868bf8a42f3,Giant Tomo
66de1369-f9eb-43cb-ae4f-88582a47a624,Elvin Jones & Jimmy Garrison Sextet
1fbb9556-b647-498a-a8ed-d3b5e8d7f85c,Tobias Lorsbach
e6895f6e-f636-4ff6-b406-f5ddaf6cb243,Diskobitch
4eee2c60-c2c8-4b33-b14f-0eed4bf4d11a,Seanews


# COLLABORATIVE FILTERING 

In [None]:
!pip install implicit

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting implicit
  Downloading implicit-0.6.2-cp38-cp38-manylinux2014_x86_64.whl (18.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.6/18.6 MB[0m [31m63.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: implicit
Successfully installed implicit-0.6.2


In [None]:
from implicit.datasets.lastfm import get_lastfm
from implicit.nearest_neighbours import bm25_weight
artists,users,artist_user_plays=get_lastfm()

# weight the matrix, both to reduce impact of users that have played the same artist thousands of times
# and to reduce the weight given to popular items
artist_user_plays = bm25_weight(artist_user_plays, K1=100, B=0.8)

# get the transpose since the most of the functions in implicit expect (user, item) sparse matrices instead of (item, user)
user_plays = artist_user_plays.T.tocsr()

0.00B [00:00, ?B/s]

In [None]:
from implicit.als import AlternatingLeastSquares

model = AlternatingLeastSquares(factors=64, regularization=0.05, alpha=2.0)
model.fit(user_plays)

  0%|          | 0/15 [00:00<?, ?it/s]

In [None]:
userid = 12345
ids, scores = model.recommend(userid, user_plays[userid], N=10, filter_already_liked_items=False)

In [None]:
print(ids,scores)

[277688 216218 241360 192493 270571  79684 204711 254178 225000  30373] [0.949351   0.9323065  0.9217777  0.90410143 0.9016966  0.8964411
 0.8950337  0.89447194 0.89329    0.8909682 ]


In [None]:
import numpy as np
import pandas as pd
pd.DataFrame({"artist": artists[ids], "score": scores, "already_liked": np.in1d(ids, user_plays[userid].indices)})

Unnamed: 0,artist,score,already_liked
0,von thronstahl,0.949351,True
1,puissance,0.932307,True
2,spiritual front,0.921778,False
3,mortiis,0.904101,True
4,triarii,0.901697,True
5,d-a-d,0.896441,True
6,ordo rosarius equilibrio,0.895034,False
7,the coffinshakers,0.894472,True
8,rome,0.89329,True
9,arditi,0.890968,True


In [None]:
# get related items for the beatles (itemid = 25512)
ids, scores= model.similar_items(252512)

# display the results using pandas for nicer formatting
pd.DataFrame({"artist": artists[ids], "score": scores})

Unnamed: 0,artist,score
0,the beatles,1.0
1,john lennon,0.90953
2,the who,0.863686
3,the beach boys,0.857818
4,the rolling stones,0.853878
5,paul mccartney,0.841989
6,bob dylan,0.838834
7,george harrison,0.838693
8,the kinks,0.828791
9,simon & garfunkel,0.825589
