# MusicMate

### Amazon Music Reviews

### Pitchfork Reviews
- https://www.kaggle.com/datasets/nolanbconaway/pitchfork-data

### Spotify

Kaggle datasets have:

🎤 Artist: The name of the artist who performed the song.

🎵 Song: The title of the song.

⏱️ Duration (ms): The length of the song in milliseconds.

🔞 Explicit: Indicates whether the song contains explicit content (True/False).

📅 Year: The release year of the song.

📈 Popularity: A score reflecting the song's popularity.

🕺 Danceability: A measure of how suitable the song is for dancing.

⚡ Energy: A measure of the song's intensity and activity.

🎼 Key: The musical key in which the song is composed.

🔊 Loudness: The overall loudness of the song in decibels.

🎚️ Mode: Indicates the modality (major or minor) of the song.

🗣️ Speechiness: The presence of spoken words in the track.

🎸 Acousticness: A measure of the acoustic sound of the song.

🎹 Instrumentalness: Predicts whether the track contains no vocals.

🎤 Liveness: The probability that the track was performed live.

😊 Valence: The musical positiveness conveyed by the song.

🎧 Tempo: The tempo of the song in beats per minute (BPM).

🎶 Genre: The genre(s) of the song.


#### Spotify Developer API
- https://developer.spotify.com/documentation/web-api
- Artist:
    - In: artist IDs (not names). Can look up one or multiple
        - There does not seem to be a way to lookup ID by name. Seems like a deal-breaker.
    - Out: genres, images, popularity
    - Separate APIs: artist albums, artist tracks
- Album:
    - Artists
    - Tracks
    - Genres
    - Images
    - Popularity  

#### Apple Music API
- https://developer.apple.com/documentation/applemusicapi
- Inputs: artist ID, region
    - How to get these based on name?
- Can specify one of several views. Most notably:
    - full albums
    - singles
- Determining what data is returned is not at all transparent from documentation

#### Amazon API
- No apparent documentation for mining data from Amazon products + reviews
    - Dev site saturated with AWS and Alexa development

#### Amazon Reviews Datasets
- Full Score Dataset:
    - https://figshare.com/articles/dataset/Amazon_Reviews_Full/13232537/1?file=25483985 for description + details
    - https://s3.amazonaws.com/fast-ai-nlp/amazon_review_full_csv.tgz for raw download
    - Has Ratings + reviews, but no product info. So useless for the sake of this project
    - Seminal paper: **Hidden Factors and Hidden Topics: Understanding Rating Dimensions with Review Text - Julian McAuley & Jure Leskovec - Stanford University - 2013**
        - 33M user reviews (how?!)
        - But not accessible
- Kaggle:
    - McAuley latest review data:
        - https://www.kaggle.com/datasets/wajahat1064/amazon-reviews-data-2023
            - 5M |700K | 1.8M song reviews | products | users
            - 29.5M books
            - 17.3M movies
            - 71K | 3.4K | 60K magazine reviews | products | users 
        - Per-category review and product data!
        - Mixes Audiobook data with song data
        - Missing genre data
        - Artist is not spelled out, but can be extracted from 'store' metadata
    - **Book** reviews dataset:
        - https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews
        - 3M reviews of 212K books with complete metadata. An excellent data set!


#### MusicBrainz
- No-fluff SQL database with thin web/API layer. Searching Scandroid leads to impressive and up-to-date results!
- Data dump: https://musicbrainz.org/doc/Development/JSON_Data_Dumps
- **CritiqueBrainz** data dump of user reviews!
    - 13K reviews by 5K users, according to home page. Still has a long way to go to be usable. 

#### Million Song Dataset
- millionsongdataset.com
- 300GB (!)
- Contains links to comprehensive lists of genres/artists
- Does not appear to be up-to-date. Cannot find most current (and not-so-current) artists: Tom Macdonald, VNV Nation



In [26]:
import csv
import dateutil
import json
import os
import re
import time

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

pd.set_option('display.max_colwidth', 200)

t = time.perf_counter()
def profile(message, timestamp = t):
    elapsed = time.perf_counter() - timestamp
    print(f'{elapsed:.3f}:\t{message}')

### ETL - Amazon Digital Music

In [29]:
t = time.perf_counter()

def parse_amazon_review(review: pd.Series) -> pd.Series:
    """
    Given an Amazon user review from Kaggle data, return a review in a normalized format.
    """
    # Review data schema:
    # - rating
    # - title
    # - text: user review text
    # - parent_asin: product id
    # - asin: redundant product id
    # - user_id
    # - helpful_vote: number of upvotes for this review
    # - timestamp
    # - images
    # - verified_purchase
    columns = ['rating', 'title', 'text', 'parent_asin', 'user_id', 'helpful_vote', 'timestamp']
    result = review[columns]
    # manufacture a unique ID based on product + user
    result['id'] = result.parent_asin + '-' + result.user_id
    # use normalized format for review data
    return result.rename({
        'text': 'review',
        'parent_asin': 'product_id'
    })
    # intentionally removed:
    # - images
    # - verified_purchase
    # - asin (using parent_asin instead)

def parse_amazon_product(product: pd.Series) -> pd.Series:
    """
    Given an Amazon product description from Kaggle data, return a product description in a normalized format.
    """
    # Product data schema:
    # - main_category: music, books, etc
    # - title: album/product title
    # - average_rating: combined average user review rating
    # - rating_number: user review count
    # - features: miscellaneous notes for some rare/collectors CDs for example
    # - description: product description text
    # - price
    # - images
    # - store: contains combination of artist and audio format
    # - categories: ?
    # - details: dictionary with potentially large volumes of arbitrary metadata. Usually contains album release date. Catalog includes non-music items that can be identified by eccentric metadata if needed.
    # - parent_asin: product id
    # - bought_together: not used
    columns = ['parent_asin', 'title', 'description', 'images', 'main_category']
    result = product[columns]
    if date := product['details'].get('Date First Available'):
        result['date'] = date
    return result.rename({
        'parent_asin': 'id',
        'main_category': 'type',
    })

def load_amazon_reviews(path, verbosity = 0) -> pd.DataFrame:
    with open(path) as file:
        records = []
        while (line := file.readline()):
            record = pd.Series(json.loads(line))
            record = parse_amazon_review(record)
            records.append(record)
            if verbosity > 0 and len(records) % 10_000 == 0:
                print(len(records))
        result = pd.DataFrame.from_records(records)
        result.set_index('id', inplace = True)
        return result

In [None]:
#music_reviews = load_amazon_reviews('Digital_Music.jsonl', verbosity = 1)
#profile(f'Read {len(music_reviews)} digital music reviews')
#music_reviews

In [20]:
def append_recordz(records: pd.DataFrame, path: str, index_col = 'id', verbosity = 1) -> int:
    try:
        old_records = pd.read_csv(path, index_col = index_col)
        if verbosity > 0:
            print(f'{len(old_records)} old records')
    except:
        old_records = None
    new_records = records
    if old_records is not None:
        new_record_ids = new_records.index.difference(old_records.index)
        new_records = new_records.loc[new_record_ids]
    if len(new_records) > 0:
        new_records = pd.concat([old_records, new_records])
        new_records.to_csv(path)
    if verbosity > 0:
        print(f'{len(new_records)} new records')
    return len(new_records)

def append_records(records: pd.DataFrame, path: str, index_col = 'id', verbosity = 1) -> int:
    with open(path, 'a') as outfile:
        records.to_csv(outfile)
        if verbosity > 0:
            print(f'appended {len(records)} records to {path}')
    return len(records)
    # possible future iteration as needed: prevent duplicates

def jsonl_to_csv(jsonpath: str, process_json = lambda x: x, index_col = 'id', batch_size = 10_000, verbosity = 1):
    csv_filename = jsonpath.replace('jsonl', 'csv')
    if os.path.exists(csv_filename):
        if verbosity > 0:
            print('CSV file already exists. Skipping.')
        return        
    t0 = time.perf_counter()
    with open(jsonpath) as file:

        batch = []  
        count = 0
        def append_batch() -> int:
            records = pd.DataFrame.from_records(batch)
            return append_records(records, csv_filename, index_col = index_col, verbosity = 0)
        while True:
            line = file.readline()
            if line:
                js = json.loads(line)
                record = process_json(pd.Series(js))
                batch.append(record)  
                if len(batch) == batch_size:
                    count += append_batch()
                    batch = []
                    if verbosity > 0:
                        profile(f'converted {count:,} records', t0)
            else:
                count += append_batch()
                if verbosity > 0:
                    profile(f'Finshed converting {count} records', t0)
                break

t = time.perf_counter()
jsonl_to_csv('Digital_Music.jsonl', process_json = parse_amazon_review)
profile(f'Converted digital music json to csv')

3.035:	converted 10,000 records
5.888:	converted 20,000 records
8.542:	converted 30,000 records
11.395:	converted 40,000 records
14.242:	converted 50,000 records
17.082:	converted 60,000 records
19.737:	converted 70,000 records
22.559:	converted 80,000 records
25.391:	converted 90,000 records
28.046:	converted 100,000 records
30.875:	converted 110,000 records
33.696:	converted 120,000 records
36.541:	converted 130,000 records
36.656:	Finshed converting 130434 records
107.398:	Converted digital music json to csv


In [33]:
t = time.perf_counter()
jsonl_to_csv('meta_Digital_Music.jsonl', process_json = parse_amazon_product)
profile(f'Converted digital music product json to csv', t)

3.020:	converted 10,000 records
5.618:	converted 20,000 records
8.407:	converted 30,000 records
11.195:	converted 40,000 records
14.018:	converted 50,000 records
16.629:	converted 60,000 records
19.458:	converted 70,000 records
19.622:	Finshed converting 70537 records
19.623:	Converted digital music product json to csv


In [34]:
t = time.perf_counter()
jsonl_to_csv('meta_CDs_and_Vinyl.jsonl', process_json = parse_amazon_product)
profile(f'Converted CD product json to csv', t)

3.007:	converted 10,000 records
5.679:	converted 20,000 records
8.528:	converted 30,000 records
11.405:	converted 40,000 records
14.285:	converted 50,000 records
17.204:	converted 60,000 records
19.888:	converted 70,000 records
22.763:	converted 80,000 records
25.633:	converted 90,000 records
28.537:	converted 100,000 records
31.217:	converted 110,000 records
34.146:	converted 120,000 records
37.042:	converted 130,000 records
39.944:	converted 140,000 records
42.643:	converted 150,000 records
47.034:	converted 160,000 records
50.005:	converted 170,000 records
53.009:	converted 180,000 records
55.699:	converted 190,000 records
58.569:	converted 200,000 records
61.445:	converted 210,000 records
64.125:	converted 220,000 records
67.023:	converted 230,000 records
69.921:	converted 240,000 records
72.822:	converted 250,000 records
75.522:	converted 260,000 records
78.452:	converted 270,000 records
81.361:	converted 280,000 records
84.248:	converted 290,000 records
87.121:	converted 300,000 

In [36]:
t = time.perf_counter()
jsonl_to_csv('CDs_and_Vinyl.jsonl', process_json = parse_amazon_review, batch_size = 100_000)
profile(f'Converted CD review json to csv', t)

28.111:	converted 100,000 records
56.003:	converted 200,000 records
83.752:	converted 300,000 records
112.357:	converted 400,000 records
140.186:	converted 500,000 records
168.135:	converted 600,000 records
196.043:	converted 700,000 records
224.114:	converted 800,000 records
252.430:	converted 900,000 records
280.327:	converted 1,000,000 records
308.171:	converted 1,100,000 records
336.806:	converted 1,200,000 records
366.609:	converted 1,300,000 records
394.490:	converted 1,400,000 records
422.512:	converted 1,500,000 records
450.313:	converted 1,600,000 records
478.285:	converted 1,700,000 records
506.417:	converted 1,800,000 records
534.203:	converted 1,900,000 records
561.964:	converted 2,000,000 records
590.238:	converted 2,100,000 records
618.230:	converted 2,200,000 records
646.511:	converted 2,300,000 records
674.545:	converted 2,400,000 records
702.208:	converted 2,500,000 records
730.380:	converted 2,600,000 records
758.509:	converted 2,700,000 records
786.678:	converted 2,8

In [69]:
def load_reviews(path = 'reviews.csv') -> pd.DataFrame:
    #result = pd.read_csv(path, index_col = 'id')
    #result.rating = result.rating.astype(np.float16)
    #result.helpful_vote = result.helpful_vote.astype(np.int)
    #result.timestamp = result.timestamp.astype(np.long)
    #return result
    return pd.read_csv(path, index_col = 'id')
    # why does this not work?
    return pd.read_csv(path, index_col = 'id', dtype = {
        'rating': np.float16,
        'timestamp': np.int64,
        'helpful_vote': np.int32,
    })

def save_reviews(reviews: pd.DataFrame, path = 'reviews.csv'):
    append_records(reviews, path)

In [70]:
t = time.perf_counter()
reviews = load_reviews('CDs_and_Vinyl.csv')
profile('Read CD reviews from CSV', t)

  return pd.read_csv(path, index_col = 'id')


28.737:	Read CD reviews from CSV


In [71]:
t = time.perf_counter()
reviews = pd.concat([
    reviews,
    load_reviews('Digital_Music.csv')
])
profile('Combined CD and Digital Music Reviews', t)
reviews.shape

3.839:	Combined CD and Digital Music Reviews


(4957768, 8)

In [72]:
reviews.rating.value_counts()

rating
5.0       2396867
5.0       1254535
4.0        443622
4.0        235021
3.0        190363
1.0        128405
3.0         99720
2.0         94514
1.0         65458
2.0         49202
rating         61
Name: count, dtype: int64

In [73]:
reviews[reviews.rating == 'rating']

Unnamed: 0_level_0,Unnamed: 0,rating,title,review,product_id,user_id,helpful_vote,timestamp
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
id,,rating,title,review,product_id,user_id,helpful_vote,timestamp
id,,rating,title,review,product_id,user_id,helpful_vote,timestamp
id,,rating,title,review,product_id,user_id,helpful_vote,timestamp
id,,rating,title,review,product_id,user_id,helpful_vote,timestamp
id,,rating,title,review,product_id,user_id,helpful_vote,timestamp
...,...,...,...,...,...,...,...,...
id,,rating,title,review,product_id,user_id,helpful_vote,timestamp
id,,rating,title,review,product_id,user_id,helpful_vote,timestamp
id,,rating,title,review,product_id,user_id,helpful_vote,timestamp
id,,rating,title,review,product_id,user_id,helpful_vote,timestamp


In [56]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5088215 entries, B002MW50JA-AGKASBHYZPGTEPO6LWZPVJWB2BVA to B00000853T-AEFCHGMHFSZA4IWC5FWTBRPR25GQ
Data columns (total 8 columns):
 #   Column        Dtype  
---  ------        -----  
 0   Unnamed: 0    float64
 1   rating        object 
 2   title         object 
 3   review        object 
 4   product_id    object 
 5   user_id       object 
 6   helpful_vote  object 
 7   timestamp     object 
dtypes: float64(1), object(7)
memory usage: 349.4+ MB


In [None]:
def load_products(path = 'products.csv') -> pd.DataFrame:
    result = pd.read_csv(path, index_col = 'id')
    return result

In [43]:
t = time.perf_counter()
products = pd.read_csv('meta_CDs_and_Vinyl.csv')
profile('Read CD product data', t)
products = pd.concat([
    products,
    pd.read_csv('meta_Digital_Music.csv')
])
profile('Combined CD and Digital Music product data', t)
products.shape

4.556:	Read CD product data
4.892:	Combined CD and Digital Music product data


(772573, 7)

### Text Processing

In [45]:
def extract_artist(text: str) -> str:
    result = text.partition('Format:')[0]
    try:
        i = result.index('(')
        result = result[:i]
    except:
        pass
    return result.strip()

examples = [
    "Scandroid Format: Audio CD",
    "Scandroid (Artist) Format: Audio CD",
    "Beethoven (Composer), Berliner Philharmonika (Performer)",
]
for example in examples:
    artist = extract_artist(example)
    print(f"artist('{example}') = '{artist}'")

artist('Scandroid Format: Audio CD') = 'Scandroid'
artist('Scandroid (Artist) Format: Audio CD') = 'Scandroid'
artist('Beethoven (Composer), Berliner Philharmonika (Performer)') = 'Beethoven'


In [46]:
import unidecode

def normalize(text: str) -> str:
    result = text.casefold().replace('the', '')
    result = re.sub(r'[^\w]', '', result)
    result = unidecode.unidecode(result)
    return result

examples = [
    'The Who', 'Who', 'Who, The',
    'R.E.M', 'R E M',
    'RMB',
    'Céline Dion', 'Blümchen', 'Bluemchen',
]
for example in examples:
    norm = normalize(example)
    print(f"normalize('{example}') = '{norm}'")

normalize('The Who') = 'who'
normalize('Who') = 'who'
normalize('Who, The') = 'who'
normalize('R.E.M') = 'rem'
normalize('R E M') = 'rem'
normalize('RMB') = 'rmb'
normalize('Céline Dion') = 'celinedion'
normalize('Blümchen') = 'blumchen'
normalize('Bluemchen') = 'bluemchen'


## Products

In [None]:
def best_products(meta: pd.DataFrame, count = 10) -> pd.DataFrame:
    return meta.sort_values(by = 'rating_number', ascending = False)[:count]

def process_products(data: pd.DataFrame) -> pd.DataFrame:
    result = data.copy()
    result = result.loc[result.store.dropna().index]
    if 'store' in result.columns:
        result['artist'] = result.store.map(extract_artist)
        result['artist_norm'] = result.artist.map(normalize)
    return result

In [None]:
%time
products = read_json(meta_file, meta_limit)
products.info()

In [None]:
products = process_products(products)
best_products(products)

### Artist Filter

In [None]:
with open('all_artists.txt') as file:
    all_artists = [artist for artist in file]
all_artists = pd.DataFrame(all_artists)
all_artists.columns = ['artist']
all_artists['artist_norm'] = all_artists.artist.map(normalize)
all_artists_set = set(all_artists.artist_norm)

def filter_artists(artists: pd.Series, return_removed = False):
    norm = artists.map(normalize)
    is_artist = norm.map(lambda artist: artist in all_artists_set)
    results = artists[is_artist]
    if return_removed:
        non_artists = artists[is_artist == False]
        return (results, non_artists)
    return results


products.artist.value_counts()[:20]
#products.groupby('artist')['title'].count().sort_values(ascending = False)[:20]

In [None]:
artists, non_artists = filter_artists(products.artist, return_removed = True)
len(artists), len(non_artists)

In [None]:
non_artists.value_counts()[:20]

In [None]:
# Taken from Billboard Hot 100 at time of writing
new_artists = [
    'Shaboozey', 'Lady Gaga', 'Billie Eilish', 'Sabrina Carpenter', 'Teddy Swims', 'Tyler, The Creator'
]
# Eclectic niche artists:
niche_artists = [
    'Scandroid', 'Zombie Hyperdrive', 'Dynatron', 'Magic Sword', 'Dance With the Dead', # Synthwave
    'Eisfabrik', 'Apoptygma Berzerk', # Industrial Dance
    'Zircon', 'Rushjet1', 'Megadrive', # Chiptunes
    'Alice in Videoland', 'Thermostatic', 'Parralox', 'Pool Waitress', # Synthpop
    'Juno Reactor', 'Blümchen', # misc
]

test_artists = new_artists + niche_artists
results = []
for artist in test_artists:
    norm = normalize(artist)
    in_db = norm in all_artists_set
    in_products = norm in product_artists_set
    results.append({'in_db': in_db, 'in_products':in_products})

pd.DataFrame.from_records(results, index = test_artists)



In [None]:
### Products + Reviews

In [74]:
def joinz(reviews: pd.DataFrame, metadata: pd.DataFrame) -> pd.DataFrame:
    # This should be trivial:
    #return reviews.join(meta, on = 'id', how = 'left') 
    # But for unknown reasons Pandas cannot handle this data. Let's perform a manual join.
    result = reviews.copy()
    new_columns = meta.columns.difference(reviews.columns)
    print(new_columns)
    for column in new_columns:
        result[column] = None
    for review_id in reviews.parent_asin:
        review = result.loc[review_id]
        try:
            data = meta.loc[review_id]
            for column in new_columns:
                review[column] = data.column
                print(f'review[{column}] = {data.column}')
        except:
            pass
    return result

#def product_reviews(reviews: pd.DataFrame, products: pd.DataFrame) -> pd.DataFrame:
#    reviews.join(products, on = 'product_id')

product_reviews = reviews.join(products, on = 'product_id', how = 'left')
product_reviews.head(2)

ValueError: You are trying to merge on object and int64 columns for key 'product_id'. If you wish to proceed you should use pd.concat

In [76]:
products.to_sql('products.sql')

AttributeError: 'DataFrame' object has no attribute 'tosql'