# Import Amazon Music
This workflow begins with raw Amazon product and product review data sourced from Kaggle at:

https://www.kaggle.com/datasets/wajahat1064/amazon-reviews-data-2023

As a prerequisite before running this workflow, download the Audio CD and Digital Music user reveiew and product metadata (4 datasets) from the above link and place them in the data/raw folder.

In [50]:
%load_ext autoreload
%autoreload 2

import json
import sqlite3 as sql

import pandas as pd

import sql_ingest as ingest
from jsonl_to_csv import jsonl_to_csv

%run '../query/search.py'

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### JSONL to CSV
Large JSONL files are unwieldy to work with. Converting to CSV makes it feasible to import datasets in a single line of code within reasonable time span.

In [22]:
jsonl_to_csv('../raw/Digital_Music.jsonl', if_exists = 'skip')

In [23]:
jsonl_to_csv('../raw/meta_Digital_Music.jsonl', if_exists =  'skip')

In [24]:
jsonl_to_csv('../raw/CDs_and_Vinyl.jsonl', batch_size = 100_000, if_exists = 'skip')

In [25]:
jsonl_to_csv('../raw/meta_CDs_and_Vinyl.jsonl', batch_size = 100_000, if_exists = 'skip')

## Normalize Format

Next, let's combine digital music and CD data and convert the data format so that it matches our SQL backend.

In [26]:
# Amazon data from this dataset hides the album artist in a 'store' field containing artist and audio format.
# Let's extract only the artist from the store field:
def extract_artist(text: str) -> str:
    if type(text) != str:
        return text
    result = text.partition('Format:')[0]
    try:
        i = result.index('(')
        result = result[:i]
    except:
        pass
    return result.strip()

# Test
examples = [
    "Scandroid Format: Audio CD",
    "Scandroid (Artist) Format: Audio CD",
    "Beethoven (Composer), Berliner Philharmonika (Performer)",
]
for example in examples:
    artist = extract_artist(example)
    print(f"artist('{example}') = '{artist}'")

# Release dates are similarly hidden within a larger 'details' metadata field
def extract_release_date(details: str) -> str:
    details = details.replace("'", '"')
    try:
        meta = json.loads(details)
        return meta.get('Date First Available')
    except:
        return None

cases = [
    'Release Date',
    '{ "Artist": "Juno Reactor" }',
    "{ 'Date First Available': '2020-12-01' }",
    '{ "Date First Available": "2020-12-01" }'

]
for case in cases:
    date = extract_release_date(case)
    print(f'release date ({repr(case)}) = {repr(date)}')

artist('Scandroid Format: Audio CD') = 'Scandroid'
artist('Scandroid (Artist) Format: Audio CD') = 'Scandroid'
artist('Beethoven (Composer), Berliner Philharmonika (Performer)') = 'Beethoven'
release date ('Release Date') = None
release date ('{ "Artist": "Juno Reactor" }') = None
release date ("{ 'Date First Available': '2020-12-01' }") = '2020-12-01'
release date ('{ "Date First Available": "2020-12-01" }') = '2020-12-01'


In [27]:
def normalize_amazon_music_reviews(reviews: pd.DataFrame) -> pd.DataFrame:
    """
    Given an Amazon user review from Kaggle data, return a review in a normalized format compatible with our SQL backend.
    """
    # Review data schema:
    # - rating
    # - title
    # - text: user review text
    # - parent_asin: product id
    # - asin: redundant product id
    # - user_id
    # - helpful_vote: number of upvotes for this review
    # - timestamp
    # - images
    # - verified_purchase
    return reviews.rename(columns = {
        'text': 'review',
        'parent_asin': 'product_id',
        'helpful_vote': 'upvotes',
    }).drop(columns = ['asin', 'verified_purchase', 'images'])


def normalize_amazon_music_data(products: pd.DataFrame) -> pd.DataFrame:
    """
    Given an Amazon product description from Kaggle data, return a product description in a normalized format.
    """
    # Product data schema:
    # - main_category: music, books, etc
    # - title: album/product title
    # - average_rating: combined average user review rating
    # - rating_number: user review count
    # - features: miscellaneous notes for some rare/collectors CDs for example
    # - description: product description text
    # - price
    # - images
    # - store: contains combination of artist and audio format
    # - categories: ?
    # - details: dictionary with potentially large volumes of arbitrary metadata. Usually contains album release date. Catalog includes non-music items that can be identified by eccentric metadata if needed.
    # - parent_asin: product id
    # - bought_together: not used
    result = products.rename(columns = {
        'parent_asin': 'id',
        'main_category': 'category',
        'store': 'creator'
    }).drop(columns = [
        'average_rating', 'rating_number',
        'features',
        'price',
        'categories',
        'details',
        'bought_together',
        'subtitle',
        'author',
        'videos',
        'images'
    ])
    result['release_date'] = products.details.map(extract_release_date)
    # descriptions come in an array. but most products only have one. Let's join them into a single string.
    result.description = result.description.map(ingest.get_single_value)
    # the store column is tricky and can contain the artist if we parse it correctly.
    result.creator = result.creator.map(extract_artist)
    result['category'] = 'Music'
    result['subcategory'] = '' # we notably cannot get the genre for this dataset. A big weakness. Perhaps there is a way to fit it in later.
    result['title_search'] = result.title.map(search_text)
    result['creator_search'] = result.creator.map(search_text)
    result.set_index('id', inplace = True)
    return result

## Import Reviews

In [54]:
conn = sql.connect('../products.sql')

In [29]:
# 130K records @ 0.6 sec = 217K records / sec
digital_music_reviews = pd.read_csv('../raw/Digital_Music.csv').drop(columns = ['Unnamed: 0'])
print(digital_music_reviews.shape)
digital_music_reviews.head(2)

(130434, 10)


Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,5.0,Nice,If i had a dollar for how many times I have pl...,[],B004RQ2IRG,B004RQ2IRG,AFUOYIZBU3MTBOLYKOJE5Z35MBDA,1618972613292,0,True
1,5.0,Excellent,awesome sound - cant wait to see them in perso...,[],B0026UZEI0,B0026UZEI0,AHGAOIZVODNHYMNCBV4DECZH42UQ,1308167525000,0,True


In [30]:
# 4.8M records @ 27.5 sec = 138K records / sec
cd_reviews = pd.read_csv('../raw/CDs_and_Vinyl.csv').drop(columns = ['Unnamed: 0'])
print(cd_reviews.shape)

(4827273, 10)


In [31]:
reviews = pd.concat([cd_reviews, digital_music_reviews])
reviews = normalize_amazon_music_reviews(reviews)
reviews.sample(3)

Unnamed: 0,rating,title,review,product_id,user_id,timestamp,upvotes
2010633,2.0,pink noise,I found it to be more annoying than relaxing. ...,B002IYDT1I,AEZ4N5AGDWHB5RUTDIIBFITLDL3Q,1387217211000,3
3465510,4.0,The instrumentation on the entire CD is not th...,The instrumentation on the entire CD is not th...,B000001OQ1,AGO2DIZ4S4EMGCDFGANQHFBDLQOA,1468802620000,0
3147139,5.0,Grade: A,There's nothing like that 'hoping' feeling you...,B00005OMGF,AH5PUT2TZRRZ3RVGSRLLBUBHXNIQ,1005268575000,10


Removed all existing music reviews.


In [None]:
# 5M records @ 22m 35 sec = 3.7K records / sec
clean_reviews = False
if clean_reviews:
    conn.execute("DELETE FROM review WHERE product_id IN (SELECT id FROM product WHERE category = 'Music')")
    print('Removed all existing music reviews.')
ingest.import_reviews(reviews, conn)

Inserting 4,957,707 records into review...
8.78: inserted 100,000 of 4,957,707 records (2.0%) @ 11385 records / sec
19.52: inserted 200,000 of 4,957,707 records (4.0%) @ 10247 records / sec
31.99: inserted 300,000 of 4,957,707 records (6.1%) @ 9378 records / sec
49.68: inserted 400,000 of 4,957,707 records (8.1%) @ 8052 records / sec
64.35: inserted 500,000 of 4,957,707 records (10.1%) @ 7770 records / sec
83.79: inserted 600,000 of 4,957,707 records (12.1%) @ 7161 records / sec
105.66: inserted 700,000 of 4,957,707 records (14.1%) @ 6625 records / sec
120.12: inserted 800,000 of 4,957,707 records (16.1%) @ 6660 records / sec
144.57: inserted 900,000 of 4,957,707 records (18.2%) @ 6225 records / sec
166.35: inserted 1,000,000 of 4,957,707 records (20.2%) @ 6011 records / sec
186.06: inserted 1,100,000 of 4,957,707 records (22.2%) @ 5912 records / sec
208.77: inserted 1,200,000 of 4,957,707 records (24.2%) @ 5748 records / sec
230.60: inserted 1,300,000 of 4,957,707 records (26.2%) @ 56

## Import Products

In [9]:
digital_music_data = pd.read_csv('../raw/meta_Digital_Music.csv').drop(columns = ['Unnamed: 0'])
digital_music_data.head(2)

Unnamed: 0,main_category,title,average_rating,rating_number,features,description,price,images,videos,store,categories,details,parent_asin,bought_together
0,Digital Music,Baja Marimba Band,4.9,8,[],[],,[{'thumb': 'https://m.media-amazon.com/images/...,[],,[],"{'Date First Available': 'February 28, 2010'}",B000V87RP2,
1,Digital Music,'80s Halloween-All Original Artists & Recordings,5.0,3,[],[],14.98,[{'thumb': 'https://m.media-amazon.com/images/...,[],"Love and Rockets (Artist), Duran Duran (...",[],{'Package Dimensions': '5.55 x 4.97 x 0.54 inc...,B0062F0MJQ,


In [10]:
cd_data = pd.read_csv('../raw/meta_CDs_and_Vinyl.csv').drop(columns = ['Unnamed: 0'])

In [None]:
product_data = pd.concat([digital_music_data, cd_data])
product_data_normalized = normalize_amazon_music_data(product_data)
product_data_normalized.sample(10)

Unnamed: 0_level_0,category,title,description,creator,release_date,subcategory,title_search,creator_search
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
B001W6Q4BU,Music,The Kinks Choral Collection,"Product Description, After massively successfu...",Ray Davies,"March 21, 2009",,kinkschoralcollection,raydavies
B000008JCW,Music,March,1. No Myth 4:11 2. Half Harvest 4:04 3. This a...,Michael Penn,"July 26, 2006",,march,michaelpenn
B00701QV0A,Music,Eye In The Sky,,Alan Parsons Project Alan Parsons Symphonic ...,"January 21, 2012",,eyeinsky,alanparsonsprojectalanparsonssymphonicproject
B000TJ71EK,Music,Sings Operatic Arias,The brilliant coloratura soprano Roberta Peter...,Roberta Peters,"July 20, 2007",,singsoperaticarias,robertapeters
B000BY8MSC,Music,Caribbean Steel Drums,Sounds wonderful by the pool or while sipping ...,Lifescapes,"November 2, 2005",,caribbeansteeldrums,lifescapes
B09JJCC89T,Music,A Tear In The Fabric of Life,Surprise EP drop from Knocked Loose ahead of t...,Knocked Loose,"October 14, 2021",,atearinfabricoflife,knockedloose
B000I8X5TM,Music,Chaney,Songs: 1. Me estoy Muriendo Por Dentro 2. Me E...,Conjunto Chaney,"September 2, 2006",,chaney,conjuntochaney
B000003FHA,Music,Pops Christmas Party,"The 1959 Boston Pops Christmas ""Living Stereo""...",Arthur Fiedler and the Boston Pops Orchestra,"December 7, 2006",,popschristmasparty,arthurfiedlerandbostonpopsorchestra
B018X0VINQ,Music,Servitude by Imports,,Aversions Crown,"May 23, 2020",,servitudebyimports,aversionscrown
B001GC8V6Q,Music,Grand Collection. Dmitri Hvorostovsky. Vol. 2....,,"Hvorostovsky, Dmitri","September 5, 2012",,grandcollectiondmitrihvorostovskyvol2oldrussia...,dmitrihvorostovsky


### Handling Duplicates
Interestingly, some music data conflicts with existing product data, resulting in an error on insertion.

The conflicting data appears to be books in CD format that are classified as audio CDs. We are only interested in music audio CDs, so it's best to remove these from our music data.

In [38]:
dupes = ingest.find_duplicates(product_data_normalized, 'product', conn)
print('Duplicate products already in the database:')
ids = ','.join(['?'] * len(dupes))
pd.read_sql(f"SELECT * FROM product WHERE id IN ({ids})", conn, params = dupes.index)


Duplicate products already in the database:


Unnamed: 0,id,title,title_search,creator,creator_search,publisher,description,category,subcategory,release_date
0,075406364X,Afterdark: The Dream Snatcher,afterdarkdreamsnatcher,Annie Dalton,anniedalton,,AFTERDARK is threatened by one man's destructi...,Books,Children's dreams,2001-06
1,0754084507,Watching Out (A Fran Varady crime novel),watchingoutafranvaradycrimenovel,Ann Granger,anngranger,Hachette UK,There's trouble ahead for Fran Varady... Just ...,Books,Fiction,2010-01-07
2,0786175621,V for Vendetta,vforvendetta,,,,,Books,,
3,0787106186,"Soaring With the Phoenix: Renewing the Vision,...",soaringwithphoenixrenewingvisionrevivingspirit...,"Marshall Goldsmith, Beverly Kaye, Ken Shelton",marshallgoldsmithbeverlykayekenshelton,Nicholas Brealey,Great leaders are great learners More than a d...,Books,Business & Economics,2010-11-26
4,082883279X,Assimil Language Course / Le Danois sans Peine...,assimillanguagecourseledanoissanspeinedanishfo...,United Nations,unitednations,UN,,Books,Political Science,2007-11-02
5,0867176865,Color Phonics,colorphonics,,,,,Books,,
6,0875098681,The Tozer CD-ROM Library (Version),tozercdromlibraryversion,A. W. Tozer,awtozer,,The Tozer CD-ROM Library is a searchable datab...,Books,,
7,0886902975,Rich Hall's Vanishing America/Audio Cassette/#...,richhallsvanishingamericaaudiocassette20090,,,,,Books,Booksellers and bookselling,1986
8,0967951615,Sunday Morning: The Novel,sundaymorningnovel,Alan Sillitoe,alansillitoe,HarperCollins UK,"""Working all week at the lathe leaves Arthur S...",Books,Black humor (Literature),2006
9,0970863330,Indigo Dreams: Adult Relaxation-Guided Meditat...,indigodreamsadultrelaxationguidedmeditationrel...,Lori Lite,lorilite,Stress Free Kids,Children love to unwind and relax with this fu...,Books,Self-Help,2008


In [None]:
product_data_normalized.drop(index = dupes.index, inplace = True)
ingest.find_duplicates(product_data_normalized, 'product', conn)

Unnamed: 0_level_0,category,title,description,creator,release_date,subcategory,title_search,creator_search
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1


In [40]:
ingest.import_products(product_data_normalized, conn)

### Verify

In [33]:
pd.read_sql("SELECT COUNT(*) AS albums FROM product WHERE category = 'Music'", conn)

Unnamed: 0,albums
0,768224


In [55]:
pd.read_sql("SELECT category, COUNT(*) AS products FROM product GROUP BY category", conn)

Unnamed: 0,category,products
0,Books,212397
1,Music,768224


In [56]:
print("Album examples")
pd.read_sql("SELECT * FROM product WHERE category = 'Music' LIMIT 5", conn)

Album examples


Unnamed: 0,id,title,title_search,creator,creator_search,publisher,description,category,subcategory,release_date
0,1046314,A Woman of Substance,awomanofsubstance,Barbara Taylor Bradford,barbarataylorbradford,,This is the first in a saga of books about Emm...,Music,,"January 30, 2007"
1,1046519,The Importance of Being Earnest Complete & Una...,importanceofbeingearnestcompleteunabridged,Oscar Wilde Trevor Millum,oscarwildetrevormillum,,,Music,,"October 7, 2006"
2,1048236,The Sherlock Holmes Audio Collection,sherlockholmesaudiocollection,,,,,Music,,"February 16, 2007"
3,1048252,All the Pretty Horses,allprettyhorses,Cormac McCarthy,cormacmccarthy,,"The story of John Grady Cole, who at 16 finds ...",Music,,"January 30, 2007"
4,1048791,"The Crucible Performed by Stuart Pankin, Jerom...",jeromedempseycastcrucibleperformedbystuartpankin,Arthur Miller,arthurmiller,,,Music,,"January 26, 2007"


In [60]:
pd.read_sql("SELECT COUNT(*) AS reviews FROM review", conn)

Unnamed: 0,reviews
0,7959678


In [59]:
pd.read_sql("SELECT category, COUNT(*) as reviews FROM review r JOIN product p ON r.product_id = p.id GROUP BY category", conn)

Unnamed: 0,category,reviews
0,Books,2597367
1,Music,4903858


In [62]:
print("Review examples:")
pd.read_sql("SELECT * FROM review r JOIN product p ON r.product_id = p.id WHERE category = 'Music' LIMIT 5", conn)

Review examples:


Unnamed: 0,user_id,product_id,title,review,rating,upvotes,downvotes,timestamp,id,title.1,title_search,creator,creator_search,publisher,description,category,subcategory,release_date
0,AFLPX7J55FASTRFVCTHBS5NJKGAA,1046314,MNReview,Great for a quick tape of the best Bradford bo...,4,0,0,1191318038000,1046314,A Woman of Substance,awomanofsubstance,Barbara Taylor Bradford,barbarataylorbradford,,This is the first in a saga of books about Emm...,Music,,"January 30, 2007"
1,AF6QIUNWC2QOTXT7DFM4E3WWZR4A,1046519,This Play Gets No Better Than This...,"Oscar Wilde's masterpiece, this play has many,...",5,1,0,1489635021000,1046519,The Importance of Being Earnest Complete & Una...,importanceofbeingearnestcompleteunabridged,Oscar Wilde Trevor Millum,oscarwildetrevormillum,,,Music,,"October 7, 2006"
2,AFUTBB27LNTCQPZS7Q77WVQUKCBA,1048236,Five Stars,Just as advertised,5,0,0,1501871291329,1048236,The Sherlock Holmes Audio Collection,sherlockholmesaudiocollection,,,,,Music,,"February 16, 2007"
3,AGTHUDIRWR5TUPG564RDXNVF27AQ,1048252,literary masterpiece..,perhaps the most memorable of the Border Trilo...,5,2,0,1178655330000,1048252,All the Pretty Horses,allprettyhorses,Cormac McCarthy,cormacmccarthy,,"The story of John Grady Cole, who at 16 finds ...",Music,,"January 30, 2007"
4,AFIC6SKUPM64WJEFRNNFW75WYO3A,1048252,Just finished- first thoughts,I've only just finished Moby Dick before readi...,4,0,0,1176830183000,1048252,All the Pretty Horses,allprettyhorses,Cormac McCarthy,cormacmccarthy,,"The story of John Grady Cole, who at 16 finds ...",Music,,"January 30, 2007"


In [65]:
print("Most reviewed albums:")
pd.read_sql_query("""
SELECT p.title AS album, p.creator AS artist, COUNT(*) AS reviews
FROM review r JOIN product p ON r.product_id = p.id
WHERE category = 'Music'
GROUP BY p.title, p.creator ORDER BY reviews DESC LIMIT 10
""", conn)

Most reviewed albums:


Unnamed: 0,album,artist,reviews
0,25,Adele,5145
1,That's Christmas To Me,Pentatonix,3648
2,Traveller,Chris Stapleton,3241
3,Hamilton O.B.C.R. Explicit Lyrics,Lin-Manuel Miranda,3096
4,Oscd Ad Adele 21,Adele,2953
5,Partners,Barbra Streisand,2933
6,1989,Taylor Swift,2602
7,I Dreamed A Dream,Susan Boyle,2309
8,Fallen,Evanescence,2161
9,Thriller,Michael Jackson,2004


In [66]:
print("Most reviewed artists:")
pd.read_sql_query("""
SELECT p.creator AS artist, COUNT(*) AS reviews
FROM review r JOIN product p ON r.product_id = p.id
WHERE category = 'Music'
GROUP BY p.creator ORDER BY reviews DESC LIMIT 10
""", conn)

Most reviewed artists:


Unnamed: 0,artist,reviews
0,Various Artists,109827
1,,41886
2,VARIOUS ARTISTS,32563
3,Various,29866
4,The Beatles,25645
5,Rated: Unrated,24659
6,Elvis Presley,16596
7,Pink Floyd,15711
8,The Rolling Stones,12765
9,Bob Dylan,12607


### Close Connection

In [67]:
conn.close()