# Amazon Digital Music Recommendation System 

## Overview

> In this project, I aim to develop a collaborative filtering recommendation system tailored for Amazon digital music. By leveraging user interactions with music items, such as ratings or purchase histories, the system will analyze patterns and similarities among users and items to generate personalized music recommendations. The project will involve preprocessing the Amazon digital music dataset, training various collaborative filtering models, and evaluating their performance using metrics such as accuracy and coverage. Ultimately, the goal is to deploy a robust recommendation system that enhances the user experience by providing relevant and personalized music suggestions based on their preferences and behaviors.


## Business Understanding

> In the dynamic landscape of digital music, platforms like Amazon face the perpetual challenge of enhancing user engagement and satisfaction. With an abundance of music choices available, users often struggle to discover content that resonates with their preferences. To address this, Amazon is implementing a collaborative and content based filtering recommendation system aimed at providing personalized music suggestions. This initiative serves the needs of both users, who seek streamlined music discovery experiences, and Amazon, which aims to boost user retention, loyalty, and ultimately, revenue. By leveraging user data to tailor recommendations, Amazon not only fosters a more enjoyable user experience but also potentially increases sales through enhanced engagement with relevant music content.

# Data Understanding

> The dataset, was pulled from a compiled dataset of Amazon.The data set can be found in [here](https://nijianmo.github.io/amazon/index.html).The data contains two zipped JSON files: the review and metadata. Due to the large size of the data, GitHub couldn't allow me to upload it here, but it can be found on the link I provided above.

> Given that the rating distribution is not normal, it could influence our recommendation system model. Hence, we'll generate a new normalized rating column by subtracting the average rating of each reviewID from the original rating.

In [1]:
#imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json
import gzip
import warnings
%matplotlib inline

### Load Data

> Loaded two large zipped JSON files used on this project. One file contains individual user reviews, while the other contains metadata for Digital amazon music. 

In [2]:
path_music='./Data/Digital_Music.json.gz'

# Open the compressed file using gzip and read JSON content
with gzip.open(path_music, 'rt', encoding='utf-8') as file:
    # Use pd.read_json to parse the JSON content into a DataFrame
    music = pd.read_json(file, lines=True)

In [3]:
# Specify the path to your compressed JSON file
path_meta = './Data/meta_Digital_Music.json.gz'

# Open the compressed file using gzip and read JSON content
with gzip.open(path_meta, 'rt', encoding='utf-8') as file:
    # Use pd.read_json to parse the JSON content into a DataFrame
    meta = pd.read_json(file, lines=True)


In [4]:
df_meta=pd.DataFrame(meta)

In [5]:
df_meta.shape

(74347, 19)

In [6]:
df_music=pd.DataFrame(music)

In [7]:
df_music.shape

(1584082, 12)

### Music Review

In [8]:
music.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1584082 entries, 0 to 1584081
Data columns (total 12 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   overall         1584082 non-null  int64 
 1   verified        1584082 non-null  bool  
 2   reviewTime      1584082 non-null  object
 3   reviewerID      1584082 non-null  object
 4   asin            1584082 non-null  object
 5   style           1310814 non-null  object
 6   reviewerName    1584001 non-null  object
 7   reviewText      1582629 non-null  object
 8   summary         1583547 non-null  object
 9   unixReviewTime  1584082 non-null  int64 
 10  vote            124722 non-null   object
 11  image           6591 non-null     object
dtypes: bool(1), int64(2), object(9)
memory usage: 134.5+ MB


In [9]:
music.columns

Index(['overall', 'verified', 'reviewTime', 'reviewerID', 'asin', 'style',
       'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'vote',
       'image'],
      dtype='object')

In [11]:
df_meta.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,[],,[],,Master Collection Volume One,"[B000002UEN, B000008LD5, B01J804JKE, 747403435...",,John Michael Talbot,[],"58,291 in CDs & Vinyl (","[B000002UEN, B000008LD5, 7474034352, B000008LD...","<img src=""https://images-na.ssl-images-amazon....",,,$18.99,1377647,[],[],
1,[],,[],,Hymns Collection: Hymns 1 &amp; 2,"[5558154950, B00014K5V4]",,Second Chapter of Acts,[],"93,164 in CDs & Vinyl (","[B000008KJ3, B000008KJ0, 5558154950, B000UN8KZ...","<img src=""https://images-na.ssl-images-amazon....",,,,1529145,[],[],
2,[],,[],,Early Works - Don Francisco,"[B00004RC05, B003H8F4NA, B003ZFVHPO, B003JMP1Z...",,Don Francisco,[],"875,825 in CDs & Vinyl (","[B003H8F4NA, B003ZFVHPO, B003JMP1ZK, B00004RC0...","<img src=""https://images-na.ssl-images-amazon....",,,,1527134,[],[],
3,[],,[],,So You Wanna Go Back to Egypt,"[B0000275QQ, 0001393774, 0001388312, B0016CP2G...",,Keith Green,[],"203,263 in CDs & Vinyl (","[B00000I7JO, B0016CP2GS, 0001393774, B0000275Q...","<img src=""https://images-na.ssl-images-amazon....",,,$13.01,1388703,[],[],
4,[],,[1. Losing Game 2. I Can't Wait 3. Didn't He S...,,Early Works - Dallas Holm,"[B0002N4JP2, 0760131694, B00002EQ79, B00150K8J...",,Dallas Holm,[],"399,269 in CDs & Vinyl (","[B0002N4JP2, 0760131694, B00150K8JC, B003MTXNV...","<img src=""https://images-na.ssl-images-amazon....",,,,1526146,[],[],


In [10]:
music.head(3)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5,True,"12 22, 2013",A1ZCPG3D3HGRSS,1388703,{'Format:': ' Audio CD'},mark l. massey,This is a great cd full of worship favorites!!...,Great worship cd,1387670400,,
1,5,True,"09 11, 2013",AC2PL52NKPL29,1388703,{'Format:': ' Audio CD'},Norma Mushen,"So creative! Love his music - the words, the ...",Gotta listen to this!,1378857600,,
2,5,True,"03 2, 2013",A1SUZXBDZSDQ3A,1388703,{'Format:': ' Audio CD'},Herbert W. Shurley,"Keith Green, gone far to early in his carreer,...",Great approach still gets the message out,1362182400,,


In [11]:
# list of comlumns to be dropped 
columns_music=['verified' , 'reviewTime' , 'verified' , 'reviewTime' , 'unixReviewTime' ,'vote','image' ,'style','summary','reviewerName']

In [12]:
# Removed specified columns from the music dataset.
music_2=music.drop(columns_music , axis=1)

In [13]:
music_2

Unnamed: 0,overall,reviewerID,asin,reviewText
0,5,A1ZCPG3D3HGRSS,0001388703,This is a great cd full of worship favorites!!...
1,5,AC2PL52NKPL29,0001388703,"So creative! Love his music - the words, the ..."
2,5,A1SUZXBDZSDQ3A,0001388703,"Keith Green, gone far to early in his carreer,..."
3,5,A3A0W7FZXM0IZW,0001388703,Keith Green had his special comedy style of Ch...
4,5,A12R54MKO17TW0,0001388703,Keith Green / So you wanna go back to Egypt......
...,...,...,...,...
1584077,5,AR3KABMPL5L0O,B01HJ91P94,Casting Crowns....you do it so well! Awesome s...
1584078,4,A2N53GHW73INDH,B01HJ91P94,This band has produced many inspiring Christia...
1584079,5,ABNKLDCCVJKW1,B01HJ91P94,Awesome band and awesome song. This is my next...
1584080,5,AMWSDABZWFRAT,B01HJ91IVY,Excellent


In [14]:
# Removed rows with missing values from the cleaned music dataset.
music_2.dropna(inplace=True)  

In [15]:
music_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1582629 entries, 0 to 1584081
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   overall     1582629 non-null  int64 
 1   reviewerID  1582629 non-null  object
 2   asin        1582629 non-null  object
 3   reviewText  1582629 non-null  object
dtypes: int64(1), object(3)
memory usage: 60.4+ MB


In [16]:
# Remove duplicate rows from the DataFrame 'music_2' based on all columns
music_2.drop_duplicates(inplace=True)

In [17]:
music_2.isna().sum()

overall       0
reviewerID    0
asin          0
reviewText    0
dtype: int64

In [18]:
music_2.head()

Unnamed: 0,overall,reviewerID,asin,reviewText
0,5,A1ZCPG3D3HGRSS,1388703,This is a great cd full of worship favorites!!...
1,5,AC2PL52NKPL29,1388703,"So creative! Love his music - the words, the ..."
2,5,A1SUZXBDZSDQ3A,1388703,"Keith Green, gone far to early in his carreer,..."
3,5,A3A0W7FZXM0IZW,1388703,Keith Green had his special comedy style of Ch...
4,5,A12R54MKO17TW0,1388703,Keith Green / So you wanna go back to Egypt......


In [23]:
df_meta.brand.nunique()

36464

### Meta Data

Merge the style from music to meta data

In [19]:

# Define a function to extract the 'Format:' value from a dictionary
def extract_format(style_dict):
    if isinstance(style_dict, dict):
        return style_dict.get('Format:', 'Unknown')
    else:
        return 'Unknown'

# Extract the value corresponding to 'Format:' from the 'style' column of df_music
df_meta['style'] = df_music['style'].apply(extract_format)


In [20]:
# column to be dropped from meta data
columns=['imageURL', 'main_cat','category' , 'imageURLHighRes' , 'rank' , 'also_buy' , 'feature' , 'tech2', 'tech1' , 'also_view', 'similar_item', 'date' , 'details' ,'fit' , 'price']

In [21]:
# Creating a new DataFrame 'df_meta_2' by dropping specified columns from the original DataFrame 'df_meta'
df_meta_2 = df_meta.drop(columns, axis=1)

In [22]:
df_meta_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74347 entries, 0 to 74346
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   description  74347 non-null  object
 1   title        74347 non-null  object
 2   brand        74347 non-null  object
 3   asin         74347 non-null  object
 4   style        74347 non-null  object
dtypes: object(5)
memory usage: 2.8+ MB


In [23]:
# asin_list=df_meta_2['asin'].tolist()

In [24]:
# Locate rows in 'df_meta_2' where the length of the 'description' column is equal to 0
df_meta_2.loc[df_meta_2['description'].apply(lambda x: len(x) == 0)].shape

(37821, 5)

> Rather than removing  and loosing 37,821 rows with empty lists in the 'description' column, let's replace these occurrences with the string 'unknown'.

In [25]:
# Replace empty descriptions with 'Unknown' in the 'df_meta_2' DataFrame
df_meta_2.loc[df_meta_2['description'].apply(lambda x: len(x) == 0), 'description'] = 'Unknown'

In [26]:
df_meta_2.head()

Unnamed: 0,description,title,brand,asin,style
0,Unknown,Master Collection Volume One,John Michael Talbot,1377647,Audio CD
1,Unknown,Hymns Collection: Hymns 1 &amp; 2,Second Chapter of Acts,1529145,Audio CD
2,Unknown,Early Works - Don Francisco,Don Francisco,1527134,Audio CD
3,Unknown,So You Wanna Go Back to Egypt,Keith Green,1388703,Audio CD
4,[1. Losing Game 2. I Can't Wait 3. Didn't He S...,Early Works - Dallas Holm,Dallas Holm,1526146,Audio CD


### cleaned data to CSV
> To prepare the clean data for modeling, it's crucial to export the processed and cleaned data : 'music_review' and 'music_meta' DataFrames as a CSV file into the 'Data' folder. This step is essential for effective data management and enables easy manipulation for future analyses.


In [27]:
# Save the DataFrame 'music_2' to a CSV file named 'music_review.csv' in the './Data/' directory
music_2.to_csv('./Data/music_review.csv')

In [28]:
# Exporting the DataFrame 'df_meta_2' to a CSV file named 'music_meta.csv' in the './Data/' directory
df_meta_2.to_csv('./Data/music_meta.csv')