# Amazon Music Recommender

## Overview

>In this project, I aim to develop a collaborative filtering recommendation system tailored for Amazon digital music. By leveraging user interactions with music items, such as ratings or purchase histories, the system will analyze patterns and similarities among users and items to generate personalized music recommendations. The project will involve preprocessing the Amazon digital music dataset, training various collaborative filtering models, and evaluating their performance using metrics such as accuracy and coverage. Ultimately, the goal is to deploy a robust recommendation system that enhances the user experience by providing relevant and personalized music suggestions based on their preferences and behaviors

## Business Understanding

> In the dynamic landscape of digital music, platforms like Amazon face the perpetual challenge of enhancing user engagement and satisfaction. With an abundance of music choices available, users often struggle to discover content that resonates with their preferences. To address this, Amazon is implementing a collaborative and content based filtering recommendation system aimed at providing personalized music suggestions. This initiative serves the needs of both users, who seek streamlined music discovery experiences, and Amazon, which aims to boost user retention, loyalty, and ultimately, revenue. By leveraging user data to tailor recommendations, Amazon not only fosters a more enjoyable user experience but also potentially increases sales through enhanced engagement with relevant music content.

# Data Understanding

> The dataset, was pulled from a compiled dataset of Amazon.The data set can be found in [here](https://nijianmo.github.io/amazon/index.html).The data contains two zipped JSON files: the review and metadata. Due to the large size of the data, GitHub couldn't allow me to upload it here, but it can be found on the link I provided above.

> Given that the rating distribution is not normal, it could influence our recommendation system model. Hence, we'll generate a new normalized rating column by subtracting the average rating of each reviewID from the original rating.

In [1]:
#imports
import pandas as pd
import texthero as hero
from texthero import preprocessing
from texthero import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import NearestNeighbors
from nltk.stem import PorterStemmer

# Now you can use the PorterStemmer class


In [2]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

In [3]:
# Read the CSV file 'music_review.csv' from the './Data/' directory into a DataFrame named 'music_review2'
music_review2 = pd.read_csv('./Data/music_review.csv')

In [4]:
music_review2.head()

Unnamed: 0.1,Unnamed: 0,overall,reviewerID,asin,reviewText
0,0,5,A1ZCPG3D3HGRSS,1388703,This is a great cd full of worship favorites!!...
1,1,5,AC2PL52NKPL29,1388703,"So creative! Love his music - the words, the ..."
2,2,5,A1SUZXBDZSDQ3A,1388703,"Keith Green, gone far to early in his carreer,..."
3,3,5,A3A0W7FZXM0IZW,1388703,Keith Green had his special comedy style of Ch...
4,4,5,A12R54MKO17TW0,1388703,Keith Green / So you wanna go back to Egypt......


grouby by asin and join review text 

In [5]:
# Convert 'reviewText' column to strings
music_review2['reviewText'] = music_review2['reviewText'].astype(str)

# Group by 'asin' and join review text
grouped_reviews = music_review2.groupby('asin')['reviewText'].agg(lambda x: ' '.join(x))

# Convert the result back to a DataFrame
grouped_reviews_df = pd.DataFrame(grouped_reviews).reset_index()


In [6]:
grouped_reviews_df.head()

Unnamed: 0,asin,reviewText
0,1377647,"If you're looking for a meditative, contemplat..."
1,1388703,This is a great cd full of worship favorites!!...
2,1526146,"This is music from my younger years that I, as..."
3,1527134,"Don Francisco's ""Early Works"" are filled with ..."
4,1529145,"Discovering older Christian music, inspiration..."


In [7]:
grouped_reviews_df.shape

(456811, 2)

In [8]:
grouped_reviews_df

Unnamed: 0,asin,reviewText
0,0001377647,"If you're looking for a meditative, contemplat..."
1,0001388703,This is a great cd full of worship favorites!!...
2,0001526146,"This is music from my younger years that I, as..."
3,0001527134,"Don Francisco's ""Early Works"" are filled with ..."
4,0001529145,"Discovering older Christian music, inspiration..."
...,...,...
456806,B01HJ91RWE,Love this group!
456807,B01HJ91TDQ,"This was the song as I've heard it on T.V., do..."
456808,B01HJ91VJ8,"This is a beautiful, worshipful song that glor..."
456809,B01HJ91WOW,"Awesome Love, love, love it Love It,Anytime..."


In [9]:
import string

def clean_review(df):
    # Define NLTK stopwords and Porter Stemmer
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    
    # Define custom preprocessing pipeline
    def custom_pipeline(text):
        # Lowercase
        text = text.lower()
        
        # Remove punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))
        
        # Tokenize
        tokens = word_tokenize(text)
        
        # Remove stopwords and apply stemming
        tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
        
        # Join tokens back into text
        text = ' '.join(tokens)
        
        return text
    
    # Apply custom preprocessing pipeline to 'reviewText' column
    df['clean_text'] = df['reviewText'].apply(custom_pipeline)
    
    return df

In [10]:
clean_review(grouped_reviews_df)

Unnamed: 0,asin,reviewText,clean_text
0,0001377647,"If you're looking for a meditative, contemplat...",your look medit contempl tape perfect one bar ...
1,0001388703,This is a great cd full of worship favorites!!...,great cd full worship favorit time great keith...
2,0001526146,"This is music from my younger years that I, as...",music younger year musician use quit often chu...
3,0001527134,"Don Francisco's ""Early Works"" are filled with ...",francisco earli work fill uniqu sens passion l...
4,0001529145,"Discovering older Christian music, inspiration...",discov older christian music inspir beauti gif...
...,...,...,...
456806,B01HJ91RWE,Love this group!,love group
456807,B01HJ91TDQ,"This was the song as I've heard it on T.V., do...",song ive heard tv wish longer
456808,B01HJ91VJ8,"This is a beautiful, worshipful song that glor...",beauti worship song glorifi lord cant get enou...
456809,B01HJ91WOW,"Awesome Love, love, love it Love It,Anytime...",awesom love love love love itanytim


In [11]:
grouped_reviews_df.to_csv('./Data/grouped_reviews_df.csv', index=False)

In [12]:
grouped_reviews_df=pd.read_csv('./Data/grouped_reviews_df.csv')

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Assuming grouped_reviews_df is your DataFrame
# Fill missing values in 'clean_text' column with empty strings
grouped_reviews_df['clean_text'] = grouped_reviews_df['clean_text'].fillna('')

# Vectorize the review text using TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)  # Limiting to top 1000 features

tfidf_matrix = tfidf_vectorizer.fit_transform(grouped_reviews_df['clean_text'])

# Convert the TF-IDF matrix to a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Concatenate the 'asin' column with the TF-IDF DataFrame
tfidf_df = pd.concat([grouped_reviews_df['asin'], tfidf_df], axis=1)


In [14]:
tfidf_df.head()

Unnamed: 0,asin,10,100,11,12,13,14,15,16,17,...,wrote,ye,yeah,year,york,youll,young,youth,youtub,youv
0,1377647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.067305,0.0,0.0,0.0,0.0,0.0,0.0
1,1388703,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.05493,0.0,0.0,0.025288,0.0,0.0,0.0
2,1526146,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.032802,0.0,0.0,0.09856,0.0,0.0,0.054448,0.0,0.0,0.0
3,1527134,0.0,0.0,0.0,0.0,0.0,0.0,0.035368,0.0,0.0,...,0.0,0.0,0.0,0.18971,0.0,0.028909,0.0,0.0,0.0,0.034821
4,1529145,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
tfidf_df.to_csv('./Data/tfidf_df.csv', index=False)

In [16]:
music_meta = pd.read_csv('./Data/music_meta.csv')
music_meta.drop(columns =['Unnamed: 0'], inplace=True)

In [17]:
music_meta.head()

Unnamed: 0,description,title,brand,asin,style
0,Unknown,Master Collection Volume One,John Michael Talbot,1377647,Audio CD
1,Unknown,Hymns Collection: Hymns 1 &amp; 2,Second Chapter of Acts,1529145,Audio CD
2,Unknown,Early Works - Don Francisco,Don Francisco,1527134,Audio CD
3,Unknown,So You Wanna Go Back to Egypt,Keith Green,1388703,Audio CD
4,"[""1. Losing Game 2. I Can't Wait 3. Didn't He ...",Early Works - Dallas Holm,Dallas Holm,1526146,Audio CD


In [18]:
tfidf_df.set_index('asin', inplace=True)
music_meta.set_index('asin', inplace=True)

In [19]:
content_model = tfidf_df.join(music_meta['style'], on='asin', rsuffix='_music_meta')

In [20]:
content_model.head()

Unnamed: 0_level_0,10,100,11,12,13,14,15,16,17,20,...,ye,yeah,year,york,youll,young,youth,youtub,youv,style_music_meta
asin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1377647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.067305,0.0,0.0,0.0,0.0,0.0,0.0,Audio CD
1388703,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.05493,0.0,0.0,0.025288,0.0,0.0,0.0,Audio CD
1526146,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.09856,0.0,0.0,0.054448,0.0,0.0,0.0,Audio CD
1527134,0.0,0.0,0.0,0.0,0.0,0.0,0.035368,0.0,0.0,0.034105,...,0.0,0.0,0.18971,0.0,0.028909,0.0,0.0,0.0,0.034821,Audio CD
1529145,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Audio CD


In [22]:
# content_model.to_csv('content_model.csv', index=False)

In [23]:
content_model = pd.get_dummies(content_model, columns=['style_music_meta'])

In [24]:
content_model.head()

Unnamed: 0_level_0,10,100,11,12,13,14,15,16,17,20,...,style_music_meta_ Audio CD,style_music_meta_ Audio Cassette,style_music_meta_ Blu-ray,style_music_meta_ DVD,style_music_meta_ DVD Audio,style_music_meta_ MP3 Music,style_music_meta_ Paperback,style_music_meta_ VHS Tape,style_music_meta_ Vinyl,style_music_meta_Unknown
asin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1377647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
1388703,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
1526146,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0
1527134,0.0,0.0,0.0,0.0,0.0,0.0,0.035368,0.0,0.0,0.034105,...,1,0,0,0,0,0,0,0,0,0
1529145,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0


In [26]:
content_model.to_csv('./Data/content_model_dummies.csv', index=False)

In [27]:
import faiss

In [45]:
# # Initialize a Faiss index
# dimension = content_model.shape[1]  # Dimension of vectors
# index = faiss.IndexFlatL2(dimension)  # L2 distance is used for similarity search

# # Add vectors to the index
# index.add(content_model.values.astype('float32'))  # Ensure data type is suitable for Faiss

# # Save the Faiss index
# faiss.write_index(index, "content_model.index")

# # Later, when you want to load the index and perform recommendation:

# # Load the index
# index = faiss.read_index("content_model.index")

# # Example: Perform similarity search for a target vector
# target_asin = '0001377647'  # Replace this with the ASIN of the item you want to find similar items for
# target_vector = content_model.loc[target_asin].values.astype('float32').reshape(1, -1)
# num_results = 10
# distances, indices = index.search(target_vector, num_results)

# # Retrieve similar ASINs from your database based on the indices
# similar_asins = content_model.iloc[indices[0]].index.tolist()

# # Print or use similar ASINs as needed
# print("Similar ASINs:", similar_asins)


Similar ASINs: ['0001377647', '555820690X', '0006935257', '5552256646', 'B00004UU0Y', '5558433892', '1565852443', '1565857879', '1577345479', '1631380362']


In [48]:
# def get_similar_titles(target_asin, num_results, content_model, music_meta):
#     # Load the index
#     index = faiss.read_index("content_model.index")
    
#     # Perform similarity search for the target vector
#     target_vector = content_model.loc[target_asin].values.astype('float32').reshape(1, -1)
#     distances, indices = index.search(target_vector, num_results)
    
#     # Retrieve similar ASINs from your database based on the indices
#     similar_asins = content_model.iloc[indices[0]].index.tolist()
    
#     # Get titles from music_meta based on similar ASINs
#     similar_titles = music_meta.loc[similar_asins]['title']
    
#     return similar_titles.tolist()


In [53]:
#Save the Faiss index
faiss.write_index(index, "content_model.index")

In [51]:
def get_similar_titles(target_asin, num_results, content_model, music_meta):
    # Load the index
    index = faiss.read_index("content_model.index")
    
    # Perform similarity search for the target vector using cosine similarity
    target_vector = content_model.loc[target_asin].values.astype('float32').reshape(1, -1)
    faiss.normalize_L2(target_vector)  # Normalize the target vector
    distances, indices = index.search(target_vector, num_results)

    # Retrieve similar ASINs from your database based on the indices
    similar_asins = content_model.iloc[indices[0]].index.tolist()
    
    # Get titles from music_meta based on similar ASINs
    similar_titles = music_meta.loc[similar_asins]['title']
    
    return similar_titles.tolist()


In [54]:
# Example usage:
target_asin = input('ASIN: ')
num_results = int(input('num_results? '))
similar_titles = get_similar_titles(target_asin, num_results, content_model, music_meta)

target_title = music_meta.loc[target_asin]['title']
similar_titles

ASIN: 0001377647
num_results? 5


['Master Collection Volume One',
 'The Ultimate Collection',
 'The Great Courses Must History Repeat the Great Conflicts of This Century?',
 'The Great Courses Great World Religions Christianity',
 'Christ the Lord Is Risen Today']

## Conclusion 

In conclusion, the content-based recommendation system, exemplified by the get_similar_titles function, plays a pivotal role in enhancing the music discovery experience on our platform. By analyzing intrinsic attributes such as title, description, and genre, content-based recommendation identifies music items with similar characteristics to a target item. This approach excels in recommending niche or lesser-known items that may not have garnered substantial user interactions yet. By focusing on the inherent features of items, content-based recommendation tailors recommendations to specific user preferences and interests, enriching the platform's recommendation capabilities and facilitating personalized music exploration.
