# 02807 Final project: Recommendation system
Recommendation system of products from __Digital Music__ category on __Amazon__. Products are suggested based on a short description inserted by a user.
[**Data source**](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/)

In [352]:
# Imports
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
import json
import gzip
import spacy
import warnings
# import os
import pandas as pd
import numpy as np
# import torch
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import DBSCAN, KMeans
from scipy import sparse
from hdbscan import HDBSCAN
from collections import Counter, defaultdict
from lxml import html, etree
from nrclex import NRCLex
from transformers import AutoTokenizer, AutoModelWithLMHead
import preprocess_data as pre_process_data


# Load the data 

In [353]:
# Download dataset if it is not downloaded yet
if not os.path.exists('Dataset/meta_Digital_Music.json.gz'):
    !wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/metaFiles2/meta_Digital_Music.json.gz -P ./Dataset
else:
    print('Dataset already downloaded.')

Dataset already downloaded.


__Data format__
   * `asin`: ID of the product, e.g. 0000031852
   * `title`: name of the product
   * `feature`: bullet-point format features of the product
   * `description`: description of the product
   * `price`: price in US dollars (at time of crawl)
   * `imageURL`: url of the product image
   * `imageURL`: url of the high resolution product image
   * `related`: related products (also bought, also viewed, bought together, buy after viewing)
   * `salesRank`: sales rank information
   * `brand`: brand name
   * `categories`: list of categories the product belongs to
   * `tech1`: the first technical detail table of the product
   * `tech2`: the second technical detail table of the product
   * `similar`: similar product table

_Note that there are usually multiple attributes left out blank for each product (specific attributes differs from product to product)._ 


In [354]:
### Load the meta data
data = []
with gzip.open('Dataset/meta_Digital_Music.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# Total length of list, this number equals total number of products
print("Total number of items in the dataset: ", len(data))

Total number of items in the dataset:  74347


In [355]:
# convert list into pandas dataframe
df = pd.DataFrame.from_dict(data)

# set size of display in pandas
pd.set_option('display.max_colwidth', 300)
pd.set_option('display.max_rows', 20 )

# first row of the list
print("Columns of the dataset: ", df.columns)

print("Totale length of the dataset: ", len(df))
# show dataframe with columns and rows
# df.head()
# df2.info()

Columns of the dataset:  Index(['category', 'tech1', 'description', 'fit', 'title', 'also_buy', 'tech2',
       'brand', 'feature', 'rank', 'also_view', 'main_cat', 'similar_item',
       'date', 'price', 'asin', 'imageURL', 'imageURLHighRes', 'details'],
      dtype='object')
Totale length of the dataset:  74347


Filter the products based on the asin in asin.cv. 
This step is necessary because the dataset for the cluster is different from the dataset for sentiment analysis and similar items. 

In [356]:
# drop columns that are not needed because the products are not in asin.cv
asins = pd.read_csv('Dataset/asin.csv')
df = df[df['asin'].isin(asins['asin'])]

print("Totale length of the dataset after the update with asin", len(df))
print(df.columns)
print(type(df))

Totale length of the dataset after the update with asin 45139
Index(['category', 'tech1', 'description', 'fit', 'title', 'also_buy', 'tech2',
       'brand', 'feature', 'rank', 'also_view', 'main_cat', 'similar_item',
       'date', 'price', 'asin', 'imageURL', 'imageURLHighRes', 'details'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>


# Data pre-processing

- Remove empty description
- Remove HTML tag
- Remove URLs
- Remove HTML hidden carachters
- Remove punctuation
- Remove numbers
- Transform every word into lowercase
- Remove stop words
- Perform stemming 

In [357]:
# Drop rows with no description (description is empty)
df = df[df['description'].map(lambda d: len(d)) > 0]
# df2.head()

In [358]:
df.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
4,[],,[1. Losing Game 2. I Can't Wait 3. Didn't He Shine 4. Never Seen...Righteous... 5. A Broken Heart 6. Looking Back 7. Here We Are 8. I Saw The Lord 9. Jesus Is A River Of Love 10. Hittin' The Road 11. I've Never Been Out Of... 12. Jesus Gotta Hold Of My Life 13. Saved- Saved- Saved 14. What Will ...,,Early Works - Dallas Holm,"[B0002N4JP2, 0760131694, B00002EQ79, B00150K8JC, B00007E8SE, B00000387A, B000I0QKB0, B000025Q0M, B000008QP3, B000FAMYIG, B0009WA252, B0016CP2GS, B016E9NE9Y, B000002UEN, B000T5MJN2, B003H8F4NA, B00004RC05, B000A0GP04, B00004RC01, B00004YNA4, B016E9NEN0, B00BS96UG0, B00BS96XJY, B00BS96Y9S, B00BS96...",,Dallas Holm,[],"399,269 in CDs & Vinyl (","[B0002N4JP2, 0760131694, B00150K8JC, B003MTXNVE, B00007E8SE, B00000DPJN, B00000387A, B0009WA252, B000008QP3, B00KYVH4VI]","<img src=""https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png"" class=""nav-categ-image"" alt=""Digital Music""/>",,,,1526146,[],[],
12,[],,"[Spanish Before You Know It - Gold Edition. Learn Spanish in a Flash!, , ]",,Spanish Before You Know It - Gold Edition,[],,Transparent Language,[],"1,153,345 in CDs &amp; Vinyl (",[],"<img src=""https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png"" class=""nav-categ-image"" alt=""Digital Music"" />",,,"$9,600.61",545069882,[https://images-na.ssl-images-amazon.com/images/I/510aGyfLO5L._SS40_.jpg],[https://images-na.ssl-images-amazon.com/images/I/510aGyfLO5L.jpg],
13,[],,"[Just the CD. The Book has long since vanished. I great condition and it is a classic., , ]",,Puff the Magic Dragon,[],,,[],"242,922 in CDs &amp; Vinyl (","[B00YZ82TPW, B0009YA39U, 1402747829, B00U1CES3M, B000051NSZ, B00514KN3E]","<img src=""https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png"" class=""nav-categ-image"" alt=""Digital Music"" />",,,$6.14,545109620,[https://images-na.ssl-images-amazon.com/images/I/51n5-HdJCCL._SS40_.jpg],[https://images-na.ssl-images-amazon.com/images/I/51n5-HdJCCL.jpg],
19,[],,"[The seasons are changing and life slows down in the forest. Or does it? Join Mindy and her friends in Bluebell Woods as they scrabble for the biggest, juiciest blackberries ... carve the silliest and scariest jack-o-lanterns ... and share the true spirit of giving when they're snowed in for thr...",,"The Tales of Mindy Mousekins: Adventures Through the Seasons, Autumn - Winter","[0692384251, B004HVKAAI]",,Jacqueline Houston,[],"200,626 in CDs &amp; Vinyl (",[0692384251],"<img src=""https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png"" class=""nav-categ-image"" alt=""Digital Music"" />",,,,615897398,"[https://images-na.ssl-images-amazon.com/images/I/610CCoHtMyL._SS40_.jpg, https://images-na.ssl-images-amazon.com/images/I/619zvyeoqwL._SS40_.jpg, https://images-na.ssl-images-amazon.com/images/I/61d0wtGHDZL._SS40_.jpg, https://images-na.ssl-images-amazon.com/images/I/516OvXY0TdL._SS40_.jpg]","[https://images-na.ssl-images-amazon.com/images/I/610CCoHtMyL.jpg, https://images-na.ssl-images-amazon.com/images/I/619zvyeoqwL.jpg, https://images-na.ssl-images-amazon.com/images/I/61d0wtGHDZL.jpg, https://images-na.ssl-images-amazon.com/images/I/516OvXY0TdL.jpg]",
20,[],,"[The Msica del mundo hispano Audio CD includes authentic music from around the Spanish-speaking world. Lyrics are included in the liner notes., , ]",,Avancemos! Musica del mundo hispano,[0618753222],,Various Artists,[],"521,475 in CDs &amp; Vinyl (",[],"<img src=""https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png"" class=""nav-categ-image"" alt=""Digital Music"" />",,,$1.00,618866760,[https://images-na.ssl-images-amazon.com/images/I/512M0WP5tNL._SS40_.jpg],[https://images-na.ssl-images-amazon.com/images/I/512M0WP5tNL.jpg],


In [359]:
# each description is a list of strings,we want to remove the empty strings, and join the list of strings into one string
df.description = df.description.apply(lambda x: [string for string in x if string != ""])
df.description = df.description.apply(lambda x: " ".join(x))
df.iloc[0].description


"1. Losing Game 2. I Can't Wait 3. Didn't He Shine 4. Never Seen...Righteous... 5. A Broken Heart 6. Looking Back 7. Here We Are 8. I Saw The Lord 9. Jesus Is A River Of Love 10. Hittin' The Road 11. I've Never Been Out Of... 12. Jesus Gotta Hold Of My Life 13. Saved- Saved- Saved 14. What Will You Do? 15. Rise Again"

In [360]:
df_similarity_scores = df.copy()

print("Example of description before preprocessing: ")
print(df.description.iloc[0:1])
df.description = df.description.apply(lambda x: pre_process_data.user_description_sentiment_analysis(x))

print()
print("Example of description after preprocessing: ")
print(df.description.iloc[0:1])


Example of description before preprocessing: 
4    1. Losing Game 2. I Can't Wait 3. Didn't He Shine 4. Never Seen...Righteous... 5. A Broken Heart 6. Looking Back 7. Here We Are 8. I Saw The Lord 9. Jesus Is A River Of Love 10. Hittin' The Road 11. I've Never Been Out Of... 12. Jesus Gotta Hold Of My Life 13. Saved- Saved- Saved 14. What Will Y...
Name: description, dtype: object

Example of description after preprocessing: 
4    losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
Name: description, dtype: object


## Does any product contain different descriptions?  
There exists products which are not unique. The asin and the descriptions are duplicated. 
We process the data in order to have unique products.

Removing the duplicates products -> now each product is unique

In [361]:
df_asin_description = df[["asin","description"]].copy()
df_asin_description.drop_duplicates(subset = "description", inplace=True)

df_asin_description

Unnamed: 0,asin,description
4,0001526146,losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
12,0545069882,spanish know gold edition learn spanish flash
13,0545109620,cd book long since vanished great condition classic
19,0615897398,seasons changing life slows forest join mindy friends bluebell woods scrabble biggest juiciest blackberries carve silliest scariest jack lanterns share true spirit giving snowed three whole days master storyteller jacqueline houston brings mindy mousekins world alive heartwarming tales fun adven...
20,0618866760,msica del mundo hispano audio cd includes authentic music around spanish speaking world lyrics included liner notes
...,...,...
74333,B01HDZM264,dub infused post punk nyu punk professor music label staubgold
74336,B01HG2DW1I,track listing butter ball zaq attack zona walk like guv sentimental pacific daylight trombone institute technology san jose fog city show crbs trombones giant
74338,B01HH5R7LK,coldplay head full dreams tour live etihad stadium manchester england june th cd intro head full dreams yellow every teardrop waterfall scientist birds paradise everglow lovers japan magic clocks midnight charlie brown hymn weekend fix heroes viva la vida cd adventure lifetime kaleidoscope troub...
74339,B01HH68B96,known live versions thats way life goes steam blacktop witha demo version superficial love sang hughie instead chris hicks


# Sentiment Analysis

In [362]:
df_process = df_asin_description
df_process.head()

Unnamed: 0,asin,description
4,1526146,losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
12,545069882,spanish know gold edition learn spanish flash
13,545109620,cd book long since vanished great condition classic
19,615897398,seasons changing life slows forest join mindy friends bluebell woods scrabble biggest juiciest blackberries carve silliest scariest jack lanterns share true spirit giving snowed three whole days master storyteller jacqueline houston brings mindy mousekins world alive heartwarming tales fun adven...
20,618866760,msica del mundo hispano audio cd includes authentic music around spanish speaking world lyrics included liner notes


In [363]:
# Suppressing warning about old version of spacy
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    # Applying Spacy affect model emotions
    nlp_affect = spacy.load('Spacy-Affect-Model/affect_ner')

    
df_process['emotion_spacy'] = df_process.description.apply(lambda x: Counter([item.label_.lower() for item in nlp_affect(x).ents]))

In [364]:

# Extracting most significant emotion of a particular description
def get_most_significant_emotion(emotions):
    try:
        sign_emotion = max(emotions, key=emotions.get)
    except ValueError:
        sign_emotion = None
    return sign_emotion

df_process['most_significant_emotion_spacy'] = df_process.emotion_spacy.apply(lambda x: get_most_significant_emotion(x))

df_process.head(100)
save_csv = True
if save_csv:
    df_process.to_csv('digital_music.csv')



In [365]:

# Output of user emotion based on input.txt
file_input = open("input.txt", "r")
text = file_input.read()
nlp_affect = spacy.load('Spacy-Affect-Model/affect_ner')

def measure_affect_score(sentence : str, nlp_affect):
    affect_percent = {'fear': 0.0, 'anger': 0.0, 'anticipation': 0.0, 'trust': 0.0, 'surprise': 0.0, 'positive': 0.0,
                      'negative': 0.0, 'sadness': 0.0, 'disgust': 0.0, 'joy': 0.0}
    emotions = []
    doc = nlp_affect(sentence)
    if len(doc.ents) != 0:
        for ent in doc.ents:
            emotions.append(ent.label_.lower())
        affect_counts = Counter()
        for emotion in emotions:
            affect_counts[emotion] += 1
        sum_values = sum(affect_counts.values())
        for key in affect_counts.keys():
            affect_percent.update({key: float(affect_counts[key]) / float(sum_values)})
    return affect_percent

user_emotion_scores = measure_affect_score(text,nlp_affect)
max_emotion = max(user_emotion_scores, key=user_emotion_scores.get)
user_emotion = max_emotion

print(user_emotion)



trust


In [366]:
# Find all items with the emotion "anticipation"
import pandas as pd

# read file with all emotions 
df_emotion = pd.read_csv('digital_music.csv')  
# filter satisfied lines（emotion == anticipation）
filtered_df = df_emotion[df_emotion['most_significant_emotion_spacy'] == user_emotion]

# generated new lines 
filtered_df.to_csv('grouped_emotion.csv', index=False) 



In [367]:
df_emotion.columns

Index(['Unnamed: 0', 'asin', 'description', 'emotion_spacy',
       'most_significant_emotion_spacy'],
      dtype='object')

# Similar Items System
Program that reads the dataset, preprocess the data and output the most similar items based on a user description of a product.

In [368]:
import json
from collections import defaultdict
import gzip
import pandas as pd
from lxml import html,etree
import numpy as np
import ipywidgets as widgets
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
from nltk.stem import PorterStemmer
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
import os


# set stopwords vocabulary
nltk.download('stopwords')

# set tokenizer
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ariannabianchi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ariannabianchi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [369]:
df_asin_description = df_emotion[["asin","description"]].copy()
df_asin_description.drop_duplicates(subset = "description", inplace=True)
df_asin_description.dropna(subset=['description'], inplace=True)

df_asin_description

Unnamed: 0,asin,description
0,0001526146,losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
1,0545069882,spanish know gold edition learn spanish flash
2,0545109620,cd book long since vanished great condition classic
3,0615897398,seasons changing life slows forest join mindy friends bluebell woods scrabble biggest juiciest blackberries carve silliest scariest jack lanterns share true spirit giving snowed three whole days master storyteller jacqueline houston brings mindy mousekins world alive heartwarming tales fun adven...
4,0618866760,msica del mundo hispano audio cd includes authentic music around spanish speaking world lyrics included liner notes
...,...,...
16111,B01HDZM264,dub infused post punk nyu punk professor music label staubgold
16112,B01HG2DW1I,track listing butter ball zaq attack zona walk like guv sentimental pacific daylight trombone institute technology san jose fog city show crbs trombones giant
16113,B01HH5R7LK,coldplay head full dreams tour live etihad stadium manchester england june th cd intro head full dreams yellow every teardrop waterfall scientist birds paradise everglow lovers japan magic clocks midnight charlie brown hymn weekend fix heroes viva la vida cd adventure lifetime kaleidoscope troub...
16114,B01HH68B96,known live versions thats way life goes steam blacktop witha demo version superficial love sang hughie instead chris hicks


In [370]:
df_asin_description.head()

Unnamed: 0,asin,description
0,1526146,losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
1,545069882,spanish know gold edition learn spanish flash
2,545109620,cd book long since vanished great condition classic
3,615897398,seasons changing life slows forest join mindy friends bluebell woods scrabble biggest juiciest blackberries carve silliest scariest jack lanterns share true spirit giving snowed three whole days master storyteller jacqueline houston brings mindy mousekins world alive heartwarming tales fun adven...
4,618866760,msica del mundo hispano audio cd includes authentic music around spanish speaking world lyrics included liner notes


### Last step of preprocessing: stemming. 
Stemming is useful for comparing similar words.

In [371]:
# stemming of the words in descriptions
def stemming_data(s):
    stemmer= PorterStemmer()
    if not s or s.isspace(): 
        # print("Empty description", s, "empty")
        return ''
    try:
        strr = word_tokenize(s)
        strr = [stemmer.stem(word) for word in strr]
        strr = ' '.join(strr)
        return strr 
    except etree.ParserError: 
        return ''

In [372]:
input = " takes does goes "
# stemming_data(input)
stemming_data(input)

'take doe goe'

In [373]:
print(df_asin_description.description.iloc[0:1])

0    losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
Name: description, dtype: object


In [374]:
df_asin_description.description = df_asin_description.description.apply(lambda x: stemming_data(x))

In [375]:
print(df_asin_description.description.iloc[0:1])

0    lose game wait shine never seen righteou broken heart look back saw lord jesu river love hittin road never jesu got ta hold life save save save rise
Name: description, dtype: object


## Creating shingles

In [376]:
# Given a string input, return the list of shingles
def shingle(s, q, delimiter=' '):
    all_shingles = []
    if isinstance(s, float):
        print(s)
    if delimiter != '':
        words_list = s.split(delimiter)
    else:
        words_list = s
    for i in range (len(words_list)-q+1):
        all_shingles.append(delimiter.join(words_list[i:i+q]))
    return list(set(all_shingles))

In [377]:
# Apply shingles to the df_asin_description
df_asin_description["shingles"] = df_asin_description["description"].apply(lambda x: shingle(x, 3))
# aaa = df_asin_description["description"].apply(lambda x: shingle(x, 3))
# df_asin_description

### Similarity of sets
Computing Jaccuard similarity

In [378]:
# function that takes an intersection set and a union set and returns the Jaccard similarity
def similarity(intersection_set, union_set):
    return len(intersection_set)/len(union_set)

In [379]:
# input = "In the dynamic landscape of higher education, universities are continually redefining the traditional boundaries of learning. The integration of arts, music, and literature has become a cornerstone in fostering a holistic educational experience. At the heart of this transformation is the commitment to connect students with a diverse range of disciplines, preparing them not only for academic success but also for a life enriched by creativity and cultural understanding. In this context, universities such as New School are pioneering integrated learning models that transcend conventional subject silos. Their innovative approach, backed by cutting-edge teaching methodologies, empowers students to explore the intersections of arts, music, and literature. The vision goes beyond a mere confluence of disciplines; it seeks to create an immersive educational environment where students can seamlessly weave their academic pursuits into the fabric of their daily lives. One key player in this educational evolution is McGraw, a renowned arts author whose work has become a guiding light for both educators and students alike. McGraw's contributions extend beyond the conventional boundaries of a university classroom, resonating with a global audience. His writings not only inspire a love for the arts but also emphasize the transformative power of integrated learning in shaping well-rounded individuals. The concept of an integrated learning environment transcends the boundaries of time and space. It is not confined to the four walls of a classroom; rather, it permeates every facet of a student's journey. In this dynamic world, students are no longer passive recipients of knowledge but active participants in a vibrant community of learners. The university becomes a nexus where diverse ideas converge, fostering a collaborative spirit that extends far beyond graduation. In this interconnected world, the New School's commitment to integrated learning is a beacon of innovation. Students are not just acquiring knowledge; they are forging connections between seemingly disparate fields, discovering the harmonies between arts and sciences, and navigating the rhythms of a multicultural world. This transformative journey prepares them to navigate the complexities of the modern world with a deep appreciation for diversity and a keen sense of intellectual curiosity. As we stand at the intersection of arts, music, and literature, the integrated learning paradigm championed by universities like New School, guided by visionary authors such as McGraw, is shaping the future of education. It is a testament to the idea that learning is not a compartmentalized experience but a symphony of knowledge, where every note, every discipline, plays a crucial role in the harmonious melody of life."

file_input = open("input.txt", "r")
input = file_input.read()
# print(input)
user_description = pre_process_data.user_description_sentiment_analysis(input)
user_description = shingle(user_description, 3)  
# intersection_set = set(user_description).intersection(set(df_asin_description.shingles.iloc[0]))
# union_set = set(user_description).union(set(df_asin_description.shingles.iloc[0]))
# # perform similarity
# sim = similarity(intersection_set, union_set)
# print(sim)


In [380]:
# df_asin_description
df_asin_description["similarity"] = df_asin_description["shingles"].apply(lambda x: similarity(set(user_description).intersection(set(x)), set(user_description).union(set(x))))
df_asin_description


Unnamed: 0,asin,description,shingles,similarity
0,0001526146,lose game wait shine never seen righteou broken heart look back saw lord jesu river love hittin road never jesu got ta hold life save save save rise,"[look back saw, wait shine never, jesu got ta, never seen righteou, saw lord jesu, got ta hold, jesu river love, seen righteou broken, river love hittin, hittin road never, save save rise, love hittin road, lose game wait, game wait shine, road never jesu, shine never seen, ta hold life, broken ...",0.0
1,0545069882,spanish know gold edit learn spanish flash,"[gold edit learn, spanish know gold, edit learn spanish, know gold edit, learn spanish flash]",0.0
2,0545109620,cd book long sinc vanish great condit classic,"[cd book long, book long sinc, great condit classic, long sinc vanish, vanish great condit, sinc vanish great]",0.0
3,0615897398,season chang life slow forest join mindi friend bluebel wood scrabbl biggest juiciest blackberri carv silliest scariest jack lantern share true spirit give snow three whole day master storytel jacquelin houston bring mindi mousekin world aliv heartwarm tale fun adventur,"[tale fun adventur, scrabbl biggest juiciest, houston bring mindi, forest join mindi, scariest jack lantern, day master storytel, whole day master, juiciest blackberri carv, spirit give snow, aliv heartwarm tale, heartwarm tale fun, lantern share true, give snow three, slow forest join, bring mi...",0.0
4,0618866760,msica del mundo hispano audio cd includ authent music around spanish speak world lyric includ liner note,"[audio cd includ, authent music around, lyric includ liner, spanish speak world, includ liner note, hispano audio cd, speak world lyric, del mundo hispano, cd includ authent, mundo hispano audio, includ authent music, music around spanish, world lyric includ, msica del mundo, around spanish speak]",0.0
...,...,...,...,...
16111,B01HDZM264,dub infus post punk nyu punk professor music label staubgold,"[professor music label, nyu punk professor, infus post punk, punk nyu punk, punk professor music, music label staubgold, dub infus post, post punk nyu]",0.0
16112,B01HG2DW1I,track list butter ball zaq attack zona walk like guv sentiment pacif daylight trombon institut technolog san jose fog citi show crb trombon giant,"[crb trombon giant, guv sentiment pacif, daylight trombon institut, like guv sentiment, track list butter, citi show crb, sentiment pacif daylight, pacif daylight trombon, san jose fog, jose fog citi, ball zaq attack, attack zona walk, zaq attack zona, show crb trombon, butter ball zaq, list but...",0.0
16113,B01HH5R7LK,coldplay head full dream tour live etihad stadium manchest england june th cd intro head full dream yellow everi teardrop waterfal scientist bird paradis everglow lover japan magic clock midnight charli brown hymn weekend fix hero viva la vida cd adventur lifetim kaleidoscop troubl see soon amaz...,"[day sky full, live etihad stadium, nme award viva, vida charli brown, stadium manchest england, brown hymn weekend, feat beyonc bruno, clock midnight charli, waterfal scientist bird, kaleidoscop troubl see, england june th, coldplay full super, fix hero viva, vida cd adventur, troubl see soon, ...",0.0
16114,B01HH68B96,known live version that way life goe steam blacktop witha demo version superfici love sang hughi instead chri hick,"[superfici love sang, instead chri hick, witha demo version, known live version, steam blacktop witha, love sang hughi, hughi instead chri, goe steam blacktop, way life goe, live version that, version superfici love, sang hughi instead, demo version superfici, life goe steam, version that way, b...",0.0


Dataframe sorted by similarity

In [381]:

df_asin_description.sort_values(by="similarity", ascending=False, inplace=True)
df_asin_description

Unnamed: 0,asin,description,shingles,similarity
746,B00004NK4G,tracklist eya authent eya flow mix remix thoma fehlmann eya green velvet funk mix remix green velvet eya pro plu mix remix skage smoke eya hardwar mix remix futur forc inc eya cityscap mix remix futur forc inc,"[flow mix remix, eya flow mix, tracklist eya authent, funk mix remix, eya pro plu, forc inc eya, smoke eya hardwar, skage smoke eya, mix remix thoma, mix remix futur, cityscap mix remix, pro plu mix, eya authent eya, thoma fehlmann eya, green velvet funk, plu mix remix, eya cityscap mix, remix g...",0.037975
756,B00004R8UB,inspir melod groov beat rapper solo singer put top chill babi chill multi instrumentalist shane doctor faber man behind music equal home either side record consol write produc materi jeepjazz combin program live play digit analog record techniqu shane uniqu sens melodi composit come togeth follo...,"[groov beat rapper, sens melodi composit, man behind music, solo singer put, put top chill, shane uniqu sens, doctor faber man, consol write produc, top chill babi, melodi composit come, inspir melod groov, music new cd, soon releas film, record techniqu shane, analog record techniqu, babi chill...",0.010050
0,0001526146,lose game wait shine never seen righteou broken heart look back saw lord jesu river love hittin road never jesu got ta hold life save save save rise,"[look back saw, wait shine never, jesu got ta, never seen righteou, saw lord jesu, got ta hold, jesu river love, seen righteou broken, river love hittin, hittin road never, save save rise, love hittin road, lose game wait, game wait shine, road never jesu, shine never seen, ta hold life, broken ...",0.000000
10744,B00400GO5G,westworld action uk vinyl,"[westworld action uk, action uk vinyl]",0.000000
10749,B00403MAFQ,koda kumi sing masterpiec love forev,"[masterpiec love forev, koda kumi sing, sing masterpiec love, kumi sing masterpiec]",0.000000
...,...,...,...,...
5379,B000QWC89I,contain song like avalon look love mood indigo want littl girl,"[want littl girl, avalon look love, like avalon look, song like avalon, contain song like, mood indigo want, indigo want littl, love mood indigo, look love mood]",0.000000
5380,B000QWBXNU,includ navi hymn day holi holi holi lead kindli light prais god bless flow fairest lord jesu nearer god thee onward christian soldier mighti fortress stand jesu old rug cross abid rock age friend jesu,"[day holi holi, mighti fortress stand, flow fairest lord, navi hymn day, prais god bless, light prais god, bless flow fairest, holi holi lead, god bless flow, hymn day holi, nearer god thee, god thee onward, includ navi hymn, stand jesu old, fortress stand jesu, rock age friend, soldier mighti f...",0.000000
5381,B000QWHEGK,contain song like wish knew new know love noth,"[know love noth, wish knew new, song like wish, contain song like, new know love, knew new know, like wish knew]",0.000000
5382,B000QX6M5I,bbc concert orchestra miguel harth bedoya conductor includ moncayo huapango ginastera estancia ballet suit de falla popular spanish song piazzolla tangazo gershwin cuban overtur plaza nocturna,"[cuban overtur plaza, orchestra miguel harth, gershwin cuban overtur, falla popular spanish, bedoya conductor includ, ginastera estancia ballet, miguel harth bedoya, de falla popular, piazzolla tangazo gershwin, conductor includ moncayo, harth bedoya conductor, overtur plaza nocturna, concert or...",0.000000


In [382]:
asin_most_similar = df_asin_description.asin.iloc[0]

In [383]:
print("Similarity of items")
print(df_asin_description.iloc[:20].similarity)

Similarity of items
746      0.037975
756      0.010050
0        0.000000
10744    0.000000
10749    0.000000
10748    0.000000
10747    0.000000
10746    0.000000
10745    0.000000
10743    0.000000
10751    0.000000
10742    0.000000
10741    0.000000
10740    0.000000
10739    0.000000
10738    0.000000
10750    0.000000
10753    0.000000
10752    0.000000
10736    0.000000
Name: similarity, dtype: float64


### Create csv file with asin, description, emotion and similarity

In [384]:
# from preprocess_data import html_url_hidden_chars
df_similarity_scores.description = df_similarity_scores.description.apply(lambda x: pre_process_data.html_url_hidden_chars(x))
# df_similarity_scores.iloc[:2].description
# (pd.merge(df1, df2, on='company')
df_1 = df_similarity_scores[["asin","description"]].copy()
df_2 = df_asin_description[["asin","similarity"]].copy()
similarity_df = pd.merge(df_1, df_2, on='asin')
# similarity_df = pd.merge(df_similarity_scores, df_asin_description, on='asin')
similarity_df.columns
# similarity_df.sort_values(by="similarity", ascending=False, inplace=True)
# similarity_df.head()

Index(['asin', 'description', 'similarity'], dtype='object')

In [385]:
similarity_df.head()

Unnamed: 0,asin,description,similarity
0,1526146,1. Losing Game 2. I Can't Wait 3. Didn't He Shine 4. Never Seen...Righteous... 5. A Broken Heart 6. Looking Back 7. Here We Are 8. I Saw The Lord 9. Jesus Is A River Of Love 10. Hittin' The Road 11. I've Never Been Out Of... 12. Jesus Gotta Hold Of My Life 13. Saved- Saved- Saved 14. What Will Y...,0.0
1,545069882,Spanish Before You Know It - Gold Edition. Learn Spanish in a Flash!,0.0
2,545109620,Just the CD. The Book has long since vanished. I great condition and it is a classic.,0.0
3,615897398,"The seasons are changing and life slows down in the forest. Or does it? Join Mindy and her friends in Bluebell Woods as they scrabble for the biggest, juiciest blackberries ... carve the silliest and scariest jack-o-lanterns ... and share the true spirit of giving when they're snowed in for thre...",0.0
4,618866760,The Msica del mundo hispano Audio CD includes authentic music from around the Spanish-speaking world. Lyrics are included in the liner notes.,0.0


In [386]:
similarity_dfd = similarity_df.sort_values(by="similarity", ascending=False, inplace=False)
similarity_dfd.head()

Unnamed: 0,asin,description,similarity
1123,B00004NK4G,Tracklist: 1 Eya (Authentic) 5:13 2 Eya (Flow Mix #2) Remix Thomas Fehlmann 7:27 3 Eya (Green Velvet Funk Mix) Remix Green Velvet 7:53 4 Eya (Pro Plus Mix) Remix Skage + Smokes 5:31 5 Eya (Hardware Mix) Remix Future Forces Inc 6:02 6 Eya...,0.037975
1122,B00004NK4G,Tracklist: 1 Eya (Authentic) 5:13 2 Eya (Flow Mix #2) Remix Thomas Fehlmann 7:27 3 Eya (Green Velvet Funk Mix) Remix Green Velvet 7:53 4 Eya (Pro Plus Mix) Remix Skage + Smokes 5:31 5 Eya (Hardware Mix) Remix Future Forces Inc 6:02 6 Eya...,0.037975
1143,B00004R8UB,"70's inspired melodic grooves over 90's beats. No rappers, no solos, no singers. Put the top down and chill, baby, chill. Multi-instrumentalist Shane 'the doctor' Faber is the man behind the music. Equally at home on either side of the recording console, he writes and produces all of the materia...",0.01005
1142,B00004R8UB,"70's inspired melodic grooves over 90's beats. No rappers, no solos, no singers. Put the top down and chill, baby, chill. Multi-instrumentalist Shane 'the doctor' Faber is the man behind the music. Equally at home on either side of the recording console, he writes and produces all of the materia...",0.01005
11783,B00347DMN8,LP - The Lettermen She Cried 1964 T 2142 Capitol Records Label; Mono version,0.0


In [387]:
similarity_dfd.iloc[:30].to_csv("similarity_results.csv", sep='\t')

## Adding clustering based on users reviews

In [388]:
# Imports
import json
import gzip
import os
import pandas as pd
from scipy.sparse import dok_matrix
from sklearn.decomposition import NMF
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter

In [389]:
# Loading the df example
# df = pd.read_pickle('df_example.pkl')
df = similarity_dfd.copy()

In [390]:
import numpy as np

# Download dataset if it is not downloaded yet
if not os.path.exists('Dataset/Digital_Music.json.gz'):
    !wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFiles/Digital_Music.json.gz -P ./Dataset
else:
    print('Dataset already downloaded.')

Dataset already downloaded.


### Unpacking the reviews dataset and getting only the ratings

In [391]:
data = []
with gzip.open('Dataset/Digital_Music.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# Total length of list, this number equals total number of products
print("Total number of items in the dataset: ", len(data))
# Convert list into pandas dataframe
df_rating = pd.DataFrame.from_dict(data)
df_rating = df_rating[['overall', 'reviewerID', 'asin']]

Total number of items in the dataset:  1584082


### Filters used for narrowing down the user reviews
* Using only products from the previous parts (they must have defined the description, etc.)
* Using only 'positive' reviews (overall rating > 3)
* Using only users with more than 1 review

In [392]:
# Get all of the used products 'asin' (unique identifier)
unique_asin = set(df.asin)
# Filter out only products which are used in previous parts
idxs = [df_rating.asin[i] in unique_asin for i in range(len(df_rating))]
# Filter out only 'positive' reviews (overall >= 3)
positive_ratings = df_rating.overall >= 3
# Filter out only users with more than 1 review
unique_user = [key for key, value in Counter(df_rating.reviewerID).items() if value > 1]
experienced_users = df_rating.reviewerID.isin(unique_user)
# Putting all filters together
all_filters = idxs * np.array(positive_ratings) * np.array(experienced_users)

# Applying the filters
df_rating = df_rating[all_filters]

# Update the set of rated products
unique_asin = set(df_rating.asin)
unique_user = set(df_rating.reviewerID)
print('Number of users: ', len(unique_user))
print('Number of products: ', len(unique_asin))

Number of users:  20715
Number of products:  16115


In [393]:
# Create dicts, which maps user id (product asin) -> number (index in matrix) and the other way around
product2idx = {}
idx2product = []
user2idx = {}
idx2user = []
for idx, product in enumerate(unique_asin):
    product2idx[product] = idx
    idx2product.append(product)
for idx, user in enumerate(unique_user):
    user2idx[user] = idx
    idx2user.append(user)
# Create sparse matrix, where rows contain users and each column represents one product
product_user_matrix = dok_matrix((len(user2idx), len(product2idx)), dtype=np.float16)
for _, row in df_rating.iterrows():
    product_user_matrix[user2idx[row.reviewerID], product2idx[row.asin]] = 1
print('Matrix size: ', product_user_matrix.shape)

### Applying clustering algo

In [None]:
# Perform NMF to get clusters
model = NMF(n_components=10, verbose=False, max_iter=1000, random_state=42)
W = model.fit_transform(product_user_matrix)
H = model.components_

In [None]:
# Check that there is not a product which is not assigned to any cluster
non_zeros_cols = np.count_nonzero(H, axis=0)
print((non_zeros_cols > 0).sum() == len(unique_asin))
# Transposing the H matrix
H = H.transpose()

True


In [None]:
# Normalizing the H matrix and applying cosine similarity
row_sums = H.sum(axis=1)
H_normalized = H / row_sums[:, np.newaxis]
similarities = cosine_similarity(H_normalized)

In [None]:
n = 5
suggested_product_idx = product2idx[asin_most_similar]
suggested_product = idx2product[suggested_product_idx]
# Getting the index of 5 nearest neighbors
ind = np.argpartition(similarities[:, suggested_product_idx], -n)[-n:]
# Check for the same product
if suggested_product_idx in ind:
    ind = np.argpartition(similarities[:, suggested_product_idx], -n-1)[-n-1:]
    ind = np.delete(ind, np.where(ind == suggested_product_idx))

In [None]:
# Suggested product description
df.loc[df.asin == idx2product[suggested_product_idx]].description

4    The Msica del mundo hispano Audio CD includes authentic music from around the Spanish-speaking world. Lyrics are included in the liner notes.
Name: description, dtype: object

In [None]:
# 5 nearest neighbors description
# df.loc[df.asin.isin([idx2product[i] for i in ind])].description
df.head()

Unnamed: 0,asin,description,similarity
4,0618866760,The Msica del mundo hispano Audio CD includes authentic music from around the Spanish-speaking world. Lyrics are included in the liner notes.,0.043011
1,0545069882,Spanish Before You Know It - Gold Edition. Learn Spanish in a Flash!,0.011628
11783,B00347DMN8,LP - The Lettermen She Cried 1964 T 2142 Capitol Records Label; Mono version,0.0
11789,B0034Z3AZ0,"RELEASED 2009. TONY IS ONE OF IRELAND'S COUNTRY LEGENDS,",0.0
11788,B0034V1WPY,The Best Of Judy Garland This mono 2-LP Record album was released as MCA Records MCA2-4003 in 1980...this album is a reissue of the original released as Decca DXP 7172 and the set with same MCA number in 1973. Judy Garland (vocals) sings on these tracks recorded between 1937 and 1945. The tra...,0.0


In [None]:
# Output the cluster result to output.txt
file = open("output.txt", "w")
for i in range(0,5):
    print(ind[i])
    file.write(str(df.loc[df.asin == idx2product[ind[i]]].description))
    file.write("\n")
file.close()

10141
6903
3911
3908
8660
