# 02807 Final project: Recommendation system
Recommendation system of products from __Digital Music__ category on __Amazon__. Products are suggested based on a short description inserted by a user.
[**Data source**](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/)

In [587]:
# Imports
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
import json
import gzip
import spacy
import warnings
# import os
import pandas as pd
import numpy as np
# import torch
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import DBSCAN, KMeans
from scipy import sparse
from hdbscan import HDBSCAN
from collections import Counter, defaultdict
from lxml import html, etree
from nrclex import NRCLex
from transformers import AutoTokenizer, AutoModelWithLMHead
import preprocess_data as pre_process_data


# Load the data 

In [588]:
# Download dataset if it is not downloaded yet
if not os.path.exists('Dataset/meta_Digital_Music.json.gz'):
    !wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/metaFiles2/meta_Digital_Music.json.gz -P ./Dataset
else:
    print('Dataset already downloaded.')

Dataset already downloaded.


__Data format__
   * `asin`: ID of the product, e.g. 0000031852
   * `title`: name of the product
   * `feature`: bullet-point format features of the product
   * `description`: description of the product
   * `price`: price in US dollars (at time of crawl)
   * `imageURL`: url of the product image
   * `imageURL`: url of the high resolution product image
   * `related`: related products (also bought, also viewed, bought together, buy after viewing)
   * `salesRank`: sales rank information
   * `brand`: brand name
   * `categories`: list of categories the product belongs to
   * `tech1`: the first technical detail table of the product
   * `tech2`: the second technical detail table of the product
   * `similar`: similar product table

_Note that there are usually multiple attributes left out blank for each product (specific attributes differs from product to product)._ 


In [589]:
### Load the meta data
data = []
with gzip.open('Dataset/meta_Digital_Music.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# Total length of list, this number equals total number of products
print("Total number of items in the dataset: ", len(data))

Total number of items in the dataset:  74347


In [590]:
# convert list into pandas dataframe
df = pd.DataFrame.from_dict(data)

# set size of display in pandas
pd.set_option('display.max_colwidth', 300)
pd.set_option('display.max_rows', 20 )

# first row of the list
print("Columns of the dataset: ", df.columns)

print("Totale length of the dataset: ", len(df))
# show dataframe with columns and rows
# df.head()
# df2.info()

Columns of the dataset:  Index(['category', 'tech1', 'description', 'fit', 'title', 'also_buy', 'tech2',
       'brand', 'feature', 'rank', 'also_view', 'main_cat', 'similar_item',
       'date', 'price', 'asin', 'imageURL', 'imageURLHighRes', 'details'],
      dtype='object')
Totale length of the dataset:  74347


Filter the products based on the asin in asin.cv. 
This step is necessary because the dataset for the cluster is different from the dataset for sentiment analysis and similar items. 

In [591]:
# drop columns that are not needed because the products are not in asin.cv
asins = pd.read_csv('Dataset/asin.csv')
df = df[df['asin'].isin(asins['asin'])]

print("Totale length of the dataset after the update with asin", len(df))
print(df.columns)
print(type(df))

Totale length of the dataset after the update with asin 45139
Index(['category', 'tech1', 'description', 'fit', 'title', 'also_buy', 'tech2',
       'brand', 'feature', 'rank', 'also_view', 'main_cat', 'similar_item',
       'date', 'price', 'asin', 'imageURL', 'imageURLHighRes', 'details'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>


# Data pre-processing

- Remove empty description
- Remove HTML tag
- Remove URLs
- Remove HTML hidden carachters
- Remove punctuation
- Remove numbers
- Transform every word into lowercase
- Remove stop words
- Perform stemming 

In [592]:
# Drop rows with no description (description is empty)
df = df[df['description'].map(lambda d: len(d)) > 0]
# df2.head()

In [593]:
df.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
4,[],,[1. Losing Game 2. I Can't Wait 3. Didn't He Shine 4. Never Seen...Righteous... 5. A Broken Heart 6. Looking Back 7. Here We Are 8. I Saw The Lord 9. Jesus Is A River Of Love 10. Hittin' The Road 11. I've Never Been Out Of... 12. Jesus Gotta Hold Of My Life 13. Saved- Saved- Saved 14. What Will ...,,Early Works - Dallas Holm,"[B0002N4JP2, 0760131694, B00002EQ79, B00150K8JC, B00007E8SE, B00000387A, B000I0QKB0, B000025Q0M, B000008QP3, B000FAMYIG, B0009WA252, B0016CP2GS, B016E9NE9Y, B000002UEN, B000T5MJN2, B003H8F4NA, B00004RC05, B000A0GP04, B00004RC01, B00004YNA4, B016E9NEN0, B00BS96UG0, B00BS96XJY, B00BS96Y9S, B00BS96...",,Dallas Holm,[],"399,269 in CDs & Vinyl (","[B0002N4JP2, 0760131694, B00150K8JC, B003MTXNVE, B00007E8SE, B00000DPJN, B00000387A, B0009WA252, B000008QP3, B00KYVH4VI]","<img src=""https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png"" class=""nav-categ-image"" alt=""Digital Music""/>",,,,1526146,[],[],
12,[],,"[Spanish Before You Know It - Gold Edition. Learn Spanish in a Flash!, , ]",,Spanish Before You Know It - Gold Edition,[],,Transparent Language,[],"1,153,345 in CDs &amp; Vinyl (",[],"<img src=""https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png"" class=""nav-categ-image"" alt=""Digital Music"" />",,,"$9,600.61",545069882,[https://images-na.ssl-images-amazon.com/images/I/510aGyfLO5L._SS40_.jpg],[https://images-na.ssl-images-amazon.com/images/I/510aGyfLO5L.jpg],
13,[],,"[Just the CD. The Book has long since vanished. I great condition and it is a classic., , ]",,Puff the Magic Dragon,[],,,[],"242,922 in CDs &amp; Vinyl (","[B00YZ82TPW, B0009YA39U, 1402747829, B00U1CES3M, B000051NSZ, B00514KN3E]","<img src=""https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png"" class=""nav-categ-image"" alt=""Digital Music"" />",,,$6.14,545109620,[https://images-na.ssl-images-amazon.com/images/I/51n5-HdJCCL._SS40_.jpg],[https://images-na.ssl-images-amazon.com/images/I/51n5-HdJCCL.jpg],
19,[],,"[The seasons are changing and life slows down in the forest. Or does it? Join Mindy and her friends in Bluebell Woods as they scrabble for the biggest, juiciest blackberries ... carve the silliest and scariest jack-o-lanterns ... and share the true spirit of giving when they're snowed in for thr...",,"The Tales of Mindy Mousekins: Adventures Through the Seasons, Autumn - Winter","[0692384251, B004HVKAAI]",,Jacqueline Houston,[],"200,626 in CDs &amp; Vinyl (",[0692384251],"<img src=""https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png"" class=""nav-categ-image"" alt=""Digital Music"" />",,,,615897398,"[https://images-na.ssl-images-amazon.com/images/I/610CCoHtMyL._SS40_.jpg, https://images-na.ssl-images-amazon.com/images/I/619zvyeoqwL._SS40_.jpg, https://images-na.ssl-images-amazon.com/images/I/61d0wtGHDZL._SS40_.jpg, https://images-na.ssl-images-amazon.com/images/I/516OvXY0TdL._SS40_.jpg]","[https://images-na.ssl-images-amazon.com/images/I/610CCoHtMyL.jpg, https://images-na.ssl-images-amazon.com/images/I/619zvyeoqwL.jpg, https://images-na.ssl-images-amazon.com/images/I/61d0wtGHDZL.jpg, https://images-na.ssl-images-amazon.com/images/I/516OvXY0TdL.jpg]",
20,[],,"[The Msica del mundo hispano Audio CD includes authentic music from around the Spanish-speaking world. Lyrics are included in the liner notes., , ]",,Avancemos! Musica del mundo hispano,[0618753222],,Various Artists,[],"521,475 in CDs &amp; Vinyl (",[],"<img src=""https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png"" class=""nav-categ-image"" alt=""Digital Music"" />",,,$1.00,618866760,[https://images-na.ssl-images-amazon.com/images/I/512M0WP5tNL._SS40_.jpg],[https://images-na.ssl-images-amazon.com/images/I/512M0WP5tNL.jpg],


In [594]:
# each description is a list of strings,we want to remove the empty strings, and join the list of strings into one string
df.description = df.description.apply(lambda x: [string for string in x if string != ""])
df.description = df.description.apply(lambda x: " ".join(x))
df.iloc[0].description


"1. Losing Game 2. I Can't Wait 3. Didn't He Shine 4. Never Seen...Righteous... 5. A Broken Heart 6. Looking Back 7. Here We Are 8. I Saw The Lord 9. Jesus Is A River Of Love 10. Hittin' The Road 11. I've Never Been Out Of... 12. Jesus Gotta Hold Of My Life 13. Saved- Saved- Saved 14. What Will You Do? 15. Rise Again"

In [595]:
df_similarity_scores = df.copy()

print("Example of description before preprocessing: ")
print(df.description.iloc[0:1])
df.description = df.description.apply(lambda x: pre_process_data.user_description_sentiment_analysis(x))

print()
print("Example of description after preprocessing: ")
print(df.description.iloc[0:1])


Example of description before preprocessing: 
4    1. Losing Game 2. I Can't Wait 3. Didn't He Shine 4. Never Seen...Righteous... 5. A Broken Heart 6. Looking Back 7. Here We Are 8. I Saw The Lord 9. Jesus Is A River Of Love 10. Hittin' The Road 11. I've Never Been Out Of... 12. Jesus Gotta Hold Of My Life 13. Saved- Saved- Saved 14. What Will Y...
Name: description, dtype: object

Example of description after preprocessing: 
4    losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
Name: description, dtype: object


## Does any product contain different descriptions?  
There exists products which are not unique. The asin and the descriptions are duplicated. 
We process the data in order to have unique products.

Removing the duplicates products -> now each product is unique

In [596]:
df_asin_description = df[["asin","description"]].copy()
df_asin_description.drop_duplicates(subset = "description", inplace=True)

df_asin_description

Unnamed: 0,asin,description
4,0001526146,losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
12,0545069882,spanish know gold edition learn spanish flash
13,0545109620,cd book long since vanished great condition classic
19,0615897398,seasons changing life slows forest join mindy friends bluebell woods scrabble biggest juiciest blackberries carve silliest scariest jack lanterns share true spirit giving snowed three whole days master storyteller jacqueline houston brings mindy mousekins world alive heartwarming tales fun adven...
20,0618866760,msica del mundo hispano audio cd includes authentic music around spanish speaking world lyrics included liner notes
...,...,...
74333,B01HDZM264,dub infused post punk nyu punk professor music label staubgold
74336,B01HG2DW1I,track listing butter ball zaq attack zona walk like guv sentimental pacific daylight trombone institute technology san jose fog city show crbs trombones giant
74338,B01HH5R7LK,coldplay head full dreams tour live etihad stadium manchester england june th cd intro head full dreams yellow every teardrop waterfall scientist birds paradise everglow lovers japan magic clocks midnight charlie brown hymn weekend fix heroes viva la vida cd adventure lifetime kaleidoscope troub...
74339,B01HH68B96,known live versions thats way life goes steam blacktop witha demo version superficial love sang hughie instead chris hicks


# Sentiment Analysis

In [597]:
df_process = df_asin_description
df_process.head()

Unnamed: 0,asin,description
4,1526146,losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
12,545069882,spanish know gold edition learn spanish flash
13,545109620,cd book long since vanished great condition classic
19,615897398,seasons changing life slows forest join mindy friends bluebell woods scrabble biggest juiciest blackberries carve silliest scariest jack lanterns share true spirit giving snowed three whole days master storyteller jacqueline houston brings mindy mousekins world alive heartwarming tales fun adven...
20,618866760,msica del mundo hispano audio cd includes authentic music around spanish speaking world lyrics included liner notes


In [598]:
# Suppressing warning about old version of spacy
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    # Applying Spacy affect model emotions
    nlp_affect = spacy.load('Spacy-Affect-Model/affect_ner')

    
df_process['emotion_spacy'] = df_process.description.apply(lambda x: Counter([item.label_.lower() for item in nlp_affect(x).ents]))

In [599]:

# Extracting most significant emotion of a particular description
def get_most_significant_emotion(emotions):
    try:
        sign_emotion = max(emotions, key=emotions.get)
    except ValueError:
        sign_emotion = None
    return sign_emotion

df_process['most_significant_emotion_spacy'] = df_process.emotion_spacy.apply(lambda x: get_most_significant_emotion(x))

df_process.head(100)
save_csv = True
if save_csv:
    df_process.to_csv('digital_music.csv')



In [600]:

# Output of user emotion based on input.txt
file_input = open("input.txt", "r")
text = file_input.read()
nlp_affect = spacy.load('Spacy-Affect-Model/affect_ner')

def measure_affect_score(sentence : str, nlp_affect):
    affect_percent = {'fear': 0.0, 'anger': 0.0, 'anticipation': 0.0, 'trust': 0.0, 'surprise': 0.0, 'positive': 0.0,
                      'negative': 0.0, 'sadness': 0.0, 'disgust': 0.0, 'joy': 0.0}
    emotions = []
    doc = nlp_affect(sentence)
    if len(doc.ents) != 0:
        for ent in doc.ents:
            emotions.append(ent.label_.lower())
        affect_counts = Counter()
        for emotion in emotions:
            affect_counts[emotion] += 1
        sum_values = sum(affect_counts.values())
        for key in affect_counts.keys():
            affect_percent.update({key: float(affect_counts[key]) / float(sum_values)})
    return affect_percent

user_emotion_scores = measure_affect_score(text,nlp_affect)
max_emotion = max(user_emotion_scores, key=user_emotion_scores.get)
user_emotion = max_emotion

print(user_emotion)



anticipation


In [601]:
# Find all items with the emotion "anticipation"
import pandas as pd

# read file with all emotions 
df_emotion = pd.read_csv('digital_music.csv')  
# filter satisfied lines（emotion == anticipation）
filtered_df = df_emotion[df_emotion['most_significant_emotion_spacy'] == user_emotion]

# generated new lines 
filtered_df.to_csv('grouped_emotion.csv', index=False) 



In [602]:
df_emotion.columns

Index(['Unnamed: 0', 'asin', 'description', 'emotion_spacy',
       'most_significant_emotion_spacy'],
      dtype='object')

# Similar Items System
Program that reads the dataset, preprocess the data and output the most similar items based on a user description of a product.

In [603]:
import json
from collections import defaultdict
import gzip
import pandas as pd
from lxml import html,etree
import numpy as np
import ipywidgets as widgets
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
from nltk.stem import PorterStemmer
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
import os


# set stopwords vocabulary
nltk.download('stopwords')

# set tokenizer
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ariannabianchi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ariannabianchi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [604]:
df_asin_description = df_emotion[["asin","description"]].copy()
df_asin_description.drop_duplicates(subset = "description", inplace=True)
df_asin_description.dropna(subset=['description'], inplace=True)

df_asin_description

Unnamed: 0,asin,description
0,0001526146,losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
1,0545069882,spanish know gold edition learn spanish flash
2,0545109620,cd book long since vanished great condition classic
3,0615897398,seasons changing life slows forest join mindy friends bluebell woods scrabble biggest juiciest blackberries carve silliest scariest jack lanterns share true spirit giving snowed three whole days master storyteller jacqueline houston brings mindy mousekins world alive heartwarming tales fun adven...
4,0618866760,msica del mundo hispano audio cd includes authentic music around spanish speaking world lyrics included liner notes
...,...,...
16111,B01HDZM264,dub infused post punk nyu punk professor music label staubgold
16112,B01HG2DW1I,track listing butter ball zaq attack zona walk like guv sentimental pacific daylight trombone institute technology san jose fog city show crbs trombones giant
16113,B01HH5R7LK,coldplay head full dreams tour live etihad stadium manchester england june th cd intro head full dreams yellow every teardrop waterfall scientist birds paradise everglow lovers japan magic clocks midnight charlie brown hymn weekend fix heroes viva la vida cd adventure lifetime kaleidoscope troub...
16114,B01HH68B96,known live versions thats way life goes steam blacktop witha demo version superficial love sang hughie instead chris hicks


In [605]:
df_asin_description.head()

Unnamed: 0,asin,description
0,1526146,losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
1,545069882,spanish know gold edition learn spanish flash
2,545109620,cd book long since vanished great condition classic
3,615897398,seasons changing life slows forest join mindy friends bluebell woods scrabble biggest juiciest blackberries carve silliest scariest jack lanterns share true spirit giving snowed three whole days master storyteller jacqueline houston brings mindy mousekins world alive heartwarming tales fun adven...
4,618866760,msica del mundo hispano audio cd includes authentic music around spanish speaking world lyrics included liner notes


### Last step of preprocessing: stemming
Stemming is useful for comparing similar words.

In [606]:
# stemming of the words in descriptions
def stemming_data(s):
    stemmer= PorterStemmer()
    if not s or s.isspace(): 
        # print("Empty description", s, "empty")
        return ''
    try:
        strr = word_tokenize(s)
        strr = [stemmer.stem(word) for word in strr]
        strr = ' '.join(strr)
        return strr 
    except etree.ParserError: 
        return ''

In [607]:
print(df_asin_description.description.iloc[0:1])

0    losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
Name: description, dtype: object


In [608]:
df_asin_description.description = df_asin_description.description.apply(lambda x: stemming_data(x))

In [609]:
print(df_asin_description.description.iloc[0:1])

0    lose game wait shine never seen righteou broken heart look back saw lord jesu river love hittin road never jesu got ta hold life save save save rise
Name: description, dtype: object


## Creating shingles

In [610]:
# Given a string input, return the list of shingles
def shingle(s, q, delimiter=' '):
    all_shingles = []
    if isinstance(s, float):
        print(s)
    if delimiter != '':
        words_list = s.split(delimiter)
    else:
        words_list = s
    for i in range (len(words_list)-q+1):
        all_shingles.append(delimiter.join(words_list[i:i+q]))
    return list(set(all_shingles))

In [611]:
# Apply shingles to the df_asin_description
df_asin_description["shingles"] = df_asin_description["description"].apply(lambda x: shingle(x, 3))
# aaa = df_asin_description["description"].apply(lambda x: shingle(x, 3))
# df_asin_description

### Similarity of sets
Computing Jaccuard similarity

In [612]:
# function that takes an intersection set and a union set and returns the Jaccard similarity
def similarity(intersection_set, union_set):
    return len(intersection_set)/len(union_set)

### Read the user description, pre process it, and compute the Jaccard similarity with the dataset's descriptions.

In [613]:
file_input = open("input.txt", "r")
input = file_input.read()
# print(input)
user_description = pre_process_data.user_description_sentiment_analysis(input)
user_description = shingle(user_description, 3)  


In [614]:
# df_asin_description
df_asin_description["similarity"] = df_asin_description["shingles"].apply(lambda x: similarity(set(user_description).intersection(set(x)), set(user_description).union(set(x))))
df_asin_description


Unnamed: 0,asin,description,shingles,similarity
0,0001526146,lose game wait shine never seen righteou broken heart look back saw lord jesu river love hittin road never jesu got ta hold life save save save rise,"[look back saw, wait shine never, jesu got ta, never seen righteou, saw lord jesu, got ta hold, jesu river love, seen righteou broken, river love hittin, hittin road never, save save rise, love hittin road, lose game wait, game wait shine, road never jesu, shine never seen, ta hold life, broken ...",0.0
1,0545069882,spanish know gold edit learn spanish flash,"[gold edit learn, spanish know gold, edit learn spanish, know gold edit, learn spanish flash]",0.0
2,0545109620,cd book long sinc vanish great condit classic,"[cd book long, book long sinc, great condit classic, long sinc vanish, vanish great condit, sinc vanish great]",0.0
3,0615897398,season chang life slow forest join mindi friend bluebel wood scrabbl biggest juiciest blackberri carv silliest scariest jack lantern share true spirit give snow three whole day master storytel jacquelin houston bring mindi mousekin world aliv heartwarm tale fun adventur,"[tale fun adventur, scrabbl biggest juiciest, houston bring mindi, forest join mindi, scariest jack lantern, day master storytel, whole day master, juiciest blackberri carv, spirit give snow, aliv heartwarm tale, heartwarm tale fun, lantern share true, give snow three, slow forest join, bring mi...",0.0
4,0618866760,msica del mundo hispano audio cd includ authent music around spanish speak world lyric includ liner note,"[audio cd includ, authent music around, lyric includ liner, spanish speak world, includ liner note, hispano audio cd, speak world lyric, del mundo hispano, cd includ authent, mundo hispano audio, includ authent music, music around spanish, world lyric includ, msica del mundo, around spanish speak]",0.0
...,...,...,...,...
16111,B01HDZM264,dub infus post punk nyu punk professor music label staubgold,"[professor music label, nyu punk professor, infus post punk, punk nyu punk, punk professor music, music label staubgold, dub infus post, post punk nyu]",0.0
16112,B01HG2DW1I,track list butter ball zaq attack zona walk like guv sentiment pacif daylight trombon institut technolog san jose fog citi show crb trombon giant,"[crb trombon giant, guv sentiment pacif, daylight trombon institut, like guv sentiment, track list butter, citi show crb, sentiment pacif daylight, pacif daylight trombon, san jose fog, jose fog citi, ball zaq attack, attack zona walk, zaq attack zona, show crb trombon, butter ball zaq, list but...",0.0
16113,B01HH5R7LK,coldplay head full dream tour live etihad stadium manchest england june th cd intro head full dream yellow everi teardrop waterfal scientist bird paradis everglow lover japan magic clock midnight charli brown hymn weekend fix hero viva la vida cd adventur lifetim kaleidoscop troubl see soon amaz...,"[day sky full, live etihad stadium, nme award viva, vida charli brown, stadium manchest england, brown hymn weekend, feat beyonc bruno, clock midnight charli, waterfal scientist bird, kaleidoscop troubl see, england june th, coldplay full super, fix hero viva, vida cd adventur, troubl see soon, ...",0.0
16114,B01HH68B96,known live version that way life goe steam blacktop witha demo version superfici love sang hughi instead chri hick,"[superfici love sang, instead chri hick, witha demo version, known live version, steam blacktop witha, love sang hughi, hughi instead chri, goe steam blacktop, way life goe, live version that, version superfici love, sang hughi instead, demo version superfici, life goe steam, version that way, b...",0.0


Dataframe sorted by similarity

In [615]:

df_asin_description.sort_values(by="similarity", ascending=False, inplace=True)
df_asin_description

Unnamed: 0,asin,description,shingles,similarity
9811,B002IAV3YM,rare print first album matisyahu jdub record chop em tzama l chol nafshi got water king without crown interlud father forset interlud aish tamid short nigun candl close eye interlud exalt refug interlud warrior outro,"[print first album, tzama l chol, interlud father forset, close eye interlud, short nigun candl, rare print first, without crown interlud, first album matisyahu, interlud exalt refug, chop em tzama, l chol nafshi, album matisyahu jdub, father forset interlud, water king without, chol nafshi got,...",0.008929
13104,B00AUUEA7O,cult album titel track avenu smokey sing night murder love think rage regret ark angel king without crown bad blood jealou lover one day avenu z minneapoli smokey sing miami mix night murder love whole stori chicago abridg,"[without crown bad, regret ark angel, murder love think, night murder love, think rage regret, avenu z minneapoli, angel king without, rage regret ark, one day avenu, lover one day, jealou lover one, sing miami mix, bad blood jealou, love think rage, day avenu z, miami mix night, minneapoli smok...",0.008772
159,9321432531,cd king without crown chop em smash lie jerusalem fight close eye late night zion youth ancient lullabi short nigun dark light silenc candl light fire heaven altar earth got water natur warrior cd time song one day dispatch troop wp hi lo exalt uniqu dove shalom saalam aish tamid refug indestruc...,"[night zion youth, silenc candl light, ancient lullabi short, lullabi short nigun, jerusalem fight close, water natur warrior, one day dispatch, tzama l chol, nafshi psalm motiv, wp hi lo, fight close eye, close eye late, hi lo exalt, father forest walk, nigun dark light, chop em smash, eye late...",0.006803
10748,B00401NFVG,make mine countri charley pride vinyl lp,"[make mine countri, charley pride vinyl, pride vinyl lp, countri charley pride, mine countri charley]",0.000000
10746,B00400IA80,asin bia titl ball treasuri turn centuri popular song lp record label name nonesuch marketplac us,"[treasuri turn centuri, song lp record, lp record label, record label name, ball treasuri turn, nonesuch marketplac us, popular song lp, name nonesuch marketplac, turn centuri popular, titl ball treasuri, centuri popular song, label name nonesuch, asin bia titl, bia titl ball]",0.000000
...,...,...,...,...
5378,B000QW5SJK,old peopl whisper bossi man hit girl renaiss fair knock knock bug thrift store blue last time car flora fauna devil slow danc cat know place go sweet understand dragon pirat heart winter true delici,"[blue last time, go sweet understand, knock bug thrift, fair knock knock, understand dragon pirat, knock knock bug, heart winter true, slow danc cat, danc cat know, store blue last, man hit girl, thrift store blue, bossi man hit, cat know place, sweet understand dragon, hit girl renaiss, fauna d...",0.000000
5379,B000QWC89I,contain song like avalon look love mood indigo want littl girl,"[want littl girl, avalon look love, like avalon look, song like avalon, contain song like, mood indigo want, indigo want littl, love mood indigo, look love mood]",0.000000
5380,B000QWBXNU,includ navi hymn day holi holi holi lead kindli light prais god bless flow fairest lord jesu nearer god thee onward christian soldier mighti fortress stand jesu old rug cross abid rock age friend jesu,"[day holi holi, mighti fortress stand, flow fairest lord, navi hymn day, prais god bless, light prais god, bless flow fairest, holi holi lead, god bless flow, hymn day holi, nearer god thee, god thee onward, includ navi hymn, stand jesu old, fortress stand jesu, rock age friend, soldier mighti f...",0.000000
5381,B000QWHEGK,contain song like wish knew new know love noth,"[know love noth, wish knew new, song like wish, contain song like, new know love, knew new know, like wish knew]",0.000000


In [616]:
asin_most_similar = df_asin_description.asin.iloc[0]

In [617]:
print("Similarity of items")
print(df_asin_description.iloc[:20].similarity)

Similarity of items
9811     0.008929
13104    0.008772
159      0.006803
10748    0.000000
10746    0.000000
10745    0.000000
10744    0.000000
10743    0.000000
10742    0.000000
0        0.000000
10741    0.000000
10740    0.000000
10739    0.000000
10738    0.000000
10737    0.000000
10736    0.000000
10735    0.000000
10747    0.000000
10750    0.000000
10749    0.000000
Name: similarity, dtype: float64


### Create csv file with asin, description, emotion and similarity

In [618]:
# from preprocess_data import html_url_hidden_chars
df_similarity_scores.description = df_similarity_scores.description.apply(lambda x: pre_process_data.html_url_hidden_chars(x))
# df_similarity_scores.iloc[:2].description
# (pd.merge(df1, df2, on='company')
df_1 = df_similarity_scores[["asin","description"]].copy()
df_2 = df_asin_description[["asin","similarity"]].copy()
similarity_df = pd.merge(df_1, df_2, on='asin')
# similarity_df = pd.merge(df_similarity_scores, df_asin_description, on='asin')
similarity_df.columns
# similarity_df.sort_values(by="similarity", ascending=False, inplace=True)
# similarity_df.head()

Index(['asin', 'description', 'similarity'], dtype='object')

In [619]:
similarity_df.head()

Unnamed: 0,asin,description,similarity
0,1526146,1. Losing Game 2. I Can't Wait 3. Didn't He Shine 4. Never Seen...Righteous... 5. A Broken Heart 6. Looking Back 7. Here We Are 8. I Saw The Lord 9. Jesus Is A River Of Love 10. Hittin' The Road 11. I've Never Been Out Of... 12. Jesus Gotta Hold Of My Life 13. Saved- Saved- Saved 14. What Will Y...,0.0
1,545069882,Spanish Before You Know It - Gold Edition. Learn Spanish in a Flash!,0.0
2,545109620,Just the CD. The Book has long since vanished. I great condition and it is a classic.,0.0
3,615897398,"The seasons are changing and life slows down in the forest. Or does it? Join Mindy and her friends in Bluebell Woods as they scrabble for the biggest, juiciest blackberries ... carve the silliest and scariest jack-o-lanterns ... and share the true spirit of giving when they're snowed in for thre...",0.0
4,618866760,The Msica del mundo hispano Audio CD includes authentic music from around the Spanish-speaking world. Lyrics are included in the liner notes.,0.0


In [620]:
similarity_dfd = similarity_df.sort_values(by="similarity", ascending=False, inplace=False)
similarity_dfd.head()

Unnamed: 0,asin,description,similarity
11373,B002IAV3YM,RARE and out-of-print first album by Matisyahu from 2004 on JDub Records.1. Chop 'em Down2. Tzama L'chol Nafshi3. Got No Water4. King Without A Crown5. Interlude6. Father In The Forset7. Interlude8. Aish Tamid9. Short Nigun10. Candle11. Close My Eyes12. Interlude13. Exaltation14. Refuge15. Inter...,0.008929
14666,B00AUUEA7O,"Cult Album from 1987 [15 Titel/Tracks]: Avenue A When Smokey Sings , The Night You Murdered Love , Think Again , Rage And Then Regret, Ark-Angel , King Without A Crown , Bad Blood , Jealous Lover , One Day, Avenue Z , Minneapolis , When Smokey Sings (The Miami Mix) , The Night You Murdered Love ...",0.008772
158,9321432531,CD 1 1. King Without A Crown 2. Chop 'Em Down 3. Smash Lies 4. Jerusalem 5. What I'm Fighting For 6. Close My Eyes 7. Late Night In Zion 8. Youth 9. Ancient Lullaby 10. Short Nigun 11. For You 12. Darkness Into Light 13. Silence 14. Candle 15. I Will Be Light 16. Fire of Heaven - Altar Of Earth ...,0.006803
11776,B0033B7XF8,"Cole Porter's ""Let's Face It, Red Hot and Blue, Leave it to Me! Smithsonian American Musical Theater Series 1979 Vinyl LP",0.0
11790,B0034ZQWMI,Songs: (1) Who's In The Mood (2) I Just Want You (3) Rockin' Years (4) God Gave Me You (5) I Hurt (6) Guilty (7) Hard Times For Lovers (8) Couple Of Dreamers (9) Everybody's Gonna Want What We Got (10) Old Time Lovers,0.0


In [621]:
similarity_dfd.iloc[:30].to_csv("similarity_results.csv", sep='\t')

## Adding clustering based on users reviews

In [622]:
# Imports
import json
import gzip
import os
import pandas as pd
from scipy.sparse import dok_matrix
from sklearn.decomposition import NMF
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter

In [623]:
# Loading the df example
# df = pd.read_pickle('df_example.pkl')
df = similarity_dfd.copy()

In [624]:
import numpy as np

# Download dataset if it is not downloaded yet
if not os.path.exists('Dataset/Digital_Music.json.gz'):
    !wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFiles/Digital_Music.json.gz -P ./Dataset
else:
    print('Dataset already downloaded.')

Dataset already downloaded.


### Unpacking the reviews dataset and getting only the ratings

In [625]:
data = []
with gzip.open('Dataset/Digital_Music.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# Total length of list, this number equals total number of products
print("Total number of items in the dataset: ", len(data))
# Convert list into pandas dataframe
df_rating = pd.DataFrame.from_dict(data)
df_rating = df_rating[['overall', 'reviewerID', 'asin']]

Total number of items in the dataset:  1584082


### Filters used for narrowing down the user reviews
* Using only products from the previous parts (they must have defined the description, etc.)
* Using only 'positive' reviews (overall rating > 3)
* Using only users with more than 1 review

In [626]:
# Get all of the used products 'asin' (unique identifier)
unique_asin = set(df.asin)
# Filter out only products which are used in previous parts
idxs = [df_rating.asin[i] in unique_asin for i in range(len(df_rating))]
# Filter out only 'positive' reviews (overall >= 3)
positive_ratings = df_rating.overall >= 3
# Filter out only users with more than 1 review
unique_user = [key for key, value in Counter(df_rating.reviewerID).items() if value > 1]
experienced_users = df_rating.reviewerID.isin(unique_user)
# Putting all filters together
all_filters = idxs * np.array(positive_ratings) * np.array(experienced_users)

# Applying the filters
df_rating = df_rating[all_filters]

# Update the set of rated products
unique_asin = set(df_rating.asin)
unique_user = set(df_rating.reviewerID)
print('Number of users: ', len(unique_user))
print('Number of products: ', len(unique_asin))

Number of users:  20715
Number of products:  16115


In [627]:
# Create dicts, which maps user id (product asin) -> number (index in matrix) and the other way around
product2idx = {}
idx2product = []
user2idx = {}
idx2user = []
for idx, product in enumerate(unique_asin):
    product2idx[product] = idx
    idx2product.append(product)
for idx, user in enumerate(unique_user):
    user2idx[user] = idx
    idx2user.append(user)
# Create sparse matrix, where rows contain users and each column represents one product
product_user_matrix = dok_matrix((len(user2idx), len(product2idx)), dtype=np.float16)
for _, row in df_rating.iterrows():
    product_user_matrix[user2idx[row.reviewerID], product2idx[row.asin]] = 1
print('Matrix size: ', product_user_matrix.shape)

Matrix size:  (20715, 16115)


### Applying clustering algo

In [628]:
# Perform NMF to get clusters
model = NMF(n_components=10, verbose=False, max_iter=1000, random_state=42)
W = model.fit_transform(product_user_matrix)
H = model.components_

In [629]:
# Check that there is not a product which is not assigned to any cluster
non_zeros_cols = np.count_nonzero(H, axis=0)
print((non_zeros_cols > 0).sum() == len(unique_asin))
# Transposing the H matrix
H = H.transpose()

True


In [630]:
# Normalizing the H matrix and applying cosine similarity
row_sums = H.sum(axis=1)
H_normalized = H / row_sums[:, np.newaxis]
similarities = cosine_similarity(H_normalized)

In [631]:
n = 5
suggested_product_idx = product2idx[asin_most_similar]
suggested_product = idx2product[suggested_product_idx]
# Getting the index of 5 nearest neighbors
ind = np.argpartition(similarities[:, suggested_product_idx], -n)[-n:]
# Check for the same product
if suggested_product_idx in ind:
    ind = np.argpartition(similarities[:, suggested_product_idx], -n-1)[-n-1:]
    ind = np.delete(ind, np.where(ind == suggested_product_idx))

In [632]:
# Suggested product description
df.loc[df.asin == idx2product[suggested_product_idx]].description

11373    RARE and out-of-print first album by Matisyahu from 2004 on JDub Records.1. Chop 'em Down2. Tzama L'chol Nafshi3. Got No Water4. King Without A Crown5. Interlude6. Father In The Forset7. Interlude8. Aish Tamid9. Short Nigun10. Candle11. Close My Eyes12. Interlude13. Exaltation14. Refuge15. Inter...
Name: description, dtype: object

In [633]:
# 5 nearest neighbors description
# df.loc[df.asin.isin([idx2product[i] for i in ind])].description
df.head()

Unnamed: 0,asin,description,similarity
11373,B002IAV3YM,RARE and out-of-print first album by Matisyahu from 2004 on JDub Records.1. Chop 'em Down2. Tzama L'chol Nafshi3. Got No Water4. King Without A Crown5. Interlude6. Father In The Forset7. Interlude8. Aish Tamid9. Short Nigun10. Candle11. Close My Eyes12. Interlude13. Exaltation14. Refuge15. Inter...,0.008929
14666,B00AUUEA7O,"Cult Album from 1987 [15 Titel/Tracks]: Avenue A When Smokey Sings , The Night You Murdered Love , Think Again , Rage And Then Regret, Ark-Angel , King Without A Crown , Bad Blood , Jealous Lover , One Day, Avenue Z , Minneapolis , When Smokey Sings (The Miami Mix) , The Night You Murdered Love ...",0.008772
158,9321432531,CD 1 1. King Without A Crown 2. Chop 'Em Down 3. Smash Lies 4. Jerusalem 5. What I'm Fighting For 6. Close My Eyes 7. Late Night In Zion 8. Youth 9. Ancient Lullaby 10. Short Nigun 11. For You 12. Darkness Into Light 13. Silence 14. Candle 15. I Will Be Light 16. Fire of Heaven - Altar Of Earth ...,0.006803
11776,B0033B7XF8,"Cole Porter's ""Let's Face It, Red Hot and Blue, Leave it to Me! Smithsonian American Musical Theater Series 1979 Vinyl LP",0.0
11790,B0034ZQWMI,Songs: (1) Who's In The Mood (2) I Just Want You (3) Rockin' Years (4) God Gave Me You (5) I Hurt (6) Guilty (7) Hard Times For Lovers (8) Couple Of Dreamers (9) Everybody's Gonna Want What We Got (10) Old Time Lovers,0.0


In [634]:
# Output the cluster result to output.txt
file = open("output.txt", "w")
for i in range(0,5):
    print(ind[i])
    file.write(str(df.loc[df.asin == idx2product[ind[i]]].description))
    file.write("\n")
file.close()

2378
10163
1480
15262
5347
