# 02807 Final project: Recommendation system
Recommendation system of products from __Digital Music__ category on __Amazon__. Products are suggested based on a short description inserted by a user.
[**Data source**](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/)

## Data processing

In [1]:
# Imports
import json
import gzip
import spacy
import warnings
import os
import pandas as pd
import numpy as np
import torch
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import DBSCAN, KMeans
from scipy import sparse
from hdbscan import HDBSCAN
from collections import Counter, defaultdict
from lxml import html, etree
from nrclex import NRCLex
from transformers import AutoTokenizer, AutoModelWithLMHead

In [7]:
# Download dataset if it is not downloaded yet
if not os.path.exists('Dataset/meta_Digital_Music.json.gz'):
    !wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/metaFiles2/meta_Digital_Music.json.gz -P ./Dataset
else:
    print('Dataset already downloaded.')

Dataset already downloaded.


__Data format__
   * `asin`: ID of the product, e.g. 0000031852
   * `title`: name of the product
   * `feature`: bullet-point format features of the product
   * `description`: description of the product
   * `price`: price in US dollars (at time of crawl)
   * `imageURL`: url of the product image
   * `imageURL`: url of the high resolution product image
   * `related`: related products (also bought, also viewed, bought together, buy after viewing)
   * `salesRank`: sales rank information
   * `brand`: brand name
   * `categories`: list of categories the product belongs to
   * `tech1`: the first technical detail table of the product
   * `tech2`: the second technical detail table of the product
   * `similar`: similar product table

_Note that there are usually multiple attributes left out blank for each product (specific attributes differs from product to product)._ 


In [2]:
# Data is in the format: 
# "overall": 4.0,
# "verified",
# "reviewTime",
# "reviewerID",
# "asin",
# "style": {"Format:"}
# "reviewerName",
# "reviewText"
# "summary",
# "unixReviewTime"

### Load the meta data
data = []
with gzip.open('Dataset/meta_Digital_Music.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# Total length of list, this number equals total number of products
print("Total number of items in the dataset: ", len(data))

Total number of items in the dataset:  74347


In [3]:
# Convert list into pandas dataframe
df = pd.DataFrame.from_dict(data)

# Change list to strings
df.description = df.description.apply(lambda x: ". ".join(x))

# A lot of the descriptions (and other features) contain HTML.
# The function parses and "translates" into plain text descriptions more suitable for analysis
def strip_html(s):
    if not s or s.isspace(): 
        return ''
    try:
        return str(html.fromstring(s).text_content())
    except etree.ParserError: # I am not able to find out why the error occur so i continued by catching the exception. Seem to happen on some empty description strings 
        return ''

df.description = df.description.apply(lambda x: strip_html(x))

# Filter out descriptions shorter than 100 chars
df = df[df['description'].map(lambda d: len(d) >= 100)]

df.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
4,[],,1. Losing Game 2. I Can't Wait 3. Didn't He Sh...,,Early Works - Dallas Holm,"[B0002N4JP2, 0760131694, B00002EQ79, B00150K8J...",,Dallas Holm,[],"399,269 in CDs & Vinyl (","[B0002N4JP2, 0760131694, B00150K8JC, B003MTXNV...","<img src=""https://images-na.ssl-images-amazon....",,,,1526146,[],[],
10,[],,The Music Connection by Silver Burdett Ginn is...,,"The Music Connection Grade 3, CD 7",[],,Silver Burdett Ginn,[],"694,369 in CDs & Vinyl (","[0382262948, 0382262875, 0382262891, 038226290...","<img src=""https://images-na.ssl-images-amazon....",,,$18.79,382262921,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
14,[],,"These entertaining, bright, joyous romps are f...",,Chicka Chicka Boom Boom CD Trio- Set of Three ...,[],,Bill Martin Jr,[],"564,433 in CDs & Vinyl (",[1481400568],"<img src=""https://images-na.ssl-images-amazon....",,,,545352886,[],[],
15,[],,The KING JAMES BIBLE is part of our deepest cu...,,Speak To Me,[],,David Teems,[],"506,806 in CDs &amp; Vinyl (",[],"<img src=""https://images-na.ssl-images-amazon....",,,$500.00,578077698,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
17,[],,Soothing background music and a female voice l...,,"<span class=""a-size-medium a-color-secondary a...",[],,Benita A. Esposito and Steven Mark Kohn (music),[],"980,759 in CDs &amp; Vinyl (",[],"<img src=""https://images-na.ssl-images-amazon....",,,,615165982,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,


In [10]:
print("Total number of products after filtering out: ", len(df))
print("First three product description")
for i in range(3):
    print()
    print(df.iloc[i].title)
    print(df.iloc[i].description)

Total number of products after filtering out:  21911
First three product description

Early Works - Dallas Holm
1. Losing Game 2. I Can't Wait 3. Didn't He Shine 4. Never Seen...Righteous... 5. A Broken Heart 6. Looking Back 7. Here We Are 8. I Saw The Lord 9. Jesus Is A River Of Love 10. Hittin' The Road 11. I've Never Been Out Of... 12. Jesus Gotta Hold Of My Life 13. Saved- Saved- Saved 14. What Will You Do? 15. Rise Again

The Music Connection Grade 3, CD 7
The Music Connection by Silver Burdett Ginn is a teaching aid for  
an elementary music or a homeroom teacher. Created by authorities  
in Music, The Music Connection: by Silver Burdett provides an  
excellent foundation for Music studies. Silver Burdetts style is  
suited towards Music studies, and will teach students the  
material clearly without overcomplicating the subject. Contains a  
variety of recordings such as vocal tracks, performance tracks,  
pick-a-track, dance - practice tempo, & more.

Chicka Chicka Boom Boom CD

In [4]:
# Remove empty columns
df.replace("", np.nan, inplace=True)
df.dropna(how='all', axis=1, inplace=True)

# Display final cleaned up pandas dataframe
df.head()

Unnamed: 0,category,description,title,also_buy,brand,feature,rank,also_view,main_cat,date,price,asin,imageURL,imageURLHighRes,details
4,[],1. Losing Game 2. I Can't Wait 3. Didn't He Sh...,Early Works - Dallas Holm,"[B0002N4JP2, 0760131694, B00002EQ79, B00150K8J...",Dallas Holm,[],"399,269 in CDs & Vinyl (","[B0002N4JP2, 0760131694, B00150K8JC, B003MTXNV...","<img src=""https://images-na.ssl-images-amazon....",,,1526146,[],[],
10,[],The Music Connection by Silver Burdett Ginn is...,"The Music Connection Grade 3, CD 7",[],Silver Burdett Ginn,[],"694,369 in CDs & Vinyl (","[0382262948, 0382262875, 0382262891, 038226290...","<img src=""https://images-na.ssl-images-amazon....",,$18.79,382262921,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
14,[],"These entertaining, bright, joyous romps are f...",Chicka Chicka Boom Boom CD Trio- Set of Three ...,[],Bill Martin Jr,[],"564,433 in CDs & Vinyl (",[1481400568],"<img src=""https://images-na.ssl-images-amazon....",,,545352886,[],[],
15,[],The KING JAMES BIBLE is part of our deepest cu...,Speak To Me,[],David Teems,[],"506,806 in CDs &amp; Vinyl (",[],"<img src=""https://images-na.ssl-images-amazon....",,$500.00,578077698,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
17,[],Soothing background music and a female voice l...,"<span class=""a-size-medium a-color-secondary a...",[],Benita A. Esposito and Steven Mark Kohn (music),[],"980,759 in CDs &amp; Vinyl (",[],"<img src=""https://images-na.ssl-images-amazon....",,,615165982,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,


## Adding emotions characteristics of the description

In [12]:
# Applying NRCLex emotions
df['emotion_nrc'] = df.description.apply(lambda x: NRCLex(x).raw_emotion_scores) 

In [9]:
# Suppressing warning about old version of spacy
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    # Applying Spacy affect model emotions
    nlp_affect = spacy.load('submodules/Spacy-Affect-Model/affect_ner')
    
df['emotion_spacy'] = df.description.apply(lambda x: Counter([item.label_.lower() for item in nlp_affect(x).ents]))

In [10]:
# Transformer method for emotion recognition
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-emotion")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-emotion")
model.to(device)

def get_emotion(text):
  input_ids = tokenizer.encode(text + '</s>', return_tensors='pt').to(device)
  output = model.generate(input_ids=input_ids,
               max_length=2)
  dec = [tokenizer.decode(ids) for ids in output]
  label = dec[0]
  return label

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [11]:
# Applying the transformer method on our dataset
df['emotion_transformer'] = df.description[:100].apply(lambda x: get_emotion(x)[6:])

Token indices sequence length is longer than the specified maximum sequence length for this model (583 > 512). Running this sequence through the model will result in indexing errors


In [12]:
# Extracting most significant emotion of a particular description
def get_most_significant_emotion(emotions):
    try:
        sign_emotion = max(emotions, key=emotions.get)
    except ValueError:
        sign_emotion = None
    return sign_emotion

df['most_significant_emotion_nrc'] = df.emotion_nrc.apply(lambda x: get_most_significant_emotion(x))
df['most_significant_emotion_spacy'] = df.emotion_spacy.apply(lambda x: get_most_significant_emotion(x))

df.head(100)
save_csv = False
if save_csv:
    df.to_csv('digital_music.csv')

## Similar items

In [None]:
dfdescription = df.description
descr = defaultdict(list)
 
for idx, row in df.iterrows():
    if row.description in descr[row.asin]:
        print(idx, row.asin, row.description)
    else:
        descr[row.asin].append(row.description)

In [None]:
descr

In [None]:
for key, value in descr.items():
    print(type(value))
    # if len(elem.values())  >1:
    #     print(elem)d

In [None]:
display(descr)

In [None]:
def shingle(aString, q, delimiter=' '):
    """
    Input:
        - aString (str): string to split into shingles
        - q (int)
        - delimiter (str): string of the delimiter to consider to split the input string (default: space)
    Return: list of unique shingles
    """
    all_shingles = []
    if delimiter != '':
        words_list = aString.split(delimiter)
    else:
        words_list = aString
    for i in range (len(words_list)-q+1):
        all_shingles.append(delimiter.join(words_list[i:i+q]))
    return list(set(all_shingles))

In [None]:
ex_string, q = dfdescription.iloc[0], 2
# ex_string, q = "Latin rhythms that will get your kids singing in Spanish. ", 2
ex_shingles = shingle(ex_string, q)
# assert len(ex_shingles) == 7
print('\nInitial string:', ex_string)
print(f'>> Shingles with q = {q} :',ex_shingles)

In [None]:
print(len(dfdescription))
dfdescription.drop_duplicates(inplace=True)
print(len(dfdescription))
# dfdescription

In [None]:
df.head()

In [None]:
# Merge description to reviews data using 'asin'

merged_df = df.merge(df[['asin', 'description']], on='asin', how='left')

In [None]:
merged_df.iloc[15:200]

## Adding clustering based on users purchases

In [13]:
# Download dataset if it is not downloaded yet
if not os.path.exists('Dataset/Digital_Music.csv'):
    !wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFilesSmall/Digital_Music.csv -P ./Dataset
else:
    print('Dataset already downloaded.')

Dataset already downloaded.


In [5]:
# Get all of the used products 'asin' (unique identifier)
unique_asin = set(df.asin)
# Load the ratings from users to pandas dataframe
df_ratings = pd.read_csv('Dataset/Digital_Music.csv', header=None)
df_ratings.columns = ['asin', 'user', 'rating', 'timestamp']
# Filter out only products which are used in previous parts
df_ratings = df_ratings.loc[df_ratings['asin'].isin(unique_asin)]
# Get all unique users
unique_user = set(df_ratings.user)
# Crete dicts, which maps user id (product asin) -> number (index in matrix)
products = {product: idx for idx, product in enumerate(unique_asin)}
users = {user: idx for idx, user in enumerate(unique_user)}
# Create sparse matrix, where rows contain products and each column represents one user
product_user_matrix = np.zeros((len(products), len(users)))
for _, row in df_ratings.iterrows():
    product_user_matrix[products[row.asin], users[row.user]] = 1

__Applying clustering algo__

In [11]:
# Reduce dimensionality
sparse_matrix = sparse.dok_matrix(product_user_matrix)
svd = TruncatedSVD(n_components=1000, n_iter=10, random_state=42)
reduced_matrix = svd.fit_transform(sparse_matrix)
print(reduced_matrix.shape)
print(len(svd.explained_variance_ratio_))
print(sum(svd.explained_variance_ratio_))

(19972, 1000)
1000
0.4433658231459868


In [12]:
clustering = KMeans(n_clusters=1000, n_init='auto', random_state=42).fit_predict(reduced_matrix)

In [14]:
Counter(clustering)

Counter({199: 13,
         0: 17148,
         126: 49,
         473: 13,
         933: 2,
         132: 69,
         291: 1,
         907: 8,
         839: 9,
         63: 79,
         62: 6,
         359: 13,
         241: 14,
         238: 1,
         794: 1,
         103: 40,
         915: 11,
         121: 91,
         26: 1,
         287: 1,
         159: 38,
         354: 18,
         385: 16,
         977: 9,
         18: 46,
         90: 36,
         329: 1,
         705: 10,
         985: 8,
         560: 12,
         218: 1,
         468: 1,
         935: 1,
         655: 2,
         363: 15,
         749: 1,
         375: 27,
         582: 9,
         854: 8,
         573: 11,
         471: 14,
         968: 11,
         648: 11,
         954: 3,
         955: 1,
         353: 17,
         658: 12,
         970: 5,
         190: 1,
         135: 19,
         434: 13,
         914: 1,
         336: 1,
         808: 9,
         384: 1,
         188: 1,
         116: 34,
      