# 02807 Final project: Recommendation system
Recommendation system of products from __Digital Music__ category on __Amazon__. Products are suggested based on a short description inserted by a user.
[**Data source**](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/)

In [2]:
# Imports
import json
import gzip
import spacy
import warnings
import os
import pandas as pd
import numpy as np
# import torch
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import DBSCAN, KMeans
from scipy import sparse
from hdbscan import HDBSCAN
from collections import Counter, defaultdict
from lxml import html, etree
from nrclex import NRCLex
from transformers import AutoTokenizer, AutoModelWithLMHead
from preprocess_data import *
from preprocess_data import pre_process_for_description



None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yuetingguan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/yuetingguan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Load the data 

In [3]:
# Download dataset if it is not downloaded yet
if not os.path.exists('Dataset/meta_Digital_Music.json.gz'):
    !wget https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/metaFiles2/meta_Digital_Music.json.gz -P ./Dataset
else:
    print('Dataset already downloaded.')

Dataset already downloaded.


__Data format__
   * `asin`: ID of the product, e.g. 0000031852
   * `title`: name of the product
   * `feature`: bullet-point format features of the product
   * `description`: description of the product
   * `price`: price in US dollars (at time of crawl)
   * `imageURL`: url of the product image
   * `imageURL`: url of the high resolution product image
   * `related`: related products (also bought, also viewed, bought together, buy after viewing)
   * `salesRank`: sales rank information
   * `brand`: brand name
   * `categories`: list of categories the product belongs to
   * `tech1`: the first technical detail table of the product
   * `tech2`: the second technical detail table of the product
   * `similar`: similar product table

_Note that there are usually multiple attributes left out blank for each product (specific attributes differs from product to product)._ 


In [4]:
### Load the meta data
data = []
with gzip.open('Dataset/meta_Digital_Music.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# Total length of list, this number equals total number of products
print("Total number of items in the dataset: ", len(data))

Total number of items in the dataset:  74347


In [5]:
# convert list into pandas dataframe
df = pd.DataFrame.from_dict(data)

# set size of display in pandas
pd.set_option('display.max_colwidth', 300)
pd.set_option('display.max_rows', 20 )

# first row of the list
print("Columns of the dataset: ", df.columns)

# show dataframe with columns and rows
# df.head()
# df2.info()

Columns of the dataset:  Index(['category', 'tech1', 'description', 'fit', 'title', 'also_buy', 'tech2',
       'brand', 'feature', 'rank', 'also_view', 'main_cat', 'similar_item',
       'date', 'price', 'asin', 'imageURL', 'imageURLHighRes', 'details'],
      dtype='object')


# Data processing

- Remove empty description
- Remove HTML tag
- Remove URLs
- Remove HTML hidden carachters
- Remove punctuation
- Remove numbers
- Transform every word into lowercase
- Remove stop words
- Perform stemming 

In [6]:
# Drop rows with no description (description is empty)
df = df[df['description'].map(lambda d: len(d)) > 0]
df.description
# df2.head()

4        [1. Losing Game 2. I Can't Wait 3. Didn't He Shine 4. Never Seen...Righteous... 5. A Broken Heart 6. Looking Back 7. Here We Are 8. I Saw The Lord 9. Jesus Is A River Of Love 10. Hittin' The Road 11. I've Never Been Out Of... 12. Jesus Gotta Hold Of My Life 13. Saved- Saved- Saved 14. What Will ...
9                                                                                                                                                                                                                                                                                                                [.]
10       [The Music Connection by Silver Burdett Ginn is a teaching aid for  \nan elementary music or a homeroom teacher. Created by authorities  \nin Music, The Music Connection: by Silver Burdett provides an  \nexcellent foundation for Music studies. Silver Burdetts style is  \nsuited towards Music stu...
12                                                                       

In [7]:
# each description is a list of strings,we want to remove the empty strings, and join the list of strings into one string
df.description = df.description.apply(lambda x: [string for string in x if string != ""])
df.description = df.description.apply(lambda x: " ".join(x))
df.iloc[0].description


"1. Losing Game 2. I Can't Wait 3. Didn't He Shine 4. Never Seen...Righteous... 5. A Broken Heart 6. Looking Back 7. Here We Are 8. I Saw The Lord 9. Jesus Is A River Of Love 10. Hittin' The Road 11. I've Never Been Out Of... 12. Jesus Gotta Hold Of My Life 13. Saved- Saved- Saved 14. What Will You Do? 15. Rise Again"

In [8]:
# f = open("descriptionHTMLbefore.txt", "w")
# for i in range(5000):
#     f.write(df2.iloc[i].description)
# f.close()
df_similarity_scores = df.copy()

print("Example of description before preprocessing: ")
print(df.description.iloc[0:2])
df.description = df.description.apply(lambda x: pre_process_for_description(x))
print()
print("Example of description after preprocessing: ")
print(df.description.iloc[0:2])

# f = open("descriptionHTMLafter.txt", "w")
# for i in range(5000):
#     f.write(df2.iloc[i].description)
# f.close()



Example of description before preprocessing: 
4    1. Losing Game 2. I Can't Wait 3. Didn't He Shine 4. Never Seen...Righteous... 5. A Broken Heart 6. Looking Back 7. Here We Are 8. I Saw The Lord 9. Jesus Is A River Of Love 10. Hittin' The Road 11. I've Never Been Out Of... 12. Jesus Gotta Hold Of My Life 13. Saved- Saved- Saved 14. What Will Y...
9                                                                                                                                                                                                                                                                                                              .
Name: description, dtype: object

Example of description after preprocessing: 
4    losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
9                                                                                                   

## Does any product contain different descriptions?  
There exists products which are not unique. The asin and the descriptions are duplicated. 
We process the data in order to have unique products.

In [67]:
# Counting occurence of unique "asin"
#asin_count = df['asin'].value_counts()
# print(asin_count)
#asin_more_than_once = asin_count[asin_count > 1].index
# print(asin_more_than_once)
# Step 2: Filter df2 to keep rows where 'asin' is in asin_more_than_once
#filtered_df = df[df['asin'].isin(asin_more_than_once)]
#filtered_df = filtered_df[["asin","description"]].sort_values(by="asin")
# Visual confirmation of duplicates 
#filtered_df

In [68]:
""" # If "asin" and "description" match -> drop
filtered_df.drop_duplicates(inplace=True)

# How many unique "asin" ?
len(filtered_df.asin.unique())
filtered_df """

' # If "asin" and "description" match -> drop\nfiltered_df.drop_duplicates(inplace=True)\n\n# How many unique "asin" ?\nlen(filtered_df.asin.unique())\nfiltered_df '

Removing the duplicates products -> now each product is unique

In [9]:
df_asin_description = df[["asin","description"]].copy()
df_asin_description.drop_duplicates(subset = "description", inplace=True)
# print(len(df_asin_description))
df_asin_description

Unnamed: 0,asin,description
4,0001526146,losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
9,0159024684,
10,0382262921,music connection silver burdett ginn teaching aid elementary music homeroom teacher created authorities music music connection silver burdett provides excellent foundation music studies silver burdetts style suited towards music studies teach students material clearly without overcomplicating su...
12,0545069882,spanish know gold edition learn spanish flash
13,0545109620,cd book long since vanished great condition classic
...,...,...
74336,B01HG2DW1I,track listing butter ball zaq attack zona walk like guv sentimental pacific daylight trombone institute technology san jose fog city show crbs trombones giant
74338,B01HH5R7LK,coldplay head full dreams tour live etihad stadium manchester england june th cd intro head full dreams yellow every teardrop waterfall scientist birds paradise everglow lovers japan magic clocks midnight charlie brown hymn weekend fix heroes viva la vida cd adventure lifetime kaleidoscope troub...
74339,B01HH68B96,known live versions thats way life goes steam blacktop witha demo version superficial love sang hughie instead chris hicks
74342,B01HH7D5KU,free last southside never gon lose purple coming southside diamonds africa southside southside compadre southside march madness tarentino trap niggas southside da fam da gram skit southside nights southside total length


## Creating shingles

In [10]:
# Given a string input, return the list of shingles
def shingle(s, q, delimiter=' '):
    all_shingles = []
    if delimiter != '':
        words_list = s.split(delimiter)
    else:
        words_list = s
    for i in range (len(words_list)-q+1):
        all_shingles.append(delimiter.join(words_list[i:i+q]))
    return list(set(all_shingles))

In [11]:
# Apply shingles to the df_asin_description
df_asin_description["shingles"] = df_asin_description["description"].apply(lambda x: shingle(x, 3))
# df_asin_description

# Sentiment Analysis

In [12]:
df_process = df_asin_description
df_process.head()

Unnamed: 0,asin,description,shingles
4,1526146,losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise,"[back saw lord, heart looking back, righteous broken heart, ta hold life, hold life saved, looking back saw, never seen righteous, losing game wait, game wait shine, never jesus got, river love hittin, road never jesus, hittin road never, wait shine never, love hittin road, seen righteous broken..."
9,159024684,,[]
10,382262921,music connection silver burdett ginn teaching aid elementary music homeroom teacher created authorities music music connection silver burdett provides excellent foundation music studies silver burdetts style suited towards music studies teach students material clearly without overcomplicating su...,"[style suited towards, towards music studies, recordings vocal tracks, foundation music studies, pick track dance, tracks pick track, without overcomplicating subject, suited towards music, dance practice tempo, silver burdett ginn, silver burdett provides, teaching aid elementary, burdett ginn ..."
12,545069882,spanish know gold edition learn spanish flash,"[gold edition learn, edition learn spanish, spanish know gold, learn spanish flash, know gold edition]"
13,545109620,cd book long since vanished great condition classic,"[long since vanished, vanished great condition, great condition classic, since vanished great, book long since, cd book long]"


In [14]:
# Suppressing warning about old version of spacy
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    # Applying Spacy affect model emotions
    nlp_affect = spacy.load('Spacy-Affect-Model/affect_ner')

    
df_process['emotion_spacy'] = df_process.description.apply(lambda x: Counter([item.label_.lower() for item in nlp_affect(x).ents]))

In [None]:
# Applying NRCLex emotions
#df['emotion_nrc'] = df.description.apply(lambda x: NRCLex(x).raw_emotion_scores) 

In [16]:

# Extracting most significant emotion of a particular description
def get_most_significant_emotion(emotions):
    try:
        sign_emotion = max(emotions, key=emotions.get)
    except ValueError:
        sign_emotion = None
    return sign_emotion

#df['most_significant_emotion_nrc'] = df.emotion_nrc.apply(lambda x: get_most_significant_emotion(x))
df_process['most_significant_emotion_spacy'] = df_process.emotion_spacy.apply(lambda x: get_most_significant_emotion(x))

df_process.head(100)
save_csv = True
if save_csv:
    df_process.to_csv('digital_music.csv')



In [17]:

# Output of user emotion based on input.txt
file_input = open("input.txt", "r")
text = file_input.read()
nlp_affect = spacy.load('Spacy-Affect-Model/affect_ner')

def measure_affect_score(sentence : str, nlp_affect):
    affect_percent = {'fear': 0.0, 'anger': 0.0, 'anticipation': 0.0, 'trust': 0.0, 'surprise': 0.0, 'positive': 0.0,
                      'negative': 0.0, 'sadness': 0.0, 'disgust': 0.0, 'joy': 0.0}
    emotions = []
    doc = nlp_affect(sentence)
    if len(doc.ents) != 0:
        for ent in doc.ents:
            emotions.append(ent.label_.lower())
        affect_counts = Counter()
        for emotion in emotions:
            affect_counts[emotion] += 1
        sum_values = sum(affect_counts.values())
        for key in affect_counts.keys():
            affect_percent.update({key: float(affect_counts[key]) / float(sum_values)})
    return affect_percent

user_emotion_scores = measure_affect_score(text,nlp_affect)
max_emotion = max(user_emotion_scores, key=user_emotion_scores.get)
user_emotion = max_emotion

print(user_emotion)



positive


In [19]:
# Find all items with the emotion "anticipation"
import pandas as pd

# read file with all emotions 
df_emotion = pd.read_csv('digital_music.csv')  
# filter satisfied lines（emotion == anticipation）
filtered_df = df_emotion[df_emotion['most_significant_emotion_spacy'] == user_emotion]

# generated new lines 
filtered_df.to_csv('grouped_emotion.csv', index=False) 



# Similar Items System
Program that reads the dataset, preprocess the data and output the most similar items based on a user description of a product.

In [20]:
import json
from collections import defaultdict
import gzip
import pandas as pd
from lxml import html,etree
import numpy as np
import ipywidgets as widgets
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
from nltk.stem import PorterStemmer
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
import os
from preprocess_data import preprocess_data


# set stopwords vocabulary
nltk.download('stopwords')

# set tokenizer
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yuetingguan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/yuetingguan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [23]:
df_asin_description = df_emotion[["asin","description","shingle"]].copy()
df_asin_description.drop_duplicates(subset = "description", inplace=True)
# print(len(df_asin_description))
df_asin_description

Unnamed: 0,asin,description
0,0001526146,losing game wait shine never seen righteous broken heart looking back saw lord jesus river love hittin road never jesus got ta hold life saved saved saved rise
1,0159024684,
2,0382262921,music connection silver burdett ginn teaching aid elementary music homeroom teacher created authorities music music connection silver burdett provides excellent foundation music studies silver burdetts style suited towards music studies teach students material clearly without overcomplicating su...
3,0545069882,spanish know gold edition learn spanish flash
4,0545109620,cd book long since vanished great condition classic
...,...,...
26883,B01HG2DW1I,track listing butter ball zaq attack zona walk like guv sentimental pacific daylight trombone institute technology san jose fog city show crbs trombones giant
26884,B01HH5R7LK,coldplay head full dreams tour live etihad stadium manchester england june th cd intro head full dreams yellow every teardrop waterfall scientist birds paradise everglow lovers japan magic clocks midnight charlie brown hymn weekend fix heroes viva la vida cd adventure lifetime kaleidoscope troub...
26885,B01HH68B96,known live versions thats way life goes steam blacktop witha demo version superficial love sang hughie instead chris hicks
26886,B01HH7D5KU,free last southside never gon lose purple coming southside diamonds africa southside southside compadre southside march madness tarentino trap niggas southside da fam da gram skit southside nights southside total length


### Similarity of sets
Computing Jaccuard similarity

In [24]:
# function that takes an intersection set and a union set and returns the Jaccard similarity
def similarity(intersection_set, union_set):
    return len(intersection_set)/len(union_set)

In [25]:
# input = "In the dynamic landscape of higher education, universities are continually redefining the traditional boundaries of learning. The integration of arts, music, and literature has become a cornerstone in fostering a holistic educational experience. At the heart of this transformation is the commitment to connect students with a diverse range of disciplines, preparing them not only for academic success but also for a life enriched by creativity and cultural understanding. In this context, universities such as New School are pioneering integrated learning models that transcend conventional subject silos. Their innovative approach, backed by cutting-edge teaching methodologies, empowers students to explore the intersections of arts, music, and literature. The vision goes beyond a mere confluence of disciplines; it seeks to create an immersive educational environment where students can seamlessly weave their academic pursuits into the fabric of their daily lives. One key player in this educational evolution is McGraw, a renowned arts author whose work has become a guiding light for both educators and students alike. McGraw's contributions extend beyond the conventional boundaries of a university classroom, resonating with a global audience. His writings not only inspire a love for the arts but also emphasize the transformative power of integrated learning in shaping well-rounded individuals. The concept of an integrated learning environment transcends the boundaries of time and space. It is not confined to the four walls of a classroom; rather, it permeates every facet of a student's journey. In this dynamic world, students are no longer passive recipients of knowledge but active participants in a vibrant community of learners. The university becomes a nexus where diverse ideas converge, fostering a collaborative spirit that extends far beyond graduation. In this interconnected world, the New School's commitment to integrated learning is a beacon of innovation. Students are not just acquiring knowledge; they are forging connections between seemingly disparate fields, discovering the harmonies between arts and sciences, and navigating the rhythms of a multicultural world. This transformative journey prepares them to navigate the complexities of the modern world with a deep appreciation for diversity and a keen sense of intellectual curiosity. As we stand at the intersection of arts, music, and literature, the integrated learning paradigm championed by universities like New School, guided by visionary authors such as McGraw, is shaping the future of education. It is a testament to the idea that learning is not a compartmentalized experience but a symphony of knowledge, where every note, every discipline, plays a crucial role in the harmonious melody of life."

file_input = open("input.txt", "r")
input = file_input.read()
# print(input)
user_description = preprocess_data(input)
user_description = shingle(user_description, 3)  
# intersection_set = set(user_description).intersection(set(df_asin_description.shingles.iloc[0]))
# union_set = set(user_description).union(set(df_asin_description.shingles.iloc[0]))
# # perform similarity
# sim = similarity(intersection_set, union_set)
# print(sim)


In [26]:
# df_asin_description
df_asin_description["similarity"] = df_asin_description["shingles"].apply(lambda x: similarity(set(user_description).intersection(set(x)), set(user_description).union(set(x))))
df_asin_description


KeyError: 'shingles'

Dataframe sorted by similarity

In [None]:

df_asin_description.sort_values(by="similarity", ascending=False, inplace=True)
df_asin_description


# if os.path.exists("10RecommendedItems.csv"):
#   os.remove("10RecommendedItems.csv")
# df_asin_description[:11].to_csv('10RecommendedItems.csv', index=False)

In [None]:
print("Similarity of items")
print(df_asin_description.similarity)