# Setting up the data

this dataset is a series of reviews and rating of home improvement from Amazon.
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Home_Improvement_v1_00.tsv.gz

In [16]:
import gzip
from collections import defaultdict
import scipy
import scipy.optimize
import numpy
import random

path = "/content/amazon_reviews_us_Home_Improvement_v1_00.tsv.gz"

In [17]:
f = gzip.open(path, 'rt', encoding="utf8")

header = f.readline()
header = header.strip().split('\t')

dataset = []

for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    d['star_rating'] = int(d['star_rating'])
    d['helpful_votes'] = int(d['helpful_votes'])
    d['total_votes'] = int(d['total_votes'])
    dataset.append(d)

## checking the attributes of our data

In [18]:
dataset[0]

{'customer_id': '48881148',
 'helpful_votes': 0,
 'marketplace': 'US',
 'product_category': 'Home Improvement',
 'product_id': 'B00FR4YQYK',
 'product_parent': '381800308',
 'product_title': 'SadoTech Model C Wireless Doorbell Operating at over 500-feet Range with Over 50 Chimes, No Batteries Required for Receiver, (Various Colors)',
 'review_body': 'good product',
 'review_date': '2015-08-31',
 'review_headline': 'Four Stars',
 'review_id': 'R215C9BDXTDQOW',
 'star_rating': 4,
 'total_votes': 0,
 'verified_purchase': 'Y',
 'vine': 'N'}

# Finding Similarities

## extracting data features required for finding similarities:

* userPerItem: contains users for a items in a dictionary
* itemsPerUser: conatins items for a user in a dictionary
* itemNames: contains name of item corresponding to a product id

In [19]:
usersPerItem = defaultdict(set)
itemsPerUser = defaultdict(set)

itemNames = {}

for d in dataset:
    user,item = d['customer_id'], d['product_id']
    usersPerItem[item].add(user)
    itemsPerUser[user].add(item)
    itemNames[item] = d['product_title']

## function to find similarities

We are using Jaccard similarity to find the similarity between 2 products for different users.

In [20]:
def Jaccard(s1, s2):
    numer = len(s1.intersection(s2))
    denom = len(s1.union(s2))
    return numer / denom

Our mostSimilar function above takes an input of a "Product ID" and a value n which is the number of similar Products we would like, then outputs a list of size n with the Poduct ID's.

In [21]:
def mostSimilar(iD, n):
    similarities = []
    users = usersPerItem[iD]
    for i2 in usersPerItem:
        if i2 == iD: continue
        sim = Jaccard(users, usersPerItem[i2])
        similarities.append((sim,i2))
    similarities.sort(reverse=True)
    return similarities[:n]

# Getting a Recommendation

In [22]:
query = dataset[10]['product_id']
query

'B008BYQCWM'

In [23]:
itemNames[query]

'Legend 809125 Legend Decorative Lever Passage Adj Bs Bronze'

In [24]:
mostSimilar(query, 10)

[(0.23837209302325582, 'B008BYQCWW'),
 (0.08955223880597014, 'B008BYQECU'),
 (0.050359712230215826, 'B008BYQCS6'),
 (0.04790419161676647, 'B002X9NGPM'),
 (0.026737967914438502, 'B0050ZP8DO'),
 (0.01834862385321101, 'B005CTJ7IU'),
 (0.017241379310344827, 'B00EEPZ6CY'),
 (0.012987012987012988, 'B008BYQGEQ'),
 (0.010526315789473684, 'B0000DCFE7'),
 (0.010416666666666666, 'B0000DCFEM')]

## getting 10 most similar product Using Jaccard Similarity

In [25]:
[itemNames[x[1]] for x in mostSimilar(query, 10)]

['Legend Wave Style Lever Handle Privacy Bed and Bath Leverset Lockset',
 'Legend 809153 Wave Style Handle Dummy Leverset, US613 Oil Rubbed Bronze Finish',
 'Legend 809119 Legend Decorative Lever Entry Adj Bs Kw1 Bronze',
 'Legend 809126 Wave Style Lever Front Door Knob Entry Leverset Lockset and Single Cylinder Deadbolt Combination Set, US3 Polished Brass Finish',
 '3558-ORB Pack of 12 3-1/2” Oil Rubbed Bronze Door Hinges 5/8" Radius Corners',
 'Designers Impressions Kingston Design Oil Rubbed Bronze Door Levers',
 'Dynasty Hardware DEN-HER-100-12PR Denver Front Door Handleset, Aged Oil Rubbed Bronze, With Heritage Lever, Right Hand',
 'Legend 931123 Legend Privacy Egg Knob Lock Oil Rubbed Bronze',
 'Moen Danbury Double Robe Hook',
 'Moen Danbury Towel Ring']