# Clothes shopping recommendation

This project aims to improve customer shopping enthusiasm and online shopping experiences. It details and explains the model used for recommending shop items that users have a high possibility to purchase and the comparison between several approaches to predicting similarities and thus recommendation. \
The experimental results show that by measuring user similarity, specifically between purchase and rating history, one is able to more precisely predict user ratings for items that have not yet been purchased by the user and coincidentally curate a list of (new) items that the user is highly likely to purchase and rate high.\
Results from this study are not limited to the fashion industry alone. The models can be used in many other industries where user-item ratings are available and rating predictions could influence item recommendation to the users. 


In [241]:
# import packages

In [220]:
import gzip
import math
import matplotlib.pyplot as plt
import numpy
import random
import sklearn
import string
import json
import pandas as pd
from collections import defaultdict
#from gensim.models import Word2Vec
#from nltk.stem.porter import *
from sklearn import linear_model
from sklearn.manifold import TSNE

#### Dataset
The dataset used contains the measurements of clothing fit from RentTheRunway. The data is cited from the report called Decomposing fit semantics for product size recommendation in metric spaces by Rishabh Misra, Mengting Wan, Julian McAuley RecSys, 2018

In [164]:
f = gzip.open('data/renttherunway_final_data.json.gz')
dataset = []
for l in f:
    dataset.append(json.loads(l))
    if len(dataset) == 100000:
        break

In [165]:
# split data
training_set = dataset[0:int(len(dataset)*9/10)]
test_set = dataset[int(len(dataset)*9/10):]
print(len(training_set))
print(len(test_set))

90000
10000


This dataset contains a total of 192,544 example vectors and aside from body measurements such as bus size, height, and weight, it also includes the text feedback (review_text), review date from each item’s purchase/rental, and finally the rating given to the item by the user on a scale of 1-10.


In [166]:
training_set[0]

{'fit': 'fit',
 'user_id': '420272',
 'bust size': '34d',
 'item_id': '2260466',
 'weight': '137lbs',
 'rating': '10',
 'rented for': 'vacation',
 'review_text': "An adorable romper! Belt and zipper were a little hard to navigate in a full day of wear/bathroom use, but that's to be expected. Wish it had pockets, but other than that-- absolutely perfect! I got a million compliments.",
 'body type': 'hourglass',
 'review_summary': 'So many compliments!',
 'category': 'romper',
 'height': '5\' 8"',
 'size': 14,
 'age': '28',
 'review_date': 'April 20, 2016'}

In [207]:
#count number of rating
rating = [int(d['rating']) if d['rating'] is not None else 0 for d in training_set]
for i in range(0,11):
    print(i,rating.count(i))

0 42
1 0
2 463
3 0
4 1344
5 0
6 4974
7 0
8 24852
9 0
10 58325


### Jaccard

In [167]:
def Jaccard(s1 , s2):
    numerator = len(s1.intersection(s2))
    denominator = len(s1.union(s2))
    return numerator / denominator

In [187]:
usersPerItem = defaultdict (set)
itemsPerUser = defaultdict (set)
reviewsPerUser = defaultdict(list)
reviewsPerItem = defaultdict(list)
ratingDict = {} # To retrieve a rating for a specific user/item pair

for d in training_set:
    user,item = d['user_id'], d['item_id']
    usersPerItem[item].add(user)
    itemsPerUser[user].add(item)
    ratingDict[(user,item)] = d['rating']
    reviewsPerUser[user].append(d)
    reviewsPerItem[item].append(d)

In [188]:
def mostSimilarUser(i): # Query item i, and number of results K to return
    similarities = []
    items = itemsPerUser[i] # items which were purchased u
    for j in itemsPerUser : # Compute similarity against each item
        if j == i: continue
        sim = Jaccard(items , itemsPerUser[j])
        similarities.append((sim ,j))
        similarities.sort(reverse=True) # Sort to find the most similar
    return similarities[:10]

In [170]:
for d in training_set:
    i = d['item_id']
    u = d['user_id']
    if d['rating'] is None:
        d['rating'] = 0
    else:
        d['rating'] =  int(d['rating'])

In [171]:
ratingMean = sum([int(d['rating']) if d['rating'] is not None else 0 for d in training_set]) / len(training_set)

In [172]:
def predictRating(item, user):
    ratings = []
    similarities = []
    for d in reviewsPerItem[item]:
        j = d['user_id']
        if j == user: continue
        ratings.append(d['rating'])
        similarities.append(Jaccard(itemsPerUser[user], itemsPerUser[j]))
    if (sum(similarities) > 0):
        weightedRatings = [(x*y) for x, y in zip(ratings, similarities)]
        return sum(weightedRatings) / sum(similarities)
    else:
        return ratingMean

In [173]:
ratingMean

9.091244444444444

In [174]:
simPredictions = [predictRating(d['item_id'], d['user_id']) for d in test_set]

In [175]:
labels = [int(d['rating']) if d['rating'] is not None else 0 for d in test_set]

In [176]:
def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions, labels)]
    return sum(differences)/ len(differences)

In [177]:
MSE(simPredictions, labels)

2.13332842884137

### Bag of Words with Cosine Similarity

In [200]:
#total words
wordCount = defaultdict(int)
for d in training_set:
    for w in d['review_text'].split():
        wordCount[w] += 1
len(wordCount)

76921

In [202]:
#total words after filter
wordCount = defaultdict(int)
punctuation = set(string.punctuation)
for d in training_set:
    r = ''.join([c for c in d['review_text'].lower() if not c in punctuation])
    for w in r.split():
        wordCount[w] += 1

counts = [(wordCount[w], w) for w in wordCount]
counts.sort()
counts.reverse()

In [203]:
len(wordCount)

32537

In [205]:
training = []
for d in training_set:
    if d['rating'] is not None:
        training.append(d)

In [204]:
test = []
for d in test_set:
    if d['rating'] is not None:
        test.append(d)

In [206]:
#mean of total rating
ratingMean = sum([int(d['rating']) if d['rating'] is not None else 0 for d in training_set]) / len(training_set)
print(ratingMean)

9.091244444444444


In [210]:
itemAverages = defaultdict(list)
reviewsPerUser = defaultdict(list)
userAverages = defaultdict(list)
reviewsPerItem = defaultdict(list)
for d in training:
    i = d['item_id']
    u = d['user_id']
    if d['rating'] is None:
        d['rating'] = 0
    else:
        d['rating'] =  int(d['rating'])
    itemAverages[i].append(d['rating'])
    reviewsPerUser[u].append(d)
    
    userAverages[u].append(d['rating'])
    reviewsPerItem[i].append(d)

In [211]:
itemsPerUser = defaultdict(list)
for d in training:
    i = d['item_id']
    u = d['user_id']
    itemsPerUser[u].append(i)

In [212]:
usersPerItem = defaultdict(list)
for d in training:
    i = d['item_id']
    u = d['user_id']
    usersPerItem[i].append(u)

In [213]:
for i in itemAverages:
    itemAverages[i] = sum(itemAverages[i]) / len(itemAverages[i])
    
for u in userAverages:
    userAverages[u] = sum(userAverages[u]) / len(userAverages[u])

In [214]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

In [215]:
#join all the review
wordsPerUser = defaultdict(list)
wordsPerItem = defaultdict(list)
punctuation = set(string.punctuation)
for d in training_set:
    u = d['user_id'] 
    i = d['item_id']
    wordsPerUser[u].append(''.join([c for c in d['review_text'].lower() if not c in punctuation]))
    wordsPerItem[i].append(''.join([c for c in d['review_text'].lower() if not c in punctuation]))
    #wordsPerUser[u].append(d['review_text'])
for i in wordsPerUser:
    wordsPerUser[i] = ' '.join(wordsPerUser[i])
for i in wordsPerItem:
    wordsPerItem[i] = ' '.join(wordsPerItem[i])

In [216]:
def Cosine(x1,x2):
    numer = 0
    norm1 = 0
    norm2 = 0
    for a1,a2 in zip(x1,x2):
        numer += a1*a2
        norm1 += a1**2
        norm2 += a2**2
    if norm1*norm2:
        return numer / math.sqrt(norm1*norm2)
    return 0

In [217]:
for d in test:
    i = d['item_id']
    u = d['user_id']
    if d['rating'] is None:
        print('None element')
        d['rating'] = 0
    else:
        d['rating'] =  int(d['rating'])

In [225]:
#similarities by user
def predictRating(user,item):
    rating = 0
    ratingusermean = 0
    similarities = 0
    rev = wordsPerUser[user]
    pre = 0
    if rev == [] or rev == '':
        if itemAverages[item] == '' or itemAverages[item] == []:
            #print('ratingMean')
            return ratingMean
        else:
            #print('itemAverages')
            return itemAverages[item]
    
    else:
        ratingusermean = userAverages[user]
        for i in wordsPerUser:
            if i == user:
                continue
            else:
                if item in itemsPerUser[i]:    
                    documents =[wordsPerUser[user], wordsPerUser[i]]
                    count_vectorizer = CountVectorizer(stop_words='english')
                    sparse_matrix = count_vectorizer.fit_transform(documents)
                    doc_term_matrix = sparse_matrix.todense()
                    df = pd.DataFrame(doc_term_matrix, columns=count_vectorizer.get_feature_names_out(), index=['x', 'y'])
                    sim = cosine_similarity(df, df)[0,1]
                    similarities += sim
                    rr = userAverages[user]
                    for x in reviewsPerUser[i]:
                        if x['item_id'] == item:
                            rr = x['rating']
                    rating += (rr-userAverages[user]) * sim
                    #print(user)
        if similarities == 0:
            pre = ratingusermean
        else:
            pre = ratingusermean + rating / similarities
        if pre > 10:
            return 10
        else:
            #print('pre')
            return pre

In [230]:
square_e = 0
step_check = 0
for n in range(len(test)):
    step_check += 1
    u, i = test[n]['user_id'], test[n]['item_id']
    predicted_rating = predictRating(u,i)
    #print((float(test_Set[n]['rating'])),predicted_rating)
    #print(test_Set[n]['rating'])
    square_e += (float(test[n]['rating'])-predicted_rating)**2
    
    if (step_check % 1000 == 0):
        print(step_check)
    
mse = square_e/len(test)

1000
2000
3000
4000
5000
6000
7000
8000
9000


In [232]:
mse

2.0254888390085966

### TF-IDF

In [233]:
# top 1000 word
words = [x[1] for x in counts[:1000]]

In [234]:
# rate prediction
rate_count = defaultdict(int)
for rate in [d['rating'] for d in training_set ]:
    # Note = rather than +=, different versions of tf could be used instead
    rate_count[rate] += 1


In [235]:
reviewsListPerUser = defaultdict(str)
for u in reviewsPerUser:
    list_review = [review['review_text'] for review in reviewsPerUser[u]]
    total_review = '\n'.join(list_review)
    reviewsListPerUser[u] = (total_review)
    

In [236]:
# new document frequency
df = defaultdict(int)
for d in reviewsListPerUser:
    # concate words in list
    # if we use loop for text, we can get each letter in the text
    r = ''.join([c for c in reviewsListPerUser[d].lower() if not c in punctuation])
    # if we find a specific word in each document, count up
    for w in set(r.split()):
        #df[w] += 1
        df[w] = 1

In [237]:
def predictRating(user,item):
    ratings = []
    similarities = []
    
    rev = reviewsListPerUser[user]
    if rev == [] or rev == '':
        if itemAverages[item] == '' or itemAverages[item] == []:
            #print('ratingMean')
            return ratingMean
        else:
            #print('itemAverages')
            return itemAverages[item]
        
        
    
    tf = defaultdict(int)
    r = ''.join([c for c in rev.lower() if not c in punctuation])
    for w in r.split():
        # Note = rather than +=, different versions of tf could be used instead
        #tf[w] += 1
        tf[w] = 1
        
    #tfidf = dict(zip(words,[tf[w] * math.log2(len(reviewsListPerUser) / df[w]) for w in words]))
    tfidf = [tf[w] * math.log2(len(reviewsListPerUser) / df[w]) for w in words]

    #print('step1 finish')
    
    for d in reviewsPerItem[item]:
        i2 = d['user_id']
        if i2 == user: continue
        
        tf = defaultdict(int)
        r = ''.join([c for c in reviewsListPerUser[i2].lower() if not c in punctuation])
        for w in r.split():
            #tf[w] += 1
            tf[w] = 1
            
        tfidf2 = [tf[w] * math.log2(len(reviewsListPerUser) / df[w]) for w in words]
        
        similarities.append(Cosine(tfidf, tfidf2))
    
    #print('step2 finish',similarities)    
    
    for d in reviewsPerItem[item]:
        i2 = d['user_id']
        if i2 == user: continue
        # other's rating
        ratings.append(d['rating'] - userAverages[i2])
        
    if (sum(similarities) > 0):
        weightedRatings = [(x*y) for x,y in zip(ratings,similarities)]
        pred = userAverages[user] + sum(weightedRatings) / sum(similarities)
        if pred >= 10:
            pred = 10
        return pred
    else:
        return ratingMean

In [238]:
# test data
rate_count = defaultdict(int)
for rate in [d['rating'] for d in test_set ]:
    # Note = rather than +=, different versions of tf could be used instead
    rate_count[rate] += 1

print(rate_count)

for d in test_set:
    i = d['item_id']
    u = d['user_id']
    if d['rating'] is None:
        print('None element')
        d['rating'] = 0
    else:
        d['rating'] =  int(d['rating'])

defaultdict(<class 'int'>, {10: 6461, 6: 564, 8: 2791, 4: 127, 2: 55, 0: 2})


In [239]:
# MSE #
square_e = 0
step_check = 0
for n in range(len(test_set)):
    step_check += 1
    u, i = test_set[n]['user_id'], test_set[n]['item_id']
    predicted_rating = predictRating(u,i)
    square_e += (test_set[n]['rating']-predicted_rating)**2
    
    if (step_check % 1000 == 0):
        print(step_check)
    
mse = square_e/len(test_set)

1000
2000
3000
4000
5000
6000
7000
8000
9000
10000


In [240]:
print(mse)

2.425489512540989


#### Result
Results of the study show that among the models listed above, the Bag of Words is the best performing model for the objective of using text-mining for rating prediction, with once again a MSE of 2.025. The results showed that using text was more beneficial than utilizing similarity with simple users and items Part of our findings indicate that users who write similar reviews also have similar preference for clothes. Thus, we can make future clothes recommendations to customers based on reviews they wrote before. It is interesting to note that the other models do not fall far behind in terms of performance and further testing and experimentation with other models mentioned as part of our recommendations could prove to yield even more accurate results. 
