# Collaborative Recommendations

In this notebook we're going to build a collaborative recommendation algorithm for our knitting data (kindly provided by www.ravelry.com).

In [7]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from collections import Counter
from scipy.sparse import hstack
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import pairwise_distances
from math import log, sqrt

I've made these csvs publically available in an s3 bucket, here:

You can go and download that and put it in a directory of your choice - you'll need to change the path in the cell below.

In [2]:
df = pd.read_csv('../data/recommend/user_data.csv')

In [3]:
patterns_df = pd.read_csv('../data/recommend/patterns_data.csv')
patterns_df.index = patterns_df.pattern_id

## Data Cleaning

Not much to do here - just gonna filter to only patterns with 5 or more likes.

In [5]:
counts = df.groupby('pattern_id')['user_id'].count()

In [6]:
filtered_df = df[df.pattern_id.map(counts) >= 5]

## Making a matrix of user likes

What were going to do here is turn the data into a "matrix" - one row for each pattern, and one column for every user. Each cell holds either a one - if the user likes that pattern - or a zero if they don't.

In [82]:
.cat.codes

0           6910
1          16783
2          16783
3          16783
4          16783
5          16783
6          16783
7          14485
8          14485
9          17394
10         17394
11         17394
12         17394
13         17394
14         17394
15         17394
16         17394
17         17394
18         17394
19         17394
20         17394
21         17394
22         17394
23         17394
24         17394
25         17394
26         17394
27         17394
28         17394
29         17394
           ...  
3756684     8101
3756685     8101
3756686     8101
3756687     8101
3756688     8101
3756689     8101
3756690     8101
3756691     8101
3756692     8101
3756693     8101
3756694     8101
3756695     8101
3756696     8101
3756697     8101
3756698     8101
3756699     8101
3756700     8101
3756701     8101
3756702     8101
3756703     8101
3756704     8101
3756705     8101
3756706     8101
3756707     8101
3756708     8101
3756709     8101
3756710     8101
3756711     81

In [85]:

def make_matrix(df):
    data = np.ones(len(df))
    col = pd.Series(pd.Categorical(df.user_id)).cat.codes
    row = pd.Series(pd.Categorical(df.pattern_id)).cat.codes
    return csr_matrix((data, (row, col)), shape=(df.pattern_id.nunique(), df.user_id.nunique()))

In [86]:
matrix = make_matrix(filtered_df)

We're also gonna "TFIDF transform" the data - this reduces the effect of "super users" who have liked a whole lot of stuff.

In [14]:
transformer = TfidfTransformer()
matrix = transformer.fit_transform(matrix)

## Dimensionality reduction

This is a bit complicated, but we're going to use an algorithm to reduce the dimensionality of the data - that is, turn it from one column per user to only 35 columns, while retaining as much information as possible. This is a way of "smoothing" the data - weird outliers tend to be reduced, and you get more consistent results.

In [75]:
shrinky = TruncatedSVD(35, random_state=6) 

# SVD is a stochastic algorithm, so I've fixed the random state to ensure consistent results. 
# Many of the patterns have similar profiles, so a different random state gives slightly different recommendations.

In [69]:
shrunk = shrinky.fit_transform(matrix)

## Finding similar patterns

It's a bit fiddly converting from the matrix (which contains one row for each pattern with five or more likes) back to the DataFrame which has the pattern names, but the process is otherwise simple.

In [150]:
patterns = list(filtered_df.pattern_id.unique())

In [149]:
pattern_ids = patterns_df.pattern_id
pattern_ids.index = patterns_df.permalink

In [153]:
target = patterns.index(pattern_ids.loc['mr-dangly'])

Here's how Mr. Dangly looks in our shrunk matrix: just 35 values which define what kinds of users like him, and what kinds don't.

In [154]:
shrunk[target]

array([ 0.0895207 ,  0.02946681, -0.0574579 , -0.03167866, -0.0509617 ,
        0.01118265,  0.01919036, -0.02228885,  0.01880677,  0.02909504,
       -0.01188873, -0.00646529, -0.0389338 , -0.0470142 , -0.01109375,
        0.00856811, -0.00951299,  0.00711156, -0.00357397,  0.00124073,
        0.02102683, -0.00951941, -0.02167144, -0.0024223 , -0.00691445,
        0.00731288, -0.00792361, -0.01751737, -0.04219167,  0.0106009 ,
        0.00480203,  0.02572602, -0.01532836, -0.02027218, -0.01442943])

Finding similar patterns is just a simple euclidean distance calculation.

In [155]:
similars = [products[n] for n in pd.Series([i[0] for i in \
                        pairwise_distances(shrunk, shrunk[target].reshape(1, -1), metric='cosine')]).argsort()[:10].values]

In [165]:
[patterns_df.permalink.loc[i] for i in similars]

['mr-dangly',
 'socktopus',
 'robot-hat-2',
 'the-great-batsby',
 'misty-morn-gloves',
 'the-thrifty-critter',
 'felted-knit-amigurumi-kitties',
 'praying-mantis',
 'pasha',
 'cthulhuclava']