# Content-Based Recommendations

In this notebook we'll build a content-based recommendation algorithm for our knitting data (kindly provided by www.ravelry.com).

In [1]:
import pandas as pd
from copy import deepcopy

from collections import Counter
from scipy.sparse import hstack
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import pairwise_distances

I've made these csvs publically available in an s3 bucket, here:

You can go and download that and put it in a directory of your choice - you'll need to change the path in the cell below.

In [2]:
df = pd.read_csv('../data/recommend/patterns_data.csv')

In [3]:
df.head()

Unnamed: 0,pattern_id,keywords,category,difficulty,permalink,difficulty_average,gauge_divisor,gauge,row_gauge,gauge_pattern,yardage,ply,craft
0,524303,female|adult|cables|ribbed|textured|seamed|bal...,pullover,4.0,brigantine-sweater,4.0,1.0,4.25,6.25,Reverse Stockinette,1050.0,10.0,knitting
1,524297,teen|adult|written-pattern|schematic,vest,3.0,unique-shell-vest,3.0,,,,see notes below,300.0,12.0,crochet
2,524299,female|adult|cables|textured|one-piece|bottom-...,other-hat,6.0,sebago-hat,6.0,1.0,6.5,7.0,Cable Pattern,210.0,10.0,knitting
3,524327,cables|stranded|Intarsia|icord|sideways|bottom...,coffee-teapot,5.333333,the-bee-cosy-restoring-normality,5.333333,4.0,32.0,33.0,stockinette knit in the round,300.0,8.0,knitting
4,49,unisex|child|ribbed|one-piece|bottom-up|writte...,pixie,1.431319,meathead-hat,1.431319,4.0,9.0,,stockinette stitch,125.0,,knitting


## Data Cleaning

I'm cleaning the data a little to make it easier to work with.

To cut down on the amount of data, and also to improve the quality of the data, I'm only going to use patterns that have five or more "likes" from Ravelry users.

In [4]:
likes_df = pd.read_csv('../data/recommend/user_data.csv')
counts = likes_df.groupby('pattern_id')['user_id'].count()
filtered_df = df[df.pattern_id.map(counts) >= 5]

In [5]:
required = ['keywords',
    'category',
    'difficulty',
    'permalink',
    'difficulty_average',
    'craft']

In [6]:
fillna_dict = {'gauge_divisor': 0,
    'gauge': 0,
    'row_gauge': 0,
    'gauge_pattern': 'xxxBonusWordxxx',
    'yardage': 0,
    'ply': 0}  

In [7]:
df = filtered_df.dropna(subset=required)

In [8]:
df = df.fillna(fillna_dict)

## Transforming the data

Here I'm importing a few helper methods I wrote to make the data transformation a bit easier. They let me define all the transformations I want in a simple dictionary, and then pass that to the transformer to work on. I wrote this stuff a while ago, so it's not my best work, but it gets the job done.

In [9]:
from util.data_transformation_helpers import *

In [10]:
transformers = {
    'bag of words': NameGettingPipeline([('vectoriser', CountVectorizer(min_df=0.002, max_df=0.2, stop_words='english')), 
                              ('weighting', TfidfTransformer())
                                        ]),
    'keyword list': NameGettingPipeline([('vectoriser', CountVectorizer(tokenizer=lambda x: x.split('|'))), 
                              ('weighting', TfidfTransformer())]),
    'minmax': MinMaxWrapper(),
    'one-hot': OneHotWrapper()
}

Here's the dictionary which describes how I'm transforming the data. A "keyword list" is basically turning each word in the list into a seperate column, "minmax" is scaling numeric values to be between 0 and 1, "one-hot" is turning a categorical column into seperate columns, and "bag of words" is extracting important words from free text.

The numbers that follow are the "weights" for each column. The transformers ensure that every column has a value between 1 and 0, and then they are mutiplied by the weight to arrive at a final value.

In [11]:
data_transform = [
    ('keywords', 'keyword list', 1),
    ('category', 'keyword list', 2),
    ('difficulty', 'minmax', 2),
    ('ply', 'minmax', 3),
    ('gauge', 'minmax', 1),
    ('yardage', 'minmax', 1),
    ('craft', 'one-hot', 4),
    ('gauge_pattern', 'bag of words', 1)
]

In [12]:
transform_set = [(column, NameGettingPipeline([(
                    'selector', ItemSelector(column)), 
                ('transformer', deepcopy(transformers[transform_type]))
                   ]))  for column, transform_type, weight in data_transform]

We're gonna use scikit-learn's "Feature Union" class, which is super handy but a bit fiddly to use. The helpers I created above are going to make it easy though. This is a little sensitive to the version of Pandas you're using though - they're cool like that.

In [13]:
weights = {column: weight for column, transform_type, weight in data_transform}

fu = FeatureUnion(transform_set, transformer_weights=weights)

In [14]:

fu.fit(df)

FeatureUnion(n_jobs=1,
       transformer_list=[('keywords', NameGettingPipeline(memory=None,
          steps=[('selector', <util.data_transformation_helpers.ItemSelector object at 0x1a15785e10>), ('transformer', NameGettingPipeline(memory=None,
          steps=[('vectoriser', CountVectorizer(analyzer='word', binary=False, decod...('weighting', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True))]))]))],
       transformer_weights={'keywords': 1, 'category': 2, 'difficulty': 2, 'ply': 3, 'gauge': 1, 'yardage': 1, 'craft': 4, 'gauge_pattern': 1})

In [15]:
features = fu.transform(df)

## Finding similar patterns
Our features are returned as a "sparse matrix" - a terse representation of the information. We can look at it in greater detail though.

In [39]:
transformed_df = pd.DataFrame(features.todense(), columns=fu.get_feature_names())
transformed_df.head()

Unnamed: 0,keywords__2-at-a-time,keywords__3-4-sleeve,keywords__3-dimensional,keywords__adult,keywords__afterthought-heel,keywords__afterthought-pocket,keywords__aline,keywords__amigurumi,keywords__andean,keywords__appliqued,...,gauge_pattern__stitches,gauge_pattern__stocking,gauge_pattern__stranded,gauge_pattern__strands,gauge_pattern__stretched,gauge_pattern__sts,gauge_pattern__unblocked,gauge_pattern__using,gauge_pattern__worked,gauge_pattern__yarn
0,0.0,0.0,0.0,0.173995,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.371447,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.274877,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.20745,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We've now got over 500 columns of data, all of it scaled to between zero and one, and then multiplied by a weight.

Here's what Mr. Dangly looks like in the original dataset:

In [31]:
target = list(df.permalink).index('mr-dangly')

In [32]:
df.iloc[target]

pattern_id                                                 3150
keywords              fringe|seamed|written-pattern|worked-flat
category                                                 animal
difficulty                                              2.83824
permalink                                             mr-dangly
difficulty_average                                      2.83824
gauge_divisor                                                 0
gauge                                                         0
row_gauge                                                     0
gauge_pattern                                   xxxBonusWordxxx
yardage                                                       0
ply                                                           0
craft                                                  knitting
Name: 505, dtype: object

And here are his key features in the transformed dataframe.

In [40]:
transformed_df.iloc[target].sort_values(ascending=False)[:10]

craft__knitting              4.000000
category__animal             2.000000
keywords__fringe             0.845390
difficulty__difficulty       0.567647
keywords__seamed             0.406268
keywords__worked-flat        0.290197
keywords__written-pattern    0.189863
keywords__ruffles            0.000000
keywords__schematic          0.000000
keywords__sami               0.000000
Name: 397, dtype: float64

To find similar patterns we can use a simple euclidean distance calculation.

In [43]:
def get_closest_n(target, matrix, n):
    distances = pd.Series([i[0] for i in pairwise_distances(features, features[target])])
    return distances.argsort()[:n]

In [44]:
df.iloc[get_closest_n(target, features, 10)]

Unnamed: 0,pattern_id,keywords,category,difficulty,permalink,difficulty_average,gauge_divisor,gauge,row_gauge,gauge_pattern,yardage,ply,craft
505,3150,fringe|seamed|written-pattern|worked-flat,animal,2.838235,mr-dangly,2.838235,0.0,0.0,0.0,xxxBonusWordxxx,0.0,0.0,knitting
22137,169279,fringe|seamed|amigurumi|3-dimensional|written-...,animal,2.0,spring-collection,2.0,1.0,7.0,9.0,xxxBonusWordxxx,0.0,0.0,knitting
224579,179532,seamed|written-pattern|worked-flat,animal,1.857143,spring-lambs,1.857143,0.0,0.0,0.0,xxxBonusWordxxx,0.0,0.0,knitting
55869,521500,seamed|written-pattern|worked-flat,animal,1.75,pocket-fox,1.75,4.0,0.0,0.0,xxxBonusWordxxx,0.0,0.0,knitting
207269,1253,felted|in-the-round|fringe|one-piece|seamless|...,animal,1.571429,jellyfishin,1.571429,0.0,0.0,0.0,xxxBonusWordxxx,54.0,0.0,knitting
60124,33254,seamed|written-pattern|worked-flat,animal,1.5,knitted-kitten,1.5,0.0,0.0,0.0,xxxBonusWordxxx,0.0,0.0,knitting
511,3208,seamed|written-pattern|worked-flat,animal,3.090909,snoozing-ned,3.090909,4.0,28.0,36.0,xxxBonusWordxxx,0.0,0.0,knitting
253387,54441,fringe|seamed|written-pattern|worked-flat,animal,4.315789,leo-the-lion-3,4.315789,0.0,0.0,0.0,xxxBonusWordxxx,0.0,4.0,knitting
223101,688076,seamed|written-pattern,animal,3.0,toys-from-the-toybox,3.0,4.0,0.0,0.0,xxxBonusWordxxx,164.0,0.0,knitting
371434,229623,seamed|written-pattern,animal,2.333333,cats,2.333333,0.0,0.0,0.0,xxxBonusWordxxx,0.0,0.0,knitting
