## Simple Recommender #TODO- embed images of patterns
Based only on overall popularity and ratings.

Ratings do not take into account popularity (we could have one person be the only one to rate a pattern, and rate it as 5 - using this metric, that pattern would then be considered better than one that had 1000's of ratings and average of 4.8). We need to take both into consideration and use a weighted average.

Ravelry is interesting as it has a few popularity metics other than rating count. Popularity can also be considered by number of people who have it in their queue (planning on making it in the near future), and number of projects completed or attempted should also be considered. 

Also note - I only pull patterns through the api with average ratings of 4 and 5 (though that was most of them).

In [2]:
import pandas as pd
import numpy as np

In [16]:
# import pattern data

df = pd.read_csv('data/patterns_cleaned.csv', low_memory=False)
patterns = df.copy()
print(df.shape)
df.head()
df.columns

(132843, 24)


Index(['pattern_id', 'name', 'name_permalink', 'favorites_count',
       'projects_count', 'difficulty_average', 'difficulty_count',
       'rating_average', 'queued_projects_count', 'rating_count',
       'pattern_type_names', 'pattern_type_clothing', 'photos_url',
       'pattern_needle_sizes', 'pattern_attributes', 'yardage_max', 'yardage',
       'generally_available', 'gauge', 'gauge_divisor', 'free', 'downloadable',
       'categories', 'yarn_weight_description'],
      dtype='object')

In [5]:
# Only need the rating_count and rating_average columns for simple recommender

df = df[['pattern_id', 'rating_count', 'rating_average', 'favorites_count', 'projects_count']]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132843 entries, 0 to 132842
Data columns (total 5 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   pattern_id       132843 non-null  int64  
 1   rating_count     132841 non-null  float64
 2   rating_average   132843 non-null  float64
 3   favorites_count  132843 non-null  int64  
 4   projects_count   132843 non-null  int64  
dtypes: float64(2), int64(3)
memory usage: 5.1 MB


In [6]:
# Drop Nan and average ratings of 0. 

df.dropna(inplace=True)
df
print(df.shape)

(132841, 5)


#### Calculate Weighted Average
To calculate weighted average, we need the overall average for all patterns (mean_average below), rating count (df.rating_count), and rating average (df.rating_average) for each pattern.

WeightedRating(WR) = ((c/(c+m))*R) + (m/(c+m))*M

c = # rating counts
M - mean rating across whole dataset (mean_average)
R - average rating for pattern
m - minimum number of ratings to be on chart (say, to be in the top 90 percent)

##### (reference: https://www.datacamp.com/community/tutorials/recommender-systems-python)

In [7]:
# Calculate the overall pattern average (M)

mean_average = df['rating_average'].mean()
mean_average

4.505618854429369

In [8]:
# Want the top segment of patterns (only recommending top ones) - look at the top 10%
# m is the minimum rating count to get into that segmented

# from tues walk though - look at top ratings:
# print(book_ratingCount['totalRatingCount'].quantile(np.arange(.9, 1, .01)))
m = df['rating_count'].quantile(0.95)
print(m)

105.0


In [9]:
top_10percent_count_patterns = df.copy().loc[df['rating_count'] >= m]
top_10percent_count_patterns.shape

(6680, 5)

In [10]:
C = df['rating_average'].mean()

def weighted_rating(x, m=m, C=C):
    v = x['rating_count']
    R = x['rating_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [11]:
top_10percent_count_patterns['score'] = top_10percent_count_patterns.apply(weighted_rating, axis=1)

In [12]:
top_10percent_count_patterns = top_10percent_count_patterns.sort_values('score', ascending=False)

In [13]:
top_10percent_count_patterns

Unnamed: 0,pattern_id,rating_count,rating_average,favorites_count,projects_count,score
74176,832035,1018.0,4.915521,13642,4102,4.877195
74154,766246,697.0,4.925395,17030,3538,4.870436
55694,847816,990.0,4.906061,17906,3753,4.867662
74214,908945,596.0,4.929530,10384,2859,4.866034
74243,952178,532.0,4.936090,4926,1569,4.865133
...,...,...,...,...,...,...
1191,285223,205.0,3.639024,6631,606,3.932548
7134,1739,367.0,3.752044,2123,905,3.919682
4457,218,206.0,3.504854,7195,487,3.842733
6296,731890,304.0,3.611842,807,1075,3.841296


In [17]:
top_patterns = top_10percent_count_patterns.merge(patterns[['pattern_id','name','photos_url']], on='pattern_id', how = 'left')

In [18]:
top_patterns[['name', 'score', 'rating_count', 'photos_url']][0:27]

Unnamed: 0,name,score,rating_count,photos_url
0,Beloved,4.877195,1018.0,https://images4-g.ravelrycache.com/uploads/tin...
1,Never Not Gnoming,4.870436,697.0,https://images4-g.ravelrycache.com/uploads/Ima...
2,Ripple Bralette,4.867662,990.0,https://images4-g.ravelrycache.com/uploads/jes...
3,Bakers Twine,4.866034,596.0,https://images4-f.ravelrycache.com/uploads/DUC...
4,"Oh, Gnome, You Didn't",4.865133,532.0,https://images4-f.ravelrycache.com/uploads/Ima...
5,ADVENTure Gnome,4.861158,629.0,https://images4-f.ravelrycache.com/uploads/Ima...
6,My Little Secret Crop,4.859022,703.0,https://images4-g.ravelrycache.com/uploads/jes...
7,Slipstravaganza,4.857482,1845.0,https://images4-f.ravelrycache.com/uploads/wes...
8,Nice to Gnome You,4.852141,591.0,https://images4-f.ravelrycache.com/uploads/Ima...
9,Mariechen,4.848375,455.0,https://images4-g.ravelrycache.com/uploads/lil...


Okay, these are not the patterns I would have thought of as popular.  This may be due to the fact that the rating scale is so compact, and 4/5 of the people completing the patterns not actually rating them.  However, these must have excellent instructions have have recieved such high ratings - and I may have to take a second look at these when I'm done this project.

Apparently gnomes are highly rated by knitters.

### Simple recommender on most favourited projects 
(curently a ravelry search feature)

Note - you can "favourite" a pattern, or someones project without having knit the actual project.  It can be more of a "I like this", or "great job", or "I love the colour combination you used".  

In [19]:
most_favourites = patterns[['pattern_id', 'name', 'favorites_count','generally_available']].sort_values(
    'favorites_count', ascending=False)
most_favourites[0:20]

Unnamed: 0,pattern_id,name,favorites_count,generally_available
74120,588220,Reyna,74715,2015/06/01 00:00:00 -0400
55687,788421,The Weekender,73447,2017/11/01 00:00:00 -0400
74066,130787,Hermione's Everyday Socks,68941,2009/07/01 00:00:00 -0400
74560,216488,GAP-tastic Cowl,68776,2010/12/01 00:00:00 -0500
74067,169260,Honey Cowl,68426,2010/03/01 00:00:00 -0500
74116,566285,Baa-ble Hat,65243,2015/03/01 00:00:00 -0500
55675,710742,No Frills Sweater,64920,2016/11/01 00:00:00 -0400
74068,181549,The Age of Brass and Steam Kerchief,63170,2010/05/01 00:00:00 -0400
74188,870739,Nightshift,62667,2018/10/01 00:00:00 -0400
74074,273024,Bandana Cowl,61741,2011/10/01 00:00:00 -0400


### Most Projects 
(curently a ravelry search feature)

This could be a factor of project release date (not a lot of patterns readily availble except in books or knitty.com in the early-mid 2000's, makes those patterns quite popular. I know I've knit a number of these.)

In [20]:
most_projects = patterns[['pattern_id', 'name', 'projects_count', 'generally_available']].sort_values(
    'projects_count', ascending=False)
most_projects[0:20]

Unnamed: 0,pattern_id,name,projects_count,generally_available
74066,130787,Hermione's Everyday Socks,35028,2009/07/01 00:00:00 -0400
74070,211562,Hitchhiker,33945,2010/11/01 00:00:00 -0400
55971,605,Baby Surprise Jacket,29683,1968/01/01 00:00:00 -0500
74067,169260,Honey Cowl,27210,2010/03/01 00:00:00 -0500
74064,124400,Sockhead Slouch Hat,25695,2009/05/01 00:00:00 -0400
74089,426231,Barley,25011,2013/07/01 00:00:00 -0400
97,29,Clapotis,23519,2004/09/01 00:00:00 -0400
74060,573,Monkey Socks,22659,2006/12/01 00:00:00 -0500
55673,443533,Flax,22015,2013/10/01 00:00:00 -0400
190,195,Fetching,21231,2006/05/01 00:00:00 -0400


In [34]:
most_queued = df[['pattern_id', 'name', 'queued_projects_count', 'favorites_count','generally_available']].sort_values(
    'queued_projects_count', ascending=False)
most_queued[0:50]

Unnamed: 0,pattern_id,name,queued_projects_count,favorites_count,generally_available
74067,169260,Honey Cowl,13906,68426,2010/03/01 00:00:00 -0500
74560,216488,GAP-tastic Cowl,13808,68776,2010/12/01 00:00:00 -0500
197,48911,Star Crossed Slouchy Beret,13283,53972,2008/01/01 00:00:00 -0500
74066,130787,Hermione's Everyday Socks,12774,68941,2009/07/01 00:00:00 -0400
74120,588220,Reyna,11947,74715,2015/06/01 00:00:00 -0400
387,68527,February Lady Sweater,11646,45640,2008/05/01 00:00:00 -0400
56857,103767,owls,11625,56612,2010/02/01 00:00:00 -0500
55687,788421,The Weekender,11448,73447,2017/11/01 00:00:00 -0400
55773,440784,Hitofude Cardigan,11115,59917,2013/10/01 00:00:00 -0400
74068,181549,The Age of Brass and Steam Kerchief,10863,63170,2010/05/01 00:00:00 -0400


### Favourites to projects ratio 
This may not make much sense, but let's give it a shot. 

In [35]:
df.columns

Index(['pattern_id', 'rating_count', 'rating_average'], dtype='object')

In [22]:
fav_proj_ratio_df = patterns[['pattern_id', 'name', 'projects_count','favorites_count', 'generally_available']]
fav_proj_ratio_df['fav_proj_ratio'] = fav_proj_ratio_df.favorites_count/ fav_proj_ratio_df.projects_count
fav_proj_ratio_df= fav_proj_ratio_df[['pattern_id', 'name', 'fav_proj_ratio','projects_count', 'favorites_count','generally_available']].sort_values(
    'fav_proj_ratio', ascending=False)
fav_proj_ratio_df[0:20]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,pattern_id,name,fav_proj_ratio,projects_count,favorites_count,generally_available
65672,598852,A-dress,310.272727,11,3413,2015/08/01 00:00:00 -0400
73445,726472,Fusion,287.071429,14,4019,2017/02/01 00:00:00 -0500
126792,598857,Turkish Wave,259.181818,11,2851,2015/02/01 00:00:00 -0500
50665,471873,3D Hexagon Kaleidoscope,250.0,12,3000,2014/02/01 00:00:00 -0500
54630,693633,Fangorn Forest,221.785714,14,3105,2016/09/01 00:00:00 -0400
105664,858840,Atlantic Waves,221.5,14,3101,2018/09/01 00:00:00 -0400
126459,586126,#132 Sophisticated Cable and Lace Cowl,216.545455,11,2382,2011/08/01 00:00:00 -0400
62014,1031424,Bayou,213.636364,11,2350,2020/05/01 00:00:00 -0400
55159,757436,Old Shale Cardigan for ladies,210.636364,11,2317,2017/06/01 00:00:00 -0400
62082,1075234,Inez,210.083333,12,2521,2020/09/01 00:00:00 -0400


This is interesting - definitely useful - will have dig into this to see if I can tune or weight it to make more meaningful.

### queued projects

#### favourites and projects sum
(explain that people can favourite, and if you make it it is a good sign)

In [31]:
fav_proj_df = patterns[['pattern_id', 'name', 'projects_count','favorites_count', 'generally_available']]
fav_proj_df['fav_proj_sum'] = fav_proj_df.favorites_count+ fav_proj_df.projects_count
fav_proj_df['fav_proj_avg'] = fav_proj_df['fav_proj_sum']/2
fav_proj_df= fav_proj_df[['pattern_id', 'name', 'fav_proj_sum', 'fav_proj_avg','projects_count', 'favorites_count','generally_available']].sort_values(
    'fav_proj_avg', ascending=False)
fav_proj_df[0:60]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,pattern_id,name,fav_proj_sum,fav_proj_avg,projects_count,favorites_count,generally_available
74066,130787,Hermione's Everyday Socks,103969,51984.5,35028,68941,2009/07/01 00:00:00 -0400
74067,169260,Honey Cowl,95636,47818.0,27210,68426,2010/03/01 00:00:00 -0500
74560,216488,GAP-tastic Cowl,88773,44386.5,19997,68776,2010/12/01 00:00:00 -0500
74120,588220,Reyna,88678,44339.0,13963,74715,2015/06/01 00:00:00 -0400
74070,211562,Hitchhiker,87219,43609.5,33945,53274,2010/11/01 00:00:00 -0400
55687,788421,The Weekender,84759,42379.5,11312,73447,2017/11/01 00:00:00 -0400
55673,443533,Flax,81893,40946.5,22015,59878,2013/10/01 00:00:00 -0400
74064,124400,Sockhead Slouch Hat,80789,40394.5,25695,55094,2009/05/01 00:00:00 -0400
74068,181549,The Age of Brass and Steam Kerchief,77310,38655.0,14140,63170,2010/05/01 00:00:00 -0400
74116,566285,Baa-ble Hat,76127,38063.5,10884,65243,2015/03/01 00:00:00 -0500
