# 03 - Ravelry -  Content Based Recommendation System
___

This notebook reads in the cleaned Ravelry dataframe and transforms it into a cosine distance matrix.  This allows it to be used as a content based recommender program using its features.  

The notebook includes code to search for patterns in the database and to submit a pattern name for five recommended patterns that are considered similar.  Another function returns the five recommended patterns along with Ravelry URL links to the pattern's information.

### Contents:
- [Import Data & Prepare Features](#Import-Data-&-Prepare-Features)
- [Calculate Cosine Distances and Build Recommender Dataframe](#Calculate-Cosine-Distances-and-Build-Recommender-Dataframe)
- [User Interface and Recommender Function](#User-Interface-and-Recommender-Function)

|Function|Argument|Purpose|
|---|---|---|
|**spacy_processor**|*str* - text|Creates a token list with only alpha characters and leaving out any that are only one letter.  Also lemmatizes words and omits Spacy stop words.|
|**display_recs**|*str* - user input|If user_input pattern is in rav_rec dataframe, it will return the five most similar patterns as well as URL links to the pattern's Ravelry page.|
|**patt_search**|*str* - user_search|Function accepts user string input meant to search for a specific pattern they are looking for.  Function will use Fuzzy Wuzzy Python libary to find and return ten patterns that are similar to the user input text.  If nothing matches, function will return 'Nothing Found'.

In [49]:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.feature_extraction.text import TfidfVectorizer

from scipy import sparse

import spacy

from fuzzywuzzy import fuzz 
from fuzzywuzzy import process 

## Import Data & Prepare Features
___

In [50]:
rav_clean_df = pd.read_csv('../data/rav_clean.csv')

In [51]:
rav_clean_df

Unnamed: 0,id,name,author,difficulty_avg,gauge,gauge_divisor,max_yardage,notes,price,projects_count,queued_projects_count,rating_avg,yarn_weight,permalink,type,gauge_per_inch
0,990044,Musselburgh,Ysolda Teague,2.46,6.0,1.0,610.0,>Our favourite swatchless hat pattern\r\n> jus...,6.00,23932,7716,4.89,Fingering,musselburgh,hat,6.0
1,899479,Classic Ribbed Hat,Purl Soho,1.92,32.0,4.0,305.0,MATERIALS\r\nPurl Soho’s [Cashmere Merino Bloo...,0.00,10383,5374,4.83,DK,classic-ribbed-hat-5,hat,8.0
2,1353734,Alpine Bloom Hat,Caitlin Hunter,3.37,24.0,4.0,230.0,The Alpine Bloom hat is designed as a companio...,5.00,1553,2400,4.84,Sport,alpine-bloom-hat,hat,6.0
3,528611,Classic Cuffed Hat,Purl Soho,1.87,20.0,4.0,328.0,"MATERIALS\r\n\r\n- Hat with Pom Pom: 1 (2, 2) ...",0.00,9106,5037,4.70,Worsted,classic-cuffed-hat,hat,5.0
4,7340640,The Traveler Hat,Andrea Mowry,2.18,20.0,4.0,350.0,[Do you enjoy Andrea’s patterns? Sign up for t...,7.00,838,312,4.83,Worsted,the-traveler-hat,hat,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14993,1159217,Furbelow Pullover,Debi Maige,0.00,26.0,4.0,2622.0,Get the full issue for [$8.99][1].\r\n\r\nThe ...,7.99,4,32,0.00,Fingering,furbelow-pullover,pullover,6.5
14994,1163057,Kiely Swoncho,Tamy Gore,5.00,22.0,4.0,2340.0,\r\nA fun architectural design knit in strande...,8.50,11,32,5.00,DK,kiely-swoncho,pullover,5.5
14995,1164074,Suttons Bay Sweater,Plucky Knitter Design,2.67,18.0,4.0,1732.0,The Suttons Bay Sweater may just become your f...,8.00,11,47,4.83,Worsted,suttons-bay-sweater,pullover,4.5
14996,1168304,Hedera,Stephanie Lotven,2.83,22.0,4.0,3210.0,"> **Buy 3, Get 1 FREE! Place 4 of my patterns ...",8.00,11,73,5.00,DK,hedera-4,pullover,5.5


In [52]:
# Won't use these columns for recommender.  Having gauge_per_inch makes the gauge and gauge_divisor feature redundant
# ID is unique to every pattern and not needed for recommender
rav_clean_df.drop(columns = ['id', 'gauge', 'gauge_divisor'], inplace = True)

In [53]:
#Renaming columns to make NLP encoding easier - otherwise need to set verbose_feature_names_out in the 
# column transformer to True.  The words name and price are in the notes values.
rav_clean_df.rename(columns = {'name':'pattern_name', 'price':'pattern_price'}, inplace = True)

In [54]:
rav_clean_df.head()

Unnamed: 0,pattern_name,author,difficulty_avg,max_yardage,notes,pattern_price,projects_count,queued_projects_count,rating_avg,yarn_weight,permalink,type,gauge_per_inch
0,Musselburgh,Ysolda Teague,2.46,610.0,>Our favourite swatchless hat pattern\r\n> jus...,6.0,23932,7716,4.89,Fingering,musselburgh,hat,6.0
1,Classic Ribbed Hat,Purl Soho,1.92,305.0,MATERIALS\r\nPurl Soho’s [Cashmere Merino Bloo...,0.0,10383,5374,4.83,DK,classic-ribbed-hat-5,hat,8.0
2,Alpine Bloom Hat,Caitlin Hunter,3.37,230.0,The Alpine Bloom hat is designed as a companio...,5.0,1553,2400,4.84,Sport,alpine-bloom-hat,hat,6.0
3,Classic Cuffed Hat,Purl Soho,1.87,328.0,"MATERIALS\r\n\r\n- Hat with Pom Pom: 1 (2, 2) ...",0.0,9106,5037,4.7,Worsted,classic-cuffed-hat,hat,5.0
4,The Traveler Hat,Andrea Mowry,2.18,350.0,[Do you enjoy Andrea’s patterns? Sign up for t...,7.0,838,312,4.83,Worsted,the-traveler-hat,hat,5.0


In [55]:
rav_clean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14998 entries, 0 to 14997
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   pattern_name           14998 non-null  object 
 1   author                 14998 non-null  object 
 2   difficulty_avg         14998 non-null  float64
 3   max_yardage            14998 non-null  float64
 4   notes                  14998 non-null  object 
 5   pattern_price          14998 non-null  float64
 6   projects_count         14998 non-null  int64  
 7   queued_projects_count  14998 non-null  int64  
 8   rating_avg             14998 non-null  float64
 9   yarn_weight            14998 non-null  object 
 10  permalink              14998 non-null  object 
 11  type                   14998 non-null  object 
 12  gauge_per_inch         14998 non-null  float64
dtypes: float64(5), int64(2), object(6)
memory usage: 1.5+ MB


**Features to One Hot Encode**
* author
* yarn_weight
* type

**SpaCy Natural Language Processing**
* notes

**Features to Scale**
* difficulty_avg
* gauge
* gauge_divisor
* max_yardage
* pattern_price
* projects_count
* queued_projects_count
* rating_avg

In [56]:
# Load the medium size SpaCy pipeline
nlp = spacy.load('en_core_web_md')

In [57]:
# To-do:  Add a custom stop word for phrase 'no notes provided'
def spacy_processor(text):
    
    '''
    (str) text - Creates a token list with only alpha characters and leaving out any that are only one letter.  Also
    lemmatizes words and omits Spacy stop words.
    '''
    #Put the data into spaCy model
    doc = nlp(text)

    tokens = [token.lemma_.lower().strip() for token in doc if token.is_alpha and not token.is_stop and len(token.text) > 1]

    #Put the processed text back together
    processed_text = ' '.join(tokens)

    #return processed text to dataframe
    return processed_text
    

In [58]:
# Apply the function to the text column of rav_clean_df
rav_clean_df['processed_text'] = rav_clean_df['notes'].apply(spacy_processor)

In [65]:
# Drop the notes column - no longer needed now that I have a processed text column from SpaCy
rav_clean_df.drop(columns = ['notes'], inplace = True)

In [66]:
# Four patterns had nothing in the processed text feature after Spacy function.  Adding garment type as the processed_text value.
# This could also be refactored into a function
rav_clean_df[rav_clean_df['processed_text'] == '']

Unnamed: 0,pattern_name,author,difficulty_avg,max_yardage,pattern_price,projects_count,queued_projects_count,rating_avg,yarn_weight,permalink,type,gauge_per_inch,processed_text
6748,Roza's Socks,Grumperina,2.92,429.0,0.0,491,283,4.23,Fingering,rozas-socks,socks,8.25,
9427,Denim Squares Socks,Amy Polcyn,0.0,400.0,0.0,5,9,0.0,Fingering,denim-squares-socks,socks,8.0,
12549,Twin Set,Sirdar,0.0,1840.0,0.0,6,5,0.0,DK,twin-set-9,pullover,5.5,
14450,Dog-Eared Sweater,Lorna Miser,2.44,1781.0,0.0,29,94,4.0,Worsted,dog-eared-sweater,pullover,5.0,


In [67]:
# I admit this is a very specific fix and can be changed into more generic code down the line.
# On the to-do list
rav_clean_df.loc[[6748, 9427],'processed_text'] = 'sock'
rav_clean_df.loc[[12549, 14450],'processed_text'] = 'pullover'

In [68]:
rav_clean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14998 entries, 0 to 14997
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   pattern_name           14998 non-null  object 
 1   author                 14998 non-null  object 
 2   difficulty_avg         14998 non-null  float64
 3   max_yardage            14998 non-null  float64
 4   pattern_price          14998 non-null  float64
 5   projects_count         14998 non-null  int64  
 6   queued_projects_count  14998 non-null  int64  
 7   rating_avg             14998 non-null  float64
 8   yarn_weight            14998 non-null  object 
 9   permalink              14998 non-null  object 
 10  type                   14998 non-null  object 
 11  gauge_per_inch         14998 non-null  float64
 12  processed_text         14998 non-null  object 
dtypes: float64(5), int64(2), object(6)
memory usage: 1.5+ MB


In [69]:
# Instantiate the transformers

ohe = OneHotEncoder(handle_unknown='ignore',
                    drop = 'first',
                   sparse_output = False)

sc = StandardScaler()
tvec = TfidfVectorizer(max_features = 3000) # Getting errors until I lowered max_features to 3000 - memory issues?  

In [70]:
# Transform columns as mentioned previously
ctx = ColumnTransformer(
    transformers=[
        ('one_hot', ohe, ['author', 'yarn_weight', 'type']),
        ('sc', sc, ['difficulty_avg', 'gauge_per_inch',
                    'max_yardage', 'pattern_price', 'projects_count',
                    'queued_projects_count', 'rating_avg']),
        ('tvec', tvec, 'processed_text'), 
    ],
    remainder='passthrough', verbose_feature_names_out= False
)

rav_clean_enc = ctx.fit_transform(rav_clean_df)

In [71]:
rav_clean_enc = pd.DataFrame(rav_clean_enc,
                             columns = ctx.get_feature_names_out(),
                            )

rav_clean_enc.set_index(['pattern_name'], inplace = True)

In [72]:
rav_clean_enc.head()

Unnamed: 0_level_0,author_13th Raven Designs,author_: : : Katie Degroff Knits : : :,author_A Little Knitty Designs,author_A Whimsical Wood Yarn Co.,author_A. Karen Alfke,author_A.Opie Designs,author_Aaron Schwartzbard,author_Abby Brown,author_Abby Goodman,author_Abby Tohline Wooden,...,ysolda,yummy,zag,zero,zig,zigzag,zipper,zoom,ístex,permalink
pattern_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Musselburgh,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.075168,0.0,0.0,0.0,0.0,0.0,0.0,0.15487,0.0,musselburgh
Classic Ribbed Hat,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,classic-ribbed-hat-5
Alpine Bloom Hat,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,alpine-bloom-hat
Classic Cuffed Hat,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,classic-cuffed-hat
The Traveler Hat,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,the-traveler-hat


In [73]:
# Separating out the permalink column so I can use it for the recommender to make
# pattern specific URLS.

permalink_df = rav_clean_enc[['permalink']]
permalink_df.reset_index(inplace = True)

In [74]:
 permalink_df

Unnamed: 0,pattern_name,permalink
0,Musselburgh,musselburgh
1,Classic Ribbed Hat,classic-ribbed-hat-5
2,Alpine Bloom Hat,alpine-bloom-hat
3,Classic Cuffed Hat,classic-cuffed-hat
4,The Traveler Hat,the-traveler-hat
...,...,...
14993,Furbelow Pullover,furbelow-pullover
14994,Kiely Swoncho,kiely-swoncho
14995,Suttons Bay Sweater,suttons-bay-sweater
14996,Hedera,hedera-4


In [75]:
#Drop the permalink column from rav_clean_enc now that I have a separate permalink
#dataframe to use
rav_clean_enc.drop(columns = 'permalink', inplace = True)

## Calculate Cosine Distances and Build Recommender Dataframe
___

For this to work, I need to convert rav_clean_enc to a float array.  Referenced [this](https://stackoverflow.com/questions/57434284/covert-to-sparse-matrix-typeerror-no-supported-conversion-for-types-dtype) stackoverflow post.

In [76]:
rav_clean_array = np.array(rav_clean_enc, dtype = float)

In [78]:
rav_clean_sparse = sparse.csr_matrix(rav_clean_array)

In [79]:
rav_clean_sparse

<14998x6351 sparse matrix of type '<class 'numpy.float64'>'
	with 1057783 stored elements in Compressed Sparse Row format>

In [80]:
distances = pairwise_distances(rav_clean_sparse, metric = 'cosine')


In [110]:
distances

array([[0.        , 0.03232811, 0.29834673, ..., 1.10478672, 1.0708272 ,
        1.12719411],
       [0.03232811, 0.        , 0.18753322, ..., 1.16376385, 1.10049439,
        1.18003074],
       [0.29834673, 0.18753322, 0.        , ..., 1.12124199, 1.11464982,
        1.20354259],
       ...,
       [1.10478672, 1.16376385, 1.12124199, ..., 0.        , 0.404157  ,
        0.69132   ],
       [1.0708272 , 1.10049439, 1.11464982, ..., 0.404157  , 0.        ,
        0.87289763],
       [1.12719411, 1.18003074, 1.20354259, ..., 0.69132   , 0.87289763,
        0.        ]])

In [111]:
rav_rec_df = pd.DataFrame(distances, columns = rav_clean_enc.index, index = rav_clean_enc.index)

In [112]:
rav_rec_df.head()

pattern_name,Musselburgh,Classic Ribbed Hat,Alpine Bloom Hat,Classic Cuffed Hat,The Traveler Hat,February Hat,Manhattan Hat,October Hat,My Baker's Hat,Berry Baby Hat,...,Sommar,Stowe,In My Pocket Sweater,Colorblock Raglan Pullover,Argentum Sweater,Furbelow Pullover,Kiely Swoncho,Suttons Bay Sweater,Hedera,Siblings
pattern_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Musselburgh,0.0,0.032328,0.298347,0.04021,0.556542,0.147871,0.420749,0.190777,0.552751,0.064932,...,1.096324,1.101988,1.127004,1.116466,1.095907,1.08424,1.086557,1.104787,1.070827,1.127194
Classic Ribbed Hat,0.032328,0.0,0.187533,0.009599,0.605715,0.070617,0.409814,0.093948,0.504675,0.018955,...,1.121568,1.156071,1.139926,1.115171,1.116348,1.129156,1.126015,1.163764,1.100494,1.180031
Alpine Bloom Hat,0.298347,0.187533,0.0,0.174866,0.667879,0.105629,0.419704,0.105991,0.577683,0.133989,...,1.223565,1.225287,1.100536,1.227563,1.20313,1.234003,1.076965,1.121242,1.11465,1.203543
Classic Cuffed Hat,0.04021,0.009599,0.174866,0.0,0.570103,0.047776,0.389431,0.091434,0.463313,0.010475,...,1.148436,1.136262,1.161016,1.093742,1.124442,1.137556,1.141645,1.122713,1.113563,1.14783
The Traveler Hat,0.556542,0.605715,0.667879,0.570103,0.0,0.598657,0.352771,0.698406,0.596645,0.593702,...,1.14246,1.069568,0.869934,1.210959,1.277033,1.195256,1.006737,0.741084,1.047988,0.965041


## User Interface and Recommender Function
___

In [113]:
# Make a list of pattern names that can be searched
# Near the end of project, this became redundant and may remove in future
# Can do the same think with permalink_df
patterns = rav_rec_df.index

In [114]:
patterns

Index(['Musselburgh', 'Classic Ribbed Hat', 'Alpine Bloom Hat',
       'Classic Cuffed Hat', 'The Traveler Hat', 'February Hat',
       'Manhattan Hat', 'October Hat', 'My Baker's Hat', 'Berry Baby Hat',
       ...
       'Sommar', 'Stowe', 'In My Pocket Sweater', 'Colorblock Raglan Pullover',
       'Argentum Sweater', 'Furbelow Pullover', 'Kiely Swoncho',
       'Suttons Bay Sweater', 'Hedera', 'Siblings'],
      dtype='object', name='pattern_name', length=14998)

In [115]:
def display_recs(user_pattern):
    '''
    Function accepts 'user_pattern' string argument which is a user's input of a pattern they wish to find recommendations for.
    
    Using the 'rav_rec' dataframe and Fuzzy Wuzzy Python library, the function will sort out the top five patterns most similar to 'user_pattern' argument.  This is saved in a 'top_five' list variable.
    The function will take to 'top_five' and using a character replacement table, will transform each pattern to its Ravelry URL equivalent.
    
    Finally the function return the name of each recommended pattern as well as a URL link to its details on Ravelry.  If the pattern is not found in the 'rav_rec' database,
    the function will handle the KeyError by printing a message to the user to check their input.
    '''
    
    choices = rav_rec_df.columns
    fuzzy_process = process.extractOne(user_pattern, choices)
    fuzzy_choice = fuzzy_process[0]
    matching_row = permalink_df[permalink_df['pattern_name'] == fuzzy_choice]
    # For when FuzzyWuzzy just can't find anything in the permalink_df that matches what the user type
    # For example 'power flower mittens'
    if not matching_row.empty:
        patt_link = matching_row['permalink'].iloc[0]
        url_prefix = 'https://www.ravelry.com/patterns/library/'
        print(f'We think you are looking for the pattern titled {fuzzy_choice}: {url_prefix}{patt_link}')


        top_five = list(rav_rec_df[fuzzy_choice].sort_values().iloc[1:6].index)

        for patt in top_five:
            top_matching_row = permalink_df[permalink_df['pattern_name'] == patt]
            top_patt_link = top_matching_row['permalink'].iloc[0]
            url = f'{patt}: {url_prefix}{top_patt_link}'  
            print(url)
    else:
        return 'No patterns found like this in our database.'

In [126]:
def patt_search(user_search):   
    
    '''
    Function accepts user string input meant to search for a specific pattern they are looking for.  
    Function will use Fuzzy Wuzzy Python libary to find and return ten patterns that are similar to the user input text.
    If nothing matches, function will return 'Nothing Found'.
    '''
    choices = rav_rec_df.columns
    fuzzy_process = process.extract(user_search, choices, limit = 10)
    matches = [pattern[0] for pattern in fuzzy_process]

    if matches:
        return ', '.join(matches)
    else:
        return 'Nothing Found'

In [130]:
search_term = input('Search for a pattern! ').lower()
patt_search(search_term)

Search for a pattern!  pressed flowers


'Pressed Flowers Hat, Pressed Flowers Socks, Pressed Flowers Pullover, Flowers of Fortrose Hat, Pressed Rib Cap & Muffler, Flaming Flowers Brioche Hat, Winter Wonder Flowers Hat, Flamboyant Flowers Beanie, Hearts and Flowers Socks, Floating Flowers Pullover'

In [129]:
# Asks user for a pattern they like and uses display_recs function to return five urls of similar patterns.
user_input = input('Type a knitting pattern you like: ')
display_recs(user_input)

Type a knitting pattern you like:  floewr


We think you are looking for the pattern titled Pressed Flowers Hat: https://www.ravelry.com/patterns/library/pressed-flowers-hat-2
Shiftalong: https://www.ravelry.com/patterns/library/shiftalong
Alpine Bloom Hat: https://www.ravelry.com/patterns/library/alpine-bloom-hat
Baudelaire: https://www.ravelry.com/patterns/library/baudelaire
Welcome to the Flock: https://www.ravelry.com/patterns/library/welcome-to-the-flock
We Call Them Pirates: https://www.ravelry.com/patterns/library/we-call-them-pirates
