# Ravelry API - Content Based Recommendation System
___

This notebook reads in the cleaned Ravelry dataframe and transforms it into a consine distance matrix.  This allows it to be used as a content based recommender program using its features.  

The notebook includes code to search for patterns in the database and to submit a pattern name for five recommended patterns that are considered similar.  Another function returns the five recommended patterns along with Ravelry URL links to the pattern's information.

### Contents:
- [Import Data & Prepare Features](#Import-Data-&-Prepare-Features)
- [Calculate Cosine Distances and Build Recommender Dataframe](#Calculate-Cosine-Distances-and-Build-Recommender-Dataframe)
- [User Interface and Recommender Function](#User-Interface-and-Recommender-Function)

|Function|Argument|Function|
|---|---|---|
|**display_recs**|*str* - user input|If user_input pattern is in rav_rec dataframe, it will return the five most similar patterns as well as URL links to the pattern's Ravelry page.|

In [30]:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.feature_extraction.text import TfidfVectorizer

from scipy import sparse

import spacy

## Import Data & Prepare Features
___

In [31]:
rav_clean_df = pd.read_csv('../data/rav_clean.csv')
rav_clean_df.drop(columns = ['id', 'gauge', 'gauge_divisor'], inplace = True)

In [32]:
#Renaming columns to make NLP encoding easier - otherwise need to set verbose_feature_names_out in the 
# column transformer to True
rav_clean_df.rename(columns = {'name':'pattern_name', 'price':'pattern_price'}, inplace = True)

In [33]:
rav_clean_df.head()

Unnamed: 0,pattern_name,author,difficulty_avg,max_yardage,notes,pattern_price,projects_count,queued_projects_count,rating_avg,yarn_weight,type,gauge_per_inch
0,Musselburgh,Ysolda Teague,2.46,610.0,>Our favourite swatchless hat pattern\r\n> jus...,6.0,23656,7700,4.89,Fingering,hat,6.0
1,Classic Ribbed Hat,Purl Soho,1.92,305.0,MATERIALS\r\nPurl Soho’s [Cashmere Merino Bloo...,0.0,10353,5382,4.83,DK,hat,8.0
2,Alpine Bloom Hat,Caitlin Hunter,3.37,230.0,The Alpine Bloom hat is designed as a companio...,5.0,1520,2364,4.84,Sport,hat,6.0
3,Classic Cuffed Hat,Purl Soho,1.87,328.0,"MATERIALS\r\n\r\n- Hat with Pom Pom: 1 (2, 2) ...",0.0,9097,5043,4.7,Worsted,hat,5.0
4,February Hat,Kate Gagnon Osborn,2.62,213.0,When thinking about what I wanted to do for my...,0.0,3888,3532,4.75,Worsted,hat,4.5


In [34]:
rav_clean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   pattern_name           6000 non-null   object 
 1   author                 6000 non-null   object 
 2   difficulty_avg         6000 non-null   float64
 3   max_yardage            6000 non-null   float64
 4   notes                  6000 non-null   object 
 5   pattern_price          6000 non-null   float64
 6   projects_count         6000 non-null   int64  
 7   queued_projects_count  6000 non-null   int64  
 8   rating_avg             6000 non-null   float64
 9   yarn_weight            6000 non-null   object 
 10  type                   6000 non-null   object 
 11  gauge_per_inch         6000 non-null   float64
dtypes: float64(5), int64(2), object(5)
memory usage: 562.6+ KB


**Features to One Hot Encode**
* author
* yarn_weight
* type

**SpaCy Natural Language Processing**
* notes

**Features to Scale**
* difficulty_avg
* gauge
* gauge_divisor
* max_yardage
* pattern_price
* projects_count
* queued_projects_count
* rating_avg

In [5]:
# Load the medium size SpaCy pipeline
nlp = spacy.load('en_core_web_md')

In [6]:
def spacy_processor(text):
    
    #Put the data into spaCy model
    doc = nlp(text)
    
    # Create a tokens list with only alpha characters and leaving out any that are only one letter
    # Also lemmatizes words and omits SpaCy stop words
    tokens = [token.lemma_.lower().strip() for token in doc if token.is_alpha and not token.is_stop and len(token.text) > 1]

    #Put the processed text back together
    processed_text = ' '.join(tokens)

    #return processed text to dataframe
    return processed_text
    

In [7]:
# Apply the function to the text column of rav_clean_df
rav_clean_df['processed_text'] = rav_clean_df['notes'].apply(spacy_processor)

In [8]:
rav_clean_df.drop(columns = ['notes'], inplace = True)

In [9]:
# Noticed only one pattern had nothing in processed text.  To make encoding and saving easier, will add the word
# 'socks'.  Pattern is 'Roza's Socks'.
rav_clean_df[rav_clean_df['processed_text'] == '']

Unnamed: 0,pattern_name,author,difficulty_avg,max_yardage,pattern_price,projects_count,queued_projects_count,rating_avg,yarn_weight,type,gauge_per_inch,processed_text
3738,Roza's Socks,Grumperina,2.92,429.0,0.0,491,283,4.23,Fingering,socks,8.25,


In [10]:
rav_clean_df.loc[3738,'processed_text'] = 'socks'

In [11]:
rav_clean_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   pattern_name           6000 non-null   object 
 1   author                 6000 non-null   object 
 2   difficulty_avg         6000 non-null   float64
 3   max_yardage            6000 non-null   float64
 4   pattern_price          6000 non-null   float64
 5   projects_count         6000 non-null   int64  
 6   queued_projects_count  6000 non-null   int64  
 7   rating_avg             6000 non-null   float64
 8   yarn_weight            6000 non-null   object 
 9   type                   6000 non-null   object 
 10  gauge_per_inch         6000 non-null   float64
 11  processed_text         6000 non-null   object 
dtypes: float64(5), int64(2), object(5)
memory usage: 562.6+ KB


In [12]:
# Instantiate the transformers

ohe = OneHotEncoder(handle_unknown='ignore',
                    drop = 'first',
                   sparse_output = False)

sc = StandardScaler()
tvec = TfidfVectorizer(max_features = 3000) # Getting errors until I lowered max_features to abouut 3000

In [13]:
# Transform columns as mentioned previously
ctx = ColumnTransformer(
    transformers=[
        ('one_hot', ohe, ['author', 'yarn_weight', 'type']),
        ('sc', sc, ['difficulty_avg', 'gauge_per_inch',
                    'max_yardage', 'pattern_price', 'projects_count',
                    'queued_projects_count', 'rating_avg']),
        ('tvec', tvec, 'processed_text'), 
    ],
    remainder='passthrough', verbose_feature_names_out= False
)

rav_clean_enc = ctx.fit_transform(rav_clean_df)

In [14]:
rav_clean_enc = pd.DataFrame(rav_clean_enc,
                             columns = ctx.get_feature_names_out(),
                            )

rav_clean_enc.set_index(['pattern_name'], inplace = True)

## Calculate Cosine Distances and Build Recommender Dataframe
___

For this to work, I need to convert rav_clean_enc to a float array.  Referenced [this](https://stackoverflow.com/questions/57434284/covert-to-sparse-matrix-typeerror-no-supported-conversion-for-types-dtype) stackoverflow post.

In [15]:
rav_clean_array = np.array(rav_clean_enc, dtype = float)

In [16]:
rav_clean_sparse = sparse.csr_matrix(rav_clean_array)

In [17]:
rav_clean_sparse

<6000x4700 sparse matrix of type '<class 'numpy.float64'>'
	with 475617 stored elements in Compressed Sparse Row format>

In [18]:
distances = pairwise_distances(rav_clean_sparse, metric = 'cosine')


In [19]:
distances

array([[0.        , 0.04621287, 0.36420993, ..., 0.89519602, 1.16480299,
        1.19011466],
       [0.04621287, 0.        , 0.26616476, ..., 0.92703129, 1.22294791,
        1.23068005],
       [0.36420993, 0.26616476, 0.        , ..., 0.82979868, 1.20798853,
        1.21311884],
       ...,
       [0.89519602, 0.92703129, 0.82979868, ..., 0.        , 0.70442856,
        0.57303657],
       [1.16480299, 1.22294791, 1.20798853, ..., 0.70442856, 0.        ,
        0.62866486],
       [1.19011466, 1.23068005, 1.21311884, ..., 0.57303657, 0.62866486,
        0.        ]])

In [20]:
rav_rec_df = pd.DataFrame(distances, columns = rav_clean_enc.index, index = rav_clean_enc.index)

In [21]:
rav_rec_df.head()

pattern_name,Musselburgh,Classic Ribbed Hat,Alpine Bloom Hat,Classic Cuffed Hat,February Hat,October Hat,Manhattan Hat,My Baker's Hat,Berry Baby Hat,Basic Baby Hat,...,Sister Snowflakes,Galloway Pullover,Afterlight,Rolling Rock,Amélie,Bray,Barbet Turtleneck,Stonewall,Park Pullover,Construction Trucks Sweater
pattern_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Musselburgh,0.0,0.046213,0.36421,0.054929,0.18675,0.250587,0.645529,0.777101,0.075066,0.044225,...,1.105664,1.043551,0.827158,0.702654,0.952868,0.863075,1.136594,0.895196,1.164803,1.190115
Classic Ribbed Hat,0.046213,0.0,0.266165,0.021598,0.111116,0.144282,0.652178,0.712427,0.033014,0.037861,...,1.052955,1.083355,0.807411,0.677096,0.965539,0.933767,1.128567,0.927031,1.222948,1.23068
Alpine Bloom Hat,0.36421,0.266165,0.0,0.25745,0.203746,0.212913,0.656075,0.810363,0.212479,0.34615,...,0.91328,1.057985,0.720213,0.467526,0.822247,0.75156,1.21177,0.829799,1.207989,1.213119
Classic Cuffed Hat,0.054929,0.021598,0.25745,0.0,0.070223,0.151293,0.61558,0.644627,0.018428,0.046348,...,1.051592,1.078458,0.827334,0.662563,0.980172,0.887187,1.050248,0.863138,1.18538,1.175848
February Hat,0.18675,0.111116,0.203746,0.070223,0.0,0.115133,0.614808,0.571329,0.055685,0.141568,...,0.981764,1.075296,0.823158,0.628921,0.955381,0.865939,0.992007,0.823667,1.189676,1.145111


## User Interface and Recommender Function
___

In [22]:
# Make a list of pattern names that can be searched
patterns = rav_rec_df.index

In [23]:
patterns

Index(['Musselburgh', 'Classic Ribbed Hat', 'Alpine Bloom Hat',
       'Classic Cuffed Hat', 'February Hat', 'October Hat', 'Manhattan Hat',
       'My Baker's Hat', 'Berry Baby Hat', 'Basic Baby Hat',
       ...
       'Sister Snowflakes', 'Galloway Pullover', 'Afterlight', 'Rolling Rock',
       'Amélie', 'Bray', 'Barbet Turtleneck', 'Stonewall', 'Park Pullover',
       'Construction Trucks Sweater'],
      dtype='object', name='pattern_name', length=6000)

In [24]:
def display_recs(user_pattern):
    '''
    Function accepts 'user_pattern' string argument which is a user's input of a pattern they wish to find recommendations for.<br>
    using the 'rav_rec' dataframe, the function will sort out the top five patterns most similar to 'user_pattern' argument.  This is saved in a 'top_five' list variable.
    The function will take to 'top_five' and using a character replacement table, will transform each pattern to its Ravelry URL equivalent.
    Finally the function return the name of each recommended pattern as well as a URL link to its details on Ravelry.  If the pattern is not found in the 'rav_rec' database,
    the function will handle the KeyError by printing a message to the user to check their input.
    '''
    

    #Using try here to catch any typos or non-existent patterns the user may enter
    try:

        top_five = list(rav_rec_df[user_pattern].sort_values().iloc[1:6].index)

        # Make a dictionary to deal with characters in the pattern name, but not in the url address.
        replacements = {
            '#': '',
            '&': '',
            ' ': '-',
            '/': '-',
            '!': '',
            '@': '-',
            '~': '-',
            ',': '',
            "'":''
        }

        # Make a table of the replacements dictionary using .maketrans
        replacement_table = str.maketrans(replacements)

        #Iterate through top five patterns to transform names into their url equivalents
        for patt in top_five:
            url_ready = patt.translate(replacement_table).lower()
            url = print(f'{patt}: https://www.ravelry.com/patterns/library/{url_ready}')
        return url

    # Return error message if typo or non-existent pattern
    except KeyError:
        return 'Please check to see if your pattern is typed correctly.  It must be written exactly as designer writes it.'

In [37]:
search_term = input('Search for a pattern! ').lower()

if list(patterns[patterns.str.lower().str.contains(search_term)]) == []:
    print('Nothing Found')
else:
    print(list(patterns[patterns.str.lower().str.contains(search_term)]))

Search for a pattern!  Sip


['Sip', 'Sipila']


In [39]:
# Asks user for a pattern they like and uses display_recs function to return five urls of similar patterns.
user_input = input('Type a knitting pattern you like: ')
display_recs(user_input)

Type a knitting pattern you like:  Brassica


Holey Hat: https://www.ravelry.com/patterns/library/holey-hat
Close Knit Waffle Hat: https://www.ravelry.com/patterns/library/close-knit-waffle-hat
Backcountry Hat: https://www.ravelry.com/patterns/library/backcountry-hat
The Republic Hat: https://www.ravelry.com/patterns/library/the-republic-hat
Gnarly Hat: https://www.ravelry.com/patterns/library/gnarly-hat


In [27]:
rav_rec_df.to_pickle('../data/rav_rec.pkl')