# Part I: Content-Based Filtering
---
Code and data based on: https://heartbeat.fritz.ai/recommender-systems-with-python-part-i-content-based-filtering-5df4940bd831

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 

In [2]:
# source of sample data: 
# https://github.com/nikitaa30/Content-based-Recommender-System/blob/master/sample-data.csv

ds = pd.read_csv("data/sample-data.csv")

In [3]:
ds.head()

Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."


In [4]:
ds.tail()

Unnamed: 0,id,description
495,496,Cap 2 bottoms - Cut loose from the maddening c...
496,497,Cap 2 crew - This crew takes the edge off fick...
497,498,All-time shell - No need to use that morning T...
498,499,All-wear cargo shorts - All-Wear Cargo Shorts ...
499,500,All-wear shorts - Time to simplify? Our All-We...


In [10]:
# see full description of first item
ds.iloc[0,1]

'Active classic boxers - There\'s a reason why our boxers are a cult favorite - they keep their cool, especially in sticky situations. The quick-drying, lightweight underwear takes up minimal space in a travel pack. An exposed, brushed waistband offers next-to-skin softness, five-panel construction with a traditional boxer back for a classic fit, and a functional fly. Made of 3.7-oz 100% recycled polyester with moisture-wicking performance. Inseam (size M) is 4 1/2". Recyclable through the Common Threads Recycling Program.<br><br><b>Details:</b><ul> <li>"Silky Capilene 1 fabric is ultralight, breathable and quick-to-dry"</li> <li>"Exposed, brushed elastic waistband for comfort"</li> <li>5-panel construction with traditional boxer back</li> <li>"Inseam (size M) is 4 1/2"""</li></ul><br><br><b>Fabric: </b>3.7-oz 100% all-recycled polyester with Gladiodor natural odor control for the garment. Recyclable through the Common Threads Recycling Program<br><br><b>Weight: </b>99 g (3.5 oz)<br><b

**See description in Markup:**

Active classic boxers - There\'s a reason why our boxers are a cult favorite - they keep their cool, especially in sticky situations. The quick-drying, lightweight underwear takes up minimal space in a travel pack. An exposed, brushed waistband offers next-to-skin softness, five-panel construction with a traditional boxer back for a classic fit, and a functional fly. Made of 3.7-oz 100% recycled polyester with moisture-wicking performance. Inseam (size M) is 4 1/2". Recyclable through the Common Threads Recycling Program.<br><br><b>Details:</b><ul> <li>"Silky Capilene 1 fabric is ultralight, breathable and quick-to-dry"</li> <li>"Exposed, brushed elastic waistband for comfort"</li> <li>5-panel construction with traditional boxer back</li> <li>"Inseam (size M) is 4 1/2"""</li></ul><br><br><b>Fabric: </b>3.7-oz 100% all-recycled polyester with Gladiodor natural odor control for the garment. Recyclable through the Common Threads Recycling Program<br><br><b>Weight: </b>99 g (3.5 oz)<br><br>Made in Mexico.

In [12]:
# calculate the TF-IDF score for each item description, word-by-word
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(ds['description'])

In [13]:
# the tfidf_matrix contains each word and its TF-IDF score with regard to each item
print(tfidf_matrix.toarray())
tfidf_matrix.shape, type(tfidf_matrix)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


((500, 52262), scipy.sparse.csr.csr_matrix)

In [14]:
# calculate the cosine of the angle between each pair of item vectors (reflects similarity of item descriptions)
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)  # length-normalized vectors
cosine_similarities.shape

(500, 500)

In [15]:
cosine_similarities[:5,:5]

array([[1.        , 0.10110642, 0.06487353, 0.05420526, 0.04566789],
       [0.10110642, 1.        , 0.4181664 , 0.0545398 , 0.05834021],
       [0.06487353, 0.4181664 , 1.        , 0.05003225, 0.06391289],
       [0.05420526, 0.0545398 , 0.05003225, 1.        , 0.09967924],
       [0.04566789, 0.05834021, 0.06391289, 0.09967924, 1.        ]])

In [16]:
# arrange items according to their similarity (each item with every other item) and store the values in results

results = {}
for idx, row in ds.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1]  # largest 99
    similar_items = [(cosine_similarities[idx][i], ds['id'][i]) for i in similar_indices] 
    results[row['id']] = similar_items[1:]

In [17]:
results[1][:10]

[(0.22037921472617453, 19),
 (0.16938950913002357, 494),
 (0.16769458065321555, 18),
 (0.16485527745622977, 172),
 (0.148126154605864, 442),
 (0.14577863284367545, 171),
 (0.1413764236536125, 21),
 (0.13884463426216978, 495),
 (0.13879533331363048, 25),
 (0.13813550299091404, 496)]

In [20]:
# Define function that just reads the results out of the dictionary

def item(id):  
    return ds.loc[ds['id'] == id]['description'].tolist()[0].split(' - ')[0]

In [21]:
# Define function that makes recommendations based on similarity of item descriptions (i.e. cosine similarity)

def recommend(item_id, num):
    print("Recommending " + str(num) + " products similar to " + item(item_id) + "...")   
    print("-------")    

    recs = results[item_id][:num]   
    for rec in recs: 
        print("Recommended: " + item(rec[1]) + " (score:" +      str(rec[0]) + ")")

In [23]:
# Test recommender system with different inputs:
recommend(1, 5)

Recommending 5 products similar to Active classic boxers...
-------
Recommended: Cap 1 boxer briefs (score:0.22037921472617453)
Recommended: Active boxer briefs (score:0.16938950913002357)
Recommended: Cap 1 bottoms (score:0.16769458065321555)
Recommended: Cap 1 t-shirt (score:0.16485527745622977)
Recommended: Cap 3 bottoms (score:0.148126154605864)


In [24]:
recommend(11, 3)

Recommending 3 products similar to Baby sunshade top...
-------
Recommended: Sunshade hoody (score:0.2133029602108501)
Recommended: Baby baggies apron dress (score:0.10975311296284813)
Recommended: Runshade t-shirt (score:0.09988151262780706)


In [26]:
recommend(100, 10)

Recommending 10 products similar to Paddler board shorts...
-------
Recommended: Minimalist board shorts-19 in. (score:0.27192038148392816)
Recommended: Wavefarer board shorts-21 in. (score:0.22439507803604378)
Recommended: Twenty-three's board shorts (score:0.2177138862218524)
Recommended: Light and variable surf trunks (score:0.20669606455013254)
Recommended: Wavefarer board shorts (score:0.15452391398846163)
Recommended: River shorts (score:0.15017991174958498)
Recommended: Girl's boardie shorts (score:0.14094609172427439)
Recommended: Marlwalker pants (score:0.1402591813373583)
Recommended: Cotton board shorts (score:0.13331764205970884)
Recommended: Meridian board shorts (score:0.12438319780671935)
