Content-based filtering is a type of recommender system that attempts to guess what a user may like based on that user's activity. Content-based filtering makes recommendations by using keywords and attributes assigned to objects in a database (e.g., items in an online marketplace) and matching them to a user profile.

In [1]:
import pandas as pd

# Loading Dataset

In [2]:
df = pd.read_csv('dataset.csv')

In [3]:
df.head()

Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."


# EDA

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 2 columns):
id             500 non-null int64
description    500 non-null object
dtypes: int64(1), object(1)
memory usage: 7.9+ KB


In [6]:
df.shape

(500, 2)

Dataset has two columns and 500 data rows

In [7]:
df.describe()

Unnamed: 0,id
count,500.0
mean,250.5
std,144.481833
min,1.0
25%,125.75
50%,250.5
75%,375.25
max,500.0


In [8]:
df['description'][0]

'Active classic boxers - There\'s a reason why our boxers are a cult favorite - they keep their cool, especially in sticky situations. The quick-drying, lightweight underwear takes up minimal space in a travel pack. An exposed, brushed waistband offers next-to-skin softness, five-panel construction with a traditional boxer back for a classic fit, and a functional fly. Made of 3.7-oz 100% recycled polyester with moisture-wicking performance. Inseam (size M) is 4 1/2". Recyclable through the Common Threads Recycling Program.<br><br><b>Details:</b><ul> <li>"Silky Capilene 1 fabric is ultralight, breathable and quick-to-dry"</li> <li>"Exposed, brushed elastic waistband for comfort"</li> <li>5-panel construction with traditional boxer back</li> <li>"Inseam (size M) is 4 1/2"""</li></ul><br><br><b>Fabric: </b>3.7-oz 100% all-recycled polyester with Gladiodor natural odor control for the garment. Recyclable through the Common Threads Recycling Program<br><br><b>Weight: </b>99 g (3.5 oz)<br><b

Contains html elements that needed to be cleaned

# Preprocessing

### Remove HTML Elements

In [10]:
import re


def remove_html_elements(text):
    return re.sub(r"<[a-z/]+>", " ", text) 

# example execution
print(remove_html_elements(df['description'][0]))

Active classic boxers - There's a reason why our boxers are a cult favorite - they keep their cool, especially in sticky situations. The quick-drying, lightweight underwear takes up minimal space in a travel pack. An exposed, brushed waistband offers next-to-skin softness, five-panel construction with a traditional boxer back for a classic fit, and a functional fly. Made of 3.7-oz 100% recycled polyester with moisture-wicking performance. Inseam (size M) is 4 1/2". Recyclable through the Common Threads Recycling Program.   Details:    "Silky Capilene 1 fabric is ultralight, breathable and quick-to-dry"   "Exposed, brushed elastic waistband for comfort"   5-panel construction with traditional boxer back   "Inseam (size M) is 4 1/2"""     Fabric:  3.7-oz 100% all-recycled polyester with Gladiodor natural odor control for the garment. Recyclable through the Common Threads Recycling Program   Weight:  99 g (3.5 oz)  Made in Mexico.


In [11]:
df['description'] = df['description'].apply(lambda x: remove_html_elements(x))

### Remove special characters

In [12]:
def remove_special_char(text):
    return re.sub(r"[^A-Za-z]+", " ", text)

# example execution
print(remove_special_char(df['description'][0]))

Active classic boxers There s a reason why our boxers are a cult favorite they keep their cool especially in sticky situations The quick drying lightweight underwear takes up minimal space in a travel pack An exposed brushed waistband offers next to skin softness five panel construction with a traditional boxer back for a classic fit and a functional fly Made of oz recycled polyester with moisture wicking performance Inseam size M is Recyclable through the Common Threads Recycling Program Details Silky Capilene fabric is ultralight breathable and quick to dry Exposed brushed elastic waistband for comfort panel construction with traditional boxer back Inseam size M is Fabric oz all recycled polyester with Gladiodor natural odor control for the garment Recyclable through the Common Threads Recycling Program Weight g oz Made in Mexico 


In [13]:
df['description'] = df['description'].apply(lambda x: remove_special_char(x))

### Remove multiple spaces

In [14]:
def remove_whitespaces(text):
    return re.sub(' +', ' ', text)
    
# example execution to see the effect
print(remove_whitespaces(df['description'][0]))   

Active classic boxers There s a reason why our boxers are a cult favorite they keep their cool especially in sticky situations The quick drying lightweight underwear takes up minimal space in a travel pack An exposed brushed waistband offers next to skin softness five panel construction with a traditional boxer back for a classic fit and a functional fly Made of oz recycled polyester with moisture wicking performance Inseam size M is Recyclable through the Common Threads Recycling Program Details Silky Capilene fabric is ultralight breathable and quick to dry Exposed brushed elastic waistband for comfort panel construction with traditional boxer back Inseam size M is Fabric oz all recycled polyester with Gladiodor natural odor control for the garment Recyclable through the Common Threads Recycling Program Weight g oz Made in Mexico 


In [15]:
df['description'] = df['description'].apply(lambda x: remove_whitespaces(x))

### Make all lowercase

In [16]:
df['description'] = df['description'].apply(lambda x: x.lower())
# example execution to see the effect
print(df['description'][0])

active classic boxers there s a reason why our boxers are a cult favorite they keep their cool especially in sticky situations the quick drying lightweight underwear takes up minimal space in a travel pack an exposed brushed waistband offers next to skin softness five panel construction with a traditional boxer back for a classic fit and a functional fly made of oz recycled polyester with moisture wicking performance inseam size m is recyclable through the common threads recycling program details silky capilene fabric is ultralight breathable and quick to dry exposed brushed elastic waistband for comfort panel construction with traditional boxer back inseam size m is fabric oz all recycled polyester with gladiodor natural odor control for the garment recyclable through the common threads recycling program weight g oz made in mexico 


# Content Based Filtering

TfidfVectorizer is the base building block of many NLP pipelines. It is a simple technique to vectorize text documents — i.e. transform sentences into arrays of numbers — and use them in subsequent tasks.

Linear Kernel is used when the data is Linearly separable, that is, it can be separated using a single Line. It is one of the most common kernels to be used. It is mostly used when there are a Large number of Features in a particular Data Set.

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [25]:
t_vec = TfidfVectorizer(
    analyzer='word',
    ngram_range=(1, 3),
    min_df=0,
    stop_words='english')

In [26]:
tvec_matrix = t_vec.fit_transform(df['description'])

In [27]:
tvec_matrix 

<500x49476 sparse matrix of type '<class 'numpy.float64'>'
	with 124956 stored elements in Compressed Sparse Row format>

**cosine similarity**

Cosine similarity measures the similarity between two vectors of an inner product space.

In [28]:
cosine_similarities = linear_kernel(tvec_matrix, tvec_matrix)

In [38]:
results = {}

for idx, row in df.iterrows():
    print('{} {}'.format(idx, row))
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1]
    similar_items = [(cosine_similarities[idx][i], df['id'][i]) for i in similar_indices]

    # remove first item, because it's a copy
    results[row['id']] = similar_items[1:]

0 id                                                             1
description    active classic boxers there s a reason why our...
Name: 0, dtype: object
1 id                                                             2
description    active sport boxer briefs skinning up glory re...
Name: 1, dtype: object
2 id                                                             3
description    active sport briefs these superbreathable no f...
Name: 2, dtype: object
3 id                                                             4
description    alpine guide pants skin in climb ice switch to...
Name: 3, dtype: object
4 id                                                             5
description    alpine wind jkt on high ridges steep ice and a...
Name: 4, dtype: object
5 id                                                             6
description    ascensionist jkt our most technical soft shell...
Name: 5, dtype: object
6 id                                                             7
des

Name: 72, dtype: object
73 id                                                            74
description    logo hat your unflappable road tripping compan...
Name: 73, dtype: object
74 id                                                            75
description    l s hooded rashguard the upf fabric of our lon...
Name: 74, dtype: object
75 id                                                            76
description    l s island hopper shirt serene in the face of ...
Name: 75, dtype: object
76 id                                                            77
description    l s rashguard the upf fabric of our long sleev...
Name: 76, dtype: object
77 id                                                            78
description    l s steersman shirt this ain t no city shirt u...
Name: 77, dtype: object
78 id                                                            79
description    lw endurance ankle socks fast forward pursuits...
Name: 78, dtype: object
79 id                             

Name: 142, dtype: object
143 id                                                           144
description    trout silhouette t shirt the simplicity of the...
Name: 143, dtype: object
144 id                                                           145
description    trucker hat for big rigs or big hair this is a...
Name: 144, dtype: object
145 id                                                           146
description    twenty three s board shorts stay covered in ou...
Name: 145, dtype: object
146 id                                                           147
description    ultra shorts our ever popular long cut quick d...
Name: 146, dtype: object
147 id                                                           148
description    ulw hiking crew socks the heat along the bouch...
Name: 147, dtype: object
148 id                                                           149
description    velocity cap when you re running the parched o...
Name: 148, dtype: object
149 id               

Name: 221, dtype: object
222 id                                                           223
description    solimar pants in case your travel plans coinci...
Name: 222, dtype: object
223 id                                                           224
description    solimar shorts horny toads and other professio...
Name: 223, dtype: object
224 id                                                           225
description    s s a c shirt not even stifling great basin he...
Name: 224, dtype: object
225 id                                                           226
description    s s rashguard the upf fabric of our short slee...
Name: 225, dtype: object
226 id                                                           227
description    s s sol patrol shirt a shirt impossible to ins...
Name: 226, dtype: object
227 id                                                           228
description    sun shelter shirt chic and sensible the sun sh...
Name: 227, dtype: object
228 id               

297 id                                                           298
description    versatiliti tank vivid colors and delicate det...
Name: 297, dtype: object
298 id                                                           299
description    active boy shorts we ve worn these versatile f...
Name: 298, dtype: object
299 id                                                           300
description    active briefs whether you re beating the heat ...
Name: 299, dtype: object
300 id                                                           301
description    active classic cami worn under a dry top or as...
Name: 300, dtype: object
301 id                                                           302
description    all time shell since rain is recycled we figur...
Name: 301, dtype: object
302 id                                                           303
description    all wear capris capris are more discreet than ...
Name: 302, dtype: object
303 id                                        

Name: 388, dtype: object
389 id                                                           390
description    reg fit organic ctn jeans short organic cotton...
Name: 389, dtype: object
390 id                                                           391
description    relax fit organic ctn jeans long the everyday ...
Name: 390, dtype: object
391 id                                                           392
description    relax fit organic ctn jeans reg the everyday g...
Name: 391, dtype: object
392 id                                                           393
description    lw travel sling the feel is open ended like a ...
Name: 392, dtype: object
393 id                                                           394
description    lw travel tote harmonizing with life on the go...
Name: 393, dtype: object
394 id                                                           395
description    wavefarer board shorts in the board shorts we ...
Name: 394, dtype: object
395 id               

Name: 456, dtype: object
457 id                                                           458
description    nano puff p o already a cult classic with our ...
Name: 457, dtype: object
458 id                                                           459
description    araveto zip jkt a slim fit and zip through sta...
Name: 458, dtype: object
459 id                                                           460
description    araveto hooded jkt for the deep chill of a fog...
Name: 459, dtype: object
460 id                                                           461
description    araveto jkt a hearty day spent running the sna...
Name: 460, dtype: object
461 id                                                           462
description    custodian pants reg the graveyard shift has it...
Name: 461, dtype: object
462 id                                                           463
description    custodian pants short the graveyard shift has ...
Name: 462, dtype: object
463 id               

In [39]:
results
# { id : [(similarity, id)], ...}

{1: [(0.1998035147475983, 19),
  (0.15182409328775237, 18),
  (0.14951420049273145, 494),
  (0.14806033056097512, 172),
  (0.13030126898051572, 442),
  (0.1243423196112318, 495),
  (0.12392584918794144, 171),
  (0.12147764628515076, 496),
  (0.12115585151508487, 21),
  (0.11808451510781004, 25),
  (0.11504658905198849, 487),
  (0.11235479373851076, 20),
  (0.1118327275919725, 340),
  (0.10959808513484064, 341),
  (0.10692528247679021, 488),
  (0.10614125580266508, 176),
  (0.10207387829377773, 440),
  (0.09777843752722717, 60),
  (0.09748273457862415, 497),
  (0.09359390250402862, 441),
  (0.09317238064140862, 413),
  (0.09293701129609476, 173),
  (0.08994786659969368, 443),
  (0.08983693442710128, 174),
  (0.0856473126572201, 359),
  (0.08559224675918581, 22),
  (0.08406462778023249, 23),
  (0.0839371084729726, 61),
  (0.08381729893691749, 365),
  (0.08208667249888199, 360),
  (0.08188555763627496, 175),
  (0.079568083310144, 24),
  (0.07953567521275805, 2),
  (0.07927542874833748, 40

# Get Predictions

In [46]:
# retrieve item description
def get_item(id):
    description = df.loc[df['id'] == id]['description'].tolist()[0]
    return description

item(1)

active classic boxers there s a reason why our boxers are a cult favorite they keep their cool especially in sticky situations the quick drying lightweight underwear takes up minimal space in a travel pack an exposed brushed waistband offers next to skin softness five panel construction with a traditional boxer back for a classic fit and a functional fly made of oz recycled polyester with moisture wicking performance inseam size m is recyclable through the common threads recycling program details silky capilene fabric is ultralight breathable and quick to dry exposed brushed elastic waistband for comfort panel construction with traditional boxer back inseam size m is fabric oz all recycled polyester with gladiodor natural odor control for the garment recyclable through the common threads recycling program weight g oz made in mexico 


In [52]:
# get n number of similar product ids from cosine similarity results
def get_recommendations(item_id, num_of_rec=3):
    # get n number of records (n = num_of_rec)
    recs = results[item_id][:num_of_rec]
    for idx, rec in enumerate(recs):
        print('Recommendation {}: {}\nScore: {}'.format(idx, get_item(rec[1]), rec[0]))
        print('---------------------------------------------------------------------')

In [53]:
get_recommendations(1, 3)

Recommendation 0: cap boxer briefs on bivy or belay the form fitting capilene boxer briefs stay dry and comfortable made from recycled polyester the underwear excels at moisture wicking and has gladiodor natural odor control for the garment exposed elastic waistband is brushed for softness the hem is coverstitched for a smooth glide beneath shorts or pants fully functioning fly and supportive front panel keep you covered inseam size m recyclable through the common threads recycling program details moisture wicking capilene fabric with gladiodor for exceptional next to skin comfort and natural odor control for the garment brushed elastic waistband supportive front panel cover stitched hem for smooth glide under shorts or pants won t restrict mobility inseam size m is fabric oz all recycled polyester with gladiodor natural odor control for the garment recyclable through the common threads recycling program weight g oz made in usa 
Score: 0.1998035147475983
-------------------------------