*Recommending **apparels** to user based on **Contained Description** where the users possess similarity based on content of the products using **Recommendation Engine** Algorithm*

* Recommendation Engine
* Content Based Filtering
* Cosine Similarity

In [1]:
import pandas as pd # Dataframe Manipulation library

# sklearn modules for Content Based Filtering(to handle text)
from sklearn.feature_extraction.text import TfidfVectorizer
# sklearn.feature_extraction.text: submodule gathers utilities to build feature vectors from text documents
# TfidfVectorizer: Convert a collection of raw documents to a matrix of TF-IDF features i.e.
# textual content of data gets rotated in space as vector 
# Words in the text are broken into unique words and the fraction (frequency of these words divided by specific value) is used as weightage
# of that particular word in the given context and the words become as features

from sklearn.metrics.pairwise import linear_kernel
# sklearn.metrics.pairwise: submodule implements utilities to evaluate pairwise distances of sets of samples
# linear_kernel: represents the similarity between two vectors with degree=1 and coef0=0
# Using Cosine Similarity to identify the similarity between the text content of the given user with other users using TfidfVectorizer.

#### Loading the dataset 
##### *having details of SKUs(Stock Keeping Units) from the outdoor apparels brand*

In [2]:
df = pd.read_csv("../input/product-item-data/sample-data.csv")
df.head()

Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."


In [3]:
print(f"The Dataframe has {df.shape[0]} products with their description")

The Dataframe has 500 products with their description


#### TfidfVectorizer
##### *Converting Text description to vector form*

In [4]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df.description)


In [5]:
tfidf_matrix
#500 products are converted to 52262 vectors/unique words/features in our dataset

<500x52262 sparse matrix of type '<class 'numpy.float64'>'
	with 148989 stored elements in Compressed Sparse Row format>

##### Creating Cosine Similarity Matrix using linear_kernel

In [6]:
cosine_similarities = linear_kernel(tfidf_matrix,tfidf_matrix) #calculating similarity with one record with another record

In [7]:
results = {}
for idx, row in df.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-12:-1] #sorted top 10 products similar to, say, product 1 
    similar_items = [(cosine_similarities[idx][i], df['id'][i]) for i in similar_indices]
    
    results[row['id']] = similar_items[1:]

In [8]:
list(results.items())[:2]

[(1,
  [(0.22037921472617467, 19),
   (0.16938950913002365, 494),
   (0.16769458065321555, 18),
   (0.1648552774562297, 172),
   (0.1481261546058637, 442),
   (0.14577863284367548, 171),
   (0.14137642365361247, 21),
   (0.13884463426216961, 495),
   (0.1387953333136303, 25),
   (0.13813550299091382, 496)]),
 (2,
  [(0.41816639921615745, 3),
   (0.11546382098627585, 19),
   (0.11303392245400211, 494),
   (0.11247854521091623, 300),
   (0.11147017924424411, 299),
   (0.10110641701157388, 1),
   (0.09912196647155447, 318),
   (0.0882901384989091, 155),
   (0.08822174844892261, 214),
   (0.08731039686442463, 301)])]

*For product 1 and 2, above are the lists of top 10 products similar to respective products,      
wherein the first element of the tuples is **cosine similarity** value and     
second element of the tuple is the **ID/index of the product** in the database*


In [9]:
def item(id):
    return df.loc[df['id'] == id]['description'].tolist()[0].split('-')[0]

* *Based on product ID we have, the method **item** will try to identify top 10 most similar products   filters the data to the specified ID*
* *Based on ID we extract the description from subsequent column*  
* *Convert it to list and split the list using hyphen. As the text has hyphens, so we split list into segments and the second segment after split has description of product which is required*

In [10]:
def recommend(item_id, num):
    print("Recommending " + str(num) + " products similar to " + item(item_id) + '...')
    print("\nRecommended products are: ")
    recs  = results[item_id][:num]
    for rec in recs:
        print(item(rec[1]) + " (score: " + str(rec[0]) + ")")      

*The for loop provides the **score** of the most similar records with the given product*    
*recs is the **total number of records** we require for similarity for given item id(from results)*

In [11]:
recommend(item_id = 250, num = 10)

Recommending 10 products similar to Simply organic tee ...

Recommended products are: 
Simply organic top  (score: 0.44072300310060447)
Simply organic tank  (score: 0.2169017842835122)
Simply organic polo  (score: 0.19120465271886505)
S/s squeaky clean polo shirt  (score: 0.17320243767409224)
Reversible phone home  (score: 0.15500114752753574)
Girl's cotton tank dress  (score: 0.13539951140855766)
Island hemp dress  (score: 0.1154825799858559)
Astrid tank  (score: 0.10849430382894588)
Organic cotton under tee  (score: 0.10570552176235502)
L/s gravi (score: 0.10263228049277556)


##### *The products are recommended based on the content of a particular product*