## E-COMMERCE RECOMMENDER SYSTEM

## MODELLING: Part 3

The objective of this notebook is to develop a content base approach using items features to solve the cold start issue that wasnt adressed by the previous models.
We will assume the most important features that can be personalized are category (dresses, sweatpants...ect), model_attr and size. We will focus on these.
Items similarity will be measured using their cosine distance

We will also explore a hybrid model that parralize recommendations from our SVD model and the one developed here 

### Imports

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
raw_data = pd.read_csv('/Users/judith/Data_science_projects/Springboard_AssignmentsJY/capstone_three/data/raw/df_modcloth.csv')

In [9]:
raw_data.head()

Unnamed: 0,item_id,user_id,rating,timestamp,size,fit,user_attr,model_attr,category,brand,year,split
0,7443,Alex,4,2010-01-21 08:00:00+00:00,,,Small,Small,Dresses,,2012,0
1,7443,carolyn.agan,3,2010-01-27 08:00:00+00:00,,,,Small,Dresses,,2012,0
2,7443,Robyn,4,2010-01-29 08:00:00+00:00,,,Small,Small,Dresses,,2012,0
3,7443,De,4,2010-02-13 08:00:00+00:00,,,,Small,Dresses,,2012,0
4,7443,tasha,4,2010-02-18 08:00:00+00:00,,,Small,Small,Dresses,,2012,0


## Extracting features of importance

In [10]:
# Defining a function that will calculate similarity of an item vs others items
# and return the best 15


In [11]:
# as the text modelling cannot support NaN, we will convert them into empty space prior to processing
data = raw_data.fillna('')

In [12]:
# Here we are creating a new column that will aggregate the text from the 3 features
# columns that we are interested in
data['aggr'] = data['size'].map(str) + ' '+ data['model_attr'].map(str) + ' '+ data['category'].map(str)

## Modelling

In [13]:
# WARNING: subsetting a sample because the cosine similarity cannot compute on the total dataset
test = data.head(20)

In [14]:
# Instantiating the vectorizer and fitting the text
vect = TfidfVectorizer()
matrix = vect.fit_transform(test['aggr'])

In [15]:
# printing the features that will go into our matrix
print(vect.get_feature_names())

['dresses', 'large', 'outerwear', 'small']


In [16]:
# checking how the matrix looks like
print(matrix.toarray())

[[0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.         0.68757698 0.68757698 0.23340053]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.73996547 0.         0.         0.67264485]
 [0.         0.68757698 0.68757698 0.23340053]]


In [17]:
# calculating the similarity
cosine_sim = cosine_similarity(matrix ,matrix)

In [18]:
# this and the following code try to develop a function to retrieve predictions
test2 = test['item_id'].unique()
test2 = pd.DataFrame(test2, columns = ['item_id'])
test2

Unnamed: 0,item_id
0,7443
1,11960


In [19]:
# test2.reset_index()

In [20]:
indices = pd.Series(test2.index, index=test2['item_id'])

In [21]:
# Defining a function to get the top recommendations for an item based on 
# similarities with others items

def recommend_items(item_id, cosine_sim):
    index = indices[item_id]
    sim_scores = list(enumerate(cosine_sim[index]))
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)
    sim_scores = sim_scores[:16]
    item_indexes = [i[0] for i in sim_scores]
    return raw_data['item_id'].iloc[item_indexes]

In [22]:
# checking a sample recommendation
recommend_items(7443, cosine_sim)

0     7443
1     7443
2     7443
3     7443
4     7443
5     7443
6     7443
7     7443
8     7443
9     7443
10    7443
12    7443
13    7443
14    7443
15    7443
16    7443
Name: item_id, dtype: int64

The content based approach gives us the expected results but is not suitable for recommending items to users with many interactions which is where our previous SVD algorithm outperformed.
We will attempt below to build an ensemble method that takes as input from both algorithmns, rank the results by aggregrating the number of time the each item has been recommended to rank the results and returing a final list of ranked results.

In [23]:
# Defining the function as explained above which leverage both SVD predictions and content base results
def hybrid_recommendations(item_id):
    CB_rec = recommened_items(item_id, cosine_sim)
    CF_rec = make_recommendations(predictions, 10)
    all_rec = CB_rec + CF_rec
    all_rec_df = pd.DataFrame(all_rec).reset_index()
    scoring = all_rec_df.groupby('item_id').count().sort_values(by = 'index', ascending = False)
    return scoring[:10]