## E-COMMERCE RECOMMENDER SYSTEM

## MODELLING: Part 3 (Content based)

The objective of this notebook is to develop a content base approach using items features to solve the cold start issue that wasnt adressed by the previous models.
We will assume the most important features that can be personalized are category (dresses, sweatpants...ect), model_attr and size. We will focus on these.
Items similarity will be measured using their cosine distance

### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
raw_data = pd.read_csv('/Users/judith/Data_science_projects/Springboard_AssignmentsJY/capstone_three/data/processed/features.csv')

In [3]:
ratings = pd.read_csv('/Users/judith/Data_science_projects/Springboard_AssignmentsJY/capstone_three/data/processed/ratings.csv')

In [4]:
raw_data['item_id'] = ratings['item_id']
raw_data['user_id'] = ratings['user_id']
raw_data.shape

(99892, 9)

In [5]:
data = raw_data.drop_duplicates(subset = ['item_id'], keep = 'first')
data.shape

(1020, 9)

## Extracting features of importance

In [6]:
# Here we are creating a new column that will aggregate the text from the 3 features
# columns that we are interested in
data['aggr'] = data['size'].map(str) + ' '+ data['model_attr'].map(str) + ' '+ data['category'].map(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [7]:
data.head()

Unnamed: 0,size,fit,user_attr,model_attr,category,year,split,item_id,user_id,aggr
0,2.0,Just right,Small,Small,Dresses,2012,0,7443,Alex,2.0 Small Dresses
11,2.0,Just right,Small,Small&Large,Outerwear,2010,0,11960,bcornwell,2.0 Small&Large Outerwear
50,2.0,Just right,Small,Small,Dresses,2011,0,16411,Candice,2.0 Small Dresses
280,2.0,Just right,Small,Small,Bottoms,2013,1,21296,Petra,2.0 Small Bottoms
290,2.0,Slightly large,Small,Small,Tops,2014,0,22563,lexaplex,2.0 Small Tops


## Modelling

In [8]:
# Instantiating the vectorizer and fitting the text
vect = TfidfVectorizer()
matrix = vect.fit_transform(data['aggr'])

In [9]:
# printing the features that will go into our matrix
print(vect.get_feature_names())

['bottoms', 'dresses', 'large', 'outerwear', 'small', 'tops']


In [10]:
# calculating the similarity
cosine_sim = cosine_similarity(matrix ,matrix)

In [11]:
indices = pd.Series(data.index, index=data['item_id'])
indices

item_id
7443          0
11960        11
16411        50
21296       280
22563       290
          ...  
153853    82082
153866    82317
149062    85272
54062     95190
153228    97944
Length: 1020, dtype: int64

In [12]:
# Defining a function to get the top recommendations for an item based on 
# similarities with others items

def recommend_items(item_id, cosine_sim):
    index = indices[item_id]
    sim_scores = list(enumerate(cosine_sim[index]))
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)
    sim_scores = sim_scores[:16]
    item_indexes = [i[0] for i in sim_scores]
    return data['item_id'].iloc[item_indexes]

In [13]:
# checking a sample recommendation
recommend_items(16411, cosine_sim)

290       22563
319       24853
2135      47397
4058      67022
4798      67507
4844      70230
4973      71434
6441      78227
8937      84436
13744    108260
16388    113643
20836    116736
21870    117620
26581    122266
34975    128359
36565    129267
Name: item_id, dtype: int64

In [14]:
data.to_csv(r'/Users/judith/Data_science_projects/Springboard_AssignmentsJY/capstone_three/data/processed/content_base_features.csv', index=False)

The content based approach gives us the expected results and complement nicely the collaborative approach. We can run both algorithms in parralel and use both recommendations to show suggestions for both returning customers and new customers or new products.