<i>This notebook is made by Haozhe TANG on 13, July for the final project of Recommender System.</i>

# Library Importing and Data Importing

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
pip install scikit-surprise



In [3]:
#Basic library
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt

#Sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

#Surprise
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split as tts

In [4]:
df_items = pd.read_csv('/content/drive/MyDrive/Recommender System/Datasets/beauty_amazon_items - beauty_amazon_items.csv', encoding="utf8")
df_iter = pd.read_csv('/content/drive/MyDrive/Recommender System/Datasets/beauty_amazon_reviews - beauty_amazon_reviews.csv',encoding="utf8")

In [5]:
df_items.head()

Unnamed: 0,Description,Title,Brand,Price,ItemId
0,Loud 'N Clear Personal Sound Amplifier allows ...,Loud 'N Clear&trade; Personal Sound Amplifier,idea village,,P4924
1,No7 Lift & Luminate Triple Action Serum 50ml b...,No7 Lift &amp; Luminate Triple Action Serum 50...,,$44.99,P4622
2,No7 Stay Perfect Foundation now stays perfect ...,No7 Stay Perfect Foundation Cool Vanilla by No7,No7,$28.76,P6435
3,,Wella Koleston Perfect Hair Colour 44/44 Mediu...,,,P4623
4,Lacto Calamine Skin Balance Daily Nourishing L...,Lacto Calamine Skin Balance Oil control 120 ml...,Pirmal Healthcare,$12.15,P7


In [6]:
df_iter.head()

Unnamed: 0,Rating,Time,UserId,ItemId,Review,Summary
0,1,"02 19, 2015",U0,P0,great,One Star
1,4,"12 18, 2014",U1,P0,My husband wanted to reading about the Negro ...,... to reading about the Negro Baseball and th...
2,4,"08 10, 2014",U2,P0,"This book was very informative, covering all a...",Worth the Read
3,5,"03 11, 2013",U3,P0,I am already a baseball fan and knew a bit abo...,Good Read
4,5,"12 25, 2011",U4,P0,This was a good story of the Black leagues. I ...,"More than facts, a good story read!"


# 1. Data Exploration

## 1.1 Item dataset

In [7]:
#Overview
df_items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32890 entries, 0 to 32889
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Description  14581 non-null  object
 1   Title        32889 non-null  object
 2   Brand        17217 non-null  object
 3   Price        11459 non-null  object
 4   ItemId       32890 non-null  object
dtypes: object(5)
memory usage: 1.3+ MB


### 1.1.1 Missing value

In [8]:
#Missing value
df_items.isnull().sum()

Description    18309
Title              1
Brand          15673
Price          21431
ItemId             0
dtype: int64

In [9]:
df_items[df_items['Title'].isnull() == True]

Unnamed: 0,Description,Title,Brand,Price,ItemId
27016,,,BCW,$2.89 - $13.99,P3733


<p>In this project, only ItemId and Title is important. So we delete the row with missing Title.</p> <br/>
<b> However, if we delete the item, we have to delete all the interactions according to this item.</b>

### 1.1.2 Duplicate check

In [10]:
# Duplicates
df_items.duplicated().sum()

404

In [11]:
df_items[df_items['Description'] == 'Ultra (Box of 10) Corn Plane Blades']

Unnamed: 0,Description,Title,Brand,Price,ItemId
424,Ultra (Box of 10) Corn Plane Blades,Ultra (Box of 10) Corn Plane Blades,Ultra,$8.00,P4821
828,Ultra (Box of 10) Corn Plane Blades,Ultra (Box of 10) Corn Plane Blades,Ultra,$8.00,P4821


It varified the duplicates really exist!

## 1.2 Iteraction dataset

### 1.2.1 Missing value

In [12]:
#Missing value
df_iter.isnull().sum()

Rating       0
Time         0
UserId       0
ItemId       0
Review     399
Summary    213
dtype: int64

In [13]:
df_iter[df_iter['ItemId'].isnull() == True]

Unnamed: 0,Rating,Time,UserId,ItemId,Review,Summary


<p> In this case, we delete this single row because of the lack of ItemId.</p>

### 1.2.2 Duplicate check

In [14]:
# Duplicates
df_iter.duplicated().sum()

8726

In [15]:
df_iter[(df_iter['UserId'] == 'U12281') & (df_iter['ItemId'] == 'P62')]

Unnamed: 0,Rating,Time,UserId,ItemId,Review,Summary
12330,5,"03 09, 2016",U12281,P62,excellent,Five Stars
12331,5,"03 09, 2016",U12281,P62,excellent,Five Stars


There are 8717 duplicates and I will delete the **second** duplicates and remain the **first** ones.

# 2. Data Preprocessing

## 2.1 Missing values

In [16]:
# Delete the row of item dataset.
df_items_nomissing = df_items.dropna(subset=['Title'])

In [17]:
# Delete all rows with the ItemId of 'P3733' of interaction dataset.
df_iter_nomissing = df_iter[df_iter['ItemId'] != 'P3733']

In [18]:
#Delete all the missing value in iteraction dataset
df_iter_nomissing = df_iter_nomissing.dropna(subset=['ItemId'])

## 2.2 Duplicates

In [19]:
df_items.head()

Unnamed: 0,Description,Title,Brand,Price,ItemId
0,Loud 'N Clear Personal Sound Amplifier allows ...,Loud 'N Clear&trade; Personal Sound Amplifier,idea village,,P4924
1,No7 Lift & Luminate Triple Action Serum 50ml b...,No7 Lift &amp; Luminate Triple Action Serum 50...,,$44.99,P4622
2,No7 Stay Perfect Foundation now stays perfect ...,No7 Stay Perfect Foundation Cool Vanilla by No7,No7,$28.76,P6435
3,,Wella Koleston Perfect Hair Colour 44/44 Mediu...,,,P4623
4,Lacto Calamine Skin Balance Daily Nourishing L...,Lacto Calamine Skin Balance Oil control 120 ml...,Pirmal Healthcare,$12.15,P7


In [20]:
#Drop the duplicates of items dataset and keep the first occurance.
df_items_no_duplicates = df_items_nomissing.drop_duplicates(inplace=False, keep="first", ignore_index=True)

In [21]:
#Drop the duplicates of iteraction dataset and keep the first occurance.
df_iter_no_duplicates = df_iter_nomissing.drop_duplicates(inplace=False, keep='first', ignore_index=True)

## 2.3 Data modifying

In [22]:
print(f'Length of items dataset:', len(df_items_no_duplicates))
print(f'Length of dataset has ItemId starting with "P":', len(df_items_no_duplicates['ItemId'][df_items_no_duplicates['ItemId'].str.startswith('P')]))
print('\n')
print(f'Length of iteraction dataset:', len(df_iter_no_duplicates))
print(f'Length of dataset has ItemId starting with "P":', len(df_iter_no_duplicates['ItemId'][df_iter_no_duplicates['ItemId'].str.startswith('P')]))


Length of items dataset: 32485
Length of dataset has ItemId starting with "P": 32485


Length of iteraction dataset: 362603
Length of dataset has ItemId starting with "P": 362603


I discovered that the Id of items all start with "P", some we can represent all ItemId with "P%". <br/> <br/>
In this case, we can modify all the ItemId from strings to numbers. <br/> <br/>
The same case for users.

In [23]:
#Change the "ItemId" column for item and iter dataset.
df_items_no_duplicates.loc[:, 'ItemId'] = df_items_no_duplicates.loc[:, 'ItemId'].str.replace('P', '', regex=True)
df_iter_no_duplicates.loc[:, 'ItemId'] = df_iter_no_duplicates.loc[:, 'ItemId'].str.replace('P', '', regex=True)

#Change the "UserId" column for iter dataset.
df_iter_no_duplicates.loc[:, 'UserId'] = df_iter_no_duplicates.loc[:, 'UserId'].str.replace('U', '', regex=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_no_duplicates.loc[:, 'ItemId'] = df_items_no_duplicates.loc[:, 'ItemId'].str.replace('P', '', regex=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_iter_no_duplicates.loc[:, 'ItemId'] = df_iter_no_duplicates.loc[:, 'ItemId'].str.replace('P', '', regex=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-

In [24]:
#Then we have to transfer the types of ItemId and UserId from object to int.
df_items_no_duplicates.loc[:, 'ItemId'] = df_items_no_duplicates.loc[:, 'ItemId'].astype(int)

df_iter_no_duplicates.loc[:, 'UserId'] = df_iter_no_duplicates.loc[:, 'UserId'].astype(int)
df_iter_no_duplicates.loc[:, 'ItemId'] = df_iter_no_duplicates.loc[:, 'ItemId'].astype(int)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_items_no_duplicates.loc[:, 'ItemId'] = df_items_no_duplicates.loc[:, 'ItemId'].astype(int)
  df_items_no_duplicates.loc[:, 'ItemId'] = df_items_no_duplicates.loc[:, 'ItemId'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_iter_no_duplicates.loc[:, 'UserId'] = df_iter_no_duplicates.loc[:, 'UserId'].astype(int)
  df_iter_no_duplicates.loc[:, 'UserId'] = df_iter_no_duplicates.loc[:, 'UserId'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,

## 2.4 Data Splitting

In this case, we will split the iteration dataset into three parts, and the size of these three parts are as follows:
 - 70% training set;
 - 30% testing set.

First, we have to transfer the data type of the column **"Time"** to **"Datetime"**;
Then, reorder the dataset according to the **iteraction time**.

In [25]:
#Change data type.
df_iter_no_duplicates.loc[:, ('Time')] = pd.to_datetime(df_iter_no_duplicates.loc[:, ('Time')])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_iter_no_duplicates.loc[:, ('Time')] = pd.to_datetime(df_iter_no_duplicates.loc[:, ('Time')])
  df_iter_no_duplicates.loc[:, ('Time')] = pd.to_datetime(df_iter_no_duplicates.loc[:, ('Time')])


In [26]:
#Sort the data type according to the Time column
df_iter_final = df_iter_no_duplicates.sort_values('Time', ascending=True, ignore_index=True)
df_items_final = df_items_no_duplicates.copy()

In [27]:
# Data Splitting on df_iter_final
X, y = df_iter_final.loc[:, ('Rating', 'Time', 'UserId', 'ItemId', 'Review')][:50001], df_iter_final.loc[:, 'ItemId'][:50001]

# Split for training and other datasets.
X_train, X_test, _, _ = train_test_split(X, y, train_size=0.7, shuffle=False)

In [28]:
X

Unnamed: 0,Rating,Time,UserId,ItemId,Review
0,5,2000-01-10,220995,4614,"M (company) is about real people, and real exp..."
1,5,2000-05-06,225997,5365,This calender really is great. In addition to...
2,5,2000-06-03,225996,5365,This calender is brilliant and has plenty of g...
3,5,2000-10-29,221070,4630,"Very good shaver, I have always purchased Nore..."
4,5,2000-11-12,5298,16,Talk about a smooth shave. The blades are clo...
...,...,...,...,...,...
49996,4,2013-08-25,35295,1358,this product functions admirably. it is slight...
49997,5,2013-08-25,42261,277,I can definitely tell this thing works! The on...
49998,4,2013-08-25,35297,1358,Waterpikest in the businesst. Does what its de...
49999,5,2013-08-25,82546,885,I have used this product since I was in my tee...


In [29]:
# Split for training and other datasets.
X_train, X_test, _, _ = train_test_split(X, y, train_size=0.7, shuffle=False)

# 3. Recommendation Models

<h3>Datasets introduction:</h3>
 <li> Final processed datasets: df_items_final for items dataset and df_iter_final for iteraction dataset;</li>
 <li> X_train(val/test): splitted useful information datasets from iteraction dataset;</li>
 <li> y_train(val/test): recommendation validates datasets.</li>

In [30]:
# User chosen
# Filter out the users exist in both datasets (train and test datasets.), it means that these uses are not new user, also not buying anymore.
common_user = set(X_train['UserId']) & set(X_test['UserId']) #549
X_train_common = X_train[X_train['UserId'].isin(common_user)]

#Filter out the users have cold start problem. (Not exist in training set but exist in testing set.)

cs_user = set(X_test['UserId']) - set(X_train['UserId']) #27012
X_train_cs = X_test[X_test['UserId'].isin(cs_user)]

#Total: 87411
#Except the users not exist in common and cold start groups: 27561

In [31]:
len(cs_user)

12774

In [32]:
len(common_user)

223

In [33]:
len(common_user | cs_user)

12997

In [34]:
len(set(X_train['UserId']) | set(X_test['UserId']))

43119

## 3.1 Popularity model

In this model, we plan to **calculate the popularity of all items** and **rank them** according to the **popularity (average rating it reveived)**. <br/> <br/>
Then, we should form **personalized recommendation list** for each user (**exclude the objects they bought befor**). Then recommend them **the most popular product(s)**.

<h3>Important: here we only consider the average of items being purchased over 500 times.</h3>

In [35]:
# Select out the popular items dataset
# Select the Items being purchased over 500 times
a1_purchased_count = (X_train['ItemId'].value_counts())[X_train['ItemId'].value_counts()>=500]
a1_filter_items = a1_purchased_count.index.to_list()
a1_filtered_dataset = X_train[X_train['ItemId'].isin(a1_filter_items)]

In [36]:
#Calculate the average scores of each item and rank them. Here we get the recommendation list globally.
a1_glo_recommend_dataset = a1_filtered_dataset.groupby('ItemId')['Rating'].mean().sort_values(ascending=False)
a1_glo_recommend_items = a1_glo_recommend_dataset.index.to_series()

In [37]:
#Then according to each user's historical behaviors, we exclude all the items have bought by each user, the recommend them the most popular item.
def a1_item_recommend(database, glo_recommend_items, user_id, n_top=1):
    #Find out the ItemId of all items user has bought before.
    items_purchased = database[database['UserId'] == user_id].ItemId.values
    #Filter out these items and return the recommendation list
    common_ele = glo_recommend_items[glo_recommend_items.isin(items_purchased)]
    recommendation = glo_recommend_items[~glo_recommend_items.isin(common_ele)].values[:n_top]
    return recommendation

In [38]:
# Run for users except neither common nor cold start group, get the dictionary of recommendation for each user.
a1_dict = dict()

for user_id in (common_user | cs_user):
  a1_recommendation = a1_item_recommend(X_train, a1_glo_recommend_items, user_id)
  a1_dict[user_id] = a1_recommendation

## 3.2 Content-based model

In [39]:
#Special processes for Nan values to desciption column in items dataset and to review column in X_train dataset.
df_items_final['Description'] = df_items_final['Description'].fillna('None')
X_train['Review'] = X_train['Review'].fillna('None')

In [40]:
#Create feature vectors according to the description of objects.
#Instantiating Objects
tfidf = TfidfVectorizer(lowercase=True, stop_words='english')
item_features = tfidf.fit_transform(df_items_final['Description'])


In [41]:
#Merge all reviews given by the same user.
user_reviews = X_train.groupby(by='UserId', axis=0, group_keys = False)['Review'].apply(''.join)

#Calculate user-profile by using the reviews of them.
# user_features = tfidf.transform(user_reviews)
# user_profile = pd.DataFrame(user_features.toarray())
user_profile = tfidf.transform(user_reviews)

In [42]:
#Calculate the similarity matrix between items and users.
al2_cos = cosine_similarity(user_profile, item_features)
np.fill_diagonal(al2_cos, 0)
al2_cos = pd.DataFrame(al2_cos)

In [61]:
# Exclude the items them bought
# And recommend user with item with the highest similarity score for them.
def a2_item_recommend(al2_cos, user_id, top_n=3):
  item_sim = al2_cos.iloc[user_id, :]
  recommendation = item_sim.sort_values(ascending=False).index[:top_n]
  return recommendation

In [62]:
#Create a dictionary to store the result of recommendation to each user.
a2_dict = dict()
for user_id in common_user:
  try:
    a2_recommendation = a2_item_recommend(al2_cos, user_id=user_id)
  except IndexError:
    continue
  a2_dict[user_id] = a2_recommendation

## 3.3 Collaborative filtering model

After analysing the structure of the two datasets: the amount of **user** is **much bigger** the amount of **item**.<br/><br/>
Thus, in this case, we plan to do **item-based memory** model.

In [45]:
#Create a pivot to set 'ItemId' as the index, 'UserId' as the column and "ratings" as values.
#Then fill all the Nan values with 0.
al3_pivot = X_train[['UserId', 'ItemId', 'Rating']].pivot_table(index='ItemId', columns='UserId', values='Rating').fillna(0).sort_index(axis=0, ascending=True)

In [46]:
#Calculate the similarities between items.
al3_cos = cosine_similarity(al3_pivot)
np.fill_diagonal(al3_cos, 0)
#Turn it into a dataframe.
al3_cos = pd.DataFrame(al3_cos)

In [47]:
al3_pivot.index

Int64Index([    0,     3,     5,     8,     9,    10,    11,    12,    13,
               14,
            ...
            25955, 26103, 26213, 26344, 26790, 27020, 27505, 27560, 27586,
            27602],
           dtype='int64', name='ItemId', length=3853)

In [48]:
#Algorithm to recommend products to users.
def a3_item_recommend(al3_pivot, al3_cos, user_id, top_n=3):
    #Get the items this user rated.
    rated_items = X_train[X_train['UserId'] == user_id]['ItemId'].tolist()
    #Calculate the similarity score between the items rated and other unrated items.
    item_scores = dict()
    for item in rated_items:
        sim_scores = al3_cos[item]
        rated_item_scores = al3_pivot[user_id]
        weighted_scores = sim_scores * rated_item_scores
        item_scores.update(zip(al3_pivot.index, weighted_scores))

    #Delete the items user has bought before.
    for item in rated_items:
        if item in item_scores:
            del item_scores[item]

    #Recommend the item to the user
    recommendation = sorted(item_scores.items(), key=lambda x: x[1], reverse=True)[:5]
    return recommendation

In [49]:
a3_dict = dict()
for user_id in common_user:
  try:
    a3_recommendation = a3_item_recommend(al3_pivot, al3_cos, user_id)
  except ValueError:
    continue
  except KeyError:
    continue
  a3_dict[user_id] = a3_recommendation

In [50]:
a3_dict

{42496: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 40970: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 35854: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 37391: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 19982: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 62993: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 38418: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 12819: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 28180: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 42518: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 49175: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 19479: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 60444: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 46624: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 40993: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 42018: [(0, 0.0), (3, nan), (5, nan), (8, 0.0), (9, nan)],
 63009: [(0, 0.0), (3, nan), (5, nan), (

## 3.4 Model-based collaborative filtering model

In the model-based model, we will use **SVD Model** from surprise library.

In [51]:
#Create a new user-item rating matrix
reader = Reader(rating_scale=(1, 5))
a4_df_iter = Dataset.load_from_df(df_iter_final[['UserId', 'ItemId', 'Rating']][:50001], reader=reader)
trainset, testset =  tts(a4_df_iter, test_size=0.3)

#Train the SVD model
svd = SVD(n_factors=5, verbose=True)
svd.fit(trainset)


Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x79e7538947f0>

In [67]:
# Recommendation function

def model_based_recommendation(user_id, top_n=3):
      rated_items = X_train[X_train['UserId'] == user_id]['ItemId'].tolist()
      unrated_items = [item_id for item_id in range(1, len(df_items_final['ItemId']) + 1) if item_id not in rated_items]
      # 获取前top_n个预测评分最高的物品作为推荐列表
      top_n_recommended = sorted(unrated_items, key=lambda item_id: svd.predict(user_id, item_id).est, reverse=True)[:top_n]
      return top_n_recommended

In [68]:
a4_dict = dict()
for user_id in common_user:
  a4_recommendation = model_based_recommendation(user_id)
  a4_dict[user_id] = a4_recommendation

# 4. Models comparison

## 4.1 Evaluation Function

In [54]:
# Validate and measure the recommendation accuracy for Algorithm 1:
# Idea: recommend only to users exist in both training and testing datasets and the cold start users.
def a1_evaluation(recommend_dict, common_user, cs_user, X_test):
  flag = 0
  for user_id in (common_user | cs_user):
    purchased_after = X_test['ItemId'][X_test['UserId'] == user_id]
    if set(recommend_dict[user_id]).intersection(set(purchased_after.values)):
      flag += 1
  return flag

In [55]:
# Validate and measure the recommendation accuracy for Algorithm 2:
# Idea: recommend only to users exist in both training and testing datasets.
def a23_evaluation(recommend_dict, common_user, X_test):
  flag = 0
  for user_id in common_user:
    purchased_after = X_test['ItemId'][X_test['UserId'] == user_id]
    try:
      if set(recommend_dict[user_id]).intersection(set(purchased_after.values)):
        flag += 1
    except KeyError:
      continue
    return flag

In [73]:
# Validate and measure the recommendation accuracy for Algorithm 2:
# Idea: recommend only to users exist in both training and testing datasets.
def a4_evaluation(recommend_dict, common_user, X_test):
  flag = 0
  for user_id in common_user:
    purchased_after = X_test['ItemId'][X_test['UserId'] == user_id]
    try:
      if set(recommend_dict[user_id]).intersection(set(purchased_after.values)):
        flag += 1
    except KeyError:
      continue
    return flag

## 4.2 Evaluations for models

In [57]:
# Algorithm 1
a1_flag = a1_evaluation(a1_dict, common_user, cs_user, X_test)
acc_a1 = round(a1_flag/len(common_user | cs_user) * 100, 1)
print(f'Popularity model has a recommendation accuracy of:', acc_a1, '%')

Popularity model has a recommendation accuracy of: 9.8 %


In [65]:
# Algorithm 2
a2_flag = a23_evaluation(a2_dict, common_user, X_test)
acc_a2 = round(a2_flag/len(a2_dict) * 100, 1)
print(f'Popularity model has a recommendation accuracy of:', acc_a2, '%')

Popularity model has a recommendation accuracy of: 0.0 %


In [None]:
# Algorithm 3
# a3_flag = a23_evaluation(a3_dict, X_train, X_test)
# acc_a3 = round(a3_flag/len(a3_dict) * 100, 1)
# print(f'Popularity model has a recommendation accuracy of:', acc_a3, '%')

In [74]:
# Algorithm 4
a4_flag = a4_evaluation(a4_dict, common_user, X_test)
acc_a4 = round(a4_flag/len(a4_dict) * 100, 1)
print(f'Model-based has a recommendation accuracy of:', acc_a4, '%')

Model-based has a recommendation accuracy of: 0.0 %


## 4.3 Model Comparison

In [75]:
print("Here are the comparison of four model's recommendation accuracy:\n\n")
print(f'Popularity model:', acc_a1, '%')
print(f'Content-based model:', acc_a2, '%')
# print(f'Collaborative model:', acc_a3, '%')
print(f'Model-based model:', acc_a4, '%')

Here are the comparison of four model's recommendation accuracy:


Popularity model: 9.8 %
Content-based model: 0.0 %
Model-based model: 0.0 %


# Copyright

## <h3 align="center"> © Haozhe TANG 07.2023. All rights reserved. <h3/>