<a href="https://colab.research.google.com/github/jiangzl2016/yelp-rating-prediction/blob/master/DeepLearning_RecSys.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Deep Learning Models
#### DeepFM

![DeepFM Model Architecture](https://d2l.ai/_images/rec-deepfm.svg)

It is a model which integrates the feature representation learning of a neural network with factorization machines.

Adding nonlinear transformation layers to factorization machines gives it the capability to model both low-order feature combinations and high-order feature combinations. Moreover, non-linear inherent structures from inputs can also be captured with neural networks.

#### Wide and Deep Learning

![Wide and Deep Model](https://2.bp.blogspot.com/-wkrmRibw_GM/V3Mg3O3Q0-I/AAAAAAAABG0/Jm3Nl4-VcYIJ44dA5nSz6vpTyCKF2KWQgCKgB/s640/image03.png)

The Wide part of the model tries to capture the co-occurrence of a query-item feature pair correlates with the target label. The Deep model generalizes the query-item interactions.

In [1]:
from google.colab import drive
drive.mount('/content/drive')#, force_remount=True)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install -U deepctr[gpu]
!pip install -U scikit-learn

Requirement already up-to-date: deepctr[gpu] in /usr/local/lib/python3.6/dist-packages (0.7.0)
Requirement already up-to-date: scikit-learn in /usr/local/lib/python3.6/dist-packages (0.22)


The GPU being used for the deep learning models is a Tesla P100

In [3]:
!nvidia-smi

Fri Dec 20 14:14:03 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

#### Importing packages and reading the dataset ...

In [0]:
## data handling
# setup libraries and env
import os
import shutil
import sys

import numpy as np
from scipy import sparse

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sn
sn.set()

import pandas as pd
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from deepctr.models import DeepFM, CCPM, FNN, PNN, WDL, MLR, NFM, AFM, DCN, DIN, DIEN, DSIN, xDeepFM, AutoInt, NFFM, FGCNN, FiBiNET
from deepctr.inputs import SparseFeat,get_feature_names

import itertools as it

training =  pd.read_csv('/content/drive/My Drive/final_project_datasets/ratings_sample_train_20.csv', index_col = 0)
validation = pd.read_csv('/content/drive/My Drive/final_project_datasets/ratings_sample_validation_20.csv', index_col = 0)
test = pd.read_csv('/content/drive/My Drive/final_project_datasets/ratings_sample_test_20.csv', index_col = 0)

businesses = pd.read_csv('/content/drive/My Drive/final_project_datasets/businesses.csv')
users = pd.read_csv('/content/drive/My Drive/final_project_datasets/active_users.csv')

In [0]:
# training.dropna(inplace = True)
# validation.dropna(inplace = True)
# test.dropna(inplace = True)

##### Include cities as a feature in the deep learning models

In [0]:
businesses['business_city_state'] = businesses['business_city'] + businesses['business_state']

In [7]:
print(training.shape, validation.shape, test.shape)

(803897, 5) (57229, 5) (57223, 5)


In [0]:
training = training.merge(right = businesses[['business_id', 'business_city_state']], how = 'left', on = 'business_id')
validation = validation.merge(right = businesses[['business_id', 'business_city_state']], how = 'left', on = 'business_id')
test = test.merge(right = businesses[['business_id', 'business_city_state']], how = 'left', on = 'business_id')

In [0]:
column_sequence = ['user_id', 'business_id', 'business_city_state', 'text', 'rating', 'date']
training = training[column_sequence]
validation = validation[column_sequence]
test = test[column_sequence]

In [10]:
# convert object to datetime
# training.date = pd.to_datetime(training.date)
# validation.date = pd.to_datetime(validation.date)
# test.date = pd.to_datetime(test.date)

# find hour from datetime
# training['hour'] = training.date.dt.hour
# validation['hour'] = validation.date.dt.hour
# test['hour'] = test.date.dt.hour

test.head()

Unnamed: 0,user_id,business_id,business_city_state,text,rating,date
0,n6-Gk65cPZL6Uz8qRm3NYw,hk5wpV-_pi5jmDDVPeG8DA,MesaAZ,"I highly recommend Arizona Pet Mortuary, David...",5.0,2018-09-14 18:50:19
1,d6xvYpyzcfbF_AZ8vMB7QA,qdCwzhJ5Yo_Sdm_bYDIfOQ,AhwatukeeAZ,I found Kathy's from yelp. I love to support ...,2.0,2011-09-11 06:09:33
2,sG_h0dIzTKWa3Q6fmb4u-g,XS1Zx6GzjtKPKmhDuVw5Jg,ClevelandOH,I had the Saison infused with grapefruit which...,3.0,2017-06-19 22:55:06
3,FIk4lQQu1eTe2EpzQ4xhBA,jLxeBgWhLRbII2ACkgH1Sg,Las VegasNV,First time for me to come inside at least! Hav...,4.0,2018-09-30 18:00:41
4,TpyOT5E16YASd7EWjLQlrw,U_yacPCk8HgE1ywATmQUrg,EtobicokeON,Ordered for lunch with a few colleagues throug...,5.0,2018-10-13 00:10:59


##### Data Quality check

In [11]:
# quality check
print(len(set(test.user_id) - set(training.user_id)))
print(len(set(validation.user_id) - set(training.user_id)))

0
0


In [0]:
test = test.loc[test.business_id.isin(training.business_id)]
validation = validation.loc[validation.business_id.isin(training.business_id)]

In [0]:
# map each user_id, business_id to an index
# user_mapping = {}
# for n,i in enumerate(training.user_id.unique()):
#   user_mapping[i] = n

# business_mapping = {}
# for n,i in enumerate(training.business_id.unique()):
#   business_mapping[i] = n

In [0]:
# for training
# training['user_id'] = training['user_id'].map(user_mapping)
# training['business_id'] = training['business_id'].map(business_mapping)
# for validation
# validation['user_id'] = validation['user_id'].map(user_mapping)
# validation['business_id'] = validation['business_id'].map(business_mapping)
# for test
# test['user_id'] = test['user_id'].map(user_mapping)
# test['business_id'] = test['business_id'].map(business_mapping)

In [13]:
test.head()

Unnamed: 0,user_id,business_id,business_city_state,text,rating,date
1,d6xvYpyzcfbF_AZ8vMB7QA,qdCwzhJ5Yo_Sdm_bYDIfOQ,AhwatukeeAZ,I found Kathy's from yelp. I love to support ...,2.0,2011-09-11 06:09:33
2,sG_h0dIzTKWa3Q6fmb4u-g,XS1Zx6GzjtKPKmhDuVw5Jg,ClevelandOH,I had the Saison infused with grapefruit which...,3.0,2017-06-19 22:55:06
3,FIk4lQQu1eTe2EpzQ4xhBA,jLxeBgWhLRbII2ACkgH1Sg,Las VegasNV,First time for me to come inside at least! Hav...,4.0,2018-09-30 18:00:41
4,TpyOT5E16YASd7EWjLQlrw,U_yacPCk8HgE1ywATmQUrg,EtobicokeON,Ordered for lunch with a few colleagues throug...,5.0,2018-10-13 00:10:59
5,_N7Ndn29bpll_961oPeEfw,O-b5osM0NO4f31dp6_DatQ,TorontoON,"I can only comment on their macarons, which I'...",3.0,2014-08-01 01:55:23


In [0]:
# validation = validation.loc[~validation.business_id.isin(['WpC53SqwoCY5AuYIFr_1eA'])]

#### Preparation for input into deep learning models

In [0]:
# 1.Label Encoding for sparse features,and do simple Transformation for dense features
training = training.loc[training.business_city_state.apply(type) != float]
training_deep = training.copy()
validation_deep = validation.copy()
test_deep = test.copy()

sparse_features = ["user_id", "business_id", "business_city_state"]#, "hour"]
target = ['rating']
for feat in sparse_features:
  lbe = LabelEncoder()
  training_deep[feat] = lbe.fit_transform(training_deep[feat])
  validation_deep[feat] = lbe.transform(validation_deep[feat])
  test_deep[feat] = lbe.transform(test_deep[feat])

##### Grid Search - Hyperparameter Tuning

We have tuned 3 parameters for both the models,
1. Embedding dimension - 8, 16, 32
2. Hidden Units - (128, 128), (256, 256), (256, 128)
3. Dropout - 0.1, 0.3, 0.5

#### Grid Search for DeepFM

In [0]:
params = {
    'embedding_dim' : [8, 16, 32],
    'dnn_hidden_units': [(128, 128), (256, 256), (256, 128)],
    'dnn_dropout': [0.1, 0.3, 0.5]
}
allNames = sorted(params)
combinations = it.product(*(params[Name] for Name in allNames))

best_params = None
best_mse = 1000

# grid search for DeepFM
for i in list(combinations):
  dropout, dnn_hidden_units, embedding_dim = i

  # 2.count #unique features for each sparse field
  fixlen_feature_columns = [SparseFeat(feat, training_deep[feat].nunique(), embedding_dim = embedding_dim) for feat in sparse_features]
  linear_feature_columns = fixlen_feature_columns
  dnn_feature_columns = fixlen_feature_columns
  feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)
  
  # 3.define model inputs   
  train_model_input = {name:training_deep[name].values for name in feature_names}
  valid_model_input = {name:validation_deep[name].values for name in feature_names}
  test_model_input = {name:test_deep[name].values for name in feature_names}

  # 4.Define Model,train,predict and evaluate
  model = DeepFM(linear_feature_columns, dnn_feature_columns, fm_group=['business_city_state'], task='regression', dnn_hidden_units = dnn_hidden_units, dnn_dropout = dropout, l2_reg_embedding=1e-05, l2_reg_dnn=1e-05, l2_reg_linear=1e-05, seed = 42)
  model.compile("adam", "mse", metrics=['mse'], )
  # only one epoch because it overfits after the first epoch
  history = model.fit(train_model_input, training_deep[target].values, batch_size=256, epochs=1, verbose=2, validation_data= (valid_model_input, validation[target].values))

  validation_predictions = model.predict(valid_model_input, batch_size=256)
  val_mse = mean_squared_error(validation_deep[target].values, validation_predictions)
  print("validation MSE", round(val_mse, 4))
  if val_mse < best_mse:
    best_mse = val_mse
    best_params = [dropout, dnn_hidden_units, embedding_dim]

Train on 803897 samples, validate on 53225 samples


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


803897/803897 - 70s - loss: 1.7237 - mse: 1.7062 - val_loss: 1.7827 - val_mse: 1.7481
validation MSE 1.7481
Train on 803897 samples, validate on 53225 samples


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


In [0]:
best_mse, best_params

#### Grid Search for WDL

In [0]:
params = {
    'embedding_dim' : [8, 16, 32],
    'dnn_hidden_units': [(128, 128), (256, 256), (256, 128)],
    'dnn_dropout': [0.1, 0.3, 0.5]
}
allNames = sorted(params)
combinations = it.product(*(params[Name] for Name in allNames))

best_params_wdl = None
best_mse_wdl = 1000

# grid search for WDL
for i in list(combinations):
  dropout, dnn_hidden_units, embedding_dim = i

  # 2.count #unique features for each sparse field
  fixlen_feature_columns = [SparseFeat(feat, training_deep[feat].nunique(), embedding_dim = embedding_dim) for feat in sparse_features]
  linear_feature_columns = fixlen_feature_columns
  dnn_feature_columns = fixlen_feature_columns
  feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)
  
  # 3.define model inputs   
  train_model_input = {name:training_deep[name].values for name in feature_names}
  valid_model_input = {name:validation_deep[name].values for name in feature_names}
  test_model_input = {name:test_deep[name].values for name in feature_names}

  # 4.Define Model,train,predict and evaluate
  model = WDL(linear_feature_columns, dnn_feature_columns, task='regression', dnn_hidden_units = dnn_hidden_units, dnn_dropout = dropout, l2_reg_embedding=1e-05, l2_reg_dnn=1e-05, l2_reg_linear=1e-05, seed = 42)
  model.compile("adam", "mse", metrics=['mse'], )
  # only one epoch because it overfits after the first epoch
  history = model.fit(train_model_input, training_deep[target].values, batch_size=256, epochs=1, verbose=2, validation_data= (valid_model_input, validation[target].values))

  validation_predictions = model.predict(valid_model_input, batch_size=256)
  val_mse = mean_squared_error(validation_deep[target].values, validation_predictions)
  print("validation MSE", round(val_mse, 4))
  if val_mse < best_mse_wdl:
    best_mse_wdl = val_mse
    best_params_wdl = [dropout, dnn_hidden_units, embedding_dim]

In [0]:
best_mse_wdl, best_params_wdl

#### Refitting the model on the best parameters for each of the models

In [16]:
# 1.Label Encoding for sparse features,and do simple Transformation for dense features
training_combined = pd.concat([training, validation], axis = 0)
training_combined_deep = training_combined.copy()
test_deep = test.copy()

sparse_features = ["user_id", "business_id", "business_city_state"]#, "hour"]
target = ['rating']

for feat in sparse_features:
  lbe = LabelEncoder()
  training_combined_deep[feat] = lbe.fit_transform(training_combined_deep[feat])
  test_deep[feat] = lbe.transform(test_deep[feat])

# best DeepFM Model
dropout_deepfm = 0.1
dnn_hidden_units_deepfm = (128, 128)
embedding_dim_deepfm = 8

# 2.count #unique features for each sparse field
fixlen_feature_columns = [SparseFeat(feat, training_combined_deep[feat].nunique(), embedding_dim = embedding_dim_deepfm) for feat in sparse_features]
linear_feature_columns = fixlen_feature_columns
dnn_feature_columns = fixlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)

# 3.define model inputs   
train_model_input = {name:training_combined_deep[name].values for name in feature_names}
test_model_input = {name:test_deep[name].values for name in feature_names}

# 4.Define Model,train,predict and evaluate
model_deepfm = DeepFM(linear_feature_columns, dnn_feature_columns, fm_group=['business_city_state'], task='regression', dnn_hidden_units = dnn_hidden_units_deepfm, dnn_dropout = dropout_deepfm, l2_reg_embedding=1e-05, l2_reg_dnn=1e-05, l2_reg_linear=1e-05, seed = 42)
model_deepfm.compile("adam", "mse", metrics=['mse'], )
# only one epoch because it overfits after the first epoch
history = model_deepfm.fit(train_model_input, training_combined_deep[target].values, batch_size=256, epochs=1, verbose=2)

test_predictions_deepfm = model_deepfm.predict(test_model_input, batch_size=255)
test_mse_deepfm = mean_squared_error(test_deep[target].values, test_predictions_deepfm)


# best WDL Model
dropout_wdl = 0.3
dnn_hidden_units_wdl = (256, 256)
embedding_dim_wdl = 8

# 2.count #unique features for each sparse field
fixlen_feature_columns = [SparseFeat(feat, training_combined_deep[feat].nunique(), embedding_dim = embedding_dim_wdl) for feat in sparse_features]
linear_feature_columns = fixlen_feature_columns
dnn_feature_columns = fixlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)

# 3.define model inputs   
train_model_input = {name:training_combined_deep[name].values for name in feature_names}
test_model_input = {name:test_deep[name].values for name in feature_names}

# 4.Define Model,train,predict and evaluate
model_wdl = WDL(linear_feature_columns, dnn_feature_columns, task='regression', dnn_hidden_units = dnn_hidden_units_wdl, dnn_dropout = dropout_wdl, l2_reg_embedding=1e-05, l2_reg_dnn=1e-05, l2_reg_linear=1e-05, seed = 42)
model_wdl.compile("adam", "mse", metrics=['mse'], )
# only one epoch because it overfits after the first epoch
history = model_wdl.fit(train_model_input, training_combined_deep[target].values, batch_size=256, epochs=1, verbose=2)

test_predictions_wdl = model_wdl.predict(test_model_input, batch_size=256)
test_mse_wdl = mean_squared_error(test_deep[target].values, test_predictions_wdl)

Train on 857122 samples


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


857122/857122 - 72s - loss: 1.7143 - mse: 1.6955
Train on 857122 samples


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


857122/857122 - 71s - loss: 1.7126 - mse: 1.6932


In [18]:
test_mse_deepfm, test_mse_wdl

(1.7833152958561953, 1.784929756718987)

#### R2 Score

In [19]:
print("R2 Score: %0.3f" %r2_score(y_true = test_deep[target], y_pred = test_predictions_deepfm))
print("R2 Score: %0.3f" %r2_score(y_true = test_deep[target], y_pred = test_predictions_wdl))

R2 Score: 0.204
R2 Score: 0.203


#### Mean Absolute Error

In [20]:
print("Mean Absolute Error: %0.3f" %mean_absolute_error(y_true = test_deep[target], y_pred = test_predictions_deepfm))
print("Mean Absolute Error: %0.3f" %mean_absolute_error(y_true = test_deep[target], y_pred = test_predictions_wdl))

Mean Absolute Error: 1.082
Mean Absolute Error: 1.079


#### Mean Squared Error

In [21]:
print("Root Mean Square Error: %0.3f" %mean_squared_error(y_true = test_deep[target], y_pred = test_predictions_deepfm, squared = False))
print("Root Mean Square Error: %0.3f" %mean_squared_error(y_true = test_deep[target], y_pred = test_predictions_wdl, squared = False))

Root Mean Square Error: 1.335
Root Mean Square Error: 1.336


#### Rest of the metrics

In [0]:
def process(df):
    # df = df.drop(df.columns[0], axis =1)
    df['date']  = pd.to_datetime(df['date'])
    df['week_day'] = df['date'].dt.weekday
    df['month'] = df['date'].dt.month
    df['hour'] = df['date'].dt.hour
    df = df.merge(users, on = 'user_id')
    df = df.merge(businesses, on = 'business_id')
    return df

In [0]:
ratings_train = process(training.copy())
ratings_validation = process(validation.copy())
ratings_test = process(test.copy())

In [0]:
ratings_train_final = ratings_train.append(ratings_validation)

In [0]:
unique_city_businesses = ratings_train_final[['business_city','business_id']].drop_duplicates()
unique_cities = unique_city_businesses.groupby('business_city').count()['business_id']
unique_cities = unique_cities[unique_cities > 100]
out = pd.DataFrame()
for city in unique_cities.index:
    tmp = ratings_train_final[(ratings_train_final['business_city'] ==city) &
                              (ratings_train_final['rating'] >ratings_train_final['average_stars'])]
    if len(tmp['user_id'].unique())>4:
        np.random.seed(42)
        ###this weird sampling technique is to ensure we dont' sample the same user twice in a same city
        five_users = np.random.choice(tmp['user_id'].unique(),5, replace = False)
        row = tmp[tmp['user_id'].isin(five_users)].groupby('user_id', group_keys=False).apply(lambda df: df.sample(1))
        out = out.append(row)

In [26]:
all(out.groupby('business_city').count()['user_id']==5)

True

In [0]:
predict_df = out[['user_id','business_city','business_state']]
predict_df = predict_df.merge(unique_city_businesses, on = 'business_city')

In [28]:
all(predict_df.groupby('business_city')['user_id'].nunique()==5)

True

In [0]:
# remove businesses not in training
predict_df = predict_df.loc[predict_df.business_id.isin(training.business_id)]

In [30]:
predict_df[['user_id', 'business_id']]

Unnamed: 0,user_id,business_id
0,7kfJk_NtOslC9jPk2Koz3g,8AW0koYMDa1PlJMOE-b2-g
1,7kfJk_NtOslC9jPk2Koz3g,-YGQwikbX2fXUIjyegR7pw
2,7kfJk_NtOslC9jPk2Koz3g,5Kh5i4VhXj-Leg8gujIzjQ
3,7kfJk_NtOslC9jPk2Koz3g,Wl1oOVbtK4I9vRKoaSKYiQ
4,7kfJk_NtOslC9jPk2Koz3g,OxSaGGTmIujsjDpDqwyGPQ
...,...,...
567480,gWbXQg0rPLDCRNR0HbImvA,nkLUGjzFNPCrClbT1UIZaw
567481,gWbXQg0rPLDCRNR0HbImvA,r0U1aexkjoUKoTuXlisjng
567482,gWbXQg0rPLDCRNR0HbImvA,KfuHr7dYyEaDrHtQpFdNUw
567483,gWbXQg0rPLDCRNR0HbImvA,41xuKlIuZTLu6qTbpqTY-A


In [0]:
predict_df['business_city_state'] = predict_df['business_city'] + predict_df['business_state']
metric_test_deep = predict_df[['user_id', 'business_id', 'business_city_state']].copy()

for feat in sparse_features:
  lbe = LabelEncoder()
  lbe.fit(training[feat])
  metric_test_deep[feat] = lbe.transform(metric_test_deep[feat])
  
metric_test_input = {name:metric_test_deep[name].values for name in feature_names}

In [0]:
metric_test_predictions_deepfm = model_deepfm.predict(metric_test_input, batch_size=256)

In [0]:
metric_test_predictions_wdl = model_wdl.predict(metric_test_input, batch_size=256)

In [0]:
predict_df['predictions'] = metric_test_predictions_wdl

In [0]:
top_10_recs = predict_df.groupby(['user_id','business_city'])['predictions'].nlargest(10).reset_index()

In [36]:
all(top_10_recs.groupby('business_city')['user_id'].count()==50)

True

In [0]:
cnt =0
serendipity = 0
for row in out.iterrows():
    row_values = row[1]
    top_10 = predict_df.loc[top_10_recs[top_10_recs['user_id'] == row_values['user_id']].level_2]['business_id']
    ###In top 10
    if row_values['business_id'] in top_10.values:
        cnt+=1
    user_history = ratings_train_final[ratings_train_final['user_id'] == row_values['user_id']]    
    been_there = [i for i in top_10.values if i in  user_history.business_id.values]
    serendipity += 1-len(been_there)/10

##### Inclusion of Last Review in Top 10 Recommendations

In [38]:
cnt/len(out)

0.15421686746987953

#### Novelty(% of new restaurants in top 10 recommendations)

In [39]:
serendipity/len(out)

0.9710843373493965

In [0]:
predict_df = predict_df.reset_index()

In [41]:
predict_df.columns

Index(['index', 'user_id', 'business_city', 'business_state', 'business_id',
       'business_city_state', 'predictions'],
      dtype='object')

In [42]:
top_10_recs.head(1)

Unnamed: 0,user_id,business_city,level_2,predictions
0,--3WaS23LcIXtxyFULJHTA,Scottsdale,442142,5.031378


In [43]:
predict_df.head(1)

Unnamed: 0,index,user_id,business_city,business_state,business_id,business_city_state,predictions
0,0,7kfJk_NtOslC9jPk2Koz3g,Ajax,ON,8AW0koYMDa1PlJMOE-b2-g,AjaxON,3.028969


In [0]:
analysis_df = predict_df.merge(top_10_recs, left_on = ['user_id','business_city','index'], right_on = ['user_id','business_city','level_2'])

In [45]:
all(analysis_df.groupby('business_city')['business_id'].count() ==50)

True

#### Coverage(% of unique recommendations)

In [46]:
(analysis_df.groupby('business_city')['business_id'].nunique()/50).values.mean()

0.23156626506024094

In [0]:
predict_df['rankings']=predict_df.groupby(['business_city','user_id'])['predictions'].rank("first",ascending = False)

In [0]:
running_rankings =0
for row in out.iterrows():
    row_values = row[1]
    user_recs = predict_df[(predict_df['user_id']==row_values['user_id'])
                        &(predict_df['business_city']==row_values['business_city'])
                         & (predict_df['business_id']==row_values['business_id'])
                          ]
    assert len(user_recs)==1
    running_rankings += user_recs['rankings'].sum()

#### Average Ranking of Last Positive Review

In [49]:
running_rankings / len(out)

449.69397590361444