# Advanced Model

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras.models import Model
from keras.layers import Input, Embedding, Flatten, Dot, Dense, Concatenate
from keras.optimizers import Adam

## Intro

Previously we built a base model, which essentially used a 2D user-item interaction matrix under the hood; it worked by simply calculating cosine similarity scores. Now we want to make the model more robust; we will accomplish this by perfoming the following:

- Extend the 2D user-item matrix into a tensor by adding in features from both the users' side and items' side
- Use a neural network

This is known as the neural collaborative filtering (NCF) approach.

## Prep Data

We will now load in datasets from all necessary sources: users, restaurants, and reviews:

In [2]:
%%time
user_df = pd.read_feather('../data/yelp_user_cleaned.feather')  # 100 MB
business_df = pd.read_feather('../data/yelp_business_cleaned.feather')  # 40 MB
review_df = pd.read_feather('../data/yelp_review_cleaned.feather')  # 2.3 GB

CPU times: user 5.19 s, sys: 5.15 s, total: 10.3 s
Wall time: 19.5 s


Sanity check:

In [3]:
user_df.user_id.unique().shape, review_df.user_id.unique().shape

((1532223,), (1532233,))

In [4]:
business_df.business_id.unique().shape, review_df.business_id.unique().shape

((68054,), (68054,))

Take a sample out of concern for hardware:

In [5]:
review_df = review_df.sample(100_000, random_state=42)

We will use the review dataset to build our user-item matrix; first un-scale stars:

In [6]:
stars_scaled_unique = sorted(list(review_df['stars'].unique()))
stars_scale_map = dict(list(zip(stars_scaled_unique, range(1, 6))))
review_df['stars'] = review_df['stars'].map(stars_scale_map)
review_df.stars.unique()

array([1, 3, 2, 5, 4])

Note that previously we only kept these columns for building user-item matrix: `['user_id', 'business_id', 'stars']`, but now we'd like to include more features from review data: `['useful', 'funny', 'cool']`. Notice that there is also a `text` column; we could apply sentiment analysis to extract sentiment categories, and use those for a new feature; but we will not do it for now; this could be a future extension.

In [7]:
review_df = review_df.loc[:, ['user_id', 'business_id', 'stars', 'useful', 'funny', 'cool']]
review_df.head(3)

Unnamed: 0,user_id,business_id,stars,useful,funny,cool
1322294,0lpxU4Dfi8AeBt0SeCrEuw,tQKqrLs16Xi-lFrd3_CBAQ,1,0.350319,-0.184315,-0.229438
4297632,5nw1Zc3fi_ehDJFd3mUEYA,nLxNJuvgoHQHn_IGYifRnw,1,-0.007821,-0.184315,-0.229438
2143059,7fDqaGdUMccXQ4bnPwR6yg,etaIhl-sduOKc6J_qHmmtA,3,0.350319,-0.184315,0.677802


We also subset and rename columns for the other two data frames before merging all:

In [8]:
user_df.drop(['name'], axis=1, inplace=True)

In [9]:
# why we rename it here and not use the suffixes for merging:
# we will later need to know which features are related to users
# keep a list of column names here is more convenient
user_df.rename({'useful': 'useful_user',
                'funny': 'funny_user',
                'cool': 'cool_user',
                'review_count': 'review_count_user'}, axis=1, inplace=True)

In [10]:
# note: we've already expanded the dictionaries such as attributes in business_data_inspect.ipynb
business_df.drop(['original_index', 'name', 'address', 'city', 'state',
                  'postal_code', 'latitude', 'longitude',
                  'attributes', 'categories', 'is_restaurant', 'GoodForMeal', 'BestNights'], axis=1, inplace=True)

In [11]:
business_df.rename({'stars': 'stars_business',
                    'review_count': 'review_count_business'}, axis=1, inplace=True)

Business data has categorical features; need to one-hot encode them (but do not encode id's):

In [12]:
dummies = pd.get_dummies(business_df.drop('business_id', axis=1), drop_first=True)

In [13]:
business_df = pd.concat([business_df.loc[:, ['business_id']], dummies], axis=1)

Next, we merge all three data frames together:

In [14]:
df = pd.merge(review_df, user_df, on='user_id')

In [15]:
df = pd.merge(df, business_df, on='business_id')

Take a sample out of concern for hardware:

In [16]:
df = df.sample(10_000, random_state=42)

In [17]:
df.shape

(10000, 7470)

## Modeling

Encode `user_id` and `business_id` to convert string into integers:

In [18]:
user_id_encoder = LabelEncoder()
business_id_encoder = LabelEncoder()

df['user_id'] = user_id_encoder.fit_transform(df['user_id'])
df['business_id'] = business_id_encoder.fit_transform(df['business_id'])

Train test split:

In [19]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

Helper constants:

In [20]:
num_users = df['user_id'].nunique()
num_businesses = df['business_id'].nunique()

Model architecture:

In [21]:
# embeddings for id's
user_input = Input(shape=(1,), name='user_input')
user_embedding = Embedding(num_users, 16, name='user_embedding')(user_input)
user_flatten = Flatten(name='user_flatten')(user_embedding)

business_input = Input(shape=(1,), name='business_input')
business_embedding = Embedding(num_businesses, 16, name='business_embedding')(business_input)
business_flatten = Flatten(name='business_flatten')(business_embedding)

dot_product = Dot(axes=1, name='dot_product')([user_flatten, business_flatten])

# add in user and business features
user_features_input = Input(shape=(user_df.shape[1] - 1,), name='user_features_input')
business_features_input = Input(shape=(business_df.shape[1] - 1,), name='business_features_input')
concat_features = Concatenate(name='concat_features')([dot_product, user_features_input, business_features_input])

dense_layer = Dense(64, activation='relu', name='dense_layer')(concat_features)
output = Dense(1, activation='linear', name='output')(dense_layer)

model = Model(inputs=[user_input, business_input, user_features_input, business_features_input], outputs=output)
model.compile(optimizer=Adam(0.001), loss='mean_squared_error')

Train:

In [22]:
train_inputs = [
    df_train['user_id'].values,
    df_train['business_id'].values,
    df_train[user_df.columns[1:]].values,
    df_train[business_df.columns[1:]].values
]

model.fit(train_inputs, df_train['stars'].values, epochs=10, validation_split=0.1)

Epoch 1/10


2023-04-24 00:07:23.695916: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x3ede997b0>

Test:

In [24]:
test_inputs = [
    df_test['user_id'].values,
    df_test['business_id'].values,
    df_test[user_df.columns[1:]].values,
    df_test[business_df.columns[1:]].values
]

test_loss = model.evaluate(test_inputs, df_test['stars'].values)
print(f'Test loss (MSE): {test_loss}')

Test loss (MSE): 1.2345056533813477


Note that the base model had an RMSE of 1.4 or an MSE of 1.96; we are certainly seeing an improvement here with our more advanced model.

## Next Steps

- Use the full dataset (remove `review_df.sample` and `df.sample`)
- Add more features to dataset, specifically:
    - Do clustering (e.g., KNN) on users and businesses; the cluster labels will be a new feature we can use; this is called a mixed cluster network.
    - Do sentiment analysis on review text; the sentiment categories will be a new feature we can use.