# Advanced Model

In [53]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from keras.models import Model
from keras.layers import Input, Embedding, Flatten, Dot, Dense, Concatenate
from keras.optimizers import Adam
# pip install textblob
from textblob import TextBlob
from datetime import datetime

np.random.seed(209)
t0 = datetime.now()

## Intro

Previously we built a base model, which essentially used a 2D user-item interaction matrix under the hood; it worked by simply calculating cosine similarity scores. Now we want to make the model more robust; we will accomplish this by perfoming the following:

- Extend the 2D user-item matrix into a tensor by adding in features from both the users' side and items' side
- Use a neural network

This is known as the neural collaborative filtering (NCF) approach.

## Prep Data

We will now load in datasets from all necessary sources: users, restaurants, and reviews:

In [2]:
%%time
user_df = pd.read_feather('../data/yelp_user_cleaned.feather')  # 100 MB
business_df = pd.read_feather('../data/yelp_business_cleaned.feather')  # 40 MB
review_df = pd.read_feather('../data/yelp_review_cleaned.feather')  # 2.3 GB

CPU times: user 5.27 s, sys: 5.48 s, total: 10.7 s
Wall time: 20.6 s


Sanity check:

In [3]:
user_df.user_id.unique().shape, review_df.user_id.unique().shape

((1532223,), (1532233,))

In [4]:
business_df.business_id.unique().shape, review_df.business_id.unique().shape

((68054,), (68054,))

Take a sample out of concern for hardware:

In [5]:
review_df = review_df.sample(100_000, random_state=42)

We will use the review dataset to build our user-item matrix.

Note that previously we only kept these columns for building user-item matrix: `['user_id', 'business_id', 'stars']`, but now we'd like to include more features from review data: `['useful', 'funny', 'cool']`. Notice that there is also a `text` column; we could apply sentiment analysis to extract sentiment categories, and use those for a new feature; but we will not do it at this section; this could be a future extension. But we will save it nonetheless, soon.

In [6]:
review_df = review_df.loc[:, ['user_id', 'business_id', 'stars', 'useful', 'funny', 'cool', 'text']]
review_df.head(3)

Unnamed: 0,user_id,business_id,stars,useful,funny,cool,text
1322294,0lpxU4Dfi8AeBt0SeCrEuw,tQKqrLs16Xi-lFrd3_CBAQ,1,2,0,0,My friends and I went there on a Friday night ...
4297632,5nw1Zc3fi_ehDJFd3mUEYA,nLxNJuvgoHQHn_IGYifRnw,1,1,0,0,"Clean, friendly waitstaff. The food, well, to..."
2143059,7fDqaGdUMccXQ4bnPwR6yg,etaIhl-sduOKc6J_qHmmtA,3,2,0,2,"Super swanky lunch spot in Clayton. I love, l..."


We also subset and rename columns for the other two data frames before merging all:

In [7]:
user_df.drop(['name'], axis=1, inplace=True)

In [8]:
# why we rename it here and not use the suffixes for merging:
# we will later need to know which features are related to users
# keep a list of column names here is more convenient
user_df.rename({'useful': 'useful_user',
                'funny': 'funny_user',
                'cool': 'cool_user',
                'review_count': 'review_count_user'}, axis=1, inplace=True)

In [9]:
# note: we've already expanded the dictionaries such as attributes in business_data_inspect.ipynb
business_df.drop(['original_index', 'name', 'address', 'city', 'state',
                  'postal_code', 'latitude', 'longitude',
                  'attributes', 'categories', 'is_restaurant', 'GoodForMeal', 'BestNights'], axis=1, inplace=True)

In [10]:
business_df.rename({'stars': 'stars_business',
                    'review_count': 'review_count_business'}, axis=1, inplace=True)

Business data has categorical features; need to one-hot encode them (but do not encode id's):

In [11]:
dummies = pd.get_dummies(business_df.drop('business_id', axis=1), drop_first=True)

In [12]:
business_df = pd.concat([business_df.loc[:, ['business_id']], dummies], axis=1)

Next, we merge all three data frames together:

In [13]:
df = pd.merge(review_df, user_df, on='user_id')

In [14]:
df = pd.merge(df, business_df, on='business_id')

Remember to remove text:

In [15]:
df.drop('text', axis=1, inplace=True)

Take a sample out of concern for hardware:

In [16]:
df = df.sample(10_000, random_state=42).reset_index(drop=True)

In [17]:
df.shape

(10000, 7470)

Filter out unneeded rows in `user_df`, `business_df`, and `review_df` to match `df`:

In [18]:
user_df = user_df.loc[user_df['user_id'].isin(df['user_id']), :].copy().reset_index(drop=True)
business_df = business_df.loc[business_df['business_id'].isin(df['business_id']), :].copy().reset_index(drop=True)
review_df = review_df.loc[(review_df['user_id'].isin(df['user_id'])) & \
                    (review_df['business_id'].isin(df['business_id'])), :].copy().reset_index(drop=True)

Remove `text` from `review_df` for now, but save it.

In [19]:
review_text = review_df['text'].copy()

In [20]:
review_df.drop('text', axis=1, inplace=True)

Encode `user_id` and `business_id` to convert string into integers:

In [21]:
user_id_encoder = LabelEncoder()
business_id_encoder = LabelEncoder()

user_df['user_id'] = user_id_encoder.fit_transform(user_df['user_id'])
review_df['user_id'] = user_id_encoder.transform(review_df['user_id'])
business_df['business_id'] = business_id_encoder.fit_transform(business_df['business_id'])
review_df['business_id'] = business_id_encoder.transform(review_df['business_id'])

df['user_id'] = user_id_encoder.transform(df['user_id'])
df['business_id'] = business_id_encoder.transform(df['business_id'])

## Modeling

Train test split:

In [22]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

Feature scaling:

In [23]:
def feature_scaling(df_train, df_test):
    # save stars (target, do not scale)
    stars_train = df_train['stars'].copy()
    stars_test = df_test['stars'].copy()

    # scale features
    scaler = MinMaxScaler()
    df_train = scaler.fit_transform(df_train)
    df_test = scaler.transform(df_test)

    # convert back to data frames
    df_train = pd.DataFrame(df_train, columns=df.columns)
    df_test = pd.DataFrame(df_test, columns=df.columns)

    # restore stars
    df_train['stars'] = stars_train.values
    df_test['stars'] = stars_test.values

    return df_train, df_test

In [24]:
df_train, df_test = feature_scaling(df_train, df_test)

Build model:

In [25]:
def build_model(df=df, user_df=user_df, business_df=business_df):
    # helper constants
    num_users = df['user_id'].nunique()
    num_businesses = df['business_id'].nunique()
    
    # Model architecture:
    
    # embeddings for id's
    user_input = Input(shape=(1,), name='user_input')
    user_embedding = Embedding(num_users, 16, name='user_embedding')(user_input)
    user_flatten = Flatten(name='user_flatten')(user_embedding)

    business_input = Input(shape=(1,), name='business_input')
    business_embedding = Embedding(num_businesses, 16, name='business_embedding')(business_input)
    business_flatten = Flatten(name='business_flatten')(business_embedding)

    dot_product = Dot(axes=1, name='dot_product')([user_flatten, business_flatten])

    # add in user and business features
    user_features_input = Input(shape=(user_df.shape[1] - 1,), name='user_features_input')
    business_features_input = Input(shape=(business_df.shape[1] - 1,), name='business_features_input')
    concat_features = Concatenate(name='concat_features')([dot_product, user_features_input, business_features_input])

    dense_layer = Dense(64, activation='relu', name='dense_layer')(concat_features)
    output = Dense(1, activation='linear', name='output')(dense_layer)

    model = Model(inputs=[user_input, business_input, user_features_input, business_features_input], outputs=output)
    model.compile(optimizer=Adam(0.0001), loss='mean_squared_error')
    
    return model

In [26]:
model = build_model()

Train:

In [27]:
def train_model(model, df_train=df_train, user_df=user_df, business_df=business_df):
    train_inputs = [
        df_train['user_id'].values,
        df_train['business_id'].values,
        df_train[user_df.columns[1:]].values,
        df_train[business_df.columns[1:]].values
    ]

    model.fit(train_inputs, df_train['stars'].values, epochs=15, validation_split=0.1)

In [28]:
%%time
train_model(model)

Epoch 1/15


2023-05-07 02:42:34.012802: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
CPU times: user 20.6 s, sys: 5.59 s, total: 26.2 s
Wall time: 10.3 s


Test:

In [29]:
def test_model(model, df_test=df_test, user_df=user_df, business_df=business_df):
    test_inputs = [
        df_test['user_id'].values,
        df_test['business_id'].values,
        df_test[user_df.columns[1:]].values,
        df_test[business_df.columns[1:]].values
    ]

    test_loss = model.evaluate(test_inputs, df_test['stars'].values)
    print(f'Test loss (MSE): {round(test_loss, 3)}')

In [30]:
test_model(model)

Test loss (MSE): 1.466


Note that the base model had an MSE of 1.96; we are certainly seeing an improvement here with our more advanced model.

## Model Extension 1: Cluster Labels

We'd like to improve the performance of our model by adding in new features to the training set. In this section, we will perform clustering on users and businesses; the cluster labels will be a new feature we can use. This is called a mixed cluster network.

Scale `user_df`:

In [31]:
def scale(df):
    scaler = MinMaxScaler()
    return scaler.fit_transform(df)

In [32]:
user_df_scaled = scale(user_df)

Apply dimensionality reduction:

In [33]:
def pca(df_scaled):
    pca = PCA(n_components=10)
    return pca.fit_transform(df_scaled)

In [34]:
user_df_pca = pca(user_df_scaled)

Hypterparameter tuning:

In [35]:
def tune_clusterer(df_pca):
    best_score = -1

    for eps in np.arange(0.1, 2, 0.1):
        for min_samples in range(2, 3):
            dbscan = DBSCAN(eps=eps, min_samples=min_samples)
            labels = dbscan.fit_predict(df_pca)
            n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
            if n_clusters > 1:
                score = silhouette_score(df_pca, labels)
                if score > best_score:
                    best_score = score
                    best_eps = eps
                    best_min_samples = min_samples

    print("Best Silhouette Score:", best_score)
    print("Best eps:", best_eps)
    print("Best min_samples:", best_min_samples)
    
    return best_eps, best_min_samples

In [36]:
%%time
best_eps, best_min_samples = tune_clusterer(user_df_pca)

Best Silhouette Score: 0.5545330937146582
Best eps: 0.30000000000000004
Best min_samples: 2
CPU times: user 18.3 s, sys: 4.73 s, total: 23 s
Wall time: 22.2 s


Get clusters:

In [37]:
def generate_clusters(df_pca, best_eps, best_min_samples):
    dbscan = DBSCAN(eps=best_eps, min_samples=best_min_samples)
    labels = dbscan.fit_predict(df_pca)
    return labels

In [38]:
user_df['user_clusters'] = generate_clusters(user_df_pca, best_eps, best_min_samples)
np.unique(user_df['user_clusters'])

array([-1,  0,  1])

Evaluation:

In [39]:
print(f"Silhouette Score: {silhouette_score(user_df_pca, user_df['user_clusters'])}")

Silhouette Score: 0.5545330937146582


Perform the same procedures for `business_df`:

In [40]:
%%time
business_df_scaled = scale(business_df)
business_df_pca = pca(business_df_scaled)
best_eps, best_min_samples = tune_clusterer(business_df_pca)
business_df['business_clusters'] = generate_clusters(business_df_pca, best_eps, best_min_samples)
print(f"Silhouette Score: {silhouette_score(business_df_pca, business_df['business_clusters'])}")

Best Silhouette Score: 0.11639984752658532
Best eps: 1.9000000000000001
Best min_samples: 2
Silhouette Score: 0.11639984752658532
CPU times: user 27.4 s, sys: 4.35 s, total: 31.8 s
Wall time: 17.7 s


## Model Extension 2: Sentiment Labels

Like the first extension we added, we will now perform sentiment analysis on review text; the sentiment labels will be another new feature we can use. We thought about using transfer learning on BERT to fine tune our own sentiment analysis model; however, transfer learning requires labeled data (the ground truth sentiment labels), which we do not have. So instaed, we will be using a pre-trained model. Note that transformers are really slow, so we will be using the Python library named `TextBlob` which is really fast.

In [41]:
def get_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity

In [42]:
%%time
review_df['review_sentiment'] = review_text.apply(get_sentiment)

CPU times: user 3.95 s, sys: 111 ms, total: 4.06 s
Wall time: 4.13 s


In [43]:
review_df['review_sentiment'].head()

0    0.535714
1    0.159977
2    0.733333
3   -0.551515
4    0.097064
Name: review_sentiment, dtype: float64

In [44]:
review_df['review_sentiment'].describe()

count    11946.000000
mean         0.260394
std          0.230792
min         -1.000000
25%          0.130695
50%          0.264103
75%          0.398842
max          1.000000
Name: review_sentiment, dtype: float64

## Re-train Model with Extensions

In [45]:
df = pd.merge(review_df, user_df, on='user_id')
df = pd.merge(df, business_df, on='business_id')

In [46]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [47]:
df_train, df_test = feature_scaling(df_train, df_test)

In [48]:
model = build_model(df=df, user_df=user_df, business_df=business_df)

In [49]:
%%time
train_model(model, df_train=df_train, user_df=user_df, business_df=business_df)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
CPU times: user 23.7 s, sys: 6.66 s, total: 30.4 s
Wall time: 12 s


In [50]:
test_model(model, df_test=df_test, user_df=user_df, business_df=business_df)

Test loss (MSE): 1.377


In [51]:
t1 = datetime.now()

In [52]:
print(f'Time elapsed running notebook: {t1 - t0}')

Time elapsed running notebook: 0:01:49.898218


We observe that the model had similar performances to its previous state without the extensions. But it still beats the baseline model.