This notebook contains an adapted top popular recommender that takes into account workout video data from the Youtube API (views, likes, comments) and compares its performance with the regular top popular recommender, which is purely based on the number of interactions for each workout (for offline testing, this is just the number of Fitness Blender comments for each workout)

In [1]:
# imports
import pandas as pd
import numpy as np
from sklearn.metrics import ndcg_score
from sklearn.preprocessing import MinMaxScaler

# get_data function from src
import sys
sys.path.insert(0,'../src/data')
from model_preprocessing import get_data 

# no warnings
import warnings
warnings.filterwarnings('ignore')

  "LightFM was compiled without OpenMP support. "


In [2]:
# get data: this assume data collection/preprocessing has been previously run
data_dct = get_data('../data/preprocessed/user_item_interactions.csv') # dictionary containing training, testing, etc
yt = pd.read_csv('../data/raw/workouts_yt.csv') # youtube api data

Num users: 4026, num_items 580.


We adapt our original top popular model to include an option for taking into account youtube data. For example, you can specify to include youtube likes into the recommender. In this case, top popular will recommend the workouts with the highest combined interactions count and like count. Since the range/variation of these counts (interactions, views, likes, comments) may vary, these attributes are first scaled to a value between 0 and 1 using sklearn's MinMaxScaler before being summed up. 

In [3]:
# adapted from src/models/top_popular.py

def top_popular(ui_df, yt_df, include=None, k=None):
    """
    Adapted top popular (different than the one in seen in src/models/top_popular.py used for offline testing)
    
    Arguments:
    - ui_df: User-item interactions dataframe
    - yt_df: Dataframe from youtube API
    - include: List of attributes from youtube API to use in the recommender. If none, 
               this recommender is the same as the regular top popular model seen in 
               src/models/top_popular.py
    - k: Top k recommendations are returned. If None, all recommendations are returned.
    
    Returns:
    - preds: List of predicted workout ids, with highest ranked first. Workout ids are external workout ids
    - scores: Corresponding scores to the predictions. Note: with the regular top popular recommender, 
              the score of each workout is the number of interactions for that workout. If include is not
              None, the scores will be the sum of the interactions and specified youtube attributes, 
              which are each scaled to a number between 0 and 1 with MinMaxScaler.
    """
    if include is None: # same as original top_popular recommender
        workout_counts = ui_df.groupby(
            'workout_id').size().sort_values(ascending=False)
        preds = np.array(workout_counts.index)
        scores = np.array(workout_counts.values)

    else:
        mms = MinMaxScaler()
        workout_counts = ui_df.groupby('workout_id').size()
        scores_scaled = mms.fit_transform(np.array(list(workout_counts)).reshape(-1,1)).reshape(1,-1)[0] # scale interaction counts
        yt_scaled = mms.fit_transform(yt_df[['view_count','like_count','comment_count']]) # scale the youtube attributes
        
        # sum to get final score
        if 'view_count' in include:
            scores_scaled += yt_scaled[:,0]
        if 'like_count' in include:
            scores_scaled += yt_scaled[:,1]
        if 'comment_count' in include:
            scores_scaled += yt_scaled[:,2]
        
        # get final predictions/scores sorted
        tot = pd.Series(scores_scaled,index=workout_counts.index).sort_values(ascending=False)
        preds = np.array(tot.index)
        scores = np.array(tot.values)
        
    if k is None:
        return preds, scores
    else:
        return preds[:k], scores[:k]
    
def get_target_scores(external_indices, scores, item_map):
    """
    Same as seen in src/models/top_popular.py
    
    Helper function to get input of sklearn ncdg:
    Given movie ids and their popularity score, as well as a dictionary mapping
    external ids to LightFM internal ids, return the list of popularity scores
    by LightFM internal id ordering
    """
    internal_indices = [item_map[i] for i in external_indices]
    scores_by_internal = np.zeros(len(item_map.values()))
    scores_by_internal.put(internal_indices, scores)
    return scores_by_internal


def evaluate_top_popular(train_df, test_ui_matrix, item_map, yt_df, include=None, k=None):
    """
    Adapted to use new top popular model.
    
    Takes in training/testing data and returns average NDCG
    for top popular reccomender
    """
    y_true = test_ui_matrix.toarray()
    external_indices, scores = top_popular(train_df, yt_df, include=include)
    y_score = get_target_scores(external_indices, scores, item_map)
    y_scores = [list(y_score)]*(y_true.shape[0])

    return ndcg_score(y_true, y_scores, k)

We train the recommender on a training dataset and evaluate the NDCG score on the testing dataset. We compare the original top popular recommender with models including view count, like counts, comment count, both individually and combined. We find that the NDCG for the original top popular recommender performs the best and adding data from Youtube decreases the NDCG.

In [4]:
# the original top popular recommender
k=20 # k parameter for ndcg
top_pop_ndcg = evaluate_top_popular(data_dct['train_df'],
                                    data_dct['test_ui_matrix'],
                                    data_dct['item_map'],
                                    yt,
                                    include=None,
                                    k=k)
print(str(top_pop_ndcg))

0.09887370285575565


In [5]:
# including view count
top_pop_ndcg = evaluate_top_popular(data_dct['train_df'],
                                    data_dct['test_ui_matrix'],
                                    data_dct['item_map'],
                                    yt,
                                    include=['view_count'],
                                    k=k)
print(str(top_pop_ndcg))

0.07385714923369312


In [6]:
# including like count
top_pop_ndcg = evaluate_top_popular(data_dct['train_df'],
                                    data_dct['test_ui_matrix'],
                                    data_dct['item_map'],
                                    yt,
                                    include=['like_count'],
                                    k=k)
print(str(top_pop_ndcg))

0.07417811298862918


In [7]:
# including comment count
top_pop_ndcg = evaluate_top_popular(data_dct['train_df'],
                                    data_dct['test_ui_matrix'],
                                    data_dct['item_map'],
                                    yt,
                                    include=['comment_count'],
                                    k=k)
print(str(top_pop_ndcg))

0.08445299011940773


In [8]:
# includinng view_count, like_count, comment_count
top_pop_ndcg = evaluate_top_popular(data_dct['train_df'],
                                    data_dct['test_ui_matrix'],
                                    data_dct['item_map'],
                                    yt,
                                    include=['view_count','like_count','comment_count'],
                                    k=k)
print(str(top_pop_ndcg))

0.06464159834645176
