# Model Evaluation | Movielens 25M Dataset with Visual Enrichment

## Movielens 25M Dataset
This dataset leverages the [Movielens 25M Dataset](https://grouplens.org/datasets/movielens/25m/) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 25000095 ratings and 1093360 tag applications across 62423 movies. These data were created by 162541 users between January 09, 1995 and November 21, 2019. This dataset was generated on November 21, 2019.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.

## Visual Enrichment
The "tmdbId" column from the Movielens Dataset is utilized via the [The Movie Database (TMDb) API](https://www.themoviedb.org/documentation/api), in which the cooresponding movie poster url and image file are stored for later use in the enrichment process.

Once movie posters for each movie are retrived, each movie poster image is sent to [Azure Computer Vision](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/overview) for analysis and metadata generation. The resulting features are then used to finally enrich the Movielens Dataset:
* [Categories](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-categorizing-images)
* [Color](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-detecting-color-schemes)
* [Tags](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-tagging-images)
* [Description](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-describing-images)
* [Celebrities](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-detecting-domain-content)

## 0 Global Settings and Imports

In [17]:
import pandas as pd

# Global Variables
SAMPLE_SIZE = 100000

## 1 Load Movielens Dataset

In [18]:
df = pd.read_csv("../../carve/datasets/carve-movielens-prepared.csv")
df = df.drop(["Unnamed: 0"], axis=1)
df = df.sample(n=SAMPLE_SIZE, random_state=0)
df.reset_index(inplace=True)

df.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,index,movieId,userId,rating,timestamp,genres_0,genres_1,genres_2,genres_3,genres_4,...,categories_2,color_0,color_1,color_2,tags_0,tags_1,tags_2,description_0,description_1,description_2
0,14781518,5418,50997.0,5.0,1198447000.0,Action,Mystery,Thriller,,,...,,Orange,Black,,human face,text,poster,text,,
1,17786929,44191,60716.0,4.0,1437543000.0,Action,Sci-Fi,Thriller,IMAX,,...,,Brown,Red,Black,book,human face,album cover,candle,text,dark
2,18917852,60069,1691.0,5.0,1534741000.0,Adventure,Animation,Children,Romance,Sci-Fi,...,,Black,Blue,,text,screenshot,cartoon,calendar,,
3,7110650,1376,152325.0,4.0,975537500.0,Adventure,Comedy,Sci-Fi,,,...,,Black,Pink,Yellow,book,painting,human face,text,book,
4,3918965,608,158007.0,4.0,832007300.0,Comedy,Crime,Drama,Thriller,,...,,White,,,stitch,text,embroidery,text,,


## 2 Evaluate Models

### 2.1 SAR Single Node on MovieLens (Python, CPU)

Simple Algorithm for Recommendation (SAR) is a fast and scalable algorithm for personalized recommendations based on user transaction history. It produces easily explainable and interpretable recommendations and handles "cold item" and "semi-cold user" scenarios. SAR is a kind of neighborhood based algorithm (as discussed in [Recommender Systems by Aggarwal](https://dl.acm.org/citation.cfm?id=2931100)) which is intended for ranking top items for each user. More details about SAR can be found in the [deep dive notebook](https://github.com/microsoft/recommenders/blob/main/examples/02_model_collaborative_filtering/sar_deep_dive.ipynb). 

SAR recommends items that are most ***similar*** to the ones that the user already has an existing ***affinity*** for. Two items are ***similar*** if the users that interacted with one item are also likely to have interacted with the other. A user has an ***affinity*** to an item if they have interacted with it in the past.

#### Advantages of SAR:
- High accuracy for an easy to train and deploy algorithm
- Fast training, only requiring simple counting to construct matrices used at prediction time. 
- Fast scoring, only involving multiplication of the similarity matrix with an affinity vector

#### Notes to use SAR properly:
- Since it does not use item or user features, it can be at a disadvantage against algorithms that do.
- It's memory-hungry, requiring the creation of an $mxm$ sparse square matrix (where $m$ is the number of items). This can also be a problem for many matrix factorization algorithms.
- SAR favors an implicit rating scenario and it does not predict ratings.

This notebook provides an example of how to utilize and evaluate SAR in Python on a CPU.

In [19]:
def modelSar(data, params):
    """
    """
    # Import packages
    %load_ext autoreload
    %autoreload 2

    import logging
    import numpy as np
    import pandas as pd
    import scrapbook as sb
    from sklearn.preprocessing import minmax_scale

    from recommenders.utils.python_utils import binarize
    from recommenders.utils.timer import Timer
    from recommenders.datasets import movielens
    from recommenders.datasets.python_splitters import python_stratified_split
    from recommenders.evaluation.python_evaluation import (
        map_at_k,
        ndcg_at_k,
        precision_at_k,
        recall_at_k,
        rmse,
        mae,
        logloss,
        rsquared,
        exp_var
    )
    from recommenders.models.sar import SAR
    import sys

    print("System version: {}".format(sys.version))
    print("Pandas version: {}".format(pd.__version__))

    #Global Variables
    TOP_K = 10

    #Start logging
    #logging.basicConfig(level=logging.DEBUG, 
    #                format='%(asctime)s %(levelname)-8s %(message)s')
    
    #SAR Code
    print("\nStarting SAR...")

    #Convert the float precision to 32-bit in order to reduce memory consumption 
    data['rating'] = data['rating'].astype(np.float32)

    #Split dataset
    train, test = python_stratified_split(data, ratio=0.75, col_user='userId', col_item='movieId', seed=0)

    #Print data summary
    print("""
    Train:
    Total Ratings: {train_total}
    Unique Users: {train_users}
    Unique Items: {train_items}

    Test:
    Total Ratings: {test_total}
    Unique Users: {test_users}
    Unique Items: {test_items}
    """.format(
        train_total=len(train),
        train_users=len(train['userId'].unique()),
        train_items=len(train['movieId'].unique()),
        test_total=len(test),
        test_users=len(test['userId'].unique()),
        test_items=len(test['movieId'].unique()),
    ))

    #Instantiate the SAR algorithm and set the index
    model = SAR(
        col_user="userId",
        col_item="movieId",
        col_rating="rating",
        col_timestamp="timestamp",
        similarity_type="jaccard", 
        time_decay_coefficient=30, 
        timedecay_formula=True,
        normalize=True
    )

    #Train the SAR model on our training data
    with Timer() as train_time:
        model.fit(train)
    print("Took {} seconds for training.".format(train_time.interval))
    
    #Get the top-k recommendations for our testing data
    with Timer() as test_time:
        top_k = model.recommend_k_items(test, remove_seen=True)
    print("Took {} seconds for prediction.".format(test_time.interval))
    
    #Evaluate model
    positivity_threshold = 2
    test_bin = test.copy()
    test_bin['rating'] = binarize(test_bin['rating'], positivity_threshold)

    top_k_prob = top_k.copy()
    top_k_prob['prediction'] = minmax_scale(
    top_k_prob['prediction'].astype(float)
    )

    eval_map = map_at_k(test, top_k, col_user='userId', col_item='movieId', col_rating='rating', k=TOP_K)
    eval_ndcg = ndcg_at_k(test, top_k, col_user='userId', col_item='movieId', col_rating='rating', k=TOP_K)
    eval_precision = precision_at_k(test, top_k, col_user='userId', col_item='movieId', col_rating='rating', k=TOP_K)
    eval_recall = recall_at_k(test, top_k, col_user='userId', col_item='movieId', col_rating='rating', k=TOP_K)
    eval_rmse = rmse(test, top_k, col_user='userId', col_item='movieId', col_rating='rating')
    eval_mae = mae(test, top_k, col_user='userId', col_item='movieId', col_rating='rating')
    eval_rsquared = rsquared(test, top_k, col_user='userId', col_item='movieId', col_rating='rating')
    eval_exp_var = exp_var(test, top_k, col_user='userId', col_item='movieId', col_rating='rating')
    eval_logloss = logloss(test_bin, top_k_prob, col_user='userId', col_item='movieId', col_rating='rating')

    evaluation_results = {"Top K": TOP_K,
                        "MAP": eval_map,
                        "NDCG": eval_ndcg,
                        "Precision": eval_precision,
                        "Recall": eval_recall,
                        "RMSE": eval_rmse,
                        "MAE": eval_mae,
                        "R2": eval_rsquared,
                        "EXP-VAR": eval_exp_var,
                        "Logloss": eval_logloss}

    print("Finished SAR...\n")
    
    return (model, evaluation_results)


### 2.2 LightGBM: A Highly Efficient Gradient Boosting Decision Tree
This notebook will give you an example of how to train a LightGBM model to estimate click-through rates on an e-commerce advertisement. We will train a LightGBM based model on the Criteo dataset.

[LightGBM](https://github.com/Microsoft/LightGBM) is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:
* Fast training speed and high efficiency.
* Low memory usage.
* Great accuracy.
* Support of parallel and GPU learning.
* Capable of handling large-scale data.

In [20]:
def modelLightgbm(data, params):
    """
    """
    # Import packages
    import sys
    import os
    import numpy as np
    import lightgbm as lgb
    import papermill as pm
    import scrapbook as sb
    import pandas as pd
    import category_encoders as ce
    from tempfile import TemporaryDirectory
    from sklearn.metrics import roc_auc_score, log_loss

    import recommenders.models.lightgbm.lightgbm_utils as lgb_utils
    import recommenders.datasets.criteo as criteo

    print("System version: {}".format(sys.version))
    print("LightGBM version: {}".format(lgb.__version__))

    #Global Variables
    MAX_LEAF = 64
    MIN_DATA = 20
    NUM_OF_TREES = 100
    TREE_LEARNING_RATE = 0.15
    EARLY_STOPPING_ROUNDS = 20
    METRIC = "auc"
    SIZE = "sample"

    #Start logging
    #logging.basicConfig(level=logging.DEBUG, 
    #                format='%(asctime)s %(levelname)-8s %(message)s')

    #Lightgbm Code
    print("\nStarting Lightgbm...")

    #Split dataset
    length = len(data)
    train_data = data.loc[:0.8*length-1]
    valid_data = data.loc[0.8*length:0.9*length-1]
    test_data = data.loc[0.9*length:]

    #Encode the string-like categorical features by an ordinal encoder
    cate_cols = ["userId", 
                "movieId", 
                "genres_0", 
                "genres_1", 
                "genres_2", 
                "genres_3", 
                "genres_4", 
                "categories_0", 
                "categories_1", 
                "categories_2", 
                "color_0", 
                "color_1", 
                "color_2", 
                "tags_0", 
                "tags_1", 
                "tags_2", 
                "description_0", 
                "description_1", 
                "description_2"]

    label_col = "rating"

    ord_encoder = ce.ordinal.OrdinalEncoder(cols=cate_cols)

    def encode_csv(df, encoder, label_col, typ='fit'):
        if typ == 'fit':
            df = encoder.fit_transform(df)
        else:
            df = encoder.transform(df)
        y = df[label_col].values
        del df[label_col]
        return df, y

    train_x, train_y = encode_csv(train_data, ord_encoder, label_col)
    valid_x, valid_y = encode_csv(valid_data, ord_encoder, label_col, 'transform')
    test_x, test_y = encode_csv(test_data, ord_encoder, label_col, 'transform')

    print('Train Data Shape: X: {trn_x_shape}; Y: {trn_y_shape}.\nValid Data Shape: X: {vld_x_shape}; Y: {vld_y_shape}.\nTest Data Shape: X: {tst_x_shape}; Y: {tst_y_shape}.\n'
        .format(trn_x_shape=train_x.shape,
                trn_y_shape=train_y.shape,
                vld_x_shape=valid_x.shape,
                vld_y_shape=valid_y.shape,
                tst_x_shape=test_x.shape,
                tst_y_shape=test_y.shape,))
    
    #Train the Lightgbm model on our training data
    lgb_train = lgb.Dataset(train_x, train_y.reshape(-1), params=params, categorical_feature=cate_cols)
    lgb_valid = lgb.Dataset(valid_x, valid_y.reshape(-1), reference=lgb_train, categorical_feature=cate_cols)
    lgb_test = lgb.Dataset(test_x, test_y.reshape(-1), reference=lgb_train, categorical_feature=cate_cols)
    model = lgb.train(params,
                    lgb_train,
                    num_boost_round=NUM_OF_TREES,
                    early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                    valid_sets=lgb_valid,
                    categorical_feature=cate_cols)

    #TODO: Evaluation metrics
    evaluation_results = dict()

    print("Finished Lightgbm...\n")
    
    return (model, evaluation_results)

In [21]:
def modelLightgbmOptimized(data, params):
    """
    """
    # Import packages
    import sys
    import os
    import numpy as np
    import lightgbm as lgb
    import papermill as pm
    import scrapbook as sb
    import pandas as pd
    import category_encoders as ce
    from tempfile import TemporaryDirectory
    from sklearn.metrics import roc_auc_score, log_loss

    import recommenders.models.lightgbm.lightgbm_utils as lgb_utils
    import recommenders.datasets.criteo as criteo

    print("System version: {}".format(sys.version))
    print("LightGBM version: {}".format(lgb.__version__))

    #Global Variables

    #Start logging
    #logging.basicConfig(level=logging.DEBUG, 
    #                format='%(asctime)s %(levelname)-8s %(message)s')
    
    #Lightgbm Optimized Code
    print("\nStarting Lightgbm Optimized...")

    #Split dataset
    length = len(data)
    train_data = data.loc[:0.8*length-1]
    valid_data = data.loc[0.8*length:0.9*length-1]
    test_data = data.loc[0.9*length:]

    #Encode the string-like categorical features by an ordinal encoder
    cate_cols = ["userId", 
                "movieId", 
                "genres_0", 
                "genres_1", 
                "genres_2", 
                "genres_3", 
                "genres_4", 
                "categories_0", 
                "categories_1", 
                "categories_2", 
                "color_0", 
                "color_1", 
                "color_2", 
                "tags_0", 
                "tags_1", 
                "tags_2", 
                "description_0", 
                "description_1", 
                "description_2"]
    label_col = 'rating'
    nume_cols = []

    #Convert all the categorical features in original data into numerical ones
    num_encoder = lgb_utils.NumEncoder(cate_cols, nume_cols, label_col)
    train_x, train_y = num_encoder.fit_transform(train_data)
    valid_x, valid_y = num_encoder.transform(valid_data)
    test_x, test_y = num_encoder.transform(test_data)
    del num_encoder
    print('Train Data Shape: X: {trn_x_shape}; Y: {trn_y_shape}.\nValid Data Shape: X: {vld_x_shape}; Y: {vld_y_shape}.\nTest Data Shape: X: {tst_x_shape}; Y: {tst_y_shape}.\n'
        .format(trn_x_shape=train_x.shape,
                trn_y_shape=train_y.shape,
                vld_x_shape=valid_x.shape,
                vld_y_shape=valid_y.shape,
                tst_x_shape=test_x.shape,
                tst_y_shape=test_y.shape,))

    #Train the Lightgbm model on our training data
    lgb_train = lgb.Dataset(train_x, train_y.reshape(-1), params=params)
    lgb_valid = lgb.Dataset(valid_x, valid_y.reshape(-1), reference=lgb_train)
    model = lgb.train(params,
                    lgb_train,
                    num_boost_round=NUM_OF_TREES,
                    early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                    valid_sets=lgb_valid)

    #TODO: Evaluation metrics
    evaluation_results = dict()

    print("Finished Lightgbm Optimized...\n")
    
    return (model, evaluation_results)

In [22]:
#Run SAR
results_sar = modelSar(df, None)

#Run Lightgbm
MAX_LEAF = 64
MIN_DATA = 20
NUM_OF_TREES = 100
TREE_LEARNING_RATE = 0.15
EARLY_STOPPING_ROUNDS = 20
METRIC = "auc"
SIZE = "sample"

params_lightgbm = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'num_class': 1,
    'objective': "binary",
    'metric': METRIC,
    'num_leaves': MAX_LEAF,
    'min_data': MIN_DATA,
    'boost_from_average': True,
    #set it according to your cpu cores.
    'num_threads': 24,
    'feature_fraction': 0.8,
    'learning_rate': TREE_LEARNING_RATE,
}
results_lightgbm = modelLightgbm(df, params_lightgbm)

#Run Lightgbm Optimized
results_lightgbm_optimized = modelLightgbmOptimized(df, params_lightgbm)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
System version: 3.7.13 (default, Mar 29 2022, 02:18:16) 
[GCC 7.5.0]
Pandas version: 1.3.5

Starting SAR...


2022-05-08 22:37:31,581 INFO     Collecting user affinity matrix
2022-05-08 22:37:31,583 INFO     Calculating time-decayed affinities
2022-05-08 22:37:31,598 INFO     Creating index columns
2022-05-08 22:37:31,662 INFO     Calculating normalization factors
2022-05-08 22:37:31,682 INFO     Building user affinity sparse matrix
2022-05-08 22:37:31,684 INFO     Calculating item co-occurrence
2022-05-08 22:37:31,695 INFO     Calculating item similarity
2022-05-08 22:37:31,695 INFO     Using jaccard based similarity



    Train:
    Total Ratings: 87911
    Unique Users: 55089
    Unique Items: 8580

    Test:
    Total Ratings: 12089
    Unique Users: 9774
    Unique Items: 4419
    


2022-05-08 22:37:33,096 INFO     Done training
2022-05-08 22:37:33,101 INFO     Calculating recommendation scores


Took 1.5241332619989407 seconds for training.


2022-05-08 22:37:35,528 INFO     Removing seen items


Took 2.8350970070023322 seconds for prediction.
Finished SAR...

System version: 3.7.13 (default, Mar 29 2022, 02:18:16) 
[GCC 7.5.0]
LightGBM version: 3.3.2

Starting Lightgbm...


2022-05-08 22:38:01,365 INFO     Filtering and fillna features


Train Data Shape: X: (80000, 21); Y: (80000,).
Valid Data Shape: X: (10000, 21); Y: (10000,).
Test Data Shape: X: (10000, 21); Y: (10000,).

[LightGBM] [Info] Number of positive: 80000, number of negative: 0
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 13092
[LightGBM] [Info] Number of data points in the train set: 80000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=1.000000 -> initscore=34.539576
[LightGBM] [Info] Start training from score 34.539576
[LightGBM] [Info] [binary:BoostFromScore]: pavg=1.000000 -> initscore=34.539576
[1]	valid_0's auc: 1
Training until validation scores don't improve for 20 rounds
[2]	valid_0's auc: 1
[3]	valid_0's auc: 1
[4]	valid_0's auc: 1
[5]	valid_0's auc: 1
[6]	valid_0's auc: 1
[7]	valid_0's auc: 1
[8]	valid_0's auc: 1
[9]	valid_0's auc: 1
[10]	valid_0's auc: 1
[11]	valid_0's auc: 1
[12]	valid_0's auc: 1
[13]	valid_0's auc: 1
[14]	valid_0's auc: 1
[15]	valid_0's auc: 1
[16]	valid

100%|██████████| 19/19 [00:01<00:00,  9.54it/s]
0it [00:00, ?it/s]
2022-05-08 22:38:03,359 INFO     Ordinal encoding cate features
2022-05-08 22:38:03,876 INFO     Target encoding cate features
100%|██████████| 19/19 [00:01<00:00, 10.30it/s]
2022-05-08 22:38:05,723 INFO     Start manual binary encoding
100%|██████████| 38/38 [00:03<00:00, 10.52it/s]
100%|██████████| 19/19 [00:02<00:00,  9.44it/s]
2022-05-08 22:38:11,509 INFO     Filtering and fillna features
100%|██████████| 19/19 [00:00<00:00, 418.94it/s]
0it [00:00, ?it/s]
2022-05-08 22:38:11,558 INFO     Ordinal encoding cate features
2022-05-08 22:38:11,615 INFO     Target encoding cate features
100%|██████████| 19/19 [00:00<00:00, 81.75it/s]
2022-05-08 22:38:11,850 INFO     Start manual binary encoding
100%|██████████| 38/38 [00:03<00:00, 11.06it/s]
100%|██████████| 19/19 [00:01<00:00, 10.51it/s]
2022-05-08 22:38:17,217 INFO     Filtering and fillna features
100%|██████████| 19/19 [00:00<00:00, 367.41it/s]
0it [00:00, ?it/s]
2022-

Train Data Shape: X: (80000, 156); Y: (80000, 1).
Valid Data Shape: X: (10000, 156); Y: (10000, 1).
Test Data Shape: X: (10000, 156); Y: (10000, 1).

[LightGBM] [Info] Number of positive: 80000, number of negative: 0
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 9921
[LightGBM] [Info] Number of data points in the train set: 80000, number of used features: 154
[LightGBM] [Info] [binary:BoostFromScore]: pavg=1.000000 -> initscore=34.539576
[LightGBM] [Info] Start training from score 34.539576
[LightGBM] [Info] [binary:BoostFromScore]: pavg=1.000000 -> initscore=34.539576
[1]	valid_0's auc: 1
Training until validation scores don't improve for 20 rounds
[2]	valid_0's auc: 1
[3]	valid_0's auc: 1
[4]	valid_0's auc: 1
[5]	valid_0's auc: 1
[6]	valid_0's auc: 1
[7]	valid_0's auc: 1
[8]	valid_0's auc: 1
[9]	valid_0's auc: 1
[10]	valid_0's auc: 1
[11]	valid_0's auc: 1
[12]	valid_0's auc: 1
[13]	valid_0's auc: 1
[14]	valid_0's auc: 1
[15]	valid_0's auc: 1
[

In [23]:
# Show results
print(f"\n\nSAR Results:\n {results_sar[1]}\n")
print(f"Lightgbm Results:\n {results_lightgbm[1]}\n")
print(f"Lightgbm Optimized Results:\n {results_lightgbm_optimized[1]}\n")



SAR Results:
 {'Top K': 10, 'MAP': 0.0003698263290037384, 'NDCG': 0.0006328883367846422, 'Precision': 0.00016369961121342339, 'Recall': 0.0014835277266216492, 'RMSE': 3.1991763260010977, 'MAE': 3.081206230847868, 'R2': -13.31743533442921, 'EXP-VAR': -0.036446535106668954, 'Logloss': 5.403117521910442}

Lightgbm Results:
 {}

Lightgbm Optimized Results:
 {}



In [24]:
df_results = pd.DataFrame.from_dict([results_sar[1], results_lightgbm[1], results_lightgbm_optimized[1]])
df_results["Models"] = ["SAR",
                        "Lightgbm",
                        "Lightgbm Optimized"]
df_results.head()

Unnamed: 0,Top K,MAP,NDCG,Precision,Recall,RMSE,MAE,R2,EXP-VAR,Logloss,Models
0,10.0,0.00037,0.000633,0.000164,0.001484,3.199176,3.081206,-13.317435,-0.036447,5.403118,SAR
1,,,,,,,,,,,Lightgbm
2,,,,,,,,,,,Lightgbm Optimized
