<a href="https://colab.research.google.com/github/Nojam11477/OSI7/blob/master/examples/tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/massquantity/LibRecommender/blob/master/examples/tutorial.ipynb)
[![View in doc](https://img.shields.io/badge/document-tutorial-ffdfba)](https://librecommender.readthedocs.io/en/latest/tutorial.html)

This tutorial will walk you through the comprehensive process of training a model in LibRecommender, i.e. **data processing -> feature engineering -> training -> evaluate -> save/load -> retrain**. We will use [Wide & Deep](https://arxiv.org/pdf/1606.07792.pdf) as the example algorithm.

First make sure the latest LibRecommender has been installed:

In [1]:
!pip install -U LibRecommender

Collecting LibRecommender
  Downloading librecommender-1.5.2.tar.gz (525 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.9/525.9 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting gensim>=4.0.0 (from LibRecommender)
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m86.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: LibRecommender
  Building wheel for LibRecommender (pyproject.toml) ... [?25l[?25hdone
  Created wheel for LibRecommender: filename=librecommender-1.5.2-cp312-cp312-linux_x86_64.whl size=2104733 sha256=6eb72c398

For how to deploy a trained model in LibRecommender, see [Serving Guide](https://librecommender.readthedocs.io/en/latest/serving_guide/python.html).

**NOTE**: If you encounter errors like `Variables already exist, disallowed...`, just call `tf.compat.v1.reset_default_graph()` first.

## Load data

In this tutorial we willl use the [MovieLens 1M](https://grouplens.org/datasets/movielens/1m/) dataset. The following code will load the data into `pandas.DataFrame` format. If the data does not exist locally, it will be downloaded at first.

In [6]:
import random
import warnings
from pathlib import Path

import pandas as pd
import tensorflow as tf
warnings.filterwarnings("ignore")

In [7]:
import random
import warnings
from pathlib import Path

import pandas as pd
import tensorflow as tf
warnings.filterwarnings("ignore")

def load_ml_1m():
    # download and extract zip file
    tf.keras.utils.get_file(
        "ml-1m.zip",
        "http://files.grouplens.org/datasets/movielens/ml-1m.zip",
        cache_dir=".",
        cache_subdir=".",
        extract=True,
    )
    # read and merge data into same table
    cur_path = Path(".").absolute()
    ratings = pd.read_csv(
        cur_path / "ml-1m" / "ratings.dat",
        sep="::",
        usecols=[0, 1, 2, 3],
        names=["user", "item", "rating", "time"],
    )
    users = pd.read_csv(
        cur_path / "ml-1m" / "users.dat",
        sep="::",
        usecols=[0, 1, 2, 3],
        names=["user", "sex", "age", "occupation"],
    )
    items = pd.read_csv(
        cur_path / "ml-1m" / "movies.dat",
        sep="::",
        usecols=[0, 2],
        names=["item", "genre"],
        encoding="iso-8859-1",
    )
    items[["genre1", "genre2", "genre3"]] = (
        items["genre"].str.split(r"|", expand=True).fillna("missing").iloc[:, :3]
    )
    items = items.drop("genre", axis=1)
    data = ratings.merge(users, on="user").merge(items, on="item")
    data = data.rename(columns={"rating": "label"})
    # random shuffle data
    data = data.sample(frac=1, random_state=42).reset_index(drop=True)
    return data

In [18]:
data = load_ml_1m()
print("data shape:", data.shape)

data shape: (1000209, 10)


In [19]:
data.iloc[random.choices(range(len(data)), k=10)]  # randomly select 10 rows

Unnamed: 0,user,item,label,time,sex,age,occupation,genre1,genre2,genre3
170058,5157,2863,3,961945490,M,35,1,Comedy,Musical,missing
596043,3565,3316,3,966780337,M,25,18,Action,Thriller,missing
716474,1658,3503,2,974715496,F,35,6,Drama,Sci-Fi,missing
191099,4732,1347,4,963332750,M,25,14,Horror,missing,missing
227392,3888,356,4,966136051,M,45,17,Comedy,Romance,War
628648,4087,3273,4,965431617,M,1,4,Horror,Mystery,Thriller
264760,5843,1207,5,957806395,M,35,1,Drama,missing,missing
701013,5394,671,5,961438584,M,18,0,Comedy,Sci-Fi,missing
317689,5831,3102,4,957905790,M,25,1,Thriller,missing,missing
308156,4889,480,5,962736365,M,18,4,Action,Adventure,Sci-Fi


Now we have about 1 million data. In order to perform evaluation after training, we need to split the data into train, eval and test data first. In this tutorial we will simply use `random_split`. For other ways of splitting data, see [Data Processing](https://librecommender.readthedocs.io/en/latest/user_guide/data_processing.html).

**For now, We will only use first half data for training. Later we will use the rest data to retrain the model.**

## Process Data & Features

In [20]:
from libreco.data import random_split

# split data into three folds for training, evaluating and testing
first_half_data = data[: (len(data) // 2)]
train_data, eval_data, test_data = random_split(first_half_data, multi_ratios=[0.8, 0.1, 0.1], seed=42)

The data contains some categorical features such as "sex" and "genre", as well as a numerical feature "age". In LibRecommender we use `sparse_col` to represent categorical features and `dense_col` to represent numerical features. So one should specify the column information and then use `DatasetFeat.build_*` functions to process the data.

In [21]:
print("first half data shape:", first_half_data.shape)

first half data shape: (500104, 10)


In [22]:
from libreco.data import DatasetFeat

sparse_col = ["sex", "occupation", "genre1", "genre2", "genre3"]
dense_col = ["age"]
user_col = ["sex", "age", "occupation"]
item_col = ["genre1", "genre2", "genre3"]

train_data, data_info = DatasetFeat.build_trainset(train_data, user_col, item_col, sparse_col, dense_col)
eval_data = DatasetFeat.build_evalset(eval_data)
test_data = DatasetFeat.build_testset(test_data)

"user_col" means features belong to user, and "item_col" means features belong to item. Note that the column numbers should match, i.e. `len(sparse_col) + len(dense_col) == len(user_col) + len(item_col)`.

In [None]:
print(data_info)

n_users: 6040, n_items: 3576, data density: 1.8523 %


## Training the Model

Now with all the data and features prepared, we can start training the model!

Since as its name suggests, the `Wide & Deep` algorithm has wide and deep parts, and they use different optimizers. So we should specify the learning rate separately by using a dict: `{"wide": 0.01, "deep": 3e-4}`. For other model hyperparameters, see API reference of [WideDeep](https://librecommender.readthedocs.io/en/latest/api/algorithms/wide_deep.html).

In this example we treat all the samples in data as positive samples, and perform negative sampling. This is called "implicit data".

In [None]:
from libreco.algorithms import WideDeep

In [None]:
model = WideDeep(
    task="ranking",
    data_info=data_info,
    embed_size=16,
    n_epochs=2,
    loss_type="cross_entropy",
    lr={"wide": 0.05, "deep": 7e-4},
    batch_size=2048,
    use_bn=True,
    hidden_units=(128, 64, 32),
)

model.fit(
    train_data,
    neg_sampling=True,  # perform negative sampling on training and eval data
    verbose=2,
    shuffle=True,
    eval_data=eval_data,
    metrics=["loss", "roc_auc", "precision", "recall", "ndcg"],
)

Training start time: [35m2023-04-06 15:12:45[0m
Instructions for updating:
Colocations handled automatically by placer.


2023-04-06 15:12:45.758683: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Instructions for updating:
Colocations handled automatically by placer.


total params: [33m192,413[0m | embedding params: [33m165,109[0m | network params: [33m27,304[0m


2023-04-06 15:12:46.174116: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
train: 100%|██████████████████████████████████████████████████████| 391/391 [00:02<00:00, 134.68it/s]


Epoch 1 elapsed: 2.905s
	 [32mtrain_loss: 0.959[0m
random neg item sampling elapsed: 0.024s


eval_pointwise: 100%|███████████████████████████████████████████████| 13/13 [00:00<00:00, 143.54it/s]
eval_listwise: 100%|████████████████████████████████████████████| 2797/2797 [00:09<00:00, 287.07it/s]


	 eval log_loss: 0.5823
	 eval roc_auc: 0.8032
	 eval precision@10: 0.0236
	 eval recall@10: 0.0339
	 eval ndcg@10: 0.1001


train: 100%|██████████████████████████████████████████████████████| 391/391 [00:02<00:00, 156.01it/s]


Epoch 2 elapsed: 2.508s
	 [32mtrain_loss: 0.499[0m


eval_pointwise: 100%|███████████████████████████████████████████████| 13/13 [00:00<00:00, 235.78it/s]
eval_listwise: 100%|████████████████████████████████████████████| 2797/2797 [00:10<00:00, 256.00it/s]


	 eval log_loss: 0.4769
	 eval roc_auc: 0.8488
	 eval precision@10: 0.0332
	 eval recall@10: 0.0523
	 eval ndcg@10: 0.1376


We've trained the model for 2 epochs and evaluated the performance on the eval data during training. Next we can evaluate on the *independent* test data.

In [None]:
from libreco.evaluation import evaluate

evaluate(
    model=model,
    data=test_data,
    neg_sampling=True,  # perform negative sampling on test data
    metrics=["loss", "roc_auc", "precision", "recall", "ndcg"],
)

random neg item sampling elapsed: 0.025s


eval_pointwise: 100%|███████████████████████████████████████████████| 13/13 [00:00<00:00, 219.84it/s]
eval_listwise: 100%|████████████████████████████████████████████| 2834/2834 [00:10<00:00, 278.12it/s]


{'loss': 0.4782908669403157,
 'roc_auc': 0.8483713737644527,
 'precision': 0.031268748897123694,
 'recall': 0.04829594849021039,
 'ndcg': 0.12866793895121623}

## Make Recommendation

The recommend part is pretty straightforward. You can make recommendation for one user or a batch of users.

In [None]:
model.recommend_user(user=1, n_rec=3)

{1: array([ 364, 3751, 2858])}

In [None]:
model.recommend_user(user=[1, 2, 3], n_rec=3)

{1: array([ 364, 3751, 2858]),
 2: array([1617,  608,  912]),
 3: array([ 589, 2571, 1200])}

You can also make recommdation based on specific user features.

In [None]:
model.recommend_user(user=1, n_rec=3, user_feats={"sex": "M", "age": 33})

{1: array([2716,  589, 2571])}

In [None]:
model.recommend_user(user=1, n_rec=3, user_feats={"occupation": 17})

{1: array([2858, 1210, 1580])}

## Save, Load and Inference

When saving the model, we should also save the `data_info` for feature information.

In [None]:
data_info.save("model_path", model_name="wide_deep")
model.save("model_path", model_name="wide_deep")

Then we can load the model and make recommendation again.

In [None]:
tf.compat.v1.reset_default_graph()  # need to reset graph in TensorFlow1

In [None]:
from libreco.data import DataInfo

loaded_data_info = DataInfo.load("model_path", model_name="wide_deep")
loaded_model = WideDeep.load("model_path", model_name="wide_deep", data_info=loaded_data_info)
loaded_model.recommend_user(user=1, n_rec=3)

total params: [33m192,413[0m | embedding params: [33m165,109[0m | network params: [33m27,304[0m


{1: array([ 364, 3751, 2858])}

## Retrain the Model with New Data

Remember that we split the original `MovieLens 1M` data into two parts in the first place? We will treat the **second half** of the data as our new data and retrain the saved model with it. In real-world recommender systems, data may be generated every day, so it is inefficient to train the model from scratch every time we get some new data.

In [None]:
second_half_data = data[(len(data) // 2) :]
train_data, eval_data = random_split(second_half_data, multi_ratios=[0.8, 0.2])

In [None]:
print("second half data shape:", second_half_data.shape)

second half data shape: (500105, 10)


The data processing is similar, except that we should use `merge_trainset()` and `merge_evalset()` in DatasetFeat.

The purpose of these functions is combining information from old data with that from new data, especially for the possible new users/items from new data. For more details, see [Model Retrain](https://librecommender.readthedocs.io/en/latest/user_guide/model_retrain.html).

In [None]:
# pass `loaded_data_info` and get `new_data_info`
train_data, new_data_info = DatasetFeat.merge_trainset(train_data, loaded_data_info, merge_behavior=True)
eval_data = DatasetFeat.merge_evalset(eval_data, new_data_info)  # use new_data_info

Then we construct a new model, and call `rebuild_model` method to assign the old variables into the new model.

In [None]:
tf.compat.v1.reset_default_graph()  # need to reset graph in TensorFlow1

In [None]:
new_model = WideDeep(
    task="ranking",
    data_info=new_data_info,  # pass new_data_info
    embed_size=16,
    n_epochs=2,
    loss_type="cross_entropy",
    lr={"wide": 0.01, "deep": 1e-4},
    batch_size=2048,
    use_bn=True,
    hidden_units=(128, 64, 32),
)

new_model.rebuild_model(path="model_path", model_name="wide_deep", full_assign=True)

total params: [33m194,164[0m | embedding params: [33m166,860[0m | network params: [33m27,304[0m


Finally, the training and recommendation parts are the same as before.

In [None]:
new_model.fit(
    train_data,
    neg_sampling=True,
    verbose=2,
    shuffle=True,
    eval_data=eval_data,
    metrics=["loss", "roc_auc", "precision", "recall", "ndcg"],
)

Training start time: [35m2023-04-06 15:18:29[0m


train: 100%|██████████████████████████████████████████████████████| 391/391 [00:02<00:00, 136.50it/s]


Epoch 1 elapsed: 2.867s
	 [32mtrain_loss: 0.4867[0m
random neg item sampling elapsed: 0.058s


eval_pointwise: 100%|███████████████████████████████████████████████| 25/25 [00:00<00:00, 175.29it/s]
eval_listwise: 100%|████████████████████████████████████████████| 2981/2981 [00:11<00:00, 262.03it/s]


	 eval log_loss: 0.4482
	 eval roc_auc: 0.8708
	 eval precision@10: 0.0985
	 eval recall@10: 0.0710
	 eval ndcg@10: 0.3062


train: 100%|██████████████████████████████████████████████████████| 391/391 [00:02<00:00, 141.23it/s]


Epoch 2 elapsed: 2.770s
	 [32mtrain_loss: 0.472[0m


eval_pointwise: 100%|███████████████████████████████████████████████| 25/25 [00:00<00:00, 214.44it/s]
eval_listwise: 100%|████████████████████████████████████████████| 2981/2981 [00:10<00:00, 275.00it/s]


	 eval log_loss: 0.4416
	 eval roc_auc: 0.8741
	 eval precision@10: 0.1031
	 eval recall@10: 0.0738
	 eval ndcg@10: 0.3168


In [None]:
new_model.recommend_user(user=1, n_rec=3)

{1: array([ 364, 2858, 1210])}

In [None]:
new_model.recommend_user(user=[1, 2, 3], n_rec=3)

{1: array([ 364, 2858, 1210]),
 2: array([ 608, 1617, 1233]),
 3: array([ 589, 2571, 1387])}

**This completes our tutorial!**

+ For more examples, see the [examples](https://github.com/massquantity/LibRecommender/tree/master/examples) folder on GitHub.

+ For more usages, please head to [User Guide](https://librecommender.readthedocs.io/en/latest/user_guide/index.html).

+ For serving a trained model, please head to [Python Serving Guide](https://librecommender.readthedocs.io/en/latest/serving_guide/python.html).