<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

# Template

Title of the notebooks should be concise and it's at heading-1 level, i.e., with one "#" in the markdown code.

Right under the notebook title, a brief introduction of the notebook is placed. Usually this will be what technical/business problems that the technical contents in this notebook try to solve.

**Example**:

This notebook shows how to set a version check for Recommenders.

## 0 Global settings

Heading-2 level is for each sections in the notebook. It starts from 0, where it is usually about global settings such as module imports, global variable definitions, etc. 
Name of the section starts with a capital letter. 

#### Module imports

It is a good practice to add all the imports in the first cell. 

In [1]:
import sys
import tensorflow as tf
import numpy as np
import pandas as pd
tf.get_logger().setLevel('ERROR')  # only show error messages



from recommenders.utils.timer import Timer
from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_chrono_split
from recommenders.evaluation.python_evaluation import (
    map, ndcg_at_k, precision_at_k, recall_at_k
)

from recommenders.models.attrec.attrec import AttRec
from recommenders.models.attrec.dataIterator import DataIterator
from recommenders.utils.constants import SEED as DEFAULT_SEED
from recommenders.utils.notebook_utils import store_metadata

print(f"System version: {sys.version}")
print(f"Tensorflow version: {tf.__version__}")
print("AttRec module imported successfully!")


System version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 08:22:19) [Clang 14.0.6 ]
Tensorflow version: 2.18.0
AttRec module imported successfully!


#### Global variables

For the convenience of parameterizing notebook tests, tagging of "parameters" can be added to the cell such that variables in the cell can be found by `papermill` in testing. 

In [2]:
RECOMMENDERS_VERSION = "0.1.1"

In [3]:
# top k items to recommend
TOP_K = 50

# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

# Model parameters
EPOCHS = 30
BATCH_SIZE = 256

SEED = DEFAULT_SEED  # Set None for non-deterministic results

In [4]:
df = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=["userID", "itemID", "rating", "timestamp"]
)

df.head()

100%|██████████| 4.81k/4.81k [00:02<00:00, 2.30kKB/s]


Unnamed: 0,userID,itemID,rating,timestamp
0,196,242,3.0,881250949
1,186,302,3.0,891717742
2,22,377,1.0,878887116
3,244,51,2.0,880606923
4,166,346,1.0,886397596


In [10]:
import os
def create_train_test(df, seq_counts=5, target_counts=3, save_dir='processed_data', is_Save=True):
    """
    Splits the dataset into train/test sets with user-item sequences.

    Args:
        data_path (str): Path to the user-item interaction data file.
        seq_counts (int): Length of input sequences.
        target_counts (int): Number of items to predict.
        save_dir (str): Directory to save the train/test data.
        is_save (bool): Whether to save the datasets and metadata.

    Returns:
        train (pd.DataFrame): Training data.
        test (pd.DataFrame): Testing data.
        user_all_items (dict): Mapping of users to their full item interaction lists.
        all_user_count (int): Total number of unique users.
        all_item_count (int): Total number of unique items.
        user_map (dict): Mapping of original user IDs to remapped IDs.
        item_map (dict): Mapping of original item IDs to remapped IDs.
    """
    # Ensure the save directory exists
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    # # Load data
    data_path = '/Users/leeisbadk/recommenders/examples/99_model_attrec/ml-100k/ml-100k/u.data'
    data = pd.read_csv(data_path, sep='\t', header=None, names=['user_id', 'item_id', 'rating', 'timestamp'])
    # Remap user and item IDs to start from 1
    user_map = {uid: i for i, uid in enumerate(data['user_id'].unique())}
    item_map = {iid: i for i, iid in enumerate(data['item_id'].unique())}
    data['user_id'] = data['user_id'].map(user_map)
    data['item_id'] = data['item_id'].map(item_map)

    # Sort data by user and timestamp
    data = data.sort_values(by=['user_id', 'timestamp']).reset_index(drop=True)

    # Group data by user
    user_sessions = data.groupby('user_id')['item_id'].apply(list).reset_index()
    user_sessions.rename(columns={'item_id': 'item_list'}, inplace=True)

    user_all_items = {}
    train_users, train_seqs, train_targets = [], [], []
    test_users, test_seqs, test_targets = [], [], []

    for _, row in user_sessions.iterrows():
        user = row['user_id']
        items = row['item_list']
        user_all_items[user] = items

        # Create training sequences
        for i in range(seq_counts, len(items) - target_counts):
            seqs = items[i - seq_counts:i]
            targets = items[i:i + target_counts]
            train_users.append(user)
            train_seqs.append(seqs)
            train_targets.append(targets)

        # Create testing sequence
        if len(items) > seq_counts + target_counts:
            test_seq = items[-seq_counts - target_counts:-target_counts]
            test_target = items[-target_counts:]
            test_users.append(user)
            test_seqs.append(test_seq)
            test_targets.append(test_target)

    # Convert to DataFrames
    train = pd.DataFrame({'user': train_users, 'seq': train_seqs, 'target': train_targets})
    test = pd.DataFrame({'user': test_users, 'seq': test_seqs, 'target': test_targets})

    # Metadata
    all_user_count = len(user_map)
    all_item_count = len(item_map)

    if is_Save:
        # Save datasets
        train.to_csv(os.path.join(save_dir, 'train.csv'), index=False)
        test.to_csv(os.path.join(save_dir, 'test.csv'), index=False)

        # Save mappings and metadata
        with open(os.path.join(save_dir, 'info.pkl'), 'wb') as f:
            pickle.dump(user_all_items, f, pickle.HIGHEST_PROTOCOL)
            pickle.dump(all_user_count, f, pickle.HIGHEST_PROTOCOL)
            pickle.dump(all_item_count, f, pickle.HIGHEST_PROTOCOL)
            pickle.dump(user_map, f, pickle.HIGHEST_PROTOCOL)
            pickle.dump(item_map, f, pickle.HIGHEST_PROTOCOL)

        print(f"Train and test datasets saved in '{save_dir}'")

    return train, test, user_all_items, all_user_count, all_item_count, user_map, item_map


In [6]:

import argparse

parser = argparse.ArgumentParser()

parser.add_argument('--file_path', type=str, default='/Users/leeisbadk/recommenders/examples/99_model_attrec/ml-100k/ml-100k/u.data', help='training data dir')
parser.add_argument('--test_path', type=str, default='input/test.csv', help='testing data dir')
parser.add_argument('--train_path', type=str, default='input/train.csv', help='training data dir')
parser.add_argument('--mode', type=str, default='train', help='train or test')
parser.add_argument('--w', type=float, default=0.3, help='The final score is a weighted sum of them with the controlling factor ω')
parser.add_argument('--num_epochs', type=int, default=30, help='number of epochs')
parser.add_argument('--sequence_length', type=int, default=5, help='sequence length')
parser.add_argument('--target_length', type=int, default=3, help='target length') ##ควรเป็น 3
parser.add_argument('--neg_sample_count', type=int, default=10, help='number of negative sample')
parser.add_argument('--item_count', type=int, default=1685, help='number of items')
parser.add_argument('--user_count', type=int, default=945, help='number of user')
parser.add_argument('--embedding_size', type=int, default=100, help='embedding size')
parser.add_argument('--batch_size', type=int, default=256, help='batch size')
parser.add_argument('--learning_rate', type=float, default=1e-2, help='learning rate')
parser.add_argument('--keep_prob', type=float, default=0.5, help='keep prob of dropout')
parser.add_argument('--l2_lambda', type=float, default=1e-3, help='Regularization rate for l2')
parser.add_argument('--gamma', type=float, default=0.5, help='gamma of the margin higle loss')
parser.add_argument('--grad_clip', type=float, default=10, help='gradient clip to prevent from grdient to large')
parser.add_argument('--save_path', type=str, default='save_path/model1.ckpt', help='the whole path to save the model')

FLAGS, unparsed = parser.parse_known_args()

print(FLAGS)





Namespace(file_path='/Users/leeisbadk/Library/Jupyter/runtime/kernel-v313f62ff26472befe2c513c3af12f03d61c2dd95b.json', test_path='input/test.csv', train_path='input/train.csv', mode='train', w=0.3, num_epochs=30, sequence_length=5, target_length=3, neg_sample_count=10, item_count=1685, user_count=945, embedding_size=100, batch_size=256, learning_rate=0.01, keep_prob=0.5, l2_lambda=0.001, gamma=0.5, grad_clip=10, save_path='save_path/model1.ckpt')


In [12]:
def Metric_HR(target_list, predict_list, num):
    count = 0
    for i in range(len(target_list)):
        t = target_list[i]
        preds = predict_list[i]
        preds = preds[:num]
        if t in preds:
            count += 1
    return count / len(target_list)

def Metric_MRR(target_list, predict_list):

    count = 0
    for i in range(len(target_list)):
        t = target_list[i]
        preds = predict_list[i]
        rank = preds.index(t) + 1
        count += 1 / rank
    return count / len(target_list)

In [None]:
from recommenders.models.attrec.attrec import AttRec
from recommenders.models.attrec.dataIterator import DataIterator

import sys
import os
sys.path.append('..')
os.environ["CUDA_VISIBLE_DEVICES"]='1'

import tensorflow as tf
import numpy as np
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()



def Metric_HR(target_list, predict_list, num):
    count = 0
    for i in range(len(target_list)):
        t = target_list[i]
        preds = predict_list[i]
        preds = preds[:num]
        if t in preds:
            count += 1
    return count / len(target_list)

def Metric_MRR(target_list, predict_list):

    count = 0
    for i in range(len(target_list)):
        t = target_list[i]
        preds = predict_list[i]
        rank = preds.index(t) + 1
        count += 1 / rank
    return count / len(target_list)


def main(args):
    data, num_users, num_items = df, df['userID'].nunique(), df['itemID'].nunique()
    print(' make datasets')
    train_data, test_data ,user_all_items, all_user_count\
        , all_item_count, user_map, item_map \
        = create_train_test(FLAGS.file_path, FLAGS.sequence_length, FLAGS.target_length, is_Save=False)
    FLAGS.item_count = all_item_count
    FLAGS.user_count = all_user_count
    all_index = [i for i in range(FLAGS.item_count)]
    print(train_data)
    print(test_data)
    print(' load model and training')
    graph = tf.Graph()
    with graph.as_default():
      with tf.compat.v1.Session() as sess:
          #Load model
          model = AttRec(FLAGS)
          topk_index = model.predict(all_index,len(all_index))
          total_loss = model.loss

          #Add L2
          # with tf.name_scope('l2loss'):
          #     loss = model.loss
          #     tv = tf.trainable_variables()
          #     regularization_cost = FLAGS.l2_lambda * tf.reduce_sum([tf.nn.l2_loss(v) for v in tv])
          #     total_loss = loss + regularization_cost

          #Optimizer
          global_step = tf.Variable(0, trainable=False)
          update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
          with tf.control_dependencies(update_ops):
              optimizer = tf.train.AdamOptimizer(FLAGS.learning_rate)
              tvars = tf.trainable_variables()
              grads, _ = tf.clip_by_global_norm(tf.gradients(total_loss, tvars), FLAGS.grad_clip)
              grads_and_vars = tuple(zip(grads, tvars))
              train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)


          #Saver and initializer
          saver = tf.train.Saver()
          if FLAGS.mode == 'test':
              saver.restore(sess, FLAGS.save_path)
          else:
              sess.run(tf.global_variables_initializer())

          #Batch reader
          trainIterator = DataIterator(data=train_data
                                      , batch_size=FLAGS.batch_size
                                      ,max_seq_length=FLAGS.batch_size
                                      ,neg_count=FLAGS.neg_sample_count
                                      ,all_items=all_index
                                      ,user_all_items=user_all_items
                                      ,shuffle=True)
          testIterator = DataIterator(data=test_data
                                      ,batch_size = FLAGS.batch_size
                                      , max_seq_length=FLAGS.batch_size
                                      , neg_count=FLAGS.neg_sample_count
                                      , all_items=all_index
                                      , user_all_items=user_all_items
                                      , shuffle=False)
          #Training and test for every epoch
          for epoch in range(FLAGS.num_epochs):
              cost_list = []
              for train_input in trainIterator:
                user, next_target, user_seq, sl, neg_seq = train_input

                # Convert lists to NumPy arrays before checking shape
                user_seq_array = np.array(user_seq)
                neg_seq_array = np.array(neg_seq)
                user_array = np.array(user)
                next_target_array = np.array(next_target)
                # Print shapes of relevant tensors
                # print("Shape of user_seq:", user_seq_array.shape)
                # print("Shape of neg_seq:", neg_seq_array.shape)
                # print("Shape of hist_seq:", user_seq_array.shape)  #tis the issue
                feed_dict = {model.u_p: user, model.next_p: next_target, model.sl: sl,
                            model.hist_seq: user_seq, model.neg_p: neg_seq,
                            model.keep_prob:FLAGS.keep_prob,model.is_Training:True}

                _, step, cost = sess.run([train_op, global_step, total_loss], feed_dict)
                cost_list.append(np.mean(cost))
              mean_cost = np.mean(cost_list)
              saver.save(sess, FLAGS.save_path)

              pred_list = []
              next_list = []
              # test and cal hr50 and mrr
              for test_input in testIterator:
                  user, next_target, user_seq, sl, neg_seq = test_input
                  feed_dict = {model.u_p: user, model.next_p: next_target, model.sl: sl,
                              model.hist_seq: user_seq,model.keep_prob:1.0
                              ,model.is_Training:False}
                  pred_indexs = sess.run(topk_index, feed_dict)
                  pred_list += pred_indexs.tolist()
                  #only predict one next item
                  single_target = [item[0] for item in next_target]
                  next_list += single_target
              hr50 = Metric_HR(next_list,pred_list,50)
              mrr = Metric_MRR(next_list,pred_list)
              print(" epoch {},  mean_loss{:g}, test HR@50: {:g}, test MRR: {:g}"
                    .format(epoch + 1, mean_cost,hr50,mrr))



if __name__ == '__main__':
    main([])

 make datasets
       user                        seq            target
0         0    [0, 289, 491, 380, 751]    [466, 522, 10]
1         0  [289, 491, 380, 751, 466]    [522, 10, 672]
2         0  [491, 380, 751, 466, 522]   [10, 672, 1045]
3         0   [380, 751, 466, 522, 10]  [672, 1045, 649]
4         0   [751, 466, 522, 10, 672]  [1045, 649, 377]
...     ...                        ...               ...
92451   942   [209, 10, 873, 935, 614]    [355, 158, 12]
92452   942   [10, 873, 935, 614, 355]    [158, 12, 141]
92453   942  [873, 935, 614, 355, 158]    [12, 141, 452]
92454   942   [935, 614, 355, 158, 12]   [141, 452, 672]
92455   942   [614, 355, 158, 12, 141]    [452, 672, 68]

[92456 rows x 3 columns]
     user                           seq            target
0       0    [834, 438, 632, 656, 1006]   [947, 363, 521]
1       1       [452, 899, 25, 246, 48]    [530, 145, 31]
2       2     [758, 437, 458, 476, 368]    [769, 14, 305]
3       3   [834, 215, 1092, 945, 1100]  [6

I0000 00:00:1732037766.321900 1361356 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled
W0000 00:00:1732037766.503567 1361356 op_level_cost_estimator.cc:699] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "CPU" model: "0" frequency: 2400 num_cores: 8 environment { key: "cpu_instruction_set" value: "ARM NEON" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 16384 l2_cache_size: 524288 l3_cache_size: 524288 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
W0000 00:00:1732037774.419466 1361356 op_level_cost_estimator.cc:699] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "CPU" model: "0" frequency: 2400 num_cores: 8 environment { key: "cpu_instruction_set" value: "ARM NEON" } environment 

 epoch 1,  mean_loss0.133791, test HR@50: 0.420997, test MRR: 0.0483422
 epoch 2,  mean_loss0.0635938, test HR@50: 0.47508, test MRR: 0.0553399
 epoch 3,  mean_loss0.0547148, test HR@50: 0.493107, test MRR: 0.051918
 epoch 4,  mean_loss0.0521787, test HR@50: 0.509014, test MRR: 0.0594593
 epoch 5,  mean_loss0.0509994, test HR@50: 0.530223, test MRR: 0.0622902
 epoch 6,  mean_loss0.0501182, test HR@50: 0.532344, test MRR: 0.0599459
 epoch 7,  mean_loss0.0493966, test HR@50: 0.534464, test MRR: 0.0675514
 epoch 8,  mean_loss0.0488412, test HR@50: 0.514316, test MRR: 0.0663028
 epoch 9,  mean_loss0.0484448, test HR@50: 0.534464, test MRR: 0.0641826
 epoch 10,  mean_loss0.0480731, test HR@50: 0.534464, test MRR: 0.0651407
 epoch 11,  mean_loss0.0477825, test HR@50: 0.533404, test MRR: 0.0611846
 epoch 12,  mean_loss0.047418, test HR@50: 0.52492, test MRR: 0.0658963


## 1 Section1

Each of the sections can be hierarchical. Level numbers are connect by ".". 

### 1.1 Sub-section1

Note that 
1. The Python codes in the notebook should follow PEP standard.
2. The code should be formatted with Black.

In [None]:
def check_version(version, current_version):
    v1_parts = version.split(".")
    v2_parts = current_version.split(".")

    # Pad the version parts with zeros to ensure equal length
    max_parts = max(len(v1_parts), len(v2_parts))
    v1_parts.extend(["0"] * (max_parts - len(v1_parts)))
    v2_parts.extend(["0"] * (max_parts - len(v2_parts)))

    for v1, v2 in zip(v1_parts, v2_parts):
        if int(v1) <= int(v2):
            print(f"{version} is older than {current_version}")
            return True
        elif int(v1) > int(v2):
            print(f"Error: {version} is newer than {current_version}")
            raise ValueError


In [None]:
checked_version = check_version(RECOMMENDERS_VERSION, recommenders.__version__)

0.1.1 is older than 1.2.0


Codes in a notebook are tested with `store_metadata`. Below the example shows how to record a variable for testing purpose.

In [None]:
store_metadata("checked_version", checked_version)

#### 1.1.1 Sub-sub-section

### 1.2 Sub-section2

## 2 Section2

## References

It is highly encouraged to have references for technical explanations in the notebooks for people to easily understand theories and reproduce codes. 

**Example:**
    
1. Jianxu Lian et al, "xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems", Proc. ACM KDD, London, UK, 2018, pp. 1754-1763.
2. PySpark MLlib evaluation metrics, url: https://spark.apache.org/docs/2.3.0/mllib-evaluation-metrics.html

Note this section, which is not the body sections of the notebook, does not have to be numbered in section name. 