# Neural Cognitive Diagnosis Model (NCDM)
This notebook will show you how to train and use the NCDM. First, we will show how to get the data (here we use a0910 as the dataset). Then we will show how to train a NCDM and perform the parameters persistence. At last, we will show how to load the parameters from the file and evaluate on the test dataset.

The script version could be found in [NCDM.py](NCDM.py)

# Data Preparation
Before we process the data, we need to first acquire the dataset which is shown in this [prepare_dataset.ipynb](prepare_dataset.ipynb)

In [41]:
# Load the data from files
import pandas as pd
import numpy as np

dataSet_list = ('FrcSub','Math1', 'Math2','ASSIST_0910', 'ASSIST_2017')
dataSet = dataSet_list[0]

if dataSet == 'ASSIST_0910':
    read_dir = '../data/a0910/'
    sub_prob_index=[]
    valid_data = pd.read_csv(read_dir+"valid.csv")
elif dataSet == 'ASSIST_2017':
    read_dir = '../data/a2017/'
    sub_prob_index=[]
    valid_data = pd.read_csv(read_dir+"valid.csv")
elif dataSet == 'FrcSub':
    read_dir='../data/frcSub/'
    sub_prob_index=[]
    valid_data = pd.read_csv(read_dir+"test.csv")
elif dataSet == 'Math1':
    read_dir='../data/math1/'
    sub_prob_index=np.loadtxt(read_dir+'sub_prob_index.csv',dtype=int)
    valid_data = pd.read_csv(read_dir+"test.csv")
elif dataSet == 'Math2':
    read_dir='../data/math2/'
    sub_prob_index=np.loadtxt(read_dir+'sub_prob_index.csv',dtype=int)
    valid_data = pd.read_csv(read_dir+"test.csv")
else:
    print('Dataset does not exist!')
    exit(0)
print('数据集：', dataSet)

train_data = pd.read_csv(read_dir+"train.csv")

test_data = pd.read_csv(read_dir+"test.csv")
df_item = pd.read_csv(read_dir+"item.csv")

item2knowledge = {}
knowledge_set = set()
for i, s in df_item.iterrows():
    item_id, knowledge_codes = s['item_id'], list(set(eval(s['knowledge_code'])))
    item2knowledge[item_id] = knowledge_codes
    knowledge_set.update(knowledge_codes)

train_data.head(5)

数据集： FrcSub


Unnamed: 0,user_id,item_id,score
0,1,1,0.0
1,1,3,0.0
2,1,5,0.0
3,1,6,0.0
4,1,7,1.0


In [42]:
df_item

Unnamed: 0,item_id,knowledge_code
0,1,"[4, 6, 7]"
1,2,"[4, 7]"
2,3,"[4, 7]"
3,4,"[2, 3, 5, 7]"
4,5,"[2, 4, 7, 8]"
5,6,[7]
6,7,"[1, 2, 7]"
7,8,[7]
8,9,[2]
9,10,"[2, 5, 7, 8]"


In [43]:
len(train_data), len(valid_data), len(test_data)

(8576, 2144, 2144)

In [44]:
# Get basic data info for model initialization
import numpy as np
user_n = np.max(train_data['user_id'])
item_n = np.max([np.max(train_data['item_id']), np.max(valid_data['item_id']), np.max(test_data['item_id'])])
knowledge_n = np.max(list(knowledge_set))

user_n, item_n, knowledge_n

(536, 20, 8)

In [45]:
# Transform data to torch Dataloader (i.e., batchify)
# batch_size is set to 32

import torch
from torch.utils.data import TensorDataset, DataLoader

batch_size = 32
def transform(user, item, item2knowledge, score, batch_size):
    knowledge_emb = torch.zeros((len(item), knowledge_n))
    for idx in range(len(item)):
        knowledge_emb[idx][np.array(item2knowledge[item[idx]]) - 1] = 1.0

    data_set = TensorDataset(
        torch.tensor(user, dtype=torch.int64) - 1,  # (1, user_n) to (0, user_n-1)
        torch.tensor(item, dtype=torch.int64) - 1,  # (1, item_n) to (0, item_n-1)
        knowledge_emb,
        torch.tensor(score, dtype=torch.float32)
    )
    return DataLoader(data_set, batch_size=batch_size, shuffle=True)


train_set, valid_set, test_set = [
    transform(data["user_id"], data["item_id"], item2knowledge, data["score"], batch_size)
    for data in [train_data, valid_data, test_data]
]

train_set, valid_set, test_set

(<torch.utils.data.dataloader.DataLoader at 0x22278c8f130>,
 <torch.utils.data.dataloader.DataLoader at 0x22210557a30>,
 <torch.utils.data.dataloader.DataLoader at 0x22210557f10>)

# Training and Persistence

In [46]:
import logging
logging.getLogger().setLevel(logging.INFO)

In [47]:
from NCDM import NCDM

cdm = NCDM(knowledge_n, item_n, user_n)
cdm.train(train_set, valid_set, sub_prob_index ,epoch=3, device="cuda")
cdm.save("ncdm.snapshot")

Epoch 0: 100%|██████████| 268/268 [00:00<00:00, 360.63it/s]


[Epoch 0] average loss: 0.955596


Evaluating: 100%|██████████| 67/67 [00:00<00:00, 1049.68it/s]


[Epoch 0] obj_acc: 0.534049,obj_auc: 0.611185,obj_rmse: 0.499534, obj_mae: 0.495882


Epoch 1: 100%|██████████| 268/268 [00:00<00:00, 375.46it/s]


[Epoch 1] average loss: 0.693264


Evaluating: 100%|██████████| 67/67 [00:00<00:00, 781.22it/s]


[Epoch 1] obj_acc: 0.465951,obj_auc: 0.661415,obj_rmse: 0.500797, obj_mae: 0.500694


Epoch 2: 100%|██████████| 268/268 [00:00<00:00, 352.64it/s]


[Epoch 2] average loss: 0.693863


Evaluating: 100%|██████████| 67/67 [00:00<00:00, 855.70it/s]
INFO:root:save parameters to ncdm.snapshot


[Epoch 2] obj_acc: 0.534049,obj_auc: 0.741887,obj_rmse: 0.499447, obj_mae: 0.495899


# Loading and Testing

In [48]:
cdm.load("ncdm.snapshot")
print('data_set_name:',dataSet)
if len(sub_prob_index)>0:
    (obj_acc,obj_auc,obj_rmse,obj_mae),(sub_rmse,sub_mae) = cdm.eval(test_set,sub_prob_index)
    print("obj_acc: %.6f,obj_auc: %.6f,obj_rmse: %.6f, obj_mae: %.6f,\nsub_rmse: %.6f, sub_mae: %.6f"% (
        obj_acc,obj_auc,obj_rmse,obj_mae,sub_rmse,sub_mae))
else:
    obj_acc,obj_auc,obj_rmse,obj_mae = cdm.eval(test_set,sub_prob_index)
    print("obj_acc: %.6f,obj_auc: %.6f,obj_rmse: %.6f, obj_mae: %.6f" % (
        obj_acc,obj_auc,obj_rmse,obj_mae))

INFO:root:load parameters from ncdm.snapshot


data_set_name: FrcSub


Evaluating: 100%|██████████| 67/67 [00:00<00:00, 946.21it/s]

obj_acc: 0.534049,obj_auc: 0.741899,obj_rmse: 0.499447, obj_mae: 0.495899



