Datasets are not sorted by time, model uses information from the future #3

artursil · 2022-03-06T12:08:28Z

I went through the code and one thing is bothering me. I think there is a major bug in the implementation. It is possible that I don't understand something, so please correct me if I'm wrong, but as of my current understanding this code trains and validates using the information "from the future" .

If you examine values in the code below you will see that there negative values for the delta.

TGSRec/model.py

Line 557 in 0c7ba17

src_ngh_t_batch_delta = cut_time_l[:, np.newaxis] - src_ngh_t_batch

I can see that mask is created only for the 0 values so negatives values are still used.

TGSRec/model.py

Line 575 in 0c7ba17

mask = src_ngh_node_batch_th == 0

Data here is sorted by edge_ids not timestamps, so the possible fix for that would by sorting by x[2], instead of x[1]

TGSRec/graph.py

Lines 34 to 39 in 0c7ba17

    
           for i in range(len(adj_list)): 
        
               curr = adj_list[i] 
        
               curr = sorted(curr, key=lambda x: x[1]) 
        
               n_idx_l.extend([x[0] for x in curr]) 
        
               e_idx_l.extend([x[1] for x in curr]) 
        
               n_ts_l.extend([x[2] for x in curr])

If you look at the:
TGSRec/datasets/ml-100k/u.data

data is not sorted, by the timestamp and there is no point in your codebase, where this sorting happens (I guess).

I tried to run experiments for ml-100 for both scenarios: your original implementation and with the sorted input data and the results I got are significantly worse, at least for the early stages of training. I haven't run it for 200 epochs, so maybe the final results are closer to each other, but firstly I would like to see if my assumption is correct.

Results afters 20 epochs:
Without sorting:

valid acc: 0.7069337926425662
valid auc: 0.8038448618385599
valid f1: 0.7070636805233961
valid ap: 0.8172432697828477

With sorting:

valid acc: 0.5271334211112526
valid auc: 0.7374971517076878
valid f1: 0.6789490018391757
valid ap: 0.7001478155001294

The text was updated successfully, but these errors were encountered:

zyh981022 · 2022-03-10T02:46:30Z

Hello, your finding is really crucial. How about the results on the other datasets? Are they incorrect as the ml-100k?

artursil · 2022-03-11T06:57:17Z

Hello, I haven't checked the results for other datasets. I've only checked, whether there was a similar sorting problem and unfortunately there was. If you run a process_amazon.py without any changes the output isn't sorted by timestamp, so I would expect that it will impact the results as well. I'm not sure in what way though.

zfan20 · 2022-04-18T04:13:33Z

Hi all,

The codebase did have some bugs that indices are not sorted based on the timestamps. I have updated the preprocess code for amazon datasets and ml100k datasets. We will rerun the experiments as soon as we can.

Best to all,

Ziwei

homaonchij · 2022-08-10T08:49:53Z

Hello, may I ask you some questions about the code of this paper? I'm sorry to disturb you. Could you leave a contact information if it's convenient for you.

SungMinCho · 2023-02-02T09:26:41Z

has the paper been updated?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets are not sorted by time, model uses information from the future #3

Datasets are not sorted by time, model uses information from the future #3

artursil commented Mar 6, 2022

zyh981022 commented Mar 10, 2022

artursil commented Mar 11, 2022 •

edited

zfan20 commented Apr 18, 2022

homaonchij commented Aug 10, 2022

SungMinCho commented Feb 2, 2023

Datasets are not sorted by time, model uses information from the future #3

Datasets are not sorted by time, model uses information from the future #3

Comments

artursil commented Mar 6, 2022

zyh981022 commented Mar 10, 2022

artursil commented Mar 11, 2022 • edited

zfan20 commented Apr 18, 2022

homaonchij commented Aug 10, 2022

SungMinCho commented Feb 2, 2023

artursil commented Mar 11, 2022 •

edited