Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Model inference locally #359

Closed
Ahanmr opened this issue Jan 13, 2022 · 18 comments
Closed

[FEA] Model inference locally #359

Ahanmr opened this issue Jan 13, 2022 · 18 comments

Comments

@Ahanmr
Copy link

Ahanmr commented Jan 13, 2022

🚀 Feature request

Wanted to check if it is currently possible to perform model inference locally in place of setting up Triton Server. If yes, where is the documentation available currently for the same?

Motivation

  • Currently, whenever I build a new model, I have to deploy it on Triton Server, but just to check the model performance of a batch of 10-20 sessions for testing, wanted to check if there was a way to see results by inferencing locally.
  • Secondly, to setup triton server, there's a need to download the docker image and setup the nvcr.io/nvidia/merlin/merlin-inference:21.09 image. But is there a work around to inference locally?
@rnyak
Copy link
Contributor

rnyak commented Jan 18, 2022

@Ahanmr Do you mean model accuracy with model performance? If yes, you need to use trainer.evaluate(metric_key_prefix='eval') for PyTorch model to evaluate your model's prediction power on validation set, first (see this example). If you want to do final prediction on a test set, you can simply do the following after model training:

model.eval()
model(input_batch_test_set)

and for Tensorflow model you can do model.predict(input_batch_test_set).

you don't need inference server for that, if that's your question?

@Ahanmr
Copy link
Author

Ahanmr commented Jan 27, 2022

@Ahanmr Do you mean model accuracy with model performance? If yes, you need to use trainer.evaluate(metric_key_prefix='eval') for PyTorch model to evaluate your model's prediction power on validation set, first (see this example). If you want to do final prediction on a test set, you can simply do the following after model training:

model.eval()
model(input_batch_test_set)

and for Tensorflow model you can do model.predict(input_batch_test_set).

you don't need inference server for that, if that's your question?

So, the following line of code helped me get the metrics for the test_set, but I wanted to clarify if I'm doing this correctly once, so that I can go ahead and deploy this.

eval_metrics = trainer.evaluate(eval_dataset=eval_data_paths, metric_key_prefix='eval')
for key in sorted(eval_metrics.keys()):
    print("  %s = %s" % (key, str(eval_metrics[key])))

After this, to get the predictions on a batch of product_id sessions, I do the following:

eval_dataload = get_dataloader(eval_data_paths, train_args.per_device_eval_batch_size)
batch = next(iter(eval_dataload))

and batch looks like this:

{'product_id-list_seq': tensor([[   642,    642,    642,      0,      0,      0,      0,      0,      0,
               0,      0,      0,      0,      0,      0,      0,      0,      0,
               0,      0],
         [  2962,   2962,   2962,      0,      0,      0,      0,      0,      0,
               0,      0,      0,      0,      0,      0,      0,      0,      0,
               0,      0],
         [ 12252,  10221,  18489,  20159,  31245,  10221,  18885,  12252,  18885,
           12252,  18885,  12252,   2886,  20159,   2886,  20159,   2886,  31245,
           78571,  10221],
         [  5005,   1839,   3112,  15067,   3112,    455,   3503,      0,      0,
               0,      0,      0,      0,      0,      0,      0,      0,      0,
               0,      0],
         ...
         [  6589,   1881,    185,      0,      0,      0,      0,      0,      0,
               0,      0,      0,      0,      0,      0,      0,      0,      0,
               0,      0]], device='cuda:0')}

then finally, on running, model(batch), I got the following output dictionary, just wanted to clarify, what are labels here? Is this the target or the predicted output?

output = model(batch)

{'loss': tensor(12.9607, device='cuda:0', grad_fn=<NllLossBackward0>),
 'labels': tensor([   642,    642,   2962,   2962,  10221,  18489,  20159,  31245,  10221,
          18885,  12252,  18885,  12252,  18885,  12252,   2886,  20159,   2886,
          20159,   2886,  31245,  78571,  10221,   1839,   3112,  15067,   3112,
            455,   3503,  27648,  33940, 371869,   1693,   3809,   1693,   1152,
          39043,  20117, 162305,  42138,   1976,  11139,   1920,    746,    746,
            322,   3780,   3121, 185678,   3121,   3780,   3121,   3780,   3121,
           3121,  73064,  73064,  73064, 419261, 147073,  95499, 175334,  59296,
           1497,   2602,   2602, 462300,  50554,   7634,  50554,  74959,  50554,
         225943,   3525,   1264,    501,   1264,     79,   6027,   6027,  31012,
             21,      6,    306,   4363,  14283,  30137,   4363,  30137, 224529,
         143820, 224522,    925,    576,   1902,  17982,  25200, 162768, 162772,
         203166, 162772, 162768,  85980,  62426,   1881,    185],
        device='cuda:0'),
 'predictions': tensor([[ -40.8026,  -41.2528,  -13.0302,  ...,  -41.4139,  -41.3386,
           -41.0184],
         [ -40.8026,  -41.2528,  -13.0302,  ...,  -41.4139,  -41.3386,
           -41.0184],
         [ -57.8077,  -57.5196,  -21.6697,  ...,  -58.0113,  -55.8364,
           -57.8259],
         ...,
         [ -16.0957,  -16.1184,  -10.7048,  ...,  -16.0890,  -16.1076,
           -16.0958],
         [-103.6853, -102.6834,  -13.4621,  ..., -100.6037, -104.4472,
          -104.3079],
         [ -70.6631,  -70.4196,   -5.5336,  ...,  -71.1709,  -70.4627,
           -70.9110]], device='cuda:0', grad_fn=<LogSoftmaxBackward0>),
 'pred_metadata': {},
 'model_outputs': []}

So, I extracted output['predictions'], but this is a logit each of size of the original catalog. So, is there a direct function to get the reverse mapping of the product_id mapped to the top-n maximum values from this tensor. Am I correct? So if I find the top-n maximum values from this tensor, those would be products with highest score? Wanted to check is there a way to get the original product_ids from the dataframe by converting this tensor directly to a beautified output like shown in triton server notebook?

@Ahanmr Ahanmr closed this as completed Feb 7, 2022
@SaadAhmed433
Copy link

I am also confused about this. Is there a way to get sorted predictions for each session with their labels (predicted output)?

@rnyak
Copy link
Contributor

rnyak commented Mar 17, 2022

@SaadAhmed433 The model generates predicted logit values and you need to convert them to the recommended item_ids. See this example how we do it with a util function (cell 23).

Hope that helps.

@SaadAhmed433
Copy link

Yes I have been looking at this notebook for sometime now and after a bit of experimentation I was able to make it work for my use case. Hopefully the implementation is correct @rnyak

test_path = ['./data/sessions_by_day_mind/6/test.parquet']
trainer.eval_dataloader = get_dataloader(test_path, train_args.per_device_eval_batch_size)

#batch 
batch = next(iter(trainer.get_eval_dataloader()))

The batch contains 32 sessions.

model.eval()

model(batch, training=False)

response =  model(batch, training=False)["predictions"]

Stores the prediction logits in response variable above

Now I load the data that I have pre-processed

df = cudf.read_parquet("./data/sessions_by_day_mind/6/test.parquet")
df = df.iloc[0:32,:] # 32 sessions selected in the dataframe too correspond to the ones selected in the batch

image

Made a very slight change to the util function because I am not using the tritonclient inference server

def visualize_response(batch, response, top_k, session_col="session_id"):
    """
    Util function to extract top-k encoded item-ids from logits

    Parameters
    ----------
    batch : cudf.DataFrame
        the batch of raw data sent to triton server.
    response: prediction logits in a tensor
    top_k: int
        the `top_k` top items to retrieve from predictions.
    """
    sessions = batch[session_col].drop_duplicates().values
    predictions = response.cpu().detach().numpy() #changed from as_numpy to numoy()
    top_preds = np.argpartition(predictions, -top_k, axis=1)[:, -top_k:]
    for session, next_items in zip(sessions, top_preds):
        print(
            "- Top-%s predictions for session `%s`: %s\n"
            % (top_k, session, " || ".join([str(e) for e in next_items]))
        )

Finally

visualize_response(df, response , top_k=5, session_col='session_id')

Output:

image

@kontrabas380
Copy link

@SaadAhmed433 how did you define the method "get_dataloader"?

@SaadAhmed433
Copy link

@kontrabas380 I followed the tutorial example given in this notebook

image

@cchudzian
Copy link

Hi All,

First of all, thank you for sharing this awesome code with us!
Hope you don't mind me adding my two cents to the very interesting thread.

I'm afraid that there's an issue with the local inference as explained above, i.e.

model.eval()
model(input_batch_test_set)

The masking layer if retained in a trained model and not disabled (I'm not sure there's a easy way to turn it off / remove), then at inference it will be replacing an embedding of the last element of every sequence with a learnable masking embedding. To my understanding this effectively makes the model ignore the last, more recently consumed item.

@lightsailpro
Copy link

@SaadAhmed433: About the local prediction, in your code: "batch = next(iter(trainer.get_eval_dataloader()))" only returns 32 sessions. Do you know how to loop through the whole file to get all sessions? Does trainer.eval_dataloader remember last batch read position and we just keep calling next until batch is null, or "next(iter(" just randomly return 32 records? Thanks

test_path = ['./data/sessions_by_day_mind/6/test.parquet']
trainer.eval_dataloader = get_dataloader(test_path, train_args.per_device_eval_batch_size)

#batch
batch = next(iter(trainer.get_eval_dataloader()))

@SaadAhmed433
Copy link

Yes you can keep calling the next and you will get the next batch in sequence, it will remember where it ended and will start from the new position.

Here is a small snippet of code

test_path = ['./data/test.parquet']
test_loader = get_dataloader(test_path )
#create iterable batch 
iterable_batch = iter(test_loader)

Now you loop over the length of the test_loader and apply the next method

for j in range(len(test_loader)):
    batch = next(iterable_batch)
    
    with torch.no_grad():
        response = model_p(batch, training=False)["predictions"]

@hosseinkalbasi
Copy link

@SaadAhmed433 thanks for the detailed explanation. On the last section, the visualize_response function, what are those numbers in front of the session_id in the result? For example, for session 17 you have a recommendation of 230. Is that a session ID, item ID, or an index? Could you finish this last part of your example?

@hosseinkalbasi
Copy link

@rnyak Thanks for the explanation. Are you sure the printed predictions of this example are Item IDs? Or are they indices of Items? I searched in the initial dataset and cannot find some this numbers as an item id. I wish the example would go one extra step to print exactly the recommendation from the initial dataset.

@rnyak
Copy link
Contributor

rnyak commented Aug 8, 2022

@hosseinkalbasi These are categorified (encoded) item ids. you need to convert/map them back to original item ids.

in order to do that, you need to find the categories folder where you can see the parquet file unique.item_id.parquet including values for encoded item_id --> original item_id. Just read in this parquet file, and you will see the index column is the encoded item_id, and the item_id column is the original id. From there you can do the mapping yourself with a simple custom code.

@hosseinkalbasi
Copy link

hosseinkalbasi commented Aug 8, 2022

Thank you @rnyak for the detailed response, interesting! I understand now!
Unfortunately, I still cannot find the connection between the item_id and encoded item id. I unpacked the unique.item_id.parquet as you specified. The encoded-item-id is not there and the magnitude of the item ids with what we see as recommendations (encoded item ids) is different. I got the same result after unpacking ./workflow_etl/categories/unique.session_id.parquet. Did I go ahead as you instructed? If I could somehow decode the encoded item id, that'd be great!

image

@rnyak
Copy link
Contributor

rnyak commented Aug 8, 2022

The index column from 0 to 52739 are your encoded item ids, that you get from Triton after you run visualize_response function. Arent the values of the item ids what you get as recommendations (encoded item ids) within 0-52739 in your case?

basically if you do that

category_item_id =  cudf.read_parquet('/workspace/kdd/categories/unique.item_id.parquet').reset_index().rename(columns={"index": "encoded_item_id"})

you will get your encoded item id column and original ids in the same cudf dataframe.

@hosseinkalbasi
Copy link

Thank you @rnyak! This is so much clearer for me now. I can now resolve recommendations.
No, the full length of recommendation I get is a bit (4 items) more than the total number of item ids! Not sure if this is a BUG or how it could be explained!

Detailed comaprisons:

Let's see how many Item Ids we have in the first place

cudf.read_parquet('/workspace/data/interactions_merged_df.parquet')['item_id'].nunique()
# output: 52739

How many encoded item ids do we have after NVTabular workflow has been applied?

category_item_id = cudf.read_parquet('./workflow_etl/categories/unique.item_id.parquet').reset_index().dropna(axis=0, subset=['item_id']).rename(columns={"index": "encoded_item_id"})
category_item_id

# number of rows: 52739

image

All good so far! However, if we take a closer look at the prediction (Triton Inference Server response), we have 52743 logit values for each predicted session!

After running this code:

import nvtabular.inference.triton as nvt_triton
import tritonclient.grpc as grpcclient

inputs = nvt_triton.convert_df_to_triton_input(filtered_batch.columns, filtered_batch, grpcclient.InferInput)
output_names = ["output_1"]
outputs = []
for col in output_names:
    outputs.append(grpcclient.InferRequestedOutput(col))
    print(outputs)
MODEL_NAME_NVT = "t4r_tf"
with grpcclient.InferenceServerClient("<inference-server-ip>:8001") as client:
    response = client.infer(MODEL_NAME_NVT, inputs)
    response_as_numpy = response.as_numpy(col)
    print(col, ':\n', response_as_numpy)
    print("*"*8, response_as_numpy.shape)

Output:

[<tritonclient.grpc.InferRequestedOutput object at 0x7f73388d8850>]
output_1 :
 [[-15.010758  -15.9208555  -9.183845  ... -15.026     -17.51056
  -14.421532 ]
 [-14.498426  -15.743032  -10.884762  ... -15.942841  -17.284363
  -15.883841 ]
 [-19.52346   -16.099092  -11.823889  ... -16.027607  -18.37647
  -13.709213 ]
 ...
 [-15.586963  -16.59345    -9.404068  ... -21.083416  -17.071983
  -16.640312 ]
 [-16.728306  -14.76777   -10.235783  ... -18.325325  -15.863041
  -13.49     ]
 [-19.494577  -21.577778   -9.008173  ... -23.189869  -18.583168
  -19.197771 ]]
******** (15, 52743)

Is this expected @rnyak ? I expected the length of each row of prediction to be 52739 as well.

@rnyak
Copy link
Contributor

rnyak commented Aug 9, 2022

@hosseinkalbasi what's your filtered batch? any filtering did you apply? what's the code before the code block you shared above. Btw, assume you are running end-to-end example nb?

Note that in the NVT pipeline we apply min session length filter, meaning sessions less than 2 are not considered.

@korchi
Copy link

korchi commented Aug 2, 2023

Hi everyone. May I ask you for help?
I would like to understand how trained model is going to behave in production. For that end I want to recompute RecallAt on my own by inferencing the model and compare it with recall returned by trainer.evaluate(eval_dataset=eval_path)). But I keep getting different results of metrics.

Specifically, I thought, that if eval_on_last_item_seq_only=True then trainer.evaluate() method would compute metrics by counting how many times the prediction hits the last item_id in the sequence/session (given the inputs sequence [0:N-1], where Nth item_id is the label).

I tried to replicate it. I truncated my valid.parquet dataset by splitting each sequence into new sequence [1:N-1] (as input), and last item [N] (for label) and computed RecallAt on my own by simply inferencing and counting number of hits.

predictions = model(sequence_truncated, testing=False)
correct = int(label in predictions.argsort()[:-20])

However, I am getting smaller RecallAt compared to running trainer.evaluate(eval_dataset=eval_path).

I debugged code and also found out, that the predictions inside the trainer.evaluate() differ from my predictions during the inference mainly due to testing=True, which significantly influences mask_targets and masking_scheme. Could someone explain in detail how testing variable influences inputs and outputs and why model predictions are so different when testing=True (in evaluate()) and testing=False (during the inference)? And how can I simulate evaluate method by inferencing the model (testing=False)?

Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants