currently some functions can can return list, but with images and other heavy modalities this can overload the memory.
To make sure everything run smoothly we can make sure all the outputs of functions like load_dataset, create_dataset and evaluate are huggingface datasets.