## Reading In Datatable (RID)
![py_datatable_logo.png](attachment:py_datatable_logo.png)

The training dataset of this competition is large-ish in nature and the plain vanilla ***pd.read_csv*** will result in an out-of-memory error on Kaggle Notebooks.

This notebook shows how you can use [Python datatable](https://datatable.readthedocs.io/en/latest/index.html) to read the complete training data and convert it to a pandas dataframe in under a minute.   
Just reading the dataset from binary format into datatable takes less than a second!

There are other options to consider as well and some of them have been described in [this notebook](https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets).

You can try out or learn more about datatable from [**DatatableTon**: *💯 datatable exercises*](https://github.com/vopani/datatableton)

## Installation
Python datatable is available by default or can be installed using pip (with internet) or using a wheel file (without internet).

In [None]:
# installation with internet
# !python3 -m pip install pip --upgrade
# !python3 -m pip install datatable --upgrade

In [None]:
# installation without internet
# !pip install ../input/python-datatable/datatable-1.0.0-cp37-cp37m-manylinux_2_12_x86_64.whl

## Reading data in csv format
First let's read the dataset from the raw csv file. Datatable takes less than a minute to read the full dataset and convert it to pandas.


In [None]:
%%time

# reading the dataset from raw csv file
import datatable as dt

train = dt.fread("../input/riiid-test-answer-prediction/train.csv").to_pandas()

print(train.shape)


In [None]:
train.head()

## Reading data in jay format
The dataset can be saved in binary format (.jay) and datatable can then read the entire dataset in less than a second!


In [None]:
# saving the dataset in .jay (binary format)
dt.fread("../input/riiid-test-answer-prediction/train.csv").to_jay("train.jay")

In [None]:
%%time

# reading the dataset from .jay format
import datatable as dt

train = dt.fread("train.jay")

print(train.shape)

In [None]:
train

It takes less than a second to read the entire dataset into Python datatable!

Converting it to pandas is fairly fast too, taking another few seconds.

In [None]:
%%time

import datatable as dt

train = dt.fread("train.jay").to_pandas()

print(train.shape)


In [None]:
train.head()

While I personally find this the fastest, there are other options to consider as well and some of them have been described in [this notebook](https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets).

You can try out or learn more about datatable from [**DatatableTon**: *💯 datatable exercises*](https://github.com/vopani/datatableton)