![Benchmark Laoding Speed using](https://i.imgur.com/8BBQdvc.jpg)

# Benchmark Loading Speed 

This notebook have most of data loading techniques, some of them will timeout before finishing the job, others are 

inadequate for loading a specific data types and structures, and some are very fast and we'll try to benchmark them.

Feel free to share with us your thoughts in the comment section =)



##### 0- Libs

##### 1- Parser 
##### 2- Numpy
##### 3- Numba
##### 4- Pandas 
##### 5- Datatable

##### 6- cuDF
##### 7- cuPY

##### 8- Pickle
##### 9- Joblib

##### 10- feather
##### 11- parquet
##### 12- jay
##### 13- hdf5

##### 14- Benchmark
##### 15- Kuods

# 0- Libs

In [None]:
import csv
import numpy as np
from numpy import genfromtxt
from numba import njit
import cudf
import cupy
import pandas as pd
import datatable as dt
import pickle
import joblib
import feather
import plotly.express as px

In [None]:
data_path = '/kaggle/input/jane-street-market-prediction/train.csv'
# saved_data_path = '../input/name of the netebook/name of the data file' if you import the data from another notebook

In [None]:
#################################### 1- Python’s Built-in CSV parser ####################################
#with open('/kaggle/input/jane-street-market-prediction/train.csv') as csv_file:
#    csv_reader = csv.reader(csv_file, delimiter=',')
#    data = list(csv_reader) 
# Result : Your notebook tried to allocate more memory than is available. It has restarted.


#################################### 2- Numpy ####################################
#my_data = genfromtxt('/kaggle/input/jane-street-market-prediction/train.csv', delimiter=',')
# Result : Your notebook tried to allocate more memory than is available. It has restarted.

# Step 1
np.save('./data.npy', pd.read_csv(data_path))

# Step 2
speed_np = %timeit -o np.load('./data.npy')


#################################### 3- Numba ####################################
#@njit
#def numba_data():
#    data = genfromtxt('/kaggle/input/jane-street-market-prediction/train.csv', delimiter=',')
#    return data
# Result : Untyped global name 'genfromtxt': cannot determine Numba type of <class 'function'>


#################################### 4- Pandas ####################################
speed_pd =  %timeit -o  pd.read_csv(data_path)


#################################### 5- Datatable ####################################
speed_dt =  %timeit -o  dt.fread(data_path)


#################################### 6- cuDF ####################################
speed_cu =  %timeit -o  data_cudf = cudf.read_csv(data_path)


#################################### 7- cuPY ####################################
#data = cupy.load('/kaggle/input/jane-street-market-prediction/train.csv', allow_pickle=False) # if the loaded file is pickled then allow_pickle=True
# Result : Only load pickled files


#################################### 8- Pickle + Datatable ####################################
# Step 1 (save the data)
pickle.dump(dt.fread(data_path), open(r'data_dt.pickle', 'wb'))

# Step 2 (load the data)
speed_pickle =  %timeit -o  pickle.load(open('./data_dt.pickle', 'rb'))


#################################### 9- Joblib + Datatable ####################################
# Step 1 (save the data)
joblib.dump(dt.fread(data_path), 'data_dt.joblib')

# Step 2 (load the data)
speed_joblib =  %timeit -o  joblib.load('./data_dt.joblib')


#################################### 10- Feather + Pandas ####################################
# Step 1 (save the data)
pd.read_csv(data_path).to_feather("data_pd.feather")

# Step 2 (load the data)
speed_feather =  %timeit -o  pd.read_feather('./data_pd.feather')


#################################### 11- Parquet + Pandas ####################################
# Step 1 (save the data)
pd.read_csv(data_path).to_parquet("data.parquet")

# Step 2 (load the data)
speed_parquet =  %timeit -o  pd.read_parquet('./data.parquet')


#################################### 12- Jay + Datatable ####################################
# Step 1 (save the data)
dt.Frame(data_path).to_jay("data_dt.jay")

# Step 2 (load the data)
speed_jay =  %timeit -o  dt.fread('./data_dt.jay')


#################################### 13- hdf5 + Pandas ####################################
# Step 1 (save the data)
pd.read_csv(data_path).to_hdf("data.h5", "data")

# Step 2 (load the data)
speed_hdf5 =  %timeit -o pd.read_hdf('./data.h5', "data")

# 14- Benchmark

In [None]:
speed = ["{:.2f}".format(speed_np.average) ,"{:.2f}".format(speed_pd.average), "{:.2f}".format(speed_dt.average), "{:.2f}".format(speed_cu.average), "{:.2f}".format(speed_pickle.average), "{:.2f}".format(speed_joblib.average), "{:.2f}".format(speed_feather.average), "{:.2f}".format(speed_parquet.average), "{:.4f}".format(speed_jay.average), "{:.2f}".format(speed_hdf5.average)]
speed_name = ['Numpy', 'Pandas', 'Datatable', 'cuDF', 'Pickle', 'Joblib', 'Feather', 'Parquet', 'Jay', 'hdf5']

for row in range(len(speed)):
    print(speed_name[row] + ' = ', speed[row], 's')
    
speedy = pd.DataFrame()
speedy["name"] = speed_name
speedy["speed"] = list(map(float, speed))
speedy.sort_values(by=['speed'], ascending=True, inplace=True)

In [None]:
fig = px.bar(speedy, y='speed', x='name', text='speed', title='Benchmarking Data loading speed in seconds')
fig.show()

There are other very fast ways to load data, and can very useful during data processing that i'll try to add as soon as i finish testing them. 

Finally feel free to put in the comment section the code that use for fast data processing =)

# 15- Kudos

##### https://www.kaggle.com/quillio/pickling
##### https://www.kaggle.com/pedrocouto39/fast-reading-w-pickle-feather-parquet-jay