# Elegant Multithreading

No one wants slow CSV reading, especially in a performance intensive task such as machine learning, where the amount of data and the use of compute resources is plentiful. Our state-of-the-art data ecosystem takes full advantage of the newly introduced multithreading functionality in Julia.

It does so with careful consideration to ease of use, without loss of capability. The CSV.jl package exposes the threads via a simple keyword argument and is compared against the equivalent pandas library in python.

In [11]:
]activate .

[32m[1mActivating[22m[39m environment at `~/mlds/Project.toml`


In [12]:
using Base.Threads
Threads.nthreads()

8

In [29]:
using CSV, BenchmarkTools
using CSV.DataFrames

## Benchmarking Julia CSV Reading - CSV.jl

Real world datasets often comprise of varied forms of data, and are very heterogenous in nature, unless special care is taken to homogenise the datasets.

We therefore benchmark reading of a common file type - CSV - and compare it against the state-of-the-art in python; ie pandas.

In [30]:
df = CSV.read("/home/dhairyagandhi96/mlds/csv-benchmarks/mixed.csv")
describe(df)

Unnamed: 0_level_0,variable,mean,min,median,max
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any
1,col1,0.504951,2.45087e-5,0.508145,0.999627
2,col2,0.504824,0.0004,0.5117,0.9995
3,col3,1.54429e15,-9222739277495079109,7.11743e16,9221746770517594113
4,col4,,1950-01-02T22:38:46,,2000-12-10T06:53:17
5,col5,,00VOeNeUrFI8oOIqfHin,,zyhNFCt6GlzgwMLdwrlJ
6,col6,,Categorical string 1,,Categorical string 5
7,col7,,"01AEEir015gbhP50WAmv""Dixp7idRQb",,"zzzMocd6QksZTrYEXgMc""8KbwLTkNDH"
8,col8,0.49722,1.00435e-5,0.494471,0.99977
9,col9,0.500279,0.0002,0.50215,0.9999
10,col10,-1.22871e15,-9218269981171391739,-1.37732e17,9221618083976482060


In [14]:
@btime CSV.read("/home/dhairyagandhi96/mlds/csv-benchmarks/mixed.csv", threaded = false);

  242.538 ms (158577 allocations: 47.27 MiB)


In [24]:
@btime CSV.read("/home/dhairyagandhi96/mlds/csv-benchmarks/mixed.csv", threaded = true);

  45.052 ms (109923 allocations: 43.29 MiB)


In [22]:
@btime CSV.read("/home/dhairyagandhi96/mlds/csv-benchmarks/big_mixed.csv", threaded = true);

  337.263 ms (951256 allocations: 430.31 MiB)


## Benchmarking Python CSV Reading - pandas

In [23]:
run(`python3 -m timeit -s "import pandas" -p "pandas.read_csv('csv-benchmarks/big_mixed.csv')"`)

10 loops, best of 3: 6.36 sec per loop


Process(`[4mpython3[24m [4m-m[24m [4mtimeit[24m [4m-s[24m [4m'import pandas'[24m [4m-p[24m [4m"pandas.read_csv('csv-benchmarks/big_mixed.csv')"[24m`, ProcessExited(0))

## That's a speedup of about **15x** with multithreading, and about 3x without !!