# JSONL To CSV Testing
This worksheet performs a sample JSONL to CSV conversion and measures its impact on performance.

This worksheet requires a large Dataset for Amazon music reviews. Before running the worksheet, please download the music review dataset from:

https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_2023/raw/review_categories/CDs_and_Vinyl.jsonl.gz

### Test Results / Take-Home Points:
- CSV reduces load times by over 10x compared to JSONL - from over 5 minutes to less than half a minute for processing 5M product reviews. 
- CSV is over 3x as fast as JSON to load and 20% more space-efficient
- File I/O is a bottleneck in Python - for unknown reasons, Python performance for file I/O is about 40x slower than raw hardware limitations
- JSONL is not recommended for Python due to very slow speed and lack of built-in library support
- CSV and Pandas are (barely) adequate for developer workflows on millions of records, but user-facing applications require a much more performant tech stack

In [14]:
import json
import pandas as pd
import time

from jsonl_to_csv import jsonl_to_csv, profile

In [None]:
# I/O test: how long does it take to just read the full 3.3GB jsonl file?
# As part of this, let's also create a separate json (not jsonl) variation that Pandas could use directly.

# Result: over a minute! (76 seconds)
# I/O speed = 3.3 GB / 76 sec = 43 MB/s
# 5M records / 76 sec = 66K records / sec
# Raw hardware I/O limitations dictate about 2 seconds for 3GB. It seems using Python adds *a lot* of overhead to file I/O.
# Alternatively, perhaps the system is bound by inefficient RAM usage for files of this size.
t = time.perf_counter()
path = '../raw/CDs_and_Vinyl.jsonl'
with open(path) as file:
    jsonl = file.read()
profile("Read jsonl file contents", t)

76.050:	Read jsonl file contents


In [None]:
# JSONL test:
# JSONL performance is erratic with bursts of speed (at the trajectory of CSV speeds in the best case) followed by long pauses.
# Overall performance is poor, taking several minutes to parse a large file.
# In addition, notice how the simple act of loading a DataFrame from a file becomes unexpectedly difficult.
# This is why we need a conversion utility.
batch_count = 100_000
t = time.perf_counter()
records = []
iteration = 0
with open(path) as file:
    while line := file.readline():
        js = json.loads(line)
        records.append(js)
        iteration += 1
        if iteration % batch_count == 0:
            profile(f'parsed {iteration} records', t)
records = pd.DataFrame.from_records(records)
profile('Finished parsing jsonl', t)

1.165:	parsed 100000 records
1.561:	parsed 200000 records
1.998:	parsed 300000 records
2.473:	parsed 400000 records
2.969:	parsed 500000 records
3.212:	parsed 600000 records
3.804:	parsed 700000 records
4.071:	parsed 800000 records
4.793:	parsed 900000 records
5.041:	parsed 1000000 records
5.841:	parsed 1100000 records
6.150:	parsed 1200000 records
6.485:	parsed 1300000 records
7.844:	parsed 1400000 records
8.232:	parsed 1500000 records
8.500:	parsed 1600000 records
8.786:	parsed 1700000 records
12.253:	parsed 1800000 records
12.640:	parsed 1900000 records
12.932:	parsed 2000000 records
13.367:	parsed 2100000 records
13.710:	parsed 2200000 records
19.673:	parsed 2300000 records
20.230:	parsed 2400000 records
20.572:	parsed 2500000 records
20.882:	parsed 2600000 records
21.179:	parsed 2700000 records
21.488:	parsed 2800000 records
21.803:	parsed 2900000 records
30.885:	parsed 3000000 records
31.171:	parsed 3100000 records
31.480:	parsed 3200000 records
31.810:	parsed 3300000 records
32.

KeyboardInterrupt: 

In [None]:
# Conversion test: 5 minutes 21 seconds.
# We can use this as a proxy for roughly how long it would take to read a dataset from a jsonl file if we were successful.
jsonl_to_csv(path, batch_size = 100_000)

Opened file ../raw/CDs_and_Vinyl.jsonl
6.489:	converted 100,000 records
12.633:	converted 200,000 records
18.929:	converted 300,000 records
25.292:	converted 400,000 records
31.139:	converted 500,000 records
37.495:	converted 600,000 records
44.252:	converted 700,000 records
50.624:	converted 800,000 records
57.162:	converted 900,000 records
63.555:	converted 1,000,000 records
70.106:	converted 1,100,000 records
76.791:	converted 1,200,000 records
82.984:	converted 1,300,000 records
89.857:	converted 1,400,000 records
96.460:	converted 1,500,000 records
102.881:	converted 1,600,000 records
109.713:	converted 1,700,000 records
116.198:	converted 1,800,000 records
122.141:	converted 1,900,000 records
128.617:	converted 2,000,000 records
135.181:	converted 2,100,000 records
142.094:	converted 2,200,000 records
148.590:	converted 2,300,000 records
154.843:	converted 2,400,000 records
161.129:	converted 2,500,000 records
167.629:	converted 2,600,000 records
174.277:	converted 2,700,000 reco

In [None]:
# CSV variation:
# - Performance = 25 seconds - over 3x as fast as json and over 10x as fast as jsonl
#   - Though this bridges the performance gap for developers, it also shows that CSV files are not at all adequate for user-facing workflows on real-world product data
# - Storage space decreases by 20% - down from 3.3GB to 2.6GB
csv_path = path.replace('.jsonl', '.csv')
t = time.perf_counter()
data = pd.read_csv(csv_path)
profile('Read CSV file', t)

25.514:	Read CSV file


In [None]:
json_path = path.replace('.jsonl', '.json')
json_str = '[' + jsonl + ']'
with open(json_path, 'w') as file:
    file.write(json_str)

In [None]:
# JSON variation:
# 75 seconds to load records. Almost exactly the same speed as dumping the file into a string, and 3x slower than CSV.
t = time.perf_counter()
data = pd.read_json(json_path)
profile('Read json')