To practice writing performant Julia code I am attempting to improve the speed of part 1. As a reminder, the terminal output after the last run was:

        Processing folder: info
        Finished in 4.35 seconds
        Processing folder: local-monthly-gross-returns
        Finished in 130.5 seconds
        Processing folder: local-monthly-net-returns
        Finished in 148.06 seconds
        Processing folder: monthly-costs
        Finished in 63.1 seconds
        Processing folder: monthly-morningstar-category
        Finished in 467.8 seconds
        Processing folder: monthly-net-assets
        Finished in 237.36 seconds
        Processing folder: usd-monthly-gross-returns
        Finished in 85.24 seconds
        Processing folder: usd-monthly-net-returns
        Finished in 100.15 seconds

The total runtime of the script (missing the time taken to load the info dataframe initially), then is:

In [1]:
println(round((4.35+130.5+148.06+63.1+467.8+237.36+85.24+100.15)/60, digits=2), " minutes")

20.61 minutes


The first change made is to employ parallel processing through @threads. The new terminal output is:

I lost the terminal output after the first run because I had to stop it partway through, but had recorded two of the times manually. It looked something like:

        Regrouping files...
        Processed folder info in 12.6 seconds
        Processed folder monthly-costs in 1412.8 seconds
        Processed folder monthly-usd-gross-returns in 1730.4 seconds

In [2]:
println("monthly-costs took $(round(1412.8/60, digits=2)) minutes")
println("monthly-usd-gross-returns took $(round(1730.4/60, digits=2)) minutes")


monthly-costs took 23.55 minutes
monthly-usd-gross-returns took 28.84 minutes


The ratio of time for usd-gross-returns to time for monthly-costs remained roughly the same, but took 22x as long, so there was no benefit to parallelism at all and the overtime time taken was substantially increased.

As a measure of the speed-up obtained simply by allowing CSV.read to run multithreaded, here is the terminal output from running the original code with number of threads increased to auto in settings:

        Processed folder info in 4.35 seconds
        Processed folder local-monthly-gross-returns in 109.61 seconds
        Processed folder local-monthly-net-returns in 142.17 seconds
        Processed folder monthly-costs in 59.3 seconds

I stopped the code after getting the first three examples. Looks like between 5-20 seconds was saved by utiliting multithreading.

I wrote a new script to clean all the csv files to make future loads faster, but it took almost as long (11 minutes total) and would not likely save more than that amount of time in the next step. Terminal output:

        Cleaning csv files...
        Finished cleaning info in 15.46 seconds.
        Finished cleaning local-monthly-gross-returns in 141.7 seconds.
        Finished cleaning local-monthly-net-returns in 149.93 seconds.
        Finished cleaning monthly-costs in 78.25 seconds.
        Finished cleaning monthly-morningstar-category in 132.58 seconds.
        Finished cleaning monthly-net-assets in 294.68 seconds.
        Finished cleaning usd-monthly-gross-returns in 135.9 seconds.
        Finished cleaning usd-monthly-net-returns in 175.42 seconds.
        Finished cleaning all csv files in 1123.93 seconds.

I will try to clean the files directly without using CSV

I was able to get the empty quotes removed with very little overhead. Terminal output:

        Copying csv files...
        Finished copying csv files in 14.19 seconds.
        Cleaning csv files...
        Finished cleaning info in 0.3 seconds.
        Finished cleaning local-monthly-gross-returns in 35.57 seconds.
        Finished cleaning local-monthly-net-returns in 32.29 seconds.
        Finished cleaning monthly-costs in 16.93 seconds.
        Finished cleaning monthly-morningstar-category in 26.62 seconds.
        Finished cleaning monthly-net-assets in 20.98 seconds.
        Finished cleaning usd-monthly-gross-returns in 20.91 seconds.
        Finished cleaning usd-monthly-net-returns in 18.77 seconds.
        Finished cleaning all csv files in 186.57 seconds.

Here is the new cleaning code terminal output after correcting it to truncate the file after cleaning and retrying IO operations until success or timeout:

        Copying csv files...
        Finished copying csv files in 39.83 seconds.
        Cleaning csv files...
        Finished cleaning info in 0.56 seconds.
        Finished cleaning local-monthly-gross-returns in 27.12 seconds.
        Finished cleaning local-monthly-net-returns in 24.45 seconds.
        Finished cleaning monthly-costs in 13.28 seconds.
        Finished cleaning monthly-morningstar-category in 32.39 seconds.
        Finished cleaning monthly-net-assets in 27.4 seconds.
        Finished cleaning usd-monthly-gross-returns in 16.97 seconds.
        Finished cleaning usd-monthly-net-returns in 25.15 seconds.
        Finished cleaning all csv files in 207.16 seconds.

After three days, I've finally managed to get multithreaded CSV reading to work on the full set of CSVs. Now that multithreading is active, here is the new terminal output:

        Regrouping files...
        Processed folder info in 4.18 seconds
        Processed folder local-monthly-gross-returns in 116.24 seconds
        Processed folder local-monthly-net-returns in 145.8 seconds
        Processed folder monthly-costs in 73.6 seconds
        Processed folder monthly-morningstar-category in 375.51 seconds
        Processed folder monthly-net-assets in 384.62 seconds
        Processed folder usd-monthly-gross-returns in 83.62 seconds
        Processed folder usd-monthly-net-returns in 101.82 seconds
        Finished refining mutual fund data in 1302.07 seconds

This original terminal output for this script was:

        Processing folder: info
        Finished in 4.35 seconds
        Processing folder: local-monthly-gross-returns
        Finished in 130.5 seconds
        Processing folder: local-monthly-net-returns
        Finished in 148.06 seconds
        Processing folder: monthly-costs
        Finished in 63.1 seconds
        Processing folder: monthly-morningstar-category
        Finished in 467.8 seconds
        Processing folder: monthly-net-assets
        Finished in 237.36 seconds
        Processing folder: usd-monthly-gross-returns
        Finished in 85.24 seconds
        Processing folder: usd-monthly-net-returns
        Finished in 100.15 seconds  

There were extremely minimal performance gains on all files except for monthly-net-assets, which actually got substantially worse. The total time, as a result, is larger with multithreading than without.

Profiler shows that 68% of the time spent is to save the CSVs, 14% concatenating the data, and only 3% reading the data.

With pooled concatenation instead of serial concatenation:

        Regrouping files...
        Processed folder info in 4.1 seconds
        Processed folder local-monthly-gross-returns in 104.27 seconds
        Processed folder local-monthly-net-returns in 148.02 seconds
        Processed folder monthly-costs in 67.38 seconds
        Processed folder monthly-morningstar-category in 269.78 seconds
        Processed folder monthly-net-assets in 310.17 seconds
        Processed folder usd-monthly-gross-returns in 75.4 seconds
        Processed folder usd-monthly-net-returns in 98.71 seconds
        Finished refining mutual fund data in 1093.77 seconds

Here is the output of the cleaning code with the addition of cleaning for thousand separating commas:

        Copying csv files...
        Finished copying csv files in 22.09 seconds.
        Cleaning csv files...
        Finished cleaning info in 0.72 seconds.
        Finished cleaning local-monthly-gross-returns in 36.03 seconds.
        Finished cleaning local-monthly-net-returns in 28.44 seconds.
        Finished cleaning monthly-costs in 15.35 seconds.
        Finished cleaning monthly-morningstar-category in 25.19 seconds.
        Finished cleaning monthly-net-assets in 25.53 seconds.
        Finished cleaning usd-monthly-gross-returns in 18.55 seconds.
        Finished cleaning usd-monthly-net-returns in 17.33 seconds.
        Finished cleaning all csv files in 189.22 seconds.

It runs in roughly the same amount of time, with most stages taking only 1-2 seconds longer than without that part. Still 50% of the time is spent removing empty double-quotes. I should check to see if this is necessary anymore.

Without thousands separators, monthly-net-assets runs substantially faster:

        Regrouping files...
        Processed folder info in 3.99 seconds
        Processed folder local-monthly-gross-returns in 103.67 seconds
        Processed folder local-monthly-net-returns in 146.86 seconds
        Processed folder monthly-costs in 72.71 seconds
        Processed folder monthly-morningstar-category in 271.08 seconds
        Processed folder monthly-net-assets in 77.28 seconds
        Processed folder usd-monthly-gross-returns in 119.39 seconds
        Processed folder usd-monthly-net-returns in 131.19 seconds
        Finished refining mutual fund data in 942.55 seconds

To test if the removal of double quotes is still necessary, I've taken it out of the cleaning step. Here is the new output of the cleaning step:

        Copying csv files...
        Finished copying csv files in 45.13 seconds.
        Cleaning csv files...
        Finished cleaning info in 0.72 seconds.
        Finished cleaning local-monthly-gross-returns in 8.07 seconds.
        Finished cleaning local-monthly-net-returns in 7.42 seconds.
        Finished cleaning monthly-costs in 4.62 seconds.
        Finished cleaning monthly-morningstar-category in 10.16 seconds.
        Finished cleaning monthly-net-assets in 11.16 seconds.
        Finished cleaning usd-monthly-gross-returns in 5.16 seconds.
        Finished cleaning usd-monthly-net-returns in 5.41 seconds.
        Finished cleaning all csv files in 97.87 seconds.

And of the grouping step:

        Regrouping files...
        Processed folder info in 4.08 seconds
        Processed folder local-monthly-gross-returns in 102.64 seconds
        Processed folder local-monthly-net-returns in 141.61 seconds
        Processed folder monthly-costs in 54.09 seconds
        Processed folder monthly-morningstar-category in 182.64 seconds
        Processed folder monthly-net-assets in 73.36 seconds
        Processed folder usd-monthly-gross-returns in 117.78 seconds
        Processed folder usd-monthly-net-returns in 126.56 seconds
        Finished refining mutual fund data in 817.61 seconds

I've renumbered the steps to include the cleaning step as the new step 1. With that in mind, the total runtime of steps 1 and 2 is now:

In [2]:
"$(round((97.87+817.61)/60, digits=2)) minutes"

"15.26 minutes"

This is 5 minutes faster than the original implementation of what is now step 2. Some of that speed up occured because monthly-morningstar-category processing time was reduced by the retention of empty double quotes. I'm not sure why, but won't spend any time testing it.

I've just come back and updated the grouping script so that it deletes empty rows and empty columns. Here is the terminal output of that step now:

        Regrouping files...
        Processed folder info in 2.99 seconds
        Processed folder local-monthly-gross-returns in 82.84 seconds
        Processed folder local-monthly-net-returns in 144.57 seconds
        Processed folder monthly-costs in 43.51 seconds
        Processed folder monthly-morningstar-category in 240.36 seconds
        Processed folder monthly-net-assets in 91.52 seconds
        Processed folder usd-monthly-gross-returns in 127.74 seconds
        Processed folder usd-monthly-net-returns in 136.24 seconds
        Finished refining mutual fund data in 870.68 seconds

Some runs were faster, some slower, but it took only 1 extra minute and will allow for easier processing in the next stage.

Later on I realised that cleaning newlines was no longer going to work, so I edited out the cleaning script altogether. The rerouted regrouping script output is now:

        Regrouping files...
        Processed folder info in 4.45 seconds
        Processed folder local-monthly-gross-returns in 87.01 seconds
        Processed folder local-monthly-net-returns in 146.95 seconds
        Processed folder monthly-costs in 47.34 seconds
        Processed folder monthly-morningstar-category in 210.86 seconds
        Processed folder monthly-net-assets in 342.48 seconds
        Processed folder usd-monthly-gross-returns in 66.58 seconds
        Processed folder usd-monthly-net-returns in 98.45 seconds
        Finished refining mutual fund data in 1020.07 seconds

There were no significant slowdowns except for in monthly-net-assets. I forgot that the cleaning code also cleans the thousands separators, so I'll have to add that back into the regrouping code.

I edited to remove the thousands seperators in place and reran a segment of the script so that only monthly-net-assets was processed. The output was:

        Regrouping files...
        Processed folder info in 3.02 seconds
        Processed folder monthly-net-assets in 71.43 seconds
        Finished refining mutual fund data in 75.04 seconds

With that timing, the total runtime would have been 748.95 seconds, or 12.5 minutes. It seems that the cleaning script was never needed to begin with.

It turns out there is at least one other file that contains a thousands separating comma in at least one spot. Without spending the time to devise a way to tell if the removal is necessary, I changed so that thousands separating commas are removed from all data. It increased the runtime to:

        Regrouping files...
        Processed folder info in 4.63 seconds
        Processed folder local-monthly-gross-returns in 104.02 seconds
        Processed folder local-monthly-net-returns in 188.06 seconds
        Processed folder monthly-costs in 60.86 seconds
        Processed folder monthly-morningstar-category in 268.3 seconds
        Processed folder monthly-net-assets in 99.16 seconds
        Processed folder usd-monthly-gross-returns in 120.9 seconds
        Processed folder usd-monthly-net-returns in 158.46 seconds
        Finished refining mutual fund data in 1020.94 seconds