Jon's notes #95

RohanAlexander · 2023-11-22T14:58:03Z

Just because a dataset is FAIR, it is not necessarily an unbiased representation of the world. Further, it is not necessarily fair in the everyday way that word is used i.e. impartial and honest (Lima et al. 2022). FAIR reflects whether a dataset is appropriately available, not whether it is appropriate.

This seems a bit out of place, it might be better after the following paragraph that kind of explains how a dataset could be problematic?

Medicine developed approaches to this over a long time. And out of that we have seen Health Insurance Portability and Accountability Act (HIPAA) in the US, and then the broader General Data Protection Regulation (GDPR) in Europe

I might repalced HIPPA with CCPA? Or mention CCPA after GDPR?

More recently, approaches based on differential privacy are being implemented

Link to something describing differential privacy?

even using databases, such as SQL

The pedant in me twitched here. It's probably not a big deal, but maybe repalced SQL with postgres or mariadb or whatever?

the simplicity of a CSV can be a useful feature.

I like simplicity here, but I wonder if highlighting that it is text-based which lends itself to human inspection is the simplicity that makes CSVs useful. Like, one could type out or read by eye a (small) CSV if one needed to, but typing out a parquet file is not something a human could do.

The storage and retrieval of information is especially connected with libraries.

My first parse of this was libraries-as-in-software-libraries which made this almost too obvious? Maybe calling out that you're talking about the broader or more traditional sense?

For instance, if we want some dataset to be available for a decade, and widely available, then it becomes important to store it in open and persistent formats, such as CSV (Hart et al. 2016).

Something that might be worth highlighting here is how our own physical storage media evolution has similar complicated issues. Datasets and recordings made on various media (wax cylinders, magnetic tapes, proprietary optical disks etc) have a variable ease of reading now (with some being practically impossible). Though the parquet-fan in me doesn't want "like CSV" here since parquet is itself open.

Another practical concern is that the maximum file size on GitHub is 100MB. And a final concern, for some, is that GitHub is owned by Microsoft, a for-profit US technology firm.

Would mentioning git LFS be interesting or relevant here?

Section 10.3.2

This felt like it went by really fast. It's possible you've covered R packages elsewhere in the text by this point in which case it's not a huge deal, but it felt like there were jumps over a number of steps here.

That is not to say that it is impossible. If we made a mistake, such as accidentally pushing the original dataset to GitHub then they could be recovered. And it is likely that various governments can reverse the cryptographic hashes used here.

Is it worth mentioning keys here? And specifically that the key can allow some to reverse this but others (generally not). Especially with the salting discussion below, IMO keys are more fundamental than salting though 💯 on including salting here!

It is also important to recognize that there are many definitions of privacy, of which differential privacy is just one []

is the [] a typo here?

The release of that dataset could never be linked to them because they are not in it

This might be beyond the point you're trying to make here, but the first thing I thought of when I read this was the (kind of) counterpoint of police recently using DNA databases to find suspects. The suspects themselves might not be in the database, but the nature of DNA meant that individuals could still be identified even if they aren't in the database.

The issue with this approach is that even if it is fine for what is of interest and known now, because it is model based, it cannot account for the questions that we do not even think to ask at the moment.

It might be worth going into a bit more detail about this. Some future unknown questions might be totally fine, but others not, yeah?

Here we discuss iterating through multiple files, and then turn to changing data formats, beginning with Apache Arrow, and finally SQL.

It would be good to mention Parquet here.

10.6.2

IMO a brief discussion of parquet would be fantastic (especially highlighting that it's an open standard, with many many implementations — I even still hear folks talk about "well yeah, but CSV is open, so I'm going to keep storing things in that" when people suggest parquet)

We use “.arrow” when we are actively using the data, for instance, as we clean, prepare, and model. And we use “.parquet” for data storage, for instance, saving a copy of the original data, and making the final dataset available to others. This is because “.parquet” is focused on size, while “.arrow” is focused on efficiency.

There are a number of moving parts here (and this is an answer that will likely change over time), but there are some ways that querying parquet files can be more efficient than arrow files. The reasons for this are complicated and depend on a lot of details of the data (where it's stored (especially if it's IO bound), the kinds of queries and how the rows are distributed in the files, etc.) One real example of this I can provide is looking at the apache arrow benchmarks for dataset selectivity from S3: https://conbench.ursa.dev/runs/3fd9d75ba0744b64b37db3f5a3512d6f/?search=dataset-selectivity the "ipc variants are the same as .arrow files (long story about that naming convention, the project is finally coming around to calling those files all just arrow everywhere cause IPC is really confusing!) and parquet are parquet. And then it reads in 100% of the data, 10%, and 1%. Arrow is slower in each case than parquet!

If I were recommending what folks should do in reality with stuff like this, I would generally recommend sticking with parquet right now (then one doesn't need to decide "is this an intermediary or not?") and the circumstances where arrow (format) is faster than parquet (when reading in with the arrow library or querying with the arrow query engine in the arrow library) are relatively rare + uncommon. The exception to this would be cases where direct access and reading things in on files is really important. One nice thing you can do with arrow files is memory map them so that you don't need to read the whole file into memory, and then you can access from anywhere in the file and only read in the areas around what you're actually reading. You can do similar with parquet, but there's more overhead since you have to decompress big(ish) chunks. But that's not a particularly common use case or pattern in on-the-ground data science.

For long-term storage, we might replace “.csv” files with “.parquet” files.

Not sure if this is a hill you want to die on, but I would say that at this point one should save everything in parquet over csv. Even "just" for having a schema attached in parquet so figuring out how to decode say dates or integers or zipcodes as strings and not numbers is totally obviated (for analysis now, but even more important in some ways analysis in the future!).

Having considered “.parquet”, and explained the benefits compared with CSVs for large data that we are storing, we now consider “.arrow” and the benefit of this file type for large data that we are actively using. Again, these benefits will be most notable for larger datasets, and so we will consider the ProPublica US Open Payments Data, from the Centers for Medicare & Medicaid Services, which is 6.66 GB and available here. It is available as a CSV file, and so we will compare reading in the data and creating a summary of the average total amount of payment on the basis of state using read_csv(), with the same task using read_csv_arrow(). We find a considerable speed-up when using read_csv_arrow() (Table 10.2).

I find it a bit confusing to talk about the .arrow format here but then the rest of the paragraph is talking about the arrow csv reader + using the arrow query engine

Table 10.2

I haven't run this code on my own (I probably should...), but I'm curious what version of arrow you used here. I"m a little bit surprised that the manipulate and summarise takes longer for arrow than it does for R (the "CSV" case). I believe that all of the functions you're using in that dplyr query should be mapped by now into the arrow query engine, so that should be incredibly fast.

You might also consider using open_dataset() instead of read_csv_arrow() here. What that allows you to do is to operate the query without reading everything into memory. (1) it's good for datasets that are larger than memory but (2) it can be incredibly efficient since it skips reading columns | rows that are irrelevant to the query.

Crane (2022) provides further information about specific tasks and Navarro (2022) provides helpful examples of implementation.

💯 citations here! I might have missed it, but https://arrow-user2022.netlify.app is another fantastic resource for using arrow. And Danielle's blog has many deep dibes after the one you cite here.

While it may be true that the SQL is never as good as the original,

What do you mean by this? What's the original?

There is an underlying ordering build into number, date and text fields that means we can use BETWEEN on all those, not just numeric.

"There is an underlying ordering build into ..." yeah?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jon's notes #95

Jon's notes #95

RohanAlexander commented Nov 22, 2023 •

edited

Jon's notes #95

Jon's notes #95

Comments

RohanAlexander commented Nov 22, 2023 • edited

RohanAlexander commented Nov 22, 2023 •

edited