-
Notifications
You must be signed in to change notification settings - Fork 16
Compare Data Frames
Cghlewis edited this page Jan 31, 2023
·
4 revisions
There are several use cases for comparing data frames.
- The first one that comes to mind in education research is when you need to double enter data.
- Say for example you collect survey data from students using a paper form. After you collect that paper survey, you need to enter the responses into a database for analysis. It is best practice to double enter that data in order to minimize errors in your data.
- Typically this looks like two individuals entering all the surveys into two separate entry files, and then rectifying any discrepancies between those two files. While this doesn't account for ALL possible errors that can be made, it does greatly minimize entry errors that occur when the data is entered only one time into one file.
- Depending on how these databases are set up, your system may check for errors automatically. However, you may want to import the data into a program like R, to check for errors across the entry files.
- Notice, in our example below, if we compared our two data frames, we would find errors in q1 for stu_id=12345 and we would find errors in q2 for stu_id=12346. We would need to go back to the original forms to see what the correct response should be and fix the error in the appropriate entry file.
Example Entry 1 data file:
stu_id | q1 | q2 | q3 |
---|---|---|---|
12345 | 5 | 6 | 4 |
12346 | 2 | 3 | 4 |
12347 | 1 | 5 | 2 |
Example Entry 2 data file:
stu_id | q1 | q2 | q3 |
---|---|---|---|
12345 | NA | 6 | 4 |
12346 | 2 | 1 | 4 |
12347 | 1 | 5 | 2 |
-
Another example is when someone makes a copy of your file and you simply want to see if the two files are identical or not.
-
And yet, another example of why we might compare data frames is to compare column types before merging. Imagine we want to bind student data across years. If our grade level variable was a numeric variable last year, and we received a dataset with that same grade level variable again this year, in order to bind that data together we would need those variables to be the same type. We may want to compare our data frame columns before binding.
Main functions used in examples
Package | Functions |
---|---|
diffdf | diffdf() |
janitor | compare_df_cols() |
dplyr | all_equal() |
Other functions used
Package | Functions |
---|---|
readr | read_csv() |
Resources
- https://link.springer.com/article/10.3758/s13428-019-01207-3
- https://bookdown.org/Maxine/r4ds/comparing-two-data-frames-tibbles.html
- Although I decided against diving into checksums for these examples (they are too strict to solve the problems I have in these examples) I think it is worth reading about them and knowing how to create them:
https://www.howtogeek.com/363735/what-is-a-checksum-and-why-should-you-care/
https://stackoverflow.com/questions/43081791/in-r-find-whether-two-files-differ https://stackoverflow.com/questions/10592148/compare-if-two-dataframe-objects-in-r-are-equal https://unix.stackexchange.com/questions/616330/why-doesnt-changing-a-files-name-change-its-checksum https://www.rdocumentation.org/packages/SpaDES/versions/1.3.1/topics/checksums
https://stat.ethz.ch/R-manual/R-devel/library/tools/html/md5sum.html