Skip to content

Compare Data Frames

Cghlewis edited this page Jan 31, 2023 · 4 revisions

There are several use cases for comparing data frames.

  1. The first one that comes to mind in education research is when you need to double enter data.
    • Say for example you collect survey data from students using a paper form. After you collect that paper survey, you need to enter the responses into a database for analysis. It is best practice to double enter that data in order to minimize errors in your data.
    • Typically this looks like two individuals entering all the surveys into two separate entry files, and then rectifying any discrepancies between those two files. While this doesn't account for ALL possible errors that can be made, it does greatly minimize entry errors that occur when the data is entered only one time into one file.
    • Depending on how these databases are set up, your system may check for errors automatically. However, you may want to import the data into a program like R, to check for errors across the entry files.
    • Notice, in our example below, if we compared our two data frames, we would find errors in q1 for stu_id=12345 and we would find errors in q2 for stu_id=12346. We would need to go back to the original forms to see what the correct response should be and fix the error in the appropriate entry file.

Example Entry 1 data file:

stu_id q1 q2 q3
12345 5 6 4
12346 2 3 4
12347 1 5 2

Example Entry 2 data file:

stu_id q1 q2 q3
12345 NA 6 4
12346 2 1 4
12347 1 5 2
  1. Another example is when someone makes a copy of your file and you simply want to see if the two files are identical or not.

  2. And yet, another example of why we might compare data frames is to compare column types before merging. Imagine we want to bind student data across years. If our grade level variable was a numeric variable last year, and we received a dataset with that same grade level variable again this year, in order to bind that data together we would need those variables to be the same type. We may want to compare our data frame columns before binding.

Compare data frames

Compare data frame columns types for binding


Main functions used in examples

Package Functions
diffdf diffdf()
janitor compare_df_cols()
dplyr all_equal()

Other functions used

Package Functions
readr read_csv()

Resources

Clone this wiki locally