Skip to content
This repository was archived by the owner on Mar 28, 2025. It is now read-only.
echang97 edited this page Jul 25, 2019 · 60 revisions

This is the wiki page for mapping out the Data Quality scripts

Goals

  • Compare two Excel files to determine if data was added, changed, or deleted
  • Determine if an Excel file follows its predefined format
  • Check if numbers are in line with older records

Dependencies

These should already be included with Anaconda. In case they aren't run these through terminal

  • Pandas - conda install pandas
  • Xlsxwriter - conda install xlsxwriter

What Things Do and How to Run

Be sure to change the directory to where the script is located using cd

cd Documents/GitHub/Data-Quality-Checker/

Excel Diff

  • Credit to Matthew Kudija for the Source Code
  • Highlights differences between two Excel files
  • Exports file with highlighted differences

Running Excel Diff through Terminal: python diff.py old.xlsx new.xlsx

// Note: "../" is used to navigate back one folder from the current directory (folder)
python diff.py ../files/monthly_production_05-2019 ../input/monthly_production_06-2019.xlsx
  • A Excel file will export in the output folder under "[old] vs [new].xlsx"

Format Checker

  • A Python script that accepts an Excel file
  • Creates formats based on sample Excel files
  • Checks given Excel file for:
    • New or Missing Field Names
    • Unexpected or missing Field Entries
    • Unexpected Units of measurement or New items
    • Number of Withheld rows
  • Can replace specific commodities

Setup only needs to be run once per file type (e.g. Monthly Production, Federal Production CY), edit the json file when adding new things

Running Setup through Terminal: python formatcheck.py setup filename.xlsx

// Note: The following path is in quotes because white-space is present
python formatcheck.py setup '../input/monthly_revenue_05-2019 (1).xlsx'
  • Setup is done to create a general layout of a file type. This will export a .json file which can be edited.

Running Format Check through Terminal: python formatcheck.py filename.xlsx

python formatcheck.py ../input/monthly_revenue_05-2019.xlsx

Exporting Format Check Replace python formatcheck.py export filename.xlsx

python formatcheck.py export ../input/monthly_revenue_05-2019.xlsx
  • Note: Export may be missing formatting

Number Checker

  • Grouping determines default thresholds
  • Editable json files for thresholds

Running Setup through Terminal: python numberchecker.py setup filename.xlsx

python numberchecker.py setup '../input/monthly_revenue_05-2019 (1).xlsx'
  • Setup will ask the user for input on which columns to use for grouping.
  • Type in the column names separated by commas and a single space (e.g. Commodity, Revenue Type)
  • Creates a .json file with thresholds based on standard deviation, is editable.

Running Number Check through Terminal: python numberchecker.py filename.xlsx

python numberchecker.py ../input/monthly_revenue_05-2019.xlsx

Clone this wiki locally