Skip to content
This repository was archived by the owner on Mar 28, 2025. It is now read-only.
echang97 edited this page Jun 27, 2019 · 60 revisions

This is the wiki page for mapping out the Data Quality scripts

Goals

  • Compare two Excel files to determine if data was added, changed, or deleted
  • Determine if an Excel file follows its predefined format
  • Check if numbers are in line with older records

What Things Do and How to Run

NOTES: Be sure to change the directory to where the script is located using cd

Use tab for auto-completing words to reduce the amount of typing

cd Documents/GitHub/Data-Quality-Checker/scripts

Excel Diff

  • Credit to Matthew Kudija for the Source Code
  • Highlights differences between two Excel files
  • Exports file with highlighted differences

Running Excel Diff through Terminal: python diff.py old.xlsx new.xlsx

python diff.py ../files/monthly_production_05-2019 ../files/monthly_production_06-2019.xlsx

Format Checker

  • A Python script that accepts an Excel file
  • Creates formats based on sample Excel files
  • Checks given Excel file for:
    • New or Missing Field Names
    • Unexpected or missing Field Entries
    • Unexpected Units of measurement or New items
    • Number of Withheld rows

Setup needs to be run only once per format, run again if new stuff is added

Running Setup through Terminal: python formatcheck.py setup filename.xlsx

python formatcheck.py setup '../files/monthly_revenue_05-2019 (1).xlsx'

Running Format Check through Terminal: python formatcheck.py filename.xlsx

python formatcheck.py ../files/monthly_revenue_05-2019.xlsx

Highlighter

  • Similar to Format Checker
  • Exports an Excel File highlighting anomalies
  • Can use same config files as formatcheck.py

Running Highlighter through Terminal: python highlighter.py "filename.xlsx"

Clone this wiki locally