Skip to content
This repository was archived by the owner on Mar 28, 2025. It is now read-only.
echang97 edited this page Jul 11, 2019 · 60 revisions

This is the wiki page for mapping out the Data Quality scripts

Goals

  • Compare two Excel files to determine if data was added, changed, or deleted
  • Determine if an Excel file follows its predefined format
  • Check if numbers are in line with older records

What Things Do and How to Run

NOTES: Be sure to change the directory to where the script is located using cd

Anaconda (PowerShell) Prompt is recommended

Use tab for auto-completing words to reduce the amount of typing

"../" goes back a folder

cd Documents/GitHub/Data-Quality-Checker/scripts

Excel Diff

  • Credit to Matthew Kudija for the Source Code
  • Highlights differences between two Excel files
  • Exports file with highlighted differences

Running Excel Diff through Terminal: python diff.py old.xlsx new.xlsx

python diff.py ../files/monthly_production_05-2019 ../files/monthly_production_06-2019.xlsx

Format Checker

  • A Python script that accepts an Excel file
  • Creates formats based on sample Excel files
  • Checks given Excel file for:
    • New or Missing Field Names
    • Unexpected or missing Field Entries
    • Unexpected Units of measurement or New items
    • Number of Withheld rows

Setup only needs to be run once per format, run again if new stuff is added edit the json file when adding new things

Running Setup through Terminal: python formatcheck.py setup filename.xlsx

python formatcheck.py setup '../files/monthly_revenue_05-2019 (1).xlsx'

Running Format Check through Terminal: python formatcheck.py filename.xlsx

python formatcheck.py ../files/monthly_revenue_05-2019.xlsx

Highlighter

  • Similar to Format Checker
  • Exports an Excel File highlighting anomalies
  • Can use same config files as formatcheck.py

Running Highlighter through Terminal: python highlighter.py "filename.xlsx"

Clone this wiki locally