m.csv.clean: Cleaning of CSV files (#361)

Uses Python csv package to read CSV file, runs series of cleanups, esp. for column names in the first row and also removes extra whitespace from cells. It can prefix numbers used as column names (such as ZIP code or DOY) and it can also identify dates and reformat them as needed using Python datetime package. The interface is designed so that user can process large number of columns without specifying them which means that there is some guess work inside to make that happen. This applies esp. to columns containing date and columns with missing column name in header. The name is derived from v.clean, hence m.csv.clean.
OSGeo · Jan 7, 2021 · ac581a8 · ac581a8
1 parent 8b2d22e
commit ac581a8
Show file tree

Hide file tree

Showing 3 changed files with 292 additions and 0 deletions.
diff --git a/grass7/misc/m.csv.clean/Makefile b/grass7/misc/m.csv.clean/Makefile
@@ -0,0 +1,7 @@
+MODULE_TOPDIR = ../..
+
+PGM = m.csv.clean
+
+include $(MODULE_TOPDIR)/include/Make/Script.make
+
+default: script
diff --git a/grass7/misc/m.csv.clean/m.csv.clean.html b/grass7/misc/m.csv.clean/m.csv.clean.html
@@ -0,0 +1,62 @@
+<h2>DESCRIPTION</h2>
+
+<em>m.csv.clean</em> reads a CSV (Comma Separated Value) file,
+cleans it, and writes a new CSV file.
+The separator for CSV is comma (<code>,</code>) by default,
+but it can be set to any single character such as semicolon (<code>;</code>),
+pipe (<code>|</code>), or tabulator.
+
+<h2>NOTES</h2>
+
+Originally, the name for this module was supposed to be <em>m.csv.polish</em>
+and the module was to be accompanied with module named <em>m.csv.czech</em>
+for checking the state of the CSV.
+
+<h2>EXAMPLES</h2>
+
+<h3>In GRASS GIS shell</h3>
+
+The following would apply all the default fixes to the the file <code>sampling_sites_raw.csv</code>
+and output a cleaned file <code>sampling_sites.csv</code>:
+
+<div class="code"><pre>
+m.csv.clean input=sampling_sites_raw.csv output=sampling_sites.csv
+</pre></div>
+
+<h3>In any shell</h3>
+
+The module is not using any information from the current location and mapset,
+so it is very easy to run it with an adhoc temporary location
+by executing a <code>grass --exec</code> command:
+
+<div class="code"><pre>
+grass --tmp-location XY --exec m.csv.clean input=sampling_sites_raw.csv output=sampling_sites.csv
+</pre></div>
+
+
+<h2>SEE ALSO</h2>
+
+<ul>
+    <li>
+        <em><a href="v.in.csv.html">v.in.csv</a></em>
+        for an addon module for importing CSV as vector points with coordinate transformation,
+    </li>
+    <li>
+        <em><a href="https://grass.osgeo.org/grass-stable/manuals/v.in.ascii.html">v.in.ascii</a></em>
+        for importing CSV as vector points with different approach,
+    </li>
+    <li>
+        <em><a href="https://grass.osgeo.org/grass-stable/manuals/v.in.ogr.html">v.in.ogr</a></em>
+        for an alternative CSV import using GDAL/OGR.
+    </li>
+</ul>
+
+
+<h2>AUTHOR</h2>
+
+Vaclav Petras, <a href="http://geospatial.ncsu.edu/">NCSU Center for Geospatial Analytics</a>
+
+<!--
+<p>
+<i>Last changed: $Date$</i>
+-->
diff --git a/grass7/misc/m.csv.clean/m.csv.clean.py b/grass7/misc/m.csv.clean/m.csv.clean.py
@@ -0,0 +1,223 @@
+#!/usr/bin/env python3
+
+# AUTHOR(S): Vaclav Petras <wenzeslaus gmail com>
+#
+# COPYRIGHT: (C) 2020 Vaclav Petras and by the GRASS Development Team
+#
+#            This program is free software under the GNU General Public
+#            License (>=v2). Read the file COPYING that comes with GRASS
+#            for details.
+
+#%module
+#% label: Creates a cleaned-up copy a CSV files
+#% description: Creates CSV files which are ready to used in GRASS GIS
+#% keyword: miscellaneous
+#% keyword: CSV
+#% keyword: ASCII
+#%end
+
+#%option G_OPT_F_INPUT
+#% label: Input CSV file to clean up
+#% required: yes
+#%end
+
+#%option G_OPT_F_SEP
+#% answer: comma
+#% required: yes
+#%end
+
+#%option G_OPT_F_OUTPUT
+#% label: Clean CSV output file
+#% required: yes
+#%end
+
+#%option
+#% key: prefix
+#% label: Prefix for columns which don't start with a letter
+#% description: Prefix itself must start with a letter of English alphabeth
+#% type: string
+#% required: yes
+#% answer: col_
+#%end
+
+#%option
+#% key: recognized_date
+#% label: Recognized date formats (e.g., %m/%d/%y)
+#% description: For example, %m/%d/%Y,%m/%d/%y matches 7/30/2021 and 7/30/21
+#% type: string
+#% required: no
+#% multiple: yes
+#% guisection: Date
+#%end
+
+#%option
+#% key: clean_date
+#% label: Format for new clean-up date
+#% description: For example, %Y-%m-%d for 2021-07-30
+#% type: string
+#% required: no
+#% answer: date_%Y-%m-%d
+#% guisection: Date
+#%end
+
+#%option
+#% key: missing_names
+#% label: Names for the columns without a name in the header
+#% description: If only one is provided, but more than one is need, underscore and column number is added
+#% type: string
+#% required: yes
+#% answer: column
+#%end
+
+#%option
+#% key: cell_clean
+#% label: Operations to apply to non-header cells in the body of the document
+#% description: If only one is provided, but more than one is need, underscore and column number is added
+#% type: string
+#% required: no
+#% multiple: yes
+#% options: strip_whitespace,collapse_whitespace,date_format,none
+#% answer: strip_whitespace,collapse_whitespace
+#%end
+
+import sys
+import csv
+import re
+from datetime import datetime
+
+import grass.script as gs
+
+try:
+    from grass.script import legalize_vector_name as make_name_sql_compliant
+except ImportError:
+    def make_name_sql_compliant(name, fallback_prefix="x"):
+        """Make *name* usable for vectors, tables, and columns
+
+        This is a simplified copy of the legalize_vector_name() function
+        from the library (not available in 7.8).
+        """
+        if fallback_prefix and re.match("[^A-Za-z]", name[0], flags=re.ASCII):
+            name = "{fallback_prefix}{name}".format(**locals())
+        name = re.sub("[^A-Za-z0-9_]", "_", name, flags=re.ASCII)
+        keywords = ["and", "or", "not"]
+        if name in keywords:
+            name = "{name}_".format(**locals())
+        return name
+
+
+def collapse_whitespace(text):
+    """Minimize the whitespace in the text.
+    
+    Replaces multiple whitespaces including the unicode ones
+    by a single space.
+
+    Removes leading and trailing whitespace.
+    """
+    return re.sub(r"\s+", " ", text)
+
+
+def minimize_whitespace(text):
+    """Minimize the whitespace in the text.
+    
+    Replaces multiple whitespaces including the unicode ones
+    by a single space.
+
+    Removes leading and trailing whitespace.
+    """
+    return collapse_whitespace(text).strip()
+
+def reformat_date(detect_dates, date_format, date):
+    """Reformats date into a desired format
+
+    If *date* is not recognized as a date by one of the
+    *detect_dates* formats, original *date* is returned.
+    """
+    for detect_date in detect_dates:
+        try:
+            date = datetime.strptime(column, detect_date)
+            column = date.strftime(date_format)
+        except ValueError:
+            # We assume the value is not a date, so we don't touch it.
+            pass
+    return date
+
+def main():
+    options, flags = gs.parser()
+    in_filename = options["input"]
+    out_filename = options["output"]
+    input_separator = gs.separator(options["separator"])
+    prefix = options["prefix"]
+    # https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes
+    date_formats = None
+    if options["recognized_date"]:
+        date_formats = options["recognized_date"].split(",")
+    out_date_format = options["clean_date"]
+    missing_names = options["missing_names"].split(",")
+    # TODO: lowercase the column names
+
+    if prefix and re.match("[^A-Za-z]", prefix[0]):
+        gs.fatal(_("Prefix (now <{prefix}>) must start with an ASCII letter (a-z or A-Z in English alphabeth)"), prefix=prefix)
+
+    with open(in_filename, "r", newline="") as infile, open(out_filename, "w", newline="") as outfile:
+        # TODO: Input format to parameters (important)
+        # TODO: Output format to parameters (somewhat less important)
+        input_csv = csv.reader(infile, delimiter=input_separator, quotechar='"')
+        output_csv = csv.writer(outfile, delimiter=",", quotechar='"', lineterminator="\n")
+        for i, row in enumerate(input_csv):
+            # TODO: Optionally remove newlines from cells.
+            # In header and body replace by space (and turns into underscore for header).
+            if i == 0:
+                new_row = []
+                num_unnamed_columns = 0
+                duplicated_number = 2  # starting at two fro duplicated names
+                for column_number, column in enumerate(row):
+                    if date_formats:
+                        column = reformat_date(date_formats, out_date_format, column)
+                    if not column:
+                        if not num_unnamed_columns:
+                            column = missing_names[0]
+                        elif len(missing_names) == 1:
+                            column = f"{missing_names[0]}_{column_number + 1}"
+                        elif num_unnamed_columns < len(missing_names):
+                            column = missing_names[num_unnamed_columns]
+                        else:
+                            column = f"{missing_names[-1]}_{name_duplicated}"
+                            duplicated_number += 1
+                        num_unnamed_columns += 1
+                    column = minimize_whitespace(column)
+                    # TODO: Also duplicate column names should be resolved here.
+                    # Perhaps just move the else of no column names here or perhaps not
+                    # because it would be difficult to navigate the code.
+                    column = make_name_sql_compliant(column, fallback_prefix=prefix)
+                    new_row.append(column)
+            else:
+                # TODO: Optionally reformat dates in the body too (but without prefix).
+                # TODO: Recognize numbers with spaces and commas and fix them.
+                # For example, 10,000 and 10 000,5 should/might be
+                # 10000 (or 10.0) 10000.5.
+                # TODO: General find and replace for cells (which could take care of some escape chars
+                # or other mess. Question is how to make it general/more than one replace pair.
+                # (Remove would be easier to have in the interface.)
+                new_row = []
+                row_has_content = False
+                for column in row:
+                    if column:
+                        row_has_content = True
+                    # TODO: Use bools for this, perhaps a dedicated class for this type of option.
+                    # This is an experiment with extremely aggressive replacemt of flags by options.
+                    if "collapse_whitespace" in  options["cell_clean"]:
+                        column = collapse_whitespace(column)
+                    if "strip_whitespace" in options["cell_clean"]:
+                        column = column.strip()
+                    if date_formats and "date_format" in options["cell_clean"]:
+                        column = reformat_date(date_formats, out_date_format, column)
+                    new_row.append(column)
+                # Skips completely empty rows and rows with only separators.
+                if not row_has_content:
+                    continue
+                # TODO: Add except csv.Error as error:
+            output_csv.writerow(new_row)
+
+
+if __name__ == "__main__":
+    sys.exit(main())