Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prototype validation scripts #107

Closed
illume opened this issue Sep 27, 2022 · 7 comments
Closed

prototype validation scripts #107

illume opened this issue Sep 27, 2022 · 7 comments
Assignees

Comments

@illume
Copy link
Collaborator

illume commented Sep 27, 2022

One of the requirements is to make it easy for admins to change the validation scripts. See #46

How to do that? In order to explore this more, and to see if this is possible, I propose we implement 3 validation scripts.

See validation functions here: https://github.com/ForestGeoHack/ForestGEO/wiki/ForestGEO-App-Specification#appendix

Two possibilities:

  • SQL based
  • Javascript or typescript based functions
@esscott esscott added this to To do in ForestGEO App MVR via automation Sep 27, 2022
@illume
Copy link
Collaborator Author

illume commented Oct 19, 2022

JavaScript validations

There are a number of JavaScript validation scripts been made already.
See https://github.com/ForestGeoHack/ForestGEO/blob/main/api/upload/index.ts#L68

I expect some validations, like the CSV based ones there, will be easier and better done in JavaScript. Things like checking if it's a CSV file at all, or if there are no headers, or if the columns aren't the correct number.

Existing SQL schema does validations we didn't consider

The existing SQL schema provides validations not listed in our list of validation in the specification. I'd expect if we implemented all validations in terms of JavaScript we'd end up having to add many validations that the schema implements. For example, the schema provides uniqueness and type checks.

MySQL compatibility

SQL versions differ slightly, and CosmosDB for example doesn't have perfect or even close compatibility with MySQL.

For example, Cosmos DB does not support JOINs on multiple tables.

In Azure Cosmos DB, joins are scoped to a single item. Cross-item and cross-container joins are not supported.
https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/join

A managed MySQL server service on Azure

The lowest offered by Azure is 1GB memory for $6.205/month, with an offer of 1 year free. See pricing, with $0.115 per GB per month of storage.

Validations done within MySQL using triggers and stored procedures

This tutorial is good showing how to use a stored procedure with a trigger for validation: https://www.mysqltutorial.org/mysql-triggers/mysql-call-stored-procedure-from-trigger/

How to Call a Stored Procedure from a Trigger by Practical Examples
This tutorial shows you step by step how to call a stored procedure from a trigger in MySQL.

Example trigger based validation in MySQL

Here is a trigger that can be used for validation.

DROP TRIGGER IF EXISTS CensusTriggerTest;

delimiter //

CREATE TRIGGER CensusTriggerTest BEFORE INSERT ON Census
  FOR EACH ROW
  BEGIN
    IF NEW.PlotID > 4 THEN
      signal sqlstate '45000' set message_text = 'My Error Message';
    END IF;
  END;//

delimiter ;

Then this insert will fail because PlotID is greater than 4 (as per the check in the trigger):

INSERT into Census VALUES (16, 5, 3, '1998-11-04', '1998-12-31', NULL);

Signal can be used for error messages. https://stackoverflow.com/questions/24/throw-an-error-preventing-a-table-update-in-a-mysql-trigger

Delimiter is required because otherwise it gives us an error when we use a ;.

@illume
Copy link
Collaborator Author

illume commented Oct 19, 2022

Emma and I did the 'screenDiam' function and came up with this:

DROP PROCEDURE IF EXISTS screenDiam;

DELIMITER $$

CREATE PROCEDURE screenDiam(
    minDBH float, 
    maxDBH float
)
BEGIN
    UPDATE TempOldTrees
    SET Errors = CONCAT(TRIM(Errors),";Check DBH")
    WHERE DBH<>0 AND (DBH>maxDBH OR DBH<minDBH) AND Errors<>'NONE' AND Errors NOT LIKE '%Check DBH%' AND Errors <> "" AND Errors IS NOT NULL;

    UPDATE TempOldTrees
    SET Errors = 'Check DBH'
    WHERE DBH<>0 AND (DBH>maxDBH OR DBH<minDBH) AND Errors<>'NONE' AND Errors NOT LIKE '%Check DBH%' AND (Errors = "" OR ISNULL(Errors));
END$$

DELIMITER ;


CALL screenDiam(0.1, 1000);

SELECT TempID, TreeID, DBH, Errors FROM TempOldTrees where DBH<>0 AND (DBH>1000 OR DBH<0.1) AND Errors<>'NONE';

Some other notes we made in the process:

Q: What are temp table names? TempMultiStems, TempOldTrees, TempNewPlants?

    // forestgeo/CTFSWeb App/ctfsweb_v5.01/ctfsweb/application/models/Screeningmodel.php
    public function screenDiam ($fileName, $maxDBH, $minDBH)
    {
      //Check for diameter range
      $q1 = "SELECT TempID FROM ".$fileName." WHERE DBH<>0 AND (DBH>".$maxDBH." OR DBH<".$minDBH.") AND Errors<>'NONE'";
      $runQuery1 = $this->screeningdb->query($q1);
      if ($runQuery1->num_rows() > 0)
      {
        foreach($runQuery1->result() as $row)
        {
          $q2 = 'UPDATE '.$fileName.' SET Errors = CONCAT(TRIM(Errors),";Check DBH") WHERE TempID = '.$row->TempID.' AND Errors <> "" AND Errors IS NOT NULL';
          $q3 = 'UPDATE '.$fileName.' SET Errors = "Check DBH" WHERE TempID = '.$row->TempID.' AND (Errors = "" OR ISNULL(Errors))';
          $runQuery2 = $this->screeningdb->query($q2);
          $runQuery3 = $this->screeningdb->query($q3);
        }
      }
    }

Hard code maxDBH, and minDBH for now. They are taken from user input. See Screening.php:67

        $minDBH = $this->input->post('minDBH');
        $maxDBH = $this->input->post('maxDBH');

Can we store maxDBH, and minDBH in a table?
Or maybe in the TempOldTrees input row.
Or is there another way to give input to triggers?
Does it need to be sent from the web interface at all? Is this still a requirement?

@esscott esscott moved this from To do to In progress in ForestGEO App MVR Oct 25, 2022
@illume
Copy link
Collaborator Author

illume commented Oct 26, 2022

Some notes from meeting with Suzanne and Emma

Diameter

  • per site usually
  • some would like to do per species diameter checks

On MySQL VS CosmosDB:

  • Not so sure about NoSQL dbs to make comparison. Not being able to JOIN on separate tables would be a strong reason to not use Cosmos DB.
  • Suzanne supportive of using MySQL in the end.

On "CSV" files.

  • Tab delimited files! See "FILES DATA FORMAT_DATABASE" doc.
  • there are 5 different types of files needed, All outlined in doc.
  • Will get some example data for testing.
  • "Always have errors in a census"

stem tags

  • Stem tags depend on the site.

  • Some stem tags not unique, depends on the site.

    1. many errors are the result of this one problem: manually entering in tags. Many tags have different species. There's a validation already for this. They use the stem tag as the tag accidentally.

Editing changes

  • One weakness of CTFSWeb, can't make changes.
  • When editing make sure there is a record.

On "TempMultiStems, TempOldTrees, TempNewPlants" tables

  • better not to have separate tables. Just upload on file. This was only to reduce the paper form size.
  • old trees, census before tag numbers already in the DB.
  • multi stem, old tree with new stem.
  • new plants

Some validations on all records, only some on old ones.

Dead trees

  • Some sites do measure dead trees.
  • There is some difference of opinion on if dead trees should be recorded.
  • There is a NULL recorded in the height for dead trees.

@illume
Copy link
Collaborator Author

illume commented Oct 26, 2022

Regarding custom fields

MySQL (since 2015) supports JSON fields https://dev.mysql.com/doc/refman/5.7/en/json.html
Which allow inserting, and querying JSON data into a table column. Making it slightly better than "custom1" and custom2 that was used before.

This can be used to have schema-less data attached to a row that can be queried.

@illume
Copy link
Collaborator Author

illume commented Oct 26, 2022

page 11-13 of https://forestgeo.si.edu/sites/default/files/database_handbook-final.pdf has "CHAPTER 2: Adding Fixed Content to Your Database" sections on the type of data files that need to be uploaded.

@illume
Copy link
Collaborator Author

illume commented Dec 11, 2022

A bunch of issues have been added with the validation label. One for each function.

@illume
Copy link
Collaborator Author

illume commented Dec 12, 2022

The validation function needs to be a mysql SQL procedure.

  • Below is an example procedure that checks all the entries in a table for errors.
  • If the validation function finds any errors, it puts them in the Errors field separated by ";" characters.
  • Existing errors should not be removed. That is, add new errors to the ones that are there already.

Each validation function should start with a procedure like this. Just pasted into the github issue to start with. It will eventually live within a data validation package. But for now, adding it in as a comment in the github issue is what is needed.

DROP PROCEDURE IF EXISTS screenDiam;

DELIMITER $$

CREATE PROCEDURE screenDiam(
    minDBH float, 
    maxDBH float
)
BEGIN
    UPDATE TempOldTrees
    SET Errors = CONCAT(TRIM(Errors),";Check DBH")
    WHERE DBH<>0 AND (DBH>maxDBH OR DBH<minDBH) AND Errors<>'NONE' AND Errors NOT LIKE '%Check DBH%' AND Errors <> "" AND Errors IS NOT NULL;

    UPDATE TempOldTrees
    SET Errors = 'Check DBH'
    WHERE DBH<>0 AND (DBH>maxDBH OR DBH<minDBH) AND Errors<>'NONE' AND Errors NOT LIKE '%Check DBH%' AND (Errors = "" OR ISNULL(Errors));
END$$

DELIMITER ;


CALL screenDiam(0.1, 1000);

SELECT TempID, TreeID, DBH, Errors FROM TempOldTrees where DBH<>0 AND (DBH>1000 OR DBH<0.1) AND Errors<>'NONE';

ForestGEO App MVR automation moved this from In progress to Done Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants