Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datachecks for field lengths #56

Closed
jteresco opened this issue Jul 25, 2017 · 32 comments · Fixed by #305
Closed

Datachecks for field lengths #56

jteresco opened this issue Jul 25, 2017 · 32 comments · Fixed by #305

Comments

@jteresco
Copy link
Contributor

A system should be put in place where all entries from TM data files that end up in DB fields are checked for length to avoid problems like TravelMapping/Web#195 .

A set of constants should be introduced to the site update program to make sure these values are consistent among all checks and when DB tables are created.

@jteresco jteresco self-assigned this Jul 25, 2017
@jteresco
Copy link
Contributor Author

More evidence that this should happen: http://tm.teresco.org/forum/index.php?topic=2358.0

@jteresco
Copy link
Contributor Author

This is now more important, and should be combined with a truncation of entries that are too long, since MySQL 5.7 will not allow overlength fields to be added to a column. Increasing (possibly temporarily, possibly to be replaced with a TEXT field) the width of the description in the updates table.

@yakra
Copy link
Contributor

yakra commented Nov 30, 2018

This could be done during the Route constructor. el.add_error, and, yes, halt the siteupdate process. :)

@yakra
Copy link
Contributor

yakra commented Jul 15, 2019

Forgot about this one.
I implemented at least part of it in C++:

if (banner.size() > 6)
{ el.add_error("Banner >6 characters in " + system->systemname + ".csv line: [" + line + "], " + banner);
return;
}

@yakra
Copy link
Contributor

yakra commented Jul 17, 2019

MySQL 8 won't even insert the overlength values unto the tables, and the site update process halts.
ERROR 1406 (22001) at line 1074: Data too long for column 'banner' at row 49532
fraidfd is commented out on lab2 until the HighwayData side is resolved.

@yakra
Copy link
Contributor

yakra commented Aug 27, 2019

The above means these should be handled at the ErrorList level of the siteupdate program, rather than create a DatacheckEntry that would need a successful site update in order to be shown on siteupdate.php.

@yakra
Copy link
Contributor

yakra commented Sep 6, 2019

C++

ABORTING due to 3 errors:
1: Abbrev >3 characters in usafl.csv line: [usafl;FL;FL44;;Leesburg;;fl.fl044;], Leesburg
2: Could not find Route matching root fl.fl044 in system usafl.
3: No roots in usafl_con.csv line [usafl;FL44;;Leesburg;fl.fl044]
Fri Sep  6 11:51:20 EDT 2019

The 2nd and 3rd errors happen because the Route constructor aborts, and then the HighwaySystem ctor does if (!route_list.back().is_valid()) route_list.pop_back();

@yakra
Copy link
Contributor

yakra commented Mar 24, 2020

@yakra
Copy link
Contributor

yakra commented Mar 28, 2020

In the OP, @jteresco wrote:

A set of constants should be introduced to the site update program to make sure these values are consistent among all checks and when DB tables are created.

table text field proposed
constant
current
length
clinched traveler traveler 48
clinchedConnectedRoutes route root 32
clinchedConnectedRoutes traveler traveler 48
clinchedOverallMileageByRegion region regionCode 8
clinchedOverallMileageByRegion traveler traveler 48
clinchedRoutes route root 32
clinchedRoutes traveler traveler 48
clinchedSystemMileageByRegion systemName systemName 10
clinchedSystemMileageByRegion region regionCode 8
clinchedSystemMileageByRegion traveler traveler 48
connectedRouteRoots firstRoot root 32
connectedRouteRoots root root 32
connectedRoutes systemName systemName 10
connectedRoutes route route 16
connectedRoutes banner banner 6
connectedRoutes groupName ? 100
connectedRoutes firstRoot root 32
continents code continentCode 3
continents name continentName 15
countries code countryCode 3
countries name countryName 32
datacheckErrors route root 32
datacheckErrors label1 label 50
datacheckErrors label2 label 20
datacheckErrors label3 label 20
datacheckErrors code dcErrCode 22
datacheckErrors value dcErrValue 32
graphTypes category graphCategory 12
graphTypes descr graphDescr 100
graphTypes longDescr N/A N/A
graphs filename graphFilename 32
graphs descr graphDescr 100
graphs format graphFormat 10
graphs category graphCategory 12
overallMileageByRegion region regionCode 8
regions code regionCode 8
regions name regionName 48
regions country countryCode 3
regions continent continentCode 3
regions regiontype regiontype 32
routes systemName systemName 10
routes region regionCode 8
routes route route 16
routes banner banner 6
routes abbrev abbrev 3
routes city ? 100
routes root root 32
segments root root 32
systemMileageByRegion systemName systemName 10
systemMileageByRegion region regionCode 8
systemUpdates date date 10
systemUpdates region regionExtDescr
regionLongDescr
countryRegion
or...?
48
systemUpdates systemName systemName 10
systemUpdates description systemFullName 128
systemUpdates statusChange statusChange 16
systems systemName systemName 10
systems countryCode countryCode 3
systems fullName systemFullName 60
systems color color 16
systems level level 10
updates date date 10
updates region regionExtDescr
regionLongDescr
countryRegion
or...?
60
updates route routeLongName 80
updates root root 32
updates description updateText 1024
waypoints pointName label 20
waypoints root root 32
  • label
    datacheckErrors.label1 was expanded a bit after the VISIBLE_HIDDEN_COLOC datacheck was first introduced, which at the time used long graph vertex style labels. It's since been changed to list the waypoint label used in the flagged route, so this field's length can be reduced to match the other waypoint label fields.
    These are currently limited to 20 characters. I've uncovered an 18-character label so far, with a shell script still chugging away. While that one deserves a forum post, it's possible a contributor may use a longer label in a hidden waypoint for note-taking purposes. Which is legitimate, IMO. Reaching for the outer limits of what we may expect for a visible label, I come up with CHIH1604WAltBypCla(410B)_N, at 26 characters.
    (I can only hope that should we ever come across a "South Dear Leader His Majestic Royal Highness The First And Foremost Exalted Grandest Of The Grand Poobah Among Poobahs Sir Comrade Charles Christian 'Otto Yamamoto' Slater Junior Esquire Boulevard West" that needs a waypoint, we'd name it ChaSlaBlvd or OttoYamBlvd and not... uhh... yeah.)
    To be on the safe side & provide a bit of headroom, shall we expand the waypoint label fields (datacheckErrors.label1, datacheckErrors.label2, datacheckErrors.label3, waypoints.pointName) to 26 characters?

  • regionExtDescr, regionLongDescr, countryRegion... ?
    I'm not sure what to call this. Any preferences or suggestions? As seen in the updates and systemUpdates tables, these often take the form of "(countryName) regionName".
    countryName of 32 + regionName of 48 + 3 more chars for parentheses and a space = 83.
    Shall we expand the systemUpdates.region and updates.region to 83 characters?

  • route
    Not to be confused with where a DB field named route contains a root. (Maybe call it routeName?)
    16 characters ought to be fine. The two 12-char examples I've seen (there may be others) appear to defy naming conventions anyway.

  • systemFullName
    systemUpdates.description contains full highway system names. Some that are no longer in use, but the same basic stuff. I see no need to use a different constant from systems.fullName. The longest full system name right now is 56 characters: Île-de-France Routes Départementales (Seine-Saint-Denis).
    Limit both fields to 60 characters? Add a few more chars' headroom to be on the safe side?

  • connectedRoutes.groupName and routes.city often but not always match. The one is based off the other; they're the same basic idea. I see no reason to use different constants for these two fields. Any good suggestions on what to call this one?

  • routeLongName
    The longest current example is 77 chars out of 80 allowed: Skorostnaya Avtomobil'naya Doroga Moskva - Sankt Peterburg (Vyshny Volochyok).
    Add a few more chars' headroom to be on the safe side?


C++:

Because Globals Are Bad, we could have these be static members of a class to avoid having to pass around another data structure between functions. Just do

sqlfile << "CREATE TABLE continents (code VARCHAR(" << dbfieldlength::continentCode
	<< "), name VARCHAR(" << dbfieldlength::continentName
	<< "), PRIMARY KEY(code));\n";

and off we go.

Python:

Less familiar with this, but it looks like we can do static class constants too.

@jteresco
Copy link
Contributor Author

I don't see any great reason to expand the ones that have examples almost at the limit, but I also wouldn't object. These are small items by comparison to many of the others in the DB.

@yakra
Copy link
Contributor

yakra commented Apr 1, 2020

I don't see any great reason to expand the ones that have examples almost at the limit, but I also wouldn't object. These are small items by comparison to many of the others in the DB.

Let's leave these as they are, then. We can expand them if we ever need to in the future. It's one of the things the constants are meant to facilitate.

  • Expanding label to handle 26 chars for a "worst-case" like CHIH1604WAltBypCla(410B)_N.
  • Choosing countryRegion as the most descriptive of what its field contains, as "regionExtDescr" or "regionLongDescr" could be confused with name+regionType concatenations (E.G. Saint Barthelemy (Overseas Collectivity)) that don't actually show up in a single DB field.
  • systemFullName will be set to 60 chars; less space will be allotted in the systemUpdates table, which should not be a problem.
  • Using city for connectedRoutes.groupName and routes.city.

@yakra
Copy link
Contributor

yakra commented Apr 1, 2020

In some cases we won't need to check for valid length because our input must match data that's already been verified.
For example, matching ConnectedRoute roots to an existing Route object's root field.

@yakra yakra added the pending label Apr 1, 2020
@yakra
Copy link
Contributor

yakra commented Apr 1, 2020

over-length labels will be implemented as a standard datacheck, LABEL_TOO_LONG.
ToDo: prevent datacheck.php from displaying error codes.

@yakra
Copy link
Contributor

yakra commented Apr 1, 2020

ToDo:

  • Right now, multiple errors on a single CSV line will require multiple passes thru siteupdate to be added to the ErrorList & displayed to the user, due to the way the continue statements are set up. Fix.
  • C++: { in place of } spotted. Look for more.
  • C++: tidy up indentation on el.add_error calls.
  • checks for graphs table values, starting with filenames:
    • subtract 14 B for "-collapsed.tmg"
    • any values of the following are safe with the present field sizes:
      • <continentCode>+"-continent"
      • <countryCode>+"-country"
      • <systemName>+"-system"
      • <regionCode>+"-region"
    • multisystem & multiregion: fields[1] maxes out at DBFieldLength::graphFilename - 14
    • area: a.base + to_string(a.r) maxes out at DBFieldLength::graphFilename - 19
  • sql_file.cpp & corresponding Python
  • fix missing std::to_string / str calls:
    • compiler may not complain, because char pointer arithmetic.
    • look @ new code in GitKraken & fix, in both versions
    • See what the compiler does complain about. Fix & recheck Python.
    • Python will complain. Fix & recheck C++.
  • extraneous C++ strings & length checks in Region.cpp; move back toward the old char* method
  • What "pars" items were flagged as "!xp" in C++?, where we don't report the number of fields?
    Do halves & halves, with el.add_error for unexpected field counts
    • HighwaySystem
      • one last strtok to check for too many fields
    • Region (see above)
    • any other similar classes done by strtok? None.
      • is anything strtok not using delete/return?
      • expected n populated fields
      • one last strtok to check for too many fields
  • new delete/return combos?
    How is this stuff handled & checked for validity once the ctor returns?
    (Should all be fine, but best to be safe.)
    • HighwaySystem: if level is assigned a value
    • Region: if type is assigned a value, and country & continent are found
  • "characters" -> "bytes"
  • grep & diff CREATE TABLE lines in TravelMapping.sql: before & after, Python & C++
  • Consider whether to check for unrecognised colours in HighwaySystem ctor
  • Mirror exceptions in C++. Verbatim?...
    • Will these be reported the same way by Ubuntu/CentOS/BSD? (Yes? Yes.)
siteupdate.py
approx. line
try to open
992 wpt file
1171 chopped CSV
1182 _con CSV
2236 continents.csv
2260 countries.csv
2284 regions.csv
2337 systems.csv

@yakra
Copy link
Contributor

yakra commented Apr 1, 2020

"(countryName) regionName".
countryName of 32 + regionName of 48 + 3 more chars for parentheses and a space = 83.
Shall we expand the systemUpdates.region and updates.region to 83 characters?

Similarly, datacheckErrors.value for VISIBLE_HIDDEN_COLOC cases is root@label.
root of 32 + label of 26 + 1 more char for @ = 59.

Instead of hard-coding these constants, we can set them relative to the other constants.

@yakra
Copy link
Contributor

yakra commented Apr 2, 2020

It's possible for HighwayData contributors to add data to areagraphs.csv, multiregion.csv and multisystem.csv that would result in a filename over the limit of what the DB can store.
Ideally, we'd add to the ErrorList and abort.

Except that we check ErrorList and abort before Setting up for graphs of highway data.
This was intentional -- @jteresco wrote in #162:

Recent changes have meant that when a data error slips in, the "ABORTING due to N errors" doesn't happen until after (the time consuming) generation of all the subgraphs. I had gotten used to being able to ignore the site update process once graph generation started. I'd like to see about getting that ABORTING condition moved back earlier so I have a better chance to notice the problem.

The subgraph speed optimizations started later that month, and graph generation is now much less time consuming -- 2m27s (including traveled graphs!) vice 21m40s.

My hope is that this is a tolerable amount of time for the ABORTING condition to be moved back after graph generation to catch the remaining errors. I'd prefer this over the couple of alternatives I can think of.

@jteresco
Copy link
Contributor Author

jteresco commented Apr 2, 2020

Yes, definitely can move after the graphs now to get the more complete error list before aborting. One possibility that is probably not worth the trouble is once any error has been detected, we skip things like writing of the actual files.

@yakra
Copy link
Contributor

yakra commented Apr 2, 2020

Skipping all area, multiregion & multisystem graphs would save us a total of 13.2s (Python/noreaster). Skipping only whichever new ones cause the error would most likely be an even smaller difference. I can see the utility in generating a desired file, just with a too-long name, and just skipping writing the DB.
Call it "not worth the trouble"?

@yakra
Copy link
Contributor

yakra commented Apr 4, 2020

The longest full system name right now is 56 characters: Île-de-France Routes Départementales (Seine-Saint-Denis).

I think when I ran LongFields.py, I had an old head checked out (the one I use for speed tests) that predates frabfcd70;FRA;Bourgogne-Franche-Comté Routes Départementales (Haute-Saône);yellow;5;devel.
Interestingly, LongFields.py reports this as 60 characters, seemingly right up against the 60 char DB limit. But it turns out this has 3 multi-byte characters, making it total of 63 bytes.
The revised C++ program correctly reports this as oversized. I'm still working out some weird runtime bugs with that version, and have not tested the new siteupdate.py yet. It'll be interesting to see what it has to say about this when I run it.

There are a couple more systems still commented out with longer names:

63	Bourgogne-Franche-Comté Routes Départementales (Côte-d’Or)
63	Bourgogne-Franche-Comté Routes Départementales (Haute-Saône)
66	Bourgogne-Franche-Comté Routes Départementales (Saône-et-Loire)
72	Bourgogne-Franche-Comté Routes Départementales (Territoire de Belfort)

My first thought was to make a stopgap expansion of systems.fullName pending sorting out the rest of this issue, but closer inspection reveals many Routes Départementales systems' names to be questionable and able to be shortened. I'll take this to the forum.

@yakra
Copy link
Contributor

yakra commented Apr 4, 2020

Just queried the DB on lab2; turns out that Bourgogne-Franche-Comté Routes Départementales (Haute-Saône), at 60 characters, can in fact fit in the DB.

Regardless of how the Python version behaves, this complicates the C++ version quite a bit, as the "error" it's reporting is a false positive.

Fun fun. I think I'll take a break & spend my next couple days of quarantine getting reacquainted with my classic video game systems.

@yakra
Copy link
Contributor

yakra commented Apr 4, 2020

Just queried the DB on lab2; turns out that Bourgogne-Franche-Comté Routes Départementales (Haute-Saône), at 60 characters, can in fact fit in the DB.

I'm guessing this changed some time between MySQL 5.7 & 8.0, along with the transition from 5.7 accepting & truncating overlength values, to 8.0 stopping with an error message...

  • TravelMapping-2020-03-24@21:13:20.sql worked fine on noreaster
  • lab2 barfed on it: ERROR 1406 (22001) at line 726: Data too long for column 'fullName' at row 362
    (the offending datum being Finland Seututiet (outlined - temp preview to fix active routes))

I'm bummed that I didn't take a screenshot of travelmapping.net when things were messed up, but ISTR the ñ being displayed as a �, as in TravelMapping/Web#195. Here's how it appears on lab2:
MEX57Cañ

Seems the thing to do is plan for the most restrictive circumstances, that we're limited by byte counts rather than character counts.
Pro: We don't have to get the C++ version to do character counts.
Con: We do have to get the Python version to do byte counts.

Edit: OK, it's pretty straightforward:

>>> print(len('ñ'.encode('utf-8')))
2

@yakra
Copy link
Contributor

yakra commented Apr 4, 2020

diffs

  • The old C++ version would stop processing a Route object if it encountered an overlength Banner or Abbrev. Fixed. Once I remove bel.n712pel, there are no diffs between the old and new logs/stats/graphs/nmp_merged.
  • siteupdate.py is running right now, 2nd pass just started. ToDo next.
  • And then, diff the new C++ & new Python output.

  • After that, on to changing all the new len('ñ') to len('ñ'.encode('utf-8')) checks.

@yakra
Copy link
Contributor

yakra commented Apr 6, 2020

OK, finally about ready to go on this.
Considering all the small things to watch out for & fix that kept popping up, it'd be wise to break one of everything in HighwayData and make sure everything behaves as expected.

Since the C++ version is all about speed, it writes the .sql file in the background during the graph generation process. Thus we have almost a complete file by the time we reach the ABORTING condition. All that's left are datacheckErrors, graphTypes & graphs. This is sometimes reported as taking 0.1s to write, so I just decided to finish it off before terminating. Not much harm in a junk .sql file hanging around, and it's there completely if you wanna do a post-mortem grep. And who knows, depending on the errors involved, it may even be possible to manually ingest it into a lab2 with no, or even a noreaster with minimal, ill effects.
What matters is that the program terminates with an exit code of 1, so localupdate.sh knows to not continue.

A final thought:
How about checking for unrecognized colors in systems.csv? We could have a set of the "official TM" colors, and check each systems.csv datum for membership in this set, catching potential misspellings etc. Just that DBFieldLength is no longer the best class name for all our constants. Ideas on a better name? Thoughts?

@yakra
Copy link
Contributor

yakra commented Apr 8, 2020

  • Break & test the following in C++:
    • region code in chopped routes .csv
    • country code in:
      • regions.csv
      • systems.csv
    • continent code in regions.csv
  • Keep dummy entries out of the DB

@jteresco
Copy link
Contributor Author

jteresco commented Apr 8, 2020

A thought on colors, probably overkill. We could go a bit further, having a colors.csv that defines names and RGB values for the "official" TM colors, introducing a new DB table to store them, the systems table would refer to those entries, and the front-end code would load up its set of colors from that new DB table.

@yakra
Copy link
Contributor

yakra commented Apr 8, 2020

I don't see a big need to specify RGB values, as we can already override them, albeit on a per-tier rather than per-color basis. I can happily agree to "probably overkill", and leave this feature out. It'll be good to get this enhancement out the door.

Before the big stress test gets underway, while I'm in the CSV parsing routes I'm giving the C++ ones a much-needed cleanup, and tightening a couple other minor screws.
After, there will be at least one merge conflict to resolve on my end.

@yakra
Copy link
Contributor

yakra commented Apr 9, 2020

There's a remote chance that the MALFORMED_LAT & MALFORMED_LON datachecks can produce overlength info values. I included checks for this, but the Python version is not 100% foolproof.
Rather than spend time on a fix and then gunk up the commit history with obsolete code that will just be overwritten again in a few days, I think I'll scrap these changes (in both languages) & go no-build pending a cleanup per #299.

@yakra
Copy link
Contributor

yakra commented Apr 10, 2020

cd ~/TravelMapping/yakra/DataProcessing/siteupdate/cplusplus/

g++ siteupdate.cpp -std=c++11 -pthread -o siteupdate_const 2>&1
./siteupdate_const -t 4 -l logs -n nmp_merged -c stats -g graphs -u /home/yakra/TravelMapping/UserData/list_files -w /home/yakra/TravelMapping/HighwayData | tee siteupdate_m1.log
grep -n ABORTING siteupdate_m1.log; AbortLineCPP=$(grep -n ABORTING siteupdate_m1.log | cut -f1 -d:)
tail -n +$AbortLineCPP siteupdate_m1.log | less

cd ../python-teresco

./siteupdate.py -l logs -n nmp_merged -c stats -g graphs -u /home/yakra/TravelMapping/UserData/list_files -w /home/yakra/TravelMapping/HighwayData -t 1 | tee siteupdate_s1.log
grep -n ABORTING siteupdate_s1.log; AbortLinePY=$(grep -n ABORTING siteupdate_s1.log | cut -f1 -d:)
tail -n +$AbortLinePY siteupdate_s1.log | less

cd ..

diff <(tail -n +`expr $AbortLineCPP + 1` cplusplus/siteupdate_m1.log | sed 's~^[0-9]\+: ~~' | head -n 40) <(tail -n +`expr $AbortLinePY + 1` python-teresco/siteupdate_s1.log | sed 's~^[0-9]\+: ~~' | head -n 40)

@yakra
Copy link
Contributor

yakra commented Apr 10, 2020

        # look up country and continent, add index into those arrays
        # in case they're needed for lookups later (not needed for DB)
        for i in range(len(countries)):
            country = countries[i][0]
            if country == fields[2]:
                fields.append(i)
                break
        if len(fields) != 6:
            el.add_error("Could not find country matching regions.csv line: " + line)
            continue
        for i in range(len(continents)):
            continent = continents[i][0]
            if continent == fields[3]:
                fields.append(i)
                break
        if len(fields) != 7:
            el.add_error("Could not find continent matching regions.csv line: " + line)
            continue

I've retained the lookups for matching countries & continents. Any objections if I nix storing the indices as part of the fields array?

@yakra
Copy link
Contributor

yakra commented Apr 11, 2020

  • vdiffs are slowly becoming sane
  • Combine where feasible (which is nowhere):
    • if len(line) == 0 or line.startswith('#'):
    • if (line.empty() || line[0] == '#')
  • Make sure all HighwayData breakages are as expected. I know I whiffed at least 4 of'em.
  • C++: Make rg_str part of the Route object; use it to load & process WPTs
    • Change validity check in HighwaySystem ctor accordingly
  • Check DIFFs again
  • Revert the Waypoint ctor changes
  • Maybe revert the list delimiter changes for the time being too
  • Merge & deal with conflicts

@jteresco
Copy link
Contributor Author

I've retained the lookups for matching countries & continents. Any objections if I nix storing the indices as part of the fields array?

I am not sure why they were there in the first place, so I suppose no harm in getting rid of them.

yakra added a commit to yakra/DataProcessing that referenced this issue Apr 12, 2020
@yakra
Copy link
Contributor

yakra commented Apr 13, 2020

# in case they're needed for lookups later (not needed for DB)

Sounds like a case of planning for the future, not fully knowing yet what's going to be done with the data.

I did simillar when starting out the C++ version, looking up & storing a pointer the the continent & country std::pairs. Nothing much is done with them, but the only less RAM-intensive option would have been to include a C char array in the Region object. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants