-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datachecks for field lengths #56
Comments
More evidence that this should happen: http://tm.teresco.org/forum/index.php?topic=2358.0 |
This is now more important, and should be combined with a truncation of entries that are too long, since MySQL 5.7 will not allow overlength fields to be added to a column. Increasing (possibly temporarily, possibly to be replaced with a TEXT field) the width of the description in the updates table. |
This could be done during the Route constructor. |
Forgot about this one. DataProcessing/siteupdate/cplusplus/classes/Route/Route.cpp Lines 56 to 59 in 0817bdd
|
MySQL 8 won't even insert the overlength values unto the tables, and the site update process halts. |
The above means these should be handled at the |
C++
The 2nd and 3rd errors happen because the Route constructor aborts, and then the HighwaySystem ctor does |
In the OP, @jteresco wrote:
C++:Because Globals Are Bad, we could have these be static members of a class to avoid having to pass around another data structure between functions. Just do sqlfile << "CREATE TABLE continents (code VARCHAR(" << dbfieldlength::continentCode
<< "), name VARCHAR(" << dbfieldlength::continentName
<< "), PRIMARY KEY(code));\n"; and off we go. Python:Less familiar with this, but it looks like we can do static class constants too. |
I don't see any great reason to expand the ones that have examples almost at the limit, but I also wouldn't object. These are small items by comparison to many of the others in the DB. |
Let's leave these as they are, then. We can expand them if we ever need to in the future. It's one of the things the constants are meant to facilitate.
|
In some cases we won't need to check for valid length because our input must match data that's already been verified. |
over-length labels will be implemented as a standard datacheck, LABEL_TOO_LONG. |
ToDo:
|
Similarly, Instead of hard-coding these constants, we can set them relative to the other constants. |
It's possible for HighwayData contributors to add data to areagraphs.csv, multiregion.csv and multisystem.csv that would result in a filename over the limit of what the DB can store. Except that we check
The subgraph speed optimizations started later that month, and graph generation is now much less time consuming -- 2m27s (including traveled graphs!) vice 21m40s. My hope is that this is a tolerable amount of time for the ABORTING condition to be moved back after graph generation to catch the remaining errors. I'd prefer this over the couple of alternatives I can think of. |
Yes, definitely can move after the graphs now to get the more complete error list before aborting. One possibility that is probably not worth the trouble is once any error has been detected, we skip things like writing of the actual files. |
Skipping all area, multiregion & multisystem graphs would save us a total of 13.2s (Python/noreaster). Skipping only whichever new ones cause the error would most likely be an even smaller difference. I can see the utility in generating a desired file, just with a too-long name, and just skipping writing the DB. |
I think when I ran LongFields.py, I had an old head checked out (the one I use for speed tests) that predates There are a couple more systems still commented out with longer names:
My first thought was to make a stopgap expansion of |
Just queried the DB on lab2; turns out that Regardless of how the Python version behaves, this complicates the C++ version quite a bit, as the "error" it's reporting is a false positive. Fun fun. I think I'll take a break & spend my next couple days of quarantine getting reacquainted with my classic video game systems. |
I'm guessing this changed some time between MySQL 5.7 & 8.0, along with the transition from 5.7 accepting & truncating overlength values, to 8.0 stopping with an error message...
I'm bummed that I didn't take a screenshot of travelmapping.net when things were messed up, but ISTR the Seems the thing to do is plan for the most restrictive circumstances, that we're limited by byte counts rather than character counts. Edit: OK, it's pretty straightforward: >>> print(len('ñ'.encode('utf-8')))
2 |
diffs
|
OK, finally about ready to go on this. Since the C++ version is all about speed, it writes the .sql file in the background during the graph generation process. Thus we have almost a complete file by the time we reach the ABORTING condition. All that's left are datacheckErrors, graphTypes & graphs. This is sometimes reported as taking 0.1s to write, so I just decided to finish it off before terminating. Not much harm in a junk .sql file hanging around, and it's there completely if you wanna do a post-mortem grep. And who knows, depending on the errors involved, it may even be possible to manually ingest it into a lab2 with no, or even a noreaster with minimal, ill effects. A final thought: |
|
A thought on colors, probably overkill. We could go a bit further, having a colors.csv that defines names and RGB values for the "official" TM colors, introducing a new DB table to store them, the systems table would refer to those entries, and the front-end code would load up its set of colors from that new DB table. |
I don't see a big need to specify RGB values, as we can already override them, albeit on a per-tier rather than per-color basis. I can happily agree to "probably overkill", and leave this feature out. It'll be good to get this enhancement out the door. Before the big stress test gets underway, while I'm in the CSV parsing routes I'm giving the C++ ones a much-needed cleanup, and tightening a couple other minor screws. |
There's a remote chance that the MALFORMED_LAT & MALFORMED_LON datachecks can produce overlength info values. I included checks for this, but the Python version is not 100% foolproof. |
cd ~/TravelMapping/yakra/DataProcessing/siteupdate/cplusplus/
g++ siteupdate.cpp -std=c++11 -pthread -o siteupdate_const 2>&1
./siteupdate_const -t 4 -l logs -n nmp_merged -c stats -g graphs -u /home/yakra/TravelMapping/UserData/list_files -w /home/yakra/TravelMapping/HighwayData | tee siteupdate_m1.log
grep -n ABORTING siteupdate_m1.log; AbortLineCPP=$(grep -n ABORTING siteupdate_m1.log | cut -f1 -d:)
tail -n +$AbortLineCPP siteupdate_m1.log | less
cd ../python-teresco
./siteupdate.py -l logs -n nmp_merged -c stats -g graphs -u /home/yakra/TravelMapping/UserData/list_files -w /home/yakra/TravelMapping/HighwayData -t 1 | tee siteupdate_s1.log
grep -n ABORTING siteupdate_s1.log; AbortLinePY=$(grep -n ABORTING siteupdate_s1.log | cut -f1 -d:)
tail -n +$AbortLinePY siteupdate_s1.log | less
cd ..
diff <(tail -n +`expr $AbortLineCPP + 1` cplusplus/siteupdate_m1.log | sed 's~^[0-9]\+: ~~' | head -n 40) <(tail -n +`expr $AbortLinePY + 1` python-teresco/siteupdate_s1.log | sed 's~^[0-9]\+: ~~' | head -n 40) |
# look up country and continent, add index into those arrays
# in case they're needed for lookups later (not needed for DB)
for i in range(len(countries)):
country = countries[i][0]
if country == fields[2]:
fields.append(i)
break
if len(fields) != 6:
el.add_error("Could not find country matching regions.csv line: " + line)
continue
for i in range(len(continents)):
continent = continents[i][0]
if continent == fields[3]:
fields.append(i)
break
if len(fields) != 7:
el.add_error("Could not find continent matching regions.csv line: " + line)
continue I've retained the lookups for matching countries & continents. Any objections if I nix storing the indices as part of the |
|
I am not sure why they were there in the first place, so I suppose no harm in getting rid of them. |
Sounds like a case of planning for the future, not fully knowing yet what's going to be done with the data. I did simillar when starting out the C++ version, looking up & storing a pointer the the continent & country |
A system should be put in place where all entries from TM data files that end up in DB fields are checked for length to avoid problems like TravelMapping/Web#195 .
A set of constants should be introduced to the site update program to make sure these values are consistent among all checks and when DB tables are created.
The text was updated successfully, but these errors were encountered: