Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer schema columns converted to numeric (decimal) columns with OGRSQL dialect #3345

Closed
steveoh opened this issue Jan 5, 2021 · 23 comments · Fixed by #4979
Closed

Integer schema columns converted to numeric (decimal) columns with OGRSQL dialect #3345

steveoh opened this issue Jan 5, 2021 · 23 comments · Fixed by #4979

Comments

@steveoh
Copy link

steveoh commented Jan 5, 2021

Expected behavior and actual behavior.

I expect to be able to translate a table from MSSQLServer to PostGIS with as close to the same schema as reasonably possible.

The actual behavior is that a column of type smallint in mssql translates to a numeric(5) in postgis and int translates to a numeric(10). Numeric field types imply decimal types. Postgresql has the equivalent types but I assume the dialect isn't able to infer from the sql query?

One issue created by this is that when using this data in a service that serializes these types to json or other formats, the output is different. 10 becomes 10.0 because numeric is interpreted as a decimal. However insignificant the .0, typed deserializers will break on the int vs decimal property type.

Steps to reproduce the problem.

Create a table with a smallint, int, or bigint and run code similar to this to convert from mssql to postgis.

sql = 'SELECT "small_int_field", "int_field", "big_int_field" FROM "schema_of.the_table_created_above"'

options = [
  '-f',
  'PostgreSQL',
  '-dialect',
  'OGRSQL',
  '-sql',
  sql,
  'OVERWRITE=YES'
]

pg_options = gdal.VectorTranslateOptions(options=options)
result = gdal.VectorTranslate(postgis_connection, mssql_connection, options=pg_options)

Operating system

Windows Server 2016 Standard

GDAL version and provenance

3.2.1 from conda

@rouault
Copy link
Member

rouault commented Jan 5, 2021

Could you show the output of ogrinfo -al -so {connection_string} on the MSSQL table you request ?

@steveoh
Copy link
Author

steveoh commented Jan 5, 2021

This table has smallint, int, and numeric fields

The MSSQL schema

image

The ogrinfo

Layer name: Energy.COALLEASES
Geometry: Unknown (any)
Feature Count: 156
Extent: (462200.490000, 4285221.610000) - (563494.880000, 4401325.350000)
Layer SRS WKT:
(unknown)
FID Column = OBJECTID
Geometry Column = Shape
SGID93_ENERGY_CoalLeases_AREA: Real (38.8)
PERIMETER: Real (38.8)
OWNER_: Integer (10.0)
OWNER_ID: Integer (10.0)
CODE: String (3.0)
SYMBOL: Integer (5.0)
GDB_GEOMATTR_DATA: Binary (0.0)

The PostGIS schema

image

@jratike80
Copy link
Collaborator

Any chance to verify that the result is the same with some newer GDAL? Version 2.3.3 is pretty old. And what happens if you use the native SQL (if there is a support for native MSSQL dialect, I do not know).

@steveoh
Copy link
Author

steveoh commented Jan 5, 2021

Ok. I ran the ogrinfo --version for the machine I ran the ogrinfo command from and it returned 3.2.0. The machine that is executing the vectortranslate has 3.0.2 installed.

I can't use the MSSQL dialect because of #1990 and how it infers the geometry incorrectly.

Are those versions new enough or should I upgrade?

@steveoh
Copy link
Author

steveoh commented Jan 6, 2021

Ok I updated gdal on the machine to 3.2.1 and will let you know what happens in the morning after it runs.

@steveoh
Copy link
Author

steveoh commented Jan 6, 2021

@jratike80 I have the same experience with gdal v3.2.1.

Would you or @rouault like any other information?

@steveoh
Copy link
Author

steveoh commented Jan 20, 2021

bump

@jratike80
Copy link
Collaborator

But isn't that right in PostGIS? By https://www.postgresql.org/docs/13/datatype-numeric.html

Integers can be considered to have a scale of zero.

and then

The precision must be positive, the scale zero or positive. Alternatively:
NUMERIC(precision)
selects a scale of 0.

So the error happens here

One issue created by this is that when using this data in a service that serializes these types to json or other formats, the output is different. 10 becomes 10.0 because numeric is interpreted as a decimal.

In other words "10 becomes 10.0 because integer is interpreted as a decimal.

@steveoh
Copy link
Author

steveoh commented Jan 20, 2021

Reading that it seems correct, but is it the best field type choice?

Since pg has more specific integer types, I would prefer to use them.

The doc also says that

The types decimal and numeric are equivalent

which makes the occurrence of the error subject to the same interpretation of the specs.

@jratike80
Copy link
Collaborator

By this RFC https://trac.osgeo.org/gdal/wiki/rfc50_ogr_field_subtype PostGIS driver does support smallint (OFSTInt16) subtype. Perhaps it is the MSSQL driver https://gdal.org/drivers/vector/mssqlspatial.html that does not support smallint in reading. Smallint converts into Integer (5.0) but that cannot be changed back into smallint because Integer (5.0) it covers wider range than -32768 - +32768 .

Fortunately CAST ASSMALLINT is supported so with some hand editing you can create the PostGIS table as you wish.

Example:

ogr2ogr -f PostgreSQL PG:"host=localhost port=5432 dbname=my_db user=user password=password" -sql "select CAST(1 AS SMALLINT) from foo as bar" foo.jml

ogrinfo PG:"host=localhost port=5432 dbname=my_db user=user password=password" bar
INFO: Open of `host=localhost port=5432 dbname=my_db user=user password=password'
      using driver `PostgreSQL' successful.

Layer name: bar
Geometry: Unknown (any)
Feature Count: 1
Extent: (501.000000, 378.000000) - (729.000000, 539.000000)
Layer SRS WKT:
(unknown)
FID Column = ogc_fid
Geometry Column = wkb_geometry
field_1: Integer(Int16) (5.0)
OGRFeature(bar):1
  field_1 (Integer(Int16)) = 1
  POLYGON ((554 539,729 400,501 378,554 539))

My conclusions (maybe not correct, I am a GDAL user like you, not developer).

  • MSSQL driver does not support strict smallint datatype. You can make a feature request. Not many developers work with MSSQL so prepare to wait, or to contract the development.
  • Meanwhile you can use CAST AS SMALLINT and create smallint fields into PostGIS.
  • The service that writes out GeoJSON has probably a bug if it converts integers into decimals.

@steveoh
Copy link
Author

steveoh commented Jan 20, 2021

Just to clarify I am using the OGR SQL dialect.

Casting would be an option if I wasn't dynamically loading ~400 tables.

I did write code to query the types in mssql and alter the postgis types after loading the data with ogr but I think it would be ideal if the type inference matched.

I will look into the c# postgresql driver to see why it treats a numeric 5,0 as a decimal. That is a good idea.

@jratike80
Copy link
Collaborator

Check first what data ogrinfo finds from your table. I get it like this:

Layer name: bar2
Geometry: None
Feature Count: 1
Layer SRS WKT:
(unknown)
inttest: Integer (5.0)
OGRFeature(bar2):0
  inttest (Integer) = 2

Inttest is numeric(5,0) in PostGIS, for ogrinfo it is Integer (5.0), and the value is "2" without decimals. Do you get something different?

@steveoh
Copy link
Author

steveoh commented Jan 20, 2021

I pasted ogr info in #3345 (comment)

@jratike80
Copy link
Collaborator

jratike80 commented Jan 20, 2021

Not that part that contains feature data. Run the command again without -so and copy the first feature. It begins like this:

OGRFeature(bar2):0
  inttest (Integer) = 2

@steveoh
Copy link
Author

steveoh commented Jan 20, 2021

Ok sorry, I'm not familiar with that tools options

OGRFeature(Energy.COALLEASES):1
  SGID93_ENERGY_CoalLeases_AREA (Real) = 0.00000000
  PERIMETER (Real) = 0.00000000
  OWNER_ (Integer) = 0
  OWNER_ID (Integer) = 0
  CODE (String) = fed
  SYMBOL (Integer) = 83
  GDB_GEOMATTR_DATA (Binary) = (null)
  POLYGON ((487284.05 4367065.85,487299.34 4368149.43,486505.71 4368149.43,486505.71 4367889.99,486098.33 4367889.08,486097.7 4367084.38,485695.36 4367084.6,485694.61 4366671.22,485324.45 4366669.62,485298.7 4366582.06,485298.51 4366506.81,485298.39 4366448.18,485298.26 4366393.49,485297.98 4366267.15,486096.23 4366270.62,486094.76 4365465.92,486899.45 4365469.42,486887.26 4366256.96,487680.86 4366256.97,487696.11 4367065.85,487284.05 4367065.85))

@jratike80
Copy link
Collaborator

The value of symbol is 83 without decimals and you can concentrate on the GeoJSON writer because that seems to add decimals into integers.

@steveoh
Copy link
Author

steveoh commented Jan 20, 2021

Yup, it's a two part issue for me.

The database table is created with a schema that is different. Downstream, the database driver interprets the numeric's as decimals without looking at the precision.

And finally the bug my users notice is that the serialized output is different.

If I can fix these issues in either or both places, I think that would be a win for everyone.

The database driver folks closed the issue as expected behavior and your argument is that the types are equivalent so my point of view seems to be the minority.

@jratike80
Copy link
Collaborator

Please clarify:

  1. In conversion MSSQL -> PostGIS the smallint field is converted into Numeric(5) instead of creating it as smallint. This is probably due to missing support for subtypes of integers in the GDAL MSSQL driver and GDAL could be improved in this place. However, MSSQL integers are kept as integers and data are not changed.
  2. What driver do you mean with this: Downstream, the database driver interprets the numeric's as decimals without looking at the precision? GDAL drivers do read integers from Numeric(5) fields as integers as in SYMBOL (Integer) = 83.
  3. What do you mean by "finally the bug my users notice is that the serialized output is different". Where that serialization happens and by what code? Do you mean https://github.com/npgsql/npgsql? GDAL does not have any problem. My PostGIS test data converts into GeoJSON this way:
{
"type": "FeatureCollection",
"name": "bar2",
"features": [
{ "type": "Feature", "properties": { "inttest": 2 }, "geometry": null }
]
}

I don't quite understand what issues npgsql developer is awaiting npgsql/npgsql#3471 (comment) but probably they know their code.

I agree that a) support of smallint subtype in GDAL MSSQL driver and b) better handling of different ways to define integer in the PostGIS schema with npgsql would be nice. Until that it seems that you must live with the rules that npgsql sets. It requires some manual work with checking the MSSQL schema and fine tuning the SQL.

sql = 'SELECT CAST("small_int_field" AS SMALLINT), CAST("int_field AS INTEGER)", CAST("big_int_field AS BIGINT)" FROM "schema_of.the_table_created_above"'

@steveoh
Copy link
Author

steveoh commented Jan 22, 2021

I apologize for the confusion. I am explaining the layers of architecture and process with all the parts involved to expose this issue to my users. Since there are a few dependent parts I thought it would be good to layout.

  1. In conversion MSSQL -> PostGIS the smallint field is converted into Numeric(5) instead of creating it as smallint. This is probably due to missing support for subtypes of integers in the GDAL MSSQL driver and GDAL could be improved in this place. However, MSSQL integers are kept as integers and data are not changed.

Yes but please remember I'm using the OGR SQL driver NOT the MSSQL driver.

  1. What driver do you mean with this: Downstream, the database driver interprets the numeric's as decimals without looking at the precision? GDAL drivers do read integers from Numeric(5) fields as integers as in SYMBOL (Integer) = 83.

I am referencing the c# pgsql driver that is used to query postgresql for the data after it has been loaded with GDAL. Nothing to do with GDAL here, just explaining another layer in my chain of issues.

  1. What do you mean by "finally the bug my users notice is that the serialized output is different". Where that serialization happens and by what code?

I have a c# web api that uses the npgsql driver to query data and return it to users serialized as json. That is the end of my chain.

@jratike80
Copy link
Collaborator

jratike80 commented Jan 22, 2021

Hi,

Yes but please remember I'm using the OGR SQL driver NOT the MSSQL driver.

OGR SQL https://gdal.org/user/ogr_sql_dialect.html is not a driver. It is query method for accessing data through the drivers. This is a list of drivers https://gdal.org/drivers/vector/index.html.

It may be that in some circumstances the selected SQL dialect also has an effect on the results even they are reading data through the same driver. Limitations and oddities may sort of pile up. Have you confirmed that the selected SQL dialect plays role in your case? And why do you use OGR SQL dialect at all?

@steveoh
Copy link
Author

steveoh commented Jan 22, 2021

And why do you use OGR SQL dialect at all?

I already answered this question above but i will repeat it.

Because #1990

@jratike80
Copy link
Collaborator

All right, thanks. This is a long thread already and I do this just for fun so my concentration may not be best possible.

What prevents you from using OGR SQL with CAST...AS SMALLINT?

@steveoh
Copy link
Author

steveoh commented Jan 28, 2021

All right, thanks. This is a long thread already and I do this just for fun so my concentration may not be best possible.

I really appreciate everything you have done so far

What prevents you from using OGR SQL with CAST...AS SMALLINT?

I mentioned that I am mirroring a database with hundreds of tables with this script and the schemas change enough that it would be cumbersome to store a static copy of it somewhere. As a short cut I select * in the import loop. I suppose I could dynamically query the source for the schema and generate the statement with all the casts. All the other field types come across as I expect so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants