diff --git a/docs/connection/db_connection/clickhouse/sql.rst b/docs/connection/db_connection/clickhouse/sql.rst index cd331ef33..658bd052a 100644 --- a/docs/connection/db_connection/clickhouse/sql.rst +++ b/docs/connection/db_connection/clickhouse/sql.rst @@ -17,8 +17,8 @@ Syntax support Only queries with the following syntax are supported: -* ``SELECT ...`` -* ``WITH alias AS (...) SELECT ...`` +* ✅︎ ``SELECT ... FROM ...`` +* ✅︎ ``WITH alias AS (...) SELECT ...`` Queries like ``SHOW ...`` are not supported. diff --git a/docs/connection/db_connection/clickhouse/types.rst b/docs/connection/db_connection/clickhouse/types.rst index e9bdd22e9..a995cfa5f 100644 --- a/docs/connection/db_connection/clickhouse/types.rst +++ b/docs/connection/db_connection/clickhouse/types.rst @@ -55,7 +55,7 @@ But Spark does not have specific dialect for Clickhouse, so Generic JDBC dialect Generic dialect is using SQL ANSI type names while creating tables in target database, not database-specific types. If some cases this may lead to using wrong column type. For example, Spark creates column of type ``TIMESTAMP`` -which corresponds to Clickhouse's type ``DateTime32`` (precision up to seconds) +which corresponds to Clickhouse type ``DateTime32`` (precision up to seconds) instead of more precise ``DateTime64`` (precision up to nanoseconds). This may lead to incidental precision loss, or sometimes data cannot be written to created table at all. @@ -192,7 +192,10 @@ Numeric types Temporal types ~~~~~~~~~~~~~~ -Note: ``DateTime(P, TZ)`` has the same precision as ``DateTime(P)``. +Notes: + * Datetime with timezone has the same precision as without timezone + * ``DateTime`` is alias for ``DateTime32`` + * ``TIMESTAMP`` is alias for ``DateTime32``, but ``TIMESTAMP(N)`` is alias for ``DateTime64(N)`` +-----------------------------------+--------------------------------------+----------------------------------+-------------------------------+ | Clickhouse type (read) | Spark type | Clickhousetype (write) | Clickhouse type (create) | @@ -238,6 +241,31 @@ Note: ``DateTime(P, TZ)`` has the same precision as ``DateTime(P)``. | ``IntervalYear`` | | | | +-----------------------------------+--------------------------------------+----------------------------------+-------------------------------+ +.. warning:: + + Note that types in Clickhouse and Spark have different value ranges: + + +------------------------+-----------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+ + | Clickhouse type | Min value | Max value | Spark type | Min value | Max value | + +========================+===================================+===================================+=====================+================================+================================+ + | ``Date`` | ``1970-01-01`` | ``2149-06-06`` | ``DateType()`` | ``0001-01-01`` | ``9999-12-31`` | + +------------------------+-----------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+ + | ``DateTime32`` | ``1970-01-01 00:00:00`` | ``2106-02-07 06:28:15`` | ``TimestampType()`` | ``0001-01-01 00:00:00.000000`` | ``9999-12-31 23:59:59.999999`` | + +------------------------+-----------------------------------+-----------------------------------+ | | | + | ``DateTime64(N=0..8)`` | ``1900-01-01 00:00:00.00000000`` | ``2299-12-31 23:59:59.99999999`` | | | | + +------------------------+-----------------------------------+-----------------------------------+ | | | + | ``DateTime64(N=9)`` | ``1900-01-01 00:00:00.000000000`` | ``2262-04-11 23:47:16.999999999`` | | | | + +------------------------+-----------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+ + + So not all of values in Spark DataFrame can be written to Clickhouse. + + References: + * `Clickhouse Date documentation `_ + * `Clickhouse Datetime32 documentation `_ + * `Clickhouse Datetime64 documentation `_ + * `Spark DateType documentation `_ + * `Spark TimestampType documentation `_ + .. [4] Clickhouse support datetime up to nanoseconds precision (``23:59:59.999999999``), but Spark ``TimestampType()`` supports datetime up to microseconds precision (``23:59:59.999999``). @@ -257,17 +285,17 @@ String types +--------------------------------------+------------------+------------------------+--------------------------+ | Clickhouse type (read) | Spark type | Clickhousetype (write) | Clickhouse type (create) | +======================================+==================+========================+==========================+ -| ``IPv4`` | ``StringType()`` | ``String`` | ``String`` | +| ``FixedString(N)`` | ``StringType()`` | ``String`` | ``String`` | +--------------------------------------+ | | | -| ``IPv6`` | | | | +| ``String`` | | | | +--------------------------------------+ | | | | ``Enum8`` | | | | +--------------------------------------+ | | | | ``Enum16`` | | | | +--------------------------------------+ | | | -| ``FixedString(N)`` | | | | +| ``IPv4`` | | | | +--------------------------------------+ | | | -| ``String`` | | | | +| ``IPv6`` | | | | +--------------------------------------+------------------+ | | | ``-`` | ``BinaryType()`` | | | +--------------------------------------+------------------+------------------------+--------------------------+ @@ -352,7 +380,7 @@ and write it as ``String`` column in Clickhouse: array_column_json String, ) ENGINE = MergeTree() - ORDER BY time + ORDER BY id """, ) @@ -369,18 +397,34 @@ Then you can parse this column on Clickhouse side - for example, by creating a v .. code:: sql - SELECT id, JSONExtract(json_column, 'Array(String)') FROM target_tbl + SELECT + id, + JSONExtract(json_column, 'Array(String)') AS array_column + FROM target_tbl + +You can also use `ALIAS `_ +or `MATERIALIZED `_ columns +to avoid writing such expression in every ``SELECT`` clause all the time: -You can also use `ALIAS `_ columns -to avoid writing such expression in every ``SELECT`` clause all the time. +.. code-block:: sql + + CREATE TABLE default.target_tbl AS ( + id Int32, + array_column_json String, + -- computed column + array_column Array(String) ALIAS JSONExtract(json_column, 'Array(String)') + -- or materialized column + -- array_column Array(String) MATERIALIZED JSONExtract(json_column, 'Array(String)') + ) + ENGINE = MergeTree() + ORDER BY id Downsides: * Using ``SELECT JSONExtract(...)`` or ``ALIAS`` column can be expensive, because value is calculated on every row access. This can be especially harmful if such column is used in ``WHERE`` clause. -* Both ``ALIAS`` columns are not included in ``SELECT *`` clause, they should be added explicitly: ``SELECT *, calculated_column FROM table``. +* ``ALIAS`` and ``MATERIALIZED`` columns are not included in ``SELECT *`` clause, they should be added explicitly: ``SELECT *, calculated_column FROM table``. .. warning:: - `MATERIALIZED `_ and `EPHEMERAL `_ columns are not supported by Spark because they cannot be selected to determine target column type. diff --git a/docs/connection/db_connection/greenplum/types.rst b/docs/connection/db_connection/greenplum/types.rst index fb3077e9f..baea34914 100644 --- a/docs/connection/db_connection/greenplum/types.rst +++ b/docs/connection/db_connection/greenplum/types.rst @@ -102,7 +102,9 @@ See Greenplum `CREATE TABLE `_. +See: + * `official connector documentation `_ + * `list of Greenplum types `_ Numeric types ~~~~~~~~~~~~~ @@ -181,6 +183,27 @@ Temporal types | ``tstzrange`` | | | | +------------------------------------+-------------------------+-----------------------+-------------------------+ +.. warning:: + + Note that types in Greenplum and Spark have different value ranges: + + +----------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+ + | Greenplum type | Min value | Max value | Spark type | Min value | Max value | + +================+=================================+==================================+=====================+================================+================================+ + | ``date`` | ``-4713-01-01`` | ``5874897-01-01`` | ``DateType()`` | ``0001-01-01`` | ``9999-12-31`` | + +----------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+ + | ``timestamp`` | ``-4713-01-01 00:00:00.000000`` | ``294276-12-31 23:59:59.999999`` | ``TimestampType()`` | ``0001-01-01 00:00:00.000000`` | ``9999-12-31 23:59:59.999999`` | + +----------------+---------------------------------+----------------------------------+ | | | + | ``time`` | ``00:00:00.000000`` | ``24:00:00.000000`` | | | | + +----------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+ + + So not all of values can be read from Greenplum to Spark. + + References: + * `Greenplum types documentation `_ + * `Spark DateType documentation `_ + * `Spark TimestampType documentation `_ + .. [3] ``time`` type is the same as ``timestamp`` with date ``1970-01-01``. So instead of reading data from Postgres like ``23:59:59`` diff --git a/docs/connection/db_connection/mssql/execute.rst b/docs/connection/db_connection/mssql/execute.rst index bed53fec5..54156ed76 100644 --- a/docs/connection/db_connection/mssql/execute.rst +++ b/docs/connection/db_connection/mssql/execute.rst @@ -3,6 +3,96 @@ Executing statements in MSSQL ============================= +How to +------ + +There are 2 ways to execute some statement in MSSQL + +Use :obj:`MSSQL.fetch ` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Use this method to execute some ``SELECT`` query which returns **small number or rows**, like reading +MSSQL config, or reading data from some reference table. + +Method accepts :obj:`JDBCOptions `. + +Connection opened using this method should be then closed with :obj:`MSSQL.close `. + +Syntax support +^^^^^^^^^^^^^^ + +This method supports **any** query syntax supported by MSSQL, like: + +* ✅︎ ``SELECT ... FROM ...`` +* ✅︎ ``WITH alias AS (...) SELECT ...`` +* ✅︎ ``SELECT func(arg1, arg2) FROM DUAL`` - call function +* ❌ ``SET ...; SELECT ...;`` - multiple statements not supported + +Examples +^^^^^^^^ + +.. code-block:: python + + from onetl.connection import MSSQL + + mssql = MSSQL(...) + + df = mssql.fetch( + "SELECT value FROM some.reference_table WHERE key = 'some_constant'", + options=MSSQL.JDBCOptions(query_timeout=10), + ) + mssql.close() + value = df.collect()[0][0] # get value from first row and first column + +Use :obj:`MSSQL.execute ` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Use this method to execute DDL and DML operations. Each method call runs operation in a separated transaction, and then commits it. + +Method accepts :obj:`JDBCOptions `. + +Connection opened using this method should be then closed with :obj:`MSSQL.close `. + +Syntax support +^^^^^^^^^^^^^^ + +This method supports **any** query syntax supported by MSSQL, like: + +* ✅︎ ``CREATE TABLE ...``, ``CREATE VIEW ...`` +* ✅︎ ``ALTER ...`` +* ✅︎ ``INSERT INTO ... AS SELECT ...`` +* ✅︎ ``DROP TABLE ...``, ``DROP VIEW ...``, and so on +* ✅︎ ``CALL procedure(arg1, arg2) ...`` or ``{call procedure(arg1, arg2)}`` - special syntax for calling procedure +* ✅︎ ``DECLARE ... BEGIN ... END`` - execute PL/SQL statement +* ✅︎ other statements not mentioned here +* ❌ ``SET ...; SELECT ...;`` - multiple statements not supported + +Examples +^^^^^^^^ + +.. code-block:: python + + from onetl.connection import MSSQL + + mssql = MSSQL(...) + + with mssql: + mssql.execute("DROP TABLE schema.table") + mssql.execute( + """ + CREATE TABLE schema.table AS ( + id bigint GENERATED ALWAYS AS IDENTITY, + key VARCHAR2(4000), + value NUMBER + ) + """, + options=MSSQL.JDBCOptions(query_timeout=10), + ) + + +References +---------- + .. currentmodule:: onetl.connection.db_connection.mssql.connection .. automethod:: MSSQL.fetch diff --git a/docs/connection/db_connection/mssql/index.rst b/docs/connection/db_connection/mssql/index.rst index 5e511e83e..696dfda53 100644 --- a/docs/connection/db_connection/mssql/index.rst +++ b/docs/connection/db_connection/mssql/index.rst @@ -1,12 +1,13 @@ .. _mssql: MSSQL -===== +====== .. toctree:: :maxdepth: 1 :caption: Connection + prerequisites connection .. toctree:: @@ -14,5 +15,12 @@ MSSQL :caption: Operations read + sql write execute + +.. toctree:: + :maxdepth: 1 + :caption: Troubleshooting + + types diff --git a/docs/connection/db_connection/mssql/prerequisites.rst b/docs/connection/db_connection/mssql/prerequisites.rst new file mode 100644 index 000000000..33786d61c --- /dev/null +++ b/docs/connection/db_connection/mssql/prerequisites.rst @@ -0,0 +1,77 @@ +.. _mssql-prerequisites: + +Prerequisites +============= + +Version Compatibility +--------------------- + +* SQL Server versions: 2014 - 2022 +* Spark versions: 2.3.x - 3.5.x +* Java versions: 8 - 20 + +See `official documentation `_ +and `official compatibility matrix `_. + +Installing PySpark +------------------ + +To use MSSQL connector you should have PySpark installed (or injected to ``sys.path``) +BEFORE creating the connector instance. + +See :ref:`install-spark` installation instruction for more details. + +Connecting to MSSQL +-------------------- + +Connection port +~~~~~~~~~~~~~~~ + +Connection is usually performed to port 1443. Port may differ for different MSSQL instances. +Please ask your MSSQL administrator to provide required information. + +Connection host +~~~~~~~~~~~~~~~ + +It is possible to connect to MSSQL by using either DNS name of host or it's IP address. + +If you're using MSSQL cluster, it is currently possible to connect only to **one specific node**. +Connecting to multiple nodes to perform load balancing, as well as automatic failover to new master/replica are not supported. + +Required grants +~~~~~~~~~~~~~~~ + +Ask your MSSQL cluster administrator to set following grants for a user, +used for creating a connection: + +.. tabs:: + + .. code-tab:: sql Read + Write (schema is owned by user) + + -- allow creating tables for user + GRANT CREATE TABLE TO username; + + -- allow read & write access to specific table + GRANT SELECT, INSERT ON username.mytable TO username; + + -- only if if_exists="replace_entire_table" is used: + -- allow dropping/truncating tables in any schema + GRANT ALTER ON username.mytable TO username; + + .. code-tab:: sql Read + Write (schema is not owned by user) + + -- allow creating tables for user + GRANT CREATE TABLE TO username; + + -- allow managing tables in specific schema, and inserting data to tables + GRANT ALTER, SELECT, INSERT ON SCHEMA::someschema TO username; + + .. code-tab:: sql Read only + + -- allow read access to specific table + GRANT SELECT ON someschema.mytable TO username; + +More details can be found in official documentation: + * `GRANT ON DATABASE `_ + * `GRANT ON OBJECT `_ + * `GRANT ON SCHEMA `_ diff --git a/docs/connection/db_connection/mssql/read.rst b/docs/connection/db_connection/mssql/read.rst index 3a336f823..50c958c68 100644 --- a/docs/connection/db_connection/mssql/read.rst +++ b/docs/connection/db_connection/mssql/read.rst @@ -1,18 +1,74 @@ .. _mssql-read: -Reading from MSSQL -================== +Reading from MSSQL using ``DBReader`` +====================================== -There are 2 ways of distributed data reading from MSSQL: +.. warning:: -* Using :obj:`DBReader ` with different :ref:`strategy` -* Using :obj:`MSSQL.sql ` + Please take into account :ref:`mssql-types` -Both methods accept :obj:`JDBCReadOptions ` +:obj:`DBReader ` supports :ref:`strategy` for incremental data reading, +but does not support custom queries, like JOINs. -.. currentmodule:: onetl.connection.db_connection.mssql.connection +Supported DBReader features +--------------------------- -.. automethod:: MSSQL.sql +* ✅︎ ``columns`` +* ✅︎ ``where`` +* ✅︎ ``hwm``, supported strategies: +* * ✅︎ :ref:`snapshot-strategy` +* * ✅︎ :ref:`incremental-strategy` +* * ✅︎ :ref:`snapshot-batch-strategy` +* * ✅︎ :ref:`incremental-batch-strategy` +* ❌ ``hint`` (MSSQL does support hints, but DBReader not, at least for now) +* ❌ ``df_schema`` +* ✅︎ ``options`` (see :obj:`JDBCReadOptions `) + +Examples +-------- + +Snapshot strategy: + +.. code-block:: python + + from onetl.connection import MSSQL + from onetl.db import DBReader + + mssql = MSSQL(...) + + reader = DBReader( + connection=mssql, + source="schema.table", + columns=["id", "key", "CAST(value AS text) value", "updated_dt"], + where="key = 'something'", + options=MSSQL.ReadOptions(partition_column="id", num_partitions=10), + ) + df = reader.run() + +Incremental strategy: + +.. code-block:: python + + from onetl.connection import MSSQL + from onetl.db import DBReader + from onetl.strategy import IncrementalStrategy + + mssql = MSSQL(...) + + reader = DBReader( + connection=mssql, + source="schema.table", + columns=["id", "key", "CAST(value AS text) value", "updated_dt"], + where="key = 'something'", + hwm=DBReader.AutoDetectHWM(name="mssql_hwm", expression="updated_dt"), + options=MSSQL.ReadOptions(partition_column="id", num_partitions=10), + ) + + with IncrementalStrategy(): + df = reader.run() + +Read options +------------ .. currentmodule:: onetl.connection.db_connection.jdbc_connection.options diff --git a/docs/connection/db_connection/mssql/sql.rst b/docs/connection/db_connection/mssql/sql.rst new file mode 100644 index 000000000..cf3467de4 --- /dev/null +++ b/docs/connection/db_connection/mssql/sql.rst @@ -0,0 +1,52 @@ +.. _mssql-sql: + +Reading from MSSQL using ``MSSQL.sql`` +======================================== + +.. warning:: + + Please take into account :ref:`mssql-types` + +:obj:`MSSQL.sql ` allows passing custom SQL query, +but does not support incremental strategies. + +Method also accepts :obj:`JDBCReadOptions `. + +Syntax support +-------------- + +Only queries with the following syntax are supported: + +* ✅︎ ``SELECT ... FROM ...`` +* ❌ ``WITH alias AS (...) SELECT ...`` +* ❌ ``SET ...; SELECT ...;`` - multiple statements not supported + +Examples +-------- + +.. code-block:: python + + from onetl.connection import MSSQL + + mssql = MSSQL(...) + df = mssql.sql( + """ + SELECT + id, + key, + CAST(value AS text) value, + updated_at + FROM + some.mytable + WHERE + key = 'something' + """, + options=MSSQL.ReadOptions(partition_column="id", num_partitions=10), + ) + +References +---------- + +.. currentmodule:: onetl.connection.db_connection.mssql.connection + +.. automethod:: MSSQL.sql diff --git a/docs/connection/db_connection/mssql/types.rst b/docs/connection/db_connection/mssql/types.rst new file mode 100644 index 000000000..ca46dacab --- /dev/null +++ b/docs/connection/db_connection/mssql/types.rst @@ -0,0 +1,371 @@ +.. _mssql-types: + +MSSQL <-> Spark type mapping +================================= + +Type detection & casting +------------------------ + +Spark's DataFrames always have a ``schema`` which is a list of columns with corresponding Spark types. All operations on a column are performed using column type. + +Reading from MSSQL +~~~~~~~~~~~~~~~~~~~~~~~ + +This is how MSSQL connector performs this: + +* For each column in query result (``SELECT column1, column2, ... FROM table ...``) get column name and MSSQL type. +* Find corresponding ``MSSQL type (read)`` -> ``Spark type`` combination (see below) for each DataFrame column. If no combination is found, raise exception. +* Create DataFrame from query with specific column names and Spark types. + +Writing to some existing MSSQL table +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is how MSSQL connector performs this: + +* Get names of columns in DataFrame. [1]_ +* Perform ``SELECT * FROM table LIMIT 0`` query. +* Take only columns present in DataFrame (by name, case insensitive). For each found column get MSSQL type. +* Find corresponding ``Spark type`` -> ``MSSQL type (write)`` combination (see below) for each DataFrame column. If no combination is found, raise exception. +* If ``MSSQL type (write)`` match ``MSSQL type (read)``, no additional casts will be performed, DataFrame column will be written to MSSQL as is. +* If ``MSSQL type (write)`` does not match ``MSSQL type (read)``, DataFrame column will be casted to target column type **on MSSQL side**. + For example, you can write column with text data to ``int`` column, if column contains valid integer values within supported value range and precision [2]_. + +.. [1] + This allows to write data to tables with ``DEFAULT`` and ``GENERATED`` columns - if DataFrame has no such column, + it will be populated by MSSQL. + +.. [2] + This is true only if DataFrame column is a ``StringType()``, because text value is parsed automatically to tagret column type. + + But other types cannot be silently converted, like ``int -> text``. This requires explicit casting, see `DBWriter`_. + +Create new table using Spark +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. warning:: + + ABSOLUTELY NOT RECOMMENDED! + +This is how MSSQL connector performs this: + +* Find corresponding ``Spark type`` -> ``MSSQL type (create)`` combination (see below) for each DataFrame column. If no combination is found, raise exception. +* Generate DDL for creating table in MSSQL, like ``CREATE TABLE (col1 ...)``, and run it. +* Write DataFrame to created table as is. + +But some cases this may lead to using wrong column type. For example, Spark creates column of type ``timestamp`` +which corresponds to MSSQL's type ``timestamp(0)`` (precision up to seconds) +instead of more precise ``timestamp(6)`` (precision up to nanoseconds). +This may lead to incidental precision loss, or sometimes data cannot be written to created table at all. + +So instead of relying on Spark to create tables: + +.. dropdown:: See example + + .. code:: python + + writer = DBWriter( + connection=mssql, + table="myschema.target_tbl", + options=MSSQL.WriteOptions( + if_exists="append", + ), + ) + writer.run(df) + +Always prefer creating tables with specific types **BEFORE WRITING DATA**: + +.. dropdown:: See example + + .. code:: python + + mssql.execute( + """ + CREATE TABLE schema.table AS ( + id bigint, + key text, + value datetime2(6) -- specific type and precision + ) + """, + ) + + writer = DBWriter( + connection=mssql, + table="myschema.target_tbl", + options=MSSQL.WriteOptions(if_exists="append"), + ) + writer.run(df) + +References +~~~~~~~~~~ + +Here you can find source code with type conversions: + +* `MSSQL -> JDBC `_ +* `JDBC -> Spark `_ +* `Spark -> JDBC `_ +* `JDBC -> MSSQL `_ + +Supported types +--------------- + +See `official documentation `_ + +Numeric types +~~~~~~~~~~~~~ + ++-------------------------------+-----------------------------------+-------------------------------+-------------------------------+ +| MSSQL type (read) | Spark type | MSSQL type (write) | MSSQL type (create) | ++===============================+===================================+===============================+===============================+ +| ``decimal`` | ``DecimalType(P=18, S=0)`` | ``decimal(P=18, S=0)`` | ``decimal(P=18, S=0)`` | ++-------------------------------+-----------------------------------+-------------------------------+-------------------------------+ +| ``decimal(P=0..38)`` | ``DecimalType(P=0..38, S=0)`` | ``decimal(P=0..38, S=0)`` | ``decimal(P=0..38, S=0)`` | ++-------------------------------+-----------------------------------+-------------------------------+-------------------------------+ +| ``decimal(P=0..38, S=0..38)`` | ``DecimalType(P=0..38, S=0..38)`` | ``decimal(P=0..38, S=0..38)`` | ``decimal(P=0..38, S=0..38)`` | ++-------------------------------+-----------------------------------+-------------------------------+-------------------------------+ +| ``real`` | ``FloatType()`` | ``real`` | ``real`` | ++-------------------------------+-----------------------------------+-------------------------------+-------------------------------+ +| ``float`` | ``DoubleType()`` | ``float`` | ``float`` | ++-------------------------------+-----------------------------------+-------------------------------+-------------------------------+ +| ``smallint`` | ``ShortType()`` | ``smallint`` | ``smallint`` | ++-------------------------------+-----------------------------------+-------------------------------+-------------------------------+ +| ``tinyint`` | ``IntegerType()`` | ``int`` | ``int`` | ++-------------------------------+ | | | +| ``int`` | | | | ++-------------------------------+-----------------------------------+-------------------------------+-------------------------------+ +| ``bigint`` | ``LongType()`` | ``bigint`` | ``bigint`` | ++-------------------------------+-----------------------------------+-------------------------------+-------------------------------+ + +Temporal types +~~~~~~~~~~~~~~ + +.. note:: + + MSSQL ``timestamp`` type is alias for ``rowversion`` (see `Special types`_). It is not a temporal type! + ++------------------------------------------+--------------------------------------+-----------------------------------+-------------------------------+ +| MSSQL type (read) | Spark type | MSSQL type (write) | MSSQL type (create) | ++==========================================+======================================+===================================+===============================+ +| ``date`` | ``DateType()`` | ``date`` | ``date`` | ++------------------------------------------+--------------------------------------+-----------------------------------+-------------------------------+ +| ``smalldatetime``, minutes | ``TimestampType()``, microseconds | ``datetime``, milliseconds | ``datetime``, milliseconds | ++------------------------------------------+ | | | +| ``datetime``, milliseconds | | | | ++------------------------------------------+ | | | +| ``datetime2(N=0)``, seconds | | | | ++------------------------------------------+ | | | +| ``datetime2(N=3)``, milliseconds | | | | ++------------------------------------------+--------------------------------------+-----------------------------------+-------------------------------+ +| ``datetime2(N=6)``, microseconds | ``TimestampType()``, microseconds | ``datetime``, milliseconds, | ``datetime``, milliseconds, | ++------------------------------------------+--------------------------------------+ **precision loss** [3]_ | **precision loss** [3]_ | +| ``datetime2(N=7)``, 100s of nanoseconds | ``TimestampType()``, microseconds, | | | +| | **precision loss** [4]_ | | | ++------------------------------------------+--------------------------------------+-----------------------------------+-------------------------------+ +| ``time(N=0)``, seconds | ``TimestampType()``, microseconds, |``datetime``, milliseconds | ``datetime``, milliseconds | ++------------------------------------------+ with time format quirks [5]_ | | | +| ``time(N=3)``, milliseconds | | | | ++------------------------------------------+--------------------------------------+-----------------------------------+-------------------------------+ +| ``time(N=6)``, microseconds | ``TimestampType()``, microseconds, |``datetime``, milliseconds, | ``datetime``, milliseconds, | ++ | with time format quirks [5]_ | **precision loss** [3]_ | **precision loss** [3]_ | ++------------------------------------------+--------------------------------------+ | | +| ``time``, 100s of nanoseconds | ``TimestampType()``, microseconds, | | | ++------------------------------------------+ **precision loss** [4]_, | | | +| ``time(N=7)``, 100s of nanoseconds | with time format quirks [5]_ | | | ++------------------------------------------+--------------------------------------+-----------------------------------+-------------------------------+ +| ``datetimeoffset`` | ``StringType()`` | ``nvarchar`` | ``nvarchar`` | ++------------------------------------------+--------------------------------------+-----------------------------------+-------------------------------+ + +.. warning:: + + Note that types in MSSQL and Spark have different value ranges: + + +-------------------+--------------------------------+--------------------------------+---------------------+--------------------------------+--------------------------------+ + | MySQL type | Min value | Max value | Spark type | Min value | Max value | + +===================+================================+================================+=====================+================================+================================+ + | ``smalldatetime`` | ``1900-01-01 00:00:00`` | ``2079-06-06 23:59:00`` | ``TimestampType()`` | ``0001-01-01 00:00:00.000000`` | ``9999-12-31 23:59:59.999999`` | + +-------------------+--------------------------------+--------------------------------+ | | | + | ``datetime`` | ``1753-01-01 00:00:00.000`` | ``9999-12-31 23:59:59.997`` | | | | + +-------------------+--------------------------------+--------------------------------+ | | | + | ``datetime2`` | ``0001-01-01 00:00:00.000000`` | ``9999-12-31 23:59:59.999999`` | | | | + +-------------------+--------------------------------+--------------------------------+ | | | + | ``time`` | ``00:00:00.0000000`` | ``23:59:59.9999999`` | | | | + +-------------------+--------------------------------+--------------------------------+---------------------+--------------------------------+--------------------------------+ + + So not all of values in Spark DataFrame can be written to MSSQL. + + References: + * `Clickhouse DateTime documentation `_ + * `Clickhouse DateTime documentation `_ + * `Spark DateType documentation `_ + * `Spark TimestampType documentation `_ + +.. [3] + MSSQL dialect for Spark generates DDL with type ``datetime`` which has precision up to milliseconds (``23:59:59.999``, 10\ :superscript:`-3` seconds). + Inserting data with microsecond and higher precision (``23:59:59.999999`` .. ``23.59:59.9999999``, 10\ :superscript:`-6` .. 10\ :superscript:`-7` seconds) + will lead to **throwing away microseconds**. + +.. [4] + MSSQL support timestamp up to 100s of nanoseconds precision (``23:59:59.9999999999``, 10\ :superscript:`-7` seconds), + but Spark ``TimestampType()`` supports datetime up to microseconds precision (``23:59:59.999999``, 10\ :superscript:`-6` seconds). + Last digit will be lost during read or write operations. + +.. [5] + ``time`` type is the same as ``timestamp`` with date ``1970-01-01``. So instead of reading data from MSSQL like ``23:59:59.999999`` + it is actually read ``1970-01-01 23:59:59.999999``, and vice versa. + +String types +~~~~~~~~~~~~~ + ++-------------------+------------------+--------------------+---------------------+ +| MSSQL type (read) | Spark type | MSSQL type (write) | MSSQL type (create) | ++===================+==================+====================+=====================+ +| ``char`` | ``StringType()`` | ``nvarchar`` | ``nvarchar`` | ++-------------------+ | | | +| ``char(N)`` | | | | ++-------------------+ | | | +| ``nchar`` | | | | ++-------------------+ | | | +| ``nchar(N)`` | | | | ++-------------------+ | | | +| ``varchar`` | | | | ++-------------------+ | | | +| ``varchar(N)`` | | | | ++-------------------+ | | | +| ``nvarchar`` | | | | ++-------------------+ | | | +| ``nvarchar(N)`` | | | | ++-------------------+ | | | +| ``mediumtext`` | | | | ++-------------------+ | | | +| ``text`` | | | | ++-------------------+ | | | +| ``ntext`` | | | | ++-------------------+ | | | +| ``xml`` | | | | ++-------------------+------------------+--------------------+---------------------+ + +Binary types +~~~~~~~~~~~~ + ++--------------------+-------------------+--------------------+---------------------+ +| MSSQL type (read) | Spark type | MSSQL type (write) | MSSQL type (create) | ++====================+===================+====================+=====================+ +| ``bit`` | ``BooleanType()`` | ``bit`` | ``bit`` | ++--------------------+-------------------+--------------------+---------------------+ +| ``binary`` | ``BinaryType()`` | ``varbinary`` | ``varbinary`` | ++--------------------+ | | | +| ``binary(N)`` | | | | ++--------------------+ | | | +| ``varbinary`` | | | | ++--------------------+ | | | +| ``varbinary(N)`` | | | | ++--------------------+ | | | +| ``image`` | | | | ++--------------------+-------------------+--------------------+---------------------+ + +Special types +~~~~~~~~~~~~~~ + ++---------------------------+------------------+--------------------+---------------------+ +| MSSQL type (read) | Spark type | MSSQL type (write) | MSSQL type (create) | ++===========================+==================+====================+=====================+ +| ``geography`` | ``BinaryType()`` | ``varbinary`` | ``varbinary`` | ++---------------------------+ | | | +| ``geometry`` | | | | ++---------------------------+ | | | +| ``hierarchyid`` | | | | ++---------------------------+ | | | +| ``rowversion`` | | | | ++---------------------------+------------------+--------------------+---------------------+ +| ``sql_variant`` | unsupported | | | ++---------------------------+------------------+--------------------+---------------------+ +| ``sysname`` | ``StringType()`` | ``nvarchar`` | ``nvarchar`` | ++---------------------------+ | | | +| ``uniqueidentifier`` | | | | ++---------------------------+------------------+--------------------+---------------------+ + +Explicit type cast +------------------ + +``DBReader`` +~~~~~~~~~~~~ + +It is possible to explicitly cast column type using ``DBReader(columns=...)`` syntax. + +For example, you can use ``CAST(column AS text)`` to convert data to string representation on MSSQL side, and so it will be read as Spark's ``StringType()``: + +.. code-block:: python + + from onetl.connection import MSSQL + from onetl.db import DBReader + + mssql = MSSQL(...) + + DBReader( + connection=mssql, + columns=[ + "id", + "supported_column", + "CAST(unsupported_column AS text) unsupported_column_str", + ], + ) + df = reader.run() + + # cast column content to proper Spark type + df = df.select( + df.id, + df.supported_column, + # explicit cast + df.unsupported_column_str.cast("integer").alias("parsed_integer"), + ) + +``DBWriter`` +~~~~~~~~~~~~ + +Convert dataframe column to JSON using `to_json `_, +and write it as ``text`` column in MSSQL: + +.. code:: python + + mssql.execute( + """ + CREATE TABLE schema.target_tbl AS ( + id bigint, + struct_column_json text -- any string type, actually + ) + """, + ) + + from pyspark.sql.functions import to_json + + df = df.select( + df.id, + to_json(df.struct_column).alias("struct_column_json"), + ) + + writer.run(df) + +Then you can parse this column on MSSQL side - for example, by creating a view: + +.. code:: sql + + SELECT + id, + JSON_VALUE(struct_column_json, "$.nested.field") AS nested_field + FROM target_tbl + +Or by using `computed column `_: + +.. code-block:: sql + + CREATE TABLE schema.target_table ( + id bigint, + supported_column datetime2(6), + struct_column_json text, -- any string type, actually + -- computed column + nested_field AS (JSON_VALUE(struct_column_json, "$.nested.field")) + -- or persisted column + -- nested_field AS (JSON_VALUE(struct_column_json, "$.nested.field")) PERSISTED + ) + +By default, column value is calculated on every table read. +Column marked as ``PERSISTED`` is calculated during insert, but this require additional space. diff --git a/docs/connection/db_connection/mssql/write.rst b/docs/connection/db_connection/mssql/write.rst index c8a5e5906..eb15fe93e 100644 --- a/docs/connection/db_connection/mssql/write.rst +++ b/docs/connection/db_connection/mssql/write.rst @@ -1,9 +1,47 @@ .. _mssql-write: -Writing to MSSQL -================ +Writing to MSSQL using ``DBWriter`` +==================================== -For writing data to MSSQL, use :obj:`DBWriter ` with options below. +For writing data to MSSQL, use :obj:`DBWriter `. + +.. warning:: + + Please take into account :ref:`mssql-types` + +.. warning:: + + It is always recommended to create table explicitly using :obj:`MSSQL.execute ` + instead of relying on Spark's table DDL generation. + + This is because Spark's DDL generator can create columns with different precision and types than it is expected, + causing precision loss or other issues. + +Examples +-------- + +.. code-block:: python + + from onetl.connection import MSSQL + from onetl.db import DBWriter + + mssql = MSSQL(...) + + df = ... # data is here + + writer = DBWriter( + connection=mssql, + target="schema.table", + options=MSSQL.WriteOptions(if_exists="append"), + ) + + writer.run(df) + + +Write options +------------- + +Method above accepts :obj:`JDBCWriteOptions ` .. currentmodule:: onetl.connection.db_connection.jdbc_connection.options diff --git a/docs/connection/db_connection/mysql/sql.rst b/docs/connection/db_connection/mysql/sql.rst index 515dcf77c..3a1624036 100644 --- a/docs/connection/db_connection/mysql/sql.rst +++ b/docs/connection/db_connection/mysql/sql.rst @@ -17,12 +17,10 @@ Syntax support Only queries with the following syntax are supported: -* ``SELECT ...`` -* ``WITH alias AS (...) SELECT ...`` - -Queries like ``SHOW ...`` are not supported. - -This method also does not support multiple queries in the same operation, like ``SET ...; SELECT ...;``. +* ✅︎ ``SELECT ... FROM ...`` +* ✅︎ ``WITH alias AS (...) SELECT ...`` +* ❌ ``SHOW ...`` +* ❌ ``SET ...; SELECT ...;`` - multiple statements not supported Examples -------- diff --git a/docs/connection/db_connection/mysql/types.rst b/docs/connection/db_connection/mysql/types.rst index 4968bd434..14e7adc64 100644 --- a/docs/connection/db_connection/mysql/types.rst +++ b/docs/connection/db_connection/mysql/types.rst @@ -47,7 +47,7 @@ This is how MySQL connector performs this: * Write DataFrame to created table as is. But some cases this may lead to using wrong column type. For example, Spark creates column of type ``timestamp`` -which corresponds to MySQL's type ``timestamp(0)`` (precision up to seconds) +which corresponds to MySQL type ``timestamp(0)`` (precision up to seconds) instead of more precise ``timestamp(6)`` (precision up to nanoseconds). This may lead to incidental precision loss, or sometimes data cannot be written to created table at all. @@ -162,7 +162,7 @@ Temporal types +-----------------------------------+--------------------------------------+-----------------------------------+-------------------------------+ | ``datetime(N=3)``, milliseconds | ``TimestampType()``, microseconds | ``timestamp(N=6)``, microseconds | ``timestamp(N=0)``, seconds, | +-----------------------------------+ | | **precision loss** [4]_, | -| ``timestamp(N=6)``, milliseconds | | | | +| ``timestamp(N=3)``, milliseconds | | | | +-----------------------------------+ | | | | ``datetime(N=6)``, microseconds | | | | +-----------------------------------+ | | | @@ -181,32 +181,28 @@ Temporal types Note that types in MySQL and Spark have different value ranges: - +-----------------------------+-------------------------+-------------------------+ - | Type | Min value | Max value | - +=============================+=========================+=========================+ - | MySQL's ``year`` | ``1901`` | ``2155`` | - +-----------------------------+-------------------------+-------------------------+ - | MySQL's ``date`` | ``1000-01-01`` | ``9999-12-31`` | - +-----------------------------+-------------------------+-------------------------+ - | Spark's ``DateType()`` | ``0001-01-01`` | ``9999-12-31`` | - +-----------------------------+-------------------------+-------------------------+ - | MySQL's ``datetime`` | ``1000-01-01 00:00:00`` | ``9999-12-31 23:59:59`` | - +-----------------------------+-------------------------+-------------------------+ - | MySQL's ``timestamp`` | ``1970-01-01 00:00:01`` | ``9999-12-31 23:59:59`` | - +-----------------------------+-------------------------+-------------------------+ - | MySQL's ``time`` | ``-838:59:59`` | ``838:59:59`` | - +-----------------------------+-------------------------+-------------------------+ - | Spark's ``TimestampType()`` | ``0001-01-01 00:00:00`` | ``9999-12-31 23:59:59`` | - +-----------------------------+-------------------------+-------------------------+ + +---------------+--------------------------------+--------------------------------+---------------------+--------------------------------+--------------------------------+ + | MySQL type | Min value | Max value | Spark type | Min value | Max value | + +===============+================================+================================+=====================+================================+================================+ + | ``year`` | ``1901`` | ``2155`` | ``DateType()`` | ``0001-01-01`` | ``9999-12-31`` | + +---------------+--------------------------------+--------------------------------+ | | | + | ``date`` | ``1000-01-01`` | ``9999-12-31`` | | | | + +---------------+--------------------------------+--------------------------------+---------------------+--------------------------------+--------------------------------+ + | ``datetime`` | ``1000-01-01 00:00:00.000000`` | ``9999-12-31 23:59:59.499999`` | ``TimestampType()`` | ``0001-01-01 00:00:00.000000`` | ``9999-12-31 23:59:59.999999`` | + +---------------+--------------------------------+--------------------------------+ | | | + | ``timestamp`` | ``1970-01-01 00:00:01.000000`` | ``9999-12-31 23:59:59.499999`` | | | | + +---------------+--------------------------------+--------------------------------+ | | | + | ``time`` | ``-838:59:59.000000`` | ``838:59:59.000000`` | | | | + +---------------+--------------------------------+--------------------------------+---------------------+--------------------------------+--------------------------------+ So Spark can read all the values from MySQL, but not all of values in Spark DataFrame can be written to MySQL. - See: - * `MySQL's year documentation `_ - * `MySQL's date, datetime & timestamp documentation `_ - * `MySQL's time documentation `_ - * `Spark's DateType documentation `_ - * `Spark's TimestampType documentation `_ + References: + * `MySQL year documentation `_ + * `MySQL date, datetime & timestamp documentation `_ + * `MySQL time documentation `_ + * `Spark DateType documentation `_ + * `Spark TimestampType documentation `_ .. [4] MySQL dialect generates DDL with MySQL type ``timestamp`` which is alias for ``timestamp(0)`` with precision up to seconds (``23:59:59``). @@ -244,44 +240,44 @@ String types Binary types ~~~~~~~~~~~~ -+---------------------------+------------------+--------------------+---------------------+ -| MySQL type (read) | Spark type | MySQL type (write) | MySQL type (create) | -+===========================+==================+====================+=====================+ -| ``binary`` | ``BinaryType()`` | ``blob`` | ``blob`` | -+---------------------------+ | | | -| ``binary(N)`` | | | | -+---------------------------+ | | | -| ``varbinary(N)`` | | | | -+---------------------------+ | | | -| ``mediumblob`` | | | | -+---------------------------+ | | | -| ``blob`` | | | | -+---------------------------+ | | | -| ``longblob`` | | | | -+---------------------------+------------------+--------------------+---------------------+ ++-------------------+------------------+--------------------+---------------------+ +| MySQL type (read) | Spark type | MySQL type (write) | MySQL type (create) | ++===================+==================+====================+=====================+ +| ``binary`` | ``BinaryType()`` | ``blob`` | ``blob`` | ++-------------------+ | | | +| ``binary(N)`` | | | | ++-------------------+ | | | +| ``varbinary(N)`` | | | | ++-------------------+ | | | +| ``mediumblob`` | | | | ++-------------------+ | | | +| ``blob`` | | | | ++-------------------+ | | | +| ``longblob`` | | | | ++-------------------+------------------+--------------------+---------------------+ Geometry types ~~~~~~~~~~~~~~ -+---------------------------+------------------+--------------------+---------------------+ -| MySQL type (read) | Spark type | MySQL type (write) | MySQL type (create) | -+===========================+==================+====================+=====================+ -| ``point`` | ``BinaryType()`` | ``blob`` | ``blob`` | -+---------------------------+ | | | -| ``linestring`` | | | | -+---------------------------+ | | | -| ``polygon`` | | | | -+---------------------------+ | | | -| ``geometry`` | | | | -+---------------------------+ | | | -| ``multipoint`` | | | | -+---------------------------+ | | | -| ``multilinestring`` | | | | -+---------------------------+ | | | -| ``multipolygon`` | | | | -+---------------------------+ | | | -| ``geometrycollection`` | | | | -+---------------------------+------------------+--------------------+---------------------+ ++------------------------+------------------+--------------------+---------------------+ +| MySQL type (read) | Spark type | MySQL type (write) | MySQL type (create) | ++========================+==================+====================+=====================+ +| ``point`` | ``BinaryType()`` | ``blob`` | ``blob`` | ++------------------------+ | | | +| ``linestring`` | | | | ++------------------------+ | | | +| ``polygon`` | | | | ++------------------------+ | | | +| ``geometry`` | | | | ++------------------------+ | | | +| ``multipoint`` | | | | ++------------------------+ | | | +| ``multilinestring`` | | | | ++------------------------+ | | | +| ``multipolygon`` | | | | ++------------------------+ | | | +| ``geometrycollection`` | | | | ++------------------------+------------------+--------------------+---------------------+ Explicit type cast ------------------ @@ -311,7 +307,7 @@ to convert column of any type to string representation, and then parse this colu columns=[ "id", "supported_column", - "CAST(unsupported_column AS TEXT) unsupported_column_str", + "CAST(unsupported_column AS text) unsupported_column_str", # or "JSON_OBJECT('key', value_column) json_column", ], @@ -362,7 +358,10 @@ Then you can parse this column on MySQL side - for example, by creating a view: .. code:: sql - SELECT id, json_column->"$[0]" AS array_item FROM target_tbl + SELECT + id, + array_column_json->"$[0]" AS array_item + FROM target_tbl Or by using `GENERATED column `_: @@ -372,7 +371,10 @@ Or by using `GENERATED column "$[0]")) VIRTUAL + -- or stired column + -- array_item_0 GENERATED ALWAYS AS (array_column_json->"$[0]")) STORED ) ``VIRTUAL`` column value is calculated on every table read. diff --git a/docs/connection/db_connection/oracle/prerequisites.rst b/docs/connection/db_connection/oracle/prerequisites.rst index c86bc393c..b5b64e437 100644 --- a/docs/connection/db_connection/oracle/prerequisites.rst +++ b/docs/connection/db_connection/oracle/prerequisites.rst @@ -88,14 +88,14 @@ used for creating a connection: -- as Oracle does not support specifying exact schema name GRANT CREATE ANY TABLE TO username; + -- allow read & write access to specific table + GRANT SELECT, INSERT ON someschema.mytable TO username; + -- only if if_exists="replace_entire_table" is used: -- allow dropping/truncating tables in any schema, -- as Oracle does not support specifying exact schema name GRANT DROP ANY TABLE TO username; - -- allow read & write access to specific table - GRANT SELECT, INSERT ON someschema.mytable TO username; - .. code-tab:: sql Read only -- allow user to log in diff --git a/docs/connection/db_connection/oracle/sql.rst b/docs/connection/db_connection/oracle/sql.rst index 3ea0832d1..6971de4ed 100644 --- a/docs/connection/db_connection/oracle/sql.rst +++ b/docs/connection/db_connection/oracle/sql.rst @@ -17,12 +17,10 @@ Syntax support Only queries with the following syntax are supported: -* ``SELECT ...`` -* ``WITH alias AS (...) SELECT ...`` - -Queries like ``SHOW ...`` are not supported. - -This method also does not support multiple queries in the same operation, like ``SET ...; SELECT ...;``. +* ✅︎ ``SELECT ... FROM ...`` +* ✅︎ ``WITH alias AS (...) SELECT ...`` +* ❌ ``SHOW ...`` +* ❌ ``SET ...; SELECT ...;`` - multiple statements not supported Examples -------- diff --git a/docs/connection/db_connection/oracle/types.rst b/docs/connection/db_connection/oracle/types.rst index 907afe1c9..122f8119b 100644 --- a/docs/connection/db_connection/oracle/types.rst +++ b/docs/connection/db_connection/oracle/types.rst @@ -173,6 +173,25 @@ Temporal types | ``INTERVAL DAY TO SECOND`` | | | | +--------------------------------------------+------------------------------------+---------------------------------+---------------------------------+ +.. warning:: + + Note that types in Oracle and Spark have different value ranges: + + +---------------+------------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+ + | Oracle type | Min value | Max value | Spark type | Min value | Max value | + +===============+====================================+===================================+=====================+================================+================================+ + | ``date`` | ``-4712-01-01`` | ``9999-01-01`` | ``DateType()`` | ``0001-01-01`` | ``9999-12-31`` | + +---------------+------------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+ + | ``timestamp`` | ``-4712-01-01 00:00:00.000000000`` | ``9999-12-31 23:59:59.999999999`` | ``TimestampType()`` | ``0001-01-01 00:00:00.000000`` | ``9999-12-31 23:59:59.999999`` | + +---------------+------------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+ + + So not all of values can be read from Oracle to Spark. + + References: + * `Oracle date, timestamp and intervals documentation `_ + * `Spark DateType documentation `_ + * `Spark TimestampType documentation `_ + .. [4] Oracle support timestamp up to nanoseconds precision (``23:59:59.999999999``), but Spark ``TimestampType()`` supports datetime up to microseconds precision (``23:59:59.999999``). diff --git a/docs/connection/db_connection/postgres/sql.rst b/docs/connection/db_connection/postgres/sql.rst index 1430381a1..e0316e2a4 100644 --- a/docs/connection/db_connection/postgres/sql.rst +++ b/docs/connection/db_connection/postgres/sql.rst @@ -17,12 +17,9 @@ Syntax support Only queries with the following syntax are supported: -* ``SELECT ...`` -* ``WITH alias AS (...) SELECT ...`` - -Queries like ``SHOW ...`` are not supported. - -This method also does not support multiple queries in the same operation, like ``SET ...; SELECT ...;``. +* ✅︎ ``SELECT ... FROM ...`` +* ✅︎ ``WITH alias AS (...) SELECT ...`` +* ❌ ``SET ...; SELECT ...;`` - multiple statements not supported Examples -------- diff --git a/docs/connection/db_connection/postgres/types.rst b/docs/connection/db_connection/postgres/types.rst index 224d4ee4d..5424289e4 100644 --- a/docs/connection/db_connection/postgres/types.rst +++ b/docs/connection/db_connection/postgres/types.rst @@ -38,8 +38,7 @@ This is how Postgres connector performs this: it will be populated by Postgres. .. [3] - This is true only if conversion of ``T_df -> T_tbl`` target or source types is ``text`` or ``StringType()``. So it is possible to - insert data back to column with original type, if data format is matching the column type. + This is true only if either DataFrame column is a ``StringType()``, or target column is ``text`` type. But other types cannot be silently converted, like ``bytea -> bit(N)``. This requires explicit casting, see `Manual conversion to string`_. @@ -202,6 +201,27 @@ Temporal types | ``tstzrange`` | | | | +------------------------------------+------------------------------+-----------------------+-------------------------+ +.. warning:: + + Note that types in Postgres and Spark have different value ranges: + + +---------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+ + | Postgres type | Min value | Max value | Spark type | Min value | Max value | + +===============+=================================+==================================+=====================+================================+================================+ + | ``date`` | ``-4713-01-01`` | ``5874897-01-01`` | ``DateType()`` | ``0001-01-01`` | ``9999-12-31`` | + +---------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+ + | ``timestamp`` | ``-4713-01-01 00:00:00.000000`` | ``294276-12-31 23:59:59.999999`` | ``TimestampType()`` | ``0001-01-01 00:00:00.000000`` | ``9999-12-31 23:59:59.999999`` | + +---------------+---------------------------------+----------------------------------+ | | | + | ``time`` | ``00:00:00.000000`` | ``24:00:00.000000`` | | | | + +---------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+ + + So not all of values can be read from Postgres to Spark. + + References: + * `Postgres date/time types documentation `_ + * `Spark DateType documentation `_ + * `Spark TimestampType documentation `_ + .. [6] ``time`` type is the same as ``timestamp`` with date ``1970-01-01``. So instead of reading data from Postgres like ``23:59:59`` diff --git a/onetl/connection/db_connection/mssql/connection.py b/onetl/connection/db_connection/mssql/connection.py index c2716ce72..608ef7521 100644 --- a/onetl/connection/db_connection/mssql/connection.py +++ b/onetl/connection/db_connection/mssql/connection.py @@ -29,30 +29,9 @@ class MSSQL(JDBCConnection): (`official MSSQL JDBC driver `_). - .. dropdown:: Version compatibility - - * SQL Server versions: 2014 - 2022 - * Spark versions: 2.3.x - 3.5.x - * Java versions: 8 - 20 - - See `official documentation `_ - and `official compatibility matrix `_. - .. warning:: - To use MSSQL connector you should have PySpark installed (or injected to ``sys.path``) - BEFORE creating the connector instance. - - You can install PySpark as follows: - - .. code:: bash - - pip install onetl[spark] # latest PySpark version - - # or - pip install onetl pyspark=3.5.0 # pass specific PySpark version - - See :ref:`install-spark` installation instruction for more details. + Before using this connector please take into account :ref:`mssql-prerequisites` Parameters ----------