[DOP-13252] Improve MSSQL documentation

MobileTeleSystems · Mar 13, 2024 · 28815fd · 28815fd
1 parent 02ba65a
commit 28815fd
Show file tree

Hide file tree

Showing 18 changed files with 905 additions and 133 deletions.
diff --git a/docs/connection/db_connection/clickhouse/sql.rst b/docs/connection/db_connection/clickhouse/sql.rst
@@ -17,8 +17,8 @@ Syntax support
 
 Only queries with the following syntax are supported:
 
-* ``SELECT ...``
-* ``WITH alias AS (...) SELECT ...``
+* ✅︎ ``SELECT ... FROM ...``
+* ✅︎ ``WITH alias AS (...) SELECT ...``
 
 Queries like ``SHOW ...`` are not supported.
 

diff --git a/docs/connection/db_connection/clickhouse/types.rst b/docs/connection/db_connection/clickhouse/types.rst
@@ -55,7 +55,7 @@ But Spark does not have specific dialect for Clickhouse, so Generic JDBC dialect
 Generic dialect is using SQL ANSI type names while creating tables in target database, not database-specific types.
 
 If some cases this may lead to using wrong column type. For example, Spark creates column of type ``TIMESTAMP``
-which corresponds to Clickhouse's type ``DateTime32`` (precision up to seconds)
+which corresponds to Clickhouse type ``DateTime32`` (precision up to seconds)
 instead of more precise ``DateTime64`` (precision up to nanoseconds).
 This may lead to incidental precision loss, or sometimes data cannot be written to created table at all.
 
@@ -192,7 +192,10 @@ Numeric types
 Temporal types
 ~~~~~~~~~~~~~~
 
-Note: ``DateTime(P, TZ)`` has the same precision as ``DateTime(P)``.
+Notes:
+    * Datetime with timezone has the same precision as without timezone
+    * ``DateTime`` is alias for ``DateTime32``
+    * ``TIMESTAMP`` is alias for ``DateTime32``, but ``TIMESTAMP(N)`` is alias for ``DateTime64(N)``
 
 +-----------------------------------+--------------------------------------+----------------------------------+-------------------------------+
 | Clickhouse type (read)            | Spark type                           | Clickhousetype (write)           | Clickhouse type (create)      |
@@ -238,6 +241,31 @@ Note: ``DateTime(P, TZ)`` has the same precision as ``DateTime(P)``.
 | ``IntervalYear``                  |                                      |                                  |                               |
 +-----------------------------------+--------------------------------------+----------------------------------+-------------------------------+
 
+.. warning::
+
+    Note that types in Clickhouse and Spark have different value ranges:
+
+    +------------------------+-----------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+
+    | Clickhouse type        | Min value                         | Max value                         | Spark type          | Min value                      | Max value                      |
+    +========================+===================================+===================================+=====================+================================+================================+
+    | ``Date``               | ``1970-01-01``                    | ``2149-06-06``                    | ``DateType()``      | ``0001-01-01``                 | ``9999-12-31``                 |
+    +------------------------+-----------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+
+    | ``DateTime32``         | ``1970-01-01 00:00:00``           | ``2106-02-07 06:28:15``           | ``TimestampType()`` | ``0001-01-01 00:00:00.000000`` | ``9999-12-31 23:59:59.999999`` |
+    +------------------------+-----------------------------------+-----------------------------------+                     |                                |                                |
+    | ``DateTime64(N=0..8)`` | ``1900-01-01 00:00:00.00000000``  | ``2299-12-31 23:59:59.99999999``  |                     |                                |                                |
+    +------------------------+-----------------------------------+-----------------------------------+                     |                                |                                |
+    | ``DateTime64(N=9)``    | ``1900-01-01 00:00:00.000000000`` | ``2262-04-11 23:47:16.999999999`` |                     |                                |                                |
+    +------------------------+-----------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+
+
+    So not all of values in Spark DataFrame can be written to Clickhouse.
+
+    References:
+        * `Clickhouse Date documentation <https://clickhouse.com/docs/en/sql-reference/data-types/date>`_
+        * `Clickhouse Datetime32 documentation <https://clickhouse.com/docs/en/sql-reference/data-types/datetime>`_
+        * `Clickhouse Datetime64 documentation <https://clickhouse.com/docs/en/sql-reference/data-types/datetime64>`_
+        * `Spark DateType documentation <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DateType.html>`_
+        * `Spark TimestampType documentation <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/TimestampType.html>`_
+
 .. [4]
     Clickhouse support datetime up to nanoseconds precision (``23:59:59.999999999``),
     but Spark ``TimestampType()`` supports datetime up to microseconds precision (``23:59:59.999999``).
@@ -257,17 +285,17 @@ String types
 +--------------------------------------+------------------+------------------------+--------------------------+
 | Clickhouse type (read)               | Spark type       | Clickhousetype (write) | Clickhouse type (create) |
 +======================================+==================+========================+==========================+
-| ``IPv4``                             | ``StringType()`` | ``String``             | ``String``               |
+| ``FixedString(N)``                   | ``StringType()`` | ``String``             | ``String``               |
 +--------------------------------------+                  |                        |                          |
-| ``IPv6``                             |                  |                        |                          |
+| ``String``                           |                  |                        |                          |
 +--------------------------------------+                  |                        |                          |
 | ``Enum8``                            |                  |                        |                          |
 +--------------------------------------+                  |                        |                          |
 | ``Enum16``                           |                  |                        |                          |
 +--------------------------------------+                  |                        |                          |
-| ``FixedString(N)``                   |                  |                        |                          |
+| ``IPv4``                             |                  |                        |                          |
 +--------------------------------------+                  |                        |                          |
-| ``String``                           |                  |                        |                          |
+| ``IPv6``                             |                  |                        |                          |
 +--------------------------------------+------------------+                        |                          |
 | ``-``                                | ``BinaryType()`` |                        |                          |
 +--------------------------------------+------------------+------------------------+--------------------------+
@@ -352,7 +380,7 @@ and write it as ``String`` column in Clickhouse:
             array_column_json String,
         )
         ENGINE = MergeTree()
-        ORDER BY time
+        ORDER BY id
         """,
     )
 
@@ -369,18 +397,34 @@ Then you can parse this column on Clickhouse side - for example, by creating a v
 
 .. code:: sql
 
-    SELECT id, JSONExtract(json_column, 'Array(String)') FROM target_tbl
+    SELECT
+        id,
+        JSONExtract(json_column, 'Array(String)') AS array_column
+    FROM target_tbl
+
+You can also use `ALIAS <https://clickhouse.com/docs/en/sql-reference/statements/create/table#alias>`_
+or `MATERIALIZED <https://clickhouse.com/docs/en/sql-reference/statements/create/table#materialized>`_ columns
+to avoid writing such expression in every ``SELECT`` clause all the time:
 
-You can also use `ALIAS <https://clickhouse.com/docs/en/sql-reference/statements/create/table#alias>`_ columns
-to avoid writing such expression in every ``SELECT`` clause all the time.
+.. code-block:: sql
+
+    CREATE TABLE default.target_tbl AS (
+        id Int32,
+        array_column_json String,
+        -- computed column
+        array_column Array(String) ALIAS JSONExtract(json_column, 'Array(String)')
+        -- or materialized column
+        -- array_column Array(String) MATERIALIZED JSONExtract(json_column, 'Array(String)')
+    )
+    ENGINE = MergeTree()
+    ORDER BY id
 
 Downsides:
 
 * Using ``SELECT JSONExtract(...)`` or ``ALIAS`` column can be expensive, because value is calculated on every row access. This can be especially harmful if such column is used in ``WHERE`` clause.
-* Both ``ALIAS`` columns are not included in ``SELECT *`` clause, they should be added explicitly: ``SELECT *, calculated_column FROM table``.
+* ``ALIAS`` and ``MATERIALIZED`` columns are not included in ``SELECT *`` clause, they should be added explicitly: ``SELECT *, calculated_column FROM table``.
 
 .. warning::
 
-    `MATERIALIZED <https://clickhouse.com/docs/en/sql-reference/statements/create/table#materialized>`_  and
     `EPHEMERAL <https://clickhouse.com/docs/en/sql-reference/statements/create/table#ephemeral>`_ columns are not supported by Spark
     because they cannot be selected to determine target column type.
diff --git a/docs/connection/db_connection/greenplum/types.rst b/docs/connection/db_connection/greenplum/types.rst
@@ -102,7 +102,9 @@ See Greenplum `CREATE TABLE <https://docs.vmware.com/en/VMware-Greenplum/7/green
 Supported types
 ---------------
 
-See `list of Greenplum types <https://docs.vmware.com/en/VMware-Greenplum-Connector-for-Apache-Spark/2.3/greenplum-connector-spark/reference-datatype_mapping.html>`_.
+See:
+    * `official connector documentation <https://docs.vmware.com/en/VMware-Greenplum-Connector-for-Apache-Spark/2.3/greenplum-connector-spark/reference-datatype_mapping.html>`_
+    * `list of Greenplum types <https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/ref_guide-data_types.html>`_
 
 Numeric types
 ~~~~~~~~~~~~~
@@ -181,6 +183,27 @@ Temporal types
 | ``tstzrange``                      |                         |                       |                         |
 +------------------------------------+-------------------------+-----------------------+-------------------------+
 
+.. warning::
+
+    Note that types in Greenplum and Spark have different value ranges:
+
+    +----------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+
+    | Greenplum type | Min value                       | Max value                        | Spark type          | Min value                      | Max value                      |
+    +================+=================================+==================================+=====================+================================+================================+
+    | ``date``       | ``-4713-01-01``                 | ``5874897-01-01``                | ``DateType()``      | ``0001-01-01``                 | ``9999-12-31``                 |
+    +----------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+
+    | ``timestamp``  | ``-4713-01-01 00:00:00.000000`` | ``294276-12-31 23:59:59.999999`` | ``TimestampType()`` | ``0001-01-01 00:00:00.000000`` | ``9999-12-31 23:59:59.999999`` |
+    +----------------+---------------------------------+----------------------------------+                     |                                |                                |
+    | ``time``       | ``00:00:00.000000``             | ``24:00:00.000000``              |                     |                                |                                |
+    +----------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+
+
+    So not all of values can be read from Greenplum to Spark.
+
+    References:
+        * `Greenplum types documentation <https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/ref_guide-data_types.html>`_
+        * `Spark DateType documentation <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DateType.html>`_
+        * `Spark TimestampType documentation <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/TimestampType.html>`_
+
 .. [3]
 
     ``time`` type is the same as ``timestamp`` with date ``1970-01-01``. So instead of reading data from Postgres like ``23:59:59``

diff --git a/docs/connection/db_connection/mssql/execute.rst b/docs/connection/db_connection/mssql/execute.rst
@@ -3,6 +3,96 @@
 Executing statements in MSSQL
 =============================
 
+How to
+------
+
+There are 2 ways to execute some statement in MSSQL
+
+Use :obj:`MSSQL.fetch <onetl.connection.db_connection.mssql.connection.MSSQL.fetch>`
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Use this method to execute some ``SELECT`` query which returns **small number or rows**, like reading
+MSSQL config, or reading data from some reference table.
+
+Method accepts :obj:`JDBCOptions <onetl.connection.db_connection.jdbc_mixin.options.JDBCOptions>`.
+
+Connection opened using this method should be then closed with :obj:`MSSQL.close <onetl.connection.db_connection.mssql.connection.MSSQL.close>`.
+
+Syntax support
+^^^^^^^^^^^^^^
+
+This method supports **any** query syntax supported by MSSQL, like:
+
+* ✅︎ ``SELECT ... FROM ...``
+* ✅︎ ``WITH alias AS (...) SELECT ...``
+* ✅︎ ``SELECT func(arg1, arg2) FROM DUAL`` - call function
+* ❌ ``SET ...; SELECT ...;`` - multiple statements not supported
+
+Examples
+^^^^^^^^
+
+.. code-block:: python
+
+    from onetl.connection import MSSQL
+
+    mssql = MSSQL(...)
+
+    df = mssql.fetch(
+        "SELECT value FROM some.reference_table WHERE key = 'some_constant'",
+        options=MSSQL.JDBCOptions(query_timeout=10),
+    )
+    mssql.close()
+    value = df.collect()[0][0]  # get value from first row and first column
+
+Use :obj:`MSSQL.execute <onetl.connection.db_connection.mssql.connection.MSSQL.execute>`
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Use this method to execute DDL and DML operations. Each method call runs operation in a separated transaction, and then commits it.
+
+Method accepts :obj:`JDBCOptions <onetl.connection.db_connection.jdbc_mixin.options.JDBCOptions>`.
+
+Connection opened using this method should be then closed with :obj:`MSSQL.close <onetl.connection.db_connection.mssql.connection.MSSQL.close>`.
+
+Syntax support
+^^^^^^^^^^^^^^
+
+This method supports **any** query syntax supported by MSSQL, like:
+
+* ✅︎ ``CREATE TABLE ...``, ``CREATE VIEW ...``
+* ✅︎ ``ALTER ...``
+* ✅︎ ``INSERT INTO ... AS SELECT ...``
+* ✅︎ ``DROP TABLE ...``, ``DROP VIEW ...``, and so on
+* ✅︎ ``CALL procedure(arg1, arg2) ...`` or ``{call procedure(arg1, arg2)}`` - special syntax for calling procedure
+* ✅︎ ``DECLARE ... BEGIN ... END`` - execute PL/SQL statement
+* ✅︎ other statements not mentioned here
+* ❌ ``SET ...; SELECT ...;`` - multiple statements not supported
+
+Examples
+^^^^^^^^
+
+.. code-block:: python
+
+    from onetl.connection import MSSQL
+
+    mssql = MSSQL(...)
+
+    with mssql:
+        mssql.execute("DROP TABLE schema.table")
+        mssql.execute(
+            """
+            CREATE TABLE schema.table AS (
+                id bigint GENERATED ALWAYS AS IDENTITY,
+                key VARCHAR2(4000),
+                value NUMBER
+            )
+            """,
+            options=MSSQL.JDBCOptions(query_timeout=10),
+        )
+
+
+References
+----------
+
 .. currentmodule:: onetl.connection.db_connection.mssql.connection
 
 .. automethod:: MSSQL.fetch

diff --git a/docs/connection/db_connection/mssql/index.rst b/docs/connection/db_connection/mssql/index.rst
@@ -1,18 +1,26 @@
 .. _mssql:
 
 MSSQL
-=====
+======
 
 .. toctree::
     :maxdepth: 1
     :caption: Connection
 
+    prerequisites
     connection
 
 .. toctree::
     :maxdepth: 1
     :caption: Operations
 
     read
+    sql
     write
     execute
+
+.. toctree::
+    :maxdepth: 1
+    :caption: Troubleshooting
+
+    types
diff --git a/docs/connection/db_connection/mssql/prerequisites.rst b/docs/connection/db_connection/mssql/prerequisites.rst
@@ -0,0 +1,77 @@
+.. _mssql-prerequisites:
+
+Prerequisites
+=============
+
+Version Compatibility
+---------------------
+
+* SQL Server versions: 2014 - 2022
+* Spark versions: 2.3.x - 3.5.x
+* Java versions: 8 - 20
+
+See `official documentation <https://learn.microsoft.com/en-us/sql/connect/jdbc/system-requirements-for-the-jdbc-driver>`_
+and `official compatibility matrix <https://learn.microsoft.com/en-us/sql/connect/jdbc/microsoft-jdbc-driver-for-sql-server-support-matrix>`_.
+
+Installing PySpark
+------------------
+
+To use MSSQL connector you should have PySpark installed (or injected to ``sys.path``)
+BEFORE creating the connector instance.
+
+See :ref:`install-spark` installation instruction for more details.
+
+Connecting to MSSQL
+--------------------
+
+Connection port
+~~~~~~~~~~~~~~~
+
+Connection is usually performed to port 1443. Port may differ for different MSSQL instances.
+Please ask your MSSQL administrator to provide required information.
+
+Connection host
+~~~~~~~~~~~~~~~
+
+It is possible to connect to MSSQL by using either DNS name of host or it's IP address.
+
+If you're using MSSQL cluster, it is currently possible to connect only to **one specific node**.
+Connecting to multiple nodes to perform load balancing, as well as automatic failover to new master/replica are not supported.
+
+Required grants
+~~~~~~~~~~~~~~~
+
+Ask your MSSQL cluster administrator to set following grants for a user,
+used for creating a connection:
+
+.. tabs::
+
+    .. code-tab:: sql Read + Write (schema is owned by user)
+
+        -- allow creating tables for user
+        GRANT CREATE TABLE TO username;
+
+        -- allow read & write access to specific table
+        GRANT SELECT, INSERT ON username.mytable TO username;
+
+        -- only if if_exists="replace_entire_table" is used:
+        -- allow dropping/truncating tables in any schema
+        GRANT ALTER ON username.mytable TO username;
+
+    .. code-tab:: sql Read + Write (schema is not owned by user)
+
+        -- allow creating tables for user
+        GRANT CREATE TABLE TO username;
+
+        -- allow managing tables in specific schema, and inserting data to tables
+        GRANT ALTER, SELECT, INSERT ON SCHEMA::someschema TO username;
+
+    .. code-tab:: sql Read only
+
+        -- allow read access to specific table
+        GRANT SELECT ON someschema.mytable TO username;
+
+More details can be found in official documentation:
+    * `GRANT ON DATABASE <https://learn.microsoft.com/en-us/sql/t-sql/statements/grant-database-permissions-transact-sql>`_
+    * `GRANT ON OBJECT <https://learn.microsoft.com/en-us/sql/t-sql/statements/grant-object-permissions-transact-sql>`_
+    * `GRANT ON SCHEMA <https://learn.microsoft.com/en-us/sql/t-sql/statements/grant-schema-permissions-transact-sql>`_