Skip to content

Commit

Permalink
[DOP-13252] Improve MSSQL documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
dolfinus committed Mar 13, 2024
1 parent 02ba65a commit 28815fd
Show file tree
Hide file tree
Showing 18 changed files with 905 additions and 133 deletions.
4 changes: 2 additions & 2 deletions docs/connection/db_connection/clickhouse/sql.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ Syntax support

Only queries with the following syntax are supported:

* ``SELECT ...``
* ``WITH alias AS (...) SELECT ...``
* ✅︎ ``SELECT ... FROM ...``
* ✅︎ ``WITH alias AS (...) SELECT ...``

Queries like ``SHOW ...`` are not supported.

Expand Down
68 changes: 56 additions & 12 deletions docs/connection/db_connection/clickhouse/types.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ But Spark does not have specific dialect for Clickhouse, so Generic JDBC dialect
Generic dialect is using SQL ANSI type names while creating tables in target database, not database-specific types.

If some cases this may lead to using wrong column type. For example, Spark creates column of type ``TIMESTAMP``
which corresponds to Clickhouse's type ``DateTime32`` (precision up to seconds)
which corresponds to Clickhouse type ``DateTime32`` (precision up to seconds)
instead of more precise ``DateTime64`` (precision up to nanoseconds).
This may lead to incidental precision loss, or sometimes data cannot be written to created table at all.

Expand Down Expand Up @@ -192,7 +192,10 @@ Numeric types
Temporal types
~~~~~~~~~~~~~~

Note: ``DateTime(P, TZ)`` has the same precision as ``DateTime(P)``.
Notes:
* Datetime with timezone has the same precision as without timezone
* ``DateTime`` is alias for ``DateTime32``
* ``TIMESTAMP`` is alias for ``DateTime32``, but ``TIMESTAMP(N)`` is alias for ``DateTime64(N)``

+-----------------------------------+--------------------------------------+----------------------------------+-------------------------------+
| Clickhouse type (read) | Spark type | Clickhousetype (write) | Clickhouse type (create) |
Expand Down Expand Up @@ -238,6 +241,31 @@ Note: ``DateTime(P, TZ)`` has the same precision as ``DateTime(P)``.
| ``IntervalYear`` | | | |
+-----------------------------------+--------------------------------------+----------------------------------+-------------------------------+

.. warning::

Note that types in Clickhouse and Spark have different value ranges:

+------------------------+-----------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+
| Clickhouse type | Min value | Max value | Spark type | Min value | Max value |
+========================+===================================+===================================+=====================+================================+================================+
| ``Date`` | ``1970-01-01`` | ``2149-06-06`` | ``DateType()`` | ``0001-01-01`` | ``9999-12-31`` |
+------------------------+-----------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+
| ``DateTime32`` | ``1970-01-01 00:00:00`` | ``2106-02-07 06:28:15`` | ``TimestampType()`` | ``0001-01-01 00:00:00.000000`` | ``9999-12-31 23:59:59.999999`` |
+------------------------+-----------------------------------+-----------------------------------+ | | |
| ``DateTime64(N=0..8)`` | ``1900-01-01 00:00:00.00000000`` | ``2299-12-31 23:59:59.99999999`` | | | |
+------------------------+-----------------------------------+-----------------------------------+ | | |
| ``DateTime64(N=9)`` | ``1900-01-01 00:00:00.000000000`` | ``2262-04-11 23:47:16.999999999`` | | | |
+------------------------+-----------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+

So not all of values in Spark DataFrame can be written to Clickhouse.

References:
* `Clickhouse Date documentation <https://clickhouse.com/docs/en/sql-reference/data-types/date>`_
* `Clickhouse Datetime32 documentation <https://clickhouse.com/docs/en/sql-reference/data-types/datetime>`_
* `Clickhouse Datetime64 documentation <https://clickhouse.com/docs/en/sql-reference/data-types/datetime64>`_
* `Spark DateType documentation <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DateType.html>`_
* `Spark TimestampType documentation <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/TimestampType.html>`_

.. [4]
Clickhouse support datetime up to nanoseconds precision (``23:59:59.999999999``),
but Spark ``TimestampType()`` supports datetime up to microseconds precision (``23:59:59.999999``).
Expand All @@ -257,17 +285,17 @@ String types
+--------------------------------------+------------------+------------------------+--------------------------+
| Clickhouse type (read) | Spark type | Clickhousetype (write) | Clickhouse type (create) |
+======================================+==================+========================+==========================+
| ``IPv4`` | ``StringType()`` | ``String`` | ``String`` |
| ``FixedString(N)`` | ``StringType()`` | ``String`` | ``String`` |
+--------------------------------------+ | | |
| ``IPv6`` | | | |
| ``String`` | | | |
+--------------------------------------+ | | |
| ``Enum8`` | | | |
+--------------------------------------+ | | |
| ``Enum16`` | | | |
+--------------------------------------+ | | |
| ``FixedString(N)`` | | | |
| ``IPv4`` | | | |
+--------------------------------------+ | | |
| ``String`` | | | |
| ``IPv6`` | | | |
+--------------------------------------+------------------+ | |
| ``-`` | ``BinaryType()`` | | |
+--------------------------------------+------------------+------------------------+--------------------------+
Expand Down Expand Up @@ -352,7 +380,7 @@ and write it as ``String`` column in Clickhouse:
array_column_json String,
)
ENGINE = MergeTree()
ORDER BY time
ORDER BY id
""",
)
Expand All @@ -369,18 +397,34 @@ Then you can parse this column on Clickhouse side - for example, by creating a v

.. code:: sql
SELECT id, JSONExtract(json_column, 'Array(String)') FROM target_tbl
SELECT
id,
JSONExtract(json_column, 'Array(String)') AS array_column
FROM target_tbl
You can also use `ALIAS <https://clickhouse.com/docs/en/sql-reference/statements/create/table#alias>`_
or `MATERIALIZED <https://clickhouse.com/docs/en/sql-reference/statements/create/table#materialized>`_ columns
to avoid writing such expression in every ``SELECT`` clause all the time:

You can also use `ALIAS <https://clickhouse.com/docs/en/sql-reference/statements/create/table#alias>`_ columns
to avoid writing such expression in every ``SELECT`` clause all the time.
.. code-block:: sql
CREATE TABLE default.target_tbl AS (
id Int32,
array_column_json String,
-- computed column
array_column Array(String) ALIAS JSONExtract(json_column, 'Array(String)')
-- or materialized column
-- array_column Array(String) MATERIALIZED JSONExtract(json_column, 'Array(String)')
)
ENGINE = MergeTree()
ORDER BY id
Downsides:

* Using ``SELECT JSONExtract(...)`` or ``ALIAS`` column can be expensive, because value is calculated on every row access. This can be especially harmful if such column is used in ``WHERE`` clause.
* Both ``ALIAS`` columns are not included in ``SELECT *`` clause, they should be added explicitly: ``SELECT *, calculated_column FROM table``.
* ``ALIAS`` and ``MATERIALIZED`` columns are not included in ``SELECT *`` clause, they should be added explicitly: ``SELECT *, calculated_column FROM table``.

.. warning::

`MATERIALIZED <https://clickhouse.com/docs/en/sql-reference/statements/create/table#materialized>`_ and
`EPHEMERAL <https://clickhouse.com/docs/en/sql-reference/statements/create/table#ephemeral>`_ columns are not supported by Spark
because they cannot be selected to determine target column type.
25 changes: 24 additions & 1 deletion docs/connection/db_connection/greenplum/types.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,9 @@ See Greenplum `CREATE TABLE <https://docs.vmware.com/en/VMware-Greenplum/7/green
Supported types
---------------

See `list of Greenplum types <https://docs.vmware.com/en/VMware-Greenplum-Connector-for-Apache-Spark/2.3/greenplum-connector-spark/reference-datatype_mapping.html>`_.
See:
* `official connector documentation <https://docs.vmware.com/en/VMware-Greenplum-Connector-for-Apache-Spark/2.3/greenplum-connector-spark/reference-datatype_mapping.html>`_
* `list of Greenplum types <https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/ref_guide-data_types.html>`_

Numeric types
~~~~~~~~~~~~~
Expand Down Expand Up @@ -181,6 +183,27 @@ Temporal types
| ``tstzrange`` | | | |
+------------------------------------+-------------------------+-----------------------+-------------------------+

.. warning::

Note that types in Greenplum and Spark have different value ranges:

+----------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+
| Greenplum type | Min value | Max value | Spark type | Min value | Max value |
+================+=================================+==================================+=====================+================================+================================+
| ``date`` | ``-4713-01-01`` | ``5874897-01-01`` | ``DateType()`` | ``0001-01-01`` | ``9999-12-31`` |
+----------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+
| ``timestamp`` | ``-4713-01-01 00:00:00.000000`` | ``294276-12-31 23:59:59.999999`` | ``TimestampType()`` | ``0001-01-01 00:00:00.000000`` | ``9999-12-31 23:59:59.999999`` |
+----------------+---------------------------------+----------------------------------+ | | |
| ``time`` | ``00:00:00.000000`` | ``24:00:00.000000`` | | | |
+----------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+

So not all of values can be read from Greenplum to Spark.

References:
* `Greenplum types documentation <https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/ref_guide-data_types.html>`_
* `Spark DateType documentation <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DateType.html>`_
* `Spark TimestampType documentation <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/TimestampType.html>`_

.. [3]
``time`` type is the same as ``timestamp`` with date ``1970-01-01``. So instead of reading data from Postgres like ``23:59:59``
Expand Down
90 changes: 90 additions & 0 deletions docs/connection/db_connection/mssql/execute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,96 @@
Executing statements in MSSQL
=============================

How to
------

There are 2 ways to execute some statement in MSSQL

Use :obj:`MSSQL.fetch <onetl.connection.db_connection.mssql.connection.MSSQL.fetch>`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use this method to execute some ``SELECT`` query which returns **small number or rows**, like reading
MSSQL config, or reading data from some reference table.

Method accepts :obj:`JDBCOptions <onetl.connection.db_connection.jdbc_mixin.options.JDBCOptions>`.

Connection opened using this method should be then closed with :obj:`MSSQL.close <onetl.connection.db_connection.mssql.connection.MSSQL.close>`.

Syntax support
^^^^^^^^^^^^^^

This method supports **any** query syntax supported by MSSQL, like:

* ✅︎ ``SELECT ... FROM ...``
* ✅︎ ``WITH alias AS (...) SELECT ...``
* ✅︎ ``SELECT func(arg1, arg2) FROM DUAL`` - call function
* ❌ ``SET ...; SELECT ...;`` - multiple statements not supported

Examples
^^^^^^^^

.. code-block:: python
from onetl.connection import MSSQL
mssql = MSSQL(...)
df = mssql.fetch(
"SELECT value FROM some.reference_table WHERE key = 'some_constant'",
options=MSSQL.JDBCOptions(query_timeout=10),
)
mssql.close()
value = df.collect()[0][0] # get value from first row and first column
Use :obj:`MSSQL.execute <onetl.connection.db_connection.mssql.connection.MSSQL.execute>`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use this method to execute DDL and DML operations. Each method call runs operation in a separated transaction, and then commits it.

Method accepts :obj:`JDBCOptions <onetl.connection.db_connection.jdbc_mixin.options.JDBCOptions>`.

Connection opened using this method should be then closed with :obj:`MSSQL.close <onetl.connection.db_connection.mssql.connection.MSSQL.close>`.

Syntax support
^^^^^^^^^^^^^^

This method supports **any** query syntax supported by MSSQL, like:

* ✅︎ ``CREATE TABLE ...``, ``CREATE VIEW ...``
* ✅︎ ``ALTER ...``
* ✅︎ ``INSERT INTO ... AS SELECT ...``
* ✅︎ ``DROP TABLE ...``, ``DROP VIEW ...``, and so on
* ✅︎ ``CALL procedure(arg1, arg2) ...`` or ``{call procedure(arg1, arg2)}`` - special syntax for calling procedure
* ✅︎ ``DECLARE ... BEGIN ... END`` - execute PL/SQL statement
* ✅︎ other statements not mentioned here
* ❌ ``SET ...; SELECT ...;`` - multiple statements not supported

Examples
^^^^^^^^

.. code-block:: python
from onetl.connection import MSSQL
mssql = MSSQL(...)
with mssql:
mssql.execute("DROP TABLE schema.table")
mssql.execute(
"""
CREATE TABLE schema.table AS (
id bigint GENERATED ALWAYS AS IDENTITY,
key VARCHAR2(4000),
value NUMBER
)
""",
options=MSSQL.JDBCOptions(query_timeout=10),
)
References
----------

.. currentmodule:: onetl.connection.db_connection.mssql.connection

.. automethod:: MSSQL.fetch
Expand Down
10 changes: 9 additions & 1 deletion docs/connection/db_connection/mssql/index.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,26 @@
.. _mssql:

MSSQL
=====
======

.. toctree::
:maxdepth: 1
:caption: Connection

prerequisites
connection

.. toctree::
:maxdepth: 1
:caption: Operations

read
sql
write
execute

.. toctree::
:maxdepth: 1
:caption: Troubleshooting

types
77 changes: 77 additions & 0 deletions docs/connection/db_connection/mssql/prerequisites.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
.. _mssql-prerequisites:

Prerequisites
=============

Version Compatibility
---------------------

* SQL Server versions: 2014 - 2022
* Spark versions: 2.3.x - 3.5.x
* Java versions: 8 - 20

See `official documentation <https://learn.microsoft.com/en-us/sql/connect/jdbc/system-requirements-for-the-jdbc-driver>`_
and `official compatibility matrix <https://learn.microsoft.com/en-us/sql/connect/jdbc/microsoft-jdbc-driver-for-sql-server-support-matrix>`_.

Installing PySpark
------------------

To use MSSQL connector you should have PySpark installed (or injected to ``sys.path``)
BEFORE creating the connector instance.

See :ref:`install-spark` installation instruction for more details.

Connecting to MSSQL
--------------------

Connection port
~~~~~~~~~~~~~~~

Connection is usually performed to port 1443. Port may differ for different MSSQL instances.
Please ask your MSSQL administrator to provide required information.

Connection host
~~~~~~~~~~~~~~~

It is possible to connect to MSSQL by using either DNS name of host or it's IP address.

If you're using MSSQL cluster, it is currently possible to connect only to **one specific node**.
Connecting to multiple nodes to perform load balancing, as well as automatic failover to new master/replica are not supported.

Required grants
~~~~~~~~~~~~~~~

Ask your MSSQL cluster administrator to set following grants for a user,
used for creating a connection:

.. tabs::

.. code-tab:: sql Read + Write (schema is owned by user)

-- allow creating tables for user
GRANT CREATE TABLE TO username;

-- allow read & write access to specific table
GRANT SELECT, INSERT ON username.mytable TO username;

-- only if if_exists="replace_entire_table" is used:
-- allow dropping/truncating tables in any schema
GRANT ALTER ON username.mytable TO username;

.. code-tab:: sql Read + Write (schema is not owned by user)

-- allow creating tables for user
GRANT CREATE TABLE TO username;

-- allow managing tables in specific schema, and inserting data to tables
GRANT ALTER, SELECT, INSERT ON SCHEMA::someschema TO username;

.. code-tab:: sql Read only

-- allow read access to specific table
GRANT SELECT ON someschema.mytable TO username;

More details can be found in official documentation:
* `GRANT ON DATABASE <https://learn.microsoft.com/en-us/sql/t-sql/statements/grant-database-permissions-transact-sql>`_
* `GRANT ON OBJECT <https://learn.microsoft.com/en-us/sql/t-sql/statements/grant-object-permissions-transact-sql>`_
* `GRANT ON SCHEMA <https://learn.microsoft.com/en-us/sql/t-sql/statements/grant-schema-permissions-transact-sql>`_
Loading

0 comments on commit 28815fd

Please sign in to comment.