Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOP-13252] Improve MSSQL documentation #235

Merged
merged 1 commit into from
Mar 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/changelog/next_release/235.improvement.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Improve MSSQL documentation:
* Add "Types" section describing mapping between MSSQL and Spark types
* Add "Prerequisites" section describing different aspects of connecting to MSSQL
* Separate documentation of ``DBReader`` and ``MSSQL.sql``
* Add examples for ``MSSQL.fetch`` and ``MSSQL.execute``
4 changes: 2 additions & 2 deletions docs/connection/db_connection/clickhouse/sql.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ Syntax support

Only queries with the following syntax are supported:

* ``SELECT ...``
* ``WITH alias AS (...) SELECT ...``
* ✅︎ ``SELECT ... FROM ...``
* ✅︎ ``WITH alias AS (...) SELECT ...``

Queries like ``SHOW ...`` are not supported.

Expand Down
68 changes: 56 additions & 12 deletions docs/connection/db_connection/clickhouse/types.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ But Spark does not have specific dialect for Clickhouse, so Generic JDBC dialect
Generic dialect is using SQL ANSI type names while creating tables in target database, not database-specific types.

If some cases this may lead to using wrong column type. For example, Spark creates column of type ``TIMESTAMP``
which corresponds to Clickhouse's type ``DateTime32`` (precision up to seconds)
which corresponds to Clickhouse type ``DateTime32`` (precision up to seconds)
instead of more precise ``DateTime64`` (precision up to nanoseconds).
This may lead to incidental precision loss, or sometimes data cannot be written to created table at all.

Expand Down Expand Up @@ -192,7 +192,10 @@ Numeric types
Temporal types
~~~~~~~~~~~~~~

Note: ``DateTime(P, TZ)`` has the same precision as ``DateTime(P)``.
Notes:
* Datetime with timezone has the same precision as without timezone
* ``DateTime`` is alias for ``DateTime32``
* ``TIMESTAMP`` is alias for ``DateTime32``, but ``TIMESTAMP(N)`` is alias for ``DateTime64(N)``

+-----------------------------------+--------------------------------------+----------------------------------+-------------------------------+
| Clickhouse type (read) | Spark type | Clickhousetype (write) | Clickhouse type (create) |
Expand Down Expand Up @@ -238,6 +241,31 @@ Note: ``DateTime(P, TZ)`` has the same precision as ``DateTime(P)``.
| ``IntervalYear`` | | | |
+-----------------------------------+--------------------------------------+----------------------------------+-------------------------------+

.. warning::

Note that types in Clickhouse and Spark have different value ranges:

+------------------------+-----------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+
| Clickhouse type | Min value | Max value | Spark type | Min value | Max value |
+========================+===================================+===================================+=====================+================================+================================+
| ``Date`` | ``1970-01-01`` | ``2149-06-06`` | ``DateType()`` | ``0001-01-01`` | ``9999-12-31`` |
+------------------------+-----------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+
| ``DateTime32`` | ``1970-01-01 00:00:00`` | ``2106-02-07 06:28:15`` | ``TimestampType()`` | ``0001-01-01 00:00:00.000000`` | ``9999-12-31 23:59:59.999999`` |
+------------------------+-----------------------------------+-----------------------------------+ | | |
| ``DateTime64(P=0..8)`` | ``1900-01-01 00:00:00.00000000`` | ``2299-12-31 23:59:59.99999999`` | | | |
+------------------------+-----------------------------------+-----------------------------------+ | | |
| ``DateTime64(P=9)`` | ``1900-01-01 00:00:00.000000000`` | ``2262-04-11 23:47:16.999999999`` | | | |
+------------------------+-----------------------------------+-----------------------------------+---------------------+--------------------------------+--------------------------------+

So not all of values in Spark DataFrame can be written to Clickhouse.

References:
* `Clickhouse Date documentation <https://clickhouse.com/docs/en/sql-reference/data-types/date>`_
* `Clickhouse Datetime32 documentation <https://clickhouse.com/docs/en/sql-reference/data-types/datetime>`_
* `Clickhouse Datetime64 documentation <https://clickhouse.com/docs/en/sql-reference/data-types/datetime64>`_
* `Spark DateType documentation <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DateType.html>`_
* `Spark TimestampType documentation <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/TimestampType.html>`_

.. [4]
Clickhouse support datetime up to nanoseconds precision (``23:59:59.999999999``),
but Spark ``TimestampType()`` supports datetime up to microseconds precision (``23:59:59.999999``).
Expand All @@ -257,17 +285,17 @@ String types
+--------------------------------------+------------------+------------------------+--------------------------+
| Clickhouse type (read) | Spark type | Clickhousetype (write) | Clickhouse type (create) |
+======================================+==================+========================+==========================+
| ``IPv4`` | ``StringType()`` | ``String`` | ``String`` |
| ``FixedString(N)`` | ``StringType()`` | ``String`` | ``String`` |
+--------------------------------------+ | | |
| ``IPv6`` | | | |
| ``String`` | | | |
+--------------------------------------+ | | |
| ``Enum8`` | | | |
+--------------------------------------+ | | |
| ``Enum16`` | | | |
+--------------------------------------+ | | |
| ``FixedString(N)`` | | | |
| ``IPv4`` | | | |
+--------------------------------------+ | | |
| ``String`` | | | |
| ``IPv6`` | | | |
+--------------------------------------+------------------+ | |
| ``-`` | ``BinaryType()`` | | |
+--------------------------------------+------------------+------------------------+--------------------------+
Expand Down Expand Up @@ -352,7 +380,7 @@ and write it as ``String`` column in Clickhouse:
array_column_json String,
)
ENGINE = MergeTree()
ORDER BY time
ORDER BY id
""",
)

Expand All @@ -369,18 +397,34 @@ Then you can parse this column on Clickhouse side - for example, by creating a v

.. code:: sql

SELECT id, JSONExtract(json_column, 'Array(String)') FROM target_tbl
SELECT
id,
JSONExtract(json_column, 'Array(String)') AS array_column
FROM target_tbl

You can also use `ALIAS <https://clickhouse.com/docs/en/sql-reference/statements/create/table#alias>`_
or `MATERIALIZED <https://clickhouse.com/docs/en/sql-reference/statements/create/table#materialized>`_ columns
to avoid writing such expression in every ``SELECT`` clause all the time:

You can also use `ALIAS <https://clickhouse.com/docs/en/sql-reference/statements/create/table#alias>`_ columns
to avoid writing such expression in every ``SELECT`` clause all the time.
.. code-block:: sql

CREATE TABLE default.target_tbl AS (
id Int32,
array_column_json String,
-- computed column
array_column Array(String) ALIAS JSONExtract(json_column, 'Array(String)')
-- or materialized column
-- array_column Array(String) MATERIALIZED JSONExtract(json_column, 'Array(String)')
)
ENGINE = MergeTree()
ORDER BY id

Downsides:

* Using ``SELECT JSONExtract(...)`` or ``ALIAS`` column can be expensive, because value is calculated on every row access. This can be especially harmful if such column is used in ``WHERE`` clause.
* Both ``ALIAS`` columns are not included in ``SELECT *`` clause, they should be added explicitly: ``SELECT *, calculated_column FROM table``.
* ``ALIAS`` and ``MATERIALIZED`` columns are not included in ``SELECT *`` clause, they should be added explicitly: ``SELECT *, calculated_column FROM table``.

.. warning::

`MATERIALIZED <https://clickhouse.com/docs/en/sql-reference/statements/create/table#materialized>`_ and
`EPHEMERAL <https://clickhouse.com/docs/en/sql-reference/statements/create/table#ephemeral>`_ columns are not supported by Spark
because they cannot be selected to determine target column type.
25 changes: 24 additions & 1 deletion docs/connection/db_connection/greenplum/types.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,9 @@ See Greenplum `CREATE TABLE <https://docs.vmware.com/en/VMware-Greenplum/7/green
Supported types
---------------

See `list of Greenplum types <https://docs.vmware.com/en/VMware-Greenplum-Connector-for-Apache-Spark/2.3/greenplum-connector-spark/reference-datatype_mapping.html>`_.
See:
* `official connector documentation <https://docs.vmware.com/en/VMware-Greenplum-Connector-for-Apache-Spark/2.3/greenplum-connector-spark/reference-datatype_mapping.html>`_
* `list of Greenplum types <https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/ref_guide-data_types.html>`_

Numeric types
~~~~~~~~~~~~~
Expand Down Expand Up @@ -181,6 +183,27 @@ Temporal types
| ``tstzrange`` | | | |
+------------------------------------+-------------------------+-----------------------+-------------------------+

.. warning::

Note that types in Greenplum and Spark have different value ranges:

+----------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+
| Greenplum type | Min value | Max value | Spark type | Min value | Max value |
+================+=================================+==================================+=====================+================================+================================+
| ``date`` | ``-4713-01-01`` | ``5874897-01-01`` | ``DateType()`` | ``0001-01-01`` | ``9999-12-31`` |
+----------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+
| ``timestamp`` | ``-4713-01-01 00:00:00.000000`` | ``294276-12-31 23:59:59.999999`` | ``TimestampType()`` | ``0001-01-01 00:00:00.000000`` | ``9999-12-31 23:59:59.999999`` |
+----------------+---------------------------------+----------------------------------+ | | |
| ``time`` | ``00:00:00.000000`` | ``24:00:00.000000`` | | | |
+----------------+---------------------------------+----------------------------------+---------------------+--------------------------------+--------------------------------+

So not all of values can be read from Greenplum to Spark.

References:
* `Greenplum types documentation <https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/ref_guide-data_types.html>`_
* `Spark DateType documentation <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DateType.html>`_
* `Spark TimestampType documentation <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/TimestampType.html>`_

.. [3]
``time`` type is the same as ``timestamp`` with date ``1970-01-01``. So instead of reading data from Postgres like ``23:59:59``
Expand Down
90 changes: 90 additions & 0 deletions docs/connection/db_connection/mssql/execute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,96 @@
Executing statements in MSSQL
=============================

How to
------

There are 2 ways to execute some statement in MSSQL

Use :obj:`MSSQL.fetch <onetl.connection.db_connection.mssql.connection.MSSQL.fetch>`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use this method to execute some ``SELECT`` query which returns **small number or rows**, like reading
MSSQL config, or reading data from some reference table.

Method accepts :obj:`JDBCOptions <onetl.connection.db_connection.jdbc_mixin.options.JDBCOptions>`.

Connection opened using this method should be then closed with :obj:`MSSQL.close <onetl.connection.db_connection.mssql.connection.MSSQL.close>`.

Syntax support
^^^^^^^^^^^^^^

This method supports **any** query syntax supported by MSSQL, like:

* ✅︎ ``SELECT ... FROM ...``
* ✅︎ ``WITH alias AS (...) SELECT ...``
* ✅︎ ``SELECT func(arg1, arg2) FROM DUAL`` - call function
* ❌ ``SET ...; SELECT ...;`` - multiple statements not supported

Examples
^^^^^^^^

.. code-block:: python
from onetl.connection import MSSQL
mssql = MSSQL(...)
df = mssql.fetch(
"SELECT value FROM some.reference_table WHERE key = 'some_constant'",
options=MSSQL.JDBCOptions(query_timeout=10),
)
mssql.close()
value = df.collect()[0][0] # get value from first row and first column
Use :obj:`MSSQL.execute <onetl.connection.db_connection.mssql.connection.MSSQL.execute>`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use this method to execute DDL and DML operations. Each method call runs operation in a separated transaction, and then commits it.

Method accepts :obj:`JDBCOptions <onetl.connection.db_connection.jdbc_mixin.options.JDBCOptions>`.

Connection opened using this method should be then closed with :obj:`MSSQL.close <onetl.connection.db_connection.mssql.connection.MSSQL.close>`.

Syntax support
^^^^^^^^^^^^^^

This method supports **any** query syntax supported by MSSQL, like:

* ✅︎ ``CREATE TABLE ...``, ``CREATE VIEW ...``
* ✅︎ ``ALTER ...``
* ✅︎ ``INSERT INTO ... AS SELECT ...``
* ✅︎ ``DROP TABLE ...``, ``DROP VIEW ...``, and so on
* ✅︎ ``CALL procedure(arg1, arg2) ...`` or ``{call procedure(arg1, arg2)}`` - special syntax for calling procedure
* ✅︎ ``DECLARE ... BEGIN ... END`` - execute PL/SQL statement
* ✅︎ other statements not mentioned here
* ❌ ``SET ...; SELECT ...;`` - multiple statements not supported

Examples
^^^^^^^^

.. code-block:: python
from onetl.connection import MSSQL
mssql = MSSQL(...)
with mssql:
mssql.execute("DROP TABLE schema.table")
mssql.execute(
"""
CREATE TABLE schema.table AS (
id bigint GENERATED ALWAYS AS IDENTITY,
key VARCHAR2(4000),
value NUMBER
)
""",
options=MSSQL.JDBCOptions(query_timeout=10),
)
References
----------

.. currentmodule:: onetl.connection.db_connection.mssql.connection

.. automethod:: MSSQL.fetch
Expand Down
10 changes: 9 additions & 1 deletion docs/connection/db_connection/mssql/index.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,26 @@
.. _mssql:

MSSQL
=====
======

.. toctree::
:maxdepth: 1
:caption: Connection

prerequisites
connection

.. toctree::
:maxdepth: 1
:caption: Operations

read
sql
write
execute

.. toctree::
:maxdepth: 1
:caption: Troubleshooting

types
77 changes: 77 additions & 0 deletions docs/connection/db_connection/mssql/prerequisites.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
.. _mssql-prerequisites:

Prerequisites
=============

Version Compatibility
---------------------

* SQL Server versions: 2014 - 2022
* Spark versions: 2.3.x - 3.5.x
* Java versions: 8 - 20

See `official documentation <https://learn.microsoft.com/en-us/sql/connect/jdbc/system-requirements-for-the-jdbc-driver>`_
and `official compatibility matrix <https://learn.microsoft.com/en-us/sql/connect/jdbc/microsoft-jdbc-driver-for-sql-server-support-matrix>`_.

Installing PySpark
------------------

To use MSSQL connector you should have PySpark installed (or injected to ``sys.path``)
BEFORE creating the connector instance.

See :ref:`install-spark` installation instruction for more details.

Connecting to MSSQL
--------------------

Connection port
~~~~~~~~~~~~~~~

Connection is usually performed to port 1443. Port may differ for different MSSQL instances.
Please ask your MSSQL administrator to provide required information.

Connection host
~~~~~~~~~~~~~~~

It is possible to connect to MSSQL by using either DNS name of host or it's IP address.

If you're using MSSQL cluster, it is currently possible to connect only to **one specific node**.
Connecting to multiple nodes to perform load balancing, as well as automatic failover to new master/replica are not supported.

Required grants
~~~~~~~~~~~~~~~

Ask your MSSQL cluster administrator to set following grants for a user,
used for creating a connection:

.. tabs::

.. code-tab:: sql Read + Write (schema is owned by user)

-- allow creating tables for user
GRANT CREATE TABLE TO username;

-- allow read & write access to specific table
GRANT SELECT, INSERT ON username.mytable TO username;

-- only if if_exists="replace_entire_table" is used:
-- allow dropping/truncating tables in any schema
GRANT ALTER ON username.mytable TO username;

.. code-tab:: sql Read + Write (schema is not owned by user)

-- allow creating tables for user
GRANT CREATE TABLE TO username;

-- allow managing tables in specific schema, and inserting data to tables
GRANT ALTER, SELECT, INSERT ON SCHEMA::someschema TO username;

.. code-tab:: sql Read only

-- allow read access to specific table
GRANT SELECT ON someschema.mytable TO username;

More details can be found in official documentation:
* `GRANT ON DATABASE <https://learn.microsoft.com/en-us/sql/t-sql/statements/grant-database-permissions-transact-sql>`_
* `GRANT ON OBJECT <https://learn.microsoft.com/en-us/sql/t-sql/statements/grant-object-permissions-transact-sql>`_
* `GRANT ON SCHEMA <https://learn.microsoft.com/en-us/sql/t-sql/statements/grant-schema-permissions-transact-sql>`_
Loading