Skip to content

Commit

Permalink
[DOP-14058] Improve Kafka documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
dolfinus committed May 21, 2024
1 parent aa2b753 commit f31d502
Show file tree
Hide file tree
Showing 12 changed files with 594 additions and 570 deletions.
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@
"sphinxcontrib.plantuml",
"sphinx.ext.extlinks",
"sphinx_favicon",
"sphinxcontrib.autodoc_pydantic",
]
numpydoc_show_class_members = False
autodoc_pydantic_model_show_config = False
Expand Down
70 changes: 47 additions & 23 deletions docs/connection/db_connection/clickhouse/types.rst
Original file line number Diff line number Diff line change
Expand Up @@ -158,8 +158,8 @@ Numeric types
| ``Int64`` | ``LongType()`` | ``Int64`` | ``Int64`` |
+--------------------------------+-----------------------------------+-------------------------------+-------------------------------+
| ``Int128`` | unsupported [3]_ | | |
+--------------------------------+-----------------------------------+-------------------------------+-------------------------------+
| ``Int256`` | unsupported [3]_ | | |
+--------------------------------+ | | |
| ``Int256`` | | | |
+--------------------------------+-----------------------------------+-------------------------------+-------------------------------+
| ``-`` | ``ByteType()`` | ``Int8`` | ``Int8`` |
+--------------------------------+-----------------------------------+-------------------------------+-------------------------------+
Expand Down Expand Up @@ -198,20 +198,25 @@ Notes:
+===================================+======================================+==================================+===============================+
| ``Date`` | ``DateType()`` | ``Date`` | ``Date`` |
+-----------------------------------+--------------------------------------+----------------------------------+-------------------------------+
| ``Date32`` | ``DateType()`` | ``Date`` | ``Date`` |
| | | | **cannot be inserted** [6]_ |
| ``Date32`` | ``DateType()`` | ``Date`` | ``Date``, |
| | | | **cannot insert data** [4]_ |
+-----------------------------------+--------------------------------------+----------------------------------+-------------------------------+
| ``DateTime32``, seconds | ``TimestampType()``, microseconds | ``DateTime64(6)``, microseconds | ``DateTime32``, seconds |
+-----------------------------------+--------------------------------------+----------------------------------+-------------------------------+
| ``DateTime32``, seconds | ``TimestampType()`` | ``DateTime64(6)``, microseconds | ``DateTime32`` |
+-----------------------------------+--------------------------------------+----------------------------------+ seconds |
| ``DateTime64(3)``, milliseconds | ``TimestampType()`` | ``DateTime64(6)``, microseconds | **precision loss** [4]_ |
+-----------------------------------+--------------------------------------+----------------------------------+ |
| ``DateTime64(6)``, microseconds | ``TimestampType()`` | ``DateTime64(6)``, microseconds | |
+-----------------------------------+--------------------------------------+----------------------------------+ |
| ``DateTime64(7..9)``, nanoseconds | ``TimestampType()`` | ``DateTime64(6)`` | |
| | | microseconds | |
| | | **precision loss** [4]_ | |
| ``DateTime64(3)``, milliseconds | ``TimestampType()``, microseconds | ``DateTime64(6)``, microseconds | ``DateTime32``, seconds, |
| | | | **precision loss** [5]_ |
+-----------------------------------+--------------------------------------+----------------------------------+-------------------------------+
| ``-`` | ``TimestampNTZType()`` | ``DateTime64(6)`` | |
| ``DateTime64(6)``, microseconds | ``TimestampType()``, microseconds | | ``DateTime32``, seconds, |
+-----------------------------------+--------------------------------------+ | **precision loss** [6]_ |
| ``DateTime64(7..9)``, nanoseconds | ``TimestampType()``, microseconds, | | |
| | **precision loss** [4]_ | | |
| | | | |
+-----------------------------------+--------------------------------------+ | |
| ``-`` | ``TimestampNTZType()``, microseconds | | |
+-----------------------------------+--------------------------------------+----------------------------------+-------------------------------+
| ``DateTime32(TZ)`` | unsupported [7]_ | | |
+-----------------------------------+ | | |
| ``DateTime64(P, TZ)`` | | | |
+-----------------------------------+--------------------------------------+----------------------------------+-------------------------------+
| ``IntervalNanosecond`` | ``LongType()`` | ```Int64`` | ``Int64`` |
+-----------------------------------+ | | |
Expand Down Expand Up @@ -261,6 +266,10 @@ Notes:
* `Spark DateType documentation <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DateType.html>`_
* `Spark TimestampType documentation <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/TimestampType.html>`_

.. [4]
``Date`` has different bytes representation than ``Date32``, and inserting value of type ``Date32`` to ``Date`` column
leads to errors on Clickhouse side, e.g. ``Date(106617) should be between 0 and 65535 inclusive of both values``.
.. [4]
Clickhouse support datetime up to nanoseconds precision (``23:59:59.999999999``),
but Spark ``TimestampType()`` supports datetime up to microseconds precision (``23:59:59.999999``).
Expand Down Expand Up @@ -291,27 +300,40 @@ String types
| ``IPv4`` | | | |
+--------------------------------------+ | | |
| ``IPv6`` | | | |
+--------------------------------------+ | | |
| ``UUID`` | | | |
+--------------------------------------+------------------+ | |
| ``-`` | ``BinaryType()`` | | |
+--------------------------------------+------------------+------------------------+--------------------------+

Struct types
~~~~~~~~~~~~~

+--------------------------------------+------------------+------------------------+--------------------------+
| Clickhouse type (read) | Spark type | Clickhousetype (write) | Clickhouse type (create) |
+======================================+==================+========================+==========================+
| ``Map(K, V)`` | ``StringType()`` | ``String`` | ``String`` |
+--------------------------------------+ | | |
| ``Tuple(T1, T2, ...)`` | | | |
+--------------------------------------+ | | |
| ``JSON`` | | | |
+--------------------------------------+------------------+------------------------+--------------------------+
| ``Array(T)`` | unsupported | | |
+--------------------------------------+ | | |
| ``Nested(field1 T1, ...)`` | | | |
+--------------------------------------+------------------+------------------------+--------------------------+

Unsupported types
-----------------

Columns of these Clickhouse types cannot be read by Spark:
* ``AggregateFunction(func, T)``
* ``Array(T)``
* ``JSON``
* ``Map(K, V)``
* ``SimpleAggregateFunction(func, T)``
* ``MultiPolygon``
* ``Nested(field1 T1, ...)``
* ``Nothing``
* ``Point``
* ``Polygon``
* ``Ring``
* ``SimpleAggregateFunction(func, T)``
* ``Tuple(T1, T2, ...)``
* ``UUID``

Dataframe with these Spark types be written to Clickhouse:
* ``ArrayType(T)``
Expand Down Expand Up @@ -359,9 +381,10 @@ For parsing JSON columns in ClickHouse, :obj:`JSON.parse_column <onetl.file.form
# Spark requires all columns to have some specific type, describe it
column_type = ArrayType(IntegerType())
json = JSON()
df = df.select(
df.id,
JSON().parse_column("array_column", column_type),
json.parse_column("array_column", column_type),
)
``DBWriter``
Expand Down Expand Up @@ -389,9 +412,10 @@ For writing JSON data to ClickHouse, use the :obj:`JSON.serialize_column <onetl.
""",
)
json = JSON()
df = df.select(
df.id,
JSON().serialize_column(df.array_column).alias("array_column_json"),
json.serialize_column(df.array_column).alias("array_column_json"),
)
writer.run(df)
Expand Down
Loading

0 comments on commit f31d502

Please sign in to comment.