Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用pgloader将mysql迁移至pgsql并使用mysql_fdw校验表数据 #48

Open
n0099 opened this issue Apr 18, 2024 · 9 comments
Open

Comments

@n0099
Copy link
Owner

n0099 commented Apr 18, 2024

  1. 环境
$ uname -a
Linux azure 6.5.0-1018-azure #19~22.04.2-Ubuntu SMP Thu Mar 21 16:45:46 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
$ mysqld --version
/usr/sbin/mysqld  Ver 8.0.36-0ubuntu0.22.04.1 for Linux on x86_64 ((Ubuntu))
$ /usr/lib/postgresql/16/bin/postgres --version
postgres (PostgreSQL) 16.2 (Ubuntu 16.2-1.pgdg22.04+1)

pgloader@3.6.10
mysql_fdw@2.9.1
库名假定为tbm

  1. 使用某11年历史的commonlisp程序 https://github.com/dimitri/pgloader 可以在多个数据源(其中有着mysql和pgsql)之间导入导出

1.1 由于 dimitri/pgloader#1211 ,从postgre社区apt source安装的pgloader 3.6.10-2.pgdg22.04+1无法连接mysqld

$ apt show -a pgloader
Package: pgloader
Version: 3.6.10-2.pgdg22.04+1
Priority: optional
Section: database
Maintainer: Dimitri Fontaine <dim@tapoueh.org>
Installed-Size: 33.9 MB
Depends: freetds-dev, libc6 (>= 2.34), zlib1g (>= 1:1.1.4), libsqlite3-0, libssl3
Homepage: https://github.com/dimitri/pgloader
Download-Size: 29.3 MB
APT-Sources: https://apt.postgresql.org/pub/repos/apt jammy-pgdg/main amd64 Packages
Description: extract, transform and load data into PostgreSQL
    pgloader imports data from different kind of sources and COPY it into
    PostgreSQL.
    .
    The command language is described in the manual page and allows one to
    describe where to find the data source, its format, and to describe data
    processing and transformation.
    .
    Supported source formats include CSV, fixed width flat files, dBase3 files
    (DBF), and SQLite and MySQL databases. In most of those formats, pgloader
    is able to auto-discover the schema and create the tables and the indexes
    in PostgreSQL. In the MySQL case it's possible to edit CASTing rules from
    the pgloader command directly.

Package: pgloader
Version: 3.6.3-1ubuntu1
Priority: extra
Section: universe/misc
Origin: Ubuntu
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Dimitri Fontaine <dim@tapoueh.org>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 33.8 MB
Depends: freetds-dev, libc6 (>= 2.34), zlib1g (>= 1:1.1.4), libssl3
Homepage: https://github.com/dimitri/pgloader
Download-Size: 29.5 MB
APT-Sources: http://azure.archive.ubuntu.com/ubuntu jammy/universe amd64 Packages
Description: extract, transform and load data into PostgreSQL
    pgloader imports data from different kind of sources and COPY it into
    PostgreSQL.
    .
    The command language is described in the manual page and allows one to
    describe where to find the data source, its format, and to describe data
    processing and transformation.
    .
    Supported source formats include CSV, fixed width flat files, dBase3 files
    (DBF), and SQLite and MySQL databases. In most of those formats, pgloader
    is able to auto-discover the schema and create the tables and the indexes
    in PostgreSQL. In the MySQL case it's possible to edit CASTing rules from
    the pgloader command directly.

1.1.1. 建议从源码自行编译3.6.10

sudo apt install -y pgloader sbcl && # 通过安装apt中的pgloader来安装其运行时所需依赖和sbcl commonlisp编译器
git clone https://github.com/dimitri/pgloader &&
cd pgloader &&
git checkout v3.6.10 &&
make DYNSIZE=1024 && # 内存限制
build/bin/pgloader --version && # pgloader version "3.6.af8c3c1" compiled with SBCL 2.1.11.debian
sudo apt purge -y sbcl

1.1.2. 清理:

sudo apt purge -y pgloader &&
sudo apt autoremove --purge

1.2. 由于 dimitri/pgloader#782 ,pgloader只能通过仍在使用mysql 8.0.34中deprecated的mysql_native_password用户登录mysqld

1.2.1. 建议ad-hoc地临时创建一个最小权限(只GRANT库/表级权限)用户用于pgloader而非修改现有用户的default auth plugin

  1. 令文件schema.load
LOAD DATABASE
    FROM mysql://username:password@localhost/tbm
    INTO pgsql://username:password@localhost/tbm

WITH quote identifiers, schema only

INCLUDING ONLY TABLE NAMES MATCHING ~/^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))/
;

2.1. 可以使用INCLUDING ONLY TABLE NAMES MATCHING ~/regexp/, 'exact'来filter要导入的表名

2.2. 令文件data.load

LOAD DATABASE
    FROM mysql://username:password@localhost/tbm
    INTO pgsql://username:password@localhost/tbm

 SET MySQL PARAMETERS -- https://github.com/dimitri/pgloader/issues/999
     net_read_timeout  = '300',
     net_write_timeout = '300'

WITH quote identifiers, data only, on error stop, workers = 2, concurrency = 1, batch rows = 1000, batch size = 8MB, prefetch rows = 10000 -- 酌情tunning https://pgloader.readthedocs.io/en/latest/command.html#batch-behaviour-options

INCLUDING ONLY TABLE NAMES MATCHING ~/^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))/
;

2.3. 分开导入表结构表数据是因为pgloader默认行为是按照指定和mysql://源下默认的CASTING创建目的表结构(除了索引),再COPY导入表数据,最后并行{表索引数量}CREATE INDEX并且不等待其完成而是并行地开始下一个表的过程: https://pgloader.readthedocs.io/en/latest/batches.html

By default, as many CREATE INDEX threads as the maximum number of indexes per table are found in your source schema. It is possible to set the max parallel create index WITH option to another number in case there’s just too many of them to create.

由于未知原因在3.6.af8c3c1WITH max parallel create index 1并没有实际作用仍然会执行上述流程所以不得不分开在WITH schema only.load中先完成CREATE INDEX dimitri/pgloader#1576 以减少io争用

2.3.1. 出于类似的原因也建议先禁用autovacuum

ALTER SYSTEM SET autovacuum = off;
SELECT pg_reload_conf();
SELECT * FROM pg_settings WHERE name = 'autovacuum';

清理:待pgloader完成后再手动VACUMM FULL https://dba.stackexchange.com/questions/130496/is-it-worth-it-to-run-vacuum-on-a-table-that-only-receives-inserts/130514#130514

ALTER SYSTEM RESET autovacuum;
SELECT pg_reload_conf();
VACUMM FULL;

2.4. WITH quote identifiers是为了方便后续校验两端表数据是否一致时可以假定列名相同,您可以之后再重命名为postgre社区习惯的snake_case以减少满屏幕""(相当于mysql的``

2.5. 视奸正在导入什么表 https://stackoverflow.com/questions/35319597/how-to-stop-kill-a-query-in-postgresql

SELECT * FROM pg_stat_activity WHERE application_name = 'pgloader';

可根据表大小猜测进度 https://stackoverflow.com/questions/21738408/postgresql-list-and-order-tables-by-size/21738505#21738505

SELECT table_schema, table_name, pg_relation_size('"'||table_schema||'"."'||table_name||'"')::float8 / 1024 / 1024 "MiB"
FROM information_schema.tables
WHERE table_schema = 'tbm'
ORDER BY 3 DESC;
  1. 设有表tbmc_f97650_thread结构
    mysql端:
CREATE TABLE `tbmc_f97650_thread` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `tid` bigint unsigned NOT NULL,
  `threadType` bigint unsigned NOT NULL,
  `stickyType` tinytext CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `topicType` tinytext CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `isGood` tinyint DEFAULT NULL,
  `title` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  `authorUid` bigint NOT NULL,
  `postedAt` int unsigned NOT NULL,
  `latestReplyPostedAt` int unsigned NOT NULL,
  `latestReplierUid` bigint DEFAULT NULL,
  `replyCount` int unsigned DEFAULT NULL,
  `viewCount` int unsigned DEFAULT NULL,
  `shareCount` int unsigned DEFAULT NULL,
  `agreeCount` int DEFAULT NULL,
  `disagreeCount` int DEFAULT NULL,
  `zan` blob,
  `geolocation` blob,
  `authorPhoneType` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `createdAt` int unsigned NOT NULL,
  `updatedAt` int unsigned DEFAULT NULL,
  `lastSeenAt` int unsigned DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `tid` (`tid`) USING BTREE,
  KEY `authorUid` (`authorUid`),
  KEY `postTime` (`postedAt` DESC)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

由pgloader在pgsql端按照2.3.mysql://源下默认的CASTING所创建:

create table tbm_dev.tbmc_f97650_thread (
  id bigint primary key not null default nextval('tbmc_f97650_thread_id_seq'::regclass),
  tid bigint not null,
  "threadType" bigint not null,
  "stickyType" text,
  "topicType" text,
  "isGood" smallint,
  title text not null,
  "authorUid" bigint not null,
  "postedAt" bigint not null,
  "latestReplyPostedAt" bigint not null,
  "latestReplierUid" bigint,
  "replyCount" bigint,
  "viewCount" bigint,
  "shareCount" bigint,
  "agreeCount" integer,
  "disagreeCount" integer,
  zan bytea,
  geolocation bytea,
  "authorPhoneType" text,
  "createdAt" bigint not null,
  "updatedAt" bigint,
  "lastSeenAt" bigint
);
create index "idx_23173_authorUid" on tbmc_f97650_thread using btree ("authorUid");
create index "idx_23173_postTime" on tbmc_f97650_thread using btree ("postedAt");
create unique index idx_23173_tid on tbmc_f97650_thread using btree (tid);

3.1. 可以通过将表中每列值concat起来hash后再aggreate地对每行hash再hash来得出整个表的hash
mermaid.live

flowchart TD
    fields[tid\nthreadType\nstickyType\ntopicType\nisGood\ntitle\nauthorUid\npostedAt\nlatestReplyPostedAt\nlatestReplierUid\nreplyCount\nviewCount\nshareCount\nagreeCount\ndisagreeCount\nzan\ngeolocation\nauthorPhoneType\ncreatedAt\nupdatedAt\nlastSeenAt]
    fields --CONCAT_WS(' ', ...)--> row[row\nsingle long csv line string but using space and without any escape]
    row --sha2-256--> row_hash
    row_hash --GROUP_CONCAT(... ORDER BY tid SEPARATOR '&bsol;n')--> rows_hash[rows_hash\nnew line delimited rows]
    rows_hash --sha2-256--> table_hash

mysql

SET SESSION group_concat_max_len = 18446744073709551615; -- 默认只有1024超过会带WARNING截断 https://dba.stackexchange.com/questions/197746/why-set-group-concat-max-len-below-the-maximum
SELECT SHA2(GROUP_CONCAT(SHA2(CONCAT_WS(' ',
    tid, threadType, stickyType, topicType, isGood, title, authorUid, postedAt, latestReplyPostedAt, latestReplierUid, replyCount, viewCount, shareCount, agreeCount, disagreeCount, LOWER(HEX(zan)), LOWER(HEX(geolocation)), authorPhoneType, createdAt, updatedAt, lastSeenAt
), 512) ORDER BY tid SEPARATOR '\n'), 512) hash FROM tbmc_f97650_thread;

blob类型字段必须套LOWER(HEX())因为mysql默认的blob2text输出类似0xDEADBEAF形式而pgsql是deadbeaf
pgsql

SELECT encode(sha512(convert_to(array_to_string(array_agg(encode(sha512(convert_to(concat_ws(' ',
    tid, "threadType", "stickyType", "topicType", "isGood", title, "authorUid", "postedAt", "latestReplyPostedAt", "latestReplierUid", "replyCount", "viewCount", "shareCount", "agreeCount", "disagreeCount", encode(zan, 'hex'), encode(geolocation, 'hex'), "authorPhoneType", "createdAt", "updatedAt", "lastSeenAt"
), 'utf8')), 'hex') ORDER BY tid), E'\n'), 'utf8')), 'hex') hash FROM tbmc_f97650_thread;

3.2. 然而在大表上两者最终都会耗尽内存而失败,并且需要

  • 显式列出所有列名
  • 给mysql表的blob列套LOWER(HEX())
  • GROUP_CONCAT()时按照PKUK排序行否则在pgsql中默认迫真假随机
  • 导致难以针对100+个表一键生成

3.3. 理论上可以通过赋予纯SQL(无需臭名昭著的命令式存储过程或外部程序辅助)图灵完备性 https://wiki.postgresql.org/wiki/Mandelbrot_set 导致其踏入了同一条Turing Tarpit之河流的rCTELIMIT 10000地每次hash的行并对每10k行aggreate之hash再aggreate地hash从而减少每行hash所aggreate的字符串长度
mermaid.live

flowchart TD
    fields[tid\nthreadType\nstickyType\ntopicType\nisGood\ntitle\nauthorUid\npostedAt\nlatestReplyPostedAt\nlatestReplierUid\nreplyCount\nviewCount\nshareCount\nagreeCount\ndisagreeCount\nzan\ngeolocation\nauthorPhoneType\ncreatedAt\nupdatedAt\nlastSeenAt]
    fields --LIMIT 10000\nCONCAT_WS(' ', ...)--> row
    fields --LIMIT 10000\nCONCAT_WS(' ', ...)--> row
    fields --LIMIT 10000\nCONCAT_WS(' ', ...)--> row
    fields --LIMIT 10000\nCONCAT_WS(' ', ...)--> row
    fields --LIMIT rest\nCONCAT_WS(' ', ...)--> row
    row[row\nsingle long csv line string but using space and without any escape]
    row --LIMIT 10000\nsha2-256--> row_hash
    row --LIMIT 10000\nsha2-256--> row_hash
    row --LIMIT 10000\nsha2-256--> row_hash
    row --LIMIT 10000\nsha2-256--> row_hash
    row --LIMIT rest\nsha2-256--> row_hash
    row_hash --GROUP_CONCAT(... ORDER BY tid SEPARATOR '&bsol;n')--> rows_hash[rows_hash\nnew line delimited rows]
    rows_hash --sha2-256--> table_hash
  1. https://stackoverflow.com/questions/59715303/is-there-any-way-join-tables-of-mysql-and-postgresql/59721228#59721228某so纯路人指出可以使用postgre社区牛逼哄哄的某知名企业EnterpriseDB所开发的某pgsql扩展 https://github.com/EnterpriseDB/mysql_fdw 通过foreign data wrapper来在pgsql中将外部数据当做迫真物化视图查询

4.1. 安装:

git clone https://github.com/EnterpriseDB/mysql_fdw &&
git checkout REL-2_9_1 &&
sudo apt install -y libmysqlclient-dev postgresql-server-dev-16 &&
export PATH=/usr/local/pgsql/bin/:$PATH &&
make USE_PGXS=1 &&
sudo make USE_PGXS=1 install &&
make USE_PGXS=1 installcheck &&
sudo apt purge -y postgresql-server-dev-16

清理:

sudo apt purge -y libmysqlclient-dev &&
sudo make USE_PGXS=1 uninstall &&
rm -rf mysql_fdw

4.2. 安装该pgsql扩展

ALTER USER tbm WITH SUPERUSER;
CREATE EXTENSION IF NOT EXISTS mysql_fdw;
ALTER USER tbm WITH NOSUPERUSER;
CREATE SERVER IF NOT EXISTS mysql
    FOREIGN DATA WRAPPER mysql_fdw
    OPTIONS (host '127.0.0.1', port '3306');
CREATE USER MAPPING IF NOT EXISTS FOR CURRENT_USER -- 当前登录用户 也可以用PUBLIC使得用户名/密码对所有pgsql用户可见
    SERVER mysql
    OPTIONS (username 'username', password 'password');

清理:

DROP USER MAPPING IF EXISTS FOR CURRENT_USER SERVER mysql; -- FOR CURRENT_USER同上
DROP SCHEMA IF EXISTS mysql CASCADE;
DROP SERVER IF EXISTS mysql CASCADE;
DROP EXTENSION IF EXISTS mysql_fdw CASCADE;
DROP DATABASE contrib_regression; -- make USE_PGXS=1 installcheck所产生

4.3. 在pgsql中创建fdw表

CREATE SCHEMA IF NOT EXISTS mysql;
CREATE FOREIGN TABLE IF NOT EXISTS mysql."tbmc_f97650_thread"()
    INHERITS (tbm."tbmc_f97650_thread")
    SERVER mysql
    OPTIONS (dbname 'tbm', table_name 'tbmc_f97650_thread');

使用table inheritance是为了避免在2.已经pgloader schema.load后再写一遍fdw表结构,但这会导致

4.3.1. 被继承表tbmc_f97650_thread无法再直接UPDATE/DELETE返回[HV005] ERROR: system attribute "tableoid" can't be fetched from remote relation而必须加ONLY EnterpriseDB/mysql_fdw#300 但仍然可以INSERT

4.3.1.1. 这个bug可能会在未来版本的mysql_fdw中修复但在这里并不重要,因为这种迫真假锁全表正好避免了在校验期间UPDATE/DELETE(但防不了INSERT)(默认事务隔离级别READ COMMITTED而非mysql默认的REPEATABLE READ #32 (comment) 当然您也可以改)了pgsql表导致假阴性

4.3.2. 由于table inheritance的本质:

SELECT * FROM tbm."tbmc_f97650_thread"

也会出现

SELECT * FROM mysql."tbmc_f97650_thread"

中的所有结果:也就是被继承表并集了所有继承表(在这里通过查询mysql),相当于

SELECT * FROM ONLY tbm."tbmc_f97650_thread" UNION ALL SELECT * FROM mysql."tbmc_f97650_thread"
  1. 设有表tbmi_ocr_box_Latn结构:
    mysql端:
CREATE TABLE `tbmi_ocr_box_Latn` (
  `imageId` int unsigned NOT NULL,
  `frameIndex` int unsigned NOT NULL,
  `centerPointX` smallint unsigned NOT NULL,
  `centerPointY` smallint unsigned NOT NULL,
  `width` smallint unsigned NOT NULL,
  `height` smallint unsigned NOT NULL,
  `rotationDegrees` smallint unsigned NOT NULL,
  `recognizer` enum('TesseractHorizontal','TesseractVertical','PaddleOCRv4','PaddleOCRv3') NOT NULL,
  `confidence` tinyint unsigned NOT NULL,
  `text` text NOT NULL,
  PRIMARY KEY (`imageId`,`frameIndex`,`centerPointX`,`centerPointY`,`width`,`height`,`rotationDegrees`,`recognizer`)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

由pgloader在pgsql端按照2.3.mysql://源下默认的CASTING所创建:

create table tbm."tbmi_ocr_box_Latn" (
  "imageId" bigint not null,
  "frameIndex" bigint not null,
  "centerPointX" integer not null,
  "centerPointY" integer not null,
  width integer not null,
  height integer not null,
  "rotationDegrees" integer not null,
  recognizer "tbmi_ocr_box_Latn_recognizer" not null,
  confidence smallint not null,
  text text not null,
  primary key ("imageId", "frameIndex", "centerPointX", "centerPointY", width, height, "rotationDegrees", recognizer)
);

5.1. 根据某so纯路人symmdifftable有4种写法:

5.1.1. https://stackoverflow.com/questions/15330403/find-difference-between-two-big-tables-in-postgresql/15333054#15333054

-- symmetric difference FULL OUTER JOIN USING (...columns)
SELECT COUNT(*) FROM mysql."tbmi_ocr_box_Latn" a FULL OUTER JOIN ONLY tbm."tbmi_ocr_box_Latn" b
  USING ("imageId", "frameIndex", "centerPointX", "centerPointY", "width", "height", "rotationDegrees", "recognizer", "confidence", "text")
WHERE (a."imageId" IS NULL AND a."frameIndex" IS NULL AND a."centerPointX" IS NULL AND a."centerPointY" IS NULL AND a."width" IS NULL AND a."height" IS NULL AND a."rotationDegrees" IS NULL AND a."recognizer" IS NULL AND a."confidence" IS NULL AND a."text" IS NULL)
   OR (b."imageId" IS NULL AND b."frameIndex" IS NULL AND b."centerPointX" IS NULL AND b."centerPointY" IS NULL AND b."width" IS NULL AND b."height" IS NULL AND b."rotationDegrees" IS NULL AND b."recognizer" IS NULL AND b."confidence" IS NULL AND b."text" IS NULL);
1 row retrieved starting from 1 in 5 s 745 ms (execution: 5 s 698 ms, fetching: 47 ms)

5.1.1.1. 一键生成

SELECT
    'SELECT COUNT(*) FROM mysql."' ||
    table_name ||
    '" a FULL OUTER JOIN ONLY tbm."' ||
    table_name ||
    E'" b \n  USING ("' ||
    string_agg(column_name, '", "' ORDER BY ordinal_position) ||
    E'")\nWHERE (a."' ||
    string_agg(column_name, '" IS NULL AND a."' ORDER BY ordinal_position) ||
    E'" IS NULL)\n   OR (b."' ||
    string_agg(column_name, '" IS NULL AND b."' ORDER BY ordinal_position) ||
    '" IS NULL);'
FROM information_schema.columns
WHERE table_schema = 'tbm' AND table_name = 'tbmi_ocr_box_Latn'
GROUP BY table_name;

5.1.1.2. 也可以使用NATURAL FULL OUTER JOIN ONLY https://www.postgresql.org/docs/current/queries-table-expressions.html 省略USING (...)

Finally, NATURAL is a shorthand form of USING: it forms a USING list consisting of all column names that appear in both input tables. As with USING, these columns appear only once in the output table. If there are no common column names, NATURAL JOIN behaves like JOIN ... ON TRUE, producing a cross-product join.

5.1.2. 但5.1.1.5.1.1.2.并不能用于有列类型允许NULL之存在的表如7.中的tbm_bilibiliVote表有着

[...]
  `authorExpGrade` int DEFAULT NULL,
  `voteBy` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `voteFor` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `replyContent` json DEFAULT NULL,
[...]

因为根据sql特色之 https://en.wikipedia.org/wiki/Three-valued_logic NULL = NULLUNKNOWNUNKNOWNboolfalse于是不会被ON/WHERE等clause视作相同 https://stackoverflow.com/questions/14366004/sql-server-join-missing-null-values

5.1.2.1. 因而需要将5.1.1.中的USING (...columns)展开成对所有列=operator以及额外的OR (IS NULL AND IS NULL)

SELECT COUNT(*) FROM mysql."tbm_bilibiliVote" a FULL OUTER JOIN ONLY tbm."tbm_bilibiliVote" b ON
    (a."pid" = b."pid") AND
    (a."authorUid" = b."authorUid") AND
    (a."authorExpGrade" = b."authorExpGrade" OR (a."authorExpGrade" IS NULL AND b."authorExpGrade" IS NULL)) AND
    (a."isValid" = b."isValid") AND
    (a."voteBy" = b."voteBy" OR (a."voteBy" IS NULL AND b."voteBy" IS NULL)) AND
    (a."voteFor" = b."voteFor" OR (a."voteFor" IS NULL AND b."voteFor" IS NULL)) AND
    (a."replyContent" = b."replyContent") AND
    (a."postTime" = b."postTime")
WHERE (a."pid" IS NULL AND a."authorUid" IS NULL AND a."authorExpGrade" IS NULL AND a."isValid" IS NULL AND a."voteBy" IS NULL AND a."voteFor" IS NULL AND a."replyContent" IS NULL AND a."postTime" IS NULL)
   OR (b."pid" IS NULL AND b."authorUid" IS NULL AND b."authorExpGrade" IS NULL AND b."isValid" IS NULL AND b."voteBy" IS NULL AND b."voteFor" IS NULL AND b."replyContent" IS NULL AND b."postTime" IS NULL);

5.1.2.1.1. 不对NOT NULL的列做额外判断是为了避免7.1.中的

FULL JOIN is only supported with merge-joinable or hash-joinable join conditions

5.1.2.2. 一键生成

SELECT
    'SELECT COUNT(*) FROM mysql."' ||
    table_name ||
    '" a FULL OUTER JOIN ONLY tbm."' ||
    table_name ||
    E'" b ON\n' ||
    string_agg(
        '    (a."' ||
        column_name ||
        '" = b."' ||
        column_name ||
        CASE WHEN is_nullable::bool = true THEN
            '" OR (a."' ||
            column_name ||
            '" IS NULL AND b."' ||
            column_name ||
            '" IS NULL))'
        ELSE '")'
        END,
        E' AND\n' ORDER BY ordinal_position
    ) ||
    E'\nWHERE (a."' ||
    string_agg(column_name, '" IS NULL AND a."' ORDER BY ordinal_position) ||
    E'" IS NULL)\n   OR (b."' ||
    string_agg(column_name, '" IS NULL AND b."' ORDER BY ordinal_position) ||
    '" IS NULL);'
FROM information_schema.columns
WHERE table_schema = 'tbm' AND table_name = 'tbm_bilibiliVote'
GROUP BY table_name;

5.1.3. https://stackoverflow.com/questions/15330403/find-difference-between-two-big-tables-in-postgresql/49381589#49381589

-- symmetric difference FULL OUTER JOIN row(table)
SELECT COUNT(*) FROM mysql."tbmi_ocr_box_Latn" a FULL OUTER JOIN ONLY tbm."tbmi_ocr_box_Latn" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;
1 row retrieved starting from 1 in 6 s 464 ms (execution: 6 s 345 ms, fetching: 119 ms)

5.1.3.1. 其中row(t)https://www.postgresql.org/docs/current/sql-expressions.html#SQL-SYNTAX-ROW-CONSTRUCTORS

5.1.3.2. 这也避免了5.1.2.

5.1.4. https://stackoverflow.com/questions/6337871/how-can-i-speed-up-a-diff-between-tables

-- non symmetric difference EXCEPT mysql
SELECT COUNT(*) FROM (SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" a EXCEPT SELECT * FROM mysql."tbmi_ocr_box_Latn" b) t;
1 row retrieved starting from 1 in 5 s 137 ms (execution: 4 s 903 ms, fetching: 234 ms)

由于EXCEPT不是symmdiffFULL OUTER JOIN而是SELECT a.* FROM a LEFT OUTER JOIN b所以只能执行两遍2x耗时正如同没有FULL OUTER JOIN可用的mysql人 https://stackoverflow.com/questions/4796872/how-can-i-do-a-full-outer-join-in-mysql

-- non symmetric difference EXCEPT tbm
SELECT COUNT(*) FROM (SELECT * FROM mysql."tbmi_ocr_box_Latn" a EXCEPT SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" b) t;
1 row retrieved starting from 1 in 5 s 793 ms (execution: 5 s 527 ms, fetching: 266 ms)

5.1.4.1. 由于未知bug

-- symmetric difference EXCEPT
SELECT COUNT(*) FROM (
    SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" a EXCEPT SELECT * FROM mysql."tbmi_ocr_box_Latn" b
    UNION ALL
    SELECT * FROM mysql."tbmi_ocr_box_Latn" a EXCEPT SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" b
) t;

UNION ALL后始终为0行即便单独查询有行

1 row retrieved starting from 1 in 8 s 678 ms (execution: 8 s 595 ms, fetching: 83 ms)

5.1.4.2. 这也避免了5.1.2.

5.1.5. 典型的误用NOT EXISTS

-- non symmetric difference WHERE NOT EXISTS mysql
SELECT COUNT(*) FROM (SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" a WHERE NOT EXISTS (SELECT * FROM mysql."tbmi_ocr_box_Latn" b)) t;
1 row retrieved starting from 1 in 764 ms (execution: 250 ms, fetching: 514 ms)

因为SELECT * FROM mysql."tbmi_ocr_box_Latn"在此显然返回非0行(因而只需要读一行mysql."tbmi_ocr_box_Latn"几百ms后就会立即返回)从而使得该predicate在非空表上恒真

-- non symmetric difference WHERE NOT EXISTS tbm
SELECT COUNT(*) FROM (SELECT * FROM mysql."tbmi_ocr_box_Latn" a WHERE NOT EXISTS (SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" b)) t;
1 row retrieved starting from 1 in 516 ms (execution: 125 ms, fetching: 391 ms)

5.1.5.1. 由于未知bug

-- symmetric difference WHERE NOT EXISTS
SELECT COUNT(*) FROM (
    SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" a WHERE NOT EXISTS (SELECT * FROM mysql."tbmi_ocr_box_Latn" b)
    UNION ALL
    SELECT * FROM mysql."tbmi_ocr_box_Latn" a WHERE NOT EXISTS (SELECT * FROM ONLY tbm."tbmi_ocr_box_Latn" b)
) t;

UNION ALL后始终为0行即便单独查询有非0行

1 row retrieved starting from 1 in 776 ms (execution: 203 ms, fetching: 573 ms)

5.1.6. 除5.1.5.外的3种执行期间均需pgsql消耗1x表大小的内存(总比3.2.好)和3x表大小临时表存储
image
image
image
image

5.1.6.1. 可以修改临时表空间在fs上的默认路径/var/lib/postgresql/16/main/base/pgsql_tmp https://dba.stackexchange.com/questions/170661/in-postgres-how-do-i-adjust-the-pgsql-tmp-setting/170665#170665

tmp={目的路径} &&
mkdir -p $tmp /PG_16_202307071 &&
chown -R postgres: $tmp &&
chmod -R 700 $tmp
create tablespace temp_tablespace owner postgres location '{目的路径}';
grant create on tablespace temp_tablespace to public;
alter system set temp_tablespaces = temp_tablespace;
select pg_reload_conf();
create temp table a(b text);

5.2. 总行数对比

SELECT a, b, b - a diff FROM (SELECT (SELECT COUNT(*) FROM mysql."tbmi_ocr_box_Latn") a, (SELECT COUNT(*) FROM ONLY tbm."tbmi_ocr_box_Latn") b) t;
   a    |   b    | diff
--------+--------+--------
 472655 | 472655 | 0
(1 row)
  1. 令文件compare.sql有在mysql中codegen出基于5.1.3.4.3.5.2.校验用的一键生成pgsql
SELECT CONCAT(
    'CREATE FOREIGN TABLE IF NOT EXISTS mysql."',
    TABLE_NAME,
    '"() INHERITS (tbm."',
    TABLE_NAME,
    '") SERVER mysql OPTIONS (dbname \'tbm\', table_name \'',
    TABLE_NAME,
    '\');\n',
    'SELECT a, b, b - a diff FROM (SELECT (SELECT COUNT(*) FROM mysql."',
    TABLE_NAME,
    '") a, (SELECT COUNT(*) FROM ONLY tbm."',
    TABLE_NAME,
    '") b) t;\n',
    'SELECT COUNT(*) FROM mysql."',
    TABLE_NAME,
    '" a FULL OUTER JOIN ONLY tbm."',
    TABLE_NAME,
    '" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;'
)
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'tbm'
  AND TABLE_NAME REGEXP '^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))'
ORDER BY TABLE_NAME;

REGEXP部分同2.1.

6.1. 一键bash

cd pgloader && # 1.1.1.中git clone的pgloader目录路径
(
    /usr/bin/time -v build/bin/pgloader schema.load &&
    /usr/bin/time -v build/bin/pgloader data.load &&
    PGPASSWORD=password /usr/bin/time -v psql -aUusername tbm < compare.sql
) 2>&1 | tee -a stdout+err
  1. 设有表tbm_bilibiliVote结构 https://github.com/n0099/bilibiliVote
    mysql端:
CREATE TABLE `tbm_bilibiliVote` (
  `pid` bigint NOT NULL,
  `authorUid` bigint NOT NULL,
  `authorExpGrade` int DEFAULT NULL,
  `isValid` tinyint(1) NOT NULL,
  `voteBy` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `voteFor` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `replyContent` json DEFAULT NULL,
  `postTime` datetime NOT NULL,
  PRIMARY KEY (`pid`),
  KEY `authorUid` (`authorUid`),
  KEY `isVaild` (`isValid`) USING BTREE,
  KEY `voteFor` (`voteFor`(100),`isValid`) USING BTREE,
  KEY `postTime` (`postTime`)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

由pgloader在pgsql端按照2.3.mysql://源下默认的CASTING所创建:

create table tbm."tbm_bilibiliVote" (
  pid bigint primary key not null,
  "authorUid" bigint not null,
  "authorExpGrade" integer,
  "isValid" boolean not null,
  "voteBy" text,
  "voteFor" text,
  "replyContent" json,
  "postTime" timestamp with time zone not null
);
create index "idx_163841_authorUid" on "tbm_bilibiliVote" using btree ("authorUid");
create index "idx_163841_isVaild" on "tbm_bilibiliVote" using btree ("isValid");
create index "idx_163841_postTime" on "tbm_bilibiliVote" using btree ("postTime");
create index "idx_163841_voteFor" on "tbm_bilibiliVote" using btree ("voteFor", "isValid");

7.1. 使用5.1.3.中的symmetric difference FULL OUTER JOIN row(table)

SELECT COUNT(*) FROM mysql."tbm_bilibiliVote" a FULL OUTER JOIN ONLY tbm."tbm_bilibiliVote" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;

可得 https://stackoverflow.com/questions/47405732/why-does-postgresql-throw-full-join-is-only-supported-with-merge-joinable-or-ha

ERROR:  FULL JOIN is only supported with merge-joinable or hash-joinable join conditions

7.1.1. 使用5.1.5.中的non symmetric difference EXCEPT

SELECT COUNT(*) FROM (SELECT * FROM ONLY tbm."tbm_bilibiliVote" a EXCEPT SELECT * FROM mysql."tbm_bilibiliVote" b) t;
SELECT COUNT(*) FROM (SELECT * FROM mysql."tbm_bilibiliVote" a EXCEPT SELECT * FROM ONLY tbm."tbm_bilibiliVote" b) t;

可得 https://stackoverflow.com/questions/48420438/could-not-identify-an-equality-operator-for-type-json-when-using-distinct

ERROR: could not identify an equality operator for type json

7.1.2. 执行5.1.2.1.

SELECT * FROM mysql."tbm_bilibiliVote" a FULL OUTER JOIN ONLY tbm."tbm_bilibiliVote" b ON
    a."authorUid" = b."authorUid" AND
    a."isValid" = b."isValid" AND
    a."pid" = b."pid" AND
    a."postTime" = b."postTime" AND
    a."replyContent" = b."replyContent" AND
    a."voteBy" = b."voteBy" AND
    a."voteFor" = b."voteFor" AND
    a."authorExpGrade" = b."authorExpGrade"
WHERE (a."pid" IS NULL AND a."authorUid" IS NULL AND a."authorExpGrade" IS NULL AND a."isValid" IS NULL AND a."voteBy" IS NULL AND a."voteFor" IS NULL AND a."replyContent" IS NULL AND a."postTime" IS NULL)
   OR (b."pid" IS NULL AND b."authorUid" IS NULL AND b."authorExpGrade" IS NULL AND b."isValid" IS NULL AND b."voteBy" IS NULL AND b."voteFor" IS NULL AND b."replyContent" IS NULL AND b."postTime" IS NULL);

会有 https://stackoverflow.com/questions/32843213/operator-does-not-exist-json-json

ERROR: operator does not exist: json = json
Hint: No operator matches the given name and argument types. You might need to add explicit type casts.

7.1.3. 7.中可见列replyContentjson类型

-    a."replyContent" = b."replyContent" AND
+    a."replyContent"::jsonb = b."replyContent"::jsonb AND
count
29510

这恰好等于5.2.

SELECT COUNT(*) FROM mysql."tbm_bilibiliVote" UNION ALL SELECT COUNT(*) FROM ONLY tbm."tbm_bilibiliVote";
count
14755
14755

之和

7.1.4. 但即便是json::text

-    a."replyContent" = b."replyContent" AND
+    a."replyContent"::text = b."replyContent"::text AND

也仍然相同

7.1.5. 可以在mysql端将该json列转text

ALTER TABLE tbm_bilibiliVote MODIFY replyContent TEXT NOT NULL;

后再重新按照2.2.仅导入该表(INCLUDING ONLY TABLE NAMES MATCHING 'tbm_bilibiliVote'

7.2. 但即便两端的列replyContent重新导入后均为text类型也仍然有所有行不同
实际上是由于列postTimemysql无时区类型datetime2.3.中的pgloadermysql://默认的CASTING

type datetime to timestamptz

转为pgsql有时区类型timestamptz时由pgloader根据系统时区 UTC+8而非pgsql时区 UTC+0转换为UTC

SELECT a.pid, a."postTime", b."postTime" FROM mysql."tbm_bilibiliVote" a JOIN ONLY tbm."tbm_bilibiliVote" b ON a.pid = b.pid ORDER BY pid;
pid postTime postTime
124498114574 2019-03-10 12:38:17.000000 +00:00 2019-03-10 04:38:17.000000 +00:00
124498200265 2019-03-10 12:44:12.000000 +00:00 2019-03-10 04:44:12.000000 +00:00
...

7.2.1. 由于timestamptz实际上并没有额外存储时区UTC offset https://stackoverflow.com/questions/5876218/difference-between-timestamps-with-without-time-zone-in-postgresql
因而其并不能解决 https://z.n0099.net/#narrow/near/86397 https://old.reddit.com/r/PostgreSQL/comments/xpygbh/when_would_i_ever_use_timestamp_over_timestamptz/ https://news.ycombinator.com/item?id=20212671 中争论的带时区引用未来指定datetime问题
https://www.w3.org/International/wiki/WorkingWithTimeZones#Past_and_Future_Events

The SQL data types 'date', 'time', and 'timestamp' are field based time values which are intended to be zone offset independent: they are actually, technically, floating time values! The data type 'timestamp with time zone' is the zone offset-dependent equivalent of 'timestamp' in SQL.

https://www.w3.org/International/wiki/WorkingWithTimeZones#Floating_Time

  • Floating Time
    Some observed time values are not related to a specific moment in incremental time. Instead, they need to be combined with local information to determine a range of acceptable incremental time values. We refer to these sorts of time values as "floating times". Floating times are not attached and should never be attached to a particular time zone.
    Examples of floating time events include a user’s birth date, an employee’s hire or termination date, or a list of company holidays.
    For example, suppose that January 1st is considered a holiday in your application. The day "January 1" has a floating-time status as a "holiday". That "day" can begin as early as midnight GMT-14:00 and end as late as midnight of January 2 GMT+12:00, depending on local time. This covers an incremental time range of fifty hours.

再叠加mysql_fdw不论当前pgsql时区 https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-TIMEZONES 是什么都不会在将从mysql读到的datetime转换为timestamptz时根据时区重新计算时间而是只修改offset(也就是使用跟mysql类型datetime语义相同的pgsql类型timestamp的语义)

SET SESSION TIME ZONE 'Asia/Shanghai';
SELECT a.pid, a."postTime", b."postTime" FROM mysql."tbm_bilibiliVote" a JOIN ONLY tbm."tbm_bilibiliVote" b ON a.pid = b.pid ORDER BY pid;
pid postTime postTime
124498114574 2019-03-10 12:38:17.000000 +08:00 2019-03-10 12:38:17.000000 +08:00
124498200265 2019-03-10 12:44:12.000000 +08:00 2019-03-10 12:44:12.000000 +08:00

因而只需要修改pgsql时区

SET SESSION TIME ZONE 'Asia/Shanghai';
SELECT COUNT(*) FROM ONLY mysql."tbm_bilibiliVote" a FULL OUTER JOIN ONLY tbm."tbm_bilibiliVote" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;

便校验一致

7.2.2. 如果执行了7.1.5.需再在pgsql端将该列类型改回jsonjsonb

DROP FOREIGN TABLE mysql."tbm_bilibiliVote"; -- 避免`ALTER TABLE`由于`4.3.2.`而返回假阳性之`ERROR: "tbm_bilibiliVote" is not a table`
ALTER TABLE tbm."tbm_bilibiliVote" ALTER "replyContent" TYPE jsonb USING "replyContent"::jsonb;
-- 修改`6.`中的`TABLE_NAME REGEXP '...'`predicate为`TABLE_NAME = 'tbm_bilibiliVote'`并eval其输出以重新执行校验
VACUUM FULL tbm."tbm_bilibiliVote";
  1. 设有表tbmi_metadata_gif结构:
CREATE TABLE `tbmi_metadata_gif` (
  `imageId` int unsigned NOT NULL,
  `repeatCount` smallint unsigned NOT NULL,
  `colorTableMode` enum('Global','Local') NOT NULL,
  `globalColorTableLength` int NOT NULL,
  `comments` json DEFAULT NULL,
  PRIMARY KEY (`imageId`)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

由pgloader在pgsql端按照2.3.mysql://源下默认的CASTING所创建:

create table tbm.tbmi_metadata_gif (
  "imageId" bigint primary key not null,
  "repeatCount" integer not null,
  "colorTableMode" "tbmi_metadata_gif_colorTableMode" not null,
  "globalColorTableLength" integer not null,
  comments json
);

8.1. 按照7.1.3.

SELECT COUNT(*) FROM mysql."tbmi_metadata_gif" a FULL OUTER JOIN ONLY tbm."tbmi_metadata_gif" b ON
    (a."imageId" = b."imageId") AND
    (a."repeatCount" = b."repeatCount") AND
    (a."colorTableMode" = b."colorTableMode") AND
    (a."globalColorTableLength" = b."globalColorTableLength") AND
    (a."comments"::jsonb = b."comments"::jsonb OR (a."comments" IS NULL AND b."comments" IS NULL))
WHERE (a."imageId" IS NULL AND a."repeatCount" IS NULL AND a."colorTableMode" IS NULL AND a."globalColorTableLength" IS NULL AND a."comments" IS NULL)
   OR (b."imageId" IS NULL AND b."repeatCount" IS NULL AND b."colorTableMode" IS NULL AND b."globalColorTableLength" IS NULL AND b."comments" IS NULL);

ERROR: unsupported Unicode escape sequence
Detail: \u0000 cannot be converted to text.
Where: JSON data, line 1: ...                   ", "Created by TechSmith\u0000...

8.1.1. 但按照7.1.4.就无差异

-    (a."comments"::jsonb = b."comments"::jsonb OR (a."comments" IS NULL AND b."comments" IS NULL))
+    (a."comments"::text = b."comments"::text OR (a."comments" IS NULL AND b."comments" IS NULL))

8.1.2. 这是由于jsonb类型不同于本质textjson,会将json解析后存储为pgsql内置类型之集,再由于text类型受c人最爱的 https://en.wikipedia.org/wiki/Null-terminated_string 影响不支持存储256个字节中的唯一一个0x00导致jsonb也无法存储\u0000转义后的0x00进其内部的text turbot/steampipe-postgres-fdw#118
https://www.postgresql.org/docs/current/datatype-json.html

The jsonb type also rejects \u0000 (because that cannot be represented in PostgreSQL's text type)

Table 8.23. JSON Primitive Types and Corresponding PostgreSQL Types

JSON primitive type PostgreSQL type Notes
string text \u0000 is disallowed, as are Unicode escapes representing characters not available in the database > encoding
number numeric NaN and infinity values are disallowed
boolean boolean Only lowercase true and false spellings are accepted
null (none) SQL NULL is a different concept

8.2. 该表该列中共有如下\u0000 https://codepoints.net/U+0000

SELECT COUNT(*) count, comments::text c
FROM ONLY tbm."tbmi_metadata_gif"
WHERE comments::text LIKE '%\u0000%'
GROUP BY c
ORDER BY count DESC;
count c
31 [" ", "Created by TechSmith\u0000"]
2 ["tvc\u0000"]
1 ["Author:\t\tGod\_job\_dave\u0000\u0000\u0000???\u0001\u0000\u0000\u0000\u0000P??\u0001\u0000"]
1 ["Optimized by Ulead SmartSaver!\u0000"]
1 ["CREATOR: gd-jpeg v1.0 \(using IJG JPEG v62\), quality = 75\n\u0000"]

8.2.1. 使用如下3层如同6.codegen迫真元编程自我复制的quine病毒可在mysql端检查所有含有0x00text类型及其各个长度变种 https://stackoverflow.com/questions/13932750/tinytext-text-mediumtext-and-longtext-maximum-storage-sizes 列并生成ALTER列类型为BLOB(pgloader在pgsql端按照2.3.mysql://源下默认的CASTING会转换为可存储所有256个字节bytea)和SELECT查阅的一键生成mysql:

SET SESSION group_concat_max_len = 18446744073709551615; -- 同`3.1.`
SELECT CONCAT(
'SELECT t.*, CONCAT(
    \'ALTER TABLE tbm.`\',
    t,
    \'` MODIFY `\',
    c,
    \'` \',
    REPLACE(COLUMN_TYPE, \'text\', \'blob\'),
    CASE WHEN COLUMNS.IS_NULLABLE = \'true\' THEN \' NULL;\'
    ELSE \' NOT NULL;\'
    END
) `alter`, CONCAT(
    \'SELECT * FROM tbm.`\',
    t,
    \'` WHERE CAST(`\',
    c,
    \'` AS BINARY) LIKE CONCAT(\'\'%\'\', 0x00, \'\'%\'\');\'
) `select`
FROM (\n',
    GROUP_CONCAT(
        '    SELECT COUNT(*) count, \'',
        TABLE_NAME,
        '\' t, \'',
        COLUMN_NAME,
        '\' c FROM `',
        TABLE_NAME,
        '` WHERE CAST(`',
        COLUMN_NAME,
        '` AS BINARY) LIKE CONCAT(\'%\', 0x00, \'%\')'
        SEPARATOR ' UNION ALL\n'
    ),
    '\n) t
JOIN information_schema.COLUMNS
  ON count != 0 AND TABLE_SCHEMA = \'tbm\'
 AND TABLE_NAME = t AND COLUMN_NAME = c;'
)
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'tbm' AND COLUMN_TYPE LIKE '%text%'
  AND TABLE_NAME REGEXP '^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))'
ORDER BY TABLE_NAME;

REGEXP部分同2.1.

count t c alter select
4321 tbmi_metadata_embedded_exif userComment ALTER TABLE tbm.`tbmi_metadata_embedded_exif` MODIFY `userComment` BLOB NOT NULL; SELECT * FROM tbm.`tbmi_metadata_embedded_exif` WHERE CAST(`userComment` AS BINARY) LIKE CONCAT('%', 0x00, '%');
2612 tbmi_metadata_embedded_exif xpAuthor ALTER TABLE tbm.`tbmi_metadata_embedded_exif` MODIFY `xpAuthor` BLOB NOT NULL; SELECT * FROM tbm.`tbmi_metadata_embedded_exif` WHERE CAST(`xpAuthor` AS BINARY) LIKE CONCAT('%', 0x00, '%');
0.00s user 0.02s system 0% cpu 42:49.75 total

8.3. 值得注意的是2.2.导入 dimitri/pgloader#1573 不会有截断0x00后字节的WARNING6.校验 EnterpriseDB/mysql_fdw#299 过程中也无法发现以0x00结尾text中的最后一个0x00字节消失

  1. 设有表tbmi_hash结构:
CREATE TABLE `tbmi_hash` (
  `imageId` int unsigned NOT NULL,
  `frameIndex` int unsigned NOT NULL,
  `pHash` binary(8) NOT NULL,
  `averageHash` bigint unsigned NOT NULL,
  `blockMeanHash` binary(32) NOT NULL,
  `marrHildrethHash` binary(72) NOT NULL,
  `thumbHash` tinyblob NOT NULL,
  PRIMARY KEY (`imageId`,`frameIndex`)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

由pgloader在pgsql端按照2.3.mysql://源下默认的CASTING所创建:

create table tbm.tbmi_hash (
  "imageId" bigint not null,
  "frameIndex" bigint not null,
  "pHash" bytea not null,
  "averageHash" bigint not null,
  "blockMeanHash" bytea not null,
  "marrHildrethHash" bytea not null,
  "thumbHash" bytea not null,
  primary key ("imageId", "frameIndex")
);

9.1. 由于pgsql没有mysql特有(isosql不要求)带unsigned修饰的的int类型 https://stackoverflow.com/questions/20810134/why-unsigned-integer-is-not-available-in-postgresql/59802732#59802732 正如同m$ft精神mvp最爱的中小实体企业c#工控上位机人上壬 https://z.n0099.net/#narrow/near/94726 除非使用扩展 https://github.com/petere/pguint 因而

-  `averageHash` bigint unsigned NOT NULL,
+  "averageHash" bigint not null,

导致该列上限值从 $2^{64} = 18446744073709551615$ 降到 $2^{63} = 9223372036854775807$

9.2. 实际上所谓的opencv_imghash Starry-OvO/aiotieba#63 (comment)averageHash

{AverageHash.Create(), (image, bytes) => image.AverageHash = BitConverter.ToUInt64(bytes)},
本质上更类似于 https://en.wikipedia.org/wiki/Nominal_number 因而不应该使用具有可运算的正整数语义之类型存储而应直接视作斑点二进制 https://en.wikipedia.org/wiki/Binary_large_object

ALTER TABLE tbmi_hash ADD COLUMN averageHash_2 BINARY(8) NOT NULL AFTER averageHash;
UPDATE tbmi_hash SET averageHash_2 = UNHEX(LPAD(HEX(averageHash), 16, '00')); -- https://stackoverflow.com/questions/58611436/how-do-you-convert-an-integer-into-its-condensed-binary-equivalent-e-g-binary3/58657797#comment138021380_58657797
SELECT COUNT(*) FROM tbmi_hash WHERE averageHash_2 != UNHEX(LPAD(HEX(averageHash), 16, '00'));
-- expecting 0 rows returned, then:
-- ALTER TABLE tbmi_hash DROP COLUMN averageHash, RENAME COLUMN averageHash_2 TO averageHash;

9.3. 一键生成找出所有存在值 $&gt; 2^{63}$bigint unsigned列的mysql

SET SESSION group_concat_max_len = 18446744073709551615; -- 同`3.1.`
SELECT CONCAT(
    'SELECT * FROM (\n',
    GROUP_CONCAT(
        '    SELECT COUNT(*) c, \'',
        TABLE_NAME,
        '\' t FROM `',
        TABLE_NAME,
        '` WHERE `',
        COLUMN_NAME,
        '` > 9223372036854775807'
        SEPARATOR ' UNION ALL\n'
    ),
    '\n) t WHERE c != 0;'
)
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'tbm' AND COLUMN_TYPE = 'bigint unsigned'
  AND TABLE_NAME REGEXP '^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))'
ORDER BY TABLE_NAME;

REGEXP部分同2.1.

9.4.

DROP TABLE tbm."tbmi_hash" CASCADE;

类比7.1.5.重新导入该表

CREATE FOREIGN TABLE IF NOT EXISTS mysql."tbmi_hash"()
    INHERITS (tbm."tbmi_hash")
    SERVER mysql
    OPTIONS (dbname 'tbm', table_name 'tbmi_hash');

Comment is too long (maximum is 65536 characters)

@n0099
Copy link
Owner Author

n0099 commented Apr 18, 2024

  1. 设有表tbmi_metadata_embedded_exif结构:
CREATE TABLE `tbmi_metadata_embedded_exif` (
  `imageId` int unsigned NOT NULL,
  `orientation` text,
  `imageDescription` text,
  `userComment` blob,
  `artist` text,
  `xpAuthor` blob,
  `copyright` text,
  `imageUniqueId` text,
  `bodySerialNumber` text,
  `make` text,
  `model` text,
  `software` text,
  `customRendered` smallint DEFAULT NULL,
  `dateTime` datetime DEFAULT NULL,
  `dateTimeOffset` text,
  `dateTimeDigitized` datetime DEFAULT NULL,
  `dateTimeDigitizedOffset` text,
  `dateTimeOriginal` datetime DEFAULT NULL,
  `dateTimeOriginalOffset` text,
  `offsetTime` text,
  `offsetTimeDigitized` text,
  `offsetTimeOriginal` text,
  `gpsDateTime` datetime DEFAULT NULL,
  `gpsCoordinate` point DEFAULT NULL,
  `gpsImgDirection` float DEFAULT NULL,
  `gpsImgDirectionRef` text,
  `rawBytes` blob,
  `xxHash3` binary(8) NOT NULL,
  PRIMARY KEY (`imageId`)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

由pgloader在pgsql端按照2.3.mysql://源下默认的CASTING所创建:

create table tbm.tbmi_metadata_embedded_exif (
  "imageId" bigint primary key not null,
  orientation text,
  "imageDescription" text,
  "userComment" text,
  artist text,
  "xpAuthor" text,
  copyright text,
  "imageUniqueId" text,
  "bodySerialNumber" text,
  make text,
  model text,
  software text,
  "customRendered" smallint,
  "dateTime" timestamp with time zone,
  "dateTimeOffset" text,
  "dateTimeDigitized" timestamp with time zone,
  "dateTimeDigitizedOffset" text,
  "dateTimeOriginal" timestamp with time zone,
  "dateTimeOriginalOffset" text,
  "offsetTime" text,
  "offsetTimeDigitized" text,
  "offsetTimeOriginal" text,
  "gpsDateTime" timestamp with time zone,
  "gpsCoordinate" point,
  "gpsImgDirection" double precision,
  "gpsImgDirectionRef" text,
  "rawBytes" bytea,
  "xxHash3" bytea not null
);

10.1. 使用5.1.3.中的symmetric difference FULL OUTER JOIN row(table)

SELECT COUNT(*) FROM ONLY mysql."tbmi_metadata_embedded_exif" a FULL OUTER JOIN ONLY tbm."tbmi_metadata_embedded_exif" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;

可得7.1.

ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions

10.1.1. 执行5.1.1.

SELECT COUNT(*) FROM mysql."tbmi_metadata_embedded_exif" a FULL OUTER JOIN ONLY tbm."tbmi_metadata_embedded_exif" b
  USING ("imageId", "orientation", "imageDescription", "userComment", "artist", "xpAuthor", "copyright", "imageUniqueId", "bodySerialNumber", "make", "model", "software", "customRendered", "dateTime", "dateTimeOffset", "dateTimeDigitized", "dateTimeDigitizedOffset", "dateTimeOriginal", "dateTimeOriginalOffset", "offsetTime", "offsetTimeDigitized", "offsetTimeOriginal", "gpsDateTime", "gpsCoordinate", "gpsImgDirection", "gpsImgDirectionRef", "rawBytes", "xxHash3")
WHERE (a."imageId" IS NULL AND a."orientation" IS NULL AND a."imageDescription" IS NULL AND a."userComment" IS NULL AND a."artist" IS NULL AND a."xpAuthor" IS NULL AND a."copyright" IS NULL AND a."imageUniqueId" IS NULL AND a."bodySerialNumber" IS NULL AND a."make" IS NULL AND a."model" IS NULL AND a."software" IS NULL AND a."customRendered" IS NULL AND a."dateTime" IS NULL AND a."dateTimeOffset" IS NULL AND a."dateTimeDigitized" IS NULL AND a."dateTimeDigitizedOffset" IS NULL AND a."dateTimeOriginal" IS NULL AND a."dateTimeOriginalOffset" IS NULL AND a."offsetTime" IS NULL AND a."offsetTimeDigitized" IS NULL AND a."offsetTimeOriginal" IS NULL AND a."gpsDateTime" IS NULL AND a."gpsCoordinate" IS NULL AND a."gpsImgDirection" IS NULL AND a."gpsImgDirectionRef" IS NULL AND a."rawBytes" IS NULL AND a."xxHash3" IS NULL)
   OR (b."imageId" IS NULL AND b."orientation" IS NULL AND b."imageDescription" IS NULL AND b."userComment" IS NULL AND b."artist" IS NULL AND b."xpAuthor" IS NULL AND b."copyright" IS NULL AND b."imageUniqueId" IS NULL AND b."bodySerialNumber" IS NULL AND b."make" IS NULL AND b."model" IS NULL AND b."software" IS NULL AND b."customRendered" IS NULL AND b."dateTime" IS NULL AND b."dateTimeOffset" IS NULL AND b."dateTimeDigitized" IS NULL AND b."dateTimeDigitizedOffset" IS NULL AND b."dateTimeOriginal" IS NULL AND b."dateTimeOriginalOffset" IS NULL AND b."offsetTime" IS NULL AND b."offsetTimeDigitized" IS NULL AND b."offsetTimeOriginal" IS NULL AND b."gpsDateTime" IS NULL AND b."gpsCoordinate" IS NULL AND b."gpsImgDirection" IS NULL AND b."gpsImgDirectionRef" IS NULL AND b."rawBytes" IS NULL AND b."xxHash3" IS NULL);

会有

ERROR: operator does not exist: point = point
Hint: No operator matches the given name and argument types. You might need to add explicit type casts.

10.1.2. 按照 https://dba.stackexchange.com/questions/252066/how-to-formulate-equality-predicate-on-point-column-in-postgresql/252112#252112 使用5.1.2.1.

SELECT COUNT(*) FROM mysql."tbmi_metadata_embedded_exif" a FULL OUTER JOIN ONLY tbm."tbmi_metadata_embedded_exif" b ON
    (a."imageId" = b."imageId") AND
    (a."orientation" = b."orientation" OR (a."orientation" IS NULL AND b."orientation" IS NULL)) AND
    (a."imageDescription" = b."imageDescription" OR (a."imageDescription" IS NULL AND b."imageDescription" IS NULL)) AND
    (a."userComment" = b."userComment" OR (a."userComment" IS NULL AND b."userComment" IS NULL)) AND
    (a."artist" = b."artist" OR (a."artist" IS NULL AND b."artist" IS NULL)) AND
    (a."xpAuthor" = b."xpAuthor" OR (a."xpAuthor" IS NULL AND b."xpAuthor" IS NULL)) AND
    (a."copyright" = b."copyright" OR (a."copyright" IS NULL AND b."copyright" IS NULL)) AND
    (a."imageUniqueId" = b."imageUniqueId" OR (a."imageUniqueId" IS NULL AND b."imageUniqueId" IS NULL)) AND
    (a."bodySerialNumber" = b."bodySerialNumber" OR (a."bodySerialNumber" IS NULL AND b."bodySerialNumber" IS NULL)) AND
    (a."make" = b."make" OR (a."make" IS NULL AND b."make" IS NULL)) AND
    (a."model" = b."model" OR (a."model" IS NULL AND b."model" IS NULL)) AND
    (a."software" = b."software" OR (a."software" IS NULL AND b."software" IS NULL)) AND
    (a."customRendered" = b."customRendered" OR (a."customRendered" IS NULL AND b."customRendered" IS NULL)) AND
    (a."dateTime" = b."dateTime" OR (a."dateTime" IS NULL AND b."dateTime" IS NULL)) AND
    (a."dateTimeOffset" = b."dateTimeOffset" OR (a."dateTimeOffset" IS NULL AND b."dateTimeOffset" IS NULL)) AND
    (a."dateTimeDigitized" = b."dateTimeDigitized" OR (a."dateTimeDigitized" IS NULL AND b."dateTimeDigitized" IS NULL)) AND
    (a."dateTimeDigitizedOffset" = b."dateTimeDigitizedOffset" OR (a."dateTimeDigitizedOffset" IS NULL AND b."dateTimeDigitizedOffset" IS NULL)) AND
    (a."dateTimeOriginal" = b."dateTimeOriginal" OR (a."dateTimeOriginal" IS NULL AND b."dateTimeOriginal" IS NULL)) AND
    (a."dateTimeOriginalOffset" = b."dateTimeOriginalOffset" OR (a."dateTimeOriginalOffset" IS NULL AND b."dateTimeOriginalOffset" IS NULL)) AND
    (a."offsetTime" = b."offsetTime" OR (a."offsetTime" IS NULL AND b."offsetTime" IS NULL)) AND
    (a."offsetTimeDigitized" = b."offsetTimeDigitized" OR (a."offsetTimeDigitized" IS NULL AND b."offsetTimeDigitized" IS NULL)) AND
    (a."offsetTimeOriginal" = b."offsetTimeOriginal" OR (a."offsetTimeOriginal" IS NULL AND b."offsetTimeOriginal" IS NULL)) AND
    (a."gpsDateTime" = b."gpsDateTime" OR (a."gpsDateTime" IS NULL AND b."gpsDateTime" IS NULL)) AND
    (a."gpsCoordinate" ~= b."gpsCoordinate" OR (a."gpsCoordinate" IS NULL AND b."gpsCoordinate" IS NULL)) AND
    (a."gpsImgDirection" = b."gpsImgDirection" OR (a."gpsImgDirection" IS NULL AND b."gpsImgDirection" IS NULL)) AND
    (a."gpsImgDirectionRef" = b."gpsImgDirectionRef" OR (a."gpsImgDirectionRef" IS NULL AND b."gpsImgDirectionRef" IS NULL)) AND
    (a."rawBytes" = b."rawBytes" OR (a."rawBytes" IS NULL AND b."rawBytes" IS NULL)) AND
    (a."xxHash3" = b."xxHash3")
WHERE (a."imageId" IS NULL AND a."orientation" IS NULL AND a."imageDescription" IS NULL AND a."userComment" IS NULL AND a."artist" IS NULL AND a."xpAuthor" IS NULL AND a."copyright" IS NULL AND a."imageUniqueId" IS NULL AND a."bodySerialNumber" IS NULL AND a."make" IS NULL AND a."model" IS NULL AND a."software" IS NULL AND a."customRendered" IS NULL AND a."dateTime" IS NULL AND a."dateTimeOffset" IS NULL AND a."dateTimeDigitized" IS NULL AND a."dateTimeDigitizedOffset" IS NULL AND a."dateTimeOriginal" IS NULL AND a."dateTimeOriginalOffset" IS NULL AND a."offsetTime" IS NULL AND a."offsetTimeDigitized" IS NULL AND a."offsetTimeOriginal" IS NULL AND a."gpsDateTime" IS NULL AND a."gpsCoordinate" IS NULL AND a."gpsImgDirection" IS NULL AND a."gpsImgDirectionRef" IS NULL AND a."rawBytes" IS NULL AND a."xxHash3" IS NULL)
   OR (b."imageId" IS NULL AND b."orientation" IS NULL AND b."imageDescription" IS NULL AND b."userComment" IS NULL AND b."artist" IS NULL AND b."xpAuthor" IS NULL AND b."copyright" IS NULL AND b."imageUniqueId" IS NULL AND b."bodySerialNumber" IS NULL AND b."make" IS NULL AND b."model" IS NULL AND b."software" IS NULL AND b."customRendered" IS NULL AND b."dateTime" IS NULL AND b."dateTimeOffset" IS NULL AND b."dateTimeDigitized" IS NULL AND b."dateTimeDigitizedOffset" IS NULL AND b."dateTimeOriginal" IS NULL AND b."dateTimeOriginalOffset" IS NULL AND b."offsetTime" IS NULL AND b."offsetTimeDigitized" IS NULL AND b."offsetTimeOriginal" IS NULL AND b."gpsDateTime" IS NULL AND b."gpsCoordinate" IS NULL AND b."gpsImgDirection" IS NULL AND b."gpsImgDirectionRef" IS NULL AND b."rawBytes" IS NULL AND b."xxHash3" IS NULL);

可得

ERROR: invalid input syntax for type point: "��"

其分别是 https://codepoints.net/U+FFFE 的utf8和0x10 https://codepoints.net/U+0010 (github将其便乘了两个U+FFFE

10.1.3. 即便尝试::text也仍然相同

-    (a."gpsCoordinate" ~= b."gpsCoordinate" OR (a."gpsCoordinate" IS NULL AND b."gpsCoordinate" IS NULL)) AND
+    (a."gpsCoordinate"::text = b."gpsCoordinate"::text OR (a."gpsCoordinate" IS NULL AND b."gpsCoordinate" IS NULL)) AND

事实核查:截止2024年4月,mysql_fdw仍然不支持转换任何gis类型 EnterpriseDB/mysql_fdw#260

10.2.mysql创建视图(类似4.3.中的表继承,当然pgsql也有着isosql中的视图)

CREATE OR REPLACE VIEW tbmi_metadata_embedded_exif_fdw
    AS SELECT *,
        ST_Latitude(gpsCoordinate) gpsCoordinateLat,
        ST_Longitude(gpsCoordinate) gpsCoordinateLng
       FROM tbmi_metadata_embedded_exif;
SELECT imageId, gpsCoordinate, gpsCoordinateLat, gpsCoordinateLng
FROM tbmi_metadata_embedded_exif_fdw
WHERE gpsCoordinate IS NOT NULL;

清理:

DROP VIEW tbmi_metadata_embedded_exif_fdw;

10.2.1. 按照4.3.

DROP FOREIGN TABLE mysql."tbmi_metadata_embedded_exif";
CREATE FOREIGN TABLE IF NOT EXISTS mysql."tbmi_metadata_embedded_exif"(
    "gpsCoordinateLat" float8,
    "gpsCoordinateLng" float8)
    INHERITS (tbm."tbmi_metadata_embedded_exif")
    SERVER mysql
    OPTIONS (dbname 'tbm', table_name 'tbmi_metadata_embedded_exif_fdw');
-    (a."gpsCoordinate" ~= b."gpsCoordinate" OR (a."gpsCoordinate" IS NULL AND b."gpsCoordinate" IS NULL)) AND
+    ((a."gpsCoordinateLat" = b."gpsCoordinate"[0] AND a."gpsCoordinateLng" = b."gpsCoordinate"[1])
+        OR (a."gpsCoordinate" IS NULL AND b."gpsCoordinate" IS NULL)) AND
-WHERE (a."imageId" IS NULL AND a."orientation" IS NULL AND a."imageDescription" IS NULL AND a."userComment" IS NULL AND a."artist" IS NULL AND a."xpAuthor" IS NULL AND a."copyright" IS NULL AND a."imageUniqueId" IS NULL AND a."bodySerialNumber" IS NULL AND a."make" IS NULL AND a."model" IS NULL AND a."software" IS NULL AND a."customRendered" IS NULL AND a."dateTime" IS NULL AND a."dateTimeOffset" IS NULL AND a."dateTimeDigitized" IS NULL AND a."dateTimeDigitizedOffset" IS NULL AND a."dateTimeOriginal" IS NULL AND a."dateTimeOriginalOffset" IS NULL AND a."offsetTime" IS NULL AND a."offsetTimeDigitized" IS NULL AND a."offsetTimeOriginal" IS NULL AND a."gpsDateTime" IS NULL AND a."gpsCoordinate" IS NULL AND a."gpsImgDirection" IS NULL AND a."gpsImgDirectionRef" IS NULL AND a."rawBytes" IS NULL AND a."xxHash3" IS NULL)
+WHERE (a."imageId" IS NULL AND a."orientation" IS NULL AND a."imageDescription" IS NULL AND a."userComment" IS NULL AND a."artist" IS NULL AND a."xpAuthor" IS NULL AND a."copyright" IS NULL AND a."imageUniqueId" IS NULL AND a."bodySerialNumber" IS NULL AND a."make" IS NULL AND a."model" IS NULL AND a."software" IS NULL AND a."customRendered" IS NULL AND a."dateTime" IS NULL AND a."dateTimeOffset" IS NULL AND a."dateTimeDigitized" IS NULL AND a."dateTimeDigitizedOffset" IS NULL AND a."dateTimeOriginal" IS NULL AND a."dateTimeOriginalOffset" IS NULL AND a."offsetTime" IS NULL AND a."offsetTimeDigitized" IS NULL AND a."offsetTimeOriginal" IS NULL AND a."gpsDateTime" IS NULL AND a."gpsCoordinate" IS NULL AND a."gpsCoordinateLat" IS NULL AND a."gpsCoordinateLng" IS NULL AND a."gpsImgDirection" IS NULL AND a."gpsImgDirectionRef" IS NULL AND a."rawBytes" IS NULL AND a."xxHash3" IS NULL)

10.2.2. 然而

ERROR: failed to execute the MySQL query:
Invalid GIS data provided to function st_latitude.

https://dev.mysql.com/doc/refman/8.0/en/gis-point-property-functions.html

  • If any geometry argument is not a syntactically well-formed geometry, an ER_GIS_INVALID_DATA error occurs.
  • If a longitude or latitude value is out of range, an error occurs:
    • If a longitude value is not in the range (−180, 180], an ER_LONGITUDE_OUT_OF_RANGE error occurs.
    • If a latitude value is not in the range [−90, 90], an ER_LATITUDE_OUT_OF_RANGE error occurs.

https://dev.mysql.com/doc/mysql-errors/8.0/en/server-error-reference.html#error_er_gis_invalid_data

Error number: 3037; Symbol: ER_GIS_INVALID_DATA; SQLSTATE: 22023
Message: Invalid GIS data provided to function %s.
A spatial function was called with an argument not recognized as a valid geometry value.

10.2.3. 但即便直接在mysql端绕过视图换成ST_X() ST_Y()也一样

SELECT COUNT(ST_X(gpsCoordinate)) FROM tbmi_metadata_embedded_exif WHERE gpsCoordinate IS NOT NULL;
[3037] Data truncation: Invalid GIS data provided to function st_x.

10.2.4. 但这又正常

SELECT COUNT(*) FROM (
    SELECT gpsCoordinate, ST_X(gpsCoordinate) FROM tbmi_metadata_embedded_exif WHERE gpsCoordinate IS NOT NULL
) t;
-- 10026

10.3. 换wkt https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry 表达

CREATE OR REPLACE VIEW tbmi_metadata_embedded_exif_fdw
    AS SELECT *,
        ST_AsWKT(gpsCoordinate) gpsCoordinateWKT,
        ST_AsWKB(gpsCoordinate) gpsCoordinateWKB
       FROM tbmi_metadata_embedded_exif;
SELECT COUNT(gpsCoordinate), COUNT(gpsCoordinateWKT) FROM tbmi_metadata_embedded_exif_fdw WHERE gpsCoordinate IS NOT NULL;
COUNT(gpsCoordinate) COUNT(gpsCoordinateWKT)
10026 10016

可得有10行wkt为NULL

10.3.1. 直接查看mysql内部存储gis类型所用的二进制encoding

SELECT COUNT(gpsCoordinate), gpsCoordinate, COUNT(gpsCoordinateWKT), gpsCoordinateWKT
FROM tbmi_metadata_embedded_exif_fdw
WHERE gpsCoordinate IS NOT NULL AND gpsCoordinateWKT IS NULL
GROUP BY gpsCoordinate, gpsCoordinateWKT;
COUNT(gpsCoordinate) gpsCoordinate COUNT(gpsCoordinateWKT) gpsCoordinateWKT
9 0xE61000000101000000000000000000F8FF000000000000F8FF 0 null
1 0xE61000000101000000000000000000F8FF98B868DB43A13E40 0 null

使用 https://rodic.fr/wp-content/uploads/2015/11/geom_converter.html 无法解析wkb之0xE61000000101000000000000000000F8FF000000000000F8FF
https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary 对此早有预言

The first byte indicates the byte order for the data:

  • 00 : big endian
  • 01 : little endian

手动修改第一个字节0xE60x00后的001000000101000000000000000000F8FF000000000000F8FF
得到POINT(7.291122019556398e-304 1.352623777034994e-309)
暂且蒙古

public static Point? ParseGpsCoordinateOrNull(
是如何产生这10个神必丶的

10.3.2. 但实际上mysql内部存储gis类型所用的二进制encoding与wkb并不一致

SELECT gpsCoordinate, gpsCoordinateWKB, gpsCoordinateWKT FROM tbmi_metadata_embedded_exif_fdw WHERE gpsCoordinate != gpsCoordinateWKB;
gpsCoordinate gpsCoordinateWKB gpsCoordinateWKT
0xE61000000101000000FD4DC6BCFF565C400BCD751A69193740 0x01010000000BCD751A69193740FD4DC6BCFF565C40 POINT(23.099259999999997 113.35935897222221)
0xE61000000101000000FD4DC6BCFF565C400BCD751A69193740 0x01010000000BCD751A69193740FD4DC6BCFF565C40 POINT(23.099259999999997 113.35935897222221)
0xE61000000101000000E605D847A70C5E40B88E71C5C5293E40 0x0101000000B88E71C5C5293E40E605D847A70C5E40 POINT(30.163173999999998 120.19771)
0xE6100000010100000012F4FF7FB67861406A96FF1F91D94140 0x01010000006A96FF1F91D9414012F4FF7FB6786140 POINT(35.69974136333333 139.77227783194445)
0xE6100000010100000016F1FF1F9679614082CBFFFFF2D04140 0x010100000082CBFFFFF2D0414016F1FF1F96796140 POINT(35.63241577138889 139.79957580555555)
0xE61000000101000000B4EEFF9F9A79614005FDFF9FF7D04140 0x010100000005FDFF9FF7D04140B4EEFF9F9A796140 POINT(35.63255691527778 139.80012512194446)
...

10.3.3. 而如果只看wkb则所有行都能转为wkb

SELECT COUNT(gpsCoordinate), COUNT(gpsCoordinateWKB) FROM tbmi_metadata_embedded_exif_fdw WHERE gpsCoordinate IS NOT NULL;
COUNT(gpsCoordinate) COUNT(gpsCoordinateWKB)
10026 10026

10.3.4. 扩展10.3.1.可得

SELECT COUNT(gpsCoordinate), gpsCoordinate,
       COUNT(gpsCoordinateWKT), gpsCoordinateWKT,
       COUNT(gpsCoordinateWKB), gpsCoordinateWKB,
       ST_PointFromWKB(gpsCoordinateWKB) a
FROM tbmi_metadata_embedded_exif_fdw
WHERE gpsCoordinate IS NOT NULL AND gpsCoordinateWKT IS NULL
GROUP BY gpsCoordinate, gpsCoordinateWKT, gpsCoordinateWKB;
COUNT(gpsCoordinate) gpsCoordinate COUNT(gpsCoordinateWKT) gpsCoordinateWKT COUNT(gpsCoordinateWKB) gpsCoordinateWKB a
9 0xE61000000101000000000000000000F8FF000000000000F8FF 0 null 9 0x0101000000000000000000F8FF000000000000F8FF 0x000000000101000000000000000000F8FF000000000000F8FF
1 0xE61000000101000000000000000000F8FF98B868DB43A13E40 0 null 1 0x010100000098B868DB43A13E40000000000000F8FF 0x00000000010100000098B868DB43A13E40000000000000F8FF

0x0101000000000000000000F8FF000000000000F8FFPOINT(NaN NaN)
0x010100000098B868DB43A13E40000000000000F8FFPOINT(30.629941666666667 NaN)
值得注意的是重新ST_PointFromWKB()使得头4字节0xE610便乘了0x0000
暂且怀疑是 https://github.com/PomeloFoundation/Pomelo.EntityFrameworkCore.MySql/blob/156585fd44ddad7e2594ef99176effac39f92a3c/test/EFCore.MySql.FunctionalTests/MySqlMigrationsSqlGeneratorTest.cs#L133 亦或其幕后的 https://github.com/search?q=repo%3ANetTopologySuite%2FNetTopologySuite+%22E610%22&type=code 及其深层transitive依赖INSERT时(因而在.net层而非mysql层完成了转换所以不一致)就写入了mysql内部存储gis类型所用的二进制encoding而非wkb/t

10.3.5. 而在INSERT INTO时并不会检查pointx y是否为NaN https://bugs.mysql.com/bug.php?id=80123

SELECT COUNT(gpsCoordinate), gpsCoordinate, ST_IsValid(gpsCoordinate)
FROM tbmi_metadata_embedded_exif
WHERE gpsCoordinate IS NOT NULL AND ST_AsWKT(gpsCoordinate) IS NULL
GROUP BY gpsCoordinate;
[3037] Data truncation: Invalid GIS data provided to function st_isvalid.

10.4. 由于pgsql本身不提供gis用途的ST_*()函数如 https://postgis.net/docs/ST_PointFromWKB.html 只提供了类型供存储因而需要手动安装所谓的后地理信息系统 https://en.wikipedia.org/wiki/PostGIS 扩展 https://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS3UbuntuPGSQLApt

sudo apt install -y postgresql-16-postgis-3
CREATE EXTENSION IF NOT EXISTS postgis;

清理:

DROP EXTENSION IF EXISTS postgis;
sudo apt purge -y postgresql-16-postgis-3 &&
sudo apt autoremove --purge

10.5. 如同9.2.mysql端

CREATE TABLE tbmi_metadata_embedded_exif_wkb LIKE tbmi_metadata_embedded_exif;
INSERT INTO tbmi_metadata_embedded_exif_wkb SELECT * FROM tbmi_metadata_embedded_exif;
ALTER TABLE tbmi_metadata_embedded_exif_wkb ADD COLUMN gpsCoordinateWKB BLOB NULL AFTER gpsCoordinate;
UPDATE tbmi_metadata_embedded_exif_wkb SET gpsCoordinateWKB = ST_AsWKB(gpsCoordinate) WHERE gpsCoordinate IS NOT NULL;
SELECT COUNT(gpsCoordinate), COUNT(gpsCoordinateWKB) FROM tbmi_metadata_embedded_exif_wkb;
-- expecting the same count
SELECT COUNT(*) FROM tbmi_metadata_embedded_exif_wkb
WHERE (gpsCoordinate IS NULL AND gpsCoordinateWKB IS NOT NULL)
   OR (gpsCoordinateWKB IS NULL AND gpsCoordinate IS NOT NULL);
-- expecting 0 rows returned, then:
-- ALTER TABLE tbmi_metadata_embedded_exif_wkb DROP COLUMN gpsCoordinate;

清理:

DROP TABLE tbmi_metadata_embedded_exif_wkb;

10.5.1.
pgsql端准备重新导入

DROP TABLE tbm."tbmi_metadata_embedded_exif" CASCADE;

类比7.1.5.导入mysql表tbmi_metadata_embedded_exif_wkb

CREATE FOREIGN TABLE IF NOT EXISTS mysql."tbmi_metadata_embedded_exif_wkb"()
    INHERITS (tbm."tbmi_metadata_embedded_exif_wkb")
    SERVER mysql
    OPTIONS (dbname 'tbm', table_name 'tbmi_metadata_embedded_exif_wkb');

10.5.2. 由于根据10.该表中有4个从datetimemysql类型列转换来的timestamptzpgsql类型列因而需要先7.2.1.再重新执行10.1.校验

  "dateTime" timestamp with time zone,
  "dateTimeDigitized" timestamp with time zone,
  "dateTimeOriginal" timestamp with time zone,
  "gpsDateTime" timestamp with time zone,

10.5.3. 10.5.1.后有pgsql表结构

create table tbm.tbmi_metadata_embedded_exif_wkb (
  "imageId" bigint primary key not null,
  orientation text,
  "imageDescription" text,
  "userComment" bytea,
  artist text,
  "xpAuthor" bytea,
  copyright text,
  "imageUniqueId" text,
  "bodySerialNumber" text,
  make text,
  model text,
  software text,
  "customRendered" smallint,
  "dateTime" timestamp with time zone,
  "dateTimeOffset" text,
  "dateTimeDigitized" timestamp with time zone,
  "dateTimeDigitizedOffset" text,
  "dateTimeOriginal" timestamp with time zone,
  "dateTimeOriginalOffset" text,
  "offsetTime" text,
  "offsetTimeDigitized" text,
  "offsetTimeOriginal" text,
  "gpsDateTime" timestamp with time zone,
  "gpsCoordinateWKB" bytea,
  "gpsImgDirection" double precision,
  "gpsImgDirectionRef" text,
  "rawBytes" bytea,
  "xxHash3" bytea not null
);

由于pgsql长期以来无法像mysql那样一键移动 https://stackoverflow.com/questions/6805426/how-to-move-columns-in-a-mysql-table 列顺序 https://stackoverflow.com/questions/285733/how-do-i-alter-the-position-of-a-column-in-a-postgresql-database-table http://wiki.postgresql.org/wiki/Alter_column_position
因而需要手动创建

-create table tbm.tbmi_metadata_embedded_exif_wkb (
+create table tbm.tbmi_metadata_embedded_exif (
-  "gpsCoordinateWKB" bytea,
+  "gpsCoordinate" point,

10.5.4. 如同10.5.pgsql端

DROP FOREIGN TABLE mysql."tbmi_metadata_embedded_exif_wkb"; -- 避免`4.3.2.`
ALTER TABLE tbm."tbmi_metadata_embedded_exif_wkb" ADD COLUMN "gpsCoordinate" POINT;
UPDATE tbm."tbmi_metadata_embedded_exif_wkb" SET "gpsCoordinate" = st_pointfromwkb("gpsCoordinateWKB")::point WHERE "gpsCoordinateWKB" IS NOT NULL;
SELECT COUNT("gpsCoordinate"), COUNT("gpsCoordinateWKB") FROM tbm."tbmi_metadata_embedded_exif_wkb";
-- expecting the same count
SELECT COUNT(*) FROM tbm."tbmi_metadata_embedded_exif_wkb"
WHERE ("gpsCoordinate" IS NULL AND "gpsCoordinateWKB" IS NOT NULL)
   OR ("gpsCoordinateWKB" IS NULL AND "gpsCoordinate" IS NOT NULL);
-- expecting 0 rows returned, then:
ALTER TABLE tbm."tbmi_metadata_embedded_exif_wkb" DROP COLUMN "gpsCoordinateWKB";
INSERT INTO tbm."tbmi_metadata_embedded_exif"
      ("imageId", orientation, "imageDescription", "userComment", artist, "xpAuthor", copyright, "imageUniqueId", "bodySerialNumber", make, model, software, "customRendered", "dateTime", "dateTimeOffset", "dateTimeDigitized", "dateTimeDigitizedOffset", "dateTimeOriginal", "dateTimeOriginalOffset", "offsetTime", "offsetTimeDigitized", "offsetTimeOriginal", "gpsDateTime", "gpsCoordinate", "gpsImgDirection", "gpsImgDirectionRef", "rawBytes", "xxHash3")
SELECT "imageId", orientation, "imageDescription", "userComment", artist, "xpAuthor", copyright, "imageUniqueId", "bodySerialNumber", make, model, software, "customRendered", "dateTime", "dateTimeOffset", "dateTimeDigitized", "dateTimeDigitizedOffset", "dateTimeOriginal", "dateTimeOriginalOffset", "offsetTime", "offsetTimeDigitized", "offsetTimeOriginal", "gpsDateTime", "gpsCoordinate", "gpsImgDirection", "gpsImgDirectionRef", "rawBytes", "xxHash3"
FROM tbm."tbmi_metadata_embedded_exif_wkb";

10.5.5. 按10.1.校验

-SELECT COUNT(*) FROM mysql."tbmi_metadata_embedded_exif" a FULL OUTER JOIN ONLY tbm."tbmi_metadata_embedded_exif" b ON
+SELECT COUNT(*) FROM tbm."tbmi_metadata_embedded_exif_wkb" a FULL OUTER JOIN tbm."tbmi_metadata_embedded_exif" b ON

清理:

DROP TABLE tbmi_metadata_embedded_exif_wkb;
VACUUM FULL tbmi_metadata_embedded_exif;

10.6. 然而10.3.1.中的那10行中除POINT(30.629941666666667 NaN)外的⑨行仍然无法st_pointfromwkb()变为NULL
10.3.5.会对POINT(30.629941666666667 NaN)外的⑨行呈现假阴性 https://trac.osgeo.org/postgis/ticket/5718

SELECT "gpsCoordinate",
       "gpsCoordinateWKB",
       st_isvalid(st_pointfromwkb("gpsCoordinateWKB"))
FROM tbmi_metadata_embedded_exif_wkb
WHERE ("gpsCoordinate" IS NULL AND "gpsCoordinateWKB" IS NOT NULL)
   OR ("gpsCoordinateWKB" IS NULL AND "gpsCoordinate" IS NOT NULL);

暂且理解为ST_IsValid()的实现不同
何谓10.4.所装ST_*()函数背后的超时空精神 https://stackoverflow.com/questions/7234679/what-is-st-in-postgis https://gis.stackexchange.com/questions/286808/what-does-the-st-prefix-stand-for-e-g-in-st-polygonize

10.6.1. 理论上可以将point类型列拆分存储为两个IEEE754类型x y列因为pgsql不像mysql https://stackoverflow.com/questions/41936403/mysql-ieee-floating-point-nan-positiveinfinity-negativeinfinity ,支持 https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-FLOAT 存储IEEE754中3大臭名昭著浪费空间 https://stackoverflow.com/questions/19800415/why-does-ieee-754-reserve-so-many-nan-values https://stackoverflow.com/questions/18118408/what-is-the-difference-between-quiet-nan-and-signaling-nan 的特殊值 https://en.wikipedia.org/wiki/NaN $+\infty$ $-\infty$

10.6.2. 最终选择了在mysql表中将所有10行都

UPDATE tbmi_metadata_embedded_exif
SET gpsCoordinate = NULL
WHERE gpsCoordinate IS NOT NULL AND ST_AsWKT(gpsCoordinate) IS NULL;
10 rows affected in 10 s 455 ms
SELECT imageId, gpsCoordinate, st_isvalid(gpsCoordinate)
FROM tbmi_metadata_embedded_exif
WHERE gpsCoordinate IS NOT NULL AND st_isvalid(gpsCoordinate) != 1
ORDER BY gpsCoordinate;
0 rows retrieved in 831 ms (execution: 778 ms, fetching: 53 ms)

后重新执行10.5.10.5.5.

  1. 回顾3.可见pgloader在pgsql端按照2.3.mysql://源下默认的CASTING会将AUTO_INCREMENTPKint及其变种长度类型列转换为serial类型列 https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-SERIAL
-  `id` int unsigned NOT NULL AUTO_INCREMENT,
+  id bigint primary key not null default nextval('tbmc_f97650_thread_id_seq'::regclass),

11.1. 而其早已被彻底批倒批臭 https://stackoverflow.com/questions/55300370/postgresql-serial-vs-identity
并在pgsql 10中实现了isosql中的identify可使用 https://www.2ndquadrant.com/en/blog/postgresql-10-identity-columns/ 中的 https://en.wikipedia.org/wiki/PL/pgSQL 3.3.命令式存储过程一键转换

CREATE OR REPLACE FUNCTION upgrade_serial_to_identity(tbl regclass, col name)
RETURNS void
LANGUAGE plpgsql
AS $$
DECLARE
  colnum smallint;
  seqid oid;
  count int;
BEGIN
  -- find column number
  SELECT attnum INTO colnum FROM pg_attribute WHERE attrelid = tbl AND attname = col;
  IF NOT FOUND THEN
    RAISE EXCEPTION 'column does not exist';
  END IF;

  -- find sequence
  SELECT INTO seqid objid
    FROM pg_depend
    WHERE (refclassid, refobjid, refobjsubid) = ('pg_class'::regclass, tbl, colnum)
      AND classid = 'pg_class'::regclass AND objsubid = 0
      AND deptype = 'a';

  GET DIAGNOSTICS count = ROW_COUNT;
  IF count < 1 THEN
    RAISE EXCEPTION 'no linked sequence found';
  ELSIF count > 1 THEN
    RAISE EXCEPTION 'more than one linked sequence found';
  END IF;

  -- drop the default
  EXECUTE 'ALTER TABLE ' || tbl || ' ALTER COLUMN ' || quote_ident(col) || ' DROP DEFAULT';

  -- change the dependency between column and sequence to internal
  UPDATE pg_depend
    SET deptype = 'i'
    WHERE (classid, objid, objsubid) = ('pg_class'::regclass, seqid, 0)
      AND deptype = 'a';

  -- mark the column as identity column
  UPDATE pg_attribute
    SET attidentity = 'd'
    WHERE attrelid = tbl
      AND attname = col;
END;
$$;

11.2. mysql端一键生成针对便乘serialAUTO_INCREMENT调用上述转换的pgsql

SELECT CONCAT(
    'SELECT upgrade_serial_to_identity(\'tbm."',
    TABLE_NAME,
    '"\', \'',
    COLUMN_NAME,
    '\');'
)
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'tbm' AND EXTRA LIKE '%auto_increment%'
  AND TABLE_NAME REGEXP '^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))'
ORDER BY TABLE_NAME;

REGEXP部分同2.1.

@n0099
Copy link
Owner Author

n0099 commented Apr 28, 2024

7.2.1. 由于timestamptz实际上并没有额外存储时区UTC offset stackoverflow.com/questions/5876218/difference-between-timestamps-with-without-time-zone-in-postgresql 因而其并不能解决 z.n0099.net/#narrow/near/86397 old.reddit.com/r/PostgreSQL/comments/xpygbh/when_would_i_ever_use_timestamp_over_timestamptz news.ycombinator.com/item?id=20212671 中争论的带时区引用未来指定datetime问题 w3.org/International/wiki/WorkingWithTimeZones#Past_and_Future_Events

The SQL data types 'date', 'time', and 'timestamp' are field based time values which are intended to be zone offset independent: they are actually, technically, floating time values! The data type 'timestamp with time zone' is the zone offset-dependent equivalent of 'timestamp' in SQL.

w3.org/International/wiki/WorkingWithTimeZones#Floating_Time

  • Floating Time
    Some observed time values are not related to a specific moment in incremental time. Instead, they need to be combined with local information to determine a range of acceptable incremental time values. We refer to these sorts of time values as "floating times". Floating times are not attached and should never be attached to a particular time zone.
    Examples of floating time events include a user’s birth date, an employee’s hire or termination date, or a list of company holidays.
    For example, suppose that January 1st is considered a holiday in your application. The day "January 1" has a floating-time status as a "holiday". That "day" can begin as early as midnight GMT-14:00 and end as late as midnight of January 2 GMT+12:00, depending on local time. This covers an incremental time range of fifty hours.

https://news.ycombinator.com/item?id=19500640
https://news.ycombinator.com/item?id=30659164
https://www.roji.org/storing-timezones-in-the-db
https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL_Standard#TIMESTAMP_WITH_TIME_ZONE

@n0099
Copy link
Owner Author

n0099 commented Apr 29, 2024

9.2. 实际上所谓的opencv_imghash Starry-OvO/aiotieba#63 (comment)averageHash

{AverageHash.Create(), (image, bytes) => image.AverageHash = BitConverter.ToUInt64(bytes)},

本质上更类似于 en.wikipedia.org/wiki/Nominal_number 因而不应该使用具有可运算的正整数语义之类型存储而应直接视作斑点二进制 en.wikipedia.org/wiki/Binary_large_object

ALTER TABLE tbmi_hash ADD COLUMN averageHash_2 BINARY(8) NOT NULL AFTER averageHash;
UPDATE tbmi_hash SET averageHash_2 = UNHEX(LPAD(HEX(averageHash), 16, '00')); -- https://stackoverflow.com/questions/58611436/how-do-you-convert-an-integer-into-its-condensed-binary-equivalent-e-g-binary3/58657797#comment138021380_58657797
SELECT COUNT(*) FROM tbmi_hash WHERE averageHash_2 != UNHEX(LPAD(HEX(averageHash), 16, '00'));
-- expecting 0 rows returned, then:
-- ALTER TABLE tbmi_hash DROP COLUMN averageHash, RENAME COLUMN averageHash_2 TO averageHash;

9.2.1. 事实上

{PHash.Create(), (image, bytes) => image.PHash = BitConverter.ToUInt64(bytes)},
{AverageHash.Create(), (image, bytes) => image.AverageHash = BitConverter.ToUInt64(bytes)},

已经使得原byte[]被反转地解释为了小端序
sharplab.io

using System;
using System.Linq;

byte[] big = [0, 1, 2, 3, 4, 5, 6, 7];
var small = big.Reverse().ToArray();
Console.WriteLine(Convert.ToHexString(big)); // 0001020304050607
Console.WriteLine(BitConverter.ToUInt64(big, 0)); // 506097522914230528
Console.WriteLine();
Console.WriteLine(Convert.ToHexString(small)); // 0706050403020100
Console.WriteLine(BitConverter.ToUInt64(small, 0)); // 283686952306183
SELECT UNHEX(LPAD(HEX(506097522914230528), 16, '00')),
       UNHEX(LPAD(HEX(283686952306183), 16, '00'));
UNHEX(LPAD(HEX(506097522914230528), 16, '00')) UNHEX(LPAD(HEX(283686952306183), 16, '00'))
0x0706050403020100 0x0001020304050607

而其他地方更常用的xxhash3 https://z.n0099.net/#narrow/near/88989 反而的确是大端序
https://source.dot.net/#System.IO.Hashing/System/IO/Hashing/XxHash3.cs,19a80b2f50453190,references
https://github.com/dotnet/runtime/blob/92ca5f3227ada3f1620cd9b0ea631491303e4148/src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash3.cs#L73

https://dba.stackexchange.com/questions/156700/reverse-byte-order-of-a-postgres-bytea-field/156706#156706

SELECT t.b, text_reverse, decode(text_reverse, 'hex') AS bytea_reverse
FROM   (SELECT '\x0001020304050607'::bytea b) t
LEFT   JOIN LATERAL (
   SELECT string_agg(to_hex(get_byte(b, x)), '') AS text_reverse
   FROM   generate_series(octet_length(t.b) - 1, 0, -1) x
   ) x ON true;
b text_reverse bytea_reverse
0x0001020304050607 76543210 0x76543210

https://stackoverflow.com/questions/11142235/convert-bigint-to-bytea-but-swap-the-byte-order/26494661#26494661

CREATE OR REPLACE FUNCTION reverse(bytea) RETURNS bytea AS $reverse$
    SELECT string_agg(byte,''::bytea)
      FROM (
        SELECT substr($1,i,1) byte
          FROM generate_series(length($1),1,-1) i) s
$reverse$ LANGUAGE sql;
SELECT reverse('\x0001020304050607'::bytea);
DROP FUNCTION reverse(bytea);
reverse
0x0706050403020100

mysql端

UPDATE tbmi_hash SET pHash = REVERSE(pHash), averageHash = REVERSE(averageHash);
-- 14,322,348 rows affected in 20 m 6 s 174 ms

pgsql端

UPDATE tbmi_hash SET "pHash" = reverse("pHash"), "averageHash" = reverse("averageHash");
-- 14,322,348 rows affected in 28 m 56 s 671 ms

n0099 added a commit that referenced this issue May 2, 2024
…`8.1.2.` @ `CrawlerDbContext..OnModelCreating()`

@ crawler

* converting string fields that may contain `0x00` to utf8 bytes: #48 `8.1.2.` @ `ImagePipelineDbContext.OnModelCreating()`
* now will convert point with invalid coordinate to null: #48 `10.3.1.` @ `MetadataConsumer.ExifGpsTagValuesParser.ParseGpsCoordinateOrNull()`
@ imagePipeline
@ c#
@n0099
Copy link
Owner Author

n0099 commented May 2, 2024

2.5. 视奸正在导入什么表 https://stackoverflow.com/questions/35319597/how-to-stop-kill-a-query-in-postgresql

SELECT * FROM pg_stat_activity WHERE application_name = 'pgloader';

可根据表大小猜测进度 stackoverflow.com/questions/21738408/postgresql-list-and-order-tables-by-size/21738505#21738505

SELECT table_schema, table_name, pg_relation_size('"'||table_schema||'"."'||table_name||'"')::float8 / 1024 / 1024 "MiB"
FROM information_schema.tables
WHERE table_schema = 'tbm'
ORDER BY 3 DESC;

https://www.postgresql.org/docs/current/functions-admin.html
https://stackoverflow.com/questions/41991380/whats-the-difference-between-pg-table-size-pg-relation-size-pg-total-relatio

WITH t AS (
    SELECT table_schema, table_name,
           ('"' || table_schema || '"."' || table_name ||'"')::regclass regclass
    FROM information_schema.tables
    WHERE table_schema = 'tbm'
)
SELECT table_schema, table_name,
       pg_relation_filepath(regclass),
       pg_indexes_size(regclass) / 1024 / 1024 i,
       pg_table_size(regclass) / 1024 / 1024 t,
       pg_relation_size(regclass) / 1024 / 1024 r,
       pg_total_relation_size(regclass) / 1024 / 1024 tr
FROM t
ORDER BY tr DESC;

@n0099
Copy link
Owner Author

n0099 commented May 2, 2024

  1. 设有表tbmc_f97650_reply结构
create table tbmc_f97650_reply (
  id bigint primary key not null default nextval('tbmc_f97650_reply_id_seq'::regclass),
  tid bigint not null,
  pid bigint not null,
  floor bigint not null,
  "authorUid" bigint not null,
  "subReplyCount" bigint,
  "postedAt" bigint not null,
  "isFold" smallint,
  "agreeCount" integer,
  "disagreeCount" integer,
  geolocation bytea,
  "signatureId" bigint,
  "createdAt" bigint not null,
  "updatedAt" bigint,
  "lastSeenAt" bigint
);
create unique index idx_23148_pid on tbmc_f97650_reply using btree (pid);
create index "idx_23148_postTime" on tbmc_f97650_reply using btree ("postedAt");
create index "idx_23148_authorUid" on tbmc_f97650_reply using btree ("authorUid");
create index idx_23148_tid on tbmc_f97650_reply using btree (tid);

12.1. 其中业务无关自增PKid属于 https://en.wikipedia.org/wiki/Surrogate_key
此前在mysql中使用此类id代理键是由于mysql innodb基于clustered index=primary index=PK https://dba.stackexchange.com/questions/195288/what-is-the-formal-definition-of-a-primary-index-and-primary-key 因而INSERT不连续的PK值会使其不断重新平衡b+tree以保持PK序与行在表中的物理存储序同步 https://stackoverflow.com/questions/1251636/what-do-clustered-and-non-clustered-index-actually-mean
而众所周知pgsql如同老mysql myisam使用heap table因而无需不断重新平衡b+tree https://use-the-index-luke.com/sql/clustering/index-organized-clustered-index
这也是为什么有3.2.迫真假随机序因为行物理存储序在无序的heap而非良序stack aka clusteredindex中本就是随机的 https://www.postgresql.org/docs/current/indexes-index-only-scans.html
但正义执行 https://www.postgresql.org/docs/current/sql-cluster.html 可以在指定索引列中新的不连续值又INSERT/UPDATE写入前让行物理存储序指定索引列同步为一致的良序

12.2. 事实核查:正义削除该id代理键再加上嵌入于表名中用于水平分表 https://en.wikipedia.org/wiki/Shard_(database_architecture)fid可节省 $1-(1754/1978) \approx 11.3\%$表大小

CREATE TABLE tbmc_reply (fid int NOT NULL, LIKE tbmc_f97650_reply INCLUDING ALL);
ALTER TABLE tbmc_reply DROP COLUMN id, ADD PRIMARY KEY (pid);
DROP INDEX tbmc_reply_pid_idx; -- 重复的UNIQUE(pid)
INSERT INTO tbmc_reply
                (fid, tid, pid, floor, "authorUid", "subReplyCount", "postedAt", "isFold", "agreeCount", "disagreeCount", geolocation, "signatureId", "createdAt", "updatedAt", "lastSeenAt")
SELECT 97650 AS  fid, tid, pid, floor, "authorUid", "subReplyCount", "postedAt", "isFold", "agreeCount", "disagreeCount", geolocation, "signatureId", "createdAt", "updatedAt", "lastSeenAt"
FROM tbmc_f97650_reply;
VACUUM FULL tbmc_reply;

12.3. 而进一步削除fid梦回此前的水平分表之fid嵌入表名

ALTER TABLE tbmc_reply DROP COLUMN fid;
VACUUM FULL tbmc_reply;

也只会进一步节省 $1-(1754/1675) \approx 4.7\%$80MiB

@n0099
Copy link
Owner Author

n0099 commented May 3, 2024

11.2. mysql端一键生成针对便乘serialAUTO_INCREMENT调用上述转换的pgsql

11.3.

SELECT *
FROM information_schema.columns
WHERE table_schema = 'tbm'
  AND column_default IS NOT NULL;
  1. 一键生成varchar(n)text
SELECT 'ALTER TABLE tbm."' || table_name || '" ALTER COLUMN "' || column_name || '" TYPE text;'
FROM information_schema.columns
WHERE table_schema = 'tbm' AND data_type = 'character varying';

而索引中的列长度似乎没有被pgloader迁移过来

@n0099
Copy link
Owner Author

n0099 commented May 4, 2024

  1. https://pgloader.readthedocs.io/en/latest/ref/mysql.html#mysql-database-migration-options-with

uniquify index names, preserve index names

MySQL index names are unique per-table whereas in PostgreSQL index names have to be unique per-schema. The default for pgloader is to change the index name by prefixing it with idx_OID where OID is the internal numeric identifier of the table the index is built against.

In somes cases like when the DDL are entirely left to a framework it might be sensible for pgloader to refrain from handling index unique names, that is achieved by using the preserve index names option.

The default is to uniquify index names.

Even when using the option preserve index names, MySQL primary key indexes named “PRIMARY” will get their names uniquified. Failing to do so would prevent the primary keys to be created again in PostgreSQL where the index names must be unique per schema.

一键生成eval重命名索引把所有索引重命名为$表名_$字段1_$字段2_...再把PK索引重命名为_pkey后缀 https://stackoverflow.com/questions/65069778/postgresql-rename-table-cascade-renaming-to-objects-like-indices-constraints/78485773#78485773

-- https://stackoverflow.com/questions/65069778/postgresql-rename-table-cascade-renaming-to-objects-like-indices-constraints/78485773#78485773
DO $$DECLARE r record;
BEGIN
    FOR r IN
        SELECT 'ALTER INDEX "' || schema_name || '"."' || index_name ||
               '" RENAME TO "' || new_index_name || '";' sql, *
        FROM (
            SELECT schema_name, table_name, index_name,
                   table_name ||
                   CASE constraint_type
                       -- only uncomment below line when using postgresql<12 https://stackoverflow.com/questions/4107915/postgresql-default-constraint-names#comment109084417_4108266
                       -- WHEN 'f' THEN '_' || columns[1] -- https://stackoverflow.com/questions/4107915/postgresql-default-constraint-names#comment127573826_4108266
                       WHEN 'p' THEN '' -- https://stackoverflow.com/questions/4107915/postgresql-default-constraint-names#comment127553786_4108266
                       ELSE '_' || string_agg(columns, '_')
                   END || '_' ||
                   -- https://stackoverflow.com/questions/15417167/case-when-null-evaluates-to-false
                   CASE COALESCE(constraint_type, 'NULL')
                       -- https://stackoverflow.com/questions/4107915/postgresql-default-constraint-names/4108266#4108266
                       -- https://gist.github.com/popravich/d6816ef1653329fb1745
                       -- https://stackoverflow.com/questions/8674562/postgresql-is-it-possible-to-provide-custom-name-for-primary-key-or-unique/8674640#8674640
                       WHEN 'c' THEN 'check'
                       WHEN 'f' THEN 'fkey'
                       WHEN 'p' THEN 'pkey'
                       WHEN 'u' THEN 'key'
                       WHEN 'x' THEN 'excl'
                       WHEN 'NULL' THEN 'idx'
                   END new_index_name
            FROM (
                SELECT n.nspname schema_name, t.relname table_name, i.relname index_name, c.contype constraint_type, a.attname columns
                FROM pg_index x
                -- https://stackoverflow.com/questions/37329561/how-to-list-indexes-created-for-table-in-postgres/37330092#37330092
                JOIN pg_class t ON t.oid = x.indrelid
                JOIN pg_namespace n ON n.oid = t.relnamespace
                JOIN pg_class i ON i.oid = x.indexrelid
                -- https://stackoverflow.com/questions/55447819/how-to-get-the-list-of-column-names-for-all-indexes-on-a-table-in-postgresql/55448707#55448707
                JOIN pg_attribute a ON a.attrelid = x.indrelid AND a.attnum = ANY(x.indkey)
                -- https://www.postgresql.org/docs/current/catalog-pg-constraint.html
                LEFT JOIN pg_constraint c ON c.conindid = x.indexrelid
                WHERE n.nspname = 'your_schema'
            ) t
            GROUP BY schema_name, table_name, index_name, constraint_type
        ) t
        WHERE index_name != new_index_name
    LOOP
         -- https://stackoverflow.com/questions/1348126/postgresql-modify-owner-on-all-tables-simultaneously-in-postgresql/37259655#37259655
         EXECUTE r.sql;
    END LOOP;
END$$;

12.2. 事实核查:正义削除该id代理键再加上嵌入于表名中用于水平分表 https://en.wikipedia.org/wiki/Shard_(database_architecture)fid可节省 1−(1754/1978)≈11.3% 的`表大小

12.4. 一键生成eval为主题帖 回复帖 楼中楼表削除id代理键并重设PK

DO $$DECLARE r record;
BEGIN
    FOR r IN
        SELECT 'ALTER TABLE "' || table_schema || '"."' || table_name ||
               '" DROP COLUMN id, ADD PRIMARY KEY USING INDEX "' || table_name || '_tid";' c
        FROM information_schema.tables t
        WHERE table_schema = 'tbm' AND table_name ~ '^tbmc_f\d+_thread$'
        UNION
        SELECT 'ALTER TABLE "' || table_schema || '"."' || table_name ||
               '" DROP COLUMN id, ADD PRIMARY KEY USING INDEX "' || table_name || '_pid";' c
        FROM information_schema.tables t
        WHERE table_schema = 'tbm' AND table_name ~ '^tbmc_f\d+_reply$'
        UNION
        SELECT 'ALTER TABLE "' || table_schema || '"."' || table_name ||
               '" DROP COLUMN id, ADD PRIMARY KEY USING INDEX "' || table_name || '_spid";' c
        FROM information_schema.tables t
        WHERE table_schema = 'tbm' AND table_name ~ '^tbmc_f\d+_subReply$'
    LOOP
        EXECUTE r.c;
    END LOOP;
END$$;

@n0099
Copy link
Owner Author

n0099 commented May 7, 2024

7.1. 使用5.1.3.中的symmetric difference FULL OUTER JOIN row(table)

SELECT COUNT(*) FROM mysql."tbm_bilibiliVote" a FULL OUTER JOIN ONLY tbm."tbm_bilibiliVote" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;

可得 https://stackoverflow.com/questions/47405732/why-does-postgresql-throw-full-join-is-only-supported-with-merge-joinable-or-ha

ERROR:  FULL JOIN is only supported with merge-joinable or hash-joinable join conditions

https://www.postgresql.org/message-id/flat/20060313104028.GB6714%40svana.org
https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL_Standard#FULL_OUTER_JOIN_conditions

PostgreSQL currently limits such conditions to ones that are implementable without constructing explicit UNION operations; that is, the condition must be hashable, mergeable, or constant.
Adding support for arbitrary full joins (which would likely require constructing a UNION of the inner join and the two anti-joins) seems like a lot of work for minimal gain.

@n0099 n0099 closed this as completed May 15, 2024
@n0099 n0099 reopened this May 15, 2024
@n0099 n0099 closed this as completed in fa16951 May 18, 2024
@n0099 n0099 reopened this Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant