Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add removeGarbageParametersFromURL function #6498

Open
alexey-milovidov opened this issue Aug 14, 2019 · 3 comments · May be fixed by #49305
Open

Add removeGarbageParametersFromURL function #6498

alexey-milovidov opened this issue Aug 14, 2019 · 3 comments · May be fixed by #49305
Labels
feature minor Priority: minor warmup task The task for new ClickHouse team members. Low risk, moderate complexity, no urgency.

Comments

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Aug 14, 2019

Use case
Remove long base64 and hex encoded paramters that look random. Example:

http://yandex.ru/clck/jsredir?from=yandex.ru;search%2F;web;;&text=&etext=1004&cst=AiuY0DBWFJ5Hyx_fyvalFPA3abBqdnSOApSiLPWwVkIeiz46AroRCQXrfJ8M5oYTWorWEWccK4Kw_QEhSDD6X4nGMT4OabEk0xnry4NtnOEzWFPU4iTzunSVVjkuY7CYolnD7hb04cMRv7iMnaO8LjNm0hxqvwN9sXCzUYeXp_muLsdY4W99_U5MJKGmz7IAmR5-ceoAoaBB2XGYAS9BTYKbbvlmneBpbf_SwAd_6OOACXtLmRXqXad3AQbcArYE8LCO0zmE9vpha3yoT0jl8pd9CUmbGZR5nA3sf5TcDFTpr5nYaOdxjmHep2cZeW3QHvPtKA2xWXW6qzGrQeZ1SEOPcJ1afJqmAHisup90hhNYyl2hxl8xn_DtCRbJqYHb88JtuQ3591EGW42wPZhSbxBFdU0KIZN3c_VZOmk6avzKzqG_kJpjPObWXbh9qs0S23WxDGCcPUrIzi3ESSLv1qgaRhqkfjBc57BFVA4RxlljpKQdeVeTbklJgqptznf1aHZQ2wYARBzC_jvv994MCTZIus_NctCMWoSaU74OaMmo0h5ScYLI2CWy6nj5PbhCrgeLsaEBVOQT9xoLSoCRfJ78xI_T1ruuD3QBJmHY6YW8f5UM36LRbzhd5vmNTPRvrs2wcCFhF_w&l10n=ru&cts=1458904227333&mc=1.584962500721156

Note: these parameters are usually not "garbage" but they are impossible to decrypt and they will waste space on disk. We are already removing these parameters from Yandex, Google, etc. before inserting into ClickHouse.

Implementation
For parameters of some minimal length, quickly calculate some statistics like:

  • does not contain symbols that are not in base64 or base64-for-URL;
  • distribution looks uniform.
@alexey-milovidov alexey-milovidov added feature minor Priority: minor labels Aug 14, 2019
@alexey-milovidov alexey-milovidov changed the title Add removeGarbageParametersFromURL Add removeGarbageParametersFromURL function Aug 14, 2019
@stale stale bot added the stale label Oct 20, 2019
@blinkov blinkov removed the stale label Oct 22, 2019
@ClickHouse ClickHouse deleted a comment from stale bot Jul 1, 2020
@alexey-milovidov
Copy link
Member Author

This function is needed for web analytics applications.

@alexey-milovidov alexey-milovidov added the warmup task The task for new ClickHouse team members. Low risk, moderate complexity, no urgency. label Jul 3, 2021
@mayank-17
Copy link

mayank-17 commented May 21, 2023

Hi @alexey-milovidov can you give some sample test case for the below URL.

http://yandex.ru/clck/jsredir?from=yandex.ru;search%2F;web;;&text=&etext=1004&cst=AiuY0DBWFJ5Hyx_fyvalFPA3abBqdnSOApSiLPWwVkIeiz46AroRCQXrfJ8M5oYTWorWEWccK4Kw_QEhSDD6X4nGMT4OabEk0xnry4NtnOEzWFPU4iTzunSVVjkuY7CYolnD7hb04cMRv7iMnaO8LjNm0hxqvwN9sXCzUYeXp_muLsdY4W99_U5MJKGmz7IAmR5-ceoAoaBB2XGYAS9BTYKbbvlmneBpbf_SwAd_6OOACXtLmRXqXad3AQbcArYE8LCO0zmE9vpha3yoT0jl8pd9CUmbGZR5nA3sf5TcDFTpr5nYaOdxjmHep2cZeW3QHvPtKA2xWXW6qzGrQeZ1SEOPcJ1afJqmAHisup90hhNYyl2hxl8xn_DtCRbJqYHb88JtuQ3591EGW42wPZhSbxBFdU0KIZN3c_VZOmk6avzKzqG_kJpjPObWXbh9qs0S23WxDGCcPUrIzi3ESSLv1qgaRhqkfjBc57BFVA4RxlljpKQdeVeTbklJgqptznf1aHZQ2wYARBzC_jvv994MCTZIus_NctCMWoSaU74OaMmo0h5ScYLI2CWy6nj5PbhCrgeLsaEBVOQT9xoLSoCRfJ78xI_T1ruuD3QBJmHY6YW8f5UM36LRbzhd5vmNTPRvrs2wcCFhF_w&l10n=ru&cts=1458904227333&mc=1.584962500721156

I am interested and would like to pick this up.

@alexey-milovidov
Copy link
Member Author

@mayank-17 the function should return http://yandex.ru/clck/jsredir?from=yandex.ru;search%2F;web;;&text=&etext=1004&l10n=ru&cts=1458904227333&mc=1.584962500721156, because the cst parameter will be detected as garbage and removed.

But this task is already active, it is being implemented here: #49305

@rschu1ze rschu1ze linked a pull request May 29, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature minor Priority: minor warmup task The task for new ClickHouse team members. Low risk, moderate complexity, no urgency.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants