Implement algorithm for typo discovery #89

polyntsov · 2022-02-20T16:07:24Z

Implements the initial workflow for typo discovery using approximate and precise fd mining algorithms. Refactors main.cpp and adds new AlgoFactory module --- convenient interface for creating algorithm instances. Introduces namespace algos (part of #63) for entities in src/algorithms/ (eventually need to place all code from src/algorithms/ into this namespace).

src/algorithms/FDAlgorithm.h

src/algorithms/AlgoFactory.h

polyntsov · 2022-02-21T16:09:14Z

src/algorithms/Algorithms.h

+#include "algorithms/Fd_mine.h"
+#include "algorithms/Pyro.h"
+#include "algorithms/TaneX.h"
+


Наверное сюда нужно добавить algorithms/TypoMiner.h

Зависит от того, с каким замыслом ты создавал этот файл.

Если это перечисление вообще всех алгоритмов для облегечения инклюдов в файлах, где нужны все алгоритмы, тогда стоит добавить.

Если это только ФЗ алгоритмы, стоит переименовать -> FDAlgorithms.h и не добавлять TypoMiner.

Возможно, такая группировка и упрощает чтение инклюдов, но я бы наверно вообще удалил этот хэдер -- пока он только в AlgoFactory используется, можно один раз поинклюдить все алгоритмы по отдельности.

Изначально создавал как хедер со всеми алгоритмами.
Кажется, что такой хедер может понадобиться в любом месте, представляющем в каком-то виде клиент для desbordante lib. На бэке веб приложения, в main консольного desborbante, в тестах desbordante. Но при этом все такие клиенты по идее должны (но в данный момент это не так) использовать AlgoFactory. Можно включить все нужные хедеры алгоритмов в AlgoFactory и использовать транзитивное включение.

polyntsov · 2022-02-21T16:09:55Z

src/algorithms/TypoMiner.h

+template <typename PreciseAlgo, typename ApproxAlgo>
+TypoMiner<PreciseAlgo, ApproxAlgo>::TypoMiner(Config const& config)
+    : Primitive(config.data, config.separator, config.has_header,
+                {"Precise fd algorithm execution", "Approximate fd algoritm execution",


Нужно либо реализовать прогресс, либо передавать тут пустой вектор.

Mstrutov

Выглядит очень хорошо. Я бы ещё хотел получше изучить TypoMiner и Configuration, но, если функционал уже нужен, можно мёржить.

Mstrutov · 2022-02-23T15:55:16Z

src/algorithms/AlgoFactory.h

+    }
+}
+
+/* Really cumbersome, also copying parameter names and types throughout the project


Можно завести struct со static полями типа
public static kIsNullEqualNull = "is_null_equal_null";
или аналогичный enum, если BetterEnums такое умееют.

Была такая идея, но не понятно, нужно по struct/enum для каждого типа примитива со своими опциями или один большой struct для вообще всех возможных опций. Во втором случае наверное стоит делать что-то большее, чем просто enum, где будут еще описания опций и типы в каком-то виде (то, что сейчас захардкожено в main.cpp).

src/algorithms/FDAlgorithm.h

Mstrutov · 2022-02-23T16:28:56Z

src/algorithms/Algorithms.h

+#include "algorithms/Fd_mine.h"
+#include "algorithms/Pyro.h"
+#include "algorithms/TaneX.h"
+


Зависит от того, с каким замыслом ты создавал этот файл.

Если это перечисление вообще всех алгоритмов для облегечения инклюдов в файлах, где нужны все алгоритмы, тогда стоит добавить.

Если это только ФЗ алгоритмы, стоит переименовать -> FDAlgorithms.h и не добавлять TypoMiner.

Возможно, такая группировка и упрощает чтение инклюдов, но я бы наверно вообще удалил этот хэдер -- пока он только в AlgoFactory используется, можно один раз поинклюдить все алгоритмы по отдельности.

Mstrutov

OK

Separate module for algorithm instance creation helps to avoid code duplication between places where you need to create algorithm objects, e.g. between main() of cli desbordante app and backend of desbordante web-app. Also this approach is a lot more scalable in terms of adding new algorithms.

Introduce base class for FDlgorithm and move there input_generator_ field and methods and fiels to work with progress bar. These are common things to all primitives to be implemented, such as conditional functional dependencies, association rules and more. Add new constructor from ColumnRelationLayoutData to algorithms inherited from PliBasedFDAlgorithm. And make algorithms unique ownership of relation shared. It helps to implement primitives which use algorithms as building blocks. Such primitives need to execute multiple algorithms over the same relation and at the same time own that relation.

Implement methods in CSVParser to get line by its number from file. It can be used to print result of typos miner primitive. Use boost algorithms to parse csv string instead of manual algorithm which is less reliable.

Improves readability

Refactor main.cpp, algo_factory.h: impelement method that creates algorithm instance from map of parameters, it should be easier to maintain.

Reduces code duplication and simplifies algo_factory interface.

Implement static FDAlgorithm method to convert container of fds to json. Generally usefull functionality, e.g. for testing.

In typo mining workflow several algorithms work on the same table and can have different RelationalSchema objects so we cannot just compare pointers to RelationalSchema. Implement operator== for RelationalSchema objects to address this issue.

Useful for testing when you don't have schema and therefore can't create your own Verticals to compare with algorithm result Verticals.

It is better than handcoded version of the same functionality.

Because the progress bar is not implemented yet.

Implement script that pulls datasets using git lfs or, if git lfs fails due to excess of free GitHub data quota, retrieves datasets from git history. Temporary script to address small GitHub git lfs data quota.

polyntsov requested a review from Mstrutov February 20, 2022 16:07

polyntsov added the feature Provides new functionality label Feb 20, 2022

polyntsov commented Feb 21, 2022

View reviewed changes

src/algorithms/FDAlgorithm.h Show resolved Hide resolved

polyntsov commented Feb 21, 2022

View reviewed changes

src/algorithms/AlgoFactory.h Show resolved Hide resolved

polyntsov commented Feb 21, 2022

View reviewed changes

Mstrutov approved these changes Feb 23, 2022

View reviewed changes

polyntsov force-pushed the workflow branch from a446d3d to 958453d Compare February 27, 2022 20:20

polyntsov requested a review from Mstrutov February 27, 2022 20:25

Mstrutov approved these changes Mar 8, 2022

View reviewed changes

polyntsov added 19 commits March 8, 2022 14:25

Add better_enum library

f4f7ede

Refactor CSVParser

dbcc74c

Implement methods in CSVParser to get line by its number from file. It can be used to print result of typos miner primitive. Use boost algorithms to parse csv string instead of manual algorithm which is less reliable.

Fix PliBasedFDAlgorithm::relation_ initialization

a40a851

Add aliases to PositionListIndex

968ef1f

Improves readability

Move fd algorithm configuration params to one struct

2d67126

Refactor main.cpp, algo_factory.h: impelement method that creates algorithm instance from map of parameters, it should be easier to maintain.

Replace switch case in algo_factory with templates

ae379a2

Reduces code duplication and simplifies algo_factory interface.

Add virutal destructor to Primitive

068c3e2

Implement method to convert container of fds to json

7ded798

Implement static FDAlgorithm method to convert container of fds to json. Generally usefull functionality, e.g. for testing.

Implement operator== for RelationalSchema

280b074

In typo mining workflow several algorithms work on the same table and can have different RelationalSchema objects so we cannot just compare pointers to RelationalSchema. Implement operator== for RelationalSchema objects to address this issue.

Implement Vertical::GetColumnIndicesAsVector()

e21b8c4

Useful for testing when you don't have schema and therefore can't create your own Verticals to compare with algorithm result Verticals.

Implement TypoMiner algorithm

406e752

Implement tests for TypoMiner

ca3a4e9

Rename files to meet styleguide

97e102e

Move better-enums library to lib/

6682db2

Add better-enums library to github workflow

a88d2d0

Use boost::mp11 to choose algorithm type at runtime

0d5dbb1

It is better than handcoded version of the same functionality.

Pass empty progress phase names vector in TypoMiner

4c88fea

Because the progress bar is not implemented yet.

polyntsov force-pushed the workflow branch from 73806a3 to 4c88fea Compare March 8, 2022 11:28

polyntsov requested a review from Mstrutov March 8, 2022 13:06

polyntsov force-pushed the workflow branch from d61eafc to eff23db Compare March 8, 2022 13:40

Mstrutov approved these changes Mar 8, 2022

View reviewed changes

polyntsov added 2 commits March 8, 2022 19:26

Implement pull_datasets.sh

988d971

Implement script that pulls datasets using git lfs or, if git lfs fails due to excess of free GitHub data quota, retrieves datasets from git history. Temporary script to address small GitHub git lfs data quota.

Pull datasets via pull_datasets.sh in GitHub Actions

db44f63

polyntsov force-pushed the workflow branch from eff23db to db44f63 Compare March 8, 2022 16:29

Mstrutov approved these changes Mar 8, 2022

View reviewed changes

Mstrutov merged commit 30699cc into Desbordante:main Mar 8, 2022

polyntsov deleted the workflow branch October 28, 2022 12:52

Mstrutov mentioned this pull request Jul 17, 2023

Refactor CSV Parser #96

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement algorithm for typo discovery #89

Implement algorithm for typo discovery #89

polyntsov commented Feb 20, 2022

polyntsov Feb 21, 2022 •

edited

Loading

Mstrutov Feb 23, 2022

polyntsov Feb 23, 2022

polyntsov Feb 21, 2022

Mstrutov left a comment

Mstrutov Feb 23, 2022

polyntsov Feb 23, 2022

Mstrutov Feb 23, 2022

Mstrutov left a comment

Implement algorithm for typo discovery #89

Implement algorithm for typo discovery #89

Conversation

polyntsov commented Feb 20, 2022

polyntsov Feb 21, 2022 • edited Loading

Choose a reason for hiding this comment

Mstrutov Feb 23, 2022

Choose a reason for hiding this comment

polyntsov Feb 23, 2022

Choose a reason for hiding this comment

polyntsov Feb 21, 2022

Choose a reason for hiding this comment

Mstrutov left a comment

Choose a reason for hiding this comment

Mstrutov Feb 23, 2022

Choose a reason for hiding this comment

polyntsov Feb 23, 2022

Choose a reason for hiding this comment

Mstrutov Feb 23, 2022

Choose a reason for hiding this comment

Mstrutov left a comment

Choose a reason for hiding this comment

polyntsov Feb 21, 2022 •

edited

Loading