New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal to integrate a new cluster library #983

Open
lispc opened this Issue Apr 21, 2015 · 11 comments

Comments

Projects
None yet
4 participants
@lispc
Contributor

lispc commented Apr 21, 2015

Hi I am a research student from Database Group , Tsinghua University and we have developed a java clustering library https://github.com/lispc/EditDistanceClusterer which is much faster than the current simile-vicino used in OpenRefine. I wonder whether it is possible to integrate the lib to OpenRefine. What features / tests / performance reports are needed ? (First time to do a open-source pull request, sorry for anything not corretly done)

@baditaflorin

This comment has been minimized.

Show comment
Hide comment
@baditaflorin

baditaflorin Apr 21, 2015

I don`t know what is the procedure, but if you say that is faster, this is a good think, i would love to test it, when it will be intergrated into OpenRefine.
I run Clustering on big datasets ( 15 M rows ) and usualy i have to wait around 40-50 seconds even on the simpliest algorithm

Here is the code for the actual clustering methods https://github.com/OpenRefine/OpenRefine/tree/9b2a506caada4ea6d580d2140184fef8caa7566c/main/src/com/google/refine/clustering/binning

baditaflorin commented Apr 21, 2015

I don`t know what is the procedure, but if you say that is faster, this is a good think, i would love to test it, when it will be intergrated into OpenRefine.
I run Clustering on big datasets ( 15 M rows ) and usualy i have to wait around 40-50 seconds even on the simpliest algorithm

Here is the code for the actual clustering methods https://github.com/OpenRefine/OpenRefine/tree/9b2a506caada4ea6d580d2140184fef8caa7566c/main/src/com/google/refine/clustering/binning

@tfmorris

This comment has been minimized.

Show comment
Hide comment
@tfmorris

tfmorris Apr 21, 2015

Member

Welcome - and thanks for the offer. We would love to have an additional high performance clustering algorithm. The existing simile-vicino algorithms are integrated here: https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/clustering/knn/kNNClusterer.java but that assumes that blocking is required. Does the new algorithm use no blocking at all or just larger blocks? (I haven't had a chance to read the paper yet).

If you have a paper or blog post which describes the performance, that would be sufficient for the performance side of things. For test coverage, we prefer to have test coverage for all new features (although we're not very good about enforcing that).

Next steps:

  • fork the OpenRefine repo
  • create a feature branch
  • integrate your code and tests
  • submit a pull request for us to review

If you have any questions, please ask on the openrefine-dev list. We look forward to your contribution!

Member

tfmorris commented Apr 21, 2015

Welcome - and thanks for the offer. We would love to have an additional high performance clustering algorithm. The existing simile-vicino algorithms are integrated here: https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/clustering/knn/kNNClusterer.java but that assumes that blocking is required. Does the new algorithm use no blocking at all or just larger blocks? (I haven't had a chance to read the paper yet).

If you have a paper or blog post which describes the performance, that would be sufficient for the performance side of things. For test coverage, we prefer to have test coverage for all new features (although we're not very good about enforcing that).

Next steps:

  • fork the OpenRefine repo
  • create a feature branch
  • integrate your code and tests
  • submit a pull request for us to review

If you have any questions, please ask on the openrefine-dev list. We look forward to your contribution!

@lispc

This comment has been minimized.

Show comment
Hide comment
@lispc

lispc Apr 22, 2015

Contributor

@baditaflorin. Thanks for the link. I have already read the code on clustering in OpenRefine / Simile-Vicino / SecondString all. My algorithm is particularly optimized for edit-distance clustering, so my target is not to replace most of the existed algorithms but only replace the edit-distance module.
@tfmorris The new algorithm does not use blocking at all. Here is a brief comparison between new lib vs simle-vicino (There seems a bug in simile-vicino , which is mentioned in the gist) : https://gist.github.com/lispc/d4ee8de81e7bfa6bc352. I will provide a git repo to reproduce experiment results and start the fork-pull-request steps. Thanks.

Contributor

lispc commented Apr 22, 2015

@baditaflorin. Thanks for the link. I have already read the code on clustering in OpenRefine / Simile-Vicino / SecondString all. My algorithm is particularly optimized for edit-distance clustering, so my target is not to replace most of the existed algorithms but only replace the edit-distance module.
@tfmorris The new algorithm does not use blocking at all. Here is a brief comparison between new lib vs simle-vicino (There seems a bug in simile-vicino , which is mentioned in the gist) : https://gist.github.com/lispc/d4ee8de81e7bfa6bc352. I will provide a git repo to reproduce experiment results and start the fork-pull-request steps. Thanks.

@baditaflorin

This comment has been minimized.

Show comment
Hide comment
@baditaflorin

baditaflorin Apr 22, 2015

If you will need a real live database to play, i have the database with all the objects of all the trials from 2000 to 2015, from all the justice system courts in Romania - 16 M Rows, 800.000 different values.

There are a lot of examples of real life errors made by real people
https://www.dropbox.com/s/lg6nmb8l56l19px/240_instante_obiect.rar?dl=0

Waiting to test your repo

baditaflorin commented Apr 22, 2015

If you will need a real live database to play, i have the database with all the objects of all the trials from 2000 to 2015, from all the justice system courts in Romania - 16 M Rows, 800.000 different values.

There are a lot of examples of real life errors made by real people
https://www.dropbox.com/s/lg6nmb8l56l19px/240_instante_obiect.rar?dl=0

Waiting to test your repo

@li-guoliang

This comment has been minimized.

Show comment
Hide comment
@li-guoliang

li-guoliang Apr 23, 2015

Hi @baditaflorin and @tfmorris, I am the supervisor of @lispc. I have studied data cleaning for several years and I would like to contribute our algorithms to OpenRefine. Our algorithm is super fast and in the completion organized by a premier international conference EDBT'13, our algorithm beats all of other algorithms by an order of magnitude (see http://www2.informatik.hu-berlin.de/~wandelt/searchjoincompetition2013/Results.html).

@baditaflorin I am very interested in your trial dataset and I want to test our algorithm on it. Could you share a copy for us. Thank you.

li-guoliang commented Apr 23, 2015

Hi @baditaflorin and @tfmorris, I am the supervisor of @lispc. I have studied data cleaning for several years and I would like to contribute our algorithms to OpenRefine. Our algorithm is super fast and in the completion organized by a premier international conference EDBT'13, our algorithm beats all of other algorithms by an order of magnitude (see http://www2.informatik.hu-berlin.de/~wandelt/searchjoincompetition2013/Results.html).

@baditaflorin I am very interested in your trial dataset and I want to test our algorithm on it. Could you share a copy for us. Thank you.

@baditaflorin

This comment has been minimized.

Show comment
Hide comment
@baditaflorin

baditaflorin Apr 24, 2015

@li-guoliang You can find the dropbox link in the comment above

I also have a list of 46M rows containing all the firms, public administration or persons that have been in trial.

baditaflorin commented Apr 24, 2015

@li-guoliang You can find the dropbox link in the comment above

I also have a list of 46M rows containing all the firms, public administration or persons that have been in trial.

@li-guoliang

This comment has been minimized.

Show comment
Hide comment
@li-guoliang

li-guoliang Apr 24, 2015

@baditaflorin I have downloaded the dataset. We will test it and let you know the result later. Is it possible to have a copy of your 46M data?

Thanks.

li-guoliang commented Apr 24, 2015

@baditaflorin I have downloaded the dataset. We will test it and let you know the result later. Is it possible to have a copy of your 46M data?

Thanks.

@lispc

This comment has been minimized.

Show comment
Hide comment
@lispc

lispc Apr 29, 2015

Contributor

@baditaflorin I had a look at the dataset. I know there are two methods used in OpenRefine for clustering : binning and knn. This dataset is very suitable for testing binning method, but it is too large for knn method and I calculated that the dataset would cost several days using the simile-vicino clustering lib. So may I guess that you had not tested the dataset using current knn method (namely, simile-vicino ) in OpenRefine? Even our algorithm may take several hours. So, do you have any smaller dataset ? Or would you please give me some advice to generate smaller dataset that can simulate how OpenRefine is usually used? Thanks.

Contributor

lispc commented Apr 29, 2015

@baditaflorin I had a look at the dataset. I know there are two methods used in OpenRefine for clustering : binning and knn. This dataset is very suitable for testing binning method, but it is too large for knn method and I calculated that the dataset would cost several days using the simile-vicino clustering lib. So may I guess that you had not tested the dataset using current knn method (namely, simile-vicino ) in OpenRefine? Even our algorithm may take several hours. So, do you have any smaller dataset ? Or would you please give me some advice to generate smaller dataset that can simulate how OpenRefine is usually used? Thanks.

@tfmorris

This comment has been minimized.

Show comment
Hide comment
@tfmorris

tfmorris Apr 30, 2015

Member

@lispc The stated design center for OpenRefine is < 1 M rows, although in practice users push this higher. I can see what sample data sets I have available, but something like 1 to 5 million author names would probably be a reasonably representative large data set.

Member

tfmorris commented Apr 30, 2015

@lispc The stated design center for OpenRefine is < 1 M rows, although in practice users push this higher. I can see what sample data sets I have available, but something like 1 to 5 million author names would probably be a reasonably representative large data set.

@lispc

This comment has been minimized.

Show comment
Hide comment
@lispc

lispc May 5, 2015

Contributor

@tfmorris Thanks. I am improving the multicore version of our lib. It may take a bit longer before I can issue a pull request.

Contributor

lispc commented May 5, 2015

@tfmorris Thanks. I am improving the multicore version of our lib. It may take a bit longer before I can issue a pull request.

@tfmorris

This comment has been minimized.

Show comment
Hide comment
@tfmorris

tfmorris Oct 16, 2015

Member

There's an Apache-licensed implementation of PassJoin in Scala available here: https://github.com/sjyk/sampleclean-async/blob/master/src/main/scala/sampleclean/clean/deduplication/join/PassJoin.scala

I haven't studied it closely enough to see if it implements exactly the same algorithm.

Member

tfmorris commented Oct 16, 2015

There's an Apache-licensed implementation of PassJoin in Scala available here: https://github.com/sjyk/sampleclean-async/blob/master/src/main/scala/sampleclean/clean/deduplication/join/PassJoin.scala

I haven't studied it closely enough to see if it implements exactly the same algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment