Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tree/XML Diff #10

Open
gabriel-weaver opened this issue Jun 21, 2012 · 2 comments
Open

Tree/XML Diff #10

gabriel-weaver opened this issue Jun 21, 2012 · 2 comments

Comments

@gabriel-weaver
Copy link
Owner

Phillip Bille. A survey on tree edit distance and related problems. Theoretical Computer Science, 337 (unknown):unknown, June 2005.

We survey the problem of comparing labeled trees based on simple local operations of deleting, inserting, and relabeling nodes. These op- erations lead to the tree edit distance, alignment distance, and inclusion problem. For each problem we review the results available and present, in detail, one or more of the central algorithms for solving the problem.

@gabriel-weaver
Copy link
Owner Author

Sudarshan S. Chawathe, Anand Rajaraman, Hector Garcia-Molina, and Jennifer Widom. Change detection in hierarchically structured information. In In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD ’96), pages 493–504. ACM, June 1996.

Detecting and representing changes to data is important for active databases, data warehousing, view maintenance, and version and configuration management. Most previous work in change management has dealt with flat-file and relational data; we focus on hierarchically structured data. Since in many cases changes must be computed from old and new versions of the data, we define the hierarchical change detection problem as the problem of finding a “minimum-cost edit script” that transforms one data tree to another, and we present efficient algorithms for computing such an edit script. Our algorithms make use of some key domain characteristics to achieve substantially better performance than previous, general- purpose algorithms. We study the performance of our algorithms both analytically and empirically, and we describe the application of our techniques to hierarchically structured documents.

@gabriel-weaver
Copy link
Owner Author

Gre ́gory Cobe ́na, Serge Abiteboul, and Ame ́lie Marian. Detecting changes in XML documents. In In Pro- ceedings of the 18th International Conference on Data Engineering, pages 41–52. IEEE, February and March 2002.

We present a diff algorithm for XML data. This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volume of XML data. Because of the context, our algorithm has to be very efficient in terms of speed and memory space even at the cost of some loss of “quality”. Also, it considers, besides insertions, dele- tions and updates (standard in diffs), a move operation on subtrees that is essential in the context of XML. Intuitively, our diff algorithm uses signatures to match (large) subtrees that were left unchanged between the old and new versions. Such exact matchings are then possibly propagated to an- cestors and descendants to obtain more matchings. It also uses XML specific information such as ID attributes. We provide a performance analysis of the algorithm. We show that it runs in average in linear time vs. quadratic time for previous algorithms. We present experiments on synthetic data that confirm the analysis. Since this problem is NP- hard, the linear time is obtained by trading some quality. We present experiments (again on synthetic data) that show that the output of our algorithm is reasonably close to the “optimal” in terms of quality. Finally we present experi- ments on a small sample of XML pages found on the Web.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant