Skip to content

rug-compling/dep-brown-cluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dependency Brown clustering

Syntactic extension of Brown et al. 1992 clustering algorithm

This is a modification of Percy Liang’s implementation (version 1.3) of the Brown hierarchical word clustering algorithm that is based on a dependency language model (DLM) instead of the bigram language model.

Note that this is not a revision of those segments of the original code which are not relevant for dependency clustering. The code modification should be seen as the minimal working extension of the original code for dependency-based clustering.

Input

Tab-separated sequence of “head”, “dependent” and “count” (see input.txt for an example), one such instance per line. Space-separated multiword sequences will be treated as one token. The program thus expects that the extraction of dependency instances with counts was already performed.

Output

For each word type, its cluster (see output.txt for an example). In particular, each line is:

[cluster bit id] [word] [number of times word occurs in input]

References

If you use this code, please cite:

Other references:

Compile

make

Run

Cluster input.txt into 50 clusters (–max-ind-level controls amount of verbose output):

./wcluster --text input.txt --c 50 --max-ind-level 3
# Output in input-c50-p1.out/paths

Changes for dependency clustering

Changes to the original code were made in the following files/functions:

  • wcluster.cc
    • read_text_process_word()
    • read_text()
    • incorporate_new_phrase()
    • create_initial_clusters()
    • compute_cluster_distribs()
    • main()
  • strdb.cc
    • read_text()

All modifications in the source code are marked as comments beginning with “dlm”.

Acknowledgments

Thanks to Percy Liang for clarifications about parts of original code.

Copyright

(C) Copyright 2007-2012, Percy Liang

(C) Copyright Simon Šuster

Permission is granted for anyone to copy, use, or modify these programs and accompanying documents for purposes of research or education, provided this copyright notice is retained, and note is made of any changes that have been made.

These programs and documents are distributed without any warranty, express or implied. As the programs were written for research purposes only, they have not been tested to the degree that would be advisable in any important application. All use of these programs is entirely at the user’s own risk.

http://www.let.rug.nl/suster/

About

Syntactic extension of Brown et al. 1992 clustering algorithm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published