GitHub - ColiLea/scan

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README		README
main.zip		main.zip
other_lib.zip		other_lib.zip
scan_lib.zip		scan_lib.zip
Repository files navigation

===========================
======    README    =======
===========================

This package contains the code underlying the SCAN model as described in 
Frermann and Lapata (2016). The code can be used to (a) create binary files
containing time-stamped input documents (the required input to SCAN); and (b)
to train scan models.

This bundle contains three parts
(1) main.zip  
    the main functions. Run code from this directory. Also contains example
    input and output
(2) scan_lib.zip
    the core code underlying SCAN (my code)
(3) other_lib.zip
    libraries that are used but not part of the core go libraries. Each of these must 
    be placed in the same directory as scan_lib.zip


To run the code, navigate into your 'main' directory and run 

EITHER the pre-compiled binary (should work out of the box)
./dynamic-senses -parameter_file=/path/to/parameters/ [-create_corpus] -store={true,false}

OR the code itself (requires to have go installed)
go run *.go -parameter_file=/path/to/parameters/ [-create_corpus] -store={true,false}

The command-line parameters are:

- parameter_file 
    path to a text file containing all parameters (see below)
    
-create_corpus 
    an optional parameter. If it is set a binary corpus will be created. Otherwise a model will be trained
    
-store 
    indicates whether target word-specific corpora should be stored
   
To run the code itself (second option above i.e., not just the binary) golang (https://golang.org/doc/) 
must be installed. I use version go1.7.4 (might be safest to use the same). All required 
packages to run this code should be part of this bundle.
   

   
----------------------------------
----- running out of the box -----
----------------------------------
I include an example for all necessary test files:
- a parameters.txt (which has my own hard-coded paths, MUST BE CHANGED)
- corpus file under main/test_input/corpus.txt
- a file with target words under main/test_input/targets.txt

Change the paths in the parameters.txt to your own paths (to the corpus / targets fies), 
and run

(b) to create a binary corpus
./dynamic_senses -parameter_file=path/to/parameters.txt -create_corpus -store=true 

(a) to train models
./dynamic_senses -parameter_file=path/to/parameters.txt -store=true    
   
   
   
---------------------------
--- the parameters file ---
---------------------------
All parameters are specified in a parameters which must be passed as input to
the program. Best see the included 'parameters.txt' examples file. This includes

- paths to underlying text corpora, target word sets, etc
- model parameters (see paper for explanation)
- sampler parameters (number of iterations)
- parameters regarding the time start/end/intervals of interest
- parameters to optionally restrict the minimum number of available documents (to ignore highly
  infrequent words) and / or maximum number of available documents per time interval
  (to get managable-size corpora for very high-frequent words)




------------------------------
Creating binary input corpora
------------------------------
Takes a text file and a list of target words, and a 'document length' specification. Outputs
a binary corpus with target-word specific, time-stamped documents. Document length refers to
the size of the context window considered around the target word, e.g., 5 words. 
[The corpus has words mapped to unique ID identifiers, and contains dictionaries mapping from 
 word strings to IDs and back]

It takes the following parameters (all specified in parameters.txt)
 
- text_corpus
    path to a text file which in each line contains a number indicating
    a year (of origin) followed by a \tab\ character followed by the corresponding text 
    from that year of origin. The same year can be listed multiple times:

    YEAR \tab\ text ....
    YEAR \tab\ text ....
    ....

- target_words
    path to a text file containing all target words of interest
    whose meaning should be tracked, one word per line. 

- window_size
    It also requires a specification of the window size (i.e, the context window to consider, as
    explained above)

- bin_corpus_store
    path to the location the binary corpus should be stored 




------------------------------
    Training a SCAN model 
------------------------------
Once we have a binary corpus of time-stamped documents as explained above full_corpus_path, we can train SCAN
models that track meaning change of individual target words. To do this we

(1) extract a target word-specific corpus from the underlying binary corpus. It contains only time-tagged 
    documents with the specified target word. It converts the absolute times in the underlying corpus (e.g, 1764, 1993, ...)
    to time intervals (0, 1, ..., T) based on the start_time, end_time and time_interval parameters (see below). 
    It takes the following parameters (all specified in parameters.txt):

    - full_corpus_path
        path to the underlying binary corpus
        
    - start_time
        the earliest time stamp in the underlying corpus to be considered
        
    - end_time
        the latest time stamp in the underlying corpus to be considered
        
    - time_interval
        the length of time intervals into which the span [start_time, end_time] is to be split
      
        e.g., if start_time=1700 , end_time=2000 , time_interval=10 then documents are binned into 10-year bins
        and all documents from before 1700 and after 2000 are ignored. Documents from 1700-1709 are assigned to
        bin 0, documents from 1710-1719 are assigned to bin 1 and so on.
    
    - word_corpus_path
        path to a location where word-specific corpora are stored. 
        
        The filename reflects the choice of start_time / end_time / time_interval, 
        e.g., corpus_s1700_e2009_i10.bin for the example above
    
    
(2) pass this corpus to the model and train the model with MCMC inference. It creates a model and human-readable output:

    - model.bin   the trained model binary and 
    - output.dat  human-readable model output, namely for each time slice its distribution over senses, 
                  and for each sense in each time slice, it's distribution over words (as the set of most 
                  highly  associated words).

    It takes the following parameters (all specified in parameters.txt)
    
    - output_path
        directory in which files output.dat and model.bin are to be stored

        If the output directory and files (a) and (b) already exist (from a previous run) it moves the old files
        to output_old.dat and model_old.bin

    - kappaF, kappaK, a0, b0, num_top
        model parameters; check paper (or ask me!) for explanations
        
    - iterations
        number of training iterations
        
    - min_doc_per_word
        the model doesn't work well if there are very few documents for a target word available. You may want to
        only learn models per target words that occur at least N (~ 500?) times in the data
    
    - max_docs_per_slice
        some words occur extremely often. To get a managable size input corpus you can restrict the number of documents
        to consider per time interval


        
        
        

-------------------------------
  Understanding the output
-------------------------------
The program creates a human-readable output file for each target word in [outptut_path]/word/output.dat

The included directories under main/test_input/output/ contain output for models trained on the corpus in 
main/test_input/corpus.txt for the target words in main/test_input/targets.txt

Models were trained to learn
  -- from the corpus.txt file (containing some language from between 1700 and 2009)
  -- K=8 senses per word
  -- Start_time=1700, end_time=2009, intervals=20 --> obtaining 16 20-year time intervals in total

In each output.dat file:
  -- K indexes the sense ID
  -- T indexes the time slice ID
  -- After each sense / time: The top 10 words with highest probability under each sense are listed
  -- The bottom line of each block (p(w|s)....) sums over all senses, i.e., shows the most highly associated words 
     for a particular time, ignoring sense information.

     
The file output.dat contains the same information in two ways.

*** per type***
First, I list *by sense* the representation of the 
same sense (k=0...K) for each time slice. The first number indicates the sense's prevalence at that par-
ticular time (as a probability, between 0 and 1). Look for example in main/example_input/output/power/output.dat. 
We can see that sense 2 (the block with K=2) seems to relate to the 'electricity' sense of power (it has
a highly associated words 'battery', 'plant', etc especially towards later times, and tts prevalence increases 
towards later time slices.

*** per time ***
This lists the senses associated with each time interval (the content of the lines is the same as above). E.g., 
the times associated with T=0 (first block under 'per_time') shows that senses K=7 has high probablility 
and sense K=2 (the 'power' sense including e.g., the word 'dynamo') has low probability. Senses K=2 refers to 
the 'mental' power.

Output for the targets 'battery' and 'transport' is also included.