Skip to content

turian/biased-text-sample

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

biased-text-sample
==================

by Joseph Turian


Make a biased sample of a large text corpus, based upon text in a smaller text corpus.
Essentially, Lucene index the large text corpus, and for each document
in the smaller corpus retrieve the top ten Lucene results.

Pipe a large stream of text into the indexer:
    /u/turian/data/web_corpus/WaCky2/sentencesplit.py  | ./index-sentences.py


REQUIREMENTS:
    * numpy
        Used for Bloom filter.
    * murmurhash
        Used for Bloom filter.
    * http://github.com/turian/common

About

Perform a biased sample of text data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages