- original authors: Xylander23 is the original author, and then Lyrichu modified the code to python3 version.
- references: Code for Chinese Word Segmentation, Blog about New Words Detection, 中文分词新词发现.
- old version: an immature version of mine could be find here.
- stop-words dictionary: a stop-words dictionary file could leverage the final performance of EUREKA, an example could be seen here (this dictionary is copied from the Lyrichu).
- input corpus: the input corpus is a long string, such as a novel text, or a concatenated documentation pieces. See an example.
- corpus in mongodb: you can store each document as one sample in a collection of a mongodb database, with the format like this:
{"_id": ObjectId("123456789"), "content": your_corpus(long string)}
eureka -> model
from eureka import Eureka
model = Eureka()
model.load_dictionary()
# data from .txt file
####################################################################
import codecs
corpus = codecs.open("document.txt", "r", "utf-8").read()
n = len(corpus)
if n < 5000:
print("The corpus is too small.")
elif n < 250000:
res = model.discover_corpus(corpus)
else:
res = model.discover_corpus_multi(corpus, corpus_size=200000, re_list=True) # corpus_size is the length of sub-corpus in from the input corpus
# data from mongo
####################################################################
import pymongo
client = pymongo.MongoClient("mongodb://localhost:27017/")
col = client["your_database_name"]["your_collection_name"]
res = model.discover_corpus_mongo(col, n=20000, corpus_size=200000, re_list=True) # n is the number of samples used in collections
- Python>=3.5
- pandas>=0.22.0
- pkuseg
- jieba>=0.39
- tqdm>=4.19.5
- Flask(optional, if runing the server.py)
- pymongo(optional, EUREKA could handle mongo data while it essentially does not need this lib)
- ipdb(optinoal, if debugging in command line)
- Eureka is from Ancient Greek word heúrēka, which means I have found.
- Eureka is also a heroine from a Japanese anime called Eureka Seven.