EUREKA

Source:

original authors: Xylander23 is the original author, and then Lyrichu modified the code to python3 version.
references: Code for Chinese Word Segmentation, Blog about New Words Detection, 中文分词新词发现.
old version: an immature version of mine could be find here.

Data:

stop-words dictionary: a stop-words dictionary file could leverage the final performance of EUREKA, an example could be seen here (this dictionary is copied from the Lyrichu).
input corpus: the input corpus is a long string, such as a novel text, or a concatenated documentation pieces. See an example.
corpus in mongodb: you can store each document as one sample in a collection of a mongodb database, with the format like this:

{"_id": ObjectId("123456789"), "content": your_corpus(long string)}

Codes Dependency:

eureka -> model

Using Example:

from eureka import Eureka
model = Eureka()
model.load_dictionary()

# data from .txt file
####################################################################
import codecs
corpus = codecs.open("document.txt", "r", "utf-8").read()

n = len(corpus)
if n < 5000:
    print("The corpus is too small.")
elif n < 250000:
    res = model.discover_corpus(corpus)
else:
    res = model.discover_corpus_multi(corpus, corpus_size=200000, re_list=True)  # corpus_size is the length of sub-corpus in from the input corpus

# data from mongo
####################################################################
import pymongo
client = pymongo.MongoClient("mongodb://localhost:27017/")
col = client["your_database_name"]["your_collection_name"]
res = model.discover_corpus_mongo(col, n=20000, corpus_size=200000, re_list=True)  # n is the number of samples used in collections

Requirements

Python>=3.5
pandas>=0.22.0
pkuseg
jieba>=0.39
tqdm>=4.19.5
Flask(optional, if runing the server.py)
pymongo(optional, EUREKA could handle mongo data while it essentially does not need this lib)
ipdb(optinoal, if debugging in command line)

Allusion

Eureka is from Ancient Greek word heúrēka, which means I have found.
Eureka is also a heroine from a Japanese anime called Eureka Seven.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README.md		README.md
__init__.py		__init__.py
document.txt		document.txt
eureka.py		eureka.py
model.py		model.py
stop_words.txt		stop_words.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EUREKA

Source:

Data:

Codes Dependency:

Using Example:

Requirements

Allusion

About

Releases

Packages

Languages

Schlampig/EUREKA

Folders and files

Latest commit

History

Repository files navigation

EUREKA

Source:

Data:

Codes Dependency:

Using Example:

Requirements

Allusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages