Skip to content

EUREKA is an unsupervised model to detect new words from Chinese corpus.

Notifications You must be signed in to change notification settings

Schlampig/EUREKA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EUREKA

Source:


Data:

  • stop-words dictionary: a stop-words dictionary file could leverage the final performance of EUREKA, an example could be seen here (this dictionary is copied from the Lyrichu).
  • input corpus: the input corpus is a long string, such as a novel text, or a concatenated documentation pieces. See an example.
  • corpus in mongodb: you can store each document as one sample in a collection of a mongodb database, with the format like this:
{"_id": ObjectId("123456789"), "content": your_corpus(long string)}

Codes Dependency:

eureka -> model   

Using Example:

from eureka import Eureka
model = Eureka()
model.load_dictionary()

# data from .txt file
####################################################################
import codecs
corpus = codecs.open("document.txt", "r", "utf-8").read()

n = len(corpus)
if n < 5000:
    print("The corpus is too small.")
elif n < 250000:
    res = model.discover_corpus(corpus)
else:
    res = model.discover_corpus_multi(corpus, corpus_size=200000, re_list=True)  # corpus_size is the length of sub-corpus in from the input corpus

# data from mongo
####################################################################
import pymongo
client = pymongo.MongoClient("mongodb://localhost:27017/")
col = client["your_database_name"]["your_collection_name"]
res = model.discover_corpus_mongo(col, n=20000, corpus_size=200000, re_list=True)  # n is the number of samples used in collections

Requirements

  • Python>=3.5
  • pandas>=0.22.0
  • pkuseg
  • jieba>=0.39
  • tqdm>=4.19.5
  • Flask(optional, if runing the server.py)
  • pymongo(optional, EUREKA could handle mongo data while it essentially does not need this lib)
  • ipdb(optinoal, if debugging in command line)

Allusion

  • Eureka is from Ancient Greek word heúrēka, which means I have found.
  • Eureka is also a heroine from a Japanese anime called Eureka Seven.

About

EUREKA is an unsupervised model to detect new words from Chinese corpus.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages