Skip to content

tigerchen52/tfidf-tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

tfidf-tool

This is an implementation of Python.The tool provides a simple and fast method to calculate tf-idf value.

why use this tool?

  • the tool calculates idf value by multi processes,which is n times faster than traditional method
  • it can calculate n-gram tf-idf value
  • and extract key words from documents

quick start

All the input we use is in the 'input' directory.We will use 'wiki_head_10.txt' which contains 10 documents of wiki to train our model,and use 'wiki_test.txt' to test.

get idf value

    doc = Document('../input/wiki_head_10.txt')
    tfidf = TFIDF(
        documents=doc,
        ngram=2,
        stop_words_path='../input/stop_words.txt',
        idf_path='../output/idf.txt'
    )
    #use 2 process and every process handle 5 docs
    tfidf.multi_pro_idf(process_num=2, p_doc_num=5)

Here we calculate bigram idf value from the 10 wiki docs.

TFIDF's parameter

  • documents:a class of Document. The input is a generator which every element is a list of sentence which represents a document
  • ngram:Integer.1 represents unigram, 2 represents bigram, 3 represents trigram...
  • strop_words_path:stop words file.If use stop words, the ngram words contain stop words will filtered.
  • idf_path:a file path to store the idf value

get tfidf value and extract key words

    tfidf = TFIDF(
            documents=None,
            ngram=2,
            stop_words_path='../input/stop_words.txt',
            idf_path='../output/idf.txt'
        )
    tfidf.load_idf()
        doc = tfidf.read_file('../input/wiki_test.txt')
        #a dict contains word and value
        tfidf = tfidf.calculate_tfidf(doc)
        #extract top 10 key words from one documents
        tfidf.find_keywords(doc, 10)

About

a tfidf tool that can calculate tf-idf value and extract key words

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages