Skip to content
tf-idf elixir
Elixir
Branch: master
Clone or download
OCannings Merge pull request #1 from lowks/patch-1
Use containers in travis-ci
Latest commit ea1c54b Aug 30, 2015
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
config Initial commit Aug 28, 2015
lib chore(documentation): add some documentation! Aug 30, 2015
test feat(calculate_all): return tuples Aug 30, 2015
.gitignore Initial commit Aug 28, 2015
.travis.yml Use containers in travis-ci Aug 30, 2015
README.md chore(documentation): remove lines from README.md Aug 30, 2015
mix.exs chore(package): add additional hex fields Aug 30, 2015

README.md

Travis CI Build Status

#Tfidf An Elixir implementation of tf-idf

Based on the blog post by Steven Loria

##What is tf-idf?

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.

tf-idf on Wikipedia

Installation

defp deps do
  [{:tfidf, "~> 0.1.0"}]
end

Usage

Tfidf.calculate(word, text, corpus, tokenize_fn \\ &tokenize(&1))

Calculates the tf-idf for a given word within a text and a corpus (List) of texts.

iex> Tfidf.calculate("dog", "nice dog dog", ["dog hat", "dog", "cat mat", "duck"])
0.19178804830118723

An optional tokenizer function can be passed as the last argument to replace the default tokenizer:

iex> Tfidf.calculate("dog", "nice,dog,dog", ["dog,hat", "dog", "cat,mat", "duck"], &String.split(&1, ","))
0.19178804830118723

=====

Tfidf.calculate(word, tokenized_text, corpus)

Calculates the tf-idf for a given word within a pre-tokenized list and a corpus comprised of pre-tokenized lists.

iex> Tfidf.calculate("dog", ["nice", "dog", "dog"], [["dog", "hat"], ["dog"], ["cat", "mat"], ["duck"]])
0.19178804830118723

=====

Tfidf.calculate_all(text, corpus, tokenize_fn \\ &tokenize(&1))

Calculates the tf-idf for all words in a given text, returns a list of {word, score} tuples.

iex> Tfidf.calculate_all("nice dog", ["dog hat", "dog", "cat mat", "duck"])
[{"nice", 0.6931471805599453}, {"dog", 0.14384103622589042}]

As with Tfidf.calculate/4 an optional tokenizer function can be passed as the last argument. This will be used in place of the default tokenizer.

iex> Tfidf.calculate_all("nice,dog", ["dog,hat", "dog", "cat,mat", "duck"], &String.split(&1, ","))
[{"nice", 0.6931471805599453}, {"dog", 0.14384103622589042}]
You can’t perform that action at this time.