A plugin for the GATE language technology framework for creating and storing corpus statistics like tf, df.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.idea
groovy
src
tests
.gitignore
Jenkinsfile
LICENSE.txt
README.md
gateplugin-CorpusStats.iml
pom.xml

README.md

gateplugin-CorpusStats

A plugin for the GATE language technology framework for calculating various term and term pair statistics over a corpus.

The plugin implements the following PRs:

  • CorpusStatsiTfIdfPR for processing a whole corpus and creating files that contain corpus statistics like document frequency, term frequency, total number of documents etc.
  • AssignStatsTfIdfPR for processing a corpus and using the corpus statistics file created with the CorpusStatsPR to add featires to terms in each document of the corpus. This can be used to create features for scores like tf (term frequency), wtf (weighted term frequency), ltfidf (logarithmic term frequency times inverse document frequency), and others.
  • CorpusStatsCollocationsPR for processing a corpus and creating TSV files that contain corpus statistics like PMI, Chi-Squared and others for all pairs of terms.

More documentation: