உளி வீரன் - வித்து - உத்தி
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
old_html
plain_text
LICENSE
README.md
bluedot1.gif
harvest_ungiram.py
html2word.sh
pmdr0.gif
txt2json.py
unigram.json
unigram.txt
uniqword.sh
up.gif
v2unigram.json

README.md

Harvesting of unigram and bigram data from various corpus data. First we carry out with Project Madurai corpus for prose data only (skip cir/seer unparsed poetry and all other poetry). This data and any scripts are under public-domain.

Currently 4036616 total words in 'plain_text' folder which contains unigram data and bigram data at word level. One may use open-tamil library to: - discover the unigram word-frequency of this corpus - discover the bi-gram word-frequency of this corpus (since successive words occur in successive lines)