`word_list_tools`

Python and pandas tools to perform various analyses on different types of word lists

Note: This repo was thoroughly resturctured on 2014/09/13, I tried to go through and make sure all the paths were valid, but it's possible some slipped by me.

Word corpora used:

COHA, the Corpus of Historical American English from Brigham Young University. 1-grams require a licence to use, so they are not included here; .gitignore has a rule to ignore coha_1*.*. Metadata/summary data is included here.
Brown corpus, part of python's NLTK
Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005 (http://statmt.org/europarl/) [files not included because they are ginormous]

Simple word lists used:

The Moby list of crossword puzzle words (113,809 words)

Run script first:

initial_data_munge.ipynb will make properly formatted python objects of the corpora/lists above

Tools:

letter_frequency: Simple single letter counts, and comparisons to a standard cryptography text
letter_proximity: conditional probabilities of each moiety of a bigram; or, the probability that one letter follows or precedes another.
letter_distributions: graphs of how often a letter is towards the beginning, middle or end of words
letter_distribution_europarl_comparison: letter_distribution or multiple languages in Europarl
google_ngrams: a work in progress
pattern_searches: various ways to search various word lists by various patterns
stringcmp_py: scripts to compare strings (e.g. Jaro-Winkler, Levenshtein, Metaphone). From the Australian National University, under a GNU open source license. (The web page I originally downloaded them from in 2013 is no longer online.)

See the readme inside the data_user_pickle_csv for their schemas.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`word_list_tools`

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
coha_decades		coha_decades
data_initial		data_initial
data_other		data_other
data_user_pickle_csv		data_user_pickle_csv
gadsby		gadsby
google_ngrams		google_ngrams
letter_distribution_europarl_comparison		letter_distribution_europarl_comparison
letter_distributions		letter_distributions
letter_frequency		letter_frequency
letter_proximity		letter_proximity
pattern_searches		pattern_searches
stringcmp_py		stringcmp_py
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
initial_data_munge.ipynb		initial_data_munge.ipynb

Prooffreader/word_list_tools

Folders and files

Latest commit

History

Repository files navigation

word_list_tools

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`word_list_tools`

Packages