Word vs. World

This repository contains code for extracting co-occurrence statistics from Wikipedia articles. These co-occurrences will later be utilized as experimental stimuli to understand how regularities in language interact with our knowledge about regularities in the world.

Branches

master branch : count co-occurrences in windows
no-window: count co-occurrences in noun chunks

Order of Operations

Generate vocab based on desired size and constraints.
Update params.py with vocab_name to be utilized.
Submit job to Ludwig.
Retrieve results and save to database.
Query database to find desired word pairs.

Background

The input text files are stored on the UIUC Learning & Language Lab file server. The Python package Ludwig is used to interact with files on the server. More information about Ludwig can be found here.

The corpus as an abstraction

To Ludwig, a corpus is an abstraction. That is, there is no single text file to uniquely identify a Wikipedia corpus. Instead, a corpus is made up of multiple text files that exist in research_data/CreateWikiCorpus/runs. Moreover, there are multiple corpora saved on the server. But, each text file is only associated with a single corpus.

Retrieving text files

So, how are all text files associated with a single corpus retrieved? This is where a parameter configuration comes in. Each text file is co-located with a param2val.yaml file, which represents its parameter configuration. A parameter configuration is simply the set of parameters used to create a single corpus. Thus, in order to retrieve a single corpus, all text files associated with the same parameter configuration must be retrieved.

The safest method for obtaining only those text files that make up a specific corpus of interest, is to manually inspect the folders in research_data/CreateWikiCorpus/runs. Inspect each param2val.yaml file, and if the parameter configuration matches, note the parameter configuration id, a.ka. param_name. This unique ID can be used to programmatically retrieve a specifc set of text files that make up a specific corpus of interest.

Usage

Access to the UIUC file server is required. If the remote directory research_data is mounted at /media/research_data as is the default in Linux, you can:

ludwig

If, research_data is not mounted at media/research_data (e.g. on MacOs, the default mounting point is /Volumes/, the mounting point needs to be specified:

ludwig -mnt /Volumes/research_data

To run jobs locally, rather than on Ludwig workers:

ludwig --local

Notice, however, that access to the server is still required for fetching corpus data.

Running locally is especially useful for debugging. To run a single minimal configuration, using a small number of articles and a reduced vocabulary:

ludwig --local --minimal

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
word_v_world		word_v_world
.gitignore		.gitignore
README.md		README.md
frequency_to_database.py		frequency_to_database.py
main_pmi.py		main_pmi.py
main_word_count.py		main_word_count.py
pkl_pair_cooc.py		pkl_pair_cooc.py
requirements.txt		requirements.txt
results_to_database.py		results_to_database.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word_v_world

word_v_world

.gitignore

.gitignore

README.md

README.md

frequency_to_database.py

frequency_to_database.py

main_pmi.py

main_pmi.py

main_word_count.py

main_word_count.py

pkl_pair_cooc.py

pkl_pair_cooc.py

requirements.txt

requirements.txt

results_to_database.py

results_to_database.py

Repository files navigation

Word vs. World

Branches

Order of Operations

Background

The corpus as an abstraction

Retrieving text files

Usage

About

Releases

Packages

Contributors 2

Languages

emilymech/Word_V_World

Folders and files

Latest commit

History

Repository files navigation

Word vs. World

Branches

Order of Operations

Background

The corpus as an abstraction

Retrieving text files

Usage

About

Resources

Stars

Watchers

Forks

Languages