Joe Strout's CSC585 Project

This is the code repository for Joe Strout's Fall 2018 class project for CSC 585, Algorithms for Natural Language Processing, at University of Arizona. This repository contains:

A C implementation of word2vec, forked from https://github.com/dav/word2vec
Custom code to output particular sets of embedded word vectors for further analysis.

This repository's official home is here.

Installation

The following instructions assume a recent macOS environment with gcc and make (obtained, for example, by installing the Xcode command-line tools). Other Unix/Linux environments should work similarly.

Download the project from https://github.com/JoeStrout/csc585-project
At a shell prompt, cd to the src directory:

cd trunk/src

Build the executables with make:

make

This should compile and link cleanly, with no errors or warnings. It uses gcc with only standard libraries.

Get the test data. The easiest way to do this is via the get-data.sh script:

cd ../scripts
./get-data.sh

If this script fails, perhaps because your system lacks the curl or unzip commands, then manually download the text8 zip file and unzip it into the data directory. The result is the first 100 mb of a normalized (lowercased, with punctuation removed) Wikipedia dump.

How to Skip Training

Pre-trained word vectors four each of the four experiments can be found in the data directory. For each experiment, you will find a notes file describing the experimental conditions; CSV files containing training and test words for gender, latitude, and mass (and in the case of experiment 4, also isDangerous and hasWheels), and the full set of word embeddings in binary form (text8-vector.bin).

Training

To regenerate basic embedded word vectors, run the following script in scripts directory:

./create-text8-vector-data.sh

This looks in the data directory for the test8 file downloaded in step 4 of the installation above, and applies the word2vec algorithm (in skip-gram mode) to generate embedded word vectors for the 1.7 million words therein.

Alternatively, you can run one of the four experimental scripts, experiment-1.sh through experiment-4.sh. The first of these is equivalent to create-text8-vector-data.sh, while the others apply different options:

experiment-1.sh: baseline, equivalent to create-text8-vector-data.sh
experiment-2.sh: simple pinning of gender, latitude, and mass
experiment-3.sh: as above, but with pinned examples repeated 1000X
experiment-4.sh: simple pinning with the addition of hasWheels and isDangerous

Progress will be displayed as the training proceeds. On my MacBook Pro, it takes about half an hour, except for experiment 3, which took about 24 hours.

After any experimental run, the text8-vector.bin word embeddings will be found in the data directory. The next step is to extract and analyze the data of interest.

Data extraction & analysis

To reproduce the analyses in the HW4 paper, change to the bin directory and run the extract executable, passing in the path to the word vectors:

cd ../bin
./extract ../data/text8-vector.bin

This will write selected words, along with correct target values, to five CSV files in the data directory:

genderWords.csv: target value = 1 for female, 1 for male
latitudeWords.csv: target value is degrees North divided by 90
massWords.csv: target value is 0.1 * log10(mass in kg)
hasWheels.csv: target value is 1 where hasWheels is true, 0 where false
isDangerous.csv: target value is 1 where isDangerous is true, 0 where false

You can then open these CSV files in your favorite spreadsheet program, or paste them into Google Sheets as I have done here:

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
bin		bin
data		data
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Strout-Joe-presentation.pdf		Strout-Joe-presentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

data

data

scripts

scripts

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Strout-Joe-presentation.pdf

Strout-Joe-presentation.pdf

Repository files navigation

Joe Strout's CSC585 Project

Installation

How to Skip Training

Training

Data extraction & analysis

About

Releases

Packages

Languages

License

JoeStrout/csc585-project

Folders and files

Latest commit

History

Repository files navigation

Joe Strout's CSC585 Project

Installation

How to Skip Training

Training

Data extraction & analysis

About

Resources

License

Stars

Watchers

Forks

Languages