Skip to content

Scripts for training and testing machine learning models based on Sefaria's data

License

Notifications You must be signed in to change notification settings

Sefaria/Machine-Learning

Repository files navigation

Machine-Learning

Scripts for training and testing machine learning models based on Sefaria's data

Setup

Install Local DBManager package

There is a local python package included in this repo which contains utility functions and classes for interfacing with Prodigy. These utilities are also used for defining the interface between Mongo and spaCy so they are required to install before running any training tasks.

To install, follow instructions in prodigy/prodigy_utils/README.txt.

Install requirements

Run: pip install -r requirements.txt

DVC

This repo uses DVC to track changes to large data files or models. DVC is installed using requirements.txt. However, you can follow these instructions to install shell completion as well.

DVC is modeled after git. The following is a list of common DVC commands:

dvc add <filename>  # add filename to DVC tracking
dvc pull            # pull latest data from remote
dvc push            # push latest data to remote

See here for more documentation.

Note, we are currently using Google Storage on the development cluster as the remote for this repo.

Run

Most scripts in this repo are run using spaCy projects. See here for general documentation on spaCy projects.

Run a project using new vars override script

The new way to run a spaCy project is to use the vars override script located at utils/run_project_with_vars.py

This script allows you to use a base project.yml file and inject overrides to variables from a separate file which is convenient when there are multiple possible variables for one project.

Follow these steps to run with vars overrides:

cd util
python ./run_project_with_vars.py [project] [vars_name] [task]

run_project_with_vars parameters

Param Description
project name of a project folder. E.g. torah_ner
vars_name name of a vars file. This must be located in the project folder under the vars folder. E.g. ref_he. Note, leave of file suffix.
task name of a command or workflow in the project's project.yml file.

Projects

Below is a list of current projects

torah_ner

This project is meant to train NER classifiers for multiple tasks. Each task is defined in a vars file in torah_ner/vars. You must use run_project_with_vars.py to run this project.

vars

Vars file Description
ref_he Vars for training Hebrew NER model to recognize citations
subref_he Vars for training Hebrew NER model to recognize parts of citation within a citation

Machine Learning Job

To build a container called mljob, make sure you are in the root directory of the Machine-Learning project and run:

docker build . -f ./build/training/Dockerfile -t mljob

For a docker container named mljob, one can set the entrypoint to bash and then run python util/job.py. For example,

docker run -it --entrypoint /bin/bash -v $GOOGLE_APPLICATION_CREDENTIALS:/tmp/keys/mljob.json:ro -e GOOGLE_APPLICATION_CREDENTIALS=/tmp/keys/mljob.json -e MONGO_HOST="172.17.0.2" -e MONGO_PORT=27017 -e MONGO_USER="" -e MONGO_PASSWORD="" -e REPLICASET_NAME="" -e GPU_ID=-1 -e ML_PROJECT_DIR=torah_ner -e PYTHONPATH=/app mljob 

Then, once inside the docker container, you should pass a task(s) separated by comma and yaml file name in $ML_PROJECT_DIR/vars:

python util/job.py pretrain,train-ner ref_en

ML_PROJECT_DIR, MONGO_HOST, MONGO_PORT, MONGO_USER, MONGO_PASSWORD, and REPLICASET_NAME are all specific to the environment running the job (for example, Sefaria-Project or ContextUS).

About

Scripts for training and testing machine learning models based on Sefaria's data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages