Skip to content
Weakly supervised classification of adverse event reports from the FDA's MAUDE database.
Python Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
maude
.gitignore
README.md
download.sh
preprocess.py

README.md

Analyzing MAUDE with Snorkel

This repo contains code developed as part of a collaboration with the International Consortium of Investigative Journalists (ICIJ).

I. Labeling Functions

This script loads a TSV of MAUDE records and applies al labeling functions.

python preprocess.py \
	--outdir results/gender/ \
	--train data/MaudeFull8M.tsv \
	--chunksize 1000000 \
	--n_procs 36

This takes ~96 minutes to run on 8M records using 36 CPU cores.

II. Datasets

All documents are derrived from the FDA's public MAUDE database. All datasets can be downloaded with:

./download.sh

1. Unlabeled Documents

  • MaudeSample20k 16MB. Uniform random sample of 20k records. data/MaudeSample20k.tsv
  • MaudeSample500k 400MB. Uniform random sample of 500k records. data/MaudeSample500k.tsv
  • MaudeSample2M 1.5GB. Uniform random sample of 2M records. data/MaudeSample2M.tsv

2. GENDER Labels

All documents are labeled with y ∈ {MALE, FEMALE, UNK}

3. Model-generated Labels

These are model predicted labels (either the majority vote of LFs or an end model such as BERT).

All files are on Dropbox here.

  • GENDER MV_2M_MAUDE_38_LFs_2019-6-21.tsv.bz2 15MB.
  • GENDER MV_DEATH_INJURY_MAUDE_38_LFs_2019-6-25.tsv.bz2 23MB.
You can’t perform that action at this time.