In general, DSIR aims to select data from the raw dataset that matches the feature distribution of the target data. Thus, the choice of feature space and importance estimator on this feature space can change the behavior of DSIR for different use-cases. Extending the base DSIR class in base.py
is simple - follow the example in hashed_ngram_dsir.py
.
Base class for DSIR.
raw_datasets
: List of data pathstarget_datasets
: List of data pathscache_dir
: Directory to store cached intermediates (log importance weights)raw_load_dataset_fn
: Function to load raw dataset from pathraw_parse_example_fn
: a function that takes in an example dict and outputs a stringtarget_load_dataset_fn
: Function to load target dataset from pathtarget_parse_example_fn
: a function that takes in an example dict and outputs a stringnum_proc
: num cpus to parallelize over. If None, use all available cpus.separate_targets
: whether to select data separately for each target and then join them. For example, when including two target datasets, one natural language dataset and one code, the most heavily upweighted data whenseparate_targets=False
may skew towards documents with a mix of natural language and code, such as StackExchange. Whenseparate_targets=True
, two separate DSIR runs will occur in parallel, selecting a mixture of documents using each targettarget_proportions
: weighting across multiple targets if separate_targets=True. The proportions are on the document level. Set to None to weight by the size (in tokens) of each target dataset.
Compute importance weights on raw dataset with self.importance_estimator. Saves importance weights in self.log_importance_weights_dir / {index}.npy in chunks indexed by index. Also saves other per-example metadata (numpy arrays) in self.perexample_metadata_dir / {index}.npy."""
resample(self, out_dir: str, num_to_sample: int, cache_dir: str = None, top_k: bool = False) -> None:
Resample raw dataset according to importance weights.
out_dir
: path to save resampled datasetnum_to_sample
: number of samples to resamplecache_dir
: path to cache resampled datasettop_k
: if True, get top_k examples by importance weight instead of sampling
The main subclass we provide is DSIR with hashed n-gram features. This choice of feature space allows for efficient data selection over large datasets.
raw_datasets
: List of data pathstarget_datasets
: List of data pathscache_dir
: place to store cached log_importance_weightsload_dataset_fn
: Function to load a dataset from a path. Defaults to default_load_dataset_fn.parse_example_fn
: Function that takes in an example dict and returns a string. Defaults to returning the "text" field of the example.num_proc
: number of processes to use for parallelization. Defaults to number of cores.ngrams
: N in N-grams. 2 means both unigram and bigrams.num_buckets
: number of buckets to hash ngrams into.tokenizer
: word_tokenize or wordpunctmin_example_length
: minimum number of tokens in an example to be considered.separate_targets
: whether to select data separately for each target and then join them. For example, when including two target datasets, one natural language dataset and one code, the most heavily upweighted data whenseparate_targets=False
may skew towards documents with a mix of natural language and code, such as StackExchange. Whenseparate_targets=True
, two separate DSIR runs will occur in parallel, selecting a mixture of documents using each target according totarget_proportions
.target_proportions
: weighting across multiple targets if separate_targets=True. The proportions are on the document level. Set to None to weight by the size in tokens of each target dataset
Fit the importance estimator.
num_tokens_to_fit
: number of tokens to fit the raw dataset importance estimator on. Set to "all" to fit on all tokens, and "auto" to determine the number of tokens to fit on automatically (100k * num_buckets). Set to an integer to fit on that many tokens.