Synthetic Library Oversampling Model

This repository provides a Python implementation of a quantitative model for estimating the sampling requirements of synthetic gene libraries with non‑uniform representation and imperfect fidelity. The model is conceptually inspired by the probabilistic framework used in Guido NJ, Handerson S, Joseph EM, Leake D, Kung LA (2016) Determination of a Screening Metric for High Diversity DNA Libraries. PLoS ONE 11(12): e0167088.

Overview

Given a library with:

Diversity (N) – number of unique intended variants,
Fidelity (f) – fraction of perfect molecules (e.g. 0.8 for 80%),
Uniformity (Gini) – representation unevenness (0 = perfectly uniform, 1 = extremely skewed),

the model estimates either:

Required sample size t to reach a target coverage with a given probability, or
Guaranteed coverage achievable for a given t and probability.

The script implements a coupon‑collector–style model with Poissonization and lognormal‑derived frequency distributions whose Gini coefficient matches the input value.

Installation

Requires Python ≥3.8 with no non‑standard dependencies (only numpy).

git clone https://github.com/PlesaLab/Oversampling_model.git
cd Oversampling_model
pip install numpy

Usage

Compute number of samples needed for desired coverage

python oversampling.py --N 1536 --fidelity 0.30 --gini 0.40 t_for_coverage --coverage 0.95 --prob 0.95

Output example:

[t_for_coverage]
N=1,536  fidelity=0.3000  gini=0.4000  K=400
Target: coverage >= 0.9500 with probability >= 0.950
Required samples t  = 36,068.091
Effective perfect draws f*t = 10,820.427
At t, E[coverage]=0.957693  Var=2.187476e-05

Compute coverage guaranteed for a given sample size

python oversampling.py --N 1536 --fidelity 0.30 --gini 0.40 coverage_for_t --t 1E5 --prob 0.95

Output example:

[coverage_for_t]
N=1,536  fidelity=0.3000  gini=0.4000  K=400
At t=100,000.000, guaranteed coverage (p>=0.950) = 0.994161
E[coverage]=0.996523  Var=2.062565e-06

Model Details

Fidelity (f) acts as a multiplicative penalty: only a fraction f of draws are usable (so effective sampling = f × t).
Non‑uniformity (Gini) is achieved via a lognormal representation tuned by bisection to match the specified Gini value.
Coverage distribution is approximated via a central‑limit‑theorem (CLT) form:

$P(C_t \ge c) \approx 1 - \Phi!\left(\frac{c - \mathbb{E}[C_t]}{\sqrt{\mathrm{Var}[C_t]}}\right)$

where $𝐶_𝑡$ is the coverage fraction observed after sampling t molecules.

Solving for t gives the minimal sampling depth meeting your desired coverage and probability.

📊 Example Parameters

Parameter	Meaning	Typical Range
`N`	Number of unique genes	384–786,432
`fidelity`	Fraction perfect	0.05–0.6
`gini`	Library unevenness	0.15–0.9
`coverage`	Desired completeness	0.8–0.99
`prob`	Probability guarantee	0.9–0.99

Notes

The algorithm uses 400 aggregate bins by default (--K) for speed; increase this for very high N or extreme Gini values.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
oversampling.py		oversampling.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Library Oversampling Model

Overview

Installation

Usage

Compute number of samples needed for desired coverage

Compute coverage guaranteed for a given sample size

Model Details

📊 Example Parameters

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Synthetic Library Oversampling Model

Overview

Installation

Usage

Compute number of samples needed for desired coverage

Compute coverage guaranteed for a given sample size

Model Details

📊 Example Parameters

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages