The udf-bioutils
project contains bioinformatics (and public health surveillance adjacent) functions used internally by CDC's respiratory virus surveillance programs. Functions must be uploaded and registered with Apache Impala. UDFs and UDAs extend Impala's native capabilities with domain-specific analysis. This library, in concert with external programs and specialized schema, helps form a basis for database-centric sequence surveillance analytics. See INSTALL.md for building and installation instructions.
For direct correspondence, feel free to contact: Samuel S. Shepard (vfn4@cdc.gov), Centers for Disease Control and Prevention.
- Overview of UDF BIOUTILS
- User-defined bioinformatics utilities for Impala SQL
- Acknowledgments
- Notices
Functions may be created in a schema such as udx
. This is our default, but the user is free to change the function location to whatever they wish. To show functions already registered, one may write:
use udx;
show functions;
show aggregate functions;
For further readings related to function development, see:
Scalar functions take any number of values from a single row and return a single value.
complete_date(STRING date) -> STRING
Purpose: Parses STRING dates with delimiters .
,/
, and -
; adds missing month or day component when applicable (as the first of either). A NULL
in the argument will return a null value.
fortnight_date(<STRING or TIMESTAMP or DATE> date [, BOOLEAN default_cadence]) -> DATE
Purpose: Given a valid date, the function returns the date of the LAST Saturday within a complete, two-week period. Fortnights begin on a Sunday. In other words, the date is rounded to the last date of the fortnight (ending in a Saturday). The optional default_cadence
parameter changes which two-week period is being considered, where a false
value results in the fortnight dates being adjusted by one week.
Null inputs, invalid dates, or the empty STRING will return null
. If the date
is a STRING it must be in YYYY-MM-DD format but may also be delimited using .
or /
. Partial STRING dates are considered invalid but can be corrected (see complete_date). Finally, dates before 1400-01-01
return NULL
, even if the date is considered valid otherwise.
saturday_date(<STRING or TIMESTAMP or DATE> date) -> DATE
Purpose: Given a valid date, the function returns the date of the Saturday within the same week. In other words, the date is rounded to the last date for that week (Sunday to Saturday).
Null inputs, invalid dates, or the empty STRING will return null
. If the date
is a STRING it must be in YYYY-MM-DD format but may also be delimited using .
or /
. Partial STRING dates are considered invalid but can be corrected (see complete_date). Finally, dates before 1400-01-01
return NULL
, even if the date is considered valid otherwise.
to_epiweek(<STRING or TIMESTAMP> date [, BOOLEAN year_format]) -> INT
Purpose: Returns the MMWR or EPI week of a formatted date STRING or timestamp. Normally the output is an integer ranging from 1 up to 52 or 53. The optional year_format
(defaults to false
) adds the padded week to the year so that the output can be graphed on a contiguous timeline. Some examples using the year format would be: 202102 and 199853.
Null inputs, invalid dates, or the empty STRING will return null
. If the date
is a STRING it must be in YYYY-MM-DD format but may also be delimited using .
or /
. Partial STRING dates are considered invalid but can be corrected (see complete_date). Finally, dates before 1400-01-01
return NULL
, even if the date is considered valid otherwise.
variant_hash(STRING residues) -> STRING
nt_id(STRING nucleotides) -> STRING
Purpose: Returns hashed identifiers for aligned or unaligned sequences. Case is ignored in the hash, as is whitespace, :
, -
, and .
. The nt_id
function also ignores ~
, which represents a translated partial codon in the variant_hash
. The variant_hash
is a 32 character hexadecimal from the md5 hash while the nt_id
is a 40 character hexadecimal from the sha1 hash. Null values or empty STRING return NULL
.
md5(STRING field1 [,STRING field2, ...]) -> STRING
Purpose: This function takes a variable number of STRING fields (i.e., "variadic") and returns a 32 character hexadecimal from the md5 hash.
Any null value in an argument or all empty STRINGs returns NULL
. Note: fields are concatenated with the bell character as the delimiter before hashing in order to preserve field boundaries.
→ See also the Impala native functions MURMUR_HASH, FNV_HASH, SHA1/SHA2, and HEX.
ci_t(DOUBLE confidence, BIGINT sample_size, DOUBLE standard_deviation [, BOOLEAN is_two_sided]) -> DOUBLE
Purpose: Takes the confidence level (e.g., 0.99), sample_size
,sample standard_deviation
and an optional boolean (default is two-sided) for sided-ness in order to generate the confidence interval for the Student's T-distribution. In other words, the value returned is +/- around the point estimate.
quantile_t(DOUBLE confidence, BIGINT samples, BOOLEAN is_two_sided) -> DOUBLE
Purpose: Given some confidence
level (e.g., 0.99), number of samples
and a boolean for whether the Student's T-distribution is two-sided, estimate the quantile function.
reverse_complement(STRING nucleotides) -> STRING
Purpose: Returns the reverse complement nucleotide sequence. Case is preserved and ambiguous nucleotides are mapped. Non-nucleotide characters are left as-is but in reverse order. Null values return NULL
and an empty STRING remains empty.
to_aa(STRING nucleotides[, STRING replacement_nucleotides, int starting_position])
Purpose: Translates a nucleotide sequence to an amino acid sequence starting at position 1 of argument nucleotides
(including resolvable ambiguous codons). Unknown or partial codons are translated as ?
, mixed or partially gapped codons are translated as ~
, and deletions (-
) or missing data (.
) are compacted from 3 to 1 character. Residues are always written out in uppercase. Optionally, one may overwrite a portion of the nucleotide sequence prior to translation by providing replacement_nucleotides
and a starting_position
. Specifying out-of-range indices will append to the 5' or 3' end while specifying a replacement sequence larger than the original will result in the extra nucleotides being appended after in-range bases are overwritten. If any argument is NULL
a null value is returned. If the replacement_nucleotides
argument is an empty STRING, the nucleotides
argument is translated as-is. On the other hand, if the nucleotides
argument is an empty STRING but replacement_nucleotides
is not, then replacement_nucleotides
is translated and returned.
to_aa3(STRING nucleotides) -> STRING
Purpose: Translates DNA coding sequences into an amino acid sequence. However, if a codon could result in more than one translated AA residue, then up to 3 possible translations will be prouduced (separated by /
) before reverting to an X
. If the degenerate translation is in a sequence it is enclosed in brackets to avoid misreading.
Example:
select udx.to_aa3("ATGsCCTCCTGA") --> "M[A/P]S*"
select udx.to_aa3("GrN") --> "D/E/G"
select udx.to_aa3("GNy") --> "X" (more than 3 residues possible)
hamming_distance(STRING seq1, STRING seq2 [, STRING pairwise_deletion_set])
nt_distance(STRING seq1, STRING seq2)
Return type: INT
Purpose: Counts the number of differences between two sequences (though any STRINGs may be used).
If one sequence is longer than the other, the extra characters are discarded from the calculation. By default, any pair of characters with a .
as an element is ignored by the calculation. In DAIS, the .
character is used for missing data. Optionally, one may explicitly add a pairwise deletion character set. If any pair of characters contain any of the characters in the argument, that position is ignored from the calculation. If any argument is NULL
or either sequence argument is an empty STRING, a null value is returned. If the optional pairwise_deletion_set
argument is an empty STRING, no pairwise deletion is performed. The nt_distance
function is the same as the default version of hamming_distance
but does not count ambiguated differences. For example, A ≠ T but A = R.
→ See also the Impala native function JARO_DISTANCE.
tn_93(STRING sequence1, STRING sequence2 [, DOUBLE alpha]) -> DOUBLE
Purpose: Calculates the Tamura-Nei (TN-93) evolutionary distance between two aligned nucleotide sequences. The model accounts for different base frequencies of each nucleotide, as well as different rates for different substitution types: transitions, (A ↔ G or C ↔ T) type 1 transversions, (A ↔ T and C ↔ G) and type 2 transversions. (A ↔ C and G ↔ T). If either argument is NULL
or ""
then a null is returned. For very short and dissimilar sequences, a null value may also be returned due to needing to calculate the logarithm of a non-positive number. This model can optionally include a correction for rate variability among sites using a gamma distribution with a single shape parameter (alpha). Smaller values of alpha represent greater rate variation, while larger values suggest more uniform rates. If no alpha is specified, the distance is calculated under the assumption of equal rates across all sites.
sequence_diff(STRING seq1, STRING seq2)
sequence_diff_nt(STRING seq1, STRING seq2)
Return type: STRING
Purpose: Expects aligned biological sequences and returns a sequence that denotes the differences between seq1
and seq2
by returning the character from seq2
in the differing positions. Otherwise, if the characters are the same at a given position, a .
will be returned. For example, if seq1
sequence is AAAA-A
and seq2
sequence is AGA-AA
, the function will return .G.-A.
.
- The function
sequence_diff_nt
is similar to the vanillasequence_diff
function but will correctly account for degenerate nucleotides (e.g., Y for pyrimidine, R for purine).
mutation_list(STRING seq1, STRING seq2 [, STRING range])
mutation_list_gly(STRING seq1, STRING seq2)
mutation_list_indel_gly(STRING seq1, STRING seq2)
mutation_list_nt(STRING seq1, STRING seq2)
mutation_list_pds(STRING seq1, STRING seq2, STRING pairwise_deletion_set)
Return type: STRING
Purpose: Expects aligned biological sequences and returns a list of mutations from seq1
to seq2
, delimited by a comma + space. If the range
argument is included, only those sites will be compared (see the description of range_coords
in substr_range
). For example: A2G, T160K, G340R
. If any argument is NULL
or empty, a null value is returned. The function mutation_list
returns differences and may be used for nucleotide, amino acid, or any other sequence. There are some alternative variants to the function:
- The function
mutation_list_gly
is similar to the vanillamutation_list
function but annotates the addition or loss of an N-linked glycosylation site due to substitution or single deletion (fromseq1
toseq2
). Functionmutation_list_indel_gly
is an enhancement ofmutation_list_gly
which robustly accounts for the effects of indels on glycosylation for deletions no larger than 5 (a parameter that is adjustable at compile-time). Warning:mutation_list_indel_gly
is substantially slower thanmutation_list_gly
. - The function
mutation_list_nt
is suitable only for nucleotide sequences and ignores resolvable differences involving ambiguous nucleotides (e.g., "R2G" would not be listed). - The function
mutation_list_pds
also contains an explicit argument for a pairwise deletion character set. If any pair of characters contain any of the characters in the argument, that position is ignored from the calculation.
pcd(STRING sequence1, STRING sequence2) -> DOUBLE
Purpose: Calculates the PCD ("physiochemical", or perhaps more precisely, the "physicochemical" distance) between two aligned amino acid sequences. First the function takes the Euclidean distance of physiochemical factors (see Atchley et al.'s protein sequence analysis) corresponding to the compared residues at each site. After that the per-site PCD values are averaged over all valid sites, or sites where both alleles contain normal amino residues. The function ignores sequence case and does not count X
as a valid comparison. However, comparisons to deletion states (-
) are calculated versus the 0-vector. If either argument is NULL
or ""
then a null is returned. Anecdotally, S. Shepard has observed that the distances correlate strongly with the JTT model.
pcd_list(STRING sequence1, STRING sequence2) -> STRING
Purpose: Calculates the per-site "physiochemical distance" (PCD) between two aligned amino acid sequences (see description for pcd()
). String-encoded PCDs for each site are space-delimited, and the number of sites will be the shorter of the two sequences. If either argument is NULL
or ""
then a null is returned.
aa_std(STRING sequence) -> STRING
nt_std(STRING sequence) -> STRING
Purpose: Standardizes the assumed amino acid or nucleotide sequence
respectively by removing extraneous characters (\n\r\t :.-
) and switching to uppercase. The nucleotide version also removes ~
, which is interpreted by DAIS-ribosome as a partial codon for amino acid sequences. The same sequence cleaning is done by the ID generating functions variant_hash and nt_id. Returns NULL
if the input is null or an empty string is provided.
longest_deletion(STRING sequence), deletion_events(STRING sequence) -> INT
Return type: INT
Purpose: Returns the length of the longest deletion in the sequence
or the number of deletion events respectively. In either function, a deletion event must have upstream and downstream alphabetic character to the deletion span (using -
for deletion characters only). The functions return null
if the sequence is null and they return 0 if the empty STRING is used.
any_instr(STRING haystack, STRING needles) -> BOOLEAN
Purpose: Checks if any of the characters in the needles
STRING is in the haystack
STRING. Will return NULL
if either argument is null. Note that if both needles
and haystack
has an empty string, the function always returns TRUE
.
→ See also the Impala native function INSTR.
contains_sym(STRING str1, STRING str2) -> BOOLEAN
Purpose: Returns true if str1
is a substring of str2
or vice-versa. If just one argument is an empty STRING, the function returns false. A NULL
in any argument will return a null value.
contains_element(STRING str, STRING list, STRING delim) -> BOOLEAN
Purpose: Return TRUE if str
contains any element in list
delimited by delim
as a substring. The delimiter may be a sequence of characters, but if it is empty the list is split by character. A NULL
in any argument will return a null value.
cut_paste(STRING str, STRING delim, STRING fields [, STRING output_delim]) -> STRING
Purpose: For str
, split the string using the delim
and paste/concatenate back together based on the specified fields
(similar to using Unix cut
and/or paste
). One can optionally specify the output_delim
otherwise delim
(or if NULL
). The fields
uses 1-based indexing which can be separated with ;
or ,
characters. Ranges are also allowed using -
or ..
strings. The function will return NULL
if either str
, delim
, or fields
is NULL
.
Example:
select udx.cut_paste("The::fields::are::cut::pastable::!", "::", "1..3;6,6;4-3", " ")
-- Returns: "The fields are ! ! cut are"
is_element(STRING str, STRING list, STRING delim) -> BOOLEAN
Purpose: Returns true if str
is equal to any element of list
delimited by delim
. The delimiter may be a sequence of characters, but if it is empty the list is split by character.
→ See also the Impala native function FIND_IN_SET.
range_from_list(STRING integer_list, STRING delim) -> STRING
Purpose: Takes STRING list
of integers separated by STRING delim
(split by substring) and returns a STRING range in the format described in substr_range
for the field range_coords
. List elements will be added uniquely as a set. If a non-integer element is found or any argument is NULL
, a null value is returned. If an empty list
is given it is returned as-is; if an empty delim
is given, the list
is returned.
substr_range(STRING str, STRING range_coords) -> STRING
Purpose: Returns the characters in str
specified by range_coords
.
All characters are concatenated as specified by range_coords
. Ranges may be listed in forward and reverse, which affects output order, and are denoted by #..#
.
Multiple ranges or single characters may be separated using a semi-colon or comma. For example: 10..1;12;15;20..25
. If any argument is NULL
or an empty STRING, a null value is returned.
→ See also the Impala native function SUBSTR.
sort_alleles(STRING allele_or_mutation_list, STRING delim) -> STRING
Return type: STRING
Purpose: Returns a list of sorted mutations (e.g., A2G, T160K
) or alleles (e.g., 2G 160K
) where the list delimiter delim
is used both for splitting elements and for the output of the sorted list. Alleles and mutations are sorted first by their integer position, unlike other list sorting which looks at the element as a STRING. If any argument is NULL
a null value is returned.
sort_list(STRING list, STRING delim)
sort_list_unique(STRING list, STRING delim)
sort_list_set(STRING list, STRING delim_set, STRING output_delim)
sort_site_list(SRING list)
Return type: STRING
Purpose: Returns an alphabetically sorted list of elements in the STRING list
delimited by delim
or delim_set
. The function sort_list
interprets multi-character delimiters as a whole STRING while the function sort_list_set
treats each character in the argument delim_set
as a single-character delimiter (all are applied). The input and output delimiter for sort_list
are taken to be the same while the output delimiter for sort_list_set
is specified by output_delim
. If any argument is NULL
a null value is returned. The function variant sort_list_unique
behaves exactly like sort_list
but removes redundant elements. sort_site_list
is for numerically sorting a string of comma-separated integers returned by the output of group_concat()
, and will return NULL
if non-numeric values are found in the list.
codon_at_og_position(STRING query_nt_coords, STRING cds_nt_coords, STRING cds_aln, BIGINT original_position [, STRING allele]) -> STRING
Purpose: Takes coordinate mapping information provided by DAIS-ribosome, such as the query_nt_coords
and cds_nt_coords
, the cds_aln
(aligned coding sequence), and a position in the original coordinate space and then yields the whole codon at that position. The original position might come from an untrimmed IRMA variant position. An optional allele argument may be provided in order to mutate the retrieved codon at the mapped CDS position. This can be convenient for analyzing variants.
Empty strings and null values on any input returns NULL
. In particular, if the mapped position is before, after, or in-between (insertions) alignment-space coordinates, they are considered out-of-bounds and the function will return NULL
.
Example:
-- Out of bounds
select udx.codon_at_og_position("4..6;8..10", "1..3;4..6", "ATGGAC", 3) --> NULL
-- Retrieves the in-frame codon regardless of codon position
select udx.codon_at_og_position("4..6;8..10", "1..3;4..6", "ATGGAC", 4) --> "ATG"
select udx.codon_at_og_position("4..6;8..10", "1..3;4..6", "ATGGAC", 6) --> "ATG"
select udx.codon_at_og_position("4..6;8..10", "1..3;4..6", "ATGGAC", 9) --> "GAC"
-- The retrieved codon can be mutated at the corresponding alignment position
select udx.codon_at_og_position("4..6;8..10;11..16", "1..3;4..6;7..12", "ATGTAGCATTAR", 16) --> "TAR"
select udx.codon_at_og_position("4..6;8..10;11..16", "1..3;4..6;7..12", "ATGTAGCATTAR", 16, "a") --> "TAa"
og_to_aa_position(STRING query_nt_coords, STRING cds_nt_coords, BIGINT original_position) -> INT
og_to_cds_position(STRING query_nt_coords, STRING cds_nt_coords, BIGINT original_position) -> INT
Purpose: Takes coordinate mapping information provided by DAIS-ribosome, such as the query_nt_coords
and cds_nt_coords
plus a position in the original coordinate space and maps to the aligned AA or CDS position for the og_to_aa_position
or og_to_cds_position
functions respectively. The original position might come from an untrimmed IRMA variant position.
Empty strings and null values on any input returns NULL
. In particular, if the mapped position is before, after, or in-between (insertions) alignment-space coordinates, they are considered out-of-bounds and the function will return NULL
.
Example:
select udx.og_to_aa_position("4..6;8..10;11..16", "1..3;4..6;7..12", 11) --> 3
select udx.og_to_cds_position("4..6;8..10;11..16", "1..3;4..6;7..12", 11) --> 7
-- Insertions are not mappable
select udx.og_to_cds_position("1..456;457..459;460..983", "31..486;486;487..1010", 458) --> NULL
-- This is fine
select udx.og_to_cds_position("1..456;457..459;460..983", "31..486;486;487..1010", 460) --> 487
og_pos_to_aa3_mutation(STRING query_nt_coords, STRING cds_nt_coords, STRING cds_aln, BIGINT original_position, STRING consensus_allele, STRING minority_allele) -> STRING
Purpose: Takes coordinate mapping information provided by DAIS-ribosome, such as the query_nt_coords
and cds_nt_coords
, the cds_aln
(aligned coding sequence), a position in the original coordinate space, and a pair of alleles. Under the hood, the function obtains the codon, mutates it at the provided CDS position using each allele, and then translates each mutated codon using the degenerate amino acid translation function. The output is of the form AA1#AA2
where the numeric position is the aligned amino acid position.
The original position might come from an untrimmed IRMA variant position. Likewise, the alleles might come from IRMA's consensus and variant allele provided by its variant table.
Empty strings and null values on any input returns NULL
. In particular, if the mapped position is before, after, or in-between (insertions) alignment-space coordinates, they are considered out-of-bounds and the function will return NULL
.
Example:
-- Simple amino acid mutations
select udx.og_pos_to_aa3_mutation("4..6", "1..3", "ATR", 6, "G", "A") --> "M1I"
select udx.og_pos_to_aa3_mutation("4..6;8..10", "1..3;4..6", "ATGTYG", 9, "C", "T") --> "S2L"
select udx.og_pos_to_aa3_mutation("4..6;8..10;11..16", "1..3;4..6;7..12", "ATGSCGCATTYN", 8, "C", "G") --> "P2A"
-- Multiple products are allowed on either side
select udx.og_pos_to_aa3_mutation("4..6;8..10;11..16", "1..3;4..6;7..12", "ATGTAGCMWTYN", 12, "C", "A") --> "P3H/Q"
select udx.og_pos_to_aa3_mutation("4..6;8..10;11..16", "1..3;4..6;7..12", "ATGTAGCATTYK", 16, "g", "t") --> "L/S4F/S"
Aggregate functions take many values within a group and return a single value per group.
bitwise_sum(BIGINT values) -> BIGINT
Purpose: Returns the bitwise OR of all values. Available on the CDP cluster
kurtosis(<INT or DOUBLE> values) -> DOUBLE
Purpose: Compute the 4th moment or kurtosis using a one-pass formula.
logfold_agreement(INT values) -> DOUBLE
Purpose: Performs a test for modality called agreement (uniform: 0, unimodal: +1, bimodal: -1). The logfold titer values are allowed to take -16 to 16 but the categories for the calculation will be bounded to 10 for the distribution.
skewness(<INT or DOUBLE> values) -> DOUBLE
Purpose: Compute the 3rd moment or skewness using a one-pass formula.
-- Aggregate Function
alphanumeric_entropy(STRING s) -> DOUBLE
nt_entropy(STRING s) -> DOUBLE
aa_entropy(STRING s) -> DOUBLE
codon_entropy(STRING s) -> DOUBLE
-- Scalar Function
alnum_entropy(STRING s) -> DOUBLE
Purpose: The aggregate function alphanumeric_entropy
slurps all strings within the GROUP BY
column and calculates the alphanumeric character entropy over the total group-wise data. Likewise, the aggregate functions nt_entropy
, aa_entropy
, and codon_entropy
calculate the entropy based on group data where the entropy is being calculated over canonical nucleotides, amino acids (1-letter code, no X
or *
), or by codon (3 canonical nucleotides) respectively. Please note that these function operate over the first base, residue or codon and do not slurp every value in each sequence. As such they are useful in combination with extraction and slicing functions.
The scalar function alnum_entropy
is equivalent to alphanumeric_entropy
but only performs its operation on a single value.
We'd like to thank contributors (in alphabetical order) who have suggested features, identified bugs, or submitted merge requests:
- Norman Hassell (CDC)
- Kristine Lacek (CDC)
- Clint Paden (CDC)
General disclaimer This repository was created for use by CDC programs to collaborate on public health related projects in support of the CDC mission. GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.
This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.
The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later. This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version. This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details. You should have received a copy of the Apache Software License along with this program. If not, see: http://www.apache.org/licenses/LICENSE-2.0.html. The source code forked from other open source projects will inherit its license.
This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.
Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.
All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.
This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.