Is your feature request related to a problem? Please describe.
function for Instability index in seqmetrics src/metrics.rs - likely is inefficient
This is a function to calculate how unstable a protein is predicted to be when purified and in a test tube. Unstable proteins are less likely to be useful for downstream applications such as use of enzymes in biotechnology.
It is calculated according to Guruprasad K, Reddy BV, Pandit MW (1990). "Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence". Protein Eng. 4 (2): 155–61, where they take the weighted sum of dipeptide occurrences that are more frequently found in unstable proteins compared to stable ones.
The weight calculation is in the load_instability function just above the instability index function in seqmetrics. The values used for the load_instability function are in the file dipeptide_stability_values.csv in the seqmetrics crate folder and query file is K12_ribo.gbk in seqmetrics crate folder.
for window in chars.windows(2) {
let pair = format!("{}{}", window[0], window[1]);
if let Some(val) = weights.get(&pair) {
total += val;
}
causes String allocation of every pair across all the proteins (could be thousands), should be more efficient to use tuple or array instead of String
Describe the solution you'd like
we should be able to improve efficiency without using String here
Additional context
test is present in seqmetrics crate
use tokio::io::BufReader;
#[cfg(test)]
#[allow(dead_code)]
#[allow(unused_mut)]
#[allow(unused_variables)]
#[allow(unused_assignments)]
#[tokio::test]
pub async fn instability_test() -> Result<(), anyhow::Error> {
let file_gbk = File::open("K12_ribo.gbk")?;
let reader = Reader::new(file_gbk);
let mut records = reader.records();
let weights = load_instability("dipeptide_stability_values.csv").await?;
loop {
match records.next() {
Some(Ok(record)) => {
for (k, _v) in &record.cds.attributes {
match record.seq_features.get_sequence_faa(&k) {
Some(value) => {
let seq_faa = value.to_string();
let result = instability_index(seq_faa, &weights).await;
println!(
"instability index for {} {} is {}",
&record.id, &k, &result
);
}
_ => (),
};
}
}
Some(Err(e)) => {
println!("theres an error {:?}", e);
}
None => {
println!("finished iteration");
break;
}
}
}
return Ok(());
}
Is your feature request related to a problem? Please describe.
function for Instability index in seqmetrics src/metrics.rs - likely is inefficient
This is a function to calculate how unstable a protein is predicted to be when purified and in a test tube. Unstable proteins are less likely to be useful for downstream applications such as use of enzymes in biotechnology.
It is calculated according to Guruprasad K, Reddy BV, Pandit MW (1990). "Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence". Protein Eng. 4 (2): 155–61, where they take the weighted sum of dipeptide occurrences that are more frequently found in unstable proteins compared to stable ones.
The weight calculation is in the load_instability function just above the instability index function in seqmetrics. The values used for the load_instability function are in the file dipeptide_stability_values.csv in the seqmetrics crate folder and query file is K12_ribo.gbk in seqmetrics crate folder.
causes String allocation of every pair across all the proteins (could be thousands), should be more efficient to use tuple or array instead of String
Describe the solution you'd like
we should be able to improve efficiency without using String here
Additional context
test is present in seqmetrics crate