# Profiling of gene sets enriched in genomes with strong IDP propensity in ARF


Functional annotation was performed with KofamScan (Aramaki et al., Bioinformatics, 2020).

The HMM profiles were downloaded on  Mar. 24, 2021, with the following procedure:

```{bash}
# at /nfs_share/yamanouchi/kofamscan210324
wget ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz 
wget ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz 

gunzip ko_list.gz
tar xvfz profiles.tar.gz
```

(`ko_list` was copied to the `metadata/` directory.)

Then, the KofamScan annotation pipeline was executed under the following configurations:

```{bash}
docker-compose run --rm kofamscan /bin/bash /scripts/210528_kofamscan.sh
```

- docker-compose.yml (partial)
    ```
    kofamscan:
    build: ./docker/kofamscan
    volumes:
      - ./data:/data
      - ./scripts:/scripts
      - /nfs_share/yamanouchi/kofamscan210324:/db
    ```


- docker/kofamscan/Dockerfile
    ```{Dockerfile}
    FROM continuumio/miniconda3:4.9.2
    RUN conda install -c conda-forge -y mamba==0.8.2
    RUN mamba install -c bioconda -y kofamscan==1.3.0 
    ```


- scripts/210528_kofamscan.sh:

    ```{bash}
    #!/bin/bash -eu

    run_kofamscan () {
        local infile=$1
        local outfile=${infile/cds_prot/kofamscan}
        outfile=${outfile/.faa/.tsv}
        tmpdir=$(mktemp -d)
        trap "[[ -d $tmpdir ]] && rm -rf $tmpdir" ERR EXIT
        exec_annotation -p /db/profiles -k /db/ko_list --cpu 20 --tmp-dir $tmpdir -f detail-tsv $infile > $outfile
    }

    export -f run_kofamscan

    ls /data/cds_prot/*.faa | xargs -t -P5 -L1 -I{} /bin/bash -c 'run_kofamscan {}'
    ```
    
    

In [1]:
import pandas as pd
from functools import partial
from tqdm.notebook import tqdm
from pyscripts.config import path2
from pyscripts.datasets import Metadata
metadata = Metadata()

In [2]:
def load_annot(gcf):
    annot = pd.read_csv(
        path2.data/'kofamscan'/f'{gcf}.tsv', 
        sep='\t', skiprows=2, 
        names=['star', 'gene_name', 'KO', 'threshold', 'score', 'evalue', 'KO_definition']
    )
    # The genes were assigned the K number that showed the most significant e-value.
    # If there was no K number with a score above the threshold, no annotation was assigned.
    filt_annot = annot.loc[
        annot[annot['star'].notnull()].groupby('gene_name').apply(lambda v: v.evalue.idxmin()), 'KO'
    ].value_counts()
    return gcf, filt_annot
    
from multiprocessing import Pool

with Pool(100) as pool:
    ko_sets = pd.DataFrame(
        dict(tqdm(pool.imap_unordered(load_annot, metadata.acc['refseq']), total=len(metadata.acc))), 
        dtype=pd.Int64Dtype()
    ).fillna(0).astype(int)

ko_sets.to_pickle(path2.data/'kofamscan'/'summary.pkl.bz2')

  0%|          | 0/2624 [00:00<?, ?it/s]

In [3]:
idp_summary = pd.read_pickle(
    path2.data/'iupred2a'/'summary.pkl.bz2'
).swaplevel(0,1,axis=1).swaplevel(1,2,axis=1).sort_index(axis=1)

In [4]:
idp_fw2rc1 = idp_summary[30,0.5].loc[[4,8]].sum() / idp_summary[30,0].loc[[4,8]].sum()
ko_corr = ko_sets.T.apply(partial(idp_fw2rc1.corr, method='spearman')).rename('rho')

In [5]:
ko_list = pd.read_csv(path2.metadata/'ko_list', sep='\t', index_col=0)
results = pd.concat([ko_corr, ko_list['definition']], axis=1).dropna().sort_values(by='rho')

In [6]:
with pd.option_context('display.max_rows', 100, 'display.max_colwidth', 150):
    display(results.iloc[-1:-101:-1])
    display(results.iloc[0:100])
    

Unnamed: 0,rho,definition
K14162,0.639115,error-prone DNA polymerase [EC:2.7.7.7]
K00344,0.632878,NADPH:quinone reductase [EC:1.6.5.5]
K01692,0.620871,enoyl-CoA hydratase [EC:4.2.1.17]
K05524,0.607644,ferredoxin
K01496,0.60239,phosphoribosyl-AMP cyclohydrolase [EC:3.5.4.19]
K14998,0.596191,surfeit locus 1 family protein
K05838,0.595265,putative thioredoxin
K01214,0.584397,isoamylase [EC:3.2.1.68]
K12510,0.584232,tight adherence protein B
K06980,0.582072,tRNA-modifying protein YgfZ


Unnamed: 0,rho,definition
K22391,-0.543406,GTP cyclohydrolase I [EC:3.5.4.16]
K03495,-0.531573,tRNA uridine 5-carboxymethylaminomethyl modification enzyme
K07462,-0.524237,single-stranded-DNA-specific exonuclease [EC:3.1.-.-]
K07456,-0.516295,DNA mismatch repair protein MutS2
K00243,-0.506743,uncharacterized protein
K03650,-0.497353,tRNA modification GTPase [EC:3.6.-.-]
K00876,-0.483929,uridine kinase [EC:2.7.1.48]
K01893,-0.481589,asparaginyl-tRNA synthetase [EC:6.1.1.22]
K03978,-0.480697,GTP-binding protein
K15460,-0.476195,tRNA1Val (adenine37-N6)-methyltransferase [EC:2.1.1.223]


- Reproducibility was confirmed.
- The table will be formatted more precisely somewhere in this workspace.