# VFDB

https://github.com/biobakery/humann#custom-nucleotide-reference-database

## Database File

In [5]:
ls

VFDB_setA_nt.fas   VFDB_setB_nt_anno.txt  VFDB_setB_pro_anno.txt
VFDB_setA_pro.fas  VFDB_setB_nt.fas       VFDB_setB_pro.fas


In [6]:
grep ">" VFDB_setB_nt.fas | wc -l

32522


In [7]:
grep ">" VFDB_setB_pro.fas | wc -l

28616


## Custom Database

### Custom nucleotide reference database | Bowtie Index 

In [None]:
bowtie2-build VFDB_setA_nt.fas VFDB_setA_nt #core dataset

In [None]:
bowtie2-build VFDB_setB_nt.fas VFDB_setB_nt #full dataset

### Custom protein reference database | Diamond Database

In [None]:
diamond makedb --in VFDB_setA_pro.fas -d VFDB_setA_pro #reference database for DIAMOND

In [None]:
diamond makedb --in VFDB_setB_pro.fas -d VFDB_setB_pro #full dataset

### Custom reference database annotations | id mapping

In [47]:
from Bio import SeqIO
record_iterator = SeqIO.parse("./DataBase/VFDB-2019-12-16/VFDB_setA_nt.fas", "fasta")

first_record = next(record_iterator)
print(first_record.id)
print(first_record.description)
print(len(first_record))

VFG037176(gb|YP_001844723)
VFG037176(gb|YP_001844723) (plc) phospholipase C [Phospholipase C (VF0470)] [Acinetobacter baumannii ACICU]
2169


```python
import re
import pandas as pd
from Bio import SeqIO

df = pd.DataFrame()
for seq in SeqIO.parse("./DataBase/VFDB-2019-12-20/VFDB_setB_nt.fas", "fasta"):
    gf = re.findall('\((.*?)\)', seq.description)
    gi = re.findall('\((.*?)\)', seq.id)
    gis = "".join(gi)
    if gis in gf:
        gf.remove(gis)
    
    des2 = re.findall('\[(.*?)\]', seq.description)
    taxa1 = des2[-1].split(" ")
    taxa = "g__"+taxa1[0]+".s__"+taxa1[1]
    
    vf = re.findall('\((.*?)\)', str(des2[-2]))
    #vfid = vf[-1]
    
    df = df.append({'id':seq.id, 'gf':gf[0], 'len':str(len(seq)), 'taxa':taxa}, ignore_index=True, sort=None)
df.to_csv("id-VF-gf.tsv", encoding = "utf-8", sep="\t", header=False, index=False)
```

`identifier|gene_family|gene_length|taxonomy`

In [3]:
head id-mapping-full.tsv

VFG037176(gb|YP_001844723)	plc	2169	g__Acinetobacter.s__baumannii
VFG037177(gb|YP_001846906)	plc	2229	g__Acinetobacter.s__baumannii
VFG037203(gb|YP_001847849)	plcD	1626	g__Acinetobacter.s__baumannii
VFG037218(gb|YP_001847229)	basJ	1170	g__Acinetobacter.s__baumannii
VFG037232(gb|YP_001847230)	basI	756	g__Acinetobacter.s__baumannii
VFG037246(gb|YP_001847231)	basH	735	g__Acinetobacter.s__baumannii
VFG037260(gb|YP_001847232)	barB	1596	g__Acinetobacter.s__baumannii
VFG037274(gb|YP_001847233)	barA	1611	g__Acinetobacter.s__baumannii
VFG037288(gb|YP_001847235)	basG	1152	g__Acinetobacter.s__baumannii
VFG037302(gb|YP_001847236)	basF	870	g__Acinetobacter.s__baumannii


To run HUMAnN2 with the custom reference database annotations (FILE), use the option "`--id-mapping $FILE`".

In [None]:
--id-mapping <id_mapping.tsv>

## Change Database

In [None]:
humann2_config --update database_folders nucleotide /home/junyuchen/Lab/Custom-DataBase/VFDB/Bowtie2-Index_VFDB_setA_nt

To run HUMAnN2 with your custom nucleotide reference database (located in DIR), use the option "`--bypass-nucleotide-index`" and provide the custom database as the ChocoPhlAn option with "`--nucleotide-database DIR`". If you would like to bypass the translated alignment portion of HUMAnN2, add the option "`--bypass-translated-search`".

In [None]:
--bypass-nucleotide-index --nucleotide-database /home/junyuchen/Lab/Custom-DataBase/VFDB/Bowtie2-Index_VFDB_setA_nt

In [None]:
humann2_config --update database_folders protein /home/junyuchen/Lab/Custom-DataBase/VFDB/Diamod-VFDB_setA_pro

### e.g.

In [None]:
humann2 --threads 4 --bypass-nucleotide-index --nucleotide-database /home/junyuchen/Lab/Custom-DataBase/VFDB/Bowtie2-Index_VFDB_setB_nt --input /home/junyuchen/Lab/Custom-DataBase/VFDB/TestData/Pseudomonas_simiae_WCS417_2455.ffn --output Verification/Pseudomonas_simiae_WCS417_2455_setB --id-mapping /home/junyuchen/Lab/Custom-DataBase/VFDB/id-mapping-full.tsv 

### Final parameter

```shell
humann2_config --update database_folders protein /home/junyuchen/Lab/Custom-DataBase/VFDB/DataBase/VFDB-2019-12-16/Diamod-VFDB_setB_pro
```

```shell
humann2 
--bypass-nucleotide-index 
--nucleotide-database /home/junyuchen/Lab/Custom-DataBase/VFDB/DataBase/VFDB-2019-12-20 
--input cat_reads dir 
--output dir 
--id-mapping /home/junyuchen/Lab/Custom-DataBase/VFDB/DataBase/VFDB-2019-12-20/id-mapping-VFDB-2019-12-20.tsv
```

In [None]:
humann2 --threads 4 --bypass-nucleotide-index --nucleotide-database /home/junyuchen/Lab/Custom-DataBase/VFDB/Bowtie2-Index_VFDB_setB_nt --input /home/junyuchen/Lab/HUMAnN2-Pipline/humann2-conda/cat_reads/HSM7J4QT.fastq --output iHMP-HSM7J4QT-setB --id-mapping /home/junyuchen/Lab/Custom-DataBase/VFDB/id-mapping-full.tsv

In [None]:
/home/junyuchen/Lab/Meta-Analysis/Scripts/Metagenomics_VFDB_only.py