# Plant Pathogen Genome Mining
Genome mining involves extracting valuable genomic information from databases such as NCBI to understand gene content, structure, and potential functions. For plant pathogens, genome mining aids in identifying genes related to virulence, resistance, and adaptation, facilitating research in plant pathology and crop protection. In this notebook, we explore methods to fetch and analyze genome data, focusing on plant pathogens.

Steps Involved:
Fetching genomes from NCBI using datasets
Summarizing genomic data
Utilizing Python and Unix commands for data analysis

### **INSTALL BIOCONDA IN COLAB**
To use certain bioinformatics tools, Bioconda can be installed in Google Colab. Here’s the link for setup: https://bioconda.github.io/

In [None]:
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

### **INSTALL NCBI-DATASETS**
NCBI-Datasets provides a command-line tool to download datasets directly from NCBI. You can install it via the following link: https://www.ncbi.nlm.nih.gov/datasets/

In [None]:
!conda install conda-forge::ncbi-datasets-cli -y
#!conda install -c conda-forge ncbi-datasets-cli -y

## Fetching Metadata of Genomes from NCBI Using Datasets
The NCBI Datasets command-line tool allows you to easily fetch genome data. It's a powerful resource for retrieving data directly from NCBI's extensive database.


**Summarizing Genome meta data with the NCBI Datasets Tool**

Let's fetch metadata of virus genomes.
The following command summarizes the available genome data for Tomato yellow leaf curl virus using the NCBI Datasets tool:

In [None]:
!datasets summary genome taxon "tomato yellow leaf curl virus" --as-json-lines

The output could be saved using the ">" character in virus_metadata.jsonl

In [4]:
!datasets summary genome taxon "tomato yellow leaf curl virus" --as-json-lines > virus_metadata.jsonl

Converting JSON to TSV Format
JSON files are very handy, but sometimes not very user-friendly to read. Once you have the genome metadata saved in JSONL format, you can convert it to a more readable tab-separated values (TSV) format using the dataformat tool. This tool allows you to extract and display specific fields of interest from genomic metadata in an easy-to-read format like TSV.

In [None]:
!dataformat tsv genome --fields accession,organism-name,annotinfo-name --inputfile virus_metadata.jsonl > virus_summary.tsv
!cat virus_summary.tsv

Let's now fetch the metadata for fungal genomes, using Fusarium oxysporum as an example.

The datasets command retrieves genome metadata for all records that match "Fusarium oxysporum" in the GenBank database. The output is then piped to the next command. The dataformat command converts the JSON data to a tab-separated values (TSV) format and displays selected fields: accession, organism name, GC percentage, and geographic location of the biosample.

In [None]:
!datasets summary genome taxon "Fusarium oxysporum" --assembly-source genbank --as-json-lines | dataformat tsv genome --fields accession,organism-name,assmstats-gc-percent,assminfo-biosample-geo-loc-name > fo_attributes.tsv
!cat fo_attributes.tsv

What other attributes (fields) can we select?

In [None]:
!dataformat tsv genome --help

Use Unix commands to fetch specific information from our report, fo_attributes.tsv.

The report fo_attributes.tsv has a header line, which we will skip using the Unix command tail -n +2. Then, we can display and count how many records we obtain.

In [None]:
!tail -n +2 fo_attributes.tsv | cat | wc -l

We can use the awk command to calculate the average GC content. By examining the table report, we see that the GC value for each genome is in the third column, $3.

In [None]:
!tail -n +2 fo_attributes.tsv | awk -F'\t' '{sum += $3; count++} END {print "Average GC Content: ", sum/(count-1)}'

In the report, the geographic location of the isolate is in the fourth column. We can use cut to display the location, then sort and count.

In [None]:
!tail -n +2 fo_attributes.tsv | cut -f 4 | sort | uniq -c

Let's count all the genomes reported from Australia

In [None]:
!cut -f 4 fo_attributes.tsv | grep -c "Australia"

Let's display the genomes reported from Australia.

In [None]:
!cut -f 4 fo_attributes.tsv | grep "Australia" | sort | uniq -c

An alternative to simple Unix commands is to use a Python library called Pandas, which works very well with tables. Here is an excellent set of tutorials and notebooks: Pandas Tutorials. https://pandas.pydata.org/docs/getting_started/tutorials.html

## Download Sequence Data Using Datasets

So far, we have downloaded metadata. Now, let's work with genomic data. The following commands download available genome assemblies for various plant pathogens using different criteria. Let’s start with Fusarium.

In [None]:
!datasets download genome taxon "Fusarium oxysporum f. sp. lycopersici" --assembly-source genbank

The results will be saved in a zip file named ncbi_dataset.zip. These commands will unzip ncbi_dataset.zip and rename the output to fo_data

In [None]:
!unzip -o ncbi_dataset.zip
!mv ncbi_dataset fo_data

You can also include other genomic features to download. For example, you can include protein sequences.

In [None]:
!datasets download genome taxon "Fusarium oxysporum f. sp. lycopersici" --assembly-source genbank  --include protein

Now Unzip the file

In [None]:
!unzip -o ncbi_dataset.zip
!mv ncbi_dataset fo_prot_data

**Fetch a Specific Genome by Accession Number**

If you know the accession number of a genome, you can fetch it directly using the following command:

In [None]:
!datasets download genome accession GCA_021237285.1
!unzip -o ncbi_dataset.zip
!mv ncbi_dataset GCA_021237285.1_data

**Fetch Pathogen Genomes by Assembly Level**

You can also filter genomes by assembly level (e.g., complete genome, scaffold level, contig level). For example, Lets work with a The tomato bacterial pathogen Clavibacter michiganensis.  only fetch complete genome sequences of Clavibacter michiganensis:

In [None]:
!datasets download genome taxon "Clavibacter michiganensis" --assembly-level complete --assembly-source genbank
!unzip -o ncbi_dataset.zip
!mv ncbi_dataset clavibacter_data

**Fetch Pathogen Genomes by Species and Filter for Contig Level**

If you’re interested in contig-level assemblies rather than complete genomes (for cases where complete genomes are unavailable), you can adjust the command as follows:

Example: Download contig-level assemblies of Erwinia amylovora (fire blight pathogen):

In [None]:
!datasets download genome taxon "Erwinia amylovora" --assembly-level contig --assembly-source genbank --filename erwinia_amylovora_contig_genomes.zip
!unzip -o ncbi_dataset.zip
!mv ncbi_dataset erwinia_data