## Generate _P.generosa_ FastA files from CDS, gene, and mRNA GFFs

See this [GitHub Issue](https://github.com/RobertsLab/resources/issues/1439)

This notebook relies on [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) to be installed and available in your `$PATH`.

### List computer specs

In [1]:
%%bash
echo "TODAY'S DATE"
date
echo "------------"
echo ""
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "
hostname
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh

TODAY'S DATE
Thu 24 Mar 2022 02:51:39 PM PDT
------------

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.4 LTS
Release:	20.04
Codename:	focal

------------
HOSTNAME: 
computer

------------
Computer Specs:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   45 bits physical, 48 bits virtual
CPU(s):                          2
On-line CPU(s) list:             0,1
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       2
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           165
Model name:                      Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
Stepping:                        2
CPU MHz:                         2400.007
BogoMIPS:                        4800.01
Hypervisor vendor:               VMware
Virtualization type:     

No LSB modules are available.


### Set variables
- `%env` indicates a bash variable
- without `%env` is Python variable

In [2]:
# Set directories
%env data_dir=/home/sam/data/P_generosa/genomes
%env analysis_dir=/home/sam/analyses/20220324-pgen-gffs_to_fastas
analysis_dir="/home/samb/analyses/20220324-pgen-gffs_to_fastas"

# Input GFFs
%env gff_url=https://gannet.fish.washington.edu/Atumefaciens/20191105_swoose_pgen_v074_renaming/
%env gff_prefix=Panopea-generosa-v1.0.a4.

# Genome FastA
%env genome_fasta=Panopea-generosa-v1.0.fa

# Programs
%env bedtools=/home/sam/programs/bedtools-2.29.1/bin/bedtools
%env samtools=/home/sam/programs/samtools-1.12/samtools

# Formatting
%env line_break="----------------------------------------------------------------------------------------------"

env: data_dir=/home/sam/data/P_generosa/genomes
env: analysis_dir=/home/sam/analyses/20220324-pgen-gffs_to_fastas
env: gff_url=https://gannet.fish.washington.edu/Atumefaciens/20191105_swoose_pgen_v074_renaming/
env: gff_prefix=Panopea-generosa-v1.0.a4.
env: genome_fasta=Panopea-generosa-v1.0.fa
env: bedtools=/home/sam/programs/bedtools-2.29.1/bin/bedtools
env: samtools=/home/sam/programs/samtools-1.12/samtools
env: line_break="----------------------------------------------------------------------------------------------"


### Make data and analysis directories if they don't exist

In [3]:
%%bash
mkdir --parents "${analysis_dir}" "${data_dir}"

### Download FastA and GFFs

If needing to download via `wget`, be sure to include `--no-check-certificate` option to avoid error.

In [4]:
%%bash

cd "${data_dir}"

# Array of GFF files.
gff_array=(Panopea-generosa-v1.0.a4.CDS.gff3 Panopea-generosa-v1.0.a4.mRNA.gff3 Panopea-generosa-v1.0.a4.gene.gff3)

# Download GFFs
for gff in "${gff_array[@]}"
do
    wget \
    --no-check-certificate \
    --continue \
    --quiet \
    ${gff_url}${gff}
done

# Download FastA
wget \
--no-check-certificate \
--continue \
--quiet \
${gff_url}${genome_fasta}

ls -lh

total 985M
-rw-rw-r-- 1 sam sam  53M Nov  5  2019 Panopea-generosa-v1.0.a4.CDS.gff3
-rw-rw-r-- 1 sam sam 9.5M Nov  5  2019 Panopea-generosa-v1.0.a4.gene.gff3
-rw-rw-r-- 1 sam sam 9.1M Nov  5  2019 Panopea-generosa-v1.0.a4.mRNA.gff3
-rw-rw-r-- 1 sam sam 914M Nov  5  2019 Panopea-generosa-v1.0.fa


### Generate MD5 checksums for reference

In [5]:
%%bash
cd "${data_dir}"

md5sum *

b38127f901cd5f5f076bb85e40fab2f6  Panopea-generosa-v1.0.a4.CDS.gff3
5bf1cfc3ae2b68d41c49d0f732ade723  Panopea-generosa-v1.0.a4.gene.gff3
3514ad8a4fba72b00403ec604e9e32e4  Panopea-generosa-v1.0.a4.mRNA.gff3
b7b64f0ce79499d79a865348658d2e49  Panopea-generosa-v1.0.fa


### Examine GFF files

In [6]:
%%bash
cd "${data_dir}"

# Array of GFF files
gff_array=(Panopea-generosa-v1.0.a4.CDS.gff3 Panopea-generosa-v1.0.a4.mRNA.gff3 Panopea-generosa-v1.0.a4.gene.gff3)

# Make a list so subsequent head command lists filenames in output
gff_list=$(echo "${gff_array[@]}")

head ${gff_list}

==> Panopea-generosa-v1.0.a4.CDS.gff3 <==
##gff-version 3
##Generated using GenSAS, Monday 7th of October 2019 04:54:37 AM
##Project Name : Pgenerosa_v074
Scaffold_01	GenSAS_5d9637f372b5d-publish	CDS	2	125	.	+	0	ID=PGEN_.00g000010.m01.CDS01;Name=PGEN_.00g000010.m01.CDS01;Parent=PGEN_.00g000010.m01;original_ID=cds.21510-PGEN_.00g234140.m01;Alias=cds.21510-PGEN_.00g234140.m01
Scaffold_01	GenSAS_5d9637f372b5d-publish	CDS	1995	2095	.	+	1	ID=PGEN_.00g000010.m01.CDS02;Name=PGEN_.00g000010.m01.CDS02;Parent=PGEN_.00g000010.m01;original_ID=cds.21510-PGEN_.00g234140.m01;Alias=cds.21510-PGEN_.00g234140.m01
Scaffold_01	GenSAS_5d9637f372b5d-publish	CDS	3325	3495	.	+	0	ID=PGEN_.00g000010.m01.CDS03;Name=PGEN_.00g000010.m01.CDS03;Parent=PGEN_.00g000010.m01;original_ID=cds.21510-PGEN_.00g234140.m01;Alias=cds.21510-PGEN_.00g234140.m01
Scaffold_01	GenSAS_5d9637f372b5d-publish	CDS	4651	4719	.	+	0	ID=PGEN_.00g000010.m01.CDS04;Name=PGEN_.00g000010.m01.CDS04;Parent=PGEN_.00g000010.m01;original_ID=cds.21510-P

### Create customized BED file from GFFs

- Formatted to use the "name" column in the BED format for use with `bedtools` later...

  - 4th column will be: `geneID|parentID` or `geneID`

In [7]:
%%bash
cd "${data_dir}"

# Array of GFF files
gff_array=(Panopea-generosa-v1.0.a4.CDS.gff3 Panopea-generosa-v1.0.a4.mRNA.gff3 Panopea-generosa-v1.0.a4.gene.gff3)


for gff in "${gff_array[@]}"
do

    # Trim of filename prefix
    trimmed_name=${gff/Panopea-generosa-v1.0.a4./}
    
    # Trim off filename suffix to get genome feature
    feature=${trimmed_name/.gff3/}
    
    if [[ "${feature}" != "gene" ]]
    then
        # Run gtf_extractor on GFF files
        gtf_extract \
        --gff \
        --fields=chr,start,end,ID,Parent \
        "${gff}" \
        | awk '{print $1 "\t" $2 "\t" $3 "\t" $4 "|" $5}' \
        > ${analysis_dir}/"${feature}".bed.tmp
    else
        # Run gtf_extractor on GFF files
        gtf_extract \
        --gff \
        --fields=chr,start,end,ID \
        "${gff}" \
        | awk '{print $1 "\t" $2 "\t" $3 "\t" $4}' \
        > ${analysis_dir}/"${feature}".bed.tmp
    fi
done

ls -lh ${analysis_dir}/*.bed.tmp

-rw-rw-r-- 1 sam sam  18M Mar 24 14:55 /home/sam/analyses/20220324-pgen-gffs_to_fastas/CDS.bed.tmp
-rw-rw-r-- 1 sam sam 1.6M Mar 24 14:55 /home/sam/analyses/20220324-pgen-gffs_to_fastas/gene.bed.tmp
-rw-rw-r-- 1 sam sam 2.4M Mar 24 14:55 /home/sam/analyses/20220324-pgen-gffs_to_fastas/mRNA.bed.tmp


### Examine temporary BED files

In [8]:
%%bash
cd "${analysis_dir}"

for bed in *.bed.tmp
do
    echo ""
    echo "${bed}"
    echo ""
    head "${bed}"
    echo ""
    echo "${line_break}"
done


CDS.bed.tmp

Scaffold_01	2	125	PGEN_.00g000010.m01.CDS01|PGEN_.00g000010.m01
Scaffold_01	1995	2095	PGEN_.00g000010.m01.CDS02|PGEN_.00g000010.m01
Scaffold_01	3325	3495	PGEN_.00g000010.m01.CDS03|PGEN_.00g000010.m01
Scaffold_01	4651	4719	PGEN_.00g000010.m01.CDS04|PGEN_.00g000010.m01
Scaffold_01	19808	19943	PGEN_.00g000020.m01.CDS01|PGEN_.00g000020.m01
Scaffold_01	21133	21362	PGEN_.00g000020.m01.CDS02|PGEN_.00g000020.m01
Scaffold_01	22487	22613	PGEN_.00g000020.m01.CDS03|PGEN_.00g000020.m01
Scaffold_01	24824	24959	PGEN_.00g000020.m01.CDS04|PGEN_.00g000020.m01
Scaffold_01	25981	26126	PGEN_.00g000020.m01.CDS05|PGEN_.00g000020.m01
Scaffold_01	27969	28019	PGEN_.00g000020.m01.CDS06|PGEN_.00g000020.m01

"----------------------------------------------------------------------------------------------"

gene.bed.tmp

Scaffold_01	2	4719	PGEN_.00g000010
Scaffold_01	19808	36739	PGEN_.00g000020
Scaffold_01	49248	52578	PGEN_.00g000030
Scaffold_01	55792	67546	PGEN_.00g000040
Scaffold_01	67586	69113	PGEN_.

### Create FastA files

In [9]:
%%bash
cd "${analysis_dir}"

for bed in *.bed.tmp
do
    # Get feature by removing strings after first period
    feature=${bed%.*}
    
    # Used BEDTOOLS getfasta to make FastAs from GFFs
    ${bedtools} getfasta \
    -name \
    -fi ${data_dir}/${genome_fasta} \
    -bed ${bed} \
    > ${gff_prefix}${feature}.fasta
    
    # Remove tmp BED file
    echo ""
    echo "Removing ${bed}."
    rm "${bed}"
done

ls -lh *.fasta


Removing CDS.bed.tmp.

Removing gene.bed.tmp.

Removing mRNA.bed.tmp.
-rw-rw-r-- 1 sam sam  64M Mar 24 14:55 Panopea-generosa-v1.0.a4.CDS.bed.fasta
-rw-rw-r-- 1 sam sam 362M Mar 24 14:55 Panopea-generosa-v1.0.a4.gene.bed.fasta
-rw-rw-r-- 1 sam sam 475M Mar 24 14:55 Panopea-generosa-v1.0.a4.mRNA.bed.fasta


index file /home/sam/data/P_generosa/genomes/Panopea-generosa-v1.0.fa.fai not found, generating...


### Create FastA index files

In [10]:
%%bash
cd "${analysis_dir}"

for fasta in *.fasta
do
   ${samtools} faidx "${fasta}"
done

ls -ltrh

total 927M
-rw-rw-r-- 1 sam sam  64M Mar 24 14:55 Panopea-generosa-v1.0.a4.CDS.bed.fasta
-rw-rw-r-- 1 sam sam 362M Mar 24 14:55 Panopea-generosa-v1.0.a4.gene.bed.fasta
-rw-rw-r-- 1 sam sam 475M Mar 24 14:55 Panopea-generosa-v1.0.a4.mRNA.bed.fasta
-rw-rw-r-- 1 sam sam  22M Mar 24 14:55 Panopea-generosa-v1.0.a4.CDS.bed.fasta.fai
-rw-rw-r-- 1 sam sam 2.4M Mar 24 14:55 Panopea-generosa-v1.0.a4.gene.bed.fasta.fai
-rw-rw-r-- 1 sam sam 3.4M Mar 24 14:55 Panopea-generosa-v1.0.a4.mRNA.bed.fasta.fai


### Examine FastAs

In [11]:
%%bash
cd "${analysis_dir}"

for fasta in *.fasta
do
    grep --with-filename "^>" "${fasta}" | head
done

Panopea-generosa-v1.0.a4.CDS.bed.fasta:>PGEN_.00g000010.m01.CDS01|PGEN_.00g000010.m01::Scaffold_01:2-125
Panopea-generosa-v1.0.a4.CDS.bed.fasta:>PGEN_.00g000010.m01.CDS02|PGEN_.00g000010.m01::Scaffold_01:1995-2095
Panopea-generosa-v1.0.a4.CDS.bed.fasta:>PGEN_.00g000010.m01.CDS03|PGEN_.00g000010.m01::Scaffold_01:3325-3495
Panopea-generosa-v1.0.a4.CDS.bed.fasta:>PGEN_.00g000010.m01.CDS04|PGEN_.00g000010.m01::Scaffold_01:4651-4719
Panopea-generosa-v1.0.a4.CDS.bed.fasta:>PGEN_.00g000020.m01.CDS01|PGEN_.00g000020.m01::Scaffold_01:19808-19943
Panopea-generosa-v1.0.a4.CDS.bed.fasta:>PGEN_.00g000020.m01.CDS02|PGEN_.00g000020.m01::Scaffold_01:21133-21362
Panopea-generosa-v1.0.a4.CDS.bed.fasta:>PGEN_.00g000020.m01.CDS03|PGEN_.00g000020.m01::Scaffold_01:22487-22613
Panopea-generosa-v1.0.a4.CDS.bed.fasta:>PGEN_.00g000020.m01.CDS04|PGEN_.00g000020.m01::Scaffold_01:24824-24959
Panopea-generosa-v1.0.a4.CDS.bed.fasta:>PGEN_.00g000020.m01.CDS05|PGEN_.00g000020.m01::Scaffold_01:25981-26126
Panopea-gener

### Generate MD5 checksums

In [12]:
%%bash
cd "${analysis_dir}"

md5sum * | tee --append checksums.md5

267bada15ec1d289213c7bbaf9cd8674  Panopea-generosa-v1.0.a4.CDS.bed.fasta
12d3c59c804b194d63f3bbab75fa617a  Panopea-generosa-v1.0.a4.CDS.bed.fasta.fai
7c956b1c27d14bd91959763403f81265  Panopea-generosa-v1.0.a4.gene.bed.fasta
588d18f5fe0e4f2259a25586349fc244  Panopea-generosa-v1.0.a4.gene.bed.fasta.fai
1823be75694cf70f0ea6f1abc072ba16  Panopea-generosa-v1.0.a4.mRNA.bed.fasta
e120b4c1d3bb0917868e72cd22507bbc  Panopea-generosa-v1.0.a4.mRNA.bed.fasta.fai


### Remove unneeded data files

In [13]:
%%bash
cd "${data_dir}"

# Array of GFF files
gff_array=(Panopea-generosa-v1.0.a4.CDS.gff3 Panopea-generosa-v1.0.a4.mRNA.gff3 Panopea-generosa-v1.0.a4.gene.gff3)

# Remove genome FastA
echo "Removing ${genome_fasta}."
rm "${genome_fasta}"

# Remove GFFs
for gff in "${gff_array[@]}"
do
  echo ""
  echo "Removing ${gff}."
  rm "${gff}"
done

ls -lh

Removing Panopea-generosa-v1.0.fa.

Removing Panopea-generosa-v1.0.a4.CDS.gff3.

Removing Panopea-generosa-v1.0.a4.mRNA.gff3.

Removing Panopea-generosa-v1.0.a4.gene.gff3.
total 4.0K
-rw-rw-r-- 1 sam sam 658 Mar 24 14:55 Panopea-generosa-v1.0.fa.fai


### Program options

In [14]:
%%bash

gtf_extract -h

echo ""
echo "${line_break}"
echo "${line_break}"
echo ""

${samtools} faidx -h

echo ""
echo "${line_break}"
echo "${line_break}"
echo ""

${bedtools} getfasta -h

usage: gtf_extract [-h] [-v] [-f FEATURE_TYPE] [--fields FIELD_LIST]
                   [-o OUTFILE] [--gff] [-k]
                   GTF_FILE

Extract selected data items from a GTF file and output in tab-delimited
format. The program can also operate on GFF files provided the --gff option is
specified.

positional arguments:
  GTF_FILE              input GTF file to extract data items from

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -f FEATURE_TYPE, --feature FEATURE_TYPE
                        only extract data for lines where feature is
                        FEATURE_TYPE
  --fields FIELD_LIST   comma-separated list of fields to output in tab-
                        delimited format for each line in the GTF, e.g.
                        'chrom,start,end'. Fields can either be a GTF field
                        name (i.e. 'chrom', 'source', 'feature', 'start',
                       


Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.29.1
Summary: Extract DNA sequences from a fasta file based on feature coordinates.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

Options: 
	-fi		Input FASTA file
	-fo		Output file (opt., default is STDOUT
	-bed		BED/GFF/VCF file of ranges to extract from -fi
	-name		Use the name field and coordinates for the FASTA header
	-name+		(deprecated) Use the name field and coordinates for the FASTA header
	-nameOnly	Use the name field for the FASTA header
	-split		Given BED12 fmt., extract and concatenate the sequences
			from the BED "blocks" (e.g., exons)
	-tab		Write output in TAB delimited format.
			- Default is FASTA format.
	-s		Force strandedness. If the feature occupies the antisense,
			strand, the sequence will be reverse complemented.
			- By default, strand information is ignored.
	-fullHeader	Use full fasta header.
			- By default, only the word before the first space or tab 
			is used.



CalledProcessError: Command 'b'\ngtf_extract -h\n\necho ""\necho "${line_break}"\necho "${line_break}"\necho ""\n\n${samtools} faidx -h\n\necho ""\necho "${line_break}"\necho "${line_break}"\necho ""\n\n${bedtools} getfasta -h\n'' returned non-zero exit status 1.