## Convert _P.generosa_ GFF to GTF

### Notebook relies on:

- [GffRead](https://github.com/gpertea/gffread)

### Addresses [this GitHub Issue](https://github.com/RobertsLab/resources/issues/1411)

### List computer specs

In [1]:
%%bash
echo "TODAY'S DATE:"
date
echo "------------"
echo ""
#Display operating system info
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "; hostname 
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh

TODAY'S DATE:
Tue 01 Mar 2022 10:33:08 AM PST
------------

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.3 LTS
Release:	20.04
Codename:	focal

------------
HOSTNAME: 
computer

------------
Computer Specs:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   45 bits physical, 48 bits virtual
CPU(s):                          2
On-line CPU(s) list:             0,1
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       2
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           165
Model name:                      Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
Stepping:                        2
CPU MHz:                         2400.008
BogoMIPS:                        4800.01
Hypervisor vendor:               VMware
Virtualization type:    

No LSB modules are available.


### Set variables
- `%env` indicates a bash variable

- without `%env` is Python variable

In [2]:
# Set directories, input/output files
%env data_dir=/home/sam/data/P_generosa/genomes
%env analysis_dir=/home/sam/analyses/20220301-pgen-gff_to_gtf
analysis_dir="20220301-pgen-gff_to_gtf"

# Input files (from NCBI)
%env gff=Panopea-generosa-vv0.74.a4-merged-2019-10-07-4-46-46.gff3

# URL to download files from NCBI
%env url=https://gannet.fish.washington.edu/Atumefaciens/20190928_Pgenerosa_v074.a4_gensas_annotation

# Output file(s)
%env gtf=Panopea-generosa-vv0.74.a4-merged-2019-10-07-4-46-46.gtf


# Set program locations
%env gffread=/home/sam/programs/gffread-0.12.7.Linux_x86_64/gffread

env: data_dir=/home/sam/data/P_generosa/genomes
env: analysis_dir=/home/sam/analyses/20220301-pgen-gff_to_gtf
env: gff=Panopea-generosa-vv0.74.a4-merged-2019-10-07-4-46-46.gff3
env: url=https://gannet.fish.washington.edu/Atumefaciens/20190928_Pgenerosa_v074.a4_gensas_annotation
env: gtf=Panopea-generosa-vv0.74.a4-merged-2019-10-07-4-46-46.gtf
env: gffread=/home/sam/programs/gffread-0.12.7.Linux_x86_64/gffread


### Create analysis directory

In [3]:
%%bash
# Make analysis and data directory, if doesn't exist
mkdir --parents "${analysis_dir}"

mkdir --parents "${data_dir}"

### Download GFF

In [4]:
%%bash
cd "${data_dir}"

# Download with wget.
# Use --quiet option to prevent wget output from printing too many lines to notebook
# Use --continue to prevent re-downloading fie if it's already been downloaded.
# Use --no-check-certificate to avoid download error from gannet
wget --quiet \
--continue \
--no-check-certificate \
${url}/${gff}

ls -ltrh "${gff}"

-rw-rw-r-- 1 sam sam 518M Oct 14  2019 Panopea-generosa-vv0.74.a4-merged-2019-10-07-4-46-46.gff3


### Examine GFF

In [5]:
%%bash
head -n 20 "${data_dir}"/"${gff}"

##gff-version 3
##Generated using GenSAS, Monday 7th of October 2019 04:54:37 AM
##Project Name : Pgenerosa_v074
PGA_scaffold1__77_contigs__length_89643857	GenSAS_5d9637f372b5d-publish	gene	2	4719	.	+	.	ID=PGEN_.00g000010;Name=PGEN_.00g000010;original_ID=21510-PGEN_.00g234140;Alias=21510-PGEN_.00g234140;original_name=21510-PGEN_.00g234140;Notes=sp|Q86IC9|CAMT1_DICDI [BLAST protein vs protein (blastp) 2.7.1],PF01596.12 [Pfam 1.6]
PGA_scaffold1__77_contigs__length_89643857	GenSAS_5d9637f372b5d-publish	mRNA	2	4719	.	+	.	ID=PGEN_.00g000010.m01;Name=PGEN_.00g000010.m01;Parent=PGEN_.00g000010;original_ID=21510-PGEN_.00g234140.m01;Alias=21510-PGEN_.00g234140.m01;original_name=21510-PGEN_.00g234140
PGA_scaffold1__77_contigs__length_89643857	GenSAS_5d9637f372b5d-publish	exon	2	125	.	+	.	ID=PGEN_.00g000010.m01.exon01;Name=PGEN_.00g000010.m01.exon01;Parent=PGEN_.00g000010.m01;original_ID=21510-PGEN_.00g234140.m01.exon1;Alias=21510-PGEN_.00g234140.m01.exon1
PGA_scaffold1__77_contigs__length_896438

### Convert GFF to GTF

In [6]:
%%bash
cd "${data_dir}"

${gffread} -E \
${data_dir}/"${gff}" -T \
1> ${analysis_dir}/"${gtf}" \
2> ${analysis_dir}/gffread-gff_to_gtf.stderr

### Inspect GTF

In [7]:
%%bash
head ${analysis_dir}/"${gtf}"

PGA_scaffold1__77_contigs__length_89643857	GenSAS_5d9637f372b5d-publish	transcript	2	4719	.	+	.	transcript_id "PGEN_.00g000010.m01"; gene_id "PGEN_.00g000010"
PGA_scaffold1__77_contigs__length_89643857	GenSAS_5d9637f372b5d-publish	exon	2	125	.	+	.	transcript_id "PGEN_.00g000010.m01"; gene_id "PGEN_.00g000010";
PGA_scaffold1__77_contigs__length_89643857	GenSAS_5d9637f372b5d-publish	exon	1995	2095	.	+	.	transcript_id "PGEN_.00g000010.m01"; gene_id "PGEN_.00g000010";
PGA_scaffold1__77_contigs__length_89643857	GenSAS_5d9637f372b5d-publish	exon	3325	3495	.	+	.	transcript_id "PGEN_.00g000010.m01"; gene_id "PGEN_.00g000010";
PGA_scaffold1__77_contigs__length_89643857	GenSAS_5d9637f372b5d-publish	exon	4651	4719	.	+	.	transcript_id "PGEN_.00g000010.m01"; gene_id "PGEN_.00g000010";
PGA_scaffold1__77_contigs__length_89643857	GenSAS_5d9637f372b5d-publish	CDS	2	125	.	+	0	transcript_id "PGEN_.00g000010.m01"; gene_id "PGEN_.00g000010";
PGA_scaffold1__77_contigs__length_89643857	GenSAS_5d9637f372b5d-p

### Generate checksum(s)

In [8]:
%%bash
cd "${analysis_dir}"

for file in *
do
  md5sum "${file}" | tee --append checksums.md5
done

e82f283fa410a33b182b54fab585bad7  gffread-gff_to_gtf.stderr
2926b2a9029eb98775b75883cbf199af  Panopea-generosa-vv0.74.a4-merged-2019-10-07-4-46-46.gtf


### Document GffRead program options

In [9]:
%%bash
${gffread} -h

gffread v0.12.7. Usage:
gffread [-g <genomic_seqs_fasta> | <dir>] [-s <seq_info.fsize>] 
 [-o <outfile>] [-t <trackname>] [-r [<strand>]<chr>:<start>-<end> [-R]]
 [--jmatch <chr>:<start>-<end>] [--no-pseudo] 
 [-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>]
 [-j ][--ids <IDs.lst> | --nids <IDs.lst>] [--attrs <attr-list>] [-i <maxintron>]
 [--stream] [--bed | --gtf | --tlf] [--table <attrlist>] [--sort-by <ref.lst>]
 [<input_gff>] 

 Filter, convert or cluster GFF/GTF/BED records, extract the sequence of
 transcripts (exon or CDS) and more.
 By default (i.e. without -O) only transcripts are processed, discarding any
 other non-transcript features. Default output is a simplified GFF3 with only
 the basic attributes.
 
Options:
 --ids discard records/transcripts if their IDs are not listed in <IDs.lst>
 --nids discard records/transcripts if their IDs are listed in <IDs.lst>
 -i   discard transcripts having an intron larger than <maxintron>
 -l   discard transcripts

CalledProcessError: Command 'b'${gffread} -h\n'' returned non-zero exit status 1.