## Update _P.generosa_ primary gene annotations mapping file from 20220419

See this [GitHub Issue](https://github.com/RobertsLab/resources/issues/1602).

This notebook utilized files generated on [20220419](https://robertslab.github.io/sams-notebook/2022/04/19/Data-Wrangling-Create-Primary-P.generosa-Genome-Annotation-File.html) (Notebook entry).

### List computer specs

In [1]:
%%bash
echo "TODAY'S DATE"
date
echo "------------"
echo ""
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "
hostname
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh

TODAY'S DATE
Fri Mar 24 08:44:37 AM PDT 2023
------------

Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.2 LTS
Release:	22.04
Codename:	jammy

------------
HOSTNAME: 
computer

------------
Computer Specs:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   45 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
CPU family:                      6
Model:                           165
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       4
Stepping:                        2
BogoMIPS:                        4800.05
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall 

No LSB modules are available.


### Set variables
- `%env` indicates a bash variable
- without `%env` is Python variablec

In [14]:
######################################################################
### Set directories
%env data_dir=/home/sam/data/P_generosa/genomes
%env analysis_dir=/home/sam/analyses/20230322-pgen-gene_annotation-update
analysis_dir="/home/sam/analyses/20230322-pgen-gene_annotation-update"

#####################################################################
### Input files
%env base_url=https://gannet.fish.washington.edu/Atumefaciens/20220419-pgen-gene_annotation_mapping
    
# UniProt batch results
%env uniprot_output=20220419-pgen-uniprot_batch-results.txt

# Genome IDs and SPIDs
%env genome_IDs_SPIDs=Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt

######################################################################
### Output files

# Parsed UniProt
%env parsed_uniprot=20230322-pgen-accession-gene_name-gene_description-go_ids.tab

# Final output
%env joined_output=20230322-pgen-gene-accessions-gene_id-gene_name-gene_description-alt_gene_description-all_go_ids-C_go_ids-P_go_ids-F_go_ids.tab

######################################################################


env: data_dir=/home/sam/data/P_generosa/genomes
env: analysis_dir=/home/sam/analyses/20230322-pgen-gene_annotation-update
env: base_url=https://gannet.fish.washington.edu/Atumefaciens/20220419-pgen-gene_annotation_mapping
env: uniprot_output=20220419-pgen-uniprot_batch-results.txt
env: genome_IDs_SPIDs=Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt
env: parsed_uniprot=20230322-pgen-accession-gene_name-gene_description-go_ids.tab
env: joined_output=20230322-pgen-gene-accessions-gene_id-gene_name-gene_description-alt_gene_description-all_go_ids-C_go_ids-P_go_ids-F_go_ids.tab


### Make input/output directories

In [3]:
%%bash
# If directories don't exist, make them
mkdir --parents "${analysis_dir}"

### Download and inspect annotation files

`--quiet`: Prevents `wget` output from overwhelming Jupyter Notebook

`--continue`: If download was previously initiated, will continue where leftoff and will not create a second file if one already exists.

In [4]:
%%bash
cd "${analysis_dir}"

wget --quiet --continue "${base_url}/${uniprot_output}"
wget --quiet --continue "${base_url}/${genome_IDs_SPIDs}"

ls -ltrh

echo ""
echo "---------------------------------------------------------"
echo ""
head -n 25 *.txt

total 138M
-rw-rw-r-- 1 sam sam 359K Apr 20  2022 Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt
-rw-rw-r-- 1 sam sam 138M Apr 20  2022 20220419-pgen-uniprot_batch-results.txt

---------------------------------------------------------

==> 20220419-pgen-uniprot_batch-results.txt <==
ID   CAMT1_DICDI             Reviewed;         230 AA.
AC   Q86IC9; Q552T5;
DT   05-MAY-2009, integrated into UniProtKB/Swiss-Prot.
DT   01-JUN-2003, sequence version 1.
DT   23-FEB-2022, entry version 92.
DE   RecName: Full=Probable caffeoyl-CoA O-methyltransferase 1;
DE            EC=2.1.1.104;
DE   AltName: Full=O-methyltransferase 5;
GN   Name=omt5; ORFNames=DDB_G0275499;
OS   Dictyostelium discoideum (Slime mold).
OC   Eukaryota; Amoebozoa; Evosea; Eumycetozoa; Dictyostelia; Dictyosteliales;
OC   Dictyosteliaceae; Dictyostelium.
OX   NCBI_TaxID=44689;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=AX4;
RX   PubMed=12097910; DOI=10.1038/nature00847;
RA  

### Check UniProt batch retrieval

Print the first entry (end of each entry denoted by line beginning with `//`.

Let's break it down step by step:

- `grep -n "^//"` - This command searches for all lines that begin with `//` in the file.txt and uses the -n flag to include line numbers in the output.

- `head -n 1` - This command takes the first line of the output from grep, which is the line number of the first line that begins with`//`.

- `cut -d ":" -f 1` - This command extracts the line number from the output of head by splitting the output at the colon (:) and selecting the first field.

- `xargs -I {} head -n {}` - This command uses the line number as an argument for the head command, which prints the first n lines of a file. The xargs command is used to pass the line number to head as an argument.
This command will print all lines in `${uniprot_output}` up to the first line that begins with `//`.

---

Counting accessions:

- `grep -c "^AC"`Counts Accession lines (beginning with `AC`).

In [5]:
%%bash
cd "${analysis_dir}"

grep -n "^//" "${uniprot_output}" \
| head -n 1 \
| cut -d ":" -f 1 \
| xargs -I {} head -n {} "${uniprot_output}"

echo ""

echo "----------------------------------------------------"

echo ""

echo "Number of accessions:"

echo ""

grep -c "^AC" "${uniprot_output}"

ID   CAMT1_DICDI             Reviewed;         230 AA.
AC   Q86IC9; Q552T5;
DT   05-MAY-2009, integrated into UniProtKB/Swiss-Prot.
DT   01-JUN-2003, sequence version 1.
DT   23-FEB-2022, entry version 92.
DE   RecName: Full=Probable caffeoyl-CoA O-methyltransferase 1;
DE            EC=2.1.1.104;
DE   AltName: Full=O-methyltransferase 5;
GN   Name=omt5; ORFNames=DDB_G0275499;
OS   Dictyostelium discoideum (Slime mold).
OC   Eukaryota; Amoebozoa; Evosea; Eumycetozoa; Dictyostelia; Dictyosteliales;
OC   Dictyosteliaceae; Dictyostelium.
OX   NCBI_TaxID=44689;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=AX4;
RX   PubMed=12097910; DOI=10.1038/nature00847;
RA   Gloeckner G., Eichinger L., Szafranski K., Pachebat J.A., Bankier A.T.,
RA   Dear P.H., Lehmann R., Baumgart C., Parra G., Abril J.F., Guigo R.,
RA   Kumpf K., Tunggal B., Cox E.C., Quail M.A., Platzer M., Rosenthal A.,
RA   Noegel A.A.;
RT   "Sequence and analysis of chromosome 2 of Dictyostelium discoide

## Parse the stuff we want

- UniProt accession

- Gene name/abbreviation

- Gene description

- GO IDs

- GO aspect (cellular component `C`, molecular function `F`, and biological process `P`)

#### Check DE descriptor lines to decide pattern matching

Checks lines beginning with `DE` to identify values in the 2nd field with `Name` in them.

Identifies unique values. This will determine how to parse properly after this.

In [6]:
%%bash
cd "${analysis_dir}"

grep "^DE" "${uniprot_output}" | awk '$2 ~ /Name/ { print $2 }' | sort -u

AltName:
RecName:


In [7]:
%%bash
cd "${analysis_dir}"

# Loop through UniProt records
time \
while read -r line
do
  # Get record line descriptor
  descriptor=$(echo "${line}" | awk '{print $1}')

  # Capture second field for evaluation
  go_line=$(echo "${line}" | awk '{print $2}')

  # Append GO IDs to array
  if [[ "${go_line}" == "GO;" ]]; then
    go_id=$(echo "${line}" | awk '{print $3}')
    go_ids_array+=("${go_id}")
    go_id_aspect=$(echo "${line}" | awk '{print $4}' | awk -F":" '{print $1}')
    if [[ "${go_id_aspect}" == "C" ]]; then
      go_id_C_array+=("${go_id}")
    elif [[ "${go_id_aspect}" == "F" ]]; then
      go_id_F_array+=("${go_id}")
    elif [[ "${go_id_aspect}" == "P" ]]; then
      go_id_P_array+=("${go_id}")
    fi
  elif [[ "${go_line}" == "GeneID;" ]]; then
    # Uses sed to strip trailing semi-colon
    gene_id=$(echo "${line}" | awk '{print $3}' | sed 's/;$//')
  fi

  # Get gene description
  if [[ "${descriptor}" == "DE" ]] && [[ "${go_line}" == "RecName:" ]]; then
    # Uses sed to strip trailing spaces at end of line and remove commas
    gene_description=$(echo "${line}" | awk -F "[={]" '{print $2}' | sed 's/[[:blank:]]*$//' | sed 's/,//g' | sed 's/;$//')

  # Get alternate name
  elif [[ "${descriptor}" == "DE" ]] && [[ "${go_line}" == "AltName:" ]]; then
    # Uses sed to strip trailing spaces at end of line and remove commas
    alt_gene_description=$(echo "${line}" | awk -F "[={]" '{print $2}' | sed 's/[[:blank:]]*$//' | sed 's/,//g' | sed 's/;$//')

  # Get gene name
  elif [[ "${descriptor}" == "GN"  ]] && [[ $(echo "${line}" | awk -F "=" '{print $1}') == "GN   Name" ]]; then
    # Uses sed to strip trailing spaces at end of line
    gene=$(echo "${line}" | awk -F 'Name=|{|;' '{print $2}' | sed 's/[[:blank:]]*$//')

  # Get UniProt accession
  elif [[ "${descriptor}" == "AC" ]]; then
    # awk removes "AC" notation
    # sed removes all spaces
    # sed removes trailing semi-colon
    # Uses array to handle accessions being on multiple lines of UniProt records file
    accession=$(echo "${line}" | awk '{$1="";print $0}' | sed 's/[[:space:]]*//g' | sed 's/;$//')
    accessions_array+=("${accession}")

  # Identify beginning on new record
  elif [[ "${descriptor}" == "//" ]]; then

    ### Format GO arrays for easier printing ###
    
    # Remove semi-colon delimiters
    go_ids_array=("${go_ids_array[@]/;}")
    go_id_C_array=("${go_id_C_array[@]/;}")
    go_id_F_array=("${go_id_F_array[@]/;}")
    go_id_P_array=("${go_id_P_array[@]/;}")
    
    # Join array elements using semi-colon
    # sets the IFS (Internal Field Separator) to semicolon
    joined_go_ids=$(IFS=';' && echo "${go_ids_array[*]}")
    joined_go_id_C=$(IFS=';' && echo "${go_id_C_array[*]}")
    joined_go_id_F=$(IFS=';' && echo "${go_id_F_array[*]}")
    joined_go_id_P=$(IFS=';' && echo "${go_id_P_array[*]}")
    
    ### End GO array formatting ###
    
    ### Print tab-delimited ###
    
    # Prints other comma-separated variables, then GOID1;GOID2;GOIDn
    # IFS prevents spaces from being added between GO IDs
    # sed removes ";" after final GO ID
    (IFS=; printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" \
     "${accessions_array[*]}" \
     "${gene_id}" \
     "${gene}" \
     "${gene_description}" \
     "${alt_gene_description}" \
     "${joined_go_ids}" \
     "${joined_go_id_C}" \
     "${joined_go_id_P}" \
     "${joined_go_id_F}")
    
    ### END PRINTING ###

    # Re-initialize variables
    accession=""  
    accessions_array=()
    descriptor=""
    gene=""
    gene_description=""
    gene_id=""
    go_id=""
    go_ids_array=()
    go_id_C_array=()
    go_id_F_array=()
    go_id_P_array=()
  fi


done < "${uniprot_output}" >> "${parsed_uniprot}"


real	293m48.623s
user	298m10.453s
sys	41m57.464s


### Inspect parsed UniProt file

In [8]:
%%bash
cd "${analysis_dir}"

wc -l "${parsed_uniprot}"

echo ""
echo "------------------------------------------------------------------"
echo ""

head -n 25 "${parsed_uniprot}" | column -t

10304 20230322-pgen-accession-gene_name-gene_description-go_ids.tab

------------------------------------------------------------------

Q86IC9;Q552T5                                                                                            8620183          omt5             Probable                       caffeoyl-CoA         O-methyltransferase  1                                                                                                                                                                                                                                                                                                                                          O-methyltransferase                                                                                                                                                                                                                                                                                                             

### Sets markdown table align left in subsequent cell

In [11]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

### Join with original list of genes and SPIDs

Output format (tab-delimited):

| gene_ID | SPIDs | UniProt_gene_ID | gene | gene_description | alternate_gene_description | all_GO_IDs | BP_GO_IDs | CC_GO_IDs | MF_GO_IDs |
|---------|-------|-----------------|------|------------------|----------------------------|------------|-----------|-----------|-----------|


Explanation:

- `awk -v FS='[;[:space:]]+'`: Sets the Field Separator variable to handle `; ` (notice the <space> after the semi-colon) in UniProt accessions. Allows for proper searching.

- `FNR == NR`: Restricts next block (designated by `{}`) to work only on first input file.

- `{array[$1]=$0; next}`: Adds the entire line (`$0`) of the first file to the array names `array` and then moves on to the next set of commands for the second input file.

- `($1 in array)`: Looks for the value of the first column (`$1`, which is SPID) from the second file to see if there's a match from the array (which contains the line from the first file).

- `{print $2,array[$1]}'`: If there's a match, print the second column (`$2`, which is gene ID) from the second file, followed by the line from the first file.

- `"${parsed_uniprot}" "${spid_list}"`: The first and second input files.

- `"${joined_output}"`: Result of the join.

In [15]:
%%bash

cd "${analysis_dir}"

awk \
-v FS='[;[:space:]]+' \
'NR==FNR \
{array[$1]=$0; next} \
($1 in array) \
{print $2"\t"array[$1]}' \
"${parsed_uniprot}" "${genome_IDs_SPIDs}" \
> "${joined_output}"

### Inspect final annotation file

In [16]:
%%bash

cd "${analysis_dir}"

wc -l "${joined_output}"

echo ""
echo "------------------------------------------------------------------"
echo ""

head -n 25 "${joined_output}" | column -t

14672 20230322-pgen-gene-accessions-gene_id-gene_name-gene_description-alt_gene_description-all_go_ids-C_go_ids-P_go_ids-F_go_ids.tab

------------------------------------------------------------------

PGEN_.00g000010  Q86IC9;Q552T5                                                                                            8620183          omt5             Probable                       caffeoyl-CoA         O-methyltransferase  1                                                                                                                                                                                                                                                                                                                                          O-methyltransferase                                                                                                                                                                                                                          