## Create _P.verrucosa_ "karytoype" file of the following format:

`name\tlength`

### Per [this GitHub Issue](https://github.com/RobertsLab/resources/issues/1580).

### List computer specs

In [2]:
%%bash
echo "TODAY'S DATE:"
date
echo "------------"
echo ""
#Display operating system info
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "; hostname 
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh

TODAY'S DATE:
Wed Feb 15 11:37:24 AM PST 2023
------------

Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.1 LTS
Release:	22.04
Codename:	jammy

------------
HOSTNAME: 
computer

------------
Computer Specs:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   45 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
CPU family:                      6
Model:                           165
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       4
Stepping:                        2
BogoMIPS:                        4800.01
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall

No LSB modules are available.


### Set variables
- `%env` indicates a bash variable

- without `%env` is Python variable

In [3]:
# Set directories, input/output files
%env data_dir=/home/sam/data/P_verrucosa/genomes
%env analysis_dir=/home/sam/analyses/20230215-pver-GCA_014529365.1-karytoype
analysis_dir="20230215-pver-GCA_014529365.1-karytoype"

# Input files (from NCBI)
%env fasta_index=GCA_014529365.1_Pver_genome_assembly_v1.0_genomic.fna.fai

# URL of file directory
%env url=https://owl.fish.washington.edu/halfshell/genomic-databank

# Output file(s)
%env karyotype=GCA_014529365.1-pver-karytotype-name_length.tab

env: data_dir=/home/sam/data/P_verrucosa/genomes
env: analysis_dir=/home/sam/analyses/20230215-pver-GCA_014529365.1-karytoype
env: fasta_index=GCA_014529365.1_Pver_genome_assembly_v1.0_genomic.fna.fai
env: url=https://owl.fish.washington.edu/halfshell/genomic-databank
env: karyotype=GCA_014529365.1-pver-karytotype-name_length.tab


### Create analysis directory

In [4]:
%%bash
# Make analysis and data directory, if doesn't exist
mkdir --parents "${analysis_dir}"

mkdir --parents "${data_dir}"

### Download FastA Index

In [5]:
%%bash
cd "${data_dir}"

# Download with wget.
# Use --quiet option to prevent wget output from printing too many lines to notebook
# Use --continue to prevent re-downloading fie if it's already been downloaded.
# Use --no-check-certificate to avoid download error from gannet
wget --quiet \
--continue \
--no-check-certificate \
${url}/${fasta_index}

ls -ltrh "${fasta_index}"

-rw-r--r-- 1 sam sam 693K Feb 15 10:00 GCA_014529365.1_Pver_genome_assembly_v1.0_genomic.fna.fai


### Examine FastA Index

In [6]:
%%bash
head -n 20 "${data_dir}"/"${fasta_index}"

JAAVTL010000001.1	2095917	112	80	81
JAAVTL010000002.1	2081954	2122340	80	81
JAAVTL010000003.1	1617595	4230431	80	81
JAAVTL010000004.1	1576134	5868358	80	81
JAAVTL010000005.1	1560107	7464306	80	81
JAAVTL010000006.1	1451149	9044027	80	81
JAAVTL010000007.1	1442001	10513428	80	81
JAAVTL010000008.1	1404416	11973567	80	81
JAAVTL010000009.1	1375744	13395651	80	81
JAAVTL010000010.1	1318009	14788704	80	81
JAAVTL010000011.1	1243551	16123301	80	81
JAAVTL010000012.1	1229536	17382509	80	81
JAAVTL010000013.1	1172851	18627527	80	81
JAAVTL010000014.1	1203294	19815151	80	81
JAAVTL010000015.1	1198208	21033599	80	81
JAAVTL010000016.1	1181740	22246897	80	81
JAAVTL010000017.1	1125063	23443521	80	81
JAAVTL010000018.1	1142483	24582760	80	81
JAAVTL010000019.1	1132017	25739637	80	81
JAAVTL010000020.1	1094778	26885917	80	81


### Convert FastA index to desired format:

`name\tlength`

Uses awk to print the first column (`$1`), followed by a tab (`\t`), followed by the second column (`$2`).

In [8]:
%%bash
cd "${data_dir}"

awk '{print $1, "\t", $2}' "${fasta_index}" \
> "${analysis_dir}/${karyotype}"

### Inspect GTF

In [9]:
%%bash
head "${analysis_dir}/${karyotype}"

JAAVTL010000001.1 	 2095917
JAAVTL010000002.1 	 2081954
JAAVTL010000003.1 	 1617595
JAAVTL010000004.1 	 1576134
JAAVTL010000005.1 	 1560107
JAAVTL010000006.1 	 1451149
JAAVTL010000007.1 	 1442001
JAAVTL010000008.1 	 1404416
JAAVTL010000009.1 	 1375744
JAAVTL010000010.1 	 1318009


### Generate checksum(s)

In [10]:
%%bash
cd "${analysis_dir}"

for file in *
do
  md5sum "${file}" | tee --append checksums.md5
done

5aafd422505f26c0793a3b88abe0359f  GCA_014529365.1-pver-karytotype-name_length.tab
