## Rename and group genomes (pt2 Siavash HGT)

### I. Rename each genome:
For each **"mutated.fasta"** in ./input/simulated_genomes/*/mutated_*/mutated.fasta:<Br>
Give it a more useful name like **"AG-359-G18_a22_aad001.fasta"**

In [11]:
import pandas as pd
from glob import glob
import shutil
import os
from os.path import exists

PATTERN_input = "./input/simulated_genomes/*/mutated_*/mutated.fasta"
DIR_mid = "./mid/0_mutated_genomes/"
DIR_out = "./mid/1_grouped_genomes/"

LIST_to_copy = glob(PATTERN_input)
for SRC in LIST_to_copy:
    #print(SRC)
    #STR_basename = SRC.split('/')[-2]
    STR_basename = SRC.split('/')[-2].replace('mutated_BLOSUM62_','')
    DST = DIR_mid + STR_basename + ".fasta"
    if not exists(DST):
        shutil.copyfile(SRC, DST)

print("All renamed")



All renamed


### II. Group each genome
...into sets that share both a base genome and alpha value (e.g. 'AG-315-G18_a22').<Br>
Such that genomes in a group would differ only by average amino acid distance (aad):
* 'AG-315-G18_a22_**aad0005**.fasta'
* 'AG-315-G18_a22_**aad001**.fasta'
* 'AG-315-G18_a22_**aad002**.fasta'
* etc..

#### Why?
Because in the next section we will run ANI on pairs among these genomes

In [63]:
### Group each fasta by its source genome and aad (e.g. "AG-315-G18_a22" has a group of 26 fastas with different aad values)
LIST_files_to_group = glob(DIR_mid + "*.fasta")
DF = pd.DataFrame({'SRC': LIST_files_to_group})
DF['group'] = DF['SRC'].str.split('/',expand=True)[3].str.replace('_aad','~').str.split('~').str[0]
print("Grouping "+str(len(DF))+" fastas based on their source genome and aad")

### Create a directory for each group
LIST_dirs = list(DF['group'].value_counts().to_dict().keys())
for DIR in LIST_dirs:
    DIR_to_make = "./mid/1_grouped_genomes/"+DIR+"/"
    if not exists(DIR_to_make):
        os.mkdir(DIR_to_make)
print("Made folder for each group")

### Copy each fasta into its groupdir
DF["basename"] = DF['SRC'].str.split('/',expand=True)[3]
DF['DST'] = "./mid/1_grouped_genomes/" + DF['group'] + "/" + DF['basename']  # Derive destination name

for index, row in DF.iterrows():
    if not exists(row['DST']):
        shutil.copyfile(row['SRC'], row['DST'])
print("Copied each file into its group directory")

Grouping 1456 fastas based on their source genome and aad
Made folder for each group
Copied each file into its group directory


### III. GZIP each group

In [103]:
#!ls ./mid/1_grouped_genomes
!for SAG_DIR in ./mid/1_grouped_genomes/*; do ID=${SAG_DIR%/}; echo $ID; done

./mid/1_grouped_genomes/AG-359-G18_a22
./mid/1_grouped_genomes/AG-359-G18_a5
./mid/1_grouped_genomes/AG-390-D15_a22
./mid/1_grouped_genomes/AG-390-D15_a5
./mid/1_grouped_genomes/AG-390-N04_a22
./mid/1_grouped_genomes/AG-390-N04_a5
./mid/1_grouped_genomes/AG-414-L04_a22
./mid/1_grouped_genomes/AG-414-L04_a5
./mid/1_grouped_genomes/AG-426-E17_a22
./mid/1_grouped_genomes/AG-426-E17_a5
./mid/1_grouped_genomes/AG-435-F03_a22
./mid/1_grouped_genomes/AG-435-F03_a5
./mid/1_grouped_genomes/AG-891-G05_a22
./mid/1_grouped_genomes/AG-891-G05_a5
./mid/1_grouped_genomes/AG-891-I18_a22
./mid/1_grouped_genomes/AG-891-I18_a5
./mid/1_grouped_genomes/AG-891-J04_a22
./mid/1_grouped_genomes/AG-891-J04_a5
./mid/1_grouped_genomes/AG-891-J07_a22
./mid/1_grouped_genomes/AG-891-J07_a5
./mid/1_grouped_genomes/AG-891-K05_a22
./mid/1_grouped_genomes/AG-891-K05_a5
./mid/1_grouped_genomes/AG-892-F15_a22
./mid/1_grouped_genomes/AG-892-F15_a5
./mid/1_grouped_genomes/AG-893-E23_a22
./mid/1_grouped_genomes/AG-893-E23_a5

In [99]:
import tarfile

def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))

for DIR in [x[0] for x in os.walk("./mid/1_grouped_genomes/")][1:]:
    basename = DIR.split('/')[-1]
    DEST = "./mid/2_zipped_inputs/"+basename+".tar.gz"
    print(basename+'+'+DEST)
#    !tar cvzf ${basename}.tar.gz $DIR
#    make_tarfile(DEST, DIR)
#     !cd "mid/1_grouped_genomes"
#     basename = DIR.split('/')[-1]
#     DEST = basename+".tar.gzip"
#     if not exists(DEST):
#         print("compressing "+ basename)
#         make_tarfile(DEST, "./"+basename+"/")
#     !cd ..
print("done")

AG-891-G05_a5+./mid/2_zipped_inputs/AG-891-G05_a5.tar.gz
AG-891-G05_a22+./mid/2_zipped_inputs/AG-891-G05_a22.tar.gz
AG-435-F03_a5+./mid/2_zipped_inputs/AG-435-F03_a5.tar.gz
AG-435-F03_a22+./mid/2_zipped_inputs/AG-435-F03_a22.tar.gz
AG-891-I18_a22+./mid/2_zipped_inputs/AG-891-I18_a22.tar.gz
AG-891-I18_a5+./mid/2_zipped_inputs/AG-891-I18_a5.tar.gz
AG-893-E23_a5+./mid/2_zipped_inputs/AG-893-E23_a5.tar.gz
AG-893-E23_a22+./mid/2_zipped_inputs/AG-893-E23_a22.tar.gz
AG-920-L07_a22+./mid/2_zipped_inputs/AG-920-L07_a22.tar.gz
AG-920-L07_a5+./mid/2_zipped_inputs/AG-920-L07_a5.tar.gz
AG-414-L04_a5+./mid/2_zipped_inputs/AG-414-L04_a5.tar.gz
AG-414-L04_a22+./mid/2_zipped_inputs/AG-414-L04_a22.tar.gz
AG-891-K05_a5+./mid/2_zipped_inputs/AG-891-K05_a5.tar.gz
AG-891-K05_a22+./mid/2_zipped_inputs/AG-891-K05_a22.tar.gz
AG-893-F11_a22+./mid/2_zipped_inputs/AG-893-F11_a22.tar.gz
AG-893-F11_a5+./mid/2_zipped_inputs/AG-893-F11_a5.tar.gz
AG-908-A02_a22+./mid/2_zipped_inputs/AG-908-A02_a22.tar.gz
AG-908-A02_a5

In [83]:
[x[0] for x in os.walk("./mid/1_grouped_genomes/")]




['./mid/1_grouped_genomes/',
 './mid/1_grouped_genomes/AG-891-G05_a5',
 './mid/1_grouped_genomes/AG-891-G05_a22',
 './mid/1_grouped_genomes/AG-435-F03_a5',
 './mid/1_grouped_genomes/AG-435-F03_a22',
 './mid/1_grouped_genomes/AG-891-I18_a22',
 './mid/1_grouped_genomes/AG-891-I18_a5',
 './mid/1_grouped_genomes/AG-893-E23_a5',
 './mid/1_grouped_genomes/AG-893-E23_a22',
 './mid/1_grouped_genomes/AG-920-L07_a22',
 './mid/1_grouped_genomes/AG-920-L07_a5',
 './mid/1_grouped_genomes/AG-414-L04_a5',
 './mid/1_grouped_genomes/AG-414-L04_a22',
 './mid/1_grouped_genomes/AG-891-K05_a5',
 './mid/1_grouped_genomes/AG-891-K05_a22',
 './mid/1_grouped_genomes/AG-893-F11_a22',
 './mid/1_grouped_genomes/AG-893-F11_a5',
 './mid/1_grouped_genomes/AG-908-A02_a22',
 './mid/1_grouped_genomes/AG-908-A02_a5',
 './mid/1_grouped_genomes/AG-894-P05_a22',
 './mid/1_grouped_genomes/AG-894-P05_a5',
 './mid/1_grouped_genomes/AG-891-J07_a5',
 './mid/1_grouped_genomes/AG-891-J07_a22',
 './mid/1_grouped_genomes/AG-894-C14