# Data Inputs
(adaped from `continuous_filtering.ipynb`)

The purpose of this notebook is to assemble all of the input files needed to run our analysis of genomic islands (scripts in this directory). The following files were downloaded from the links listed to the locations indicated:

1. `Data_S1-Pro-Clusters.xlsx` from Supplementary Materials of [Blaskowski et al., 2024](https://zenodo.org/records/12210994)
    - Downloaded to `data/clusters/`
1. `Data_S7-genome-metadata.csv` from Supplementary Materials of [Blaskowski et al., 2024](https://zenodo.org/records/12210994)
    - Downloaded to `data/cycogs/`
1. `Data_S8-img_data_cycog6.tar.gz` from Supplementary Materials of [Blaskowski et al., 2024](https://zenodo.org/records/12210994)
    - Downloaded to this directory (`data/`) and extracted with `tar -xzvf Data_S8-img_data_cycog6.tar.gz`
    - Extracted directory is named `data/img_data_cycog6/`
1. `cycogs.tsv` from Supplementary Materials of [Berube et al., 2018](https://doi.org/10.6084/m9.figshare.c.4037048.v1)
    - Downloaded to `data/cycogs/`
1. `mmc1.xlsx` and `mmc2.xlsx` from Supplementary Materials of [Hackl et al., 2023](https://doi.org/10.1016/j.cell.2022.12.006)
    - Downloaded to `data/hackl-2023/`

The outputs of this notebook are the following files:

1. `genome-metadata.csv`
    - A file of all of the _Prochlorococcus_ genomes used to make CyCOGv6, with additional information added in, including:
        - Whether or not a genome was analyzed for predicted genomic islands by Hackl et al.
        - The number of contigs per genome in the sequences used to make CyCOGv6 (`Data_S8-img_data_cycog6.tar.gz`)
1. `ortholog-metadata.csv`
    - A file of all of the genes included in CyCOGv6 that includes the following fields:
       - `MappingName`, `CyCOGID`, `GenomeName`, `GeneID`, `GeneID`


In [1]:
import os
import pandas as pd

## Genome Metadata

This code block builds the `genome-metadata.csv` file from the `cycogs-genomes.tsv` and other input files.

In [2]:
# read in genome metadata from Blaskowski et al.

cycog_genome_df = pd.read_csv('cycogs/Data_S7-genome-metadata.csv')
cycog_genome_df = cycog_genome_df.rename(columns={'Clade': 'CladeCyCOG', 'Completeness': 'CompletenessCyCOG'})
cycog_genome_df = cycog_genome_df[cycog_genome_df.Group == 'Prochlorococcus']
cycog_genome_df['CyCOGGenome'] = True

cycog_genome_df

Unnamed: 0,GenomeID,GenomeName,Type,Group,CladeCyCOG,Virocell,CompletenessCyCOG,CyCOGGenome
0,2716884698,AG-316-L16,SAG,Prochlorococcus,AMZ-II,False,20.69,True
1,2716884700,AG-316-N23,SAG,Prochlorococcus,AMZ-II,False,20.69,True
2,2716884701,AG-316-P23,SAG,Prochlorococcus,AMZ-II,False,12.07,True
3,2716884699,AG-316-L21,SAG,Prochlorococcus,AMZ-II,False,10.34,True
4,2716884642,AG-316-A05,SAG,Prochlorococcus,AMZ-II,False,6.90,True
...,...,...,...,...,...,...,...,...
599,2681813574,MIT1341,ISOLATE,Prochlorococcus,LLVII,False,100.00,True
600,2681813570,MIT1300,ISOLATE,Prochlorococcus,LLVII,False,99.86,True
601,2681813572,MIT1307,ISOLATE,Prochlorococcus,LLVII,False,99.18,True
602,2667527276,AG-402-N21,SAG,Prochlorococcus,LLVIII,False,79.80,True


In [3]:
# add in information on how many contigs each CyCOG genome has

genome_dir = 'img_data_cycog6/'

# make a dataframe matching genome ID to contig count for the matched isolates
data = []

# Iterate over the directories within REFS
for genome in os.listdir(genome_dir):
    # print(genome)
    genome_path = os.path.join(genome_dir, genome)
    
    # Skip non-directory entries if any
    if not os.path.isdir(genome_path):
        continue
    
    if int(genome) in cycog_genome_df['GenomeID'].tolist():
        # Build fasta filename of genome
        file_path = os.path.join(genome_path, f'{genome}.fna')
        # Open and read the .fna file
        with open(file_path, 'r') as fna_file:
            content = fna_file.read()
            # Count the number of occurrences of the '>' character
            count_greater_than = content.count('>')
            # Append the result to the data list
            data.append((int(genome), count_greater_than))

# Add count data to dataframe
cycog_genome_df['NContigsCyCOG'] = cycog_genome_df['GenomeID'].map(dict(data))
cycog_genome_df


Unnamed: 0,GenomeID,GenomeName,Type,Group,CladeCyCOG,Virocell,CompletenessCyCOG,CyCOGGenome,NContigsCyCOG
0,2716884698,AG-316-L16,SAG,Prochlorococcus,AMZ-II,False,20.69,True,64
1,2716884700,AG-316-N23,SAG,Prochlorococcus,AMZ-II,False,20.69,True,41
2,2716884701,AG-316-P23,SAG,Prochlorococcus,AMZ-II,False,12.07,True,31
3,2716884699,AG-316-L21,SAG,Prochlorococcus,AMZ-II,False,10.34,True,34
4,2716884642,AG-316-A05,SAG,Prochlorococcus,AMZ-II,False,6.90,True,20
...,...,...,...,...,...,...,...,...,...
599,2681813574,MIT1341,ISOLATE,Prochlorococcus,LLVII,False,100.00,True,1
600,2681813570,MIT1300,ISOLATE,Prochlorococcus,LLVII,False,99.86,True,1
601,2681813572,MIT1307,ISOLATE,Prochlorococcus,LLVII,False,99.18,True,1
602,2667527276,AG-402-N21,SAG,Prochlorococcus,LLVIII,False,79.80,True,47


In [4]:
# read in genome data from Hackl et al.

gi_locations_df = pd.read_excel('hackl-2023/mmc1.xlsx').rename(columns={
    'data_img_genome_id': 'GenomeID', 
    'genome_id': 'GenomeName', 
    'clade_bac120': 'CladeHackl', 
    'sample_type': 'Type', 
    'assembly_contigs': 'NContigsHackl', 
    'taxonomy_genus': 'Genus', 
    'checkm_completeness': 'CompletenessHackl'
})
gi_locations_df['GenomicIslands'] = True

gi_locations_df


Unnamed: 0,GenomeName,name,CladeHackl,notes,Type,depth,lat,long,assembly_genome_size,assembly_ambiguous_bases,...,taxonomy_strain_ncbi_taxid,data_assembly_file_origin,data_ncbi_bioproject_accession,data_ncbi_biosample_accession,data_ncbi_genbank_accession,data_ncbi_refseq_accession,data_ncbi_wgs_accession,GenomeID,data_assembly_ftp,GenomicIslands
0,AG-311-I09,Uncultured Prochlorococcus sp. AG-311-I09,HLI,Simons sequencing project,SAG,20.0,-20.08,-70.80,697970,0,...,,Bigelow,,,,,,2.716885e+09,,True
1,AG-311-J05,Uncultured Prochlorococcus sp. AG-311-J05,HLI,Simons sequencing project,SAG,20.0,-20.08,-70.80,623148,0,...,,Bigelow,,,,,,2.716885e+09,,True
2,AG-311-M23,Uncultured Prochlorococcus sp. AG-311-M23,HLI,Simons sequencing project,SAG,20.0,-20.08,-70.80,651213,0,...,,Bigelow,,,,,,2.716885e+09,,True
3,AG-321-D23,Uncultured Prochlorococcus sp. AG-321-D23,HLI,Simons sequencing project,SAG,14.0,-23.46,-88.77,1201571,0,...,,Bigelow,,,,,,2.716885e+09,,True
4,AG-321-E21,Uncultured Prochlorococcus sp. AG-321-E21,HLI,Simons sequencing project,SAG,14.0,-23.46,-88.77,1314750,0,...,,Bigelow,,,,,,2.716885e+09,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
618,MIT1307,Prochlorococcus sp. MIT1307,LLVII,,Isolate,150.0,22.00,-158.00,2032419,0,...,,Chisholm Lab,,,,,,,,True
619,MIT1341,Prochlorococcus sp. MIT1341,LLVII,,Isolate,150.0,22.00,-158.00,1937096,0,...,,Chisholm Lab,,,,,,,,True
620,UBA1269,Prochlorococcus sp. UBA1269,LLVII,Tara; location unclear; depth unclear,MAG,,,,1606016,71071,...,1947242.0,NCBI,PRJNA348753,SAMN06451922,GCA_002308455.1,,DBVH00000000,,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002...,True
621,UBA1273,Prochlorococcus sp. UBA1273,LLVII,Tara; location unclear; depth unclear,MAG,,,,1658888,58506,...,1947243.0,NCBI,PRJNA348753,SAMN06454572,GCA_002308835.1,,DBVD00000000,,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002...,True


In [5]:
# merge 

genome_df = pd.merge(
    cycog_genome_df[['GenomeID', 'GenomeName', 'Type', 'CladeCyCOG', 'CompletenessCyCOG', 'CyCOGGenome', 'NContigsCyCOG']], 
    gi_locations_df[['GenomeName',  'CladeHackl', 'CompletenessHackl', 'GenomicIslands', 'NContigsHackl']], 
    on=['GenomeName'], 
    how='outer'
)
genome_df['GenomeID'] = genome_df['GenomeID'].fillna(-1).astype(int)
genome_df['NContigsCyCOG'] = genome_df['NContigsCyCOG'].fillna(-1).astype(int)
genome_df['NContigsHackl'] = genome_df['NContigsHackl'].fillna(-1).astype(int)
genome_df['CyCOGGenome'] = genome_df['CyCOGGenome'].fillna(False).astype(bool)
genome_df['GenomicIslands'] = genome_df['GenomicIslands'].fillna(False).astype(bool)
genome_df


Unnamed: 0,GenomeID,GenomeName,Type,CladeCyCOG,CompletenessCyCOG,CyCOGGenome,NContigsCyCOG,CladeHackl,CompletenessHackl,GenomicIslands,NContigsHackl
0,2716884698,AG-316-L16,SAG,AMZ-II,20.69,True,64,,,False,-1
1,2716884700,AG-316-N23,SAG,AMZ-II,20.69,True,41,,,False,-1
2,2716884701,AG-316-P23,SAG,AMZ-II,12.07,True,31,,,False,-1
3,2716884699,AG-316-L21,SAG,AMZ-II,10.34,True,34,,,False,-1
4,2716884642,AG-316-A05,SAG,AMZ-II,6.90,True,20,,,False,-1
...,...,...,...,...,...,...,...,...,...,...,...
732,-1,JGI_02_N20,,,,False,-1,LLIV,30.07,True,53
733,-1,TMED223,,,,False,-1,LLIV,50.40,True,121
734,-1,UBA1269,,,,False,-1,LLVII,65.70,True,440
735,-1,UBA1273,,,,False,-1,LLVII,69.16,True,398


In [6]:
# look at genomes in which number of CyCOG contigs doesn't match number of Hackl contigs

overlap_df = genome_df[genome_df['CyCOGGenome'] & genome_df['GenomicIslands']]
discordant_df = overlap_df[overlap_df['NContigsCyCOG'] != overlap_df['NContigsHackl']]
print(f'{len(discordant_df)} genomes are discordant')
discordant_df[['GenomeID', 'GenomeName', 'CladeCyCOG', 'NContigsCyCOG', 'CladeHackl', 'NContigsHackl']]


28 genomes are discordant


Unnamed: 0,GenomeID,GenomeName,CladeCyCOG,NContigsCyCOG,CladeHackl,NContigsHackl
119,2606217689,EQPAC1,HLI,8,HLI,7
331,2606217677,SB,HLII,4,HLII,3
334,2606217680,MIT9311,HLII,17,HLII,15
336,2606217312,MIT9314,HLII,16,HLII,15
341,2606217606,GP2,HLII,11,HLII,10
342,2606217692,MIT9107,HLII,13,HLII,12
344,2606217690,MIT9116,HLII,22,HLII,18
345,2606217318,MIT9123,HLII,18,HLII,17
346,2606217691,MIT9302,HLII,17,HLII,15
359,2551306550,W11,HLIV,158,HLIII/HLIV,62


In [7]:
# look at list of complete genomes with only one contig

single_contig_genomes_df = overlap_df[(overlap_df['NContigsCyCOG'] == 1) | (overlap_df['NContigsHackl'] == 1)]
print(f'There are {len(single_contig_genomes_df)} complete isolate genomes that are closed (one contig only) ' + 
      f'representing {single_contig_genomes_df.CladeHackl.nunique()} distinct clades.')

single_contig_genomes_df[
    ['GenomeID', 'GenomeName', 'Type', 'CladeCyCOG', 'CladeHackl', 'NContigsCyCOG', 'NContigsHackl', 
     'CompletenessCyCOG', 'CompletenessHackl']
]

There are 27 complete isolate genomes that are closed (one contig only)representing 7 distinct clades.


Unnamed: 0,GenomeID,GenomeName,Type,CladeCyCOG,CladeHackl,NContigsCyCOG,NContigsHackl,CompletenessCyCOG,CompletenessHackl
118,2623620345,MIT9515,ISOLATE,HLI,HLI,1,1,100.0,100.0
120,2606217259,MED4,ISOLATE,HLI,HLI,1,1,99.46,100.0
329,2681813573,MIT1314,ISOLATE,HLII,HLII,1,1,100.0,100.0
332,2606217688,MIT0604,ISOLATE,HLII,HLII,1,1,99.73,99.85
333,2606217559,MIT9215,ISOLATE,HLII,HLII,1,1,99.73,100.0
335,2606217708,MIT9312,ISOLATE,HLII,HLII,1,1,99.73,100.0
340,2623620959,AS9601,ISOLATE,HLII,HLII,1,1,99.64,99.89
343,2623620961,MIT9301,ISOLATE,HLII,HLII,1,1,99.46,100.0
470,2623620348,NATL1A,ISOLATE,LLI,LLI,1,1,99.73,100.0
471,2606217240,NATL2A,ISOLATE,LLI,LLI,1,1,99.45,100.0


In [8]:
# save genome-metadata.csv

if not os.path.isdir('metadata'):
    os.mkdir('metadata')
genome_df.to_csv('metadata/genome-metadata.csv', index=False)
