## Additional Reproducibility Efforts 

Importing necessary library for data analysis for .csv and .tsv files; `pandas` & `DataFrame`
All the supplementary tables (STable*) dowloaded from the Zenedo repository [1] of the original article [2].

> [1] A. Gavriilidou, “Compendium of specialized metabolite biosynthetic diversity encoded in bacterial genomes”. Zenodo, Apr. 15, 2022.(https://doi.org/10.5281/zenodo.5159210).

> [2] Gavriilidou, A., Kautsar, S.A., Zaburannyi, N. et al. Compendium of specialized metabolite biosynthetic diversity encoded in bacterial genomes. Nat Microbiol 7, 726–735 (2022). (https://doi.org/10.1038/s41564-022-01110-2)

In [1]:
import pandas as pd
from pandas import DataFrame

In [2]:
STable1: DataFrame = pd.read_csv("STable1_all_genomes_info.tsv", sep="|")
STable2: DataFrame = pd.read_csv("STable2_BiG-SLICE_t0.4_GCF_assignment.csv", sep=",")
STable3: DataFrame = pd.read_csv("STable3_BiG-SLICE_t0.5_GCF_assignment.csv", sep=",")
STable4: DataFrame = pd.read_csv("STable4_BiG-SLICE_t0.6_GCF_assignment.csv", sep=",")
STable5: DataFrame = pd.read_csv("STable5_BiG-SLICE_t0.7_GCF_assignment.csv", sep=",")

In [3]:
STable1 = STable1.rename(columns={"bgc_ids":"bgc_id"})

STable1["bgc_id"] =  STable1.bgc_id.str.split(",")

STable1 = STable1.explode(column="bgc_id").reset_index(drop=True)

STable1["bgc_id"] = STable1.bgc_id.astype("int64")

For every threshhold `pd.merge` applied. Taxonomy information is selected from taxonomy column of each dataframe. The streptomyces genus level is displayed as an output, the following question is asked;

* **From the figure 1_A the streptomyces value counts are known (see Figure_1.ipynb), if the foloowing filtering algorithm utilized do we get the same results?**

Figure 1_A data;

|          | **_Genus_**          | **_Number of GCFs_** | **_Threshold (T)_** |
|----------|----------------------|----------------------|---------------------|
| **_0_**  | **_Streptomyces_**   | **_7294_**           | **_t=0.4_**         |
| **_9_**  | **_Streptomyces_**   | **_5720_**           | **_t=0.5_**         |
| **_18_** | **_Streptomyces_**   | **_4360_**           | **_t=0.6_**         |
| **_27_** | **_Streptomyces_**   | **_2889_**           | **_t=0.7_**         |


In [4]:
thresholds_4_gcfs = pd.merge(STable1, STable2, on='bgc_id')

taxonomy_4 = thresholds_4_gcfs.taxonomy.str.split(",",expand=True).fillna("")
streptomyces_index_4 = taxonomy_4[taxonomy_4[5] == "Streptomyces"].index

thresholds_4_gcf_count = thresholds_4_gcfs.loc[streptomyces_index_4].gcf_id.value_counts()

In [5]:
thresholds_5_gcfs = pd.merge(STable1, STable3, on='bgc_id')

taxonomy_5 = thresholds_5_gcfs.taxonomy.str.split(",",expand=True).fillna("")
streptomyces_index_5 = taxonomy_5[taxonomy_5[5] == "Streptomyces"].index

thresholds_5_gcf_count = thresholds_5_gcfs.loc[streptomyces_index_5].gcf_id.value_counts()

In [6]:
thresholds_6_gcfs = pd.merge(STable1, STable4, on='bgc_id')

taxonomy_6 = thresholds_6_gcfs.taxonomy.str.split(",",expand=True).fillna("")
streptomyces_index_6 = taxonomy_6[taxonomy_6[5] == "Streptomyces"].index

thresholds_6_gcf_count = thresholds_6_gcfs.loc[streptomyces_index_6].gcf_id.value_counts()

In [7]:
thresholds_7_gcfs = pd.merge(STable1, STable5, on='bgc_id')

taxonomy_7 = thresholds_7_gcfs.taxonomy.str.split(",",expand=True).fillna("")
streptomyces_index_7 = taxonomy_7[taxonomy_7[5] == "Streptomyces"].index

thresholds_7_gcf_count = thresholds_7_gcfs.loc[streptomyces_index_7].gcf_id.value_counts()

In [8]:
len(thresholds_4_gcf_count),len(thresholds_5_gcf_count),len(thresholds_6_gcf_count),len(thresholds_7_gcf_count)

(8703, 6798, 5136, 3363)

When inspected;

Figure 1_A data from the applied algorithm;
<table>
<tr><th>Original Data </th><th>Reproduced Data</th></tr>
<tr><td>

|          |      **_Genus_**     | **_Number of GCFs_** | **_Threshold (T)_** |
|:--------:|:--------------------:|:--------------------:|:-------------------:|
|  **_0_** |  **_Streptomyces_**  |      **_7294_**      |     **_t=0.4_**     |
|  **_9_** |  **_Streptomyces_**  |      **_5720_**      |     **_t=0.5_**     |
| **_18_** |  **_Streptomyces_**  |      **_4360_**      |     **_t=0.6_**     |
| **_27_** |  **_Streptomyces_**  |      **_2889_**      |     **_t=0.7_**     |

</td><td>

|          |      **_Genus_**     | **_Number of GCFs_** | **_Threshold (T)_** |
|:--------:|:--------------------:|:--------------------:|:-------------------:|
|  **_0_** |  **_Streptomyces_**  |      **_8703_**      |     **_t=0.4_**     |
|  **_9_** |  **_Streptomyces_**  |      **_6798_**      |     **_t=0.5_**     |
| **_18_** |  **_Streptomyces_**  |      **_5136_**      |     **_t=0.6_**     |
| **_27_** |  **_Streptomyces_**  |      **_3363_**      |     **_t=0.7_**     |

</td></tr> </table>

In [9]:
len(thresholds_4_gcf_count)-7294,len(thresholds_5_gcf_count)-5720,len(thresholds_6_gcf_count)-4360,len(thresholds_7_gcf_count)-2889

(1409, 1078, 776, 474)

In [10]:
len(thresholds_4_gcf_count)-len(thresholds_5_gcf_count),len(thresholds_5_gcf_count)-len(thresholds_6_gcf_count),len(thresholds_6_gcf_count)-len(thresholds_7_gcf_count)

(1905, 1662, 1773)

In [11]:
1409-1078, 1078-776, 776-474

(331, 302, 302)

In [12]:
len(STable2)-len(thresholds_4_gcfs),len(STable3)-len(thresholds_5_gcfs),len(STable4)-len(thresholds_6_gcfs),len(STable5)-len(thresholds_7_gcfs)

(764, 37, 37, 37)

When inspected;

Figure 1_A data from the applied algorithm;
<table>
<tr><th>Original Data vs Reproduced Data </th><th>Extra Operations</th></tr>
<tr><td>

|          |      **_Genus_**     | **_Number of GCFs Differ_** | **_Threshold (T)_** |
|:--------:|:--------------------:|:---------------------------:|:-------------------:|
|  **_$diff_1$_** |  **_Streptomyces_**  |          **_1409_**         |     **_t=0.4_**     |
|  **_$diff_2$_** |  **_Streptomyces_**  |          **_1078_**         |     **_t=0.5_**     |
| **_$diff_3$_** |  **_Streptomyces_**  |          **_776_**          |     **_t=0.6_**     |
| **_$diff_4$_** |  **_Streptomyces_**  |          **_474_**          |     **_t=0.7_**     |

</td><td>

|      **_Operation_**     | **_Values of the output_**             |
|:--------------------:|:--------------------:            |
|  **_$ t_{0.4} - t_{0.5} , t_{0.5} - t_{0.6} , t_{0.6} - t_{0.7} $_**  |      **(1905, 1662, 1773)**      |
|  **_$ diff_1 - diff_2 , diff_2 - diff_3 , diff_3 - diff_4 $_**  |      **(331, 302, 302)**         |
|  **_Data Lost From INNER MERGE OPERATION\{t=0.4,0.5,0.6,0.7}_**  |      **(764, 37, 37, 37)**       |

</td></tr> </table>

In [23]:
phylum_taxa  = taxonomy_4[1].value_counts()[:20]
class_taxa   = taxonomy_4[2].value_counts()[:20]
order_taxa   = taxonomy_4[3].value_counts()[:20]
family_taxa  = taxonomy_4[4].value_counts()[:20]
genus_taxa   = taxonomy_4[5].value_counts()[:20]
species_taxa = taxonomy_4[6].value_counts()[:20]

In [27]:
# Taxonomy resolution
taxonomy_resolution = DataFrame({"phylum_taxa_name":list(phylum_taxa.index),
                         "phylum_taxa_count":list(phylum_taxa.values),
                         "class_taxa_name":list(class_taxa.index),
                         "class_taxa_count":list(class_taxa.values),
                         "order_taxa_name":list(order_taxa.index),
                         "order_taxa_count":list(order_taxa.values),
                         "family_taxa_name":list(family_taxa.index),
                         "family_taxa_count":list(family_taxa.values),
                         "genus_taxa_name":list(genus_taxa.index),
                         "genus_taxa_count":list(genus_taxa.values),
                         "species_taxa_name":list(species_taxa.index),
                         "species_taxa_count":list(species_taxa.values),
                        }
                       )

In [28]:
taxonomy_resolution.head()

Unnamed: 0,phylum_taxa_name,phylum_taxa_count,class_taxa_name,class_taxa_count,order_taxa_name,order_taxa_count,family_taxa_name,family_taxa_count,genus_taxa_name,genus_taxa_count,species_taxa_name,species_taxa_count
0,Proteobacteria,565024,Gammaproteobacteria,507067,Enterobacterales,202288,Enterobacteriaceae,170657,Mycobacterium,105197,,115269
1,Actinobacteriota,264466,Actinobacteria,262184,Mycobacteriales,180105,Mycobacteriaceae,164511,Pseudomonas,75072,Mycobacterium tuberculosis,93107
2,Firmicutes,229209,Bacilli,228404,Pseudomonadales,170045,Pseudomonadaceae,124741,Staphylococcus,74118,Pseudomonas aeruginosa,74268
3,Bacteroidota,32879,Alphaproteobacteria,57906,Burkholderiales,100727,Burkholderiaceae,90287,Streptococcus,74113,Staphylococcus aureus,63582
4,Firmicutes_A,28999,Bacteroidia,30909,Lactobacillales,94986,Streptococcaceae,75486,Escherichia,68537,Streptococcus pneumoniae,50516


In [32]:
lookup = taxonomy_4[5].value_counts().index

In [34]:
lookup

Index(['Mycobacterium', 'Pseudomonas', 'Staphylococcus', 'Streptococcus',
       'Escherichia', 'Burkholderia', 'Streptomyces', 'Pseudomonas_E',
       'Acinetobacter', 'Klebsiella',
       ...
       'Tepidibacter_A', 'UBA6177 sp002422875', 'Pasteurella_A', 'Ac37b',
       'Thermopetrobacter', 'Sediminicola_A', 'UBA6149 sp002423155', 'UBA5941',
       'UBA6147', 'Deferribacter'],
      dtype='object', name=5, length=5352)

In [None]:
for i in lookup:
    

In [35]:
thresholds_4_gcfs

Unnamed: 0,dataset_name,AccNo,taxonomy,bgc_id,gcf_id,feature_to_centroid_distance
0,mag_uba,DBAD,"Bacteria,Proteobacteria,Gammaproteobacteria,Bu...",1201186,1499,0.266547
1,mag_uba,DBAD,"Bacteria,Proteobacteria,Gammaproteobacteria,Bu...",1201187,14583,0.097783
2,mag_uba,DBAD,"Bacteria,Proteobacteria,Gammaproteobacteria,Bu...",1201188,6524,0.299472
3,mag_uba,DBAD,"Bacteria,Proteobacteria,Gammaproteobacteria,Bu...",1201189,29829,0.770989
4,mag_uba,DBAF,"Bacteria,Proteobacteria,Gammaproteobacteria,Ps...",1201638,7661,0.544232
...,...,...,...,...,...,...
1185226,GEMS,3300029947_3,"Bacteria,Bacteroidota,Bacteroidia,Bacteroidale...",1293123,63104,0.000000
1185227,GEMS,3300029947_7,"Bacteria,Bacteroidota,Bacteroidia,Bacteroidale...",1266441,57198,0.454297
1185228,GEMS,3300029947_9,"Bacteria,Bacteroidota,Bacteroidia,Bacteroidale...",1283834,61822,0.347687
1185229,GEMS,3300029948_2,"Bacteria,Bacteroidota,Bacteroidia,Flavobacteri...",1240339,16897,0.316188


In [39]:
lookup

Index(['Mycobacterium', 'Pseudomonas', 'Staphylococcus', 'Streptococcus',
       'Escherichia', 'Burkholderia', 'Streptomyces', 'Pseudomonas_E',
       'Acinetobacter', 'Klebsiella',
       ...
       'Tepidibacter_A', 'UBA6177 sp002422875', 'Pasteurella_A', 'Ac37b',
       'Thermopetrobacter', 'Sediminicola_A', 'UBA6149 sp002423155', 'UBA5941',
       'UBA6147', 'Deferribacter'],
      dtype='object', name=5, length=5352)

In [49]:
emphty_dict = {}
for i in lookup:
    temp = thresholds_4_gcfs.loc[taxonomy_4[taxonomy_4[5] == i].index,"gcf_id"].value_counts()
    emphty_dict[f"{i}"] = [len(temp),temp.index, temp.values]

In [50]:
df = pd.DataFrame.from_dict(emphty_dict)

In [59]:
df_t = df.transpose()

In [63]:
df_t = df_t.rename({0:"gcf_count_unique",1:"gcf_index",2:"gcf_count"},axis=1)

In [65]:
df_t.to_csv("reproducibility_results.csv",index_label="Genus_Taxonomy")

In [71]:
df_t_sorted = df_t.sort_values("gcf_count_unique",ascending=False)

In [72]:
df_t_sorted.to_csv("reproducibility_results_sorted.csv",index_label="Genus_Taxonomy")

In [77]:
df_t_sorted[1:12]

Unnamed: 0,gcf_count_unique,gcf_index,gcf_count
Streptomyces,8703,"Index([57228, 49265, 45969, 42802, 883, 925...","[1061, 766, 682, 654, 591, 534, 507, 491, 443,..."
Pseudomonas_E,1517,"Index([25095, 22671, 37850, 58987, 9257, 2787...","[4282, 2073, 2048, 1498, 1445, 1395, 1330, 120..."
Nocardia,1421,"Index([51953, 9257, 32744, 17269, 57228, 5000...","[173, 128, 76, 62, 60, 55, 54, 53, 48, 47, 46,..."
Micromonospora,1089,"Index([49696, 57228, 40767, 12206, 20197, 2830...","[215, 182, 152, 108, 88, 87, 86, 85, 81, 77, 7..."
Amycolatopsis,964,"Index([57228, 9131, 5897, 0, 56542, 4926...","[64, 52, 38, 35, 34, 33, 30, 30, 30, 29, 29, 2..."
Mycobacterium,862,"Index([42326, 43972, 64462, 58399, 61011, 6144...","[6586, 6578, 6498, 6489, 6482, 6328, 6266, 598..."
Kitasatospora,859,"Index([ 9604, 42802, 57228, 49265, 13199, 1037...","[53, 46, 32, 30, 24, 23, 22, 22, 19, 18, 17, 1..."
Streptococcus,782,"Index([ 37, 4762, 6957, 1986, 14723, 3390...","[6284, 4914, 4582, 3994, 3817, 3079, 3031, 206..."
Rhodococcus,736,"Index([51953, 9257, 32744, 17269, 10317, 710...","[365, 363, 266, 235, 202, 143, 121, 115, 98, 9..."
Mycolicibacterium,724,"Index([51953, 54664, 59999, 51460, 5897, 445...","[342, 276, 257, 179, 166, 127, 95, 92, 76, 63,..."


# END OF ADDITIONAL EFFORTS

In [None]:
# thresholds_4_gcfs: np.ndarray = np.array(STable2.gcf_id.unique())
# thresholds_5_gcfs: np.ndarray = np.array(STable3.gcf_id.unique())
# thresholds_6_gcfs: np.ndarray = np.array(STable4.gcf_id.unique())
# thresholds_7_gcfs: np.ndarray = np.array(STable5.gcf_id.unique())
# bgc_ids: DataFrame = STable1.bgc_ids.str.split(",", expand=True)
# bgc_ids = (bgc_ids.apply(pd.to_numeric, downcast="unsigned"))
# bgc_ids[["dataset_name", "AccNo", "taxonomy"]] = STable1[["dataset_name", "AccNo", "taxonomy"]]
# STable2["bgc_id"] = STable2["bgc_id"].astype(np.int64)
# STable2.index = STable2["bgc_id"]
# STable2["taxonomy"] = ""
# from pandarallel import pandarallel
# import os
# pandarallel.initialize(nb_workers=os.cpu_count())
# def add_taxonomy_to_table(gcfs_array: np.ndarray, stable: DataFrame) -> None:
#     # print(type(gcfs_array))
#     # print(gcfs_array[5])
#     for gcf in gcfs_array:
#         # print(f"gcf{gcf}\n")
#         bgc_ids_of_gcf: np.ndarray = np.array(STable2[STable2.gcf_id == gcf]["bgc_id"])
#         for bgc_id in bgc_ids_of_gcf:
#             # print(f"bgc_id{bgc_id}\n")
#             # print(f"type(bgc_id{type(bgc_id)}\n")
#             bgc_tax = bgc_ids[bgc_ids.eq(bgc_id).any(axis="columns")]["taxonomy"].iloc[0]
#             stable.loc[bgc_id, "taxonomy"] = bgc_tax
#add_taxonomy_to_table = np.vectorize(add_taxonomy_to_table)

#add_taxonomy_to_table(gcfs_array = thresholds_4_gcfs[-10:], stable = STable2)