# 1. Filtering

This notebook contains the code for filtering the reprocessed organism selection data.

## 1.1 Setup
First, load the necessary libraries.
Prior to running this notebook, the `zoogletools` package was installed using the following command from the top level of this GitHub repository.

```bash
pip install -e .
```

In [1]:
import arcadia_pycolor as apc

import zoogletools as zt

DATA_DIR = "../data/2025-04-21-os-portal-reprocessed/"



## 1.2 Organism parameters

These parameters are used to configure the organisms whose hits will be filtered.

In [2]:
# Choose the organisms to filter.
organisms = [
    "Ciona-intestinalis",
    "Salpingoeca-rosetta",
]

# For each organism, choose a color palette for the filtering funnel plot.
organism_color_palettes = {
    "Ciona-intestinalis": apc.Gradient(
        name="yellows",
        colors=apc.palettes.yellow_shades.colors,
    )
    .resample_as_palette(7)
    .colors,
    "Salpingoeca-rosetta": apc.gradients.blues.resample_as_palette(6).colors,
}

# Choose a version suffix for the filtered hits and funnel plot files.
version_suffix = "v3.2"

## 1.3 Running the filtering pipeline

This section contains the code for running the filtering pipeline.
This filtering relies on the default filtering pipeline, which is defined in the `zoogletools.filtering` module. You can also define your own filtering pipeline by creating a new F`ilteringPipeline` object and adding your own `Filter`s.

In [3]:
results = {}

for organism in organisms:
    filtered_hits, filter_counts = zt.filtering.filter_organism_proteins(
        target_organism=organism,
        data_dirpath=DATA_DIR,
        disease_data_filepath="../data/2025-04-21-merged-disease-datasets.tsv",
    )

    results[organism] = {
        "filtered_hits": filtered_hits,
        "filter_counts": filter_counts,
    }

    filtered_hits.to_csv(
        f"{organism}_filtered_hits_{version_suffix}.tsv",
        sep="\t",
    )

    fig = zt.plotting.create_funnel_plot(
        filter_counts,
        title=f"{organism} filtering",
        save_filepath=f"figures/{organism}_filtering_funnel_plot_{version_suffix}.html",
        width=550,
        color_list=organism_color_palettes[organism],
    )

    fig.show()

    final_table = zt.filtering.create_final_filtered_sheet(
        filtered_hits,
        f"{organism}_filtered_hits_{version_suffix}.tsv",
    )

    display(final_table)

Unnamed: 0,gene_symbol,disease_names,orphanet_highest_prevalence_class,orphanet_prevalences,human_protein,trait_distance,percentile,pvalue_rowwise,pvalue_colwise,organism_protein,portfolio_rank,zoogle,omim,orphanet,opentargets,marrvel
0,IMPA1,"Intellectual disability, autosomal recessive 59",,,P29218,0.903045,0.004310,0.0000,0.0170,F6TKL9,1.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/602064?search=impa1&hig...,,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/3612
1,FCHO1,Immunodeficiency 76,,,O14526,1.103326,0.005747,0.0000,0.0000,F6V4B3,1.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/613437?search=fcho1&hig...,,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/23149
2,NAXE,"Encephalopathy, progressive, early-onset, with...",<1/1000000,11 cases (Worldwide); <1/1000000 (Worldwide),Q8NCW5,1.321808,0.006849,0.0000,0.0000,H2Y1D7,1.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/608862?search=naxe&high...,,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/128240
3,KANK1,"Cerebral palsy, spastic quadriplegic, 2",<1/1000000,17 cases (Worldwide); <1/1000000 (Worldwide),Q14678,1.267654,0.007042,0.0145,0.0000,F7ALU8,2.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/607704?search=kank1&hig...,https://www.orpha.net/en/disease/list/gene/KAN...,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/23189
4,RNF213,Moyamoya disease 2,1-9/100000,"1/135,135 (Japan); 1/16,129 (Japan); 1/232,558...",Q63HN8,1.101398,0.007576,0.0000,0.0217,F6PRC5,1.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/613768?search=rnf213&hi...,https://www.orpha.net/en/disease/list/gene/RNF...,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/57674
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
187,CEP295,Seckel syndrome 11,,,Q9C0D2,4.791067,0.909091,0.8616,0.0427,H2Y0Y1,20.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/617728?search=cep295&hi...,,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/85459
188,FBXO28,Developmental and epileptic encephalopathy 100,Unknown,Unknown (Worldwide),Q9NVF7,3.839233,0.933333,0.8672,0.0000,F7B6K5,14.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/609100?search=fbxo28&hi...,https://www.orpha.net/en/disease/list/gene/FBX...,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/23219
189,NUP88,Fetal akinesia deformation sequence 4,1-9/1000000,"1/166,667 (Europe); <1/1000000 (Europe)",Q99567,5.013818,1.000000,0.9540,0.0000,F6ZIK8,24.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/602552?search=nup88&hig...,,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/4927
190,SUPT7L,FISCHER-ZIRNSAK PROGEROID SYNDROME,,,O94864,4.295183,1.000000,0.9504,0.0000,F6ZKQ7,20.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/612762?search=supt7l&hi...,,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/9913


Unnamed: 0,gene_symbol,disease_names,orphanet_highest_prevalence_class,orphanet_prevalences,human_protein,trait_distance,percentile,pvalue_rowwise,pvalue_colwise,organism_protein,portfolio_rank,zoogle,omim,orphanet,opentargets,marrvel
0,CDC14A,Autosomal recessive nonsyndromic hearing loss 32,Unknown,Unknown (Worldwide),Q9UNH5,1.861257,0.004386,0.0000,0.0085,F2ULW5,1.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/603504?search=cdc14a&hi...,https://www.orpha.net/en/disease/list/gene/CDC...,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/8556
1,DDC,Deficiency of aromatic-L-amino-acid decarboxylase,<1/1000000,140 cases (Worldwide); <1/1000000 (Worldwide),P20711,1.124221,0.004878,0.0000,0.0085,F2UCR6,1.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/107930?search=ddc&highl...,https://www.orpha.net/en/disease/list/gene/DDC...,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/1644
2,TNIK,"Intellectual disability, autosomal recessive 54",,,Q9UKE5,0.779315,0.005051,0.0000,0.0000,F2USE6,1.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/610005?search=tnik&high...,https://www.orpha.net/en/disease/list/gene/TNI...,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/23043
3,HMGB3,X-linked colobomatous microphthalmia-microceph...,,,O15347,0.775353,0.009849,0.0336,0.0000,F2U4M4,15.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/300193?search=hmgb3&hig...,,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/3149
4,CHCHD2,"Parkinson disease 22, autosomal dominant",,,Q9Y6H1,1.963659,0.013333,0.0206,0.0000,F2UJH6,2.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/616244?search=chchd2&hi...,,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/51142
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148,PTRHD1,Neurodevelopmental disorder with early-onset p...,,,Q6GMV3,4.367880,0.846154,0.8262,0.0222,F2U0Q1,33.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/617342?search=ptrhd1&hi...,,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/391356
149,CAMSAP1,"Cortical dysplasia, complex, with other brain ...",,,Q5T5Y3,5.392030,0.894977,0.8911,0.0000,F2UGQ1,66.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/613774?search=camsap1&h...,,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/157922
150,PHYKPL,Phosphohydroxylysinuria,,,Q8IUZ5,5.509309,0.901754,0.9097,0.0180,F2TXT5,87.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/614683?search=phykpl&hi...,,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/85007
151,RTTN,Microcephalic primordial dwarfism due to RTTN ...,<1/1000000,28 cases (Worldwide); <1/1000000 (Worldwide),Q86VV8,4.682612,0.926829,0.9008,0.0458,F2UMH2,38.0,https://zoogle.arcadiascience.com/search?gene=...,https://omim.org/entry/610436?search=rttn&high...,https://www.orpha.net/en/disease/list/gene/RTT...,https://platform.opentargets.org/target/ENSG00...,https://marrvel.org/human/gene/25914
