# FathomNet Python API
Experiments with the [fathomnet-py](https://github.com/fathomnet/fathomnet-py) client-side API.

In [None]:
%pip install --user fathomnet 

In [10]:
import fathomnet
from fathomnet.api import boundingboxes
import json
import requests
import urllib
import enum

In [4]:
concepts = boundingboxes.find_concepts()
print(len(concepts))

2119


## Mapping from concepts to Aphia IDs
Currently, all of the FathomNet data is classified by **concepts**. These labels range from scientific names to common names, and represent everything from organisms to equipment.

Some examples:
> - Tetrorchis erythrogaster
> - Parthenopidae
> - marine organism
> - tire

For the ML Challenge, we are strictly interested in biological organisms, so we need to filter out uncategorized groups. This is where the **World Register of Marine Species (WoRMS)** comes in!

### WoRMS Lookup

WoRMS holds a searchable directory of all marine organisms, listed by taxonomy. Each level of organization from kingdom to species is labeled with an **Aphia ID**, a unique integer number. We'll look up each species name and try and find a matching AphiaID for each.

You can read more about the WoRMS API from [the documentation](https://www.marinespecies.org/rest/), but for the ID lookup we'll start with two operations.

```/AphiaIDByName/{ScientificName}```
*Get the AphiaID for a given name.*

Sometimes, there may be multiple matching AphiaIDs for a given name. This can be caused by conflicting classifications, so we'll look up all of them and throw out any that aren't valid.

```/AphiaRecordsByName/{ScientificName}```
*Get one or more matching (max. 50) AphiaRecords for a given name*



*Note:*
*The lookup as written is currently very lossy, and throws out information up to the highest level of classification to account for misspellings and weird formatting. This is okay for our application because we're interested in phylum-level organization (more on this later), but might not be helpful in other situations!*

In [59]:
def get_worms_id(scientific_name) -> int or None:
    parsed_name = urllib.parse.quote(scientific_name)
    url = "https://www.marinespecies.org/rest/AphiaIDByName/" + parsed_name
    response = requests.get(url)
    if (response.status_code == 200):
        return response.json()
    elif (response.status_code == 206):
        # Multiple matches, so get the first accepted ID if one exists.
        url = "https://www.marinespecies.org/rest/AphiaRecordsByName/" + parsed_name
        response = requests.get(url)
        
        if (response.status_code == 200): # Successful
            for record in response.json():
                if record["status"] == "accepted":
                    return record["AphiaID"]
            return response.json()[0]["AphiaID"]

    else:
        return None


In [None]:
# Sort out species that have exact matches in WoRMS.
concepts = set(concepts)
matched_concepts_to_id: dict[str, str] = dict()
matched_concepts = set([])
unmatched_concepts = set([])

In [32]:
for i in range(len(concepts)):
    new_concept = list(concepts)[0]
    id = get_worms_id(new_concept)
    concepts.remove(new_concept)
    if (id == None):
        print("{}: NO MATCH".format(new_concept))
        unmatched_concepts.add(new_concept)
    else:
        print("{}: {}".format(new_concept, id))
        matched_concepts.add(new_concept)
        matched_concepts_to_id[new_concept] = id

print("\nMatched Concepts: {}".format(len(matched_concepts)))
print("\nUnmatched Concepts: {}".format(len(unmatched_concepts)))

Desbruyeresia: 391517
tire: NO MATCH
Caryophyllia/Javania: NO MATCH
Caenopedina pulchella: 456763
Tetrorchis erythrogaster: 117868
Hollardia goslinei: 281083
Diastobranchus capensis: 158656
Thrissacanthias penicillatus: 292833
Appendicularia: NO MATCH
Actinernus: 100691
Myroconger gracilis: 281607
inner filter: NO MATCH
Amphianthus sp.: NO MATCH
Clio: 137751
Galiteuthis phyllura: 341807
Tetractinellida: 597812
Cirroteuthis: 153091
Asteronyx: 123578
Bolocera: 100698
paragon: NO MATCH
Iridogorgia bella: 286152
Midwater Respirometry System: NO MATCH
Pseudosagitta maxima: 105445
marine organism: NO MATCH
Parthenopidae: 106761
Acanthamunnopsis milleri: 258647
Narella hypsocalyx: 719480
Cyclothone pallida: 127288
Victorgorgia alba: 1045634
Clausophyidae: 135337
Munidopsis recta: 392592
Echiura "mucus tube": NO MATCH
Archeterokrohnia docrickettsae: 742233
Laqueus: 235265
Funiculina-Halipteris complex: NO MATCH
Phyllodocida: 892
Hydractiniidae: 1601
Lamprogrammus brunswigi: 159133
Acanthascina

Example concepts that were unmatched:
tire
Group/Group
Genus (right)
Genus "subspecies"
Genus sp. 1
salp detritus
can
wood fall experiment
Cirrata "egg"
GenusnaME

Additional filters to run on currently unmatched groups:
- Ignore non-capitalized classifications
- Set all characters to lowercase
- Ignore any characters after first space or slash

In [None]:
sorted_matches = list(matched_concepts)
sorted_matches.sort()
print(sorted_matches)

sorted_unmatched = list(unmatched_concepts)
sorted_unmatched.sort()
print(sorted_unmatched)

In [64]:
# Second-pass filter

unmatched_concepts_list = list(unmatched_concepts)
for new_concept in unmatched_concepts_list:
    # Ignore characters after space or slash
    formatted_concept = new_concept.split(" ")[0]
    formatted_concept = formatted_concept.split("/")[0]

    if (formatted_concept[0].isupper()): # Check that first character is capitalized
        # Search for this term on WoRMS
        id = get_worms_id(formatted_concept)
        if (id == None):
            print("{}: NO MATCH".format(new_concept))
        else:
            print("{}: {}".format(new_concept, id))
            unmatched_concepts.remove(new_concept)
            matched_concepts.add(new_concept)
            matched_concepts_to_id[new_concept] = id
        
    else:
        print("{}: IGNORED".format(new_concept))

print("\nMatched Concepts: {}".format(len(matched_concepts)))
print("\nUnmatched Concepts: {}".format(len(unmatched_concepts)))

Octopodinae: NO MATCH
kelp holdfast: IGNORED
tire: IGNORED
inner filter: IGNORED
paragon: IGNORED
detrital aggregate: IGNORED
Midwater Respirometry System: NO MATCH
sheet flow: IGNORED
BED: NO MATCH
marine organism: IGNORED
Medusae: NO MATCH
whale carcass: IGNORED
Funiculina-Halipteris complex: NO MATCH
ADCP: NO MATCH
Sebastomus complex: NO MATCH
bottle: IGNORED
plastic: IGNORED
Tanyostea: NO MATCH
manipulator: IGNORED
medusa carcass: IGNORED
trash: IGNORED
DiplacanthopomaA: NO MATCH
cf. Hansenothuria sp.: IGNORED
TorquaratoridaeB sp. 1: NO MATCH
DiplacanthopomaB: NO MATCH
plastic bag: IGNORED
Eye-in-the-Sea: NO MATCH
Macon: NO MATCH
swing arm: IGNORED
bottle-2: IGNORED
cable spool: IGNORED
Vitreosalpa: NO MATCH
salp detritus: IGNORED
rope: IGNORED
wood fall experiment: IGNORED
trap: IGNORED
boulder: IGNORED
net: IGNORED
narella sp.: IGNORED
dover sole: IGNORED
can: IGNORED
Homerpro: NO MATCH
sinker: IGNORED
Tomopterid eggcase: NO MATCH
Neptunea-Buccinum Complex: NO MATCH
polychaete tu

In [65]:
unmatched_concepts

{'2G Robotics structured light laser',
 '55-gallon drum',
 'ADCP',
 'Actinaria',
 'BED',
 'Bassogigas1',
 'Bassozetus1',
 'Benthic Respiration System',
 'Benthic Rover',
 'Chrysogorgidae',
 'Ctenophore',
 'DeepPIV 1.0',
 'DeepPIV 2.0',
 'DeepPIV 3.0',
 'Detritus Sampler',
 'DiplacanthopomaA',
 'DiplacanthopomaB',
 'DiplacanthopomaC',
 'Doliolenetta',
 'Dye Injector',
 'Eye-in-the-Sea',
 'Fecampiid eggcase',
 'Funiculina-Halipteris complex',
 'Homerpro',
 'Hydrate Synthesis Chamber',
 'Hydromedusae',
 'Ink Dispenser',
 'Krill molt',
 'LRJ complex',
 'Lagrangian sediment trap',
 'Larval Sampler',
 'Laser Raman',
 'Leptocephalus-2',
 'Macon',
 'Medusae',
 'Midwater Respirometry System',
 'Neptunea-Buccinum Complex',
 'Octopodinae',
 'Phyllospadix-Zostera detritus',
 'Push Corer',
 'Roundnose grenadier',
 'Sebastomus complex',
 'Solmunaegina nematophora',
 'Sonardyne beacon',
 'Suction Sampler',
 'Tanyostea',
 'Temperature Gradient Probe',
 'Teuthoidea',
 'Theudoidea',
 'Tomopterid eggcase

In [66]:
full_matches = dict(matched_concepts_to_id)

for concept in unmatched_concepts:
    full_matches[concept] = None

with open("../data/concept_to_aphia_id.json", 'w') as fp:
    json.dump(full_matches, fp, sort_keys=True)

## Mapping from IDs to groups
You can skip to this portion if you already have the `concept_to_aphia_id.json` file!

We're going to look up each AphiaID and find the corresponding **AphiaRecord**, which includes information about the ID's position in the taxonomic tree. From there, we'll map it to a group for classification.

The classifications requested for the MATE ML Challenge are:
> - Annelids
> - Arthropods
> - Cnidarians
> - Echinoderms
> - Mollusca
> - Porifera
> - Other Invertebrates
> - Vertebrates: Fishes
> - Unidentified Biology

The astute marine biology student may notice that most of these categories are **phyla** (singular **phylum**), which are broad categories we use to organize organisms. The phylum *Echinodermata*, for instance, includes sea stars, sea cucumbers, sea urchins, and more.

However, some of these categories aren't phyla at all! For example, fishes (and us) fall into the phylum *Chordata*, but it also includes invertebrates like tunicates (sea squirts) that would fall under 'Other Invertebrates'. This makes our categorization slightly less straightforward.

In [7]:
# Load in our previous map from our concept names to Aphia IDs.
concept_to_id: dict[str, str or None]
concept_to_group: dict[str, str] = dict()

with open("../data/concept_to_aphia_id.json", 'r') as fp:
    concept_to_id = json.load(fp)


**AphiaRecord:**
Here's an example of what comes in an AphiaRecord:
```	
{
  "AphiaID": 274506,
  "url": "https://www.marinespecies.org/aphia.php?p=taxdetails&id=274506",
  "scientificname": "Liparis catharus",
  "authority": "Vogt, 1973",
  "status": "accepted",
  "unacceptreason": null,
  "taxonRankID": 220,
  "rank": "Species",
  "valid_AphiaID": 274506,
  "valid_name": "Liparis catharus",
  "valid_authority": "Vogt, 1973",
  "parentNameUsageID": 126160,
  "kingdom": "Animalia",
  "phylum": "Chordata",
  "class": "Actinopteri",
  "order": "Perciformes",
  "family": "Liparidae",
  "genus": "Liparis",
  "citation": "Froese, R. and D. Pauly. Editors. (2022). FishBase. Liparis catharus Vogt, 1973. Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=274506 on 2022-05-19",
  "lsid": "urn:lsid:marinespecies.org:taxname:274506",
  "isMarine": 1,
  "isBrackish": 0,
  "isFreshwater": 0,
  "isTerrestrial": 0,
  "isExtinct": null,
  "match_type": "exact",
  "modified": "2008-01-15T17:27:08.177Z"
  }
  ```

  We'll mostly be interested in the phylum and class information, which we'll get using this operation:

  ```
​/AphiaRecordByAphiaID​/{ID} 
```
*Get the complete AphiaRecord for a given AphiaID*

In [19]:
class OrganismClass(enum.Enum):
    ANNELIDA = "annelida"
    ARTHROPODA = "arthropoda"
    CNIDARIA = "cnidaria"
    ECHINODERMATA = "echinodermata"
    MOLLUSCA = "mollusca"
    PORIFERA = "porifera"
    OTHER_INVERTEBRATES = "other-invertebrates"
    VERTEBRATES_FISHES = "fish"
    UNIDENTIFIED = "unidentified"

phylum_to_class = {
    "Annelida": OrganismClass.ANNELIDA,
    "Arthropoda": OrganismClass.ARTHROPODA,
    "Cnidaria": OrganismClass.CNIDARIA,
    "Echinodermata": OrganismClass.ECHINODERMATA,
    "Mollusca": OrganismClass.MOLLUSCA,
    "Porifera": OrganismClass.PORIFERA
}

def get_record_from_aphia_id(aphia_id: int) -> object or None:
    url = "https://www.marinespecies.org/rest/AphiaRecordByAphiaID/{}".format(aphia_id)
    response = requests.get(url)
    if response.status_code == 200:  # Successful
        return response.json()
    else:
        return None


def class_from_record(aphia_record: object) -> OrganismClass:
    """Returns the classification of an organism based on the AphiaRecord.
    """

    if aphia_record["kingdom"] != "Animalia":
        # Non-animal classifications should be treated as unidentified.
        return OrganismClass.UNIDENTIFIED

    phylum_to_class = {
        "Annelida": OrganismClass.ANNELIDA,
        "Arthropoda": OrganismClass.ARTHROPODA,
        "Cnidaria": OrganismClass.CNIDARIA,
        "Echinodermata": OrganismClass.ECHINODERMATA,
        "Mollusca": OrganismClass.MOLLUSCA,
        "Porifera": OrganismClass.PORIFERA
    }

    if aphia_record["phylum"] in phylum_to_class.keys():  # Easily matched!
        return phylum_to_class[aphia_record["phylum"]]
    
    else:
        # Organism is other, fish, or unidentified.
        if aphia_record["phylum"] == "Chordata":
            # Check if class is either in subphylum Tunicata (tunicates) or Cephalochordata (lancelets)
            # Unforunately the AphiaRecord doesn't give us subphylum so we'll have to go a step further to check.
            tunicata_classes = ["Appendicularia", "Ascidiacea", "Larvacea", "Sorberacea", "Thaliacea"]
            cephalochordate_classes = ["Leptocardii"]

            if aphia_record["class"] in tunicata_classes or aphia_record["class"] in cephalochordate_classes:
                return OrganismClass.OTHER_INVERTEBRATES

            # We'll also check to make sure our organism is *actually* a fish!
            fish_classes = ["Actinopteri", "Cladistii", "Coelacanthi", "Dipneusti", "Elasmobranchii", "Holocephali", "Myxini", "Pteromyzonti"]
            # We also add in a weird exception for Scorpaeniformes (scorpionfishes), because they don't have a class.
            if aphia_record["class"] in fish_classes or aphia_record["order"] == "Scorpaeniformes":
                return OrganismClass.VERTEBRATES_FISHES
            
            # There's also labels in FathomNet for classifications like "Actinopterygii", which technically would slip through these earlier
            # checks. We'll do additional checks for these names too.
            fish_categories = ["Vertebrata", "Agnatha", "Cyclostomi", "Gnathostomata", "Chondrichthyes", "Osteichthyes", "Actinopterygii", "Sarcopterygii"]
            if aphia_record["scientificname"] in fish_categories:
                return OrganismClass.VERTEBRATES_FISHES

            # If we still can't find an organization, we'll go ahead and return it as unidentified.
            return OrganismClass.UNIDENTIFIED
        else:
            # Anything that's not a chordate is an invertebrate.
            return OrganismClass.OTHER_INVERTEBRATES
    


In [21]:
class_from_record(get_record_from_aphia_id(10194))

<OrganismClass.VERTEBRATES_FISHES: 'fish'>

In [23]:
# Convert concepts to groups using the Aphia IDs
for concept in concept_to_id.keys():
    print("{}: ".format(concept), end="")
    if concept_to_id[concept]:  # Check that it is not
        record = get_record_from_aphia_id(concept_to_id[concept])
        organism_class = class_from_record(record)
        if organism_class:
            concept_to_group[concept] = organism_class.value
        else: # Fail gracefully in case no classification
            concept_to_group[concept] = None
    else:
        concept_to_group[concept] = None

    print(concept_to_group[concept], end="\n")


2G Robotics structured light laser: None
55-gallon drum: None
ADCP: None
Abraliopsis (Boreabraliopsis) felis: mollusca
Abyssoberyx: fish
Abyssocladia: porifera
Abyssocladia lakwollii: porifera
Abyssocucumis abyssorum: echinodermata
Abyssopathes: cnidaria
Abyssopathes lyra: cnidaria
Acanella: cnidaria
Acanella dispar: cnidaria
Acanella weberi: cnidaria
Acanthacaris tenuimana: arthropoda
Acanthamunnopsis: arthropoda
Acanthamunnopsis milleri: arthropoda
Acanthascinae: porifera
Acanthascinae sp. 1-4 complex: porifera
Acanthascinae sp. 2: porifera
Acanthascinae sp. 4: porifera
Acanthascus: porifera
Acanthephyra: arthropoda
Acanthephyra eximia: arthropoda
Acanthephyra sp.: arthropoda
Acanthogorgia: cnidaria
Acanthogorgia sp.: cnidaria
Acanthonus armatus: fish
Acanthopathes: cnidaria
Acanthoptilum: cnidaria
Acanthurus xanthopterus: fish
Acesta: mollusca
Acesta mori: mollusca
Acesta sphoni: mollusca
Actinaria: None
Actinauge verrillii: cnidaria
Actinernus: cnidaria
Actinernus nobilis: cnidaria

In [None]:
with open("../data/concept_to_group.json", 'w') as fp:
    json.dump(concept_to_group, fp, sort_keys=True)