# Testing New Carbohydrate Classification System

This notebook tests the updated classification system that categorizes compounds based on ChEBI ontology hierarchy:

**Main Classes:**
1. **main carbohydrate group** - Direct children of CHEBI:16646 with >1 children (e.g., monosaccharide, oligosaccharide, polysaccharide)
2. **other carbohydrate** - Under CHEBI:16646 but not in main groups
3. **main carbohydrate derivative group** - Direct children of CHEBI:63299 with >1 children
4. **other carbohydrate derivative** - Under CHEBI:63299 but not in main groups
5. **other** - Under CHEBI:78616 but not in CHEBI:16646 or CHEBI:63299

In [1]:
# Import from the installed package
from carbonhydrate_analysis.main import get_compound_info_pubchem

print("Carbohydrate Classification Testing")
print("=" * 60)

Loaded ChEBI cache with 78 entries from /Users/sckuo/Documents/ylclab-ch-carbonhydrate-analysis/data/cache/chebi_children_cache.json
Carbohydrate Classification Testing


## Test 1: Glucose (Expected: main carbohydrate group / monosaccharide)

In [2]:
print("\n" + "=" * 60)
print("TEST 1: Glucose")
print("=" * 60)

glucose = get_compound_info_pubchem("WQZGKKKJIJFFOK-GASJEMHNSA-N")

if glucose:
    print(f"✓ Compound Name: {glucose['name']}")
    print(f"  Formula: {glucose['formula']}")
    print(f"  PubChem CID: {glucose['pubchem_cid']}")
    print(f"\n  Is Carbohydrate: {glucose['is_carbohydrate']}")
    print(f"  Main Class: {glucose['carbohydrate_main_class']}")
    print(f"  Subclass: {glucose['carbohydrate_subclass']}")
    
    print(f"\n  ChEBI Ontology Terms (first 10):")
    for i, term in enumerate(glucose['chebi_ontology'][:10], 1):
        term_str = term.get('StringWithMarkup', {}).get('String', term) if isinstance(term, dict) else term
        print(f"    {i}. {term_str}")
else:
    print("✗ Failed to retrieve glucose information")


TEST 1: Glucose
Processing single inchikey identifier...
Resolving 1 inchikeys to CIDs...
  WQZGKKKJIJFFOK-GASJEMHNSA-N... -> CID 5793

Found 1 valid CIDs, fetching properties...
Processing CIDs 1-1 of 1...
✓ Compound Name: (3R,4S,5S,6R)-6-(hydroxymethyl)oxane-2,3,4,5-tetrol
  Formula: C6H12O6
  PubChem CID: 5793

  Is Carbohydrate: True
  Main Class: main carbohydrate group
  Subclass: monosaccharide

  ChEBI Ontology Terms (first 10):
    1. D-glucopyranose
    2. glucopyranose
    3. glucose
    4. aldohexose
    5. aldose
    6. monosaccharide
    7. carbohydrate
    8. carbohydrates and carbohydrate derivatives
    9. organooxygen compound
    10. organochalcogen compound


## Test 2: Sucrose (Expected: main carbohydrate group / oligosaccharide or disaccharide)

In [3]:
print("\n" + "=" * 60)
print("TEST 2: Sucrose (Disaccharide)")
print("=" * 60)

sucrose = get_compound_info_pubchem("CZMRCDWAGMRECN-UGDNZRGBSA-N")

if sucrose:
    print(f"✓ Compound Name: {sucrose['name']}")
    print(f"  Formula: {sucrose['formula']}")
    print(f"  PubChem CID: {sucrose['pubchem_cid']}")
    print(f"\n  Is Carbohydrate: {sucrose['is_carbohydrate']}")
    print(f"  Main Class: {sucrose['carbohydrate_main_class']}")
    print(f"  Subclass: {sucrose['carbohydrate_subclass']}")
else:
    print("✗ Failed to retrieve sucrose information")


TEST 2: Sucrose (Disaccharide)
Processing single inchikey identifier...
Resolving 1 inchikeys to CIDs...
  CZMRCDWAGMRECN-UGDNZRGBSA-N... -> CID 5988

Found 1 valid CIDs, fetching properties...
Processing CIDs 1-1 of 1...
✓ Compound Name: (2R,3R,4S,5S,6R)-2-[(2S,3S,4S,5R)-3,4-dihydroxy-2,5-bis(hydroxymethyl)oxolan-2-yl]oxy-6-(hydroxymethyl)oxane-3,4,5-triol
  Formula: C12H22O11
  PubChem CID: 5988

  Is Carbohydrate: True
  Main Class: main carbohydrate group
  Subclass: oligosaccharide


## Test 3: Non-carbohydrate compound (Expected: is_carbohydrate=False)

In [4]:
print("\n" + "=" * 60)
print("TEST 3: Benzene (Non-carbohydrate)")
print("=" * 60)

benzene = get_compound_info_pubchem("UHOVQNZJYSORNB-UHFFFAOYSA-N")

if benzene:
    print(f"✓ Compound Name: {benzene['name']}")
    print(f"  Formula: {benzene['formula']}")
    print(f"  PubChem CID: {benzene['pubchem_cid']}")
    print(f"\n  Is Carbohydrate: {benzene['is_carbohydrate']}")
    print(f"  Main Class: {benzene['carbohydrate_main_class']}")
    print(f"  Subclass: {benzene['carbohydrate_subclass']}")
else:
    print("✗ Failed to retrieve benzene information")


TEST 3: Benzene (Non-carbohydrate)
Processing single inchikey identifier...
Resolving 1 inchikeys to CIDs...
  UHOVQNZJYSORNB-UHFFFAOYSA-N... -> CID 241

Found 1 valid CIDs, fetching properties...
Processing CIDs 1-1 of 1...
✓ Compound Name: benzene
  Formula: C6H6
  PubChem CID: 241

  Is Carbohydrate: False
  Main Class: None
  Subclass: None


## Test 4: Glyceric acid (Expected: carbohydrate acid - could be main or other carbohydrate)

In [5]:
print("\n" + "=" * 60)
print("TEST 4: Glyceric acid")
print("=" * 60)

glyceric_acid = get_compound_info_pubchem("RBNPOMFGQQGHHO-UHFFFAOYSA-N")

if glyceric_acid:
    print(f"✓ Compound Name: {glyceric_acid['name']}")
    print(f"  Formula: {glyceric_acid['formula']}")
    print(f"  PubChem CID: {glyceric_acid['pubchem_cid']}")
    print(f"\n  Is Carbohydrate: {glyceric_acid['is_carbohydrate']}")
    print(f"  Main Class: {glyceric_acid['carbohydrate_main_class']}")
    print(f"  Subclass: {glyceric_acid['carbohydrate_subclass']}")
    
    print(f"\n  ChEBI Ontology Terms (first 10):")
    for i, term in enumerate(glyceric_acid['chebi_ontology'][:10], 1):
        term_str = term.get('StringWithMarkup', {}).get('String', term) if isinstance(term, dict) else term
        print(f"    {i}. {term_str}")
else:
    print("✗ Failed to retrieve glyceric acid information")


TEST 4: Glyceric acid
Processing single inchikey identifier...
Resolving 1 inchikeys to CIDs...
  RBNPOMFGQQGHHO-UHFFFAOYSA-N... -> CID 752

Found 1 valid CIDs, fetching properties...
Processing CIDs 1-1 of 1...
✓ Compound Name: 2,3-dihydroxypropanoic acid
  Formula: C3H6O4
  PubChem CID: 752

  Is Carbohydrate: True
  Main Class: main carbohydrate group
  Subclass: carbohydrate acid

  ChEBI Ontology Terms (first 10):
    1. glyceric acid
    2. trionic acid
    3. aldonic acid
    4. carbohydrate acid
    5. carboxylic acid
    6. organic acid
    7. organic molecular entity
    8. carbon group molecular entity
    9. p-block molecular entity
    10. main group molecular entity


## Test 5: Agarose (MJQHZNBUODTQTK-WKGBVCLCSA-N)

In [6]:
inchikey = "MJQHZNBUODTQTK-WKGBVCLCSA-N"
agarose = get_compound_info_pubchem(inchikey)

if agarose:
    print(f"Name: {agarose.get('name')}")
    print(f"Formula: {agarose.get('formula')}")
    print(f"PubChem CID: {agarose.get('pubchem_cid')}")
    print(f"\nIs Carbohydrate: {agarose.get('is_carbohydrate')}")
    print(f"Main Class: {agarose.get('carbohydrate_main_class')}")
    print(f"Subclass: {agarose.get('carbohydrate_subclass')}")
    print(f"\nChEBI Ontology (first 10):")
    for i, term in enumerate(agarose.get('chebi_ontology', [])[:10]):
        term_str = term.get('StringWithMarkup', {}).get('String', term) if isinstance(term, dict) else term
        print(f"  {i+1}. {term_str}")
else:
    print("Compound not found")

Processing single inchikey identifier...
Resolving 1 inchikeys to CIDs...
  MJQHZNBUODTQTK-WKGBVCLCSA-N... -> CID 11966311

Found 1 valid CIDs, fetching properties...
Processing CIDs 1-1 of 1...
Name: (2S,3R,4S,5R,6R)-2-[[(1S,3S,4S,5S,8R)-3-[(2S,3R,4S,5S,6R)-2-[[(1S,3R,4S,5S,8R)-3,4-dihydroxy-2,6-dioxabicyclo[3.2.1]octan-8-yl]oxy]-3,5-dihydroxy-6-(hydroxymethyl)oxan-4-yl]oxy-4-hydroxy-2,6-dioxabicyclo[3.2.1]octan-8-yl]oxy]-6-(hydroxymethyl)oxane-3,4,5-triol
Formula: C24H38O19
PubChem CID: 11966311

Is Carbohydrate: True
Main Class: main carbohydrate group
Subclass: polysaccharide

ChEBI Ontology (first 10):
  1. agarose
  2. polysaccharide
  3. biomacromolecule
  4. organic molecular entity
  5. carbon group molecular entity
  6. p-block molecular entity
  7. main group molecular entity
  8. molecular entity
  9. chemical entity
  10. agarose


## Test 6: Mniopetal C (DSJKYHXDKAFGAJ-MCCJONFTSA-N)

In [7]:
inchikey = "DSJKYHXDKAFGAJ-MCCJONFTSA-N"
mniopetal_c = get_compound_info_pubchem(inchikey)

if mniopetal_c:
    print(f"Name: {mniopetal_c.get('name')}")
    print(f"Formula: {mniopetal_c.get('formula')}")
    print(f"PubChem CID: {mniopetal_c.get('pubchem_cid')}")
    print(f"\nIs Carbohydrate: {mniopetal_c.get('is_carbohydrate')}")
    print(f"Main Class: {mniopetal_c.get('carbohydrate_main_class')}")
    print(f"Subclass: {mniopetal_c.get('carbohydrate_subclass')}")
    print(f"\nChEBI Ontology (first 10):")
    for i, term in enumerate(mniopetal_c.get('chebi_ontology', [])[:10]):
        term_str = term.get('StringWithMarkup', {}).get('String', term) if isinstance(term, dict) else term
        print(f"  {i+1}. {term_str}")
else:
    print("Compound not found")

Processing single inchikey identifier...
Resolving 1 inchikeys to CIDs...
  DSJKYHXDKAFGAJ-MCCJONFTSA-N... -> CID 10094677

Found 1 valid CIDs, fetching properties...
Processing CIDs 1-1 of 1...
Name: [(3S,3aS,6aS,9S,10R,10aR)-4-formyl-3,10-dihydroxy-7,7-dimethyl-1-oxo-3a,6,6a,8,9,10-hexahydro-3H-benzo[d][2]benzofuran-9-yl] 2-hydroxyoctanoate
Formula: C23H34O8
PubChem CID: 10094677

Is Carbohydrate: True
Main Class: other
Subclass: Mniopetal C

ChEBI Ontology (first 10):
  1. Mniopetal C
  2. carbohydrates and carbohydrate derivatives
  3. organooxygen compound
  4. organochalcogen compound
  5. heteroorganic entity
  6. organic molecular entity
  7. carbon group molecular entity
  8. p-block molecular entity
  9. main group molecular entity
  10. molecular entity


## Test 7: Apulose (KNWXMEXZFWGIPP-UHFFFAOYSA-N)

In [8]:
inchikey = "KNWXMEXZFWGIPP-UHFFFAOYSA-N"
apulose = get_compound_info_pubchem(inchikey)

if apulose:
    print(f"Name: {apulose.get('name')}")
    print(f"Formula: {apulose.get('formula')}")
    print(f"PubChem CID: {apulose.get('pubchem_cid')}")
    print(f"\nIs Carbohydrate: {apulose.get('is_carbohydrate')}")
    print(f"Main Class: {apulose.get('carbohydrate_main_class')}")
    print(f"Subclass: {apulose.get('carbohydrate_subclass')}")
    print(f"\nChEBI Ontology (first 10):")
    for i, term in enumerate(apulose.get('chebi_ontology', [])[:10]):
        term_str = term.get('StringWithMarkup', {}).get('String', term) if isinstance(term, dict) else term
        print(f"  {i+1}. {term_str}")
else:
    print("Compound not found")

Processing single inchikey identifier...
Resolving 1 inchikeys to CIDs...
  KNWXMEXZFWGIPP-UHFFFAOYSA-N... -> CID 11019032

Found 1 valid CIDs, fetching properties...
Processing CIDs 1-1 of 1...
Name: 1,3,4-trihydroxy-3-(hydroxymethyl)butan-2-one
Formula: C5H10O5
PubChem CID: 11019032

Is Carbohydrate: True
Main Class: other carbohydrate
Subclass: apulose

ChEBI Ontology (first 10):
  1. apulose
  2. carbohydrate
  3. carbohydrates and carbohydrate derivatives
  4. organooxygen compound
  5. organochalcogen compound
  6. heteroorganic entity
  7. organic molecular entity
  8. carbon group molecular entity
  9. p-block molecular entity
  10. main group molecular entity


## Test 8: Adgggg (KMRCGPSUZRGVOV-TXYBRGFCSA-N)

In [9]:
inchikey = "KMRCGPSUZRGVOV-TXYBRGFCSA-N"
adgggg = get_compound_info_pubchem(inchikey)

if adgggg:
    print(f"Name: {adgggg.get('name')}")
    print(f"Formula: {adgggg.get('formula')}")
    print(f"PubChem CID: {adgggg.get('pubchem_cid')}")
    print(f"\nIs Carbohydrate: {adgggg.get('is_carbohydrate')}")
    print(f"Main Class: {adgggg.get('carbohydrate_main_class')}")
    print(f"Subclass: {adgggg.get('carbohydrate_subclass')}")
    print(f"\nChEBI Ontology (first 10):")
    for i, term in enumerate(adgggg.get('chebi_ontology', [])[:10]):
        term_str = term.get('StringWithMarkup', {}).get('String', term) if isinstance(term, dict) else term
        print(f"  {i+1}. {term_str}")
else:
    print("Compound not found")

Processing single inchikey identifier...
Resolving 1 inchikeys to CIDs...
  KMRCGPSUZRGVOV-TXYBRGFCSA-N... -> CID 195235

Found 1 valid CIDs, fetching properties...
Processing CIDs 1-1 of 1...
Name: (2S,4S,5R,6R)-5-acetamido-2-[(2R,3R,4S,5S,6R)-2-[(2R,3R,4R,5R,6R)-3-acetamido-2,5-dihydroxy-6-(hydroxymethyl)oxan-4-yl]oxy-3,5-dihydroxy-6-(hydroxymethyl)oxan-4-yl]oxy-4-hydroxy-6-[(1R,2R)-1,2,3-trihydroxypropyl]oxane-2-carboxylic acid
Formula: C25H42N2O19
PubChem CID: 195235

Is Carbohydrate: True
Main Class: other
Subclass: Adgggg

ChEBI Ontology (first 10):
  1. Adgggg
  2. carbohydrates and carbohydrate derivatives
  3. organooxygen compound
  4. organochalcogen compound
  5. heteroorganic entity
  6. organic molecular entity
  7. carbon group molecular entity
  8. p-block molecular entity
  9. main group molecular entity
  10. molecular entity


## Summary

Now you can run the cells above to test the new classification system. The system will:
1. Check if compound belongs to CHEBI:78616 (carbohydrates and carbohydrate derivatives)
2. Classify into one of 5 main categories based on ChEBI hierarchy
3. Assign appropriate subclass based on direct children of the main class