# Analysis: PubChem Fallback Classification Results

This notebook analyzes the log file to check whether CIDs that encounter "Falling back to PubChem classification" are always classified as non-carbohydrates.

In [1]:
import re
from collections import defaultdict
import pandas as pd

## Load and Parse Log File

In [2]:
# Read the log file
log_file_path = '../20260116.log'

with open(log_file_path, 'r') as f:
    log_lines = f.readlines()

print(f"Total lines in log: {len(log_lines):,}")

Total lines in log: 75,444


## Extract CIDs with PubChem Fallback

In [3]:
# Pattern to match "Falling back to PubChem classification" with CID
fallback_pattern = re.compile(r'CID (\d+): Falling back to PubChem classification')

# Pattern to match final classification result
result_pattern = re.compile(r'CID (\d+): Successfully processed \(is_carb=(True|False)\)')

# Find all CIDs that fall back to PubChem
fallback_cids = set()
for line in log_lines:
    match = fallback_pattern.search(line)
    if match:
        cid = match.group(1)
        fallback_cids.add(cid)

print(f"Number of CIDs with PubChem fallback: {len(fallback_cids):,}")

Number of CIDs with PubChem fallback: 1,377


## Extract Classification Results

In [4]:
# Extract classification results for all CIDs
classification_results = {}
for line in log_lines:
    match = result_pattern.search(line)
    if match:
        cid = match.group(1)
        is_carb = match.group(2) == 'True'
        classification_results[cid] = is_carb

print(f"Total CIDs with classification results: {len(classification_results):,}")

Total CIDs with classification results: 1,446


## Analyze Fallback CIDs Classification

In [5]:
# Check classification results for fallback CIDs
fallback_results = {}
for cid in fallback_cids:
    if cid in classification_results:
        fallback_results[cid] = classification_results[cid]

# Count True vs False
carb_count = sum(1 for is_carb in fallback_results.values() if is_carb)
non_carb_count = sum(1 for is_carb in fallback_results.values() if not is_carb)

print(f"\nFallback CIDs with results: {len(fallback_results):,}")
print(f"Classified as carbohydrate (True): {carb_count:,}")
print(f"Classified as non-carbohydrate (False): {non_carb_count:,}")
print(f"\nPercentage classified as non-carbohydrate: {non_carb_count/len(fallback_results)*100:.2f}%")


Fallback CIDs with results: 1,376
Classified as carbohydrate (True): 3
Classified as non-carbohydrate (False): 1,373

Percentage classified as non-carbohydrate: 99.78%


## Check if ALL Fallback CIDs are Non-Carbohydrates

In [6]:
# Check if all fallback CIDs are classified as False
all_false = all(not is_carb for is_carb in fallback_results.values())

print(f"\nAre ALL fallback CIDs classified as non-carbohydrates? {all_false}")

if not all_false:
    print("\nCIDs with fallback that ARE classified as carbohydrates:")
    carb_fallback_cids = [cid for cid, is_carb in fallback_results.items() if is_carb]
    for cid in carb_fallback_cids:
        print(f"  CID {cid}")


Are ALL fallback CIDs classified as non-carbohydrates? False

CIDs with fallback that ARE classified as carbohydrates:
  CID 5460026
  CID 19233
  CID 439357


## Summary Statistics

In [7]:
# Create a summary dataframe
summary_data = {
    'Metric': [
        'Total CIDs with fallback',
        'Classified as carbohydrate',
        'Classified as non-carbohydrate',
        'Percentage non-carbohydrate'
    ],
    'Value': [
        len(fallback_results),
        carb_count,
        non_carb_count,
        f"{non_carb_count/len(fallback_results)*100:.2f}%"
    ]
}

summary_df = pd.DataFrame(summary_data)
print("\n=== Summary ===")
print(summary_df.to_string(index=False))


=== Summary ===
                        Metric  Value
      Total CIDs with fallback   1376
    Classified as carbohydrate      3
Classified as non-carbohydrate   1373
   Percentage non-carbohydrate 99.78%


## Detailed Analysis: Sample Fallback Cases

In [8]:
# Show sample of fallback CIDs and their classification
sample_size = min(10, len(fallback_results))
sample_cids = list(fallback_results.items())[:sample_size]

print(f"\nSample of {sample_size} fallback CIDs:")
for cid, is_carb in sample_cids:
    status = "Carbohydrate" if is_carb else "Non-carbohydrate"
    print(f"  CID {cid}: {status}")


Sample of 10 fallback CIDs:
  CID 11907: Non-carbohydrate
  CID 1571307: Non-carbohydrate
  CID 247831: Non-carbohydrate
  CID 240: Non-carbohydrate
  CID 56773938: Non-carbohydrate
  CID 170530: Non-carbohydrate
  CID 23638286: Non-carbohydrate
  CID 97176: Non-carbohydrate
  CID 45359587: Non-carbohydrate
  CID 44715815: Non-carbohydrate


## Conclusion

In [9]:
print("\n" + "="*70)
print("CONCLUSION")
print("="*70)

if all_false:
    print("✓ CONFIRMED: All CIDs that fall back to PubChem classification")
    print("  are classified as non-carbohydrates (is_carb=False).")
else:
    print("✗ NOT CONFIRMED: Some CIDs that fall back to PubChem classification")
    print("  ARE classified as carbohydrates (is_carb=True).")
    print(f"\n  {carb_count} out of {len(fallback_results)} fallback CIDs are carbohydrates.")

print("="*70)


CONCLUSION
✗ NOT CONFIRMED: Some CIDs that fall back to PubChem classification
  ARE classified as carbohydrates (is_carb=True).

  3 out of 1376 fallback CIDs are carbohydrates.
