# ChemInformant User Manual

Welcome to `ChemInformant`! This manual will guide you through a journey from basic queries to advanced data analysis workflows, fully demonstrating how this library simplifies your interaction with the PubChem database.

### Core Features at a Glance

- **Analysis-Ready**: Core functions return clean Pandas DataFrames, ready for immediate analysis.
- **Out-of-the-Box Robustness**: Comes with built-in caching, smart rate-limiting, and automatic retries, requiring zero configuration.
- **Dual-Layer API**: Offers simple convenience functions for quick lookups and a powerful engine for high-performance batch operations.

### Guide Overview

1.  [**Installation & Setup**](#Installation-&-Setup)
2.  [**Quick Start: Master the Core in Five Minutes**](#Quick-Start:-Master-the-Core-in-Five-Minutes)
3.  [**The Convenience API: The Art of Quick Lookups**](#The-Convenience-API:-The-Art-of-Quick-Lookups)
4.  [**Batch Data Retrieval & Analysis**](#Batch-Data-Retrieval-&-Analysis)
5.  [**Advanced Applications: Solving Real-World Problems**](#Advanced-Applications:-Solving-Real-World-Problems)
6.  [**Data Export: Sharing Your Results**](#Data-Export:-Sharing-Your-Results)

## Installation & Setup

First, ensure you have ChemInformant installed. To run all examples in this manual, we recommend installing it with the `plot` and analysis extras.

In [None]:
# Run the following command in your terminal:
!pip install "ChemInformant[all]"

In [None]:
# Import all necessary libraries
import ChemInformant as ci
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from IPython.display import display, Image

# Configure display options for the best visualization experience
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 120)
sns.set_theme(style="whitegrid", context="talk")

print(f"ChemInformant loaded successfully!")

## Quick Start: Master the Core in Five Minutes

The most powerful feature of `ChemInformant` is the `get_properties` function. It allows you to go from a list of compound names to an analysis-ready data table in a single line of code.

In [None]:
# 1. Define the compounds and properties you are interested in
identifiers = ["aspirin", "caffeine", "paracetamol"]
properties = ["molecular_weight", "xlogp", "molecular_formula", "cas"]

# 2. ✨ Get all data with a single function call!
df = ci.get_properties(identifiers, properties)

# 3. The result is immediately available for analysis
print("DataFrame returned by ChemInformant:")
display(df)

## The Convenience API: The Art of Quick Lookups

For everyday, quick lookups of single properties, ChemInformant provides a series of clearly named convenience functions.

### 3.1 Using Convenience Functions

These functions follow the intuitive `get_<property>()` pattern.

In [None]:
compound = "ibuprofen"

# Basic properties
print(f"Molecular weight : {ci.get_weight(compound)} g/mol")
print(f"Formula          : {ci.get_formula(compound)}")
print(f"CAS RN           : {ci.get_cas(compound)}")
print(f"IUPAC name       : {ci.get_iupac_name(compound)}")
print(f"LogP (XLogP)     : {ci.get_xlogp(compound)}")

# SMILES
print(f"\nCanonical SMILES : {ci.get_canonical_smiles(compound)}")

# 2-D structure
print("\n2-D structure:")
ci.draw_compound(compound)          

# Compound object
obj = ci.get_compound(compound)
print(f"\nPubChem CID  : {obj.cid}")
print(f"PubChem URL  : {obj.pubchem_url}")

### 3.2 Support for Multiple Identifier Types

All convenience and core functions support various types of inputs. ChemInformant resolves them for you automatically in the background.

In [None]:
# Support for Name, CID, SMILES, etc.
test_identifiers = [
    "aspirin",                        # Common Name
    "acetylsalicylic acid",           # Chemical Name
    2244,                             # PubChem CID
    "CC(=O)OC1=CC=CC=C1C(=O)O"        # SMILES String
]

results = ci.get_properties(test_identifiers, ['molecular_weight'])

print("Query results for different identifier types:")
display(results[['input_identifier', 'cid', 'molecular_weight', 'status']])

## Batch Data Retrieval & Analysis

The true power of `get_properties` shines when you need to process a large number of compounds.

In [None]:
# 1. Define a list of 12 common drugs
drugs = [
    "aspirin", "ibuprofen", "paracetamol", "naproxen",
    "penicillin", "amoxicillin", "ciprofloxacin",
    "lisinopril", "amlodipine", "metoprolol",
    "atorvastatin", "simvastatin"
]

# 2. Define all properties to retrieve in bulk
props = [
    "molecular_weight", "molecular_formula", "canonical_smiles", 
    "xlogp", "iupac_name", "cas", "synonyms"
]

# 3. Execute the batch query
# ci.setup_cache(backend="memory") # If needed, you can switch to memory cache for quick tests
df_bulk = ci.get_properties(drugs, props)

print(f"Successfully queried: {len(df_bulk[df_bulk['status'] == 'OK'])} / {len(drugs)}")
display(df_bulk.head())

### Data Analysis and Visualization

The returned DataFrame can be directly fed into data analysis and visualization pipelines.

In [None]:
# 1. Data cleaning and preparation
analysis_df = df_bulk[df_bulk['status'] == 'OK'].copy()
analysis_df['molecular_weight'] = pd.to_numeric(analysis_df['molecular_weight'])
analysis_df['xlogp'] = pd.to_numeric(analysis_df['xlogp'])

# 2. Calculate descriptive statistics
print("Descriptive Statistics of Drug Properties:")
display(analysis_df[['molecular_weight', 'xlogp']].describe().round(2))

# 3. Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))

sns.histplot(analysis_df['molecular_weight'], kde=True, ax=ax1)
ax1.set_title('Distribution of Molecular Weights')

sns.scatterplot(data=analysis_df, x='molecular_weight', y='xlogp', s=100, ax=ax2)
ax2.set_title('Molecular Weight vs. Lipophilicity (XLogP)')

plt.tight_layout()
plt.show()

## Advanced Applications: Solving Real-World Problems

This section showcases two advanced use cases, demonstrating the value of `ChemInformant` in practical research scenarios.

### Case 1: Drug-Likeness Assessment (Lipinski's Rule of Five)

In [None]:
# A simple function to check Lipinski's Rule (MW <= 500, LogP <= 5)
def check_lipinski(df):
    df['is_druglike'] = (df['molecular_weight'] <= 500) & (df['xlogp'] <= 5)
    return df

analysis_df = check_lipinski(analysis_df)
print(f"Drug-likeness analysis result: {analysis_df['is_druglike'].sum()}/{len(analysis_df)} compounds passed the rule.")

# Visualize the results
plt.figure(figsize=(10, 8))
sns.scatterplot(data=analysis_df, x='molecular_weight', y='xlogp', 
                hue='is_druglike', style='is_druglike', 
                s=150, palette={True: 'green', False: 'red'})
plt.axvline(x=500, color='grey', linestyle='--', label='MW = 500')
plt.axhline(y=5, color='grey', linestyle='--', label='XLogP = 5')
plt.title('Drug-like Chemical Space & Lipinski Rule Assessment')
plt.legend()
plt.show()

### Case 2: Clustering Similar Drugs with Machine Learning

In [None]:
# 1. Prepare features for clustering
features_df = analysis_df[['molecular_weight', 'xlogp']].dropna()
features_scaled = StandardScaler().fit_transform(features_df)

# 2. Cluster using K-Means algorithm (finding 3 clusters)
kmeans = KMeans(n_clusters=3, random_state=42, n_init='auto')
analysis_df.loc[features_df.index, 'cluster'] = kmeans.fit_predict(features_scaled)

# 3. Visualize the clustering results
plt.figure(figsize=(14, 9))
sns.scatterplot(data=analysis_df, x='molecular_weight', y='xlogp', 
                          hue='cluster', palette='viridis', s=200, 
                          style='input_identifier', markers=True, legend='full')
plt.title('Drug Clustering based on Physicochemical Properties')
plt.legend(title='Cluster', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

## Data Export: Sharing Your Results

Finally, all data retrieved and processed with `ChemInformant` can be easily exported into various common formats.

In [None]:
from datetime import datetime

# 1. Export to a CSV file
csv_filename = f"drug_properties_{datetime.now().strftime('%Y%m%d')}.csv"
analysis_df.to_csv(csv_filename, index=False)
print(f"✓ Data successfully saved to: {csv_filename}")

# 2. Export to an Excel file (with multiple sheets)
with pd.ExcelWriter('drug_analysis.xlsx') as writer:
    analysis_df.to_excel(writer, sheet_name='Raw Data', index=False)
    analysis_df[['molecular_weight', 'xlogp']].describe().to_excel(writer, sheet_name='Summary Stats')
print("✓ Excel file successfully created with 'Raw Data' and 'Summary Stats' sheets.")

# 3. Prepare a SMILES file for other cheminformatics tools
smiles_df = analysis_df[['canonical_smiles', 'input_identifier']].dropna()
smiles_df.to_csv('compounds_smiles.smi', sep='\t', index=False, header=False)
print(f"✓ SMILES file (compounds_smiles.smi) successfully created with {len(smiles_df)} compounds.")

---
##Further Information

This manual has demonstrated how `ChemInformant` can be a powerful and reliable tool in your cheminformatics toolbox. We encourage you to explore its features further and connect with us through the following channels:

- **Project Homepage & Source Code:** [https://github.com/HzaCode/ChemInformant](https://github.com/HzaCode/ChemInformant)
- **Bugs & Feature Requests:** [GitHub Issues](https://github.com/HzaCode/ChemInformant/issues)