# Step by step guide

Welcome to this notebook demonstrating the initial use of **RobotU Molkit**, a Python toolkit designed to streamline chemical data processing by integrating AI-powered enrichment and similarity search capabilities.

`robotu-molkit` connects directly to PubChem to fetch and structure raw compound data, then leverages **IBM Granite models** for advanced semantic processing. This includes:

- 🧪 **Molecule Summarization**: Automatically generate clear, concise natural-language summaries of chemical compounds.
- 🧠 **Semantic Embedding Generation**: Transform structured chemical data into high-dimensional embeddings optimized for similarity search.
- 🔍 **Canonical Query Interpretation**: Understand and standardize natural-language queries, enabling smarter search workflows.
- 🔗 **Similarity Search**: Perform deep similarity analysis using both semantic embeddings and structural metrics (e.g., Tanimoto similarity).
- ⚡ **Optional FAST Mode**: Use lightweight embeddings for speed-critical tasks or when computational resources are limited.

This notebook will walk through the core functionalities of the library, from ingesting molecular records to performing semantic embedding and search. Whether you're building a searchable compound database, analyzing structure–activity relationships, or simply exploring AI-enhanced cheminformatics, RobotU Molkit offers a modular and powerful foundation.

> ⚙️ Before running the examples, make sure to install the package using:  
> `pip install robotu-molkit`

In [None]:
!pip install robotu-molkit

## Step 1 – Ingest 91 Molecules from PubChem

The command below uses the `molkit ingest` subcommand to fetch compound records from **PubChem** by CID (Compound ID), and parse them into structured Molecule JSON files ready for enrichment and analysis. The molecules_cids.txt has 91 CID of molecules. You can find their names in molecules_cids_with_names.csv file. 

```bash
!molkit ingest --file "molecule_cids.txt" --concurrency 2
```

### 🔍 What This Does

- **Downloads** raw 3D compound records from PubChem using the PUG-REST API.
- **Parses** each record into a Molecule JSON object (with schema-compliant fields).
- **Stores** both raw and parsed data in designated folders under `data/`.

### 🛠️ Available Options

| Option             | Description |
|--------------------|-------------|
| `--file`, `-f`     | Path to a text file with **one CID per line**. Useful for bulk ingestion. |
| `--concurrency`, `-c` | Number of parallel workers fetching data (default: `5`). You may reduce this to avoid hitting PubChem’s rate limits (max 5 requests/sec, 400/min). |
| `--raw-dir`, `-r`  | Directory to store raw JSON files from PubChem (default: `data/downloaded_data`). |
| `--parsed-dir`, `-p` | Directory to store parsed Molecule JSON payloads (default: `data/parsed`). |
| `CID(s)` (positional) | You can also pass one or more CIDs directly as arguments instead of using `--file`. |

> ⚠️ Note: Either `--file` or positional CID(s) must be provided. If none are supplied, the CLI will exit with an error.

### Example with Direct CIDs

```bash
!molkit ingest 2244 5957 2519 --concurrency 3
```

### Rate Limits and Reliability

To respect PubChem’s API limits:
- Maximum **5 requests per second**
- Maximum **400 requests per minute**
- Each request has a timeout of **30 seconds**

Using `--concurrency 2` is a safe option that balances performance and reliability, especially when fetching a large number of compounds.

---
In the next step, we’ll enrich these molecules with natural-language summaries and semantic embeddings.

In [40]:
!molkit ingest --file "molecule_cids.txt" --concurrency 2

INFO: Starting ingest of 91 CIDs...
Procesado registro CID 2519
Procesado registro CID 5429
Procesado registro CID 2153
Procesado registro CID 4534
Procesado registro CID 4744
Procesado registro CID 3758
Procesado registro CID 89594
Procesado registro CID 8378
Procesado registro CID 7028
Procesado registro CID 118943
Procesado registro CID 5816
Procesado registro CID 439260
Procesado registro CID 6041
Procesado registro CID 4236
Procesado registro CID 2244
Procesado registro CID 3672
Procesado registro CID 1983
Procesado registro CID 156391
Procesado registro CID 3033
Procesado registro CID 3825
Procesado registro CID 3715
Procesado registro CID 2662
Procesado registro CID 54676228
Procesado registro CID 3306
Procesado registro CID 3394
Procesado registro CID 4781
Procesado registro CID 4055
Procesado registro CID 4485
Procesado registro CID 4783
Procesado registro CID 4044
Procesado registro CID 5345
Procesado registro CID 5120
Procesado registro CID 3826
Procesado registro CID 681
Pr

## 🔐 Configuring Watsonx Credentials

Store your IBM Watsonx credentials locally using:

```bash
!molkit config --watsonx-api-key "your-api-key" --watsonx-project-id "your-project-id"
```

Credentials are stored at:
`~/.config/molkit/config.json`


These credentials are then globally available to all `molkit` commands in your environment.

---

## 🧠 Runtime Credential Resolution

If you need to resolve credentials dynamically (e.g. inside a script or notebook), use the `CredentialsManager` utility. It checks in the following order:

1. **CLI overrides**  
2. **Environment variables** (`IBM_API_KEY`, `IBM_PROJECT_ID`, `WATSONX_URL`)  
3. **Local config file** (`~/.config/molkit/config.json`)  

---

### ✅ Usage Guide – `CredentialsManager`

```python
from molkit.auth import CredentialsManager

# Load credentials
api_key, project_id = CredentialsManager.load()

# Load Watsonx service URL
watsonx_url = CredentialsManager.get_watsonx_url()

# Set or update values in ~/.config/molkit/config.json
CredentialsManager.set_api_key("your-api-key")
CredentialsManager.set_project_id("your-project-id")
CredentialsManager.set_watsonx_url("https://your.custom.url")
```

This is the preferred way to **inject or retrieve credentials programmatically** during embedding or search workflows.

In [19]:
!molkit config   --watsonx-api-key WATSONX_API_KEY   --watsonx-project-id WATSONX_PROJECT_ID

[32mCredentials saved to /home/user/.config/molkit/config.json[0m


## Step 2 – Generate Semantic Embeddings with IBM Watsonx

Once the molecules are parsed from PubChem, the next step is enriching each entry using advanced semantic embedding models provided by **IBM Granite**. The `molkit embed` command generates concise natural-language summaries and semantic embeddings for each parsed molecule.

```bash
!molkit embed --parsed-dir "data/parsed" --out-dir "data/vectors" --fast
```

### 🔍 What Happens During Embedding?

- **Reads** each structured Molecule JSON from the `parsed-dir` folder.
- **Generates**:
  - A concise natural-language summary using the Granite Instruct model (`ibm/granite-3-8b-instruct`).
  - High-dimensional semantic embeddings to support similarity searches.
- **Stores** these embeddings in a JSONL file (by default at `data/vectors/watsonx_vectors.jsonl`).

---

## 🛠️ CLI Options for `molkit embed`

| Option                        | Description |
|-------------------------------|-------------|
| `--parsed-dir`, `-p`          | Directory containing parsed molecule JSON files (**default**: `data/parsed`). |
| `--out-dir`, `-o`             | Directory to store the generated embeddings (**default**: `data/vectors`). |
| `--model`, `-m`               | Embedding model ID. (**default**: `ibm/granite-embedding-278m-multilingual`). |
| `--fast`                      | Use the faster, smaller model (`ibm/granite-embedding-107m-multilingual`). |
| `--watsonx-api-key`, `-k`     | IBM Watsonx API key (overrides environment or configuration settings). |
| `--watsonx-project-id`, `-j`  | IBM Watsonx Project ID (overrides environment or configuration settings). |
| `--watsonx-url`               | Watsonx API endpoint URL (**default**: `https://us-south.ml.cloud.ibm.com`). |

> ⚠️ **Important**: IBM credentials are required and must be provided through the configuration file, environment variables, or explicitly via the `--watsonx-api-key` and `--watsonx-project-id` flags.

---

## 🎯 Choosing the Right Model

### ✅ Default Mode (`ibm/granite-embedding-278m-multilingual`)  
- **Embedding dimensions:** 1024  
- **Advantages:**  
  - Higher precision and greater semantic accuracy.  
  - Ideal for detailed chemical analysis, precise molecular similarity searches, drug discovery, and predictive modeling tasks.  
- **Recommended for chemists** who require detailed semantic insights and maximum embedding quality.

### ⚡ FAST Mode (`ibm/granite-embedding-107m-multilingual`)  
- **Embedding dimensions:** 768  
- **Advantages:**  
  - Lower computational cost and faster execution times.  
  - Best suited for quick exploratory analyses, high-throughput scenarios, or resource-constrained environments.  
- **Recommended for initial testing or prototyping**.

---

## 🖥️ Example Commands

**Default model usage (recommended for accuracy):**

```bash
!molkit embed
```

**FAST model usage (recommended for speed):**

```bash
!molkit embed --fast
```

---

## 🧠 What’s Next?

The resulting `watsonx_vectors.jsonl` file contains a summary and a high-dimensional embedding for each molecule, ready for **local vector search** using `faiss-cpu`.


In [20]:
!molkit embed --watsonx-url="https://us-south.ml.cloud.ibm.com"

Embedding model: ibm/granite-embedding-278m-multilingual
[32m✅  Embeddings written to data/vectors/watsonx_vectors.jsonl[0m


## 🔎 Local Semantic Search

With your molecule embeddings in place, you can now run semantic queries entirely offline using `faiss-cpu`.

We’ll leverage the `LocalSearch` class to:

- Enrich your natural-language query via Granite Instruct  
- Generate a matching embedding with the same model used earlier  
- Perform a similarity search against the FAISS index  

Let’s execute our first query.  


In [21]:
from robotu_molkit.search.searcher import LocalSearch
from robotu_molkit.constants import DEFAULT_JSONL_FILE_ROUTE

In [22]:
# Path to the JSONL file with precomputed embeddings and metadata
JSONL_PATH    = DEFAULT_JSONL_FILE_ROUTE  # e.g. "data/vectors/watsonx_vectors.jsonl"

# Minimum similarity score (cosine) to consider a hit relevant
SIM_THRESHOLD = 0.70

# After filtering by SIM_THRESHOLD, return this many top results
TOP_K         = 20

# Number of nearest neighbors to fetch from FAISS before filtering
FAISS_K       = 300
# --------------------------------------------------------------------------- #

# Initialize searcher
searcher = LocalSearch(jsonl_path=JSONL_PATH)

### 🔎 Simple Semantic Search Example

In this example, we’ll run a purely semantic query against our local FAISS index of 91 compounds, focusing on central nervous system stimulants.

**Query:**  
> “central nervous system stimulants”

Expected top hits include methylxanthine derivatives like caffeine, theobromine, and theophylline.

In [33]:
# Define query and metadata filters
query_text = "central nervous system stimulants"

# Perform semantic + structural search
results = searcher.search_by_semantics(
    query_text=query_text, top_k=20, faiss_k=300,
)

#### 🔎 Displaying Results as a Table

Format the top semantic-search hits into a plain-text table for clear comparison.

In [34]:
# Prepare table header and rows
header = f"{'CID':<8} {'Name':<20} {'MW':<8} {'Sol':<10} {'Score':<6}"
rows = []
for m, s in results:
    cid  = m['cid']
    name = m.get('name', '<unknown>')
    mw   = m.get('molecular_weight', 0)
    sol  = m.get('solubility_tag', '')
    rows.append(f"{cid:<8} {name:<20} {mw:<8.1f} {sol:<10} {s:<.3f}")

# Print query and table
print(f"Results for query: \"{query_text}\"\n")
print(header)
print('-' * len(header))
print("\n".join(rows))

Results for query: "central nervous system stimulants"

CID      Name                 MW       Sol        Score 
--------------------------------------------------------
187      acetylcholine        146.2    soluble    0.620
5202     serotonin            176.2    moderately soluble 0.620
5770     reserpine            608.7    insoluble  0.616
245005   aconitine            645.7    sparingly soluble 0.614
439260   norepinephrine       169.2    soluble    0.609
774      histamine            111.1    soluble    0.592
1548943  Capsaicin            305.4    sparingly soluble 0.587
89594    nicotine             162.2    moderately soluble 0.580
1150     tryptamine           160.2    moderately soluble 0.579
239      beta-alanine         89.1     very soluble 0.577
681      dopamine             153.2    soluble    0.574
5184     (-)-Scopolamine      303.4    moderately soluble 0.572
896      Melatonin            232.3    moderately soluble 0.572
7028     PSEUDOEPHEDRINE      165.2    moderat

## 🧠 Result Analysis: Semantic Query – “Central Nervous System Stimulants”

We performed a semantic-only search across a local FAISS index of 91 molecules using the query:

> **"central nervous system stimulants"**

The model ranked molecules based on their conceptual relevance to the query, using no structural constraints.

### 🔬 Observations

- **Top-ranked results** include caffeine, nicotine, pseudoephedrine, dopamine, serotonin, and epinephrine — all compounds with well-documented **stimulant activity or involvement in neural signaling**.
- **Neurotransmitters** like acetylcholine, norepinephrine, and histamine also appear, reflecting the model’s understanding of their role in CNS excitation.
- Some **biosynthetic precursors** or modulators (e.g., tryptamine, beta-alanine, melatonin) are included due to their indirect relevance in neurophysiology.
- A few results like reserpine, aconitine, or naproxen are **false positives** in this context, but they still share textual or pharmacological proximity in literature.

### ✅ What This Tells Us

- **Granite Instruct** effectively enriched the input query, capturing both direct stimulants and adjacent biochemical actors.
- **Granite Embedding** produced vectors that reflect semantic relationships, retrieving molecules that span neurotransmitters, methylxanthines, and sympathomimetics.
- No structural similarity was enforced — hence the occasional inclusion of pharmacologically unrelated compounds.

This confirms that semantic-only search is useful for **broad hypothesis generation** or **contextual exploration**, especially when the query involves **functional descriptions** instead of precise chemical patterns.

For more refined control, we can now combine semantic signals with structural filters.

## 🧬 Combined Semantic + Structural Search

To improve result specificity, we now combine **semantic similarity** with **structural filtering** using:

> `search_by_semantics_and_structure`

This method first identifies semantically relevant candidates, then refines the list by applying **Tanimoto similarity** against inferred molecular scaffolds.

### What It Does

- 🧠 **Granite Instruct** enriches and interprets the user query to infer one or more scaffold SMILES.
- 🧪 **Granite Embedding** embeds the query to locate semantically similar molecules in the FAISS index.
- 🧬 **Tanimoto filtering** compares structural fingerprints of candidates to the inferred scaffolds and keeps only those above a similarity threshold (e.g. ≥ 0.70).
- ✅ Final results combine **semantic relevance** and **structural alignment**.

### Why Use It?

This hybrid approach filters out semantically plausible but **structurally unrelated molecules**, giving priority to compounds that are **both functionally and chemically close** to the query.

Perfect for:
- Querying drug analogues
- Scanning scaffold-specific chemotypes
- Prioritizing leads in a known structural class

Let’s try it with the same query:  
> _"central nervous system stimulants"_

This time, we expect to retrieve methylxanthine-like structures with more consistent chemistry.

In [38]:
# Query string
query_text = "central nervous system stimulants"

# Run combined semantic + structural search
results = searcher.search_by_semantics_and_structure(
    query_text=query_text,
    top_k=TOP_K,
    faiss_k=FAISS_K,
    sim_threshold=0.45
)

# Format results
entries = [
    f"CID {m['cid']} Name:{m.get('name','<unknown>')} "
    f"MW:{m.get('molecular_weight',0):.1f} "
    f"Sol:{m.get('solubility_tag','')} "
    f"Score:{s:.3f} Tanimoto:{sim:.2f}"
    for m, s, sim in results
]

print(
    f"Results for query: \"{query_text}\"\n"
    f"Top {len(entries)} hits (Granite-inferred scaffolds, Tanimoto ≥ {SIM_THRESHOLD}):\n"
    + "\n".join(entries)
    + "\n\nNote: Scaffold inference was performed using IBM's granite-3-8b-instruct model. "
      "Semantic and structural similarity search was powered by granite-embedding-278m-multilingual."
)

🔍 Inferred scaffolds: ['amphetamine', 'methamphetamine', 'caffeine']
⚠️ Failed to resolve scaffold 'amphetamine': 'CID 3007 not found in index.'
⚠️ Failed to resolve scaffold 'methamphetamine': 'CID 10836 not found in index.'
✅ CID 2519 for 'caffeine' → ECFP vector loaded
→ CID 187 Name:acetylcholine Tanimoto: 0.07
→ CID 5202 Name:serotonin Tanimoto: 0.04
→ CID 5770 Name:reserpine Tanimoto: 0.07
→ CID 245005 Name:aconitine Tanimoto: 0.06
→ CID 439260 Name:norepinephrine Tanimoto: 0.07
→ CID 774 Name:histamine Tanimoto: 0.07
→ CID 1548943 Name:Capsaicin Tanimoto: 0.10
→ CID 89594 Name:nicotine Tanimoto: 0.13
→ CID 1150 Name:tryptamine Tanimoto: 0.04
→ CID 239 Name:beta-alanine Tanimoto: 0.03
→ CID 681 Name:dopamine Tanimoto: 0.05
→ CID 5184 Name:(-)-Scopolamine Tanimoto: 0.12
→ CID 896 Name:Melatonin Tanimoto: 0.07
→ CID 7028 Name:PSEUDOEPHEDRINE Tanimoto: 0.10
→ CID 1983 Name:acetaminophen Tanimoto: 0.10
→ CID 33032 Name:L-glutamic acid Tanimoto: 0.02
→ CID 586 Name:creatine Tanimoto: 

## 🧬 Result Analysis: Combined Semantic + Structural Search (Tanimoto ≥ 0.45)

We executed a query for:

> **"central nervous system stimulants"**

Using `search_by_semantics_and_structure`, this time with a **Tanimoto threshold of 0.45**.

---

### ✅ Inferred Scaffolds

- `amphetamine` ❌ not present in the demo index
- `methamphetamine` ❌ not present in the demo index
- `caffeine` ✅ resolved and used for fingerprint comparison

Only `caffeine` was found among the 91 indexed molecules and used as the reference scaffold.

---

### 🔍 Top Hits (Tanimoto ≥ 0.45)

| CID   | Name         | Score  | Tanimoto |
|--------|--------------|--------|----------|
| 2519   | caffeine     | 0.564  | 1.00     |
| 2153   | theophylline | 0.535  | 0.47     |
| 5429   | theobromine  | 0.512  | 0.52     |

These results reflect canonical **methylxanthine CNS stimulants**:
- **Caffeine** is the direct scaffold.
- **Theobromine** and **theophylline** are well-known derivatives with similar stimulant properties and shared structural motifs.

---

### 🧠 Interpretation

The combination of semantic and structural similarity successfully retrieved the most relevant compounds in the context of this query.

Given that this is a **demo subset with 91 molecules**, these results demonstrate that:
- **Semantic enrichment** correctly interpreted the functional intent of the query.
- **Scaffold-based filtering** prioritized structurally consistent molecules.
- **Granite models** are capable of aligning language and chemistry for precise and explainable search results.


---



## 🧪 Filtering Semantic Search with Metadata

In addition to semantic and structural similarity, `robotu-molkit` allows you to **apply filters** to the molecule metadata returned by the query.

This enables powerful combinations like:

- Search for **functional concepts** (e.g. "anti-inflammatory drugs")
- Filter by **solubility**, **molecular weight**, **toxicity**, or any numeric/tagged field

### 🧩 How Filters Work

Filters are passed as a dictionary where keys correspond to metadata fields. You can use:

- **Exact match**  
  `{ "solubility_tag": "soluble" }`

- **Range filters** (numeric values)  
  `{ "molecular_weight": (150, 300) }`

- **Multiple allowed values**  
  `{ "solubility_tag": ["soluble", "moderately soluble"] }`

These filters are applied **after** semantic similarity is computed, helping to refine the final results.

---

### 💊 Example Query: "Anti-inflammatory agents"

In this example, we’ll search for molecules related to:

> **"Anti-inflammatory agents"**

...and restrict the results to those with:
- **Molecular weight between 200 and 400**
- **Solubility tag = "sparingly soluble" or "moderately soluble"**

This should surface NSAIDs (non-steroidal anti-inflammatory drugs) like ibuprofen, naproxen, diclofenac, etc., based on their semantics and physical properties.

---

In [39]:
# Query and filters
query_text = "anti-inflammatory agents"
filters = {
    "molecular_weight": (200, 400),
    "solubility_tag": ["sparingly soluble", "moderately soluble"]
}

# Run filtered semantic search
results = searcher.search_by_semantics(
    query_text=query_text,
    top_k=TOP_K,
    faiss_k=FAISS_K,
    filters=filters
)

# Print results
header = f"{'CID':<8} {'Name':<20} {'MW':<8} {'Solubility':<20} {'Score':<6}"
rows = [
    f"{m['cid']:<8} {m.get('name','<unknown>'):<20} "
    f"{m.get('molecular_weight', 0):<8.1f} {m.get('solubility_tag',''):<20} {s:<.3f}"
    for m, s in results
]

print(f"Results for query: \"{query_text}\" with filters:\n{filters}\n")
print(header)
print("-" * len(header))
print("\n".join(rows))

Results for query: "anti-inflammatory agents" with filters:
{'molecular_weight': (200, 400), 'solubility_tag': ['sparingly soluble', 'moderately soluble']}

CID      Name                 MW       Solubility           Score 
------------------------------------------------------------------
3672     ibuprofen            206.3    sparingly soluble    0.642
54676228 piroxicam            331.4    sparingly soluble    0.626
3825     ketoprofen           254.3    sparingly soluble    0.625
3394     flurbiprofen         244.3    sparingly soluble    0.617
3826     ketorolac            255.3    sparingly soluble    0.614
1548943  Capsaicin            305.4    sparingly soluble    0.611
5184     (-)-Scopolamine      303.4    moderately soluble   0.608
156391   NAPROXEN             230.3    sparingly soluble    0.606
4236     modafinil            273.4    moderately soluble   0.603
5280343  quercetin            302.2    sparingly soluble    0.593
9064     Cianidanol           290.3    moderately

## 🧪 Filtered Semantic Search – Anti-inflammatory Agents

We ran a semantic query for:

> **"anti-inflammatory agents"**

...using `search_by_semantics` with two metadata filters:

```python
{
    "molecular_weight": (200, 400),
    "solubility_tag": ["sparingly soluble", "moderately soluble"]
}
```

This combination retrieved molecules that are:
- Semantically related to anti-inflammatory activity
- Within a **realistic drug-like mass range**
- Limited to **medium or low aqueous solubility**

---

### ✅ Top Hits (Filtered)

| CID       | Name             | MW     | Solubility           | Score |
|-----------|------------------|--------|----------------------|--------|
| 3672      | ibuprofen         | 206.3  | sparingly soluble     | 0.642 |
| 54676228  | piroxicam         | 331.4  | sparingly soluble     | 0.626 |
| 3825      | ketoprofen        | 254.3  | sparingly soluble     | 0.625 |
| 3394      | flurbiprofen      | 244.3  | sparingly soluble     | 0.617 |
| 3826      | ketorolac         | 255.3  | sparingly soluble     | 0.614 |
| ...       | ...               | ...    | ...                   | ...   |

Most top results are well-known **NSAIDs** (non-steroidal anti-inflammatory drugs), such as:
- **Ibuprofen**
- **Naproxen**
- **Ketoprofen**
- **Piroxicam**
- **Flurbiprofen**

Also retrieved were compounds like:
- **Capsaicin** – a topical anti-inflammatory
- **Quercetin, Genistein, Resveratrol** – flavonoids with anti-inflammatory potential
- **Scopolamine, Modafinil, Harmine** – functionally related or literature-associated hits

---

### 🧠 Summary

- The semantic model correctly captured anti-inflammatory meaning.
- The filters refined the space to **drug-like, orally bioavailable candidates**.
- This search illustrates how combining **semantic queries with physicochemical constraints** can support targeted molecule exploration even in small curated sets.

This demonstrates a practical use case for medicinal chemistry, formulation profiling, or compound triage.

## 🔮 What’s Next for `robotu-molkit`

The current release of `robotu-molkit` already supports advanced semantic search, molecular summarization, and FAISS-based local querying — all powered by IBM Granite models.

But the next evolution is coming soon.

---

### 🚀 Coming Soon: Natural Language–Driven Code Generation

The next release will introduce support for:

> **`granite-instruct-code`**  
> An IBM Granite model designed to generate and execute code in response to natural-language queries.

With this, `robotu-molkit` will be able to:
- **Interpret free-form queries** — even if they don’t contain explicit filters or numbers.
- **Infer filters automatically** — such as molecular weight ranges, solubility classes, or structural constraints.
- **Generate and run internal code** — directly from natural language, with no manual configuration needed.

---

### 🗣️ Example (Future Behavior)

Input:
> _“Show me small analgesics with moderate solubility”_

The library will:
1. Enrich the query with `granite-instruct`.
2. Use `granite-instruct-code` to:
   - Interpret _"small"_ as `molecular_weight < 300`
   - Interpret _"moderate solubility"_ as a tag filter
   - Combine everything into a valid search operation
3. Execute the search internally and return ranked results — all from one sentence.

---

### ✨ Why It Matters

This makes `robotu-molkit`:
- Even more accessible to scientists who think in **goals, not filters**
- Able to **automatically convert text into queries**
- A foundation for **intelligent chemical exploration interfaces**

Stay tuned — natural-language–driven molecule discovery is almost here.

## 🚀 RobotU’s Vision

`robotu-molkit` is just the **foundation** of RobotU’s bigger vision: creating an AI-powered, quantum-ready simulation platform for chemists and chemical engineers.

Think of an experience like **Shuri in Wakanda**, interacting seamlessly with her AI in a holographic environment to virtually explore molecular structures, run instant simulations, and discover compounds—like the lost heart-shaped herb.

RobotU aims to bring this futuristic interaction closer to reality:

- **Natural language commands:** Interact effortlessly using everyday language, powered by AI models.
- **Seamless integration:** Automatic retrieval, filtering, and preparation of molecular data for simulations.
- **Quantum simulation:** Leveraging libraries like **Qiskit Nature** and IBM Quantum systems to execute advanced molecular simulations.
- **Visual insights:** Immediate visual feedback through an intuitive graphical interface, displaying simulation outcomes interactively.

`robotu-molkit` is our first step toward making molecule discovery and quantum simulations accessible to everyone—no supercomputers, vibranium, or holograms required (yet).