## Code to Chapter 6 of LangChain for Life Science and Healthcare book, by Dr. Ivan Reznikov

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YibccEsb7ZrqtoibX7lYgwNjJ5oZFHFe?usp=sharing)

# Finding Typos in PubChem Database

This notebook demonstrates how to use LangChain and OpenAI's GPT models to generate potential typos for chemical terms and then search the PubChem database to see how many compounds are associated with these typos. This approach can help identify common misspellings in chemical databases and potentially find compounds that were indexed under incorrect spellings.

## Overview

The workflow consists of three main steps:
1. **Setup**: Install required packages and configure API keys
2. **Typo Generation**: Use LangChain with OpenAI's GPT model to generate potential typos for a given chemical term
3. **Database Search**: Query PubChem's API to find how many compounds are associated with each potential typo


## Environment Setup and Package Installation

First, we need to install the required packages for LangChain integration with OpenAI.

In [None]:
!pip install -q langchain langchain_openai openai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━[0m [32m0.8/1.2 MB[0m [31m23.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip freeze | grep "lang\|openai"

google-ai-generativelanguage==0.6.15
google-cloud-language==2.17.1
langchain==0.3.22
langchain-core==0.3.49
langchain-openai==0.3.12
langchain-text-splitters==0.3.7
langcodes==3.5.0
langsmith==0.3.22
language_data==1.3.0
libclang==18.1.1
openai==1.70.0


In [None]:
from google.colab import userdata
import os
# Set OpenAI API key from Google Colab's user environment or default
def set_api_keys(default_openai_key: str = "YOUR_API_KEY", default_tavily_key: str = "YOUR_API_KEY") -> None:
    """Set the OpenAI API key from Google Colab's user environment or use a default value."""

    os.environ["OPENAI_API_KEY"] = userdata.get("LC4LS_OPENAI_API_KEY") or default_openai_key

set_api_keys()

## LangChain Setup for Typo Generation

Here we create a LangChain pipeline that uses OpenAI's GPT model to generate potential typos.


In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

In [None]:
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a professional editor and typo-catcher",
        ),
        ("placeholder", "{chat_history}"),
        ("human", "{input}"),
        ("placeholder", "{agent_scratchpad}"),
    ]
)

typo_chain = prompt | llm | StrOutputParser()

**Chain Explanation**:
- The `|` operator creates a processing pipeline
- Input flows through: prompt formatting → LLM processing → string parsing
- This is LangChain's "Expression Language" for building data processing chains

## PubChem API Integration

We'll create a function to query PubChem's database for compound information.


In [None]:
import requests
def get_pubchem_data(subword):
  response = requests.get(
    'https://pubchem.ncbi.nlm.nih.gov/sdq/sdqagent.cgi?infmt=json&outfmt=json&query={%22select%22:%22*%22,%22collection%22:%22compound%22,%22order%22:[%22relevancescore,desc%22],%22start%22:1,%22limit%22:10,%22where%22:{%22ands%22:[{%22*%22:%22'+subword+'%22}]},%22width%22:1000000,%22listids%22:0}',
    cookies={},
    headers={},
  )
  return response

**API URL Breakdown**:
- `infmt=json&outfmt=json`: Input and output format specification
- `select=*`: Select all available fields
- `collection=compound`: Search in the compound database
- `order=[relevancescore,desc]`: Sort by relevance (highest first)
- `limit=10`: Return maximum 10 results
- `where.ands[0]["*"]`: Search for the subword in any field

## Generate and Test Typos

Now we'll generate potential typos for a chemical term and search PubChem for each one.

In [None]:
word = "ethyl"
typo_llm_response = typo_chain.invoke(f"""
  Return a semicolon-separated list of 10 most possible typos for word {word}.
  The resposnse should contain only possible typos!
  Don't include initial word {word}
  Avoid adding duplicates."""
)
typo_list = [x.strip() for x in typo_llm_response.replace(".", "").split(";")]

In [None]:
typo_list, len(set(typo_list))

(['ehtyl',
  'etyl',
  'ethly',
  'eythl',
  'ehtil',
  'etyhl',
  'ethil',
  'eylth',
  'ehtly',
  'etlhy'],
 10)

## Search PubChem for Each Typo

Finally, we'll search PubChem for each generated typo to see how many compounds are associated with it.


In [None]:
typo_dict = {}
for subword in typo_list:
  response = get_pubchem_data(subword)
  total_count = response.json()['SDQOutputSet'][0]['totalCount']
  if total_count:
    typo_dict[subword] = total_count

In [None]:
typo_dict

{'ehtyl': 53, 'etyl': 23, 'ethly': 2453, 'etyhl': 1, 'ehtly': 10}

## Results Analysis and Interpretation

The results will show which "typos" actually correspond to real compounds in PubChem. This can reveal:

**Most Common Typos Found:**
- **'ethly'** (2,453 occurrences) - This is by far the most frequent typo, where the 'y' and 'l' are swapped. This makes sense because it's a simple transposition error that's easy to make when typing quickly.

**Moderately Common Typos:**
- **'ehtyl'** (53 occurrences) - Here the 'e' and 'h' are swapped, another transposition error
- **'etyl'** (23 occurrences) - Missing the 'h' entirely, likely from fast typing or autocorrect issues

**Rare Typos:**
- **'ehtly'** (10 occurrences) - Combines both letter swapping (e/h) and omission (missing 'y')
- **'etyhl'** (1 occurrence) - Multiple letter rearrangements

**What This Reveals:**
1. **Real-world data quality issues**: Even in scientific databases like PubChem, human input errors occur and persist
2. **Common error patterns**: Transposition errors (swapping adjacent letters) are the most frequent type of typo
3. **Impact magnitude**: The 'ethly' typo appears in nearly 2,500 compound entries, suggesting this is a systematic issue that could affect chemical literature searches and data retrieval

### Potential Applications:

- **Database Curation**: Identify potential spelling errors in chemical databases
- **Search Enhancement**: Improve chemical search engines with common misspelling patterns
- **Quality Control**: Validate chemical nomenclature in research databases
