# OrgAn: Molecular Analysis and Chromatographic Simulation

*Raphaël Tisseyre, Bastien Pinsard, Johan Schmidt*

OrgAn is a computational framework designed for the extraction, analysis, and simulation of molecular properties in the context of organic chemistry.

This notebook presents a modular workflow to:
- Load molecular data from a CSV file containing SMILES representations,
- Get molecular information (logP, pKa, structure...) from online databases such as PubChem,
- Apply analytical functions to explore molecular trends such as acidity and hydrophobicity,
- Simulate chromatographic separation based on calculated parameters.

All modules are designed to operate on standardized input formats and produce structured outputs suitable for further analysis or visualization.

In [None]:
import os  
import sys  

current_dir = os.getcwd()  
target_dir_relative = os.path.join("src", "OrgAn")  
target_dir_absolute = os.path.abspath(os.path.join(current_dir, target_dir_relative))  
sys.path.append(target_dir_absolute)

### 1. Building the Dataset from a CSV File

In order to initiate molecular property analysis, a dataset must first be constructed from a user-provided CSV file. The expected input is a `.csv` file containing a single column labeled `smiles`, where each row represents a molecule encoded in the SMILES (Simplified Molecular Input Line Entry System) format.

**Example:**

```
smiles
CCO
c1ccccc1
CC(=O)O

This file is the main input for the data processing pipeline. Each SMILES string is used to retrieve chemical and structural information from internal or external sources. The collected data is stored in a `pandas.DataFrame`, which can then be analyzed using the functions provided in this project. To create this DataFrame, we first define a function that gathers the data from external databases.

### 1.1 PubChem Request

The function `get_mol_info_from_smiles` retrieves detailed chemical and structural information from PubChem using a single SMILES string.

**How it works:**  
It collects key information such as the canonical SMILES, IUPAC name, molecular weight, formula, logP, formal charge, and CAS number. When available, pKa values are also retrieved and parsed. If 3D structural data is accessible, Sterimol descriptors (L, B1, B5) are computed using the `morfeus` library.

The function returns all gathered data in a dictionary format. Missing values are handled gracefully and replaced with `None` when not found.

**Uses:**  
This function is used to programmatically obtain molecular data when a compound is not already present in the local dataset. It is internally called within the `gives_data_frame` function to supplement or complete entries in the main molecular database.

The extracted data ensures compatibility with downstream analysis steps such as property filtering, clustering, and chromatographic modeling.

**Example:** 

In [None]:
from OrgAn import get_mol_info_from_smiles

smiles = "CC(=O)Oc1ccccc1C(=O)O"  # Aspirin
info = get_mol_info_from_smiles(smiles)
print(info)

### 2.2 Dataframe building

With the request function defined, the `gives_dataframe` function can be used to find the properties of the user’s CSV table. 

 

__How it works:__ 

The function will first search in the package database for the properties of the compound, yet if no compounds match, it will do a Pubchem request to gather the properties. This order will thought to save time, as the requests take more times than collecting data in the database as experimented in the example above. 

__Uses:__ 

The main use is to form a DataFrame to handle data more easily thanks to the basic `Pandas` functions or with the functions shown in the next section. However, it also can be used to construct a data base for all the components in a laboratory.  

In [None]:
from OrgAn import gives_data_frame

df = gives_data_frame("data_example.csv")
df

### 2. Dataframe functions

This section will explain the functions dedicated to DataFrame handling. 

#### 2.1. PkA and logP gaps 

The functions `find_logp_gaps` and `find_pka_gaps` are designed to find the biggest gaps of logp/pka in the DataFrame of entry. The utility behind it is to find separation of hydrophilic phases and hydrophobic ones and the sepration between the acidic compound and the basic one into your sample. 

**How it works:**  
The functions take in entry a Dataframe, which is assumed to be the one returned by `gives_data_frame`, and the number of gaps one wants. It will first sort the dataframe along the logp or pka column. Then a loop goes through it to take the maximum of the differences and the two indexes. At the end, it will return a list with the values of the gaps and the indexes. 


**Uses:**  
In organic chemistry, one will often need to separate compounds which can be quite difficult for complex solutions. However, if one knows the interesting gaps of its solution, one can easily do a separation along the pH or the hydrophilicity. 

These gaps can also be useful in chromatography for composition gradients managing. Indeed, if one knows the logP gaps, one knows that a change of the composition of eluant can be done to make the chromatography faster and save time. 

**Example:** 

In [None]:
from OrgAn import find_logp_gaps, find_pKa_gaps

logp_gaps = find_logp_gaps(df)
pka_gaps = find_pKa_gaps(df, nb=2)
logp_gaps, pka_gaps

#### 2.2. Compound suggestion 

The function find_compounds finds the 5 best fits of one’s properties criterion from the database of the package. 

**How it works:**  
The entries of the function are the pka, logp, charge, sterimol (L, B1 and B5) and smile. If the user enters a smile, it will simply search in the database for the precise compound or with a Pubchem request. However, if the user enters no smiles but criterions, the function will sort the database to find the 5 best matches. 

The function will take in consideration a maximum of difference of $\pm$ 1.5 for the logP and $\pm$ 3 for each the sterimol. 

**Uses:**  
This function aims organic chemists in synthesis. Reactions which need an acid or base, are often determined by the charge, the bulkiness, and of course the acidity. It can lead to thermodynamic or kinetic products which have different properties of interest. This applies not only for these types of reactions but whenever one knows the properties of the reactant. 

**Example:** 

In [None]:
from OrgAn import find_compounds
suggestion_1 = find_compounds(smiles="CCO")
suggestion_2 = find_compounds(pKa = 4, charge = 0)

suggestion_1

In [None]:
suggestion_2

### 3. Chromatography

#### 3.1. Elution order estimation

The function `get_elution_order` estimates an elution order by sorting given compounds by their logP values. Since the logP is an indication of the hydrophobicity of a compound, it can be used to estimate the elution order in liquid chromatography. However, this is only an estimation as the elution order also depends on factors such as hydrogen bonds and dipole moments, which are not taken into account in the logP value.

**How it works**

The function has two parameters: `solutes` and `is_reverse_phase`. 
`solutes` is a dataframe containing the logP of the solutes (e.g. generated using `gives_data_frame`).
`is_reverse_phase` is a boolean determining if the chromatography is normal phase or reverse phase.

The function then sorts the dataframe by the value of each compound's logP, in descending order if the chromatography is normal phase, in ascending order if it is reverse phase. The function then outputs the sorted dataframe.

**Uses**

Estimating the elution order can help the user plan for their chromatography. For example, it can help them decide whether to use normal phase or reverse phase.

**Example**

In [None]:
from OrgAn import get_elution_order

get_elution_order(df)

#### 3.2. Polarity index calculation

The function `calculate_polarity_index` calculates the polarity index of an eluant using the table 28.4.1 provided [here](https://chem.libretexts.org/Bookshelves/Analytical_Chemistry/Instrumental_Analysis_(LibreTexts)/28%3A_High-Performance_Liquid_Chromatography/28.04%3A_Partition_Chromatography).  

**How it works**

The function's parameters are: `cyclohex`, `n_hex`, `ccl4`, `ipr_ether`, `toluene`, `et2o`, `thf`, `etoh`, `etoac`, `dioxane`, `meoh`, `mecn` and `water`, all `float`.

The parameters correspond to the volume fractions of, respectively, cyclohexane, n-hexane, carbon tetrachloride, isopropyl ether, toluene, diethyl ether, tetrahydrofuran, ethanol, ethyl acetate, 1,4-dioxane, methanol, acetonitrile and water present in the eluant.

If the total value of the volume fractions exceeds 1, the function raises a `ValueError`.

The function returns the polarity index of the eluant as a `float`.

**Uses**

Calculating the polarity index of an eluant helps the user plan for their chromatography, as the polarity of the eluant plays an important role on how well different solutes are separated, since a small variation in polarity index can induce a large variation on the retention factor.

**Example:**

In [None]:
from OrgAn import calculate_polarity_index

calculate_polarity_index(n_hex = 0.4, etoac = 0.6)

#### 3.3. Retention factor estimation

The function `estimate_retention_factor` estimates a retention factor for reverse phase chromatography based on a given logP value and polarity index. However, this is only an estimation and has limitations. As stated before, the retention factor depends on more than just the logP, and later we make several more approximations, for example that octanol and the stationary phase in reverse phase chromatography are equivalent – they similar, but not equivalent.



**How it works**

The function has two parameters: `logP` and `polarity_index`, both `float`. The `logP` parameter is the logP value of the solute of which the user wants to calculate the retention factor, and `polarity_index` is the polarity index of the eluant, for example as calculated by `calculate_polarity_index`.

The retention factor is then calculated using the following formula: $ \log k = \log P - 0.5\cdot(10.2-P') $.

Since the retention factor is the partition coefficient between the stationary phase and the mobile phase, and the logP is the log of the partition coefficient between octanol and water, the retention factor in reverse phase chromatography and logP must be correlated, as the stationary phase is usually made out of alcanes of similar length to octanol.

[This article](https://chem.libretexts.org/Bookshelves/Analytical_Chemistry/Instrumental_Analysis_(LibreTexts)/28%3A_High-Performance_Liquid_Chromatography/28.04%3A_Partition_Chromatography) states: 
> As a general rule, a two unit change in the polarity index corresponds to an approximately 10-fold change in a solute’s retention factor. Here is a simple example. If a solute’s retention factor, k, is 22 when using water as a mobile phase (P′ = 10.2), then switching to a mobile phase of 60:40 water–methanol (P′ = 8.2) decreases k to approximately 2.2.

Which means that the polarity index is an exponential function of the retention factor, or, in other words, that the log of the retention factor (logk) is proportional to the polarity index. We thus get as a general function:

$$
    \tag 1
    \log k = a \cdot P' + b
$$

Where $ k $ is the retention factor and $ P' $ is the polarity index of the eluant.

Since the retention factor is a partition coefficient, and that the stationary phase in reverse phase chromatography is similar to octanol, we can assume that $ \log k = \log P $ when the eluant is pure water. Thus we get:

$$
    \tag 2
    \log P = a \cdot P'_{H_{2}O} + b
$$

By subtracting equation $(1)$ to equation $(2)$ we get:

$$
    \tag 3
    \log P - \log k = a\cdot(P'_{H_{2}O}-P')
$$

Plugging in the values given in the article provided above, we can calulate that $ a = 0.5 $. We thus get:

$$
    \tag 4
    \log k = \log P - 0.5\cdot(10.2-P')
$$

Which is the formula used to calculate the retention factor in the function.

The function then returns the retention factor as a `float`.

**Uses**

This function is very useful to plan a chromatography. For example, the user can run several tests with different polarity indices for a given set of logP values, until they find a polarity index that allows a good separation. The user could also use the function to optimise or minimise the retention time of a solute.

**Example**

In [None]:
from OrgAn import estimate_retention_factor, calculate_polarity_index

estimate_retention_factor(4.2, calculate_polarity_index(etoac=0.4, et2o= 0.6))

#### 3.4: Chromatogram plot

The function `generate_chromatogram` generates a chromatogram plot from a dataframe (for example such as generated by `gives_data_frame`), the polarity index of the eluant and the dead time of the chromatography column. The chromatogram is a very rudimentary one, as it only shows the retention times as calculated by `estimate_retention_factor` – it doesn't give the resolution, the width of the peaks, etc.

**How it works**

The function has three parameters: `solutes`, `dead_time` and `polarity_index`.

`solutes` is a `DataFrame` containing the logP values of the solutes the user wishes to see the chromatogram of.
`polarity_index` and `dead_time`, both `float`, are respectively the polarity index of the eluant (for example as calculated by `calculate_polarity_index`) and the dead time of the chromatography column.

The function first calculates the retention times by calling `estimate_retention_factor`, using the logP, dead time and polarity index values given, and then plots the signals as a function of time in minutes. As said before, this is a very "crude" chromatogram. The signals are only shown as a straight line with no width, and the retention times are approximations.

**Uses**

This function is mainly a visual complement to `estimate_retention_factor`. It helps streamline the process of finding the right eluant for reverse phase HPLC to allow the best compromise between high separation and low retention time.

**Example**

In [None]:
from OrgAn import generate_chromatogram, calculate_polarity_index

generate_chromatogram(df, calculate_polarity_index(etoac=0.4, et2o= 0.6), 3)

# Note that the generated chromatogram does not show a good separation. This is normal as the dataframe given has many very hydrophilic compounds

#### Conclusion

OrgAn is a modular tool designed to analyze molecular behavior in organic chemistry and simulate chromatographic separation. It brings together automated data collection, analysis of chemical properties, and basic modeling of how compounds separate. The purpose is to help chemists better understand molecular behavior and improve how they plan separations.

This report explains the development of the tool, starting from the collection of molecular data using PubChem, followed by the analysis of relevant properties, and ending with the simulation of chromatographic behavior. The document is structured to guide the reader through each step of the project while explaining the technical process and the reasoning behind the main decisions.

**Improvements**

Naturally, there are many improvements that can be made. For one, the `get_mol_info_from_smiles` takes a few seconds to process one molecule. While this is not a problem on a small scale, on a large scale, this can become cumbersome, therefore, this function could be improved by reducing its working time, for example by querying another quicker database.

Other improvements could be made on the pKa for the molecules. Not all molecules have a pKa from PubChem, and thus a pKa estimation could be used. Additionally, some molecules such as the sodium *tert*-butoxide do not have a pKa listed, but have a parent compound listed that has a pKa listed (in this case, *tert*-butanol).

Furthermore, as mentioned before, the estimations for chromatography rely heavily on approximations and do not take into account many factors. To improve our chromatography tools, those factors such as dipole moment and hydrogen bonds can be taken into account and factored in the calculation.