# Welcome to **OrgAn** 🧪
*Raphaël Tisseyre, Bastien Pinsard, Johan Schmidt*

Welcome to **OrgAn**, an interactive tool designed to explore molecular properties in organic chemistry.

This notebook will guide you step-by-step to:
- Load your molecular data from a simple file,
- Automatically enrich this data through requests to databases like **PubChem**,
- Apply analytical functions to explore chemical properties,
- Simulate chromatography experiments to predict experimental behavior.

Our goal is to **give you full control**, even if you're not experienced with programming. All you need is a `.csv` file with one column of **SMILES** (a textual representation of molecules), and we’ll take care of the rest.

In [19]:
import os
import sys
current_dir = os.getcwd()
target_dir_relative = os.path.join("src", "OrgAn")
target_dir_absolute = os.path.abspath(os.path.join(current_dir, target_dir_relative))
sys.path.append(target_dir_absolute)



## 0. Building your dataset (DataFrame) from a CSV file

To begin exploring the chemical properties of your molecules, you first need to describe them.

The method used here is very straightforward: you provide a **`.csv` file with a single column**, where each line corresponds to a molecule represented by a **SMILES**. Example:

```
smiles
CCO
c1ccccc1
CC(=O)O
```

This file is the foundation of the entire project: it allows us to query databases to extract the chemical properties of these molecules.

## 1.1 Understanding data requests: how do we enrich SMILES?

When you use our `gives_data_frame(...)` function, it will automatically **query PubChem** for each SMILES string to retrieve useful information such as:
- **logP** (hydrophobicity),
- **Molecular weight**,
- **Molecular formula**,
- **Formal charge**, 
- **pKa values**, and more.

These queries are performed through **HTTP requests** behind the scenes using a Python library. This means:
- If the molecule is **already present in our local database**, we use it directly (faster).
- Otherwise, we look it up on PubChem and **store the properties** for future use.

In the background, we use a function from a module called `pchem_rq`, which works like this:

### Example of a PubChem data request
The function used internally to perform this operation is named `get_mol_info_from_smiles(...)`, and it behaves like this:

In [20]:
# Simplified example of a request to PubChem
from pchem_rq import get_mol_info_from_smiles

mol_properties = get_mol_info_from_smiles('CCO')  # Ethanol
print(mol_properties)

{'name': 'ethanol', 'cid': 702, 'CAS': None, 'smiles': 'CCO', 'molWeight': 46.07, 'molFormula': 'C2H6O', 'logP': -0.1, 'is_pKa_parent_compound': False, 'pKa': 15.9, 'charge': 0, 'sterimol_L': 4.51, 'sterimol_B1': 1.7, 'sterimol_B5': 3.26, 'CASno': '64-17-5'}


This function returns a dictionary containing all the key chemical properties.

You don’t need to modify how it works — this behavior is fully automated inside the function we’ll be using next.

## 1.2 Next step: Automatically generating your DataFrame

In the next section, we’ll use the `gives_data_frame()` function to load a `.csv` file containing your SMILES and generate a full table of molecular properties.

Before that, make sure your `.csv` file is located in the project’s root directory. Then we’ll load it by name, like this:

In [23]:
# Step 1 – Create the CSV file used in the example
with open("example_smiles.csv", "w") as f:
    f.write("smiles\nCCO\nc1ccccc1\nCC(=O)O")

# Step 2 – Now call the function normally
from functions import gives_data_frame

df = gives_data_frame("example_smiles.csv")
df.head()


KeyError: 'names'

#### functions.py

### X: Chromatography

#### X.1: Elution order estimation

The function `get_elution_order` estimates an elution order by sorting given compounds by their logP values. Since the logP is an indication of the hydrophobicity of a compound, it can be used to estimate the elution order in liquid chromatography. However, this is only an estimation as the elution order also depends on factors such as hydrogen bonds and dipole moments, which are not taken into account in the logP value.

**How it works**

The function has two parameters: `solutes` and `is_reverse_phase`. 
`solutes` is a dataframe containing the logP of the solutes (e.g. generated using `gives_data_frame`).
`is_reverse_phase` is a boolean determining if the chromatography is normal phase or reverse phase.

The function then sorts the dataframe by the value of each compound's logP, in descending order if the chromatography is normal phase, in ascending order if it is reverse phase. The function then outputs the sorted dataframe.

**Uses**

Estimating the elution order can help the user plan for their chromatography. For example, it can help them decide whether to use normal phase or reverse phase.

**Example**

In [21]:
from functions import gives_data_frame
from chromato import get_elution_order

solutes = gives_data_frame("tests/test_data.csv")
print(get_elution_order(solutes))

KeyError: 'names'

#### X.2: Polarity index calculation

The function `calculate_polarity_index` calculates the polarity index of an eluant using the table 28.4.1 provided [here](https://chem.libretexts.org/Bookshelves/Analytical_Chemistry/Instrumental_Analysis_(LibreTexts)/28%3A_High-Performance_Liquid_Chromatography/28.04%3A_Partition_Chromatography).  

**How it works**

The function's parameters are: `cyclohex`, `n_hex`, `ccl4`, `ipr_ether`, `toluene`, `et2o`, `thf`, `etoh`, `etoac`, `dioxane`, `meoh`, `mecn` and `water`, all `float`.

The parameters correspond to the volume fractions of, respectively, cyclohexane, n-hexane, carbon tetrachloride, isopropyl ether, toluene, diethyl ether, tetrahydrofuran, ethanol, ethyl acetate, 1,4-dioxane, methanol, acetonitrile and water present in the eluant.

If the total value of the volume fractions exceeds 1, the function raises a `ValueError`.

The function returns the polarity index of the eluant as a `float`.

**Uses**

Calculating the polarity index of an eluant helps the user plan for their chromatography, as the polarity of the eluant plays an important role on how well different solutes are separated, since a small variation in polarity index can induce a large variation on the retention factor.

**Example:**

In [None]:
from chromato import calculate_polarity_index

print(calculate_polarity_index(n_hex = 0.4, etoac = 0.6))

2.68


#### X.3: Retention factor estimation

The function `estimate_retention_factor` estimates a retention factor for reverse phase chromatography based on a given logP value and polarity index. However, this is only an estimation and has limitations. As stated before, the retention factor depends on more than just the logP, and later we make several more approximations, for example that octanol and the stationary phase in reverse phase chromatography are equivalent – they similar, but not equivalent.



**How it works**

The function has two parameters: `logP` and `polarity_index`, both `float`. The `logP` parameter is the logP value of the solute of which the user wants to calculate the retention factor, and `polarity_index` is the polarity index of the eluant, for example as calculated by `calculate_polarity_index`.

The retention factor is then calculated using the following formula: $ \log k = \log P - 0.5\cdot(10.2-P') $.

Since the retention factor is the partition coefficient between the stationary phase and the mobile phase, and the logP is the log of the partition coefficient between octanol and water, the retention factor in reverse phase chromatography and logP must be correlated, as the stationary phase is usually made out of alcanes of similar length to octanol.

[This article](https://chem.libretexts.org/Bookshelves/Analytical_Chemistry/Instrumental_Analysis_(LibreTexts)/28%3A_High-Performance_Liquid_Chromatography/28.04%3A_Partition_Chromatography) states: 
> As a general rule, a two unit change in the polarity index corresponds to an approximately 10-fold change in a solute’s retention factor. Here is a simple example. If a solute’s retention factor, k, is 22 when using water as a mobile phase (P′ = 10.2), then switching to a mobile phase of 60:40 water–methanol (P′ = 8.2) decreases k to approximately 2.2.

Which means that the polarity index is an exponential function of the retention factor, or, in other words, that the log of the retention factor (logk) is proportional to the polarity index. We thus get as a general function:

$$
    \tag 1
    \log k = a \cdot P' + b
$$

Where $ k $ is the retention factor and $ P' $ is the polarity index of the eluant.

Since the retention factor is a partition coefficient, and that the stationary phase in reverse phase chromatography is similar to octanol, we can assume that $ \log k = \log P $ when the eluant is pure water. Thus we get:

$$
    \tag 2
    \log P = a \cdot P'_{H_{2}O} + b
$$

By subtracting equation $(1)$ to equation $(2)$ we get:

$$
    \tag 3
    \log P - \log k = a\cdot(P'_{H_{2}O}-P')
$$

Plugging in the values given in the article provided above, we can calulate that $ a = 0.5 $. We thus get:

$$
    \tag 4
    \log k = \log P - 0.5\cdot(10.2-P')
$$

Which is the formula used to calculate the retention factor in the function.

The function then returns the retention factor as a `float`.

**Uses**

This function is very useful to plan a chromatography. For example, the user can run several tests with different polarity indices for a given set of logP values, until they find a polarity index that allows a good separation. The user could also use the function to optimise or minimise the retention time of a solute.

**Example**

In [None]:
# yeah i will code an example later.

#### X.4: Chromatogram plot

The function `generate_chromatogram` generates a chromatogram plot from a dataframe (for example such as generated by `gives_data_frame`), the polarity index of the eluant and the dead time of the chromatography column. The chromatogram is a very rudimentary one, as it only shows the retention times as calculated by `estimate_retention_factor` – it doesn't give the resolution, the width of the peaks, etc.

**How it works**

***!!NEED TO IMPROVE THAT LATER!!***

The function has three parameters: `solutes`, `dead_time` and `polarity_index`.

`solutes` is a `DataFrame` containing the logP values of the solutes the user wishes to see the chromatogram of.
`polarity_index` and `dead_time`, both `float`, are respectively the polarity index of the eluant (for example as calculated by `calculate_polarity_index`) and the dead time of the chromatography column.

The function first calculates the retention times by calling `estimate_retention_factor`, using the logP, dead time and polarity index values given, and then plots the signals as a function of time in minutes. As said before, this is a very "crude" chromatogram. The signals are only shown as a straight line with no width, and the retention times are approximations.

**Uses**

This function is mainly a visual complement to `estimate_retention_factor`. It helps streamline the process of finding the right eluant for reverse phase HPLC to allow the best compromise between high separation and low retention time.

**Example**

In [None]:
# again, will write later

#### Conclusion