# Welcome to medicinal chemistry workshop!

### Story 
_You are a chemist working in the laboratory of a large pharmaceutical company. **Your task is to study drug compounds and compare their properties to those of aspirin.** Water solubility is particularly important, as the drug compound needs to dissolve well in blood. The biological effect and metabolism of the drug compound can be modeled later._

_You have already studied many different promising drug compounds without success in the laboratory using demanding solubility tests that depend on the acidity and temperature of the solutions. The project deadline for completing the solubility tests is approaching. Therefore, new ways must be found to determine the best possible drug compounds from the sample for further research and experimentation._

_Fortunately, you have access to various laboratory results and modeling tools that pharmaceutical and chemical software developers have been working on for decades. It's time to dive into the world of cheminformatics and solve the water solubility tests before the deadline!_


## Technical instructions
This work is done in a Jupyter notebook. This notebook is an environment that contains cells that make it **modular**. This modularity means that the computer reads each cell individually, but in top-down order. The cells can be divided into two types in this work instruction:

- Code cells, in which commands can be written. Often, before this, they load various libraries that are used to execute commands and processes.
- Markdown language cells, which contain text that can be formatted like in a text editor, but with certain **commands** using different characters either before or on both sides of a word.

You can edit all cells by double-clicking on the cell and writing commands on the lines or pasting copied text into them with the command. You can run each cell and see the effect of your changes by pressing: **CTRL + ENTER**

If you encounter technical problems (for example, a cell does not work), this may be because you have run the cell in a different order than normal. Try to run the cells from top to bottom. If this is not the cause of the problem, use the save command and then press **CTRL + SHIFT + R**. This will reset the notebook. Alternatively, you can run the 'Restart the kernel' command from the top bar.

---

# PART I

### Recap of how SMILES work

**Table 1.** Examples of SMILES rules
| Symbol | Example | Description |
| --- | --- | --- |
| Chemical symbol | `O` | This is a molecule, water. Elements are expressed as chemical symbols, as in the periodic table. Hydrogens are not indicated but are automatically interpreted as part of the compound based on the typical valence of the element, meaning the number of bonds. See the code to find out how to display them. |
| Covalent bond | `C=C` | Single bonds are not shown, but double and triple bonds are shown with the symbols = and #. |
| Hydrogen | `C` | When determining chirality, C corresponds to methane, that is CH4. |
| Branches | `CCCC(C)CC` | The brackets () indicate the branch, and the symbols within the brackets indicates which branch is involved. |
| Ring structure | `C1CCCCC1` or `C1CCC(C(C1)Cl)Cl` | The numbers 1-9 indicate the position of the atom in the ring structure. Substituents replacing hydrogen atoms can be placed in brackets. NOTE: Cl is a chlorine atom. |
| Aromaticity | `c1ccccc1` | The chirality of the ring structure can be represented by changing the letters to lowercase. |
| Chiraliry | `C[C@@H](C(=O)O)N` | L-alanine, where @@ after the carbon indicates that it is chiral. |
| Isotopes | `[13C]` | An ethane molecule consisting of C13 isotopes. The numbers before the element symbol indicate the isotope. |
| Electric charge and radicals | `[NH4+].[Cl-]` or `C[CH2].[CH3]` | Reservation + or -, and radicals are often read using special algorithms in software that interprets SMILES strings. There is no established notation for them. |

### 1. Drawing compounds using SMILES strings

In [None]:
# Using the code below, you can draw a compound by entering the SMILES string in the
# smiles = ' ' field and running the cell by pressing CTRL + Enter.
# Start by trying one of the SMILES strings from the table above.

# Import required modules from libraries
from rdkit import Chem
from rdkit.Chem import rdDepictor
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Draw import rdMolDraw2D
from IPython.display import SVG

smiles = 'C'  # Move the desired SMILES string between the ' ' characters to replace C. Finally, press CTRL + ENTER to run the cell.
m = Chem.MolFromSmiles(smiles)

# m = Chem.AddHs(m) # You can use this command to visualize all hydrogen atoms, if you wish. Remove the first # character to activate the setting.

# Create a function moltosvg that creates an SVG image of a structural formula from a SMILES string.
def moltosvg(mol, molSize=(350, 300), kekulize=True):
    mc = Chem.Mol(mol.ToBinary())
    if kekulize:
        try:
            Chem.Kekulize(mc)
        except:
            mc = Chem.Mol(mol.ToBinary())
    if not mc.GetNumConformers():
        rdDepictor.Compute2DCoords(mc)
    drawer = rdMolDraw2D.MolDraw2DSVG(molSize[0], molSize[1])
    drawer.DrawMolecule(mc)
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText()
    svg = drawer.GetDrawingText()
    return SVG(svg)
    

structuralFormula = moltosvg(m)  # Define the structural formula variable and set its value to an SVG image using the moltosvg function.
display(structuralFormula)       # Display the structural formula image.

###  Guestion 2.
**Which compound did you draw, and what functional groups does the compound have?**

These questions are answered on the Microsoft Forms form.

---

## 2. Creating a union table
Let's test how union tables work. Select a more complex SMILES and try to create a union table for it using the following code. You can also try this with multiple SMILES. The computer starts indexing from zero, meaning that the first carbon is in the table and the structure is numbered 0 in the image.

In [None]:
# Import required modules from libraries
from rdkit import Chem
from rdkit.Chem import rdDepictor
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Draw import rdMolDraw2D
from IPython.display import SVG
import numpy as np
import pandas as pd

mol = Chem.MolFromSmiles('C') # Repleace C with a more complex SMILES string

# Define settings for the rdkit drawing function.
d2d = rdMolDraw2D.MolDraw2DSVG(300,300)
d2d.drawOptions().addAtomIndices = True  # Add carbon numbering to the image. Numbering starts at 0 on the computer.
d2d.DrawMolecules([mol])
d2d.FinishDrawing()
svg = SVG(d2d.GetDrawingText())

num_atoms = mol.GetNumAtoms()
connectivity_matrix = np.zeros((num_atoms, num_atoms), dtype="f")
pd.options.display.float_format = lambda x: ('%g' % x)


# Create a union table using a for loop.
for bond in mol.GetBonds():
    begin_atom_idx = bond.GetBeginAtomIdx()    
    end_atom_idx = bond.GetEndAtomIdx()
    bond_type = (bond.GetBondTypeAsDouble())    
    connectivity_matrix[begin_atom_idx][end_atom_idx] = bond_type
    connectivity_matrix[end_atom_idx][begin_atom_idx] = bond_type
    
df = pd.DataFrame(connectivity_matrix)

# Print the structural formula and the connection table.
display(svg, df)

### Question 3.
**What is the bond between the second and third carbons?**

----

## 3. Let's learn about stereoisomerism

In [None]:
# With this code, you can draw stereochemical compounds
# First, mol1 and mol2 are defined as SMILES

mol1 = Chem.MolFromSmiles('C/C=C/C') # Note that this only works for isomeric SMILES strings.
mol2 = Chem.MolFromSmiles('C[C@H](O)[C@@H](N)C(O)=O')


d2d = rdMolDraw2D.MolDraw2DSVG(600,280,300,280)
d2d.drawOptions().addStereoAnnotation=True # The command marks chiral centers on the drawing.
d2d.DrawMolecules([mol1,mol2]) # RDKit can only print two molecules at a time
d2d.FinishDrawing()
rakenneKaavat = SVG(d2d.GetDrawingText())
display(rakenneKaavat)

### Question 4.
**Is the molecule on the left cis or trans? How many chiral centers are there in the compound on the right?**

---

## 4. Introduction to the PUG-REST method and SMILES

Compounds can be easily imported into the software environment by submitting queries to the PubChem server. This is particularly useful when you want to quickly import information about drug compounds without using search engines by listing the desired molecules as part of the code.

This is especially useful when you want to search for SMILES strings in a database.

### Examples of Medicine
| ACTIVE INGREDIENT        | BRAND NAME                                   | DESCRIPTION                                                                                                              |
| ------------------------ | -------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| **Aspirin**             | **(Aspirin)**                               | The most widely used anti-inflammatory drug. It prevents blood clots, so it's used in the initial treatment of heart attacks. |
| **Ibuprofen**           | **(Burana)**                                | An anti-inflammatory drug.                                                                                               |
| **Paracetamol**         | **(Panadol)**                               | A pain reliever.                                                                                                       |
| **Morphine**            | **(Morphine)**                              | A powerful pain reliever. Used in hospital care.                                                                        |
| **Cortisone**           | **(Prednisone, Hydrocortisone)**            | Cortisone is used for conditions like asthma, allergies, and skin diseases. Administered through inhalation, as tablets, or as ointment. |
| **Penicillin V**        | **(V-Pen)**                                 | An antibiotic for treating bacterial infections.                                                                         |
| **Cetirizine**          | **(Heinix)**                                | An antihistamine for treating allergies.                                                                                 |
| **Insulin**             | **(Lantus)**                                | A hormone for treating diabetes.                                                                                         |
| **Paclitaxel**          | **(Taxol)**                                 | A chemotherapy drug.                                                                                                     |
| **Sertraline**          | **(Zoloft)**                                | An antidepressant, also used for panic disorder and obsessive-compulsive disorder.                                       |


Select three medicinal substances to study. You can use the table above or enter the names of medicinal substances of your choice. Enter the medicinal substances in English! You can search for the name of the active ingredient of the medicine online. Do not use the brand name of the medicine.

In [None]:
# Import required modules from libraries
import requests

# Create a function that retrieves data from the PubChem database using the PUG-REST method.
# Do not modify this part of the code!
def retrieve_names(compound_names):
    pugrest = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
    pugoper = "property/CanonicalSMILES"
    pugout = "txt"
    results = {}
    for compound_name in compound_names:
        pugin = "compound/name/" + compound_name
        url = f"{pugrest}/{pugin}/{pugoper}/{pugout}"
        response = requests.get(url)
        if response.status_code == 200:
            results[compound_name] = response.text.strip()  # Remove extra spaces.
        else:
            results[compound_name] = f"Error: {response.status_code}"
    return results

# Edit the code line below. Add the English names of the drugs you have selected here, enclosed in quotation marks.
compound_names = ["", "", ""] 
# Only edit the code line above.

# Retrieve SMILES strings using the retrive_names function.
canonical_smiles = retrieve_names(compound_names)

# Print out the results
for compound_name, smiles in canonical_smiles.items():
    print(f"Canonical SMILES for {compound_name}: {smiles}")

### Question 5.
**Which drugs did you choose and what are their SMILES strings?**

---

## 5. Drawing medicines

In [None]:
# Import required modules from libraries
from rdkit import Chem 
from rdkit.Chem.Draw import IPythonConsole 
from rdkit.Chem import rdDepictor 
from rdkit.Chem.Draw import rdMolDraw2D 
from IPython.display import SVG 

smiles = '' # Test your search results by adding SMILES one at a time between the '' characters.
m = Chem.MolFromSmiles(smiles)
# m = Chem.AddHs(m) # You can use this command to visualize all hydrogen atoms, if you wish. Remove the first # character to activate the setting.

# Create a function moltosvg_with_indices that creates an SVG image of the structural formula from a SMILES string.
def moltosvg_with_indices(mol, molSize = (350,300), kekulize = True):
    mc = Chem.Mol(mol.ToBinary())
    if kekulize:
        try:
            Chem.Kekulize(mc)
        except:
            mc = Chem.Mol(mol.ToBinary())
    if not mc.GetNumConformers():
        rdDepictor.Compute2DCoords(mc)
    drawer = rdMolDraw2D.MolDraw2DSVG(molSize[0],molSize[1])
    drawer = Chem.Draw.rdMolDraw2D.MolDraw2DSVG(molSize[0], molSize[1])
    drawer.drawOptions().addAtomIndices = True
    drawer.DrawMolecule(mc)
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText()
    return SVG(svg)
    
structuralFormula = moltosvg_with_indices(m)  # Define the structural formula variable and set its value to an SVG image using the moltosvg_with_indices function.
display(structuralFormula)       # Display the structural formula image.

### Question 6.
**What functional groups can you find in medicinal substances?**

---

## 6. Introduction to IUPAC names and the PubChem database

In [None]:
# With this code, you can search for specific information in the PubChem database using various trivial or IUPAC names.

# Import required modules from libraries
import pubchempy as pcp 
import pandas as pd
pd.options.display.max_colwidth = 1000 # Set a maximum length longer than the default value for columns in the pandas library.

# List of example substances. You can add the compounds you want here.
names = ["water", 
         "benzene", 
         "methanol", 
         "ethene", 
         "ethanol", 
         "propene", 
         "1-propanol", 
         "2-propanol", 
         "butadiene", 
         "1-butanol", 
         "2-butanol", 
         "tert-butanol"]

# Use a for loop to go through the list and save the substance IDs to a list.
# Compounds in the PubChem database are indexed (ID) and CID is an abbreviation for chemical index.
# They are unique to each compound and provide a clearer basis for searching than SMILES or other data.
compound_ids = []
for name in names:
    try:
        cp = pcp.get_compounds(name, namespace='name')
        if cp:
            compound_ids.append(cp[0].cid)
    except Exception as e:
        print(f"Error retrieving CID for '{name}', : {e}")
results = pcp.get_properties(['IUPACName', 'MolecularFormula', 'MolecularWeight', 'CanonicalSMILES'], compound_ids)

# Create a table from the results using the pandas library and display it
df = pd.DataFrame(results)
display(df)

### Question 7.
**How does the name of the drug you selected differ from the IUPAC name?**

---

# PART II

The QSAR algorithm can be used to predict the biological activity of drugs. However, the algorithm was first used to predict boiling points based on molecular structure. Therefore, we will begin this workshop by visualizing the correlation between molar mass and boiling point.

### 7. Let's examine the correlation between molar mass and boiling point.

In [None]:
# Import required modules from libraries
import pandas as pd

pd.options.display.max_colwidth = 1000  # Set a maximum length longer than the default value for columns in the pandas library.
df = pd.read_csv('../../data/BP.csv')   # Read the file as a table using the pandas library.
display(df)                             # prints the table, because df means dataframe command.

In [None]:
# Import required modules from libraries
import matplotlib.pyplot as plt

plt.scatter(df.Moolimassa, df.Kiehumispiste_K)      # Creates the scatter plot from the data. (Data file has Finnish coulum names)
plt.xlabel('Molar mass')                            # Set the name on the x-axis
plt.ylabel('Boiling point (K)')                     # Set the name on the y-axis
plt.show()                                          # Show the figure

### Question 8.
**What can you conclude about the boiling points of molecules and their relationship to their molar mass? Why can the same molar mass lead to different boiling points?**

---

### 8. Fitting a linear regression model

Linear fits are probably familiar from math classes. This mathematical modeling is one of the key methods in chemoinformatics, as it allows us to examine the relationship between different variables (correlation) starting from two or more factors. Below is code that can be used to fit a regression model to a dataset. What do you notice when the code outputs a graph?

In [None]:
# Import required modules from libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

# Import data from a .csv file into a pandas dataframe
df = pd.read_csv('../../data/BP.csv')

# Fit a linear regression model
model = LinearRegression()
X = df[['Moolimassa']]          # Independent variable
y = df[['Kiehumispiste_K']]     # Dependent variable
model.fit(X, y)

# Draw a scatter plot from the data
plt.figure()
sns.scatterplot(x='Moolimassa', y='Kiehumispiste_K', data=df, color='tab:blue', label='Datapisteet')

# Draw a regression line
plt.plot(X, model.predict(X), color='tab:red', label='Lineaarinen regressio')

# Set labels and title
plt.xlabel('Moolimassa')
plt.ylabel('Kiehumispiste (K)')
plt.title('Lineaarinen regressiomalli')

# Show the legend
plt.legend()

# Show the plot
plt.show()

### Question 9.
**Does the red regression line accurately follow the data points?**

---

### 9. Predicting the effectiveness of drugs
#### Your task

**Find compounds with a solubility comparable to aspirin (LogS value -2.1) by examining them using the machine learning model below.** The LogS value describes the water solubility of a compound and is explained in more detail in the table below.

| Solubility (LogS)   | LogS < -4            | -4 < LogS < -2            | -2 < LogS < 0                     |
| -------------------- | ----------------- | -------------------- | ------------------------- |
| Interpretation   | Does not dissolve in water | Dissolves rather well   | Dissolves excellently               |
| Blood Concentration | Minimal            | Good                      | High, increased concentration           |

It is time to examine at least 5 compounds, starting with the SMILES of them. In addition to the compounds you select, always use **aspirin** as a reference:

- You can search databases either freely or using PUG-REST commands
- Utilize experimental data on compounds (table).

Evaluate their suitability as drug compounds, particularly with regard to **solubility values and Lipinski's Rule of Five**.

**The purpose of this work is to map a small sample of compounds with the same solubility**, which you or others can **use as a starting point for the following chemoinformatics solubility models**. Everything starts from known and verified data points! Good luck!

In [None]:
# For example, you can examine whether the model values correspond to the results of experimental studies.
# The dataset contains the results of failed predictions and correct predictions.

# Import required modules from libraries
import numpy as np
import pandas as pd

# Define otions for pandas libarary
pd.options.display.max_colwidth = 1000  # Set a maximum length longer than the default value for columns in the pandas library.
pd.options.display.max_rows = 500       # Set a maximum ammount of rows larger than the default value in the pandas library.

df = pd.read_csv('../../data/kokeellisetLogS_referenssipisteet.csv') # Read the file as a table using the pandas library.
df = df.loc[:100, ['Chemical name', 'LogS exp (mol/L)', 'Test', 'SMILES']] # Select the columns we want from the table

# Print the table
display(df)

### 10. Machine learning model

In [None]:
# Import the machine learning model for predicting water solubility

# Import required modules from libraries
from pycaret.classification import *

# Load the pre-trained machine learning model.
model = load_model("../../data/WaterSoulubility_02_06_2025_model")

Use this machine learning model to predict the solubility and suitability of compounds as drugs! Enter SMILES strings for the compounds you want to study, but remember to separate them with commas and line breaks.

In [None]:
# Import required modules from libraries
from chem_util import descriptors_from_smiles 

# Enter SMILES strings into the list using quotation marks around each compound as shown in the example.
# For example, the following compounds.
smiles = [
  "C1CCCCC1",
  "CC(C)Cc1ccc(cc1)C(C)C(=O)O",
  "CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=C(C=C3)O)N)C(=O)O)C",
  "CC1=C(C(=O)C(=C(C1=O)OC)OC)CC=C(C)CC\C=C(/C)\CC\C=C(/C)\CC\C=C(/C)\CC\C=C(/C)\CC\C=C(/C)\CC\C=C(/C)\CCC=C(C)C", 
  "F[C@H]1CC[C@H](O)CC1", 
  "C1CC1[C@H](F)C1CCC1", 
  "CC(=O)OC1=CC=CC=C1C(=O)O"] # This one is Aspirin!

x = descriptors_from_smiles(smiles) # The machine learning model utilizes molecular descriptors when calculating the requested values based on SMILES.
display(x) # Display the table.

### 11. Predicting solubility

In [None]:
# Predicted using a machine learning model.
ennuste = model.predict(x)

# Create a table of predictions and their SMILES and print it.
df = pd.DataFrame({"Ennustettu LogS": ennuste, "SMILE": smiles})
display(df)

### Question 10.
**What compounds are you studying, and are they soluble in water?**

### Question 11.
**How does the solubility of the compound you have chosen differ from that of aspirin, which is known to dissolve well in water and is suitable for oral administration as a medicine?**

---

### 12. SwissADME
SwissADME is a free online tool used to evaluate the pharmacokinetic properties of chemical compounds. It helps predict the absorption, distribution, metabolism, and excretion of compounds, as well as analyze their chemical reactivity and toxicity. The tool is particularly useful in drug development, as it provides a quick and easy way to assess the potential of compounds to act as effective and safe drugs.

http://www.swissadme.ch/ 

### Question 12.
**Examine the same compounds using SwissADME. Do you get the same prediction of the compound's solubility and efficacy as a drug using SwissADME compared to the machine learning model above?**

---