<a href="https://colab.research.google.com/github/Jahan08/RDKit-application-Cheminformatics-Analysis/blob/main/Molecular_Property_distribution_3D_2_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Welcome to Walter_1.0.** This code uses RDKit to compute the following parameters: molecular weight, topographic polar surface area (tPSA), number of rotatable bonds, nuber of H-bond donors, number of H-bond acceptors, fraction sp3, LogP, number of aromatic rings, number of aliphatic rings, number of saturated rings, and QED. You can then enter a SMILES string for your own molecule, and your molecule will be plotted together with the dataset for the uploaded SMILES in a PCA plot, so you can get an idea where in "chemical space" your molecule lives compared with the dataset (e.g., FDA-approved drugs).

Datasets like FDA-approved drugs, vet drugs, drugs containing phenols, and drugs containing phenolic ethers are available at the following github page (download as .csv and then upload when prompted) https://github.com/SculpturatusLabs/FDA-approved_SMILES.

**To Run the Code:**
  1. At the top, click "Runtime" and "Run All"
  2. Scroll to the bottom of the screen. When the first two modules of code finish running you will be prompted to upload a dataset. Upload the data set that you want to use, and when it finishes processing, you will be prompted to name it. Select a name (this will be used in the legent of the plot)
  3. At this point a PCA plot will be generated for this dataset. Scroll past it.
  4. You will be prompted to enter a SMILES string. Copy and paste your string, and name the compound. This name will be used in the legend of the PCA plot.
  5. Plot.

In [None]:
!pip install pandas rdkit scikit-learn matplotlib
!pip install plotly



In [None]:
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors, rdMolDescriptors
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import plotly.express as px
from google.colab import files
from IPython.display import clear_output

# Ensure plotly is installed
!pip install plotly

# Function to calculate descriptors
def calculate_descriptors(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is not None:
        mw = Descriptors.MolWt(mol)
        fraction_sp3 = rdMolDescriptors.CalcFractionCSP3(mol)
        logp = Descriptors.MolLogP(mol)
        h_donors = Descriptors.NumHDonors(mol)
        h_acceptors = Descriptors.NumHAcceptors(mol)
        tpsa = Descriptors.TPSA(mol)
        num_rotatable_bonds = Descriptors.NumRotatableBonds(mol)
        num_aromatic_rings = rdMolDescriptors.CalcNumAromaticRings(mol)
        num_aliphatic_rings = rdMolDescriptors.CalcNumAliphaticRings(mol)
        num_saturated_rings = rdMolDescriptors.CalcNumSaturatedRings(mol)
        num_heteroatoms = Descriptors.NumHeteroatoms(mol)
        qed = Descriptors.qed(mol)
        return mw, fraction_sp3, logp, h_donors, h_acceptors, tpsa, num_rotatable_bonds, num_aromatic_rings, num_aliphatic_rings, num_saturated_rings, num_heteroatoms, qed
    else:
        return None, None, None, None, None, None, None, None, None, None, None, None

# Upload the first CSV file (in blue)
uploaded_blue = files.upload()

# Prompt for the name of the uploaded CSV file
csv_name_blue = input("Enter a name for the blue dataset: ")

# Load the CSV file into a pandas dataframe
df_blue = pd.read_csv(next(iter(uploaded_blue)))

# Ensure the CSV contains a column named 'SMILES'
if 'SMILES' not in df_blue.columns:
    raise ValueError("The CSV file must contain a 'SMILES' column.")

# Apply the function to the dataframe
df_blue[['MolecularWeight', 'FractionSP3', 'LogP', 'NumHDonors', 'NumHAcceptors', 'TPSA', 'NumRotatableBonds', 'NumAromaticRings', 'NumAliphaticRings', 'NumSaturatedRings', 'NumHeteroatoms', 'QED']] = df_blue['SMILES'].apply(lambda x: pd.Series(calculate_descriptors(x)))

# Drop rows with None values (in case some SMILES strings could not be processed)
df_blue = df_blue.dropna()

# Perform PCA on the original dataset
features = ['MolecularWeight', 'FractionSP3', 'LogP', 'NumHDonors', 'NumHAcceptors', 'TPSA', 'NumRotatableBonds', 'NumAromaticRings', 'NumAliphaticRings', 'NumSaturatedRings', 'NumHeteroatoms', 'QED']
x_blue = df_blue[features]

# Normalize the data by setting mean to 0 and variance to 1
scaler = StandardScaler()
x_blue_normalized = scaler.fit_transform(x_blue)

# Perform PCA
pca = PCA(n_components=3)
principal_components_blue = pca.fit_transform(x_blue_normalized)
pca_df_blue = pd.DataFrame(data=principal_components_blue, columns=['Principal Component 1', 'Principal Component 2', 'Principal Component 3'])

# Get percentage of variance explained by each component
explained_variance = pca.explained_variance_ratio_ * 100

# Plot initial data in blue with 90% transparency
fig = px.scatter_3d(
    pca_df_blue, x='Principal Component 1', y='Principal Component 2', z='Principal Component 3',
    color_discrete_sequence=['blue'], opacity=0.1, labels={'color': csv_name_blue}
)
fig.update_traces(marker=dict(size=6, opacity=0.1))  # Blue dots with 90% transparency
fig.update_layout(
    scene=dict(
        xaxis_title=f'Principal Component 1<br>({explained_variance[0]:.2f}%)',
        yaxis_title=f'Principal Component 2<br>({explained_variance[1]:.2f}%)',
        zaxis_title=f'Principal Component 3<br>({explained_variance[2]:.2f}%)',
        xaxis=dict(titlefont=dict(size=12, family='Arial Black', color='black'), tickfont=dict(size=12)),
        yaxis=dict(titlefont=dict(size=12, family='Arial Black', color='black'), tickfont=dict(size=12)),
        zaxis=dict(titlefont=dict(size=12, family='Arial Black', color='black'), tickfont=dict(size=12))
    ),
    title='Initial 3D PCA Plot (Blue)',
    width=1000,
    height=800
)
fig.show()

# Upload the second CSV file (in red)
uploaded_red = files.upload()

# Prompt for the name of the second CSV file
csv_name_red = input("Enter a name for the red dataset: ")

# Load the second CSV file into a pandas dataframe
df_red = pd.read_csv(next(iter(uploaded_red)))

# Ensure the second CSV contains a column named 'SMILES'
if 'SMILES' not in df_red.columns:
    raise ValueError("The CSV file must contain a 'SMILES' column.")

# Apply the function to the second dataframe
df_red[['MolecularWeight', 'FractionSP3', 'LogP', 'NumHDonors', 'NumHAcceptors', 'TPSA', 'NumRotatableBonds', 'NumAromaticRings', 'NumAliphaticRings', 'NumSaturatedRings', 'NumHeteroatoms', 'QED']] = df_red['SMILES'].apply(lambda x: pd.Series(calculate_descriptors(x)))

# Drop rows with None values (in case some SMILES strings could not be processed)
df_red = df_red.dropna()

# Perform PCA on the second dataset
x_red = df_red[features]
x_red_normalized = scaler.transform(x_red)
principal_components_red = pca.transform(x_red_normalized)
pca_df_red = pd.DataFrame(data=principal_components_red, columns=['Principal Component 1', 'Principal Component 2', 'Principal Component 3'])

# Plot the updated data with both blue and red points
fig = px.scatter_3d(
    pca_df_blue, x='Principal Component 1', y='Principal Component 2', z='Principal Component 3',
    color_discrete_sequence=['blue'], opacity=0.1, labels={'color': csv_name_blue}
)
fig.update_traces(marker=dict(size=6, opacity=0.1))  # Blue dots with 90% transparency

# Add red data to the plot with smaller dots and 50% transparency
fig.add_scatter3d(
    x=pca_df_red['Principal Component 1'], y=pca_df_red['Principal Component 2'], z=pca_df_red['Principal Component 3'],
    mode='markers', marker=dict(color='red', size=4, opacity=0.5), name=csv_name_red  # Red dots with smaller size and 50% transparency
)
fig.update_layout(
    scene=dict(
        xaxis_title=f'Principal Component 1<br>({explained_variance[0]:.2f}%)',
        yaxis_title=f'Principal Component 2<br>({explained_variance[1]:.2f}%)',
        zaxis_title=f'Principal Component 3<br>({explained_variance[2]:.2f}%)',
        xaxis=dict(titlefont=dict(size=12, family='Arial Black', color='black'), tickfont=dict(size=12)),
        yaxis=dict(titlefont=dict(size=12, family='Arial Black', color='black'), tickfont=dict(size=12)),
        zaxis=dict(titlefont=dict(size=12, family='Arial Black', color='black'), tickfont=dict(size=12))
    ),
    title='3D PCA Plot with Blue and Red Points',
    width=1000,
    height=800
)
fig.show()




Saving FDA-approved_1951-2021.csv to FDA-approved_1951-2021 (9).csv
Enter a name for the blue dataset: Drugs


[22:42:56] Can't kekulize mol.  Unkekulized atoms: 3 4 16
[22:42:59] Explicit valence for atom # 28 N, 4, is greater than permitted


Saving test.csv to test (4).csv
Enter a name for the red dataset: Test
