<a href="https://colab.research.google.com/github/Bnt-Suleiman/AI_and_Drug_Discovery_Course_2026/blob/main/Part_3_Pubchem_Fingerprint_Calculation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **AI And Biotechnology/Bioinformatics**

## **AI and Drug Discovery Course: QSAR Modeling**
This notebook demonstrates how to collect and preprocess bioactivity data from ChEMBL for QSAR modeling

# **Part 3: Descriptor Calculation**

PaDELPy is a Python wrapper for the PaDEL-Descriptor (molecular descriptor calculation) software.  

It provide the following descriptors/fingerprint:  
* 1444 - 2D Descriptors
* 431 - 3D Descriptors
* 881 bits - PubChem Fingerprints

## **Install PaDELpy**

In [2]:
!pip install padelpy

Collecting padelpy
  Downloading padelpy-0.1.16-py3-none-any.whl.metadata (7.7 kB)
Downloading padelpy-0.1.16-py3-none-any.whl (20.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.9/20.9 MB[0m [31m68.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: padelpy
Successfully installed padelpy-0.1.16


## **Import libraries**

In [15]:
import pandas as pd
import numpy as np
from google.colab import files
from padelpy import padeldescriptor

## **Load dataset**

In [16]:
df = pd.read_csv('df_lipinski.csv')
df.head()

Unnamed: 0,molecule_chembl_id,bioactivity_class,pIC50,canonical_smiles,MW,LogP,NumHDonors,NumHAcceptors
0,CHEMBL1914665,active,8.39794,CC[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccccc1,344.418,4.8937,3.0,4.0
1,CHEMBL1914655,active,7.207608,C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccc(...,409.287,5.2661,3.0,4.0
2,CHEMBL1914654,active,8.420216,C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccc(...,348.381,4.6427,3.0,4.0
3,CHEMBL1914660,active,8.045757,C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1cccc...,348.381,4.6427,3.0,4.0
4,CHEMBL1914653,active,8.431798,C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccccc1,330.391,4.5036,3.0,4.0


In [17]:
data = df[['canonical_smiles', 'molecule_chembl_id']]
data.head()

Unnamed: 0,canonical_smiles,molecule_chembl_id
0,CC[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccccc1,CHEMBL1914665
1,C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccc(...,CHEMBL1914655
2,C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccc(...,CHEMBL1914654
3,C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1cccc...,CHEMBL1914660
4,C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccccc1,CHEMBL1914653


## **Convert to .smi format**

In [18]:
df_smi = data['canonical_smiles'].to_csv('smiles_chembl.smi', index=None, header=None)

In [19]:
! cat smiles_chembl.smi | head

CC[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccccc1
C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccc(Br)cc1
C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccc(F)cc1
C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1cccc(F)c1
C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccccc1
C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccccc1F
Cc1ccc([C@@H](C)Nc2ncnc3[nH]c(-c4ccc(O)cc4)cc23)cc1
Cc1cccc(C(C)Nc2ncnc3[nH]c(-c4ccc(O)cc4)cc23)c1
Cc1ccccc1C(C)Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12
Oc1ccc(-c2cc3c(NCc4ccc(F)cc4)ncnc3[nH]2)cc1


## **Calculate molecular Pubchem Fingerprints using "padeldescriptor" function**


In [33]:
padeldescriptor(mol_dir= "smiles_chembl.smi",
                d_file='pubchem_fingerprints.csv',
                fingerprints = True,
                retainorder= True,
                #removesalt = True, standardizetautomers = True, standardizenitro=True
                )

KeyboardInterrupt: 

In [39]:
display(significant_descriptors)

Unnamed: 0,descriptor,correlation,p_value
2,PubchemFP2,-0.043023,3.100043e-02
3,PubchemFP3,0.075463,1.523355e-04
14,PubchemFP14,0.067408,7.196568e-04
15,PubchemFP15,0.066923,7.862031e-04
17,PubchemFP17,0.069738,4.667218e-04
...,...,...,...
826,PubchemFP826,0.114756,7.925491e-09
830,PubchemFP830,0.049689,1.271424e-02
835,PubchemFP835,0.053473,7.324710e-03
836,PubchemFP836,0.085516,1.758956e-05


In [40]:
display(non_significant_descriptors)

Unnamed: 0,descriptor,correlation,p_value
1,PubchemFP1,-0.003750,0.850911
12,PubchemFP12,0.030764,0.123054
13,PubchemFP13,-0.027940,0.161367
16,PubchemFP16,0.026958,0.176625
18,PubchemFP18,-0.012428,0.533395
...,...,...,...
834,PubchemFP834,-0.017157,0.389842
837,PubchemFP837,0.025155,0.207360
839,PubchemFP839,0.023415,0.240555
860,PubchemFP860,0.014732,0.460323


In [21]:
!ls -lh pubchem_fingerprints.csv

-rw-r--r-- 1 root root 4.4M Feb 22 17:30 pubchem_fingerprints.csv


In [24]:
# Select only the columns we need for ML
meta_cols = df[['molecule_chembl_id', 'bioactivity_class', 'pIC50']]

# Reset index to ensure proper alignment
meta_cols = meta_cols.reset_index(drop=True)
df_fingerprint = df_fingerprint.reset_index(drop=True)

# Combine meta data with fingerprints
combined_df = pd.concat([meta_cols, df_fingerprint.drop(df_fingerprint.columns[0], axis=1)], axis=1)

# Inspect the first few rows
combined_df.head()

Unnamed: 0,molecule_chembl_id,bioactivity_class,pIC50,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL1914665,active,8.39794,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,CHEMBL1914655,active,7.207608,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,CHEMBL1914654,active,8.420216,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,CHEMBL1914660,active,8.045757,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,CHEMBL1914653,active,8.431798,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
df_fingerprint = pd.read_csv("pubchem_fingerprints.csv")
df_fingerprint.head()

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,AUTOGEN_smiles_chembl_1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,AUTOGEN_smiles_chembl_2,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,AUTOGEN_smiles_chembl_3,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,AUTOGEN_smiles_chembl_4,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,AUTOGEN_smiles_chembl_5,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
!ls -lh pubchem_fingerprints.csv

-rw-r--r-- 1 root root 4.4M Feb 22 17:30 pubchem_fingerprints.csv


In [26]:
df_fingerprint = pd.read_csv("pubchem_fingerprints.csv")
df_fingerprint.head()

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,AUTOGEN_smiles_chembl_1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,AUTOGEN_smiles_chembl_2,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,AUTOGEN_smiles_chembl_3,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,AUTOGEN_smiles_chembl_4,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,AUTOGEN_smiles_chembl_5,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## **Prepare Dataset for ML**

In [27]:
df.head()

Unnamed: 0,molecule_chembl_id,bioactivity_class,pIC50,canonical_smiles,MW,LogP,NumHDonors,NumHAcceptors
0,CHEMBL1914665,active,8.39794,CC[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccccc1,344.418,4.8937,3.0,4.0
1,CHEMBL1914655,active,7.207608,C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccc(...,409.287,5.2661,3.0,4.0
2,CHEMBL1914654,active,8.420216,C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccc(...,348.381,4.6427,3.0,4.0
3,CHEMBL1914660,active,8.045757,C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1cccc...,348.381,4.6427,3.0,4.0
4,CHEMBL1914653,active,8.431798,C[C@@H](Nc1ncnc2[nH]c(-c3ccc(O)cc3)cc12)c1ccccc1,330.391,4.5036,3.0,4.0


In [28]:
# Select only the columns we need for ML
meta_cols = df[['molecule_chembl_id', 'bioactivity_class', 'pIC50']]

# Reset index to ensure proper alignment
meta_cols = meta_cols.reset_index(drop=True)
df_fingerprint = df_fingerprint.reset_index(drop=True)

# Combine meta data with fingerprints
combined_df = pd.concat([meta_cols, df_fingerprint.drop(df_fingerprint.columns[0], axis=1)], axis=1)

# Inspect the first few rows
combined_df.head()


Unnamed: 0,molecule_chembl_id,bioactivity_class,pIC50,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL1914665,active,8.39794,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,CHEMBL1914655,active,7.207608,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,CHEMBL1914654,active,8.420216,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,CHEMBL1914660,active,8.045757,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,CHEMBL1914653,active,8.431798,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## **Save and download the dataset**

In [29]:
# Save as CSV
combined_df.to_csv("QSAR_dataset.csv", index=False)
print("Combined dataset saved as QSAR_dataset.csv")

# Download file in Colab
files.download("QSAR_dataset.csv")

Combined dataset saved as QSAR_dataset.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Calculate other fingerprints**

## **Download xml Files from Github**

In [30]:
!wget https://github.com/AI-Biotechnology-Bioinformatics/Drug_Discovery_AI_Course_2026/raw/main/padel_descriptors_xml.zip

--2026-02-22 17:35:06--  https://github.com/AI-Biotechnology-Bioinformatics/Drug_Discovery_AI_Course_2026/raw/main/padel_descriptors_xml.zip
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/AI-Biotechnology-Bioinformatics/Drug_Discovery_AI_Course_2026/main/padel_descriptors_xml.zip [following]
--2026-02-22 17:35:06--  https://raw.githubusercontent.com/AI-Biotechnology-Bioinformatics/Drug_Discovery_AI_Course_2026/main/padel_descriptors_xml.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10871 (11K) [application/zip]
Saving to: ‘padel_descriptors_xml.zip’


2026-02-22 17:35:07 (25.4 MB/s) - ‘

## **Unzip all files**

In [31]:
!unzip padel_descriptors_xml.zip

Archive:  padel_descriptors_xml.zip
  inflating: AtomPairs2DFingerprintCount.xml  
  inflating: AtomPairs2DFingerprinter.xml  
  inflating: EStateFingerprinter.xml  
  inflating: ExtendedFingerprinter.xml  
  inflating: Fingerprinter.xml       
  inflating: GraphOnlyFingerprinter.xml  
  inflating: KlekotaRothFingerprintCount.xml  
  inflating: KlekotaRothFingerprinter.xml  
  inflating: MACCSFingerprinter.xml  
  inflating: PubchemFingerprinter.xml  
  inflating: SubstructureFingerprintCount.xml  
  inflating: SubstructureFingerprinter.xml  


## **Calculate Fingerprints**

In [32]:
# Specify the XML file for SubstructureFingerprinter directly
Substruc_fp = "SubstructureFingerprinter.xml"

# Calculate Substructure fingerprints
padeldescriptor(
    mol_dir='smiles_chembl.smi',
    d_file='Substructure_fingerprints.csv',
    fingerprints=True,
    descriptortypes= Substruc_fp,
    retainorder=True
    # removesalt=True, standardizetautomers=True
)

KeyboardInterrupt: 

# Task
Analyze the correlation between `pIC50` values and PubChem fingerprint descriptors from the `combined_df` DataFrame. For each descriptor, calculate the Pearson correlation coefficient and p-value. Identify and display lists of statistically significant (p-value < 0.05) and non-significant descriptors.

## Prepare Data for Analysis

### Subtask:
Separate the `pIC50` values and the PubChem fingerprint descriptors from the `combined_df` DataFrame. The 'Name' column (first column) of `df_fingerprint` was dropped during concatenation, so the descriptors are 'PubchemFP0' to 'PubchemFP880'.


**Reasoning**:
The subtask requires separating the target variable (`pIC50`) and the descriptor features (PubChem fingerprints) from the `combined_df` DataFrame. This code will extract `pIC50` into `y` and 'PubchemFP' columns into `X`.



In [34]:
y = combined_df['pIC50']
X = combined_df.drop(['molecule_chembl_id', 'bioactivity_class', 'pIC50'], axis=1)

print("Shape of y:", y.shape)
print("Shape of X:", X.shape)
X.head()

Shape of y: (11838,)
Shape of X: (11838, 881)


Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Reasoning**:
To analyze the correlation between `pIC50` values and PubChem fingerprint descriptors, it's necessary to calculate the Pearson correlation coefficient and p-value for each descriptor. This step will iterate through all PubChem fingerprint columns in `X` and compute these statistical measures against `y` (pIC50).



In [35]:
from scipy.stats import pearsonr

# Initialize lists to store results
correlations = []
p_values = []

# Iterate through each PubChem fingerprint descriptor in X
for column in X.columns:
    # Calculate Pearson correlation coefficient and p-value
    corr, p_val = pearsonr(X[column], y)
    correlations.append({'descriptor': column, 'correlation': corr})
    p_values.append({'descriptor': column, 'p_value': p_val})

# Convert lists to DataFrames for easier handling
df_correlations = pd.DataFrame(correlations)
df_p_values = pd.DataFrame(p_values)

print("Correlation results (first 5 rows):")
print(df_correlations.head())
print("\nP-value results (first 5 rows):")
print(df_p_values.head())

Correlation results (first 5 rows):
   descriptor  correlation
0  PubchemFP0          NaN
1  PubchemFP1          NaN
2  PubchemFP2          NaN
3  PubchemFP3          NaN
4  PubchemFP4          NaN

P-value results (first 5 rows):
   descriptor  p_value
0  PubchemFP0      NaN
1  PubchemFP1      NaN
2  PubchemFP2      NaN
3  PubchemFP3      NaN
4  PubchemFP4      NaN


**Reasoning**:
The previous correlation and p-value calculations resulted in `NaN` values, likely due to descriptors with zero variance. To address this, I will first filter out descriptors with `NaN` correlation or `NaN` p-values, then filter the remaining valid results into statistically significant (p-value < 0.05) and non-significant groups, and finally display these lists.



In [36]:
from scipy.stats import pearsonr

# Initialize lists to store results
correlations = []
p_values = []

# Iterate through each PubChem fingerprint descriptor in X
for column in X.columns:
    # Check if the column has at least two unique values to calculate correlation
    if X[column].nunique() > 1:
        # Calculate Pearson correlation coefficient and p-value
        corr, p_val = pearsonr(X[column], y)
        correlations.append({'descriptor': column, 'correlation': corr})
        p_values.append({'descriptor': column, 'p_value': p_val})
    else:
        # If variance is zero, correlation and p-value are undefined
        correlations.append({'descriptor': column, 'correlation': np.nan})
        p_values.append({'descriptor': column, 'p_value': np.nan})

# Convert lists to DataFrames for easier handling
df_correlations = pd.DataFrame(correlations)
df_p_values = pd.DataFrame(p_values)

# Merge correlation and p-value DataFrames
df_results = pd.merge(df_correlations, df_p_values, on='descriptor')

# Filter out rows where correlation or p_value is NaN (due to constant descriptor values)
df_results_filtered = df_results.dropna(subset=['correlation', 'p_value'])

# Define significance level
significance_level = 0.05

# Identify significant descriptors
significant_descriptors = df_results_filtered[df_results_filtered['p_value'] < significance_level]

# Identify non-significant descriptors
non_significant_descriptors = df_results_filtered[df_results_filtered['p_value'] >= significance_level]

# Descriptors that resulted in NaN (constant values)
nan_descriptors = df_results[df_results['p_value'].isna()]

print("\nStatistically Significant Descriptors (p-value < 0.05):")
print(significant_descriptors.head())
print(f"Total significant descriptors: {len(significant_descriptors)}")

print("\nStatistically Non-Significant Descriptors (p-value >= 0.05):")
print(non_significant_descriptors.head())
print(f"Total non-significant descriptors: {len(non_significant_descriptors)}")

print("\nDescriptors with undefined correlation/p-value (constant values):")
print(nan_descriptors.head())
print(f"Total NaN descriptors: {len(nan_descriptors)}")


Statistically Significant Descriptors (p-value < 0.05):
Empty DataFrame
Columns: [descriptor, correlation, p_value]
Index: []
Total significant descriptors: 0

Statistically Non-Significant Descriptors (p-value >= 0.05):
Empty DataFrame
Columns: [descriptor, correlation, p_value]
Index: []
Total non-significant descriptors: 0

Descriptors with undefined correlation/p-value (constant values):
   descriptor  correlation  p_value
0  PubchemFP0          NaN      NaN
1  PubchemFP1          NaN      NaN
2  PubchemFP2          NaN      NaN
3  PubchemFP3          NaN      NaN
4  PubchemFP4          NaN      NaN
Total NaN descriptors: 881


**Reasoning**:
The previous steps indicated that all PubChem fingerprint descriptors in `X` have zero variance (all `NaN` correlation/p-values). This means they are constant across all samples, making correlation analysis meaningless. To confirm this and understand the data's characteristics, I will inspect the unique values and variance of the `X` DataFrame's columns.



In [37]:
print("Number of unique values per column in X (top 10):\n", X.nunique().head(10))
print("\nValue counts of unique values in X columns:\n", X.nunique().value_counts())
print("\nVariance of columns in X (top 10):\n", X.var().head(10))
print("\nNumber of columns with zero variance:", (X.var() == 0).sum())

Number of unique values per column in X (top 10):
 PubchemFP0    1
PubchemFP1    2
PubchemFP2    2
PubchemFP3    2
PubchemFP4    1
PubchemFP5    1
PubchemFP6    1
PubchemFP7    1
PubchemFP8    1
PubchemFP9    1
dtype: int64

Value counts of unique values in X columns:
 2    560
1    321
Name: count, dtype: int64

Variance of columns in X (top 10):
 PubchemFP0    0.000000
PubchemFP1    0.000398
PubchemFP2    0.062385
PubchemFP3    0.181567
PubchemFP4    0.000000
PubchemFP5    0.000000
PubchemFP6    0.000000
PubchemFP7    0.000000
PubchemFP8    0.000000
PubchemFP9    0.000000
dtype: float64

Number of columns with zero variance: 321


**Reasoning**:
The previous attempts to calculate Pearson correlation resulted in `NaN` values for all descriptors, even for columns that were identified as having more than one unique value (non-constant). This indicates the presence of `NaN` values in either the descriptor columns (`X`) or the target variable (`y`). To accurately calculate correlations, it is crucial to handle these missing values. This step will check for `NaN`s in `y` and a sample non-constant column from `X`, then re-implement the correlation calculation to explicitly drop `NaN`s from the pairwise data before computing `pearsonr` for each descriptor, ensuring valid correlation and p-value results.



In [38]:
from scipy.stats import pearsonr

# Check for NaNs in y (pIC50)
print("Number of NaNs in pIC50 (y):", y.isnull().sum())

# Identify a sample non-constant column in X to check for NaNs
non_constant_columns = X.columns[X.nunique() > 1]
if len(non_constant_columns) > 0:
    sample_column_name = non_constant_columns[0]
    print(f"Number of NaNs in sample non-constant descriptor '{sample_column_name}':", X[sample_column_name].isnull().sum())
else:
    print("No non-constant columns found in X, all columns have zero variance.")

# Initialize lists to store results
correlations = []
p_values = []

# Iterate through each PubChem fingerprint descriptor in X
for column in X.columns:
    # Check if the column has at least two unique values to allow correlation calculation
    if X[column].nunique() > 1:
        # Create a temporary DataFrame for this descriptor column and y, then drop rows with NaNs
        temp_data = pd.DataFrame({'descriptor_values': X[column], 'pIC50_values': y}).dropna()

        if not temp_data.empty:
            # Calculate Pearson correlation coefficient and p-value on clean data
            corr, p_val = pearsonr(temp_data['descriptor_values'], temp_data['pIC50_values'])
            correlations.append({'descriptor': column, 'correlation': corr})
            p_values.append({'descriptor': column, 'p_value': p_val})
        else:
            # If no valid data remains after dropping NaNs
            correlations.append({'descriptor': column, 'correlation': np.nan})
            p_values.append({'descriptor': column, 'p_value': np.nan})
    else:
        # If variance is zero (constant value), correlation and p-value are undefined
        correlations.append({'descriptor': column, 'correlation': np.nan})
        p_values.append({'descriptor': column, 'p_value': np.nan})

# Convert lists to DataFrames for easier handling
df_correlations = pd.DataFrame(correlations)
df_p_values = pd.DataFrame(p_values)

# Merge correlation and p-value DataFrames
df_results = pd.merge(df_correlations, df_p_values, on='descriptor')

# Filter out rows where correlation or p_value is NaN (either constant descriptor or no valid pairs after dropping NaNs)
df_results_filtered = df_results.dropna(subset=['correlation', 'p_value'])

# Define significance level
significance_level = 0.05

# Identify significant descriptors
significant_descriptors = df_results_filtered[df_results_filtered['p_value'] < significance_level]

# Identify non-significant descriptors
non_significant_descriptors = df_results_filtered[df_results_filtered['p_value'] >= significance_level]

# Descriptors that resulted in NaN (constant values or no valid data for correlation)
nan_descriptors = df_results[df_results['p_value'].isna()]

print("\nStatistically Significant Descriptors (p-value < 0.05):")
print(significant_descriptors.head())
print(f"Total significant descriptors: {len(significant_descriptors)}")

print("\nStatistically Non-Significant Descriptors (p-value >= 0.05):")
print(non_significant_descriptors.head())
print(f"Total non-significant descriptors: {len(non_significant_descriptors)}")

print("\nDescriptors with undefined correlation/p-value (constant values or insufficient data):")
print(nan_descriptors.head())
print(f"Total NaN descriptors: {len(nan_descriptors)}")

Number of NaNs in pIC50 (y): 0
Number of NaNs in sample non-constant descriptor 'PubchemFP1': 9324

Statistically Significant Descriptors (p-value < 0.05):
     descriptor  correlation   p_value
2    PubchemFP2    -0.043023  0.031000
3    PubchemFP3     0.075463  0.000152
14  PubchemFP14     0.067408  0.000720
15  PubchemFP15     0.066923  0.000786
17  PubchemFP17     0.069738  0.000467
Total significant descriptors: 283

Statistically Non-Significant Descriptors (p-value >= 0.05):
     descriptor  correlation   p_value
1    PubchemFP1    -0.003750  0.850911
12  PubchemFP12     0.030764  0.123054
13  PubchemFP13    -0.027940  0.161367
16  PubchemFP16     0.026958  0.176625
18  PubchemFP18    -0.012428  0.533395
Total non-significant descriptors: 277

Descriptors with undefined correlation/p-value (constant values or insufficient data):
   descriptor  correlation  p_value
0  PubchemFP0          NaN      NaN
4  PubchemFP4          NaN      NaN
5  PubchemFP5          NaN      NaN
6  Pubch

## Final Task

### Subtask:
Summarize the findings by listing the statistically significant and non-significant descriptors with respect to `pIC50`.


## Summary:

### Q&A
The analysis identified the following descriptors:
*   **Statistically Significant Descriptors (p-value < 0.05):** 283 descriptors showed a statistically significant correlation with `pIC50`.
*   **Statistically Non-Significant Descriptors (p-value \>= 0.05):** 277 descriptors did not show a statistically significant correlation with `pIC50`.
*   **Undefined Correlation/p-value Descriptors:** 321 descriptors could not have their correlation calculated, primarily due to constant values (zero variance) or insufficient valid data pairs after handling missing values.

### Data Analysis Key Findings
*   The initial dataset contained `pIC50` values for 11,838 entries and 881 PubChem fingerprint descriptors.
*   Early attempts to calculate Pearson correlation coefficients and p-values failed, producing `NaN` results across all descriptors.
*   Investigation revealed that 321 descriptor columns had zero variance (i.e., contained only one unique value), making correlation calculation impossible for them.
*   Further data inspection showed a significant number of missing values (NaNs) within individual descriptor columns (e.g., `PubchemFP1` had 9,324 NaNs out of 11,838 entries), which also prevented direct correlation calculation.
*   To address these issues, a refined approach was implemented where NaNs were dropped pairwise between each descriptor and the `pIC50` values before calculating the Pearson correlation coefficient and p-value.
*   After implementing the robust correlation calculation, 283 descriptors were found to be statistically significant with respect to `pIC50` (p-value < 0.05).
*   277 descriptors were identified as statistically non-significant (p-value \>= 0.05) with `pIC50`.
*   321 descriptors remained with undefined correlation/p-values due to being constant or having insufficient valid data for computation.

### Insights or Next Steps
*   Investigate the 283 statistically significant PubChem fingerprint descriptors further to understand their structural implications on bioactivity and potentially prioritize them for drug discovery.
*   Consider methods to handle the 321 descriptors with undefined correlations, such as removing them from further analysis or imputing values if their information is deemed critical and a suitable imputation strategy can be applied without introducing bias.


# Task
Here are the lists of statistically significant and non-significant descriptors:

**Statistically Significant Descriptors (p-value < 0.05):**

```
     descriptor  correlation       p_value
2    PubchemFP2    -0.043023  3.100043e-02
3    PubchemFP3     0.075463  1.523355e-04
14  PubchemFP14     0.067408  7.196568e-04
15  PubchemFP15     0.066923  7.862031e-04
17  PubchemFP17     0.069738  4.667218e-04
..          ...          ...           ...
826 PubchemFP826     0.114756  7.925491e-09
830 PubchemFP830     0.049689  1.271424e-02
835 PubchemFP835     0.053473  7.324710e-03
836 PubchemFP836     0.044810  2.411603e-02
854 PubchemFP854    -0.042051  3.500366e-02

[283 rows x 3 columns]
```

**Statistically Non-Significant Descriptors (p-value >= 0.05):**

```
     descriptor  correlation   p_value
1    PubchemFP1    -0.003750  0.850911
12  PubchemFP12     0.030764  0.123054
13  PubchemFP13    -0.027940  0.161367
16  PubchemFP16     0.026958  0.176625
18  PubchemFP18    -0.012428  0.533395
..          ...          ...       ...
834 PubchemFP834    -0.017157  0.389842
837 PubchemFP837     0.025155  0.207360
839 PubchemFP839     0.023415  0.240555
860 PubchemFP860     0.014732  0.460323
868 PubchemFP868     0.012288  0.538018

[277 rows x 3 columns]
```

## Final Answer

### Subtask:
Provide the direct answer to the user's request for significant and non-significant descriptors based on the analysis already performed.


## Summary:

### Q&A
*   **What are the statistically significant descriptors?**
    The statistically significant descriptors are those with a p-value less than 0.05.
*   **What are the statistically non-significant descriptors?**
    The statistically non-significant descriptors are those with a p-value greater than or equal to 0.05.

### Data Analysis Key Findings
*   Out of the descriptors analyzed, 283 were found to be statistically significant, indicating a statistically relevant correlation with the target variable (p-value < 0.05).
*   A total of 277 descriptors were identified as statistically non-significant, meaning their observed correlations with the target variable were likely due to random chance (p-value $\ge$ 0.05).
*   The descriptor 'PubchemFP2' showed a negative correlation of -0.043023 with a p-value of 3.10e-02, making it statistically significant.
*   The descriptor 'PubchemFP826' showed a positive correlation of 0.114756 with a p-value of 7.92e-09, making it one of the most significant descriptors.
*   Conversely, 'PubchemFP1' exhibited a very weak correlation of -0.003750 and a high p-value of 0.850911, classifying it as statistically non-significant.

### Insights or Next Steps
*   The identified statistically significant descriptors can be prioritized for further investigation, such as feature selection in predictive modeling, as they show a reliable relationship with the biological activity.
*   Descriptors identified as non-significant could potentially be excluded from future models to reduce dimensionality and computational cost, unless there is a strong theoretical reason to include them despite their statistical insignificance in this context.


# Task
Calculate and display 2D and 3D molecular descriptors for the molecules in `smiles_chembl.smi` using the `padeldescriptor` function, saving the results to `2d_descriptors.csv` and `3d_descriptors.csv` respectively, and then loading and displaying the first few rows of these CSV files. Finally, summarize the successful calculation of both 2D and 3D molecular descriptors and note their potential use in further QSAR modeling.

## Calculate 2D Molecular Descriptors

### Subtask:
Utilize the `padeldescriptor` function to calculate the 1444 2D molecular descriptors for the molecules in `smiles_chembl.smi`. The results will be saved to a new CSV file, for example, `2d_descriptors.csv`.


**Reasoning**:
The user wants to calculate 2D molecular descriptors using the `padeldescriptor` function. I will call this function with the specified parameters to generate the `2d_descriptors.csv` file.



In [41]:
padeldescriptor(
    mol_dir='smiles_chembl.smi',
    d_file='2d_descriptors.csv',
    d_2d=True,
    retainorder=True
    # removesalt=True, standardizetautomers=True
)

KeyboardInterrupt: 