# Drug One-Hot Encoding for Drug-Protein Interactions

This notebook processes a parquet file containing 34,741 drug-protein interactions with 1,028 unique drugs. We'll create one-hot encodings for the unique drugs and save the encoded dataset for model training.

## Dataset Overview
- Total interactions: 34,741
- Unique drugs: 1,028
- Encoding method: One-hot encoding
- Output: Parquet file with encoded drug data

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Libraries imported successfully!
Pandas version: 2.3.2
NumPy version: 2.3.0


## 2. Load and Explore the Dataset

In [2]:
# Load the parquet file
df = pd.read_parquet('scope_onside_common_v3.parquet')

print(f"Dataset shape: {df.shape}")
print(f"Total drug-protein interactions: {len(df)}")

# Display basic information about the dataset
print("\nDataset info:")
print(df.info())

print("\nFirst few rows:")
print(df.head())

Dataset shape: (34741, 7)
Total drug-protein interactions: 34741

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34741 entries, 0 to 34740
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   drug_chembl_id     34741 non-null  object
 1   target_uniprot_id  34741 non-null  object
 2   label              34741 non-null  int64 
 3   smiles             34741 non-null  object
 4   sequence           34741 non-null  object
 5   molfile_3d         34741 non-null  object
 6   rxcui              34741 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.9+ MB
None

First few rows:
  drug_chembl_id target_uniprot_id  label  \
0     CHEMBL1000            O15245      0   
1     CHEMBL1000            P08183      1   
2     CHEMBL1000            P35367      1   
3     CHEMBL1000            Q02763      0   
4     CHEMBL1000            Q12809      0   

                                        smiles

In [3]:
# Examine column names to identify drug-related columns
print("Column names:")
for i, col in enumerate(df.columns):
    print(f"{i+1}. {col}")

print("\nColumn data types:")
print(df.dtypes)

Column names:
1. drug_chembl_id
2. target_uniprot_id
3. label
4. smiles
5. sequence
6. molfile_3d
7. rxcui

Column data types:
drug_chembl_id       object
target_uniprot_id    object
label                 int64
smiles               object
sequence             object
molfile_3d           object
rxcui                object
dtype: object


## 3. Extract Unique Drugs

In [4]:
# First, let's identify the drug identifier column and SMILES column
# Common column names for drugs: 'drug_id', 'compound_id', 'smiles', 'drug_name', etc.

# Let's check for drug-related columns
drug_related_cols = [col for col in df.columns if any(keyword in col.lower() 
                     for keyword in ['drug', 'compound', 'smiles', 'molecule', 'chemical'])]

print("Drug-related columns found:")
for col in drug_related_cols:
    print(f"- {col}")
    print(f"  Sample values: {df[col].head(3).tolist()}")
    print(f"  Unique values: {df[col].nunique()}")
    print()

Drug-related columns found:
- drug_chembl_id
  Sample values: ['CHEMBL1000', 'CHEMBL1000', 'CHEMBL1000']
  Unique values: 1028

- smiles
  Sample values: ['O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1', 'O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1', 'O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1']
  Unique values: 1028



In [5]:
# Let's assume the drug identifier and SMILES columns are among the first few columns
# We'll need to update these variable names based on the actual column names in your dataset

# Replace these with the actual column names from your dataset
DRUG_ID_COLUMN = 'drug_id'  # Update this with the actual drug identifier column name
SMILES_COLUMN = 'smiles'    # Update this with the actual SMILES column name

# Check if the assumed columns exist, otherwise prompt for manual input
available_cols = list(df.columns)
print("Available columns:", available_cols)

# For demonstration, let's use the first column as drug identifier if our assumed names don't exist
if DRUG_ID_COLUMN not in df.columns:
    print(f"\nColumn '{DRUG_ID_COLUMN}' not found. Please update the DRUG_ID_COLUMN variable above.")
    print("Trying to auto-detect drug identifier column...")
    
    # Try to find a column that might be drug identifier
    potential_drug_cols = [col for col in df.columns if df[col].nunique() <= 2000 and df[col].nunique() > 500]
    if potential_drug_cols:
        DRUG_ID_COLUMN = potential_drug_cols[0]
        print(f"Using '{DRUG_ID_COLUMN}' as drug identifier column (has {df[DRUG_ID_COLUMN].nunique()} unique values)")
    else:
        DRUG_ID_COLUMN = df.columns[0]  # Fallback to first column
        print(f"Fallback: Using '{DRUG_ID_COLUMN}' as drug identifier column")

if SMILES_COLUMN not in df.columns:
    print(f"\nColumn '{SMILES_COLUMN}' not found. Please update the SMILES_COLUMN variable above.")
    # Try to find SMILES column
    smiles_cols = [col for col in df.columns if 'smiles' in col.lower()]
    if smiles_cols:
        SMILES_COLUMN = smiles_cols[0]
        print(f"Using '{SMILES_COLUMN}' as SMILES column")
    else:
        print("No SMILES column found. Will proceed with drug identifier only.")

Available columns: ['drug_chembl_id', 'target_uniprot_id', 'label', 'smiles', 'sequence', 'molfile_3d', 'rxcui']

Column 'drug_id' not found. Please update the DRUG_ID_COLUMN variable above.
Trying to auto-detect drug identifier column...
Using 'drug_chembl_id' as drug identifier column (has 1028 unique values)
Using 'drug_chembl_id' as drug identifier column (has 1028 unique values)


In [6]:
# Extract unique drugs
print(f"Using '{DRUG_ID_COLUMN}' as drug identifier column")
unique_drugs = df[DRUG_ID_COLUMN].unique()
n_unique_drugs = len(unique_drugs)

print(f"\nNumber of unique drugs: {n_unique_drugs}")
print(f"Expected: 1028 unique drugs")

if n_unique_drugs != 1028:
    print(f"Warning: Found {n_unique_drugs} unique drugs, but expected 1028.")
    print("This might be due to incorrect column identification.")

# Create a DataFrame with unique drugs and their information
if SMILES_COLUMN in df.columns:
    # Get the first occurrence of each drug with its SMILES
    unique_drugs_df = df.groupby(DRUG_ID_COLUMN)[SMILES_COLUMN].first().reset_index()
    print(f"\nUnique drugs DataFrame created with columns: {list(unique_drugs_df.columns)}")
    print(f"Sample SMILES: {unique_drugs_df[SMILES_COLUMN].iloc[0]}")
else:
    # Create DataFrame with just drug identifiers
    unique_drugs_df = pd.DataFrame({DRUG_ID_COLUMN: unique_drugs})
    print(f"\nUnique drugs DataFrame created with drug identifiers only")

print(f"\nFirst 5 unique drugs:")
print(unique_drugs_df.head())

Using 'drug_chembl_id' as drug identifier column

Number of unique drugs: 1028
Expected: 1028 unique drugs

Unique drugs DataFrame created with columns: ['drug_chembl_id', 'smiles']
Sample SMILES: O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1

First 5 unique drugs:
  drug_chembl_id                                          smiles
0     CHEMBL1000     O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1
1     CHEMBL1002               CC(C)(C)NC[C@H](O)c1ccc(O)c(CO)c1
2     CHEMBL1004                  CN(C)CCOC(C)(c1ccccc1)c1ccccn1
3     CHEMBL1005  CCC(=O)N(c1ccccc1)C1(C(=O)OC)CCN(CCC(=O)OC)CC1
4     CHEMBL1009                  N[C@@H](Cc1ccc(O)c(O)c1)C(=O)O


## 4. Create One-Hot Encoding for Drugs

In [7]:
# Create one-hot encoding for the unique drugs
print("Creating one-hot encoding for drugs...")

# Sort the unique drugs to ensure consistent ordering
unique_drugs_sorted = sorted(unique_drugs)
n_drugs = len(unique_drugs_sorted)

print(f"Number of drugs to encode: {n_drugs}")
print(f"One-hot vector dimension: {n_drugs}")

# Create a mapping from drug ID to one-hot index
drug_to_index = {drug: idx for idx, drug in enumerate(unique_drugs_sorted)}

# Create one-hot encoding matrix
# Each row represents a drug, each column represents a position in the one-hot vector
one_hot_matrix = np.eye(n_drugs, dtype=np.int8)  # Using int8 to save memory

print(f"One-hot encoding matrix shape: {one_hot_matrix.shape}")
print(f"Memory usage: {one_hot_matrix.nbytes / (1024*1024):.2f} MB")

# Create DataFrame with drug IDs and their one-hot encodings
drug_encoding_df = pd.DataFrame({
    DRUG_ID_COLUMN: unique_drugs_sorted
})

# Add one-hot encoded columns
for i in range(n_drugs):
    drug_encoding_df[f'drug_onehot_{i:04d}'] = one_hot_matrix[:, i]

print(f"\nDrug encoding DataFrame shape: {drug_encoding_df.shape}")
print(f"Columns: Drug ID + {n_drugs} one-hot features")

# Display first few drugs and their encodings (showing only first 10 one-hot columns for readability)
print(f"\nFirst 3 drugs with their one-hot encodings (showing first 10 dimensions):")
display_cols = [DRUG_ID_COLUMN] + [f'drug_onehot_{i:04d}' for i in range(min(10, n_drugs))]
print(drug_encoding_df[display_cols].head(3))

Creating one-hot encoding for drugs...
Number of drugs to encode: 1028
One-hot vector dimension: 1028
One-hot encoding matrix shape: (1028, 1028)
Memory usage: 1.01 MB

Drug encoding DataFrame shape: (1028, 1029)
Columns: Drug ID + 1028 one-hot features

First 3 drugs with their one-hot encodings (showing first 10 dimensions):
  drug_chembl_id  drug_onehot_0000  drug_onehot_0001  drug_onehot_0002  \
0     CHEMBL1000                 1                 0                 0   
1     CHEMBL1002                 0                 1                 0   
2     CHEMBL1004                 0                 0                 1   

   drug_onehot_0003  drug_onehot_0004  drug_onehot_0005  drug_onehot_0006  \
0                 0                 0                 0                 0   
1                 0                 0                 0                 0   
2                 0                 0                 0                 0   

   drug_onehot_0007  drug_onehot_0008  drug_onehot_0009  
0      

In [8]:
# Verify the one-hot encoding
print("Verifying one-hot encoding...")

# Check that each row sums to 1 (exactly one position is 1, rest are 0)
onehot_cols = [col for col in drug_encoding_df.columns if col.startswith('drug_onehot_')]
row_sums = drug_encoding_df[onehot_cols].sum(axis=1)

print(f"All rows sum to 1: {all(row_sums == 1)}")
print(f"Min row sum: {row_sums.min()}, Max row sum: {row_sums.max()}")

# Check that each column has exactly one 1 (each drug has a unique position)
col_sums = drug_encoding_df[onehot_cols].sum(axis=0)
print(f"All columns sum to 1: {all(col_sums == 1)}")
print(f"Min column sum: {col_sums.min()}, Max column sum: {col_sums.max()}")

print("\nOne-hot encoding verification complete!")

Verifying one-hot encoding...
All rows sum to 1: True
Min row sum: 1, Max row sum: 1
All columns sum to 1: True
Min column sum: 1, Max column sum: 1

One-hot encoding verification complete!


## 5. Save Encoded Dataset

In [9]:
# Add SMILES information if available
if SMILES_COLUMN in df.columns:
    # Merge SMILES information with the encoded drugs
    smiles_info = df.groupby(DRUG_ID_COLUMN)[SMILES_COLUMN].first().reset_index()
    drug_encoding_df = drug_encoding_df.merge(smiles_info, on=DRUG_ID_COLUMN, how='left')
    print(f"Added SMILES information to the encoding DataFrame")

print(f"\nFinal encoded dataset shape: {drug_encoding_df.shape}")
print(f"Columns: {list(drug_encoding_df.columns[:5])} ... (showing first 5)")

# Save the encoded drugs dataset
output_filename = 'encoded_drugs_onehot.parquet'
drug_encoding_df.to_parquet(output_filename, index=False)

print(f"\nEncoded drugs saved to: {output_filename}")
print(f"Dataset contains {len(drug_encoding_df)} unique drugs")
print(f"Each drug is represented by a {n_drugs}-dimensional one-hot vector")

# Display summary statistics
print(f"\nSummary:")
print(f"- Original dataset: {len(df)} drug-protein interactions")
print(f"- Unique drugs: {len(drug_encoding_df)}")
print(f"- One-hot encoding dimensions: {n_drugs}")
print(f"- Output file size: {os.path.getsize(output_filename) / (1024*1024):.2f} MB" if 'os' in globals() else "")

# Show final structure
print(f"\nFinal dataset structure:")
print(drug_encoding_df.info(memory_usage='deep'))

Added SMILES information to the encoding DataFrame

Final encoded dataset shape: (1028, 1030)
Columns: ['drug_chembl_id', 'drug_onehot_0000', 'drug_onehot_0001', 'drug_onehot_0002', 'drug_onehot_0003'] ... (showing first 5)

Encoded drugs saved to: encoded_drugs_onehot.parquet
Dataset contains 1028 unique drugs
Each drug is represented by a 1028-dimensional one-hot vector

Summary:
- Original dataset: 34741 drug-protein interactions
- Unique drugs: 1028
- One-hot encoding dimensions: 1028


Final dataset structure:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028 entries, 0 to 1027
Columns: 1030 entries, drug_chembl_id to smiles
dtypes: int8(1028), object(2)
memory usage: 1.2 MB
None

Encoded drugs saved to: encoded_drugs_onehot.parquet
Dataset contains 1028 unique drugs
Each drug is represented by a 1028-dimensional one-hot vector

Summary:
- Original dataset: 34741 drug-protein interactions
- Unique drugs: 1028
- One-hot encoding dimensions: 1028


Final dataset structure:
<cla

In [10]:
# Import os for file size calculation
import os

# Create a sample view of the final dataset
print("Sample of the final encoded dataset:")
print("=" * 80)

# Show structure with limited one-hot columns for readability
sample_cols = [DRUG_ID_COLUMN]
if SMILES_COLUMN in drug_encoding_df.columns:
    sample_cols.append(SMILES_COLUMN)
sample_cols.extend([f'drug_onehot_{i:04d}' for i in range(min(5, n_drugs))])

if len(sample_cols) < len(drug_encoding_df.columns):
    print(f"Showing columns: {sample_cols} ... (and {len(drug_encoding_df.columns) - len(sample_cols)} more one-hot columns)")
else:
    print(f"Showing all columns: {sample_cols}")

print(drug_encoding_df[sample_cols].head(10))

print("\n" + "=" * 80)
print("One-hot encoding completed successfully!")
print(f"Output file: {output_filename}")
print(f"Ready for model training!")

Sample of the final encoded dataset:
Showing columns: ['drug_chembl_id', 'smiles', 'drug_onehot_0000', 'drug_onehot_0001', 'drug_onehot_0002', 'drug_onehot_0003', 'drug_onehot_0004'] ... (and 1023 more one-hot columns)
  drug_chembl_id                                             smiles  \
0     CHEMBL1000        O=C(O)COCCN1CCN(C(c2ccccc2)c2ccc(Cl)cc2)CC1   
1     CHEMBL1002                  CC(C)(C)NC[C@H](O)c1ccc(O)c(CO)c1   
2     CHEMBL1004                     CN(C)CCOC(C)(c1ccccc1)c1ccccn1   
3     CHEMBL1005     CCC(=O)N(c1ccccc1)C1(C(=O)OC)CCN(CCC(=O)OC)CC1   
4     CHEMBL1009                     N[C@@H](Cc1ccc(O)c(O)c1)C(=O)O   
5     CHEMBL1016  CCOc1nc2cccc(C(=O)O)c2n1Cc1ccc(-c2ccccc2-c2nnn...   
6     CHEMBL1017  CCCc1nc2c(C)cc(-c3nc4ccccc4n3C)cc2n1Cc1ccc(-c2...   
7   CHEMBL101740                   CN(C)CC/C=C1/c2ccccc2COc2ccccc21   
8     CHEMBL1020                  Cc1ccc(C(=O)c2ccc(CC(=O)O)n2C)cc1   
9     CHEMBL1021                    NC(=O)Cc1cccc(C(=O)c2ccccc2)c1N   


## Next Steps

The one-hot encoded drug dataset is now ready for use in machine learning models. Here are some next steps you might consider:

1. **Model Training**: Use the encoded drugs as input features for your drug-protein interaction prediction model
2. **Alternative Encodings**: Compare with SMILES-to-vector encoding methods
3. **Dimensionality Reduction**: Consider PCA or other techniques if the 1028-dimensional vectors are too sparse
4. **Integration**: Merge this encoded drug data with protein features and interaction labels for training

### File Output
- **File**: `encoded_drugs_onehot.parquet`
- **Content**: 1028 unique drugs with one-hot encodings
- **Format**: Parquet file for efficient storage and fast loading
- **Columns**: Drug ID, SMILES (if available), and 1028 one-hot encoded features