# Step 2: Preprocess and Encode DAVIS/KIBA Datasets

In this notebook, we will preprocess the DAVIS and KIBA datasets and generate protein embeddings using both your custom one-hot encoding and ProtBERT.

**Workflow:**
- Load DAVIS and KIBA datasets from the extracted DeepDTA folder.
- Clean and format protein sequences and drug SMILES as needed.
- Apply your custom one-hot encoding to all protein sequences.
- Apply ProtBERT embedding to all protein sequences.
- Save the resulting embeddings for downstream evaluation.

---

In [1]:
# Import required libraries
import os
import pandas as pd
# Add any other imports needed for preprocessing and encoding

## 1. Load DAVIS and KIBA datasets
Replace the file paths below with the correct locations if needed.

In [None]:
davis_path = 'data/external/DeepDTA-master/data/DAVIS/'
kiba_path = 'data/external/DeepDTA-master/data/KIBA/'
davis_df = pd.read_csv(os.path.join(davis_path, 'davis.csv'))
kiba_df = pd.read_csv(os.path.join(kiba_path, 'kiba.csv'))

## 2. Preprocess protein sequences and drug SMILES
- Clean sequences (remove invalid characters, pad/crop, etc.)
- Prepare for encoding

In [None]:
# Your preprocessing code here

## 3. Generate custom one-hot encodings
Apply your one-hot encoding function to all protein sequences.

In [None]:
# Example: one_hot_embeddings = [your_one_hot_encode(seq) for seq in protein_sequences]

## 4. Generate ProtBERT embeddings
Apply ProtBERT to all protein sequences.

*Tip: Use the HuggingFace Transformers library for ProtBERT.*

In [None]:
# Example: Use transformers to load ProtBERT and generate embeddings
# from transformers import BertModel, BertTokenizer
# ...

## 5. Save embeddings for downstream evaluation
Save both one-hot and ProtBERT embeddings for use in the evaluation notebook.

In [None]:
# Example: Save embeddings as .csv or .npy
# pd.DataFrame(one_hot_embeddings).to_csv('davis_onehot_embeddings.csv', index=False)
# pd.DataFrame(protbert_embeddings).to_csv('davis_protbert_embeddings.csv', index=False)