In [17]:
import os
os.chdir('../')

from FlexMol.dataset import *

**Drug-Target Interaction (DTI) Data:**  
  You can load DTI data using the `load_DTI` function. Ensure that your file contains a header with at least the following columns:
  - `Drug`
  - `Protein`
  - `Y` (Interaction label)  
  Optionally, include a `Protein_ID` column if you plan to use a protein structure encoder.

In [18]:
# The optional protein ID is a unique identifier for each protein in the dataset.
# It can be used to link to additional information about the protein, such as its 3D structure.
# For example, a 3D encoder might require the PDB (Protein Data Bank) ID as input to fetch and use the 3D structure of the protein.

# Load Drug-Target Interaction data
# Optional protein ID can be included for 3D encoders that require PDB as input.
DTI = load_DTI("data/toy_data/dti.txt", delimiter=" ")
print("Drug-Target Interaction data:")
print(DTI.head())

Drug-Target Interaction data:
                                                Drug  \
0  CN(C)CC1CCN2C=C(C3=CC=CC=C32)C4=C(C5=CN(CCO1)C...   
1  CN1CCC(CC1)COC2=C(C=C3C(=C2)N=CN=C3NC4=C(C=C(C...   
2  CNC(=O)C1=CC=CC=C1SC2=CC3=C(C=C2)C(=NN3)C=CC4=...   
3  CCN(CC)CCNC(=O)C1=C(NC(=C1C)C=C2C3=C(C=CC(=C3)...   
4  CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C...   

                                             Protein  Y           Protein_ID  
0  PFWKILNPLLERGTYYYFMGQQPGKVLGDQRRPSLPALHFIKGAGK...  0  ABL1-phosphorylated  
1  MAEKQKHDGRVKIGHYVLGDTLGVGTFGKVKIGEHQLTGHKVAVKI...  0          AMPK-alpha2  
2  MAALSGGGGGGAEPGQALFNGDMEPEAGAGAGAAASSAADPAIPEE...  0          BRAF(V600E)  
3  MTSSLQRPWRVPWLPWTILLVSTAAASQNQERLCAFKDPYQQDLGI...  0                BMPR2  
4  MGPEALSSLLLLLLVASGDADMKGHFDPAKCRYALGMQDRTIPDSD...  1                 DDR1  


**Drug-Drug Interaction (DDI) Data:**  
  You can load DDI data using the `load_DDI` function. This function supports any file format readable by `pd.read_csv`. Ensure that the first line of your file contains a header with at least the following columns:
  - `Drug1`
  - `Drug2`
  - `Y` (Interaction label)

In [19]:
# Load Drug-Drug Interaction data
# Optional protein IDs can be included for 3D encoders that require PDB as input.
DDI = load_DDI("data/toy_data/ddi.txt", delimiter=" ")
print("\nDrug-Drug Interaction data:")
print(DDI.head())



Drug-Drug Interaction data:
                                               Drug1  \
0  CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O...   
1  CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O...   
2  CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O...   
3  CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O...   
4  CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O...   

                                          Drug2     Y  
0  CCC(=O)N(C1CCN(CC1)CCC2=CC=CC=C2)C3=CC=CC=C3   528  
1  CCC(=O)N(C1CCN(CC1)CCC2=CC=CC=C2)C3=CC=CC=C3   420  
2  CCC(=O)N(C1CCN(CC1)CCC2=CC=CC=C2)C3=CC=CC=C3   464  
3  CCC(=O)N(C1CCN(CC1)CCC2=CC=CC=C2)C3=CC=CC=C3  1100  
4  CCC(=O)N(C1CCN(CC1)CCC2=CC=CC=C2)C3=CC=CC=C3   207  


**Protein-Protein Interaction (PPI) Data:**  
  You can load PPI data using the `load_PPI` function. Ensure that your file contains a header with at least the following columns:
  - `Protein1`
  - `Protein2`
  - `Y` (Interaction label)  
  Optionally, include `Protein1_ID` and `Protein2_ID` columns if you plan to use protein structure encoders.

In [20]:
# Load Protein-Protein Interaction data
# Optional protein ID 1 and 2 can be included for 3D encoders that require PDB as input.
PPI = load_PPI("data/toy_data/ppi.txt", delimiter=" ")
print("\nProtein-Protein Interaction data:")
print(PPI.head())


Protein-Protein Interaction data:
                                            Protein1  \
0  MTITVGDAVSETELENKSQNVVLSPKASASSDISTDVDKDTSSSWD...   
1  MSFKATITESGKQNIWFRAIYVLSTIQDDIKITVTTNELIAWSMNE...   
2  MSAKAEKKPASKAPAEKKPAAKKTSTSTDGKKRSKARKETYSSYIY...   
3  MSKIDSVLIIGGSGFLGLHLIQQFFDINPKPDIHIFDVRDLPEKLS...   
4  MDQQAAYSTPYKKNTLSCTMSATLKDYLNKRVVIIKVDGECLIASL...   

                                            Protein2  Y Protein1_ID  \
0  MSRAVGIDLGTTYSCVAHFSNDRVEIIANDQGNRTTPSYVAFTDTE...  1      P53049   
1  MGQLLSHPLTEKTIEYNEYKNNQASTGIVPRFYNCVGSMQGYRLTQ...  1      Q08949   
2  MSLSSKLSVQDLDLKDKRVFIRVDFNVPLDGKKITSNQRIVAALPT...  1      P02293   
3  MSTPTAADRAKALERKNEGNVFVKEKHFLKAIEKYTEAIDLDSTQS...  1      P53199   
4  MAGAPAPPPPPPPPALGGSAPKPAKSVMQGRDALLGDIRKGMKLKK...  1      P47093   

  Protein2_ID  
0      P09435  
1      P38089  
2      P00560  
3      P53043  
4      P37370  


In [21]:
train_df, val_df, test_df = load_PPI("data/toy_data/ppi.txt", delimiter=" ",  split_frac = [0.6, 0.2, 0.2])
print(train_df.head())
print(val_df.head())
print(test_df.head())

                                            Protein1  \
0  MDQQAAYSTPYKKNTLSCTMSATLKDYLNKRVVIIKVDGECLIASL...   
1  MSVPAIAPRRKRLADGLSVTQKVFVRSRNGGATKIVREHYLRSDIP...   
2  MSKVMKPSNGKGSRKSSKAATPDTKNFFHAKKKDPVNQDKANNASQ...   
3  MANPFSRWFLSERPPNCHVADLETSLDPHQTLLKVQKYKPALSDWV...   
4  MDNLQVSDIETALQCISSTASQDDKNKALQFLEQFQRSTVAWSICN...   

                                            Protein2  Y Protein1_ID  \
0  MAGAPAPPPPPPPPALGGSAPKPAKSVMQGRDALLGDIRKGMKLKK...  1      P47093   
1  MSEEQTAIDSPPSTVEGSVETVTTIDSPSTTASTIAATAEEHPQLE...  1      Q08162   
2  MINESVSKREGFHESISRETSASNALGLYNKFNDERNPRYRTMIAE...  1      P53628   
3  MAGKKGQKKSGLGNHGKNSDMDVEDRLQAVVLTDSYETRFMPLTAV...  1      P36107   
4  MSSRVCYHINGPFFIIKLIDPKHLNSLTFEDFVYIALLLHKANDID...  1      Q99189   

  Protein2_ID  
0      P37370  
1      P06105  
2      P47050  
3      P32501  
4      Q08558  
                                            Protein1  \
0  MEDIEKIKPYVRSFSKALDELKPEIEKLTSKSLDEQLLLLSDERAK...   
1  MSSSLLSVLKEKSRSLKIRNKPVKM

Alternatively, you can load your custom data directly into a pandas DataFrame using your preferred method. As long as the DataFrame contains the required columns (`Drug1`, `Drug2`, `Protein`, `Y`, etc.), you can proceed with the FlexMol pipeline without any issues.