<a href="https://colab.research.google.com/github/MinamiNaoya/ExperimentTools/blob/main/leash_ecfps_and_random_forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Leash Tutorial - ECFPs and Random Forest
## Introduction

There are many ways to represent molecules for machine learning.

In this tutorial we will go through one of the simplest: ECFPs [[1]](https://pubs.acs.org/doi/10.1021/ci100050t) and Random Forest. This technique is surprisingly powerful, and on previous benchmarks often gets uncomfortably close to the state of the art.

First molecule graphs are broken into bags of subgraphs of varying sizes.

![ecfp featurizing process (chemaxon)](https://docs.chemaxon.com/display/docs/images/download/attachments/1806333/ecfp_generation.png)

Then the bag of subgraphs is hashed into a bit vector

![hashing process (chemaxon)](https://docs.chemaxon.com/display/docs/images/download/attachments/1806333/ecfp_folding.png)

This can be thought of as analogous to the [hashing trick](https://en.wikipedia.org/wiki/Feature_hashing) [[2]](https://alex.smola.org/papers/2009/Weinbergeretal09.pdf) on bag of words for NLP problems, from the days before transformers.

RDKit, an open-source cheminformatics tool, is used for generating ECFP features. It facilitates the creation of hashed bit vectors, streamlining the process. We can install it as follows:

In [1]:
!pip install -q condacolab
import condacolab
condacolab.install()

⏬ Downloading https://github.com/conda-forge/miniforge/releases/download/23.11.0-0/Mambaforge-23.11.0-0-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:19
🔁 Restarting kernel...


In [1]:
!pip install rdkit

[0m

The training set is pretty big, but we can treat the parquet files as databases using duckdb. We will use this to sample down to a smaller dataset for demonstration purposes. Lets install duckdb as well.

In [2]:
!pip install duckdb

Collecting duckdb
  Downloading duckdb-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (762 bytes)
Downloading duckdb-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.5/18.5 MB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: duckdb
Successfully installed duckdb-1.0.0
[0m

In [1]:
!pip install kaggle

from google.colab import drive
drive.mount('/content/drive')

import os
import json
f = open("/content/drive/MyDrive/kaggle.json", 'r')
json_data = json.load(f)
os.environ['KAGGLE_USERNAME'] = json_data['username']
os.environ['KAGGLE_KEY'] = json_data['key']

[0mMounted at /content/drive


In [2]:
!kaggle competitions download -c leash-BELKA

Downloading leash-BELKA.zip to /content
100% 4.15G/4.16G [00:52<00:00, 124MB/s]
100% 4.16G/4.16G [00:52<00:00, 85.4MB/s]


In [3]:
!unzip '/content/leash-BELKA.zip'

Archive:  /content/leash-BELKA.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: test.parquet            
  inflating: train.csv               
  inflating: train.parquet           


## Data Preparation

The training and testing data paths are defined for the .parquet files. We use duckdb to scan search through the large training sets. Just to get started lets sample out an equal number of positive and negatives.

This query selects an equal number of samples where binds equals 0 (non-binding) and 1 (binding), limited to 30,000 each, to avoid model bias towards a particular class.

トレーニングデータとテストデータのパスは.parquetファイルに対して定義されます。duckdbを使用して、大規模なトレーニングセットをスキャン検索します。まずは、ポジティブとネガティブを同数ずつ抽出します。このクエリでは、モデルが特定のクラスに偏らないように、bindsが0（非結合）と1（結合）の同数のサンプルを、それぞれ30,000個に制限して選択します。

In [4]:
import duckdb
import pandas as pd

train_path = '/content/train.parquet'
test_path = '/content/test.parquet'

con = duckdb.connect()

df = con.query(f"""(SELECT *
                        FROM parquet_scan('{train_path}')
                        WHERE binds = 0
                        ORDER BY random()
                        LIMIT 40000)
                        UNION ALL
                        (SELECT *
                        FROM parquet_scan('{train_path}')
                        WHERE binds = 1
                        ORDER BY random()
                        LIMIT 40000)""").df()

con.close()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

## sEH
エポキシドヒドロラーゼ2はEPHX2遺伝子座にコードされており、そのタンパク質産物は一般に「可溶性エポキシドヒドロラーゼ」、またはsEHと略称される。ヒドロラーゼは特定の化学反応を触媒する酵素であり、EPHX2/sEHもまた特定のリン酸基を加水分解する。EPHX2/sEHは、高血圧と糖尿病進行のための潜在的な薬物標的であり、以前のDELの努力からEPHX2/sEHを阻害する低分子が臨床試験に進んだ。
## BRD4
ブロモドメイン4はBRD4遺伝子座にコードされており、そのタンパク質産物もBRD4と命名されている。ブロモドメインは、DNAが巻き付く核内のタンパク質スプール（ヒストンと呼ばれる）に結合し、近くのDNAが転写される可能性に影響を与え、新しい遺伝子産物を作り出す。ブロモドメインは癌の進行に関与しており、その活性を阻害する薬剤が数多く発見されている。
## ALB
第3の標的である血清アルブミンはALB遺伝子座にコードされており、そのタンパク質産物もALBと命名されている。このタンパク質産物は「ヒト血清アルブミン」を意味するHSAと略されることもある。血液中で最も一般的なタンパク質であるALBは、浸透圧（組織から血管内に体液を戻す）を促進し、多くのリガンド、ホルモン、脂肪酸などを輸送するのに使われる。我々は、Active Motif社から購入したALBを審査した。タンパク質の構造情報を応募に取り入れたい応募者のために、アミノ酸配列はUniProtエントリーP02768の25位から609位、結晶構造はPDBエントリー1AO6、予測構造はAlphaFold2エントリーP02768にあります。リガンドが結合したその他のALB結晶構造はPDBにある。

In [5]:
df

Unnamed: 0,id,buildingblock1_smiles,buildingblock2_smiles,buildingblock3_smiles,molecule_smiles,protein_name,binds
0,233070315,O=C(O)C1CCCN1C(=O)OCC1c2ccccc2-c2ccccc21,Nc1ccncn1,Cl.NCC1CC2(C1)CC2(F)F,O=C(N[Dy])C1CCCN1c1nc(NCC2CC3(C2)CC3(F)F)nc(Nc...,BRD4,0
1,49602117,Cc1cc(Br)cc(C(=O)O)c1NC(=O)OCC1c2ccccc2-c2ccccc21,Cl.NCC1CCOCC12CCCC2,Nc1cncnc1,Cc1cc(Br)cc(C(=O)N[Dy])c1Nc1nc(NCC2CCOCC23CCCC...,BRD4,0
2,187558285,O=C(Nc1ccc(C(=O)O)cc1)OCC1c2ccccc2-c2ccccc21,CNC(=O)c1cc(Oc2ccc(N)cc2)ccn1,NCc1ccsc1,CNC(=O)c1cc(Oc2ccc(Nc3nc(NCc4ccsc4)nc(Nc4ccc(C...,HSA,0
3,195792980,O=C(Nc1ccc(C(=O)O)nc1)OCC1c2ccccc2-c2ccccc21,NCCN1CC2CCC1C2,Cl.NCCn1cnc2sccc2c1=O,O=C(N[Dy])c1ccc(Nc2nc(NCCN3CC4CCC3C4)nc(NCCn3c...,sEH,0
4,41079344,COc1nccc(C(=O)O)c1NC(=O)OCC1c2ccccc2-c2ccccc21,NCCC1SCCS1,Cl.NCCC1CC1,COc1nccc(C(=O)N[Dy])c1Nc1nc(NCCC2CC2)nc(NCCC2S...,sEH,0
...,...,...,...,...,...,...,...
79995,244803662,O=C(O)C[C@@H](Cc1ccc(C(F)(F)F)cc1)NC(=O)OCC1c2...,NCCC1CSC1,NCc1cccs1,O=C(C[C@@H](Cc1ccc(C(F)(F)F)cc1)Nc1nc(NCCC2CSC...,sEH,1
79996,83266094,O=C(NC[C@H]1CC[C@H](C(=O)O)CC1)OCC1c2ccccc2-c2...,Cl.Cl.NCC=Cc1cccnc1,COC(=O)c1cncc(N)c1,COC(=O)c1cncc(Nc2nc(NCC=Cc3cccnc3)nc(NC[C@H]3C...,sEH,1
79997,17664114,CC(C)CC(NC(=O)OCC1c2ccccc2-c2ccccc21)C(=O)O,COc1c(F)ccc(F)c1CN.Cl,Cc1ncccc1N,COc1c(F)ccc(F)c1CNc1nc(Nc2cccnc2C)nc(NC(CC(C)C...,BRD4,1
79998,187453674,O=C(Nc1ccc(C(=O)O)cc1)OCC1c2ccccc2-c2ccccc21,CC(F)(F)CN.Cl,Cc1cc(N)ncn1,Cc1cc(Nc2nc(NCC(C)(F)F)nc(Nc3ccc(C(=O)N[Dy])cc...,BRD4,1


### PDBファイルのダウンロード

In [6]:
!pip install biopython
import time
import urllib
from Bio.PDB import PDBList

# 7jkz: BED4 1ao6: HSA sEH: 3ily
pdb_ids = ['1ao6', '7jkz', '3i1y']

def download_file(url, dst_path):
    with urllib.request.urlopen(url) as web_file:
        with open(dst_path, 'wb') as local_file:
            local_file.write(web_file.read())

# AlphaFold
url = "https://alphafold.ebi.ac.uk/files/AF-P02768-F1-model_v4.pdb"

dst_path = "AF-P02768-F1-model_v4.pdb"
download_file(url, dst_path)

pdbl = PDBList()

for pdb_id in pdb_ids:
    pdbl.retrieve_pdb_file(pdb_id, pdir='pdb_files/')
    time.sleep(10)




Collecting biopython
  Downloading biopython-1.83-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading biopython-1.83-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.83
[0m



Downloading PDB structure '1ao6'...




Downloading PDB structure '7jkz'...




Downloading PDB structure '3i1y'...


In [7]:
import Bio.PDB
from Bio.PDB import PDBParser
import gzip

parser = PDBParser()

pdb_parser = PDBParser(QUIET=True)
structure = pdb_parser.get_structure('X', '/content/AF-P02768-F1-model_v4.pdb')

for model in structure.get_list():
  for chain in model.get_list():
    print(chain.get_id())
    for residue in chain.get_list():
            print(residue.get_resname(), end = ' ')

A
MET LYS TRP VAL THR PHE ILE SER LEU LEU PHE LEU PHE SER SER ALA TYR SER ARG GLY VAL PHE ARG ARG ASP ALA HIS LYS SER GLU VAL ALA HIS ARG PHE LYS ASP LEU GLY GLU GLU ASN PHE LYS ALA LEU VAL LEU ILE ALA PHE ALA GLN TYR LEU GLN GLN CYS PRO PHE GLU ASP HIS VAL LYS LEU VAL ASN GLU VAL THR GLU PHE ALA LYS THR CYS VAL ALA ASP GLU SER ALA GLU ASN CYS ASP LYS SER LEU HIS THR LEU PHE GLY ASP LYS LEU CYS THR VAL ALA THR LEU ARG GLU THR TYR GLY GLU MET ALA ASP CYS CYS ALA LYS GLN GLU PRO GLU ARG ASN GLU CYS PHE LEU GLN HIS LYS ASP ASP ASN PRO ASN LEU PRO ARG LEU VAL ARG PRO GLU VAL ASP VAL MET CYS THR ALA PHE HIS ASP ASN GLU GLU THR PHE LEU LYS LYS TYR LEU TYR GLU ILE ALA ARG ARG HIS PRO TYR PHE TYR ALA PRO GLU LEU LEU PHE PHE ALA LYS ARG TYR LYS ALA ALA PHE THR GLU CYS CYS GLN ALA ALA ASP LYS ALA ALA CYS LEU LEU PRO LYS LEU ASP GLU LEU ARG ASP GLU GLY LYS ALA SER SER ALA LYS GLN ARG LEU LYS CYS ALA SER LEU GLN LYS PHE GLY GLU ARG ALA PHE LYS ALA TRP ALA VAL ALA ARG LEU SER GLN ARG PHE PRO LYS AL

## Feature Preprocessing

Lets grab the smiles for the fully assembled molecule `molecule_smiles` and generate ecfps for it. We could choose different radiuses or bits, but 2 and 1024 is pretty standard.

In [8]:
from rdkit import Chem
from rdkit.Chem import AllChem

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import average_precision_score
from sklearn.preprocessing import OneHotEncoder


# Convert SMILES to RDKit molecules
df['molecule'] = df['molecule_smiles'].apply(Chem.MolFromSmiles)

# Generate ECFPs
def generate_ecfp(molecule, radius=2, bits=1024):
    if molecule is None:
        return None
    return list(AllChem.GetMorganFingerprintAsBitVect(molecule, radius, nBits=bits))



df['ecfp'] = df['molecule'].apply(generate_ecfp)

[1;30;43mストリーミング出力は最後の 5000 行に切り捨てられました。[0m


In [9]:
# Load model directly
from transformers import AutoTokenizer, RobertaForRegression

tokenizer = AutoTokenizer.from_pretrained("DeepChem/ChemBERTa-77M-MTR")
model = RobertaForRegression.from_pretrained("DeepChem/ChemBERTa-77M-MTR")



ImportError: cannot import name 'RobertaForRegression' from 'transformers' (/usr/local/lib/python3.10/dist-packages/transformers/__init__.py)

In [10]:
def protein_name_to_pdb(protein_name):
  if protein_name == 'BRD4':
    return '/content/pdb_files/7jkz.cif'
  if protein_name == 'HSA':
    return '/content/pdb_files/1ao6.cif'
  if protein_name == 'sEH':
    return '/content/pdb_files/3i1y.cif'

df['pdb_file_path'] = df['protein_name'].apply(protein_name_to_pdb)

In [11]:
df['pdb_file_path']

0        /content/pdb_files/7jkz.cif
1        /content/pdb_files/7jkz.cif
2        /content/pdb_files/1ao6.cif
3        /content/pdb_files/3i1y.cif
4        /content/pdb_files/3i1y.cif
                    ...             
79995    /content/pdb_files/3i1y.cif
79996    /content/pdb_files/3i1y.cif
79997    /content/pdb_files/7jkz.cif
79998    /content/pdb_files/7jkz.cif
79999    /content/pdb_files/3i1y.cif
Name: pdb_file_path, Length: 80000, dtype: object

In [12]:

from Bio.PDB import MMCIFParser
mmcif_parser = MMCIFParser(QUIET=True)
def get_amino_list_from_pdb(pdb_file_path) -> list:
    structure = mmcif_parser.get_structure('X', pdb_file_path)

    sequences = []
    for model in structure:
        for chain in model:
            seq = ''
            for residue in chain:
                if residue.id[0] == ' ':
                    seq += residue.get_resname()
            sequences.append(seq)
    return sequences

def process_sequence(sequence: list):
    sequence = sequence
    return [sequence[i:i+3] for i in range(0, len(sequence), 3)]
sequences = get_amino_list_from_pdb('/content/pdb_files/1ao6.cif')
processed_sequence = process_sequence(sequences[0])
print(processed_sequence)


['SER', 'GLU', 'VAL', 'ALA', 'HIS', 'ARG', 'PHE', 'LYS', 'ASP', 'LEU', 'GLY', 'GLU', 'GLU', 'ASN', 'PHE', 'LYS', 'ALA', 'LEU', 'VAL', 'LEU', 'ILE', 'ALA', 'PHE', 'ALA', 'GLN', 'TYR', 'LEU', 'GLN', 'GLN', 'CYS', 'PRO', 'PHE', 'GLU', 'ASP', 'HIS', 'VAL', 'LYS', 'LEU', 'VAL', 'ASN', 'GLU', 'VAL', 'THR', 'GLU', 'PHE', 'ALA', 'LYS', 'THR', 'CYS', 'VAL', 'ALA', 'ASP', 'GLU', 'SER', 'ALA', 'GLU', 'ASN', 'CYS', 'ASP', 'LYS', 'SER', 'LEU', 'HIS', 'THR', 'LEU', 'PHE', 'GLY', 'ASP', 'LYS', 'LEU', 'CYS', 'THR', 'VAL', 'ALA', 'THR', 'LEU', 'ARG', 'GLU', 'THR', 'TYR', 'GLY', 'GLU', 'MET', 'ALA', 'ASP', 'CYS', 'CYS', 'ALA', 'LYS', 'GLN', 'GLU', 'PRO', 'GLU', 'ARG', 'ASN', 'GLU', 'CYS', 'PHE', 'LEU', 'GLN', 'HIS', 'LYS', 'ASP', 'ASP', 'ASN', 'PRO', 'ASN', 'LEU', 'PRO', 'ARG', 'LEU', 'VAL', 'ARG', 'PRO', 'GLU', 'VAL', 'ASP', 'VAL', 'MET', 'CYS', 'THR', 'ALA', 'PHE', 'HIS', 'ASP', 'ASN', 'GLU', 'GLU', 'THR', 'PHE', 'LEU', 'LYS', 'LYS', 'TYR', 'LEU', 'TYR', 'GLU', 'ILE', 'ALA', 'ARG', 'ARG', 'HIS', 'PRO'

In [None]:
#df['amino_seq'] = df['pdb_file_path'].apply(get_amino_list_from_pdb) 計算時間がかかりすぎる。

In [13]:
# 計算量を減らすため

sequences_1ao6 = get_amino_list_from_pdb('/content/pdb_files/1ao6.cif')
processed_sequence_1ao6 = process_sequence(sequences_1ao6[0])

sequences_3i1y = get_amino_list_from_pdb('/content/pdb_files/3i1y.cif')
processed_sequence_3i1y = process_sequence(sequences_3i1y[0])

sequences_7jkz = get_amino_list_from_pdb('/content/pdb_files/7jkz.cif')
processed_sequence_7jkz = process_sequence(sequences_7jkz[0])
# 7jkz: BRD4 1ao6: HSA sEH: 3ily
pdb_dict = {
    'HSA': processed_sequence_1ao6,
    'sEH': processed_sequence_3i1y,
    'BRD4': processed_sequence_7jkz
}
pdb_dict_amino = {
    'HSA': sequences_1ao6,
    'sEH': sequences_3i1y,
    'BRD4': sequences_7jkz
}


## DTIについて(Drug-Target interaction)
https://www.nature.com/articles/s41598-023-30026-y

In [14]:
df['amino_seq_list'] = df['protein_name'].apply(lambda x: pdb_dict.get(x, x))

In [15]:
df['amino_seq'] = df['protein_name'].apply(lambda x: pdb_dict_amino.get(x, x))

In [16]:
df.head()

Unnamed: 0,id,buildingblock1_smiles,buildingblock2_smiles,buildingblock3_smiles,molecule_smiles,protein_name,binds,molecule,ecfp,pdb_file_path,amino_seq_list,amino_seq
0,233070315,O=C(O)C1CCCN1C(=O)OCC1c2ccccc2-c2ccccc21,Nc1ccncn1,Cl.NCC1CC2(C1)CC2(F)F,O=C(N[Dy])C1CCCN1c1nc(NCC2CC3(C2)CC3(F)F)nc(Nc...,BRD4,0,<rdkit.Chem.rdchem.Mol object at 0x7fb210fae0a0>,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",/content/pdb_files/7jkz.cif,"[LYS, VAL, SER, GLU, GLN, LEU, LYS, CYS, CYS, ...",[LYSVALSERGLUGLNLEULYSCYSCYSSERGLYILELEULYSGLU...
1,49602117,Cc1cc(Br)cc(C(=O)O)c1NC(=O)OCC1c2ccccc2-c2ccccc21,Cl.NCC1CCOCC12CCCC2,Nc1cncnc1,Cc1cc(Br)cc(C(=O)N[Dy])c1Nc1nc(NCC2CCOCC23CCCC...,BRD4,0,<rdkit.Chem.rdchem.Mol object at 0x7fb210fae810>,"[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...",/content/pdb_files/7jkz.cif,"[LYS, VAL, SER, GLU, GLN, LEU, LYS, CYS, CYS, ...",[LYSVALSERGLUGLNLEULYSCYSCYSSERGLYILELEULYSGLU...
2,187558285,O=C(Nc1ccc(C(=O)O)cc1)OCC1c2ccccc2-c2ccccc21,CNC(=O)c1cc(Oc2ccc(N)cc2)ccn1,NCc1ccsc1,CNC(=O)c1cc(Oc2ccc(Nc3nc(NCc4ccsc4)nc(Nc4ccc(C...,HSA,0,<rdkit.Chem.rdchem.Mol object at 0x7fb2110003c0>,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...",/content/pdb_files/1ao6.cif,"[SER, GLU, VAL, ALA, HIS, ARG, PHE, LYS, ASP, ...",[SERGLUVALALAHISARGPHELYSASPLEUGLYGLUGLUASNPHE...
3,195792980,O=C(Nc1ccc(C(=O)O)nc1)OCC1c2ccccc2-c2ccccc21,NCCN1CC2CCC1C2,Cl.NCCn1cnc2sccc2c1=O,O=C(N[Dy])c1ccc(Nc2nc(NCCN3CC4CCC3C4)nc(NCCn3c...,sEH,0,<rdkit.Chem.rdchem.Mol object at 0x7fb211000430>,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...",/content/pdb_files/3i1y.cif,"[ARG, ALA, ALA, VAL, PHE, ASP, LEU, ASP, GLY, ...",[ARGALAALAVALPHEASPLEUASPGLYVALLEUALALEUPROALA...
4,41079344,COc1nccc(C(=O)O)c1NC(=O)OCC1c2ccccc2-c2ccccc21,NCCC1SCCS1,Cl.NCCC1CC1,COc1nccc(C(=O)N[Dy])c1Nc1nc(NCCC2CC2)nc(NCCC2S...,sEH,0,<rdkit.Chem.rdchem.Mol object at 0x7fb2110004a0>,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",/content/pdb_files/3i1y.cif,"[ARG, ALA, ALA, VAL, PHE, ASP, LEU, ASP, GLY, ...",[ARGALAALAVALPHEASPLEUASPGLYVALLEUALALEUPROALA...


In [17]:
df['amino_seq_list']

0        [LYS, VAL, SER, GLU, GLN, LEU, LYS, CYS, CYS, ...
1        [LYS, VAL, SER, GLU, GLN, LEU, LYS, CYS, CYS, ...
2        [SER, GLU, VAL, ALA, HIS, ARG, PHE, LYS, ASP, ...
3        [ARG, ALA, ALA, VAL, PHE, ASP, LEU, ASP, GLY, ...
4        [ARG, ALA, ALA, VAL, PHE, ASP, LEU, ASP, GLY, ...
                               ...                        
79995    [ARG, ALA, ALA, VAL, PHE, ASP, LEU, ASP, GLY, ...
79996    [ARG, ALA, ALA, VAL, PHE, ASP, LEU, ASP, GLY, ...
79997    [LYS, VAL, SER, GLU, GLN, LEU, LYS, CYS, CYS, ...
79998    [LYS, VAL, SER, GLU, GLN, LEU, LYS, CYS, CYS, ...
79999    [ARG, ALA, ALA, VAL, PHE, ASP, LEU, ASP, GLY, ...
Name: amino_seq_list, Length: 80000, dtype: object

In [18]:
df['amino_seq_str'] = df['amino_seq_list'].apply(lambda x: ','.join(x))

## Train Model

In [19]:
# One-hot encode the protein_name
onehot_encoder = OneHotEncoder(sparse_output=False)
protein_onehot = onehot_encoder.fit_transform(df['protein_name'].values.reshape(-1, 1))
amino_onehot = onehot_encoder.fit_transform(df['amino_seq_str'].values.reshape(-1, 1))

# Combine ECFPs and one-hot encoded amino seq
X = [ecfp + protein for ecfp, protein in zip(df['ecfp'].tolist(), amino_onehot.tolist())]
y = df['binds'].tolist()

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the random forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_proba = rf_model.predict_proba(X_test)[:, 1]  # Probability of the positive class

# Calculate the mean average precision
map_score = average_precision_score(y_test, y_pred_proba)
print(f"Mean Average Precision (mAP): {map_score:.2f}")




Mean Average Precision (mAP): 0.96


Look at that Average Precision score. We did amazing!

Actually no, we just overfit. This is likely recurring theme for this competition. It is easy to predict molecules that come from the same corner of chemical space, but generalizing to new molecules is extremely difficult.

In [20]:
import gc
gc.collect()

54

In [21]:
df

Unnamed: 0,id,buildingblock1_smiles,buildingblock2_smiles,buildingblock3_smiles,molecule_smiles,protein_name,binds,molecule,ecfp,pdb_file_path,amino_seq_list,amino_seq,amino_seq_str
0,233070315,O=C(O)C1CCCN1C(=O)OCC1c2ccccc2-c2ccccc21,Nc1ccncn1,Cl.NCC1CC2(C1)CC2(F)F,O=C(N[Dy])C1CCCN1c1nc(NCC2CC3(C2)CC3(F)F)nc(Nc...,BRD4,0,<rdkit.Chem.rdchem.Mol object at 0x7fb210fae0a0>,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",/content/pdb_files/7jkz.cif,"[LYS, VAL, SER, GLU, GLN, LEU, LYS, CYS, CYS, ...",[LYSVALSERGLUGLNLEULYSCYSCYSSERGLYILELEULYSGLU...,"LYS,VAL,SER,GLU,GLN,LEU,LYS,CYS,CYS,SER,GLY,IL..."
1,49602117,Cc1cc(Br)cc(C(=O)O)c1NC(=O)OCC1c2ccccc2-c2ccccc21,Cl.NCC1CCOCC12CCCC2,Nc1cncnc1,Cc1cc(Br)cc(C(=O)N[Dy])c1Nc1nc(NCC2CCOCC23CCCC...,BRD4,0,<rdkit.Chem.rdchem.Mol object at 0x7fb210fae810>,"[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...",/content/pdb_files/7jkz.cif,"[LYS, VAL, SER, GLU, GLN, LEU, LYS, CYS, CYS, ...",[LYSVALSERGLUGLNLEULYSCYSCYSSERGLYILELEULYSGLU...,"LYS,VAL,SER,GLU,GLN,LEU,LYS,CYS,CYS,SER,GLY,IL..."
2,187558285,O=C(Nc1ccc(C(=O)O)cc1)OCC1c2ccccc2-c2ccccc21,CNC(=O)c1cc(Oc2ccc(N)cc2)ccn1,NCc1ccsc1,CNC(=O)c1cc(Oc2ccc(Nc3nc(NCc4ccsc4)nc(Nc4ccc(C...,HSA,0,<rdkit.Chem.rdchem.Mol object at 0x7fb2110003c0>,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...",/content/pdb_files/1ao6.cif,"[SER, GLU, VAL, ALA, HIS, ARG, PHE, LYS, ASP, ...",[SERGLUVALALAHISARGPHELYSASPLEUGLYGLUGLUASNPHE...,"SER,GLU,VAL,ALA,HIS,ARG,PHE,LYS,ASP,LEU,GLY,GL..."
3,195792980,O=C(Nc1ccc(C(=O)O)nc1)OCC1c2ccccc2-c2ccccc21,NCCN1CC2CCC1C2,Cl.NCCn1cnc2sccc2c1=O,O=C(N[Dy])c1ccc(Nc2nc(NCCN3CC4CCC3C4)nc(NCCn3c...,sEH,0,<rdkit.Chem.rdchem.Mol object at 0x7fb211000430>,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...",/content/pdb_files/3i1y.cif,"[ARG, ALA, ALA, VAL, PHE, ASP, LEU, ASP, GLY, ...",[ARGALAALAVALPHEASPLEUASPGLYVALLEUALALEUPROALA...,"ARG,ALA,ALA,VAL,PHE,ASP,LEU,ASP,GLY,VAL,LEU,AL..."
4,41079344,COc1nccc(C(=O)O)c1NC(=O)OCC1c2ccccc2-c2ccccc21,NCCC1SCCS1,Cl.NCCC1CC1,COc1nccc(C(=O)N[Dy])c1Nc1nc(NCCC2CC2)nc(NCCC2S...,sEH,0,<rdkit.Chem.rdchem.Mol object at 0x7fb2110004a0>,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",/content/pdb_files/3i1y.cif,"[ARG, ALA, ALA, VAL, PHE, ASP, LEU, ASP, GLY, ...",[ARGALAALAVALPHEASPLEUASPGLYVALLEUALALEUPROALA...,"ARG,ALA,ALA,VAL,PHE,ASP,LEU,ASP,GLY,VAL,LEU,AL..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
79995,244803662,O=C(O)C[C@@H](Cc1ccc(C(F)(F)F)cc1)NC(=O)OCC1c2...,NCCC1CSC1,NCc1cccs1,O=C(C[C@@H](Cc1ccc(C(F)(F)F)cc1)Nc1nc(NCCC2CSC...,sEH,1,<rdkit.Chem.rdchem.Mol object at 0x7fb21045ee30>,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...",/content/pdb_files/3i1y.cif,"[ARG, ALA, ALA, VAL, PHE, ASP, LEU, ASP, GLY, ...",[ARGALAALAVALPHEASPLEUASPGLYVALLEUALALEUPROALA...,"ARG,ALA,ALA,VAL,PHE,ASP,LEU,ASP,GLY,VAL,LEU,AL..."
79996,83266094,O=C(NC[C@H]1CC[C@H](C(=O)O)CC1)OCC1c2ccccc2-c2...,Cl.Cl.NCC=Cc1cccnc1,COC(=O)c1cncc(N)c1,COC(=O)c1cncc(Nc2nc(NCC=Cc3cccnc3)nc(NC[C@H]3C...,sEH,1,<rdkit.Chem.rdchem.Mol object at 0x7fb21045eea0>,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",/content/pdb_files/3i1y.cif,"[ARG, ALA, ALA, VAL, PHE, ASP, LEU, ASP, GLY, ...",[ARGALAALAVALPHEASPLEUASPGLYVALLEUALALEUPROALA...,"ARG,ALA,ALA,VAL,PHE,ASP,LEU,ASP,GLY,VAL,LEU,AL..."
79997,17664114,CC(C)CC(NC(=O)OCC1c2ccccc2-c2ccccc21)C(=O)O,COc1c(F)ccc(F)c1CN.Cl,Cc1ncccc1N,COc1c(F)ccc(F)c1CNc1nc(Nc2cccnc2C)nc(NC(CC(C)C...,BRD4,1,<rdkit.Chem.rdchem.Mol object at 0x7fb21045ef10>,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",/content/pdb_files/7jkz.cif,"[LYS, VAL, SER, GLU, GLN, LEU, LYS, CYS, CYS, ...",[LYSVALSERGLUGLNLEULYSCYSCYSSERGLYILELEULYSGLU...,"LYS,VAL,SER,GLU,GLN,LEU,LYS,CYS,CYS,SER,GLY,IL..."
79998,187453674,O=C(Nc1ccc(C(=O)O)cc1)OCC1c2ccccc2-c2ccccc21,CC(F)(F)CN.Cl,Cc1cc(N)ncn1,Cc1cc(Nc2nc(NCC(C)(F)F)nc(Nc3ccc(C(=O)N[Dy])cc...,BRD4,1,<rdkit.Chem.rdchem.Mol object at 0x7fb21045ef80>,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...",/content/pdb_files/7jkz.cif,"[LYS, VAL, SER, GLU, GLN, LEU, LYS, CYS, CYS, ...",[LYSVALSERGLUGLNLEULYSCYSCYSSERGLYILELEULYSGLU...,"LYS,VAL,SER,GLU,GLN,LEU,LYS,CYS,CYS,SER,GLY,IL..."


## Test Prediction

 The trained Random Forest model is then used to predict the binding probabilities. These predictions are saved to a CSV file, which serves as the submission file for the Kaggle competition.

In [22]:
import os
import gc
import pandas as pd
# Process the test.parquet file chunk by chunk
test_file = '/content/test.csv'
output_file = 'submission.csv'  # Specify the path and filename for the output file

# Read the test.parquet file into a pandas DataFrame
for df_test in pd.read_csv(test_file, chunksize=10000):

    # Generate ECFPs for the molecule_smiles
    df_test['molecule'] = df_test['molecule_smiles'].apply(Chem.MolFromSmiles)
    df_test['ecfp'] = df_test['molecule'].apply(generate_ecfp)
    df_test['amino_seq_list'] = df_test['protein_name'].apply(lambda x: pdb_dict.get(x, x))
    df_test['amino_seq_str'] = df_test['amino_seq_list'].apply(lambda x: ','.join(x))
    # One-hot encode the protein_name, amino_acid
    #protein_onehot = onehot_encoder.transform(df_test['protein_name'].values.reshape(-1, 1))
    amino_onehot = onehot_encoder.fit_transform(df_test['amino_seq_str'].values.reshape(-1, 1))
    # Combine ECFPs and one-hot encoded protein_name
    X_test = [ecfp + amino_acid for ecfp, amino_acid in zip(df_test['ecfp'].tolist(), amino_onehot.tolist())]

    # Predict the probabilities
    probabilities = rf_model.predict_proba(X_test)[:, 1]

    # Create a DataFrame with 'id' and 'probability' columns
    output_df = pd.DataFrame({'id': df_test['id'], 'binds': probabilities})

    # Save the output DataFrame to a CSV file
    output_df.to_csv(output_file, index=False, mode='a', header=not os.path.exists(output_file))
    gc.collect()


[1;30;43mストリーミング出力は最後の 5000 行に切り捨てられました。[0m


In [23]:
import pandas as pd

df1 = pd.read_csv("submission.csv")


## 方法
1. ChemBertaでSmilesを特徴量にしてみる。
2. ランダムフォレスト以外のモデルを利用してみる。
3. アミノ酸配列、DNA配列の情報から特徴量を生成する。
