# Базовые модели

## Эндпоинты:
1. **EPA категория** (классы 1-4)  
   - Классификация по системе [EPA](https://www.epa.gov/).
   
2. **GHS категория** (классы 1-5)  
   - Классификация по [GHS](https://www.unece.org/ghs-rev0-2003.html).

3. **LD50 (ммоль/кг)**  
   - Регрессия для прогнозирования летальной дозы.

4. **Токсичность (бинарная)**  
   - Классификация: `Toxic` (LD50 < 2000 мг/кг → 1).

5. **Высокая токсичность (бинарная)**  
   - Классификация: `Very Toxic` (LD50 < 50 мг/кг → 1).

---

## Алгоритмы:
### 1. Random Forest (RF)
- **Тип задач**:  
  Классификация, регрессия.
- **Особенности**:  
  Ансамбль решающих деревьев с оптимизацией гиперпараметров (`n_estimators`, `max_depth`, `bootstrap` и др.).

### 2. Support Vector Machines (SVM) / Support Vector Regression (SVR)
- **Тип задач**:  
  Классификация (SVM), регрессия (SVR).
- **Особенности**:  
  Подбор ядра (`rbf`, `linear`) и параметров регуляризации (`C`, `gamma`).

### 3. XGBoost
- **Тип задач**:  
  Классификация, регрессия.
- **Особенности**:  
  Градиентный бустинг с настройкой скорости обучения (`learning_rate`), глубины деревьев (`max_depth`), регуляризации (`gamma`, `subsample`).

### 4. k-Nearest Neighbors (kNN)
- **Тип задач**:  
  Классификация, регрессия.
- **Особенности**:  
  Оптимизация метрик расстояния (`dice`, `braycurtis`), числа соседей (`n_neighbors`), весов (`weights`).

---

## Стратегия обучения
- Для всех моделей проведена **5-кратная кросс-валидация**.
- Лучшие параметры подобраны через `GridSearchCV` или `RandomizedSearchCV`.
- Метрики оценки:  
  - Классификация: `ROC AUC`, `F1-weighted`.  
  - Регрессия: `MSE`.

<details>
<summary>🔍 Пример структуры гиперпараметров (RF)</summary>

```python
{
  'n_estimators': [500, 1000, 1500],
  'max_depth': [None, 20, 50, 80],
  'min_samples_split': [2, 5, 10],
  'bootstrap': [True, False]
}

# Результаты подбора моделей с 5-кратной кросс-валидацией

---

## Эндпоинт 1: Toxic (бинарная классификация)
**Метрика оценки:** `ROC AUC`

### 1. ECFP6 Bits
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `metric='dice'`, `n_neighbors=19`, `weights='distance'`                       | 0.7765003505028532    |
| **SVM**    | `C=1`, `gamma=0.01`                                                           | 0.7615940094038895    |
| **RF**     | `n_estimators=500`, `min_samples_split=10`, `min_samples_leaf=4`, `max_features='sqrt'`, `max_depth=80`, `bootstrap=False` | 0.7962032805553361 |
| **XGBoost**| `subsample=0.6`, `n_estimators=1500`, `min_child_weight=3`, `max_depth=10`, `learning_rate=0.01`, `gamma=1`, `colsample_bytree=0.6` | 0.779149971311067 |

### 2. ECFP6 Counts
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `metric='braycurtis'`, `n_neighbors=15`, `weights='distance'`                 | 0.7758825346243101    |
| **SVM**    | `C=1`, `gamma=0.01`                                                           | 0.756896105732593     |
| **RF**     | `n_estimators=1500`, `min_samples_split=2`, `min_samples_leaf=2`, `max_features='sqrt'`, `max_depth=None`, `bootstrap=False` | 0.7997123705018435 |
| **XGBoost**| `subsample=0.7`, `n_estimators=1500`, `min_child_weight=1`, `max_depth=10`, `learning_rate=0.01`, `gamma=1`, `colsample_bytree=0.9` | 0.7828889119910382 |

### 3. MACCS
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `metric='rogerstanimoto'`, `n_neighbors=19`, `weights='distance'`             | 0.8005131593622032    |
| **SVM**    | `C=1`, `gamma=0.1`                                                            | 0.8142373488655791    |
| **RF**     | `n_estimators=1500`, `min_samples_split=2`, `min_samples_leaf=2`, `max_features='log2'`, `max_depth=None`, `bootstrap=False` | 0.8220226523893197 |
| **XGBoost**| `subsample=0.8`, `n_estimators=500`, `min_child_weight=1`, `max_depth=10`, `learning_rate=0.01`, `gamma=0`, `colsample_bytree=0.6` | 0.8157 |

### 4. RDKit 2D
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `n_neighbors=15`, `p=1`, `weights='distance'`                                 | 0.8032485798583557    |
| **SVM**    | `C=1`, `gamma=1`                                                              | 0.8141723132585037    |
| **RF**     | `n_estimators=1500`, `min_samples_split=5`, `min_samples_leaf=2`, `max_features='sqrt'`, `max_depth=20`, `bootstrap=False` | 0.8377562807671433 |
| **XGBoost**| `subsample=0.8`, `n_estimators=1500`, `min_child_weight=3`, `max_depth=10`, `learning_rate=0.01`, `gamma=0`, `colsample_bytree=0.6` | 0.8300594548863851 |

### 5. Mordred
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `n_neighbors=15`, `p=1`, `weights='distance'`                                 | 0.8050440160843321    |
| **SVM**    | `C=10`, `gamma=0.1`                                                           | 0.8216012931247432    |
| **RF**     | `n_estimators=1500`, `min_samples_split=2`, `min_samples_leaf=4`, `max_features='sqrt'`, `max_depth=None`, `bootstrap=False` | 0.8350463294079058 |
| **XGBoost**| `subsample=0.6`, `n_estimators=500`, `min_child_weight=1`, `max_depth=10`, `learning_rate=0.01`, `gamma=1`, `colsample_bytree=0.7` | 0.8345831522387966 |

---

## Эндпоинт 2: EPA (многоклассовая классификация)
**Метрика оценки:** `F1-weighted`

### 1. ECFP6 Bits
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `metric='dice'`, `n_neighbors=9`, `weights='distance'`                        | 0.5280720565764737    |
| **SVM**    | `C=10`, `gamma=0.01`                                                          | 0.502214881274199     |
| **RF**     | `n_estimators=1500`, `min_samples_split=5`, `min_samples_leaf=2`, `max_features='sqrt'`, `max_depth=50`, `bootstrap=False` | 0.4962855434736273 |
| **XGBoost**| `subsample=0.7`, `n_estimators=500`, `min_child_weight=5`, `max_depth=6`, `learning_rate=0.1`, `gamma=0`, `colsample_bytree=0.8` | 0.5168191594611219 |

### 2. ECFP6 Counts
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `metric='braycurtis'`, `n_neighbors=9`, `weights='distance'`                  | 0.5286790743951763    |
| **SVM**    | `C=10`, `gamma=0.01`                                                          | 0.507425711140356     |
| **RF**     | `n_estimators=1500`, `min_samples_split=5`, `min_samples_leaf=2`, `max_features='sqrt'`, `max_depth=None`, `bootstrap=False` | 0.49182092154551615 |
| **XGBoost**| `subsample=0.7`, `n_estimators=1500`, `min_child_weight=3`, `max_depth=10`, `learning_rate=0.1`, `gamma=0`, `colsample_bytree=0.7` | 0.5164297901337191 |

### 3. MACCS
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `metric='rogerstanimoto'`, `n_neighbors=15`, `weights='distance'`             | 0.5400022074386337    |
| **SVM**    | `C=100`, `gamma=0.1`                                                          | 0.538085334634762     |
| **RF**     | `n_estimators=1500`, `min_samples_split=5`, `min_samples_leaf=2`, `max_features='sqrt'`, `max_depth=35`, `bootstrap=False` | 0.5387384859360687 |
| **XGBoost**| `subsample=0.7`, `n_estimators=1500`, `min_child_weight=1`, `max_depth=6`, `learning_rate=0.01`, `gamma=0`, `colsample_bytree=0.5` | 0.5367818120130525 |

### 4. RDKit 2D
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `n_neighbors=9`, `p=1`, `weights='distance'`                                  | 0.5394396679306864    |
| **SVM**    | `C=10`, `gamma=1`                                                             | 0.5396127137198012    |
| **RF**     | `n_estimators=500`, `min_samples_split=2`, `min_samples_leaf=2`, `max_features='sqrt'`, `max_depth=65`, `bootstrap=False` | 0.544268835812836 |
| **XGBoost**| `subsample=0.8`, `n_estimators=1500`, `min_child_weight=3`, `max_depth=10`, `learning_rate=0.1`, `gamma=0`, `colsample_bytree=0.5` | 0.5437794552662495 |

### 5. Mordred
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `n_neighbors=9`, `p=1`, `weights='distance'`                                  | 0.5397530479270659    |
| **SVM**    | `C=10`, `gamma=1`                                                             | 0.5365768484152484    |
| **RF**     | `n_estimators=1500`, `min_samples_split=10`, `min_samples_leaf=2`, `max_features='sqrt'`, `max_depth=50`, `bootstrap=False` | 0.5431836837060968 |
| **XGBoost**| `subsample=1.0`, `n_estimators=1500`, `min_child_weight=1`, `max_depth=10`, `learning_rate=0.1`, `gamma=0`, `colsample_bytree=0.6` | 0.5482312937748454 |

---

## Эндпоинт 3: logLD50 (регрессия)
**Метрика оценки:** `MSE`

### 1. ECFP6 Bits
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `metric='dice'`, `n_neighbors=15`, `weights='distance'`                       | 0.5332636774453705    |
| **SVM**    | `C=1`, `gamma=0.01`                                                           | 0.5503405736122804    |
| **RF**     | `n_estimators=1500`, `min_samples_split=2`, `min_samples_leaf=2`, `max_features='sqrt'`, `max_depth=50`, `bootstrap=True` | 0.5374327338246813 |
| **XGBoost**| `subsample=0.6`, `n_estimators=1500`, `min_child_weight=1`, `max_depth=10`, `learning_rate=0.01`, `gamma=0`, `colsample_bytree=0.9` | 0.49743116439673163 |

### 2. ECFP6 Counts
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `metric='braycurtis'`, `n_neighbors=9`, `weights='distance'`                  | 0.5249314216982528    |
| **SVM**    | `C=10`, `gamma=0.01`                                                          | 0.5453515648632298    |
| **RF**     | `n_estimators=1500`, `min_samples_split=5`, `min_samples_leaf=2`, `max_features='sqrt'`, `max_depth=None`, `bootstrap=False` | 0.5408206640969156 |
| **XGBoost**| `subsample=0.9`, `n_estimators=1500`, `min_child_weight=3`, `max_depth=10`, `learning_rate=0.01`, `gamma=0`, `colsample_bytree=0.5` | 0.4954868664317701 |

### 3. MACCS
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `metric='rogerstanimoto'`, `n_neighbors=9`, `weights='distance'`              | 0.489269050087817     |
| **SVM**    | `C=10`, `gamma=0.1`                                                           | 0.45290928375622946   |
| **RF**     | `n_estimators=500`, `min_samples_split=2`, `min_samples_leaf=2`, `max_features='sqrt'`, `max_depth=None`, `bootstrap=False` | 0.443660837367564 |
| **XGBoost**| `subsample=0.6`, `n_estimators=1500`, `min_child_weight=3`, `max_depth=6`, `learning_rate=0.01`, `gamma=0`, `colsample_bytree=0.5` | 0.4310792218926224 |

### 4. RDKit 2D
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `n_neighbors=9`, `p=1`, `weights='distance'`                                  | 0.49686610339448434   |
| **SVM**    | `C=1`, `gamma=1`                                                              | 0.49409442627621375   |
| **RF**     | `n_estimators=1500`, `min_samples_split=2`, `min_samples_leaf=2`, `max_features='log2'`, `max_depth=20`, `bootstrap=False` | 0.46541088612250886 |
| **XGBoost**| `subsample=0.7`, `n_estimators=1500`, `min_child_weight=5`, `max_depth=10`, `learning_rate=0.01`, `gamma=0`, `colsample_bytree=0.6` | 0.4468637779795599 |

### 5. Mordred
| Алгоритм   | Гиперпараметры                                                                 | Лучший результат      |
|------------|-------------------------------------------------------------------------------|-----------------------|
| **kNN**    | `n_neighbors=9`, `p=1`, `weights='distance'`                                  | 0.47073453589854897   |
| **SVM**    | `C=10`, `gamma=0.1`                                                           | 0.46021540702351055   |
| **RF**     | `n_estimators=500`, `min_samples_split=5`, `min_samples_leaf=2`, `max_features='sqrt'`, `max_depth=80`, `bootstrap=False` | 0.45110994944203975 |
| **XGBoost**| `subsample=0.7`, `n_estimators=1500`, `min_child_weight=1`, `max_depth=6`, `learning_rate=0.01`, `gamma=0`, `colsample_bytree=0.7` | 0.42682088858473594 |

---

**Примечания:**
- Для **бинарной классификации** (Toxic) использовалась метрика `ROC AUC`.
- Для **многоклассовой классификации** (EPA) — `F1-weighted`.
- Для **регрессии** (logLD50) — `MSE` (чем меньше значение, тем лучше).
- Все гиперпараметры подобраны с использованием **5-кратной кросс-валидации**.

In [1]:
from utils import * 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import itertools
from pprint import pprint
import joblib

import statistics

# Models
from xgboost import XGBClassifier, XGBRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor

from sklearn.preprocessing import LabelEncoder, LabelBinarizer
from sklearn.model_selection import KFold, cross_validate, GridSearchCV, cross_val_score, RandomizedSearchCV 
from sklearn.model_selection import cross_val_predict

from sklearn.pipeline import Pipeline

from sklearn.metrics import make_scorer

#regression matrics
from sklearn.metrics import mean_absolute_error , mean_squared_error, r2_score

#classification metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, balanced_accuracy_score, roc_auc_score, f1_score, matthews_corrcoef

from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.base import TransformerMixin
from sklearn.base import clone
from sklearn.model_selection._split import check_cv

## Data

In [2]:
train_labels = pd.read_csv('../data/processed/train_labels.csv', index_col = 'CASRN')
test_labels = pd.read_csv('../data/processed/test_labels.csv', index_col = 'CASRN')
train_labels.shape, test_labels.shape

((8221, 6), (2849, 6))

Check labeled data in train and test sets for each endpoints

In [3]:
train_labels.head(1)

Unnamed: 0_level_0,SMILES,logLD50_mmolkg,verytoxic,toxic,EPA_category,GHS_category
CASRN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
23233-88-7,CC(=O)Oc1c(Br)cc(Cl)cc1C(=S)Nc1ccc(Br)cc1,0.810998,0.0,0.0,3.0,5.0


In [4]:
print('LD50:', 
      'Train Set:', train_labels.shape[0] - sum(train_labels.logLD50_mmolkg.isnull()),
      'Test Set:',  test_labels.shape[0] - sum(test_labels.logLD50_mmolkg.isnull()),
      '\n'
      'Binary Toxic:',
      'Train Set:', train_labels.shape[0] - sum(train_labels.toxic.isnull()),
      'Test Set:',  test_labels.shape[0] - sum(test_labels.toxic.isnull()),     
           
      '\n'
      'EPA_category:',
      'Train Set:', train_labels.shape[0] - sum(train_labels.EPA_category.isnull()),
      'Test Set:',  test_labels.shape[0] - sum(test_labels.EPA_category.isnull()),    
      
      '\n' '\n'
      'The ohter two endpoint are not modeled'
      '\n' 
      'Binary verytoxic:',
      'Train Set:', train_labels.shape[0] - sum(train_labels.verytoxic.isnull()),
      'Test Set:',  test_labels.shape[0] - sum(test_labels.verytoxic.isnull()),    
      
      '\n'
      'GHS_category:',
      'Train Set:', train_labels.shape[0] - sum(train_labels.GHS_category.isnull()),
      'Test Set:',  test_labels.shape[0] - sum(test_labels.GHS_category.isnull())       
     )

LD50: Train Set: 6092 Test Set: 2145 
Binary Toxic: Train Set: 8209 Test Set: 2844 
EPA_category: Train Set: 8126 Test Set: 2822 

The ohter two endpoint are not modeled
Binary verytoxic: Train Set: 8219 Test Set: 2848 
GHS_category: Train Set: 8189 Test Set: 2840


In [5]:
# import all the training features
train_ecfp6_bits = pd.read_csv('../data/Bmodel_features/modeling_train_ecfp6_bits.csv', index_col='CASRN')
train_ecfp6_counts = pd.read_csv('../data/Bmodel_features/modeling_train_ecfp6_counts.csv', index_col='CASRN')
train_maccs = pd.read_csv('../data/Bmodel_features/modeling_train_maccs.csv', index_col='CASRN')
train_rdkit2d = pd.read_csv('../data/Bmodel_features/modeling_train_rdkit2d.csv', index_col='CASRN')
train_mordred = pd.read_csv('../data/Bmodel_features/modeling_train_mordred.csv', index_col='CASRN')


# import all the test features
test_ecfp6_bits = pd.read_csv('../data/Bmodel_features/modeling_test_ecfp6_bits.csv', index_col='CASRN')
test_ecfp6_counts = pd.read_csv('../data/Bmodel_features/modeling_test_ecfp6_counts.csv', index_col='CASRN')
test_maccs = pd.read_csv('../data/Bmodel_features/modeling_test_maccs.csv', index_col='CASRN')
test_rdkit2d = pd.read_csv('../data/Bmodel_features/modeling_test_rdkit2d.csv', index_col='CASRN')
test_mordred = pd.read_csv('../data/Bmodel_features/modeling_test_mordred.csv', index_col='CASRN')

# Training

- Report the cross-validation reslut
- Save `meta features` for building hierarchial models
- Save `out-of-fold predictions` for evaluating the base models
- Save the trained `base models`

## Endpoint 1: Toxic

### random forest

In [6]:
%%time
#yes
endpoint = 'Toxic'
descriptor = 'mordred'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
rf_clf = RandomForestClassifier(random_state =42, n_jobs=6,
                              n_estimators = 1500, min_samples_split = 2, min_samples_leaf=4,
                              max_features = 'sqrt', max_depth=None, bootstrap= False)

#input
a, b,c,d,e = prepare_input(train_labels, train_mordred, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(rf_clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.788 std: 0.015
Balance Accuracy: 0.778 std: 0.015
matthews_corrcoef: 0.566 std: 0.03
f1_score: 0.787 std: 0.015
AUROC: 0.778 std: 0.015
CPU times: total: 1h 1min 43s
Wall time: 10min 31s


['../results/Base_models/Toxic_RF_mordred_CVScore']

In [7]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) model in the Classification_meta_features

endpoint = 'Toxic'
descriptor = 'rdkit2d'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
rf_clf = RandomForestClassifier(random_state =42, n_jobs=6,
                              n_estimators = 1500, min_samples_split = 5, min_samples_leaf=2,
                              max_features = 'sqrt', max_depth=20, bootstrap= False)

#input
a, b,c,d,e = prepare_input(train_labels, train_rdkit2d, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(rf_clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.787 std: 0.013
Balance Accuracy: 0.776 std: 0.011
matthews_corrcoef: 0.562 std: 0.024
f1_score: 0.785 std: 0.012
AUROC: 0.776 std: 0.011
CPU times: total: 20min 20s
Wall time: 3min 53s


['../results/Base_models/Toxic_RF_rdkit2d_CVScore']

In [8]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) model in the Classification_meta_features

endpoint = 'Toxic'
descriptor = 'maccs'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
rf_clf = RandomForestClassifier(random_state =42, n_jobs=6,
                              n_estimators = 1500, min_samples_split = 2, min_samples_leaf=2,
                              max_features = 'log2', max_depth=None, bootstrap= False)

#input
a, b,c,d,e = prepare_input(train_labels, train_maccs, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(rf_clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.784 std: 0.015
Balance Accuracy: 0.775 std: 0.014
matthews_corrcoef: 0.558 std: 0.03
f1_score: 0.783 std: 0.014
AUROC: 0.775 std: 0.014
CPU times: total: 5min 51s
Wall time: 2min 1s


['../results/Base_models/Toxic_RF_maccs_CVScore']

In [9]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) model in the Classification_meta_features

endpoint = 'Toxic'
descriptor = 'ecfp6counts'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
rf_clf = RandomForestClassifier(random_state =42, n_jobs=6,
                              n_estimators = 1500, min_samples_split = 2, min_samples_leaf=2,
                              max_features = 'sqrt', max_depth=None, bootstrap= False)

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_counts, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(rf_clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.766 std: 0.018
Balance Accuracy: 0.751 std: 0.017
matthews_corrcoef: 0.521 std: 0.037
f1_score: 0.762 std: 0.018
AUROC: 0.751 std: 0.017
CPU times: total: 37min 18s
Wall time: 6min 48s


['../results/Base_models/Toxic_RF_ecfp6counts_CVScore']

In [10]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) model in the Classification_meta_features

endpoint = 'Toxic'
descriptor = 'ecfp6bits'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
rf_clf = RandomForestClassifier(random_state =42, n_jobs=6,
                              n_estimators = 500, min_samples_split = 10, min_samples_leaf=4,
                              max_features = 'sqrt', max_depth=80, bootstrap= False)

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_bits, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(rf_clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.759 std: 0.013
Balance Accuracy: 0.744 std: 0.014
matthews_corrcoef: 0.504 std: 0.03
f1_score: 0.755 std: 0.013
AUROC: 0.744 std: 0.014
CPU times: total: 28min 11s
Wall time: 4min 50s


['../results/Base_models/Toxic_RF_ecfp6bits_CVScore']

### xgboost

In [11]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) model in the Classification_meta_features

endpoint = 'Toxic'
descriptor = 'ecfp6bits'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = XGBClassifier(random_state =123, n_jobs=6,
                    subsample = 0.6, n_estimators = 1500, min_child_weight=3,
                    max_depth = 10, learning_rate=0.01, gamma= 1,
                    colsample_bytree = 0.6)

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_bits, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.753 std: 0.014
Balance Accuracy: 0.742 std: 0.014
matthews_corrcoef: 0.492 std: 0.028
f1_score: 0.751 std: 0.014
AUROC: 0.742 std: 0.014
CPU times: total: 21min 16s
Wall time: 3min 35s


['../results/Base_models/Toxic_xgboost_ecfp6bits_CVScore']

In [12]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) model in the Classification_meta_features

endpoint = 'Toxic'
descriptor = 'ecfp6counts'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = XGBClassifier(random_state =123, n_jobs=6,
                    subsample = 0.7, n_estimators = 1500, min_child_weight=1,
                    max_depth = 10, learning_rate=0.01, gamma= 1,
                    colsample_bytree = 0.9)

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_counts, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.761 std: 0.013
Balance Accuracy: 0.75 std: 0.012
matthews_corrcoef: 0.508 std: 0.026
f1_score: 0.758 std: 0.013
AUROC: 0.75 std: 0.012
CPU times: total: 24min 15s
Wall time: 4min 4s


['../results/Base_models/Toxic_xgboost_ecfp6counts_CVScore']

In [13]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) model in the Classification_meta_features

endpoint = 'Toxic'
descriptor = 'maccs'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = XGBClassifier(random_state =123, n_jobs=6,
                    subsample = 0.8, n_estimators = 500, min_child_weight=1,
                    max_depth = 10, learning_rate=0.01, gamma= 0,
                    colsample_bytree = 0.6)

#input
a, b,c,d,e = prepare_input(train_labels, train_maccs, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.78 std: 0.011
Balance Accuracy: 0.77 std: 0.012
matthews_corrcoef: 0.549 std: 0.024
f1_score: 0.778 std: 0.011
AUROC: 0.77 std: 0.012
CPU times: total: 2min 2s
Wall time: 20.6 s


['../results/Base_models/Toxic_xgboost_maccs_CVScore']

In [14]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) model in the Classification_meta_features

endpoint = 'Toxic'
descriptor = 'rdkit2d'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = XGBClassifier(random_state =123, n_jobs=6,
                    subsample = 0.8, n_estimators = 1500, min_child_weight=3,
                    max_depth = 10, learning_rate=0.01, gamma= 0,
                    colsample_bytree = 0.6)

#input
a, b,c,d,e = prepare_input(train_labels, train_rdkit2d, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.787 std: 0.014
Balance Accuracy: 0.779 std: 0.014
matthews_corrcoef: 0.564 std: 0.029
f1_score: 0.786 std: 0.014
AUROC: 0.779 std: 0.014
CPU times: total: 14min 7s
Wall time: 2min 21s


['../results/Base_models/Toxic_xgboost_rdkit2d_CVScore']

In [15]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) model in the Classification_meta_features

endpoint = 'Toxic'
descriptor = 'mordred'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = XGBClassifier(random_state =123, n_jobs=6,
                    subsample = 0.6, n_estimators = 500, min_child_weight=1,
                    max_depth = 10, learning_rate=0.01, gamma= 1,
                    colsample_bytree = 0.7)

#input
a, b,c,d,e = prepare_input(train_labels, train_mordred, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.791 std: 0.017
Balance Accuracy: 0.781 std: 0.017
matthews_corrcoef: 0.572 std: 0.035
f1_score: 0.789 std: 0.017
AUROC: 0.781 std: 0.017
CPU times: total: 52min 42s
Wall time: 8min 50s


['../results/Base_models/Toxic_xgboost_mordred_CVScore']

### kNN

In [16]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) features

endpoint = 'Toxic'
descriptor = 'ecfp6bits'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = KNeighborsClassifier(metric = 'dice', n_neighbors = 19, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_bits, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')




Accuracy: 0.746 std: 0.014
Balance Accuracy: 0.738 std: 0.013
matthews_corrcoef: 0.48 std: 0.028
f1_score: 0.745 std: 0.013
AUROC: 0.738 std: 0.013
CPU times: total: 3min 28s
Wall time: 3min 28s


['../results/Base_models/Toxic_knn_ecfp6bits_CVScore']

In [17]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) features

endpoint = 'Toxic'
descriptor = 'ecfp6counts'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = KNeighborsClassifier(metric = 'braycurtis', n_neighbors = 15, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_counts, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.744 std: 0.008
Balance Accuracy: 0.735 std: 0.008
matthews_corrcoef: 0.475 std: 0.017
f1_score: 0.743 std: 0.008
AUROC: 0.735 std: 0.008
CPU times: total: 4min 40s
Wall time: 37.8 s


['../results/Base_models/Toxic_knn_ecfp6counts_CVScore']

In [18]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) features

endpoint = 'Toxic'
descriptor = 'maccs'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = KNeighborsClassifier(metric = 'rogerstanimoto', n_neighbors = 19, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_maccs, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')




Accuracy: 0.763 std: 0.016
Balance Accuracy: 0.752 std: 0.015
matthews_corrcoef: 0.513 std: 0.032
f1_score: 0.761 std: 0.015
AUROC: 0.752 std: 0.015
CPU times: total: 28 s
Wall time: 27.9 s


['../results/Base_models/Toxic_knn_maccs_CVScore']

In [19]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) features

endpoint = 'Toxic'
descriptor = 'rdkit2d'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = KNeighborsClassifier(n_neighbors = 15, p = 1, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_rdkit2d, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.763 std: 0.014
Balance Accuracy: 0.754 std: 0.014
matthews_corrcoef: 0.515 std: 0.03
f1_score: 0.762 std: 0.014
AUROC: 0.754 std: 0.014
CPU times: total: 25 s
Wall time: 3.19 s


['../results/Base_models/Toxic_knn_rdkit2d_CVScore']

In [20]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) features

endpoint = 'Toxic'
descriptor = 'mordred'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = KNeighborsClassifier(n_neighbors = 15, p = 1, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_mordred, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.77 std: 0.017
Balance Accuracy: 0.763 std: 0.018
matthews_corrcoef: 0.529 std: 0.036
f1_score: 0.769 std: 0.017
AUROC: 0.763 std: 0.018
CPU times: total: 55.1 s
Wall time: 6.94 s


['../results/Base_models/Toxic_knn_mordred_CVScore']

### SVM

In [21]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) features

endpoint = 'Toxic'
descriptor = 'mordred'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = SVC(random_state=42, probability=True,
          C = 10, gamma = 0.1, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_mordred, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=6, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.778 std: 0.016
Balance Accuracy: 0.77 std: 0.016
matthews_corrcoef: 0.545 std: 0.033
f1_score: 0.777 std: 0.016
AUROC: 0.77 std: 0.016
CPU times: total: 1min 31s
Wall time: 8min 32s


['../results/Base_models/Toxic_svm_mordred_CVScore']

In [22]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) features

endpoint = 'Toxic'
descriptor = 'rdkit2d'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = SVC(random_state=42, probability=True,
          C = 1, gamma = 1, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_rdkit2d, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=6, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.768 std: 0.015
Balance Accuracy: 0.757 std: 0.015
matthews_corrcoef: 0.523 std: 0.03
f1_score: 0.766 std: 0.015
AUROC: 0.757 std: 0.015
CPU times: total: 26.7 s
Wall time: 4min 2s


['../results/Base_models/Toxic_svm_rdkit2d_CVScore']

In [23]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) features

endpoint = 'Toxic'
descriptor = 'maccs'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = SVC(random_state=42, probability=True,
          C = 1, gamma = 0.1, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_maccs, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=6, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.781 std: 0.014
Balance Accuracy: 0.772 std: 0.014
matthews_corrcoef: 0.552 std: 0.03
f1_score: 0.78 std: 0.014
AUROC: 0.772 std: 0.014
CPU times: total: 27.2 s
Wall time: 4min 5s


['../results/Base_models/Toxic_svm_maccs_CVScore']

In [24]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) features

endpoint = 'Toxic'
descriptor = 'ecfp6bits'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = SVC(random_state=42, probability=True,
          C = 1, gamma = 0.01, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_bits, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=6, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.742 std: 0.008
Balance Accuracy: 0.729 std: 0.008
matthews_corrcoef: 0.469 std: 0.017
f1_score: 0.739 std: 0.007
AUROC: 0.729 std: 0.008
CPU times: total: 6min 2s
Wall time: 34min 15s


['../results/Base_models/Toxic_svm_ecfp6bits_CVScore']

In [25]:
%%time
#yes
#thing need to change from model to model
# (1) descriptor (2) algorithm (3) model (4) features

endpoint = 'Toxic'
descriptor = 'ecfp6counts'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_toxic = joblib.load('../encoder_models/encoder_toxic.joblib')

# model
clf = SVC(random_state=42, probability=True,
          C = 1, gamma = 0.01, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_counts, target = 'toxic', encoder = encoder_toxic)

# results
BCM_mf,  BCM_oof, BCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=6, 
                                                      col_names = [f'{name}-0', f'{name}-1'])
# report the results
report_clf_models(cv_score)

# Save results
BCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', BCM_oof)
joblib.dump(BCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.738 std: 0.013
Balance Accuracy: 0.723 std: 0.014
matthews_corrcoef: 0.46 std: 0.029
f1_score: 0.734 std: 0.013
AUROC: 0.723 std: 0.014
CPU times: total: 4min 49s
Wall time: 36min 14s


['../results/Base_models/Toxic_svm_ecfp6counts_CVScore']

## Endpoint 2: EPA

### Random forest

In [26]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'mordred'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
rf_clf = RandomForestClassifier(random_state =42, n_jobs=6,
                              n_estimators = 1500, min_samples_split = 10, min_samples_leaf=2,
                              max_features = 'sqrt', max_depth=50, bootstrap= False)

#input
a, b,c,d,e = prepare_input(train_labels, train_mordred, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(rf_clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.635 std: 0.016
Balance Accuracy: 0.531 std: 0.013
matthews_corrcoef: 0.41 std: 0.026
f1_score: 0.614 std: 0.017
AUROC: 0.676 std: 0.012
CPU times: total: 1h 24min 13s
Wall time: 14min 19s


['../results/Base_models/EPA_RF_mordred_CVScore']

In [27]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'rdkit2d'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = RandomForestClassifier(random_state =42, n_jobs=6,
                              n_estimators = 500, min_samples_split = 2, min_samples_leaf=2,
                              max_features = 'sqrt', max_depth=65, bootstrap= False)

#input
a, b,c,d,e = prepare_input(train_labels, train_rdkit2d, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.634 std: 0.012
Balance Accuracy: 0.533 std: 0.014
matthews_corrcoef: 0.409 std: 0.013
f1_score: 0.615 std: 0.014
AUROC: 0.677 std: 0.007
CPU times: total: 8min 44s
Wall time: 1min 34s


['../results/Base_models/EPA_RF_rdkit2d_CVScore']

In [28]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'maccs'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = RandomForestClassifier(random_state =42, n_jobs=6,
                              n_estimators = 1500, min_samples_split = 5, min_samples_leaf=2,
                              max_features = 'sqrt', max_depth=35, bootstrap= False)

#input
a, b,c,d,e = prepare_input(train_labels, train_maccs, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.629 std: 0.013
Balance Accuracy: 0.53 std: 0.014
matthews_corrcoef: 0.401 std: 0.023
f1_score: 0.609 std: 0.014
AUROC: 0.675 std: 0.011
CPU times: total: 8min 33s
Wall time: 2min 11s


['../results/Base_models/EPA_RF_maccs_CVScore']

In [29]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'ecfp6bits'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = RandomForestClassifier(random_state =42, n_jobs=6,
                              n_estimators = 1500, min_samples_split = 5, min_samples_leaf=2,
                              max_features = 'sqrt', max_depth=50, bootstrap= False)

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_bits, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.612 std: 0.019
Balance Accuracy: 0.483 std: 0.017
matthews_corrcoef: 0.368 std: 0.024
f1_score: 0.569 std: 0.021
AUROC: 0.641 std: 0.011
CPU times: total: 1h 34min 21s
Wall time: 16min 8s


['../results/Base_models/EPA_RF_ecfp6bits_CVScore']

In [30]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'ecfp6counts'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = RandomForestClassifier(random_state =42, n_jobs=6,
                              n_estimators = 1500, min_samples_split = 5, min_samples_leaf=2,
                              max_features = 'sqrt', max_depth=None, bootstrap= False)

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_counts, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.615 std: 0.019
Balance Accuracy: 0.484 std: 0.017
matthews_corrcoef: 0.372 std: 0.03
f1_score: 0.574 std: 0.022
AUROC: 0.643 std: 0.013
CPU times: total: 33min 17s
Wall time: 5min 57s


['../results/Base_models/EPA_RF_ecfp6counts_CVScore']

### xgboost

In [31]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'ecfp6bits'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = XGBClassifier(random_state =123, n_jobs=6,
                    subsample = 0.7, n_estimators = 500, min_child_weight=5,
                    max_depth = 6, learning_rate=0.1, gamma= 0,
                    colsample_bytree = 0.8)

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_bits, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.599 std: 0.013
Balance Accuracy: 0.505 std: 0.019
matthews_corrcoef: 0.35 std: 0.017
f1_score: 0.577 std: 0.015
AUROC: 0.652 std: 0.008
CPU times: total: 24min 28s
Wall time: 4min 8s


['../results/Base_models/EPA_xgboost_ecfp6bits_CVScore']

In [32]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'ecfp6counts'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = XGBClassifier(random_state =123, n_jobs=6,
                    subsample = 0.7, n_estimators = 1500, min_child_weight=3,
                    max_depth = 10, learning_rate=0.1, gamma= 0,
                    colsample_bytree = 0.7)

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_counts, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.605 std: 0.011
Balance Accuracy: 0.531 std: 0.022
matthews_corrcoef: 0.369 std: 0.027
f1_score: 0.593 std: 0.014
AUROC: 0.667 std: 0.014
CPU times: total: 1h 31min 8s
Wall time: 15min 17s


['../results/Base_models/EPA_xgboost_ecfp6counts_CVScore']

In [33]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'maccs'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = XGBClassifier(random_state =123, n_jobs=6,
                    subsample = 0.7, n_estimators = 1500, min_child_weight=1,
                    max_depth = 6, learning_rate=0.01, gamma= 0,
                    colsample_bytree = 0.5)

#input
a, b,c,d,e = prepare_input(train_labels, train_maccs, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.626 std: 0.015
Balance Accuracy: 0.532 std: 0.01
matthews_corrcoef: 0.397 std: 0.018
f1_score: 0.606 std: 0.017
AUROC: 0.675 std: 0.01
CPU times: total: 11min 58s
Wall time: 2min 1s


['../results/Base_models/EPA_xgboost_maccs_CVScore']

In [34]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'rdkit2d'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = XGBClassifier(random_state =123, n_jobs=6,
                    subsample = 0.8, n_estimators = 1500, min_child_weight=3,
                    max_depth = 10, learning_rate=0.1, gamma= 0,
                    colsample_bytree = 0.5)

#input
a, b,c,d,e = prepare_input(train_labels, train_rdkit2d, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.626 std: 0.01
Balance Accuracy: 0.551 std: 0.013
matthews_corrcoef: 0.403 std: 0.018
f1_score: 0.616 std: 0.011
AUROC: 0.683 std: 0.009
CPU times: total: 34min 20s
Wall time: 5min 45s


['../results/Base_models/EPA_xgboost_rdkit2d_CVScore']

In [35]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'mordred'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = XGBClassifier(random_state =123, n_jobs=6,
                    subsample = 1.0, n_estimators = 1500, min_child_weight=1,
                    max_depth = 10, learning_rate=0.1, gamma= 0,
                    colsample_bytree = 0.6)

#input
a, b,c,d,e = prepare_input(train_labels, train_mordred, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.638 std: 0.012
Balance Accuracy: 0.556 std: 0.011
matthews_corrcoef: 0.42 std: 0.016
f1_score: 0.623 std: 0.013
AUROC: 0.688 std: 0.008
CPU times: total: 3h 5min 11s
Wall time: 31min 5s


['../results/Base_models/EPA_xgboost_mordred_CVScore']

### knn

In [36]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'mordred'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = KNeighborsClassifier(n_neighbors = 9, p = 1, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_mordred, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.615 std: 0.01
Balance Accuracy: 0.549 std: 0.012
matthews_corrcoef: 0.389 std: 0.016
f1_score: 0.605 std: 0.011
AUROC: 0.68 std: 0.008
CPU times: total: 1min
Wall time: 7.72 s


['../results/Base_models/EPA_knn_mordred_CVScore']

In [37]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'rdkit2d'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = KNeighborsClassifier(n_neighbors = 9, p = 1, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_rdkit2d, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.61 std: 0.018
Balance Accuracy: 0.542 std: 0.015
matthews_corrcoef: 0.381 std: 0.02
f1_score: 0.6 std: 0.019
AUROC: 0.677 std: 0.012
CPU times: total: 27.4 s
Wall time: 3.51 s


['../results/Base_models/EPA_knn_rdkit2d_CVScore']

In [38]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'maccs'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = KNeighborsClassifier(metric = 'rogerstanimoto', n_neighbors = 15, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_maccs, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')




Accuracy: 0.603 std: 0.012
Balance Accuracy: 0.525 std: 0.015
matthews_corrcoef: 0.367 std: 0.019
f1_score: 0.59 std: 0.015
AUROC: 0.668 std: 0.01
CPU times: total: 23.8 s
Wall time: 23.5 s


['../results/Base_models/EPA_knn_maccs_CVScore']

In [39]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'ecfp6counts'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = KNeighborsClassifier(metric = 'braycurtis', n_neighbors = 9, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_counts, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.605 std: 0.02
Balance Accuracy: 0.538 std: 0.019
matthews_corrcoef: 0.371 std: 0.029
f1_score: 0.593 std: 0.021
AUROC: 0.67 std: 0.015
CPU times: total: 5min 6s
Wall time: 41.5 s


['../results/Base_models/EPA_knn_ecfp6counts_CVScore']

In [40]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'ecfp6bits'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = KNeighborsClassifier(metric = 'dice', n_neighbors = 9, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_bits, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')




Accuracy: 0.605 std: 0.022
Balance Accuracy: 0.536 std: 0.023
matthews_corrcoef: 0.372 std: 0.032
f1_score: 0.594 std: 0.023
AUROC: 0.672 std: 0.017
CPU times: total: 3min 13s
Wall time: 3min 14s


['../results/Base_models/EPA_knn_ecfp6bits_CVScore']

### svm

In [41]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'mordred'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = SVC(random_state=42, probability=True,
          C = 10, gamma = 1, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_mordred, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.605 std: 0.01
Balance Accuracy: 0.545 std: 0.015
matthews_corrcoef: 0.375 std: 0.021
f1_score: 0.597 std: 0.012
AUROC: 0.672 std: 0.01
CPU times: total: 42min 9s
Wall time: 42min 21s


['../results/Base_models/EPA_svm_mordred_CVScore']

In [42]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'rdkit2d'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = SVC(random_state=42, probability=True,
          C = 10, gamma = 1, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_rdkit2d, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.61 std: 0.016
Balance Accuracy: 0.55 std: 0.016
matthews_corrcoef: 0.386 std: 0.019
f1_score: 0.604 std: 0.018
AUROC: 0.681 std: 0.011
CPU times: total: 10min 3s
Wall time: 11min 44s


['../results/Base_models/EPA_svm_rdkit2d_CVScore']

In [43]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'maccs'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = SVC(random_state=42, probability=True,
          C = 100, gamma = 0.1, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_maccs, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


Accuracy: 0.598 std: 0.008
Balance Accuracy: 0.54 std: 0.009
matthews_corrcoef: 0.37 std: 0.01
f1_score: 0.593 std: 0.008
AUROC: 0.675 std: 0.006
CPU times: total: 12min 31s
Wall time: 12min 36s


['../results/Base_models/EPA_svm_maccs_CVScore']

In [44]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'ecfp6bits'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = SVC(random_state=42, probability=True,
          C = 10, gamma = 0.01, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_bits, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')

Accuracy: 0.588 std: 0.019
Balance Accuracy: 0.522 std: 0.019
matthews_corrcoef: 0.346 std: 0.031
f1_score: 0.578 std: 0.02
AUROC: 0.66 std: 0.015
CPU times: total: 2h 46min 20s
Wall time: 2h 47min 40s


['../results/Base_models/EPA_svm_ecfp6bits_CVScore']

In [45]:
%%time
#yes
endpoint = 'EPA'
descriptor = 'ecfp6counts'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

encoder_epa = joblib.load('../encoder_models/encoder_epa.joblib')

# model
clf = SVC(random_state=42, probability=True,
          C = 10, gamma = 0.01, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_counts, target = 'EPA_category', encoder = encoder_epa)

# results
MCM_mf,  MCM_oof, MCM_base_model, cv_score  = Classification_meta_features(clf, a, c, b, d, e,cv=10,n_jobs=1, 
                                                      col_names = [f'{name}-1', f'{name}-2', f'{name}-3', f'{name}-4'])
# report the results
report_clf_models(cv_score)

# Save results
MCM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', MCM_oof)
joblib.dump(MCM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')

Accuracy: 0.594 std: 0.012
Balance Accuracy: 0.528 std: 0.016
matthews_corrcoef: 0.356 std: 0.025
f1_score: 0.585 std: 0.012
AUROC: 0.664 std: 0.012
CPU times: total: 2h 18min 1s
Wall time: 2h 42min 57s


['../results/Base_models/EPA_svm_ecfp6counts_CVScore']

## Endpoint 3: LD50

### random forest

In [46]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'mordred'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
rf_reg = RandomForestRegressor(random_state =42, n_jobs=6,
                              n_estimators = 500, min_samples_split = 5, min_samples_leaf=2,
                              max_features = 'sqrt', max_depth=80, bootstrap= False)

#input
a, b,c,d,e = prepare_input(train_labels, train_mordred, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(rf_reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.593 std: 0.026
R2: 0.568 std: 0.031
MAE: 0.437 std: 0.017
MSE: 0.353 std: 0.032
CPU times: total: 18min 16s
Wall time: 3min 8s


['../results/Base_models/LD50_RF_mordred_CVScore']

In [47]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'rdkit2d'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
rf_reg = RandomForestRegressor(random_state =42, n_jobs=6,
                              n_estimators = 1500, min_samples_split = 2, min_samples_leaf=2,
                              max_features = 'log2', max_depth=20, bootstrap= False)

#input
a, b,c,d,e = prepare_input(train_labels, train_rdkit2d, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(rf_reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.602 std: 0.024
R2: 0.556 std: 0.027
MAE: 0.443 std: 0.016
MSE: 0.363 std: 0.029
CPU times: total: 10min 16s
Wall time: 1min 54s


['../results/Base_models/LD50_RF_rdkit2d_CVScore']

In [48]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'maccs'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
rf_reg = RandomForestRegressor(random_state =42, n_jobs=6,
                              n_estimators = 500, min_samples_split = 2, min_samples_leaf=2,
                              max_features = 'sqrt', max_depth=None, bootstrap= False)

#input
a, b,c,d,e = prepare_input(train_labels, train_maccs, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(rf_reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.597 std: 0.024
R2: 0.563 std: 0.032
MAE: 0.44 std: 0.017
MSE: 0.357 std: 0.029
CPU times: total: 1min 44s
Wall time: 21.4 s


['../results/Base_models/LD50_RF_maccs_CVScore']

In [49]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'ecfp6counts'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
rf_reg = RandomForestRegressor(random_state =42, n_jobs=6,
                              n_estimators = 1500, min_samples_split = 5, min_samples_leaf=2,
                              max_features = 'sqrt', max_depth=None, bootstrap= False)

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_counts, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(rf_reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.641 std: 0.021
R2: 0.495 std: 0.028
MAE: 0.481 std: 0.016
MSE: 0.412 std: 0.028
CPU times: total: 23min 12s
Wall time: 4min 4s


['../results/Base_models/LD50_RF_ecfp6counts_CVScore']

In [50]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'ecfp6bits'
algorithm = 'RF'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
rf_reg = RandomForestRegressor(random_state =42, n_jobs=6,
                              n_estimators = 1500, min_samples_split = 2, min_samples_leaf=2,
                              max_features = 'sqrt', max_depth=50, bootstrap= True)

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_bits, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(rf_reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.651 std: 0.021
R2: 0.48 std: 0.024
MAE: 0.488 std: 0.015
MSE: 0.425 std: 0.027
CPU times: total: 41min 16s
Wall time: 7min 9s


['../results/Base_models/LD50_RF_ecfp6bits_CVScore']

### xgboost

In [51]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'ecfp6bits'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = XGBRegressor(random_state =123, n_jobs=6,
                    subsample = 0.6, n_estimators = 1500, min_child_weight=1,
                    max_depth = 10, learning_rate=0.01, gamma= 0,
                    colsample_bytree = 0.9)

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_bits, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.618 std: 0.024
R2: 0.532 std: 0.029
MAE: 0.46 std: 0.018
MSE: 0.382 std: 0.029
CPU times: total: 29min 38s
Wall time: 5min 3s


['../results/Base_models/LD50_xgboost_ecfp6bits_CVScore']

In [52]:
%%time

endpoint = 'LD50'
descriptor = 'ecfp6counts'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = XGBRegressor(random_state =123, n_jobs=6,
                    subsample =0.9, n_estimators = 1500, min_child_weight=3,
                    max_depth = 10, learning_rate=0.01, gamma= 0,
                    colsample_bytree = 0.5)

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_counts, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.618 std: 0.024
R2: 0.532 std: 0.033
MAE: 0.462 std: 0.016
MSE: 0.382 std: 0.029
CPU times: total: 24min 1s
Wall time: 4min 3s


['../results/Base_models/LD50_xgboost_ecfp6counts_CVScore']

In [53]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'maccs'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = XGBRegressor(random_state =123, n_jobs=6,
                    subsample =0.6, n_estimators = 1500, min_child_weight=3,
                    max_depth = 6, learning_rate=0.01, gamma= 0,
                    colsample_bytree = 0.5)

#input
a, b,c,d,e = prepare_input(train_labels, train_maccs, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.597 std: 0.026
R2: 0.563 std: 0.034
MAE: 0.443 std: 0.019
MSE: 0.356 std: 0.031
CPU times: total: 2min 50s
Wall time: 28.6 s


['../results/Base_models/LD50_xgboost_maccs_CVScore']

In [54]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'rdkit2d'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = XGBRegressor(random_state =123, n_jobs=6,
                    subsample =0.7, n_estimators = 1500, min_child_weight=5,
                    max_depth = 10, learning_rate=0.01, gamma= 0,
                    colsample_bytree = 0.6)

#input
a, b,c,d,e = prepare_input(train_labels, train_rdkit2d, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.585 std: 0.024
R2: 0.581 std: 0.026
MAE: 0.43 std: 0.017
MSE: 0.342 std: 0.028
CPU times: total: 30min 19s
Wall time: 5min 6s


['../results/Base_models/LD50_xgboost_rdkit2d_CVScore']

In [55]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'mordred'
algorithm = 'xgboost'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = XGBRegressor(random_state =123, n_jobs=6,
                    subsample =0.7, n_estimators = 1500, min_child_weight=1,
                    max_depth = 6, learning_rate=0.01, gamma= 0,
                    colsample_bytree = 0.7)

#input
a, b,c,d,e = prepare_input(train_labels, train_mordred, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.58 std: 0.027
R2: 0.587 std: 0.032
MAE: 0.429 std: 0.019
MSE: 0.338 std: 0.032
CPU times: total: 56min 22s
Wall time: 9min 27s


['../results/Base_models/LD50_xgboost_mordred_CVScore']

### knn

In [56]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'mordred'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = KNeighborsRegressor(p = 1, n_neighbors = 9, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_mordred, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.608 std: 0.023
R2: 0.546 std: 0.03
MAE: 0.443 std: 0.016
MSE: 0.371 std: 0.028
CPU times: total: 46.9 s
Wall time: 5.94 s


['../results/Base_models/LD50_knn_mordred_CVScore']

In [57]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'rdkit2d'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = KNeighborsRegressor(p = 1, n_neighbors = 9, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_rdkit2d, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.626 std: 0.024
R2: 0.518 std: 0.037
MAE: 0.456 std: 0.017
MSE: 0.393 std: 0.031
CPU times: total: 22.4 s
Wall time: 2.9 s


['../results/Base_models/LD50_knn_rdkit2d_CVScore']

In [58]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'maccs'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = KNeighborsRegressor(metric = 'rogerstanimoto', n_neighbors = 9, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_maccs, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')




RMSE: 0.632 std: 0.029
R2: 0.51 std: 0.045
MAE: 0.462 std: 0.017
MSE: 0.4 std: 0.037
CPU times: total: 20 s
Wall time: 19.8 s


['../results/Base_models/LD50_knn_maccs_CVScore']

In [59]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'ecfp6counts'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = KNeighborsRegressor(metric = 'braycurtis', n_neighbors = 9, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_counts, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.638 std: 0.024
R2: 0.501 std: 0.039
MAE: 0.471 std: 0.016
MSE: 0.407 std: 0.03
CPU times: total: 3min 34s
Wall time: 28.6 s


['../results/Base_models/LD50_knn_ecfp6counts_CVScore']

In [60]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'ecfp6bits'
algorithm = 'knn'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = KNeighborsRegressor(metric = 'dice', n_neighbors = 15, weights = 'distance')

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_bits, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 1, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')




RMSE: 0.646 std: 0.017
R2: 0.488 std: 0.026
MAE: 0.476 std: 0.014
MSE: 0.418 std: 0.022
CPU times: total: 2min 36s
Wall time: 2min 38s


['../results/Base_models/LD50_knn_ecfp6bits_CVScore']

### SVM

In [61]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'mordred'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = SVR(C = 10, gamma = 0.1, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_mordred, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 6, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.612 std: 0.025
R2: 0.541 std: 0.026
MAE: 0.446 std: 0.015
MSE: 0.375 std: 0.03
CPU times: total: 9.95 s
Wall time: 1min 18s


['../results/Base_models/LD50_svm_mordred_CVScore']

In [62]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'rdkit2d'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = SVR(C = 1, gamma = 1, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_rdkit2d, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 6, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.635 std: 0.025
R2: 0.506 std: 0.031
MAE: 0.461 std: 0.015
MSE: 0.404 std: 0.032
CPU times: total: 4.25 s
Wall time: 24.6 s


['../results/Base_models/LD50_svm_rdkit2d_CVScore']

In [63]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'maccs'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = SVR(C = 10, gamma = 0.1, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_maccs, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 6, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.61 std: 0.026
R2: 0.543 std: 0.031
MAE: 0.452 std: 0.019
MSE: 0.373 std: 0.031
CPU times: total: 4.45 s
Wall time: 23 s


['../results/Base_models/LD50_svm_maccs_CVScore']

In [64]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'ecfp6bits'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = SVR(C = 1, gamma = 0.01, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_bits, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 6, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')


RMSE: 0.668 std: 0.02
R2: 0.454 std: 0.024
MAE: 0.49 std: 0.016
MSE: 0.446 std: 0.026
CPU times: total: 55.1 s
Wall time: 5min 37s


['../results/Base_models/LD50_svm_ecfp6bits_CVScore']

In [65]:
%%time
#yes
endpoint = 'LD50'
descriptor = 'ecfp6counts'
algorithm = 'svm'
name = f'{endpoint}_{algorithm}_{descriptor}'

# model
reg = SVR(C = 10, gamma = 0.01, kernel = 'rbf')

#input
a, b,c,d,e = prepare_input(train_labels, train_ecfp6_counts, target = 'logLD50_mmolkg')

# results
RM_mf, RM_oof, RM_base_model, cv_score = Regression_meta_features(reg, a, c, b, 
                                                       d, e,cv=10, n_jobs = 6, col_names = [f'{name}'])
# report the results
report_cv_reg_models(cv_score)

# Save results
RM_mf.to_csv(f'../data/Hmodel_features/{name}.csv')
np.save(f'../results/Base_models/{name}.npy', RM_oof)
joblib.dump(RM_base_model, f'../models/Base_models/{name}.pkl')
joblib.dump(cv_score, f'../results/Base_models/{name}_CVScore')

RMSE: 0.644 std: 0.024
R2: 0.49 std: 0.042
MAE: 0.486 std: 0.02
MSE: 0.416 std: 0.031
CPU times: total: 39.6 s
Wall time: 4min 18s


['../results/Base_models/LD50_svm_ecfp6counts_CVScore']