# AutoML
基于[Autogluon](https://auto.gluon.ai/stable/index.html) 自动机器学习框架快速寻找最优的机器学习模型用于识别LLPS相关的蛋白


In [8]:
import numpy as np
import pandas as pd
import os
import torch

Embed = 'ESM'

DATA_PATH = '../dataset/dataset2.0/'


In [9]:
def load_esm_embed(csv_file,embed_type):
    EMBED_PATH = DATA_PATH+embed_type+'_embed/'
    EMB_LAYER = 33
    Xs = []
    ys = []
    Embed_PATH = EMBED_PATH+csv_file.split('.')[0]
    data_df =  pd.read_csv(DATA_PATH+csv_file)
    for index, row in data_df.iterrows():
        id = row['id']
        label = row['label']

        fn = f'{Embed_PATH}/{id}.pt'
        embs = torch.load(fn)
        
        Xs.append(embs['mean_representations'][EMB_LAYER])
        ys.append(label)
    Xs = torch.stack(Xs, dim=0).numpy()
    print('load {} esm embedding'.format(csv_file))
    print(len(ys))
    print(Xs.shape)
    return Xs,ys

def load_embed(csv_file,embed_type):
    EMBED_PATH = DATA_PATH+embed_type+'_embed/'
    Embed_PATH = EMBED_PATH+csv_file.split('.')[0]+'_embeds.npy'
    data_df =  pd.read_csv(DATA_PATH+csv_file)
    ys = data_df['label']
    Xs = np.load(Embed_PATH)
    print('load {} {}_embed embedding from {}'.format(csv_file,embed_type,Embed_PATH))
    print(len(ys))
    print(Xs.shape)
    return Xs,ys





In [10]:
def embedding(csv_file,embed= 'ESM'):
    if embed=='ESM' or 'ESM2_15b':Xs,ys = load_esm_embed(csv_file,embed)
    else:Xs,ys = load_embed(csv_file,embed)
    return Xs,ys


In [11]:
def Create_dataset(embed='ESM'):

    positive_data = 'positive_train_422.csv'
    negative_data = 'negative_train_3307.csv'
    Xs_p,ys_p = embedding(positive_data,embed)
    Xs_n,ys_n = embedding(negative_data,embed)
    p_df = pd.DataFrame(Xs_p)
    p_df['label'] = list(ys_p)
    n_df = pd.DataFrame(Xs_n)
    n_df['label'] = list(ys_n)
    return p_df,n_df
    

In [12]:
p_df,n_df = Create_dataset()

load positive_train_422.csv esm embedding
422
(422, 1280)
load negative_train_3307.csv esm embedding
3307
(3307, 1280)


## 针对数据不均衡的问题
### 策略一：过采样（Oversampling）

In [13]:
from autogluon.tabular import TabularDataset,TabularPredictor
def AutoML(samlpe = None):
    # embed_types = ['ESM','ProtBert_bfd','ProtBert','T5','UniRep','ESM_finetune']
    embed_types = ['T5']
    if samlpe == 'Oversampling':
        for embed in embed_types:
            p_df,n_df = Create_dataset(embed)
            # p_df = pd.concat([p_df]*8)
            p_df = pd.concat([p_df])
            df = pd.concat([p_df,n_df])
            df = df.sample(frac=1, random_state=42).reset_index(drop=True)
            df = TabularDataset(df)
            TabularPredictor(label="label",path='AutoML_{}_Oversampling'.format(embed)).fit(df,num_bag_folds=10)
    elif samlpe == None:
        for embed in embed_types:
            p_df,n_df = Create_dataset(embed)
            p_df = pd.concat([p_df])
            df = pd.concat([p_df,n_df])
            df = df.sample(frac=1, random_state=42).reset_index(drop=True)
            df = TabularDataset(df)
            # TabularPredictor(label="label",path='AutoML_result_0613-1/AutoML_{}'.format(embed)).fit(df,num_bag_folds=5)
            TabularPredictor(label="label",path='AutoML_result_12-5/AutoML_{}'.format(embed)).fit(df)

In [14]:
AutoML()

load positive_train_422.csv esm embedding
422
(422, 1280)


Beginning AutoGluon training ...
AutoGluon will save models to "AutoML_result_07-5/AutoML_ESM2_15b/"
AutoGluon Version:  0.7.0
Python Version:     3.9.16
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #78-Ubuntu SMP Tue Apr 18 09:00:29 UTC 2023
Train Data Rows:    3729
Train Data Columns: 1280
Label Column: label
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    261145.21 MB
	Train Data (Original)  Memory Usage: 19.09 MB (0.0% of available memory)


	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.


load negative_train_3307.csv esm embedding
3307
(3307, 1280)


	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('float', []) : 1280 | ['0', '1', '2', '3', '4', ...]
	Types of features in processed data (raw dtype, special dtypes):
		('float', []) : 1280 | ['0', '1', '2', '3', '4', ...]
	1.1s = Fit runtime
	1280 features in original data used to generate 1280 features in processed data.
	Train Data (Processed) Memory Usage: 19.09 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 1.21s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
	To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.13408420488066505, Train Rows: 3229, Val Rows: 500
Fitting 13 L1 model