To demonstrate the ``TreeModel().predict_proba()``method, we obtain the ``DOM_GSEC`` example dataset and its respective feature set (see [Breimann25a]_):

In [6]:
import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(100)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

We can not fit the ``TreeModel``, which will internally fit 3 tree-based models over 5 training rounds be default:

In [7]:
tm = aa.TreeModel()
tm = tm.fit(X, labels=labels)

Using the ``TreeModel().predict_proba()`` method calculates probability predictions by averaging across multiple models and rounds, using a Monte Carlo approach for robust estimation:

In [9]:
pred, pred_std = tm.predict_proba(X)

df_seq["prediction"] = pred
df_seq["pred_std"] = pred_std

print("Prediction scores for 5 substrates")
aa.display_df(df_seq.head(5))

print("Prediction scores for 5 non-substrates")
aa.display_df(df_seq.tail(5))

Prediction scores for 5 substrates


Unnamed: 0,entry,sequence,label,tmd_start,tmd_stop,jmd_n,tmd,jmd_c,prediction,pred_std
1,P05067,MLPGLALLLLAAWTA...GYENPTYKFFEQMQN,1,701,723,FAEDVGSNKG,AIIGLMVGGVVIATVIVITLVML,KKKQYTSIHH,0.998504,0.00267
2,P14925,MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS,1,868,890,KLSTEPGSGV,SVVLITTLLVIPVLVLLAIVMFI,RWKKSRAFGD,0.992071,0.002571
3,P70180,MRSLLLFTFSACVLL...RELREDSIRSHFSVA,1,477,499,PCKSSGGLEE,SAVTGIVVGALLGAGLLMAFYFF,RKKYRITIER,0.994342,0.003736
4,Q03157,MGPTSPAARGQGRRW...HGYENPTYRFLEERP,1,585,607,APSGTGVSRE,ALSGLLIMGAGGGSLIVLSLLLL,RKKKPYGTIS,0.991111,0.009306
5,Q06481,MAATGTAAAAATGRL...GYENPTYKYLEQMQI,1,694,716,LREDFSLSSS,ALIGLLVIAVAIATVIVISLVML,RKRQYGTISH,0.999858,0.000105


Prediction scores for 5 non-substrates


Unnamed: 0,entry,sequence,label,tmd_start,tmd_stop,jmd_n,tmd,jmd_c,prediction,pred_std
122,P36941,MLLPWATSAPGLAWG...TPSNRGPRNQFITHD,0,226,248,PLPPEMSGTM,LMLAVLLPLAFFLLLATVFSCIW,KSHPSLCRKL,0.002809,0.002456
123,P25446,MLWIWAVLPLVLAGS...STPDTGNENEGQCLE,0,170,187,NCRKQSPRNR,LWLLTILVLLIPLVFIYR,KYRKRKCWKR,0.089479,0.011379
124,Q9P2J2,MVWCLGLAVLSLVIS...AYRQPVPHPEQATLL,0,738,760,PGLLPQPVLA,GVVGGVCFLGVAVLVSILAGCLL,NRRRAARRRR,0.061405,0.01077
125,Q96J42,MVPAAGRRPPRVMRL...SIRWLIPGQEQEHVE,0,324,342,LPSTLIKSVD,WLLVFSLFFLISFIMYATI,RTESIRWLIP,0.026037,0.008509
126,P0DPA2,MRVGGAFHLLLVCLS...DCAEGPVQCKNGLLV,0,265,287,KVSDSRRIGV,IIGIVLGSLLALGCLAVGIWGLV,CCCCGGSGAG,0.005619,0.00153
