# Merck Molecular Activity Challenge

<a href="https://www.kaggle.com/c/MerckActivity">Link to competition on Kaggle.</a>

Submissions to the leaderboard were closed at the time of writing.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns; sns.set_style('whitegrid')
import matplotlib.pyplot as plt
%matplotlib inline

pd.options.display.max_columns = 999
plt.rcParams['figure.figsize'] = (16, 9)

## Load Data

The data are spread across fifteen linked training and test datasets, with each pair of datasets corresponding to a different biologically relevant target. As such, we will load, process, and model each pair of datasets separately, before combining the predictions into a final submission appropriate CSV.

In [2]:
training_sets = []
for i in range(1, 16):
    training_sets.append(pd.read_csv('TrainingSet/ACT%s_competition_training.csv' % i))

In [3]:
for i in range(15):
    print(i, training_sets[i].shape, training_sets[i].isnull().values.any())

0 (37241, 9493) False
1 (8716, 5879) False
2 (6148, 5205) False
3 (1815, 4308) False
4 (3212, 6276) False
5 (37388, 8923) False
6 (1569, 4507) False
7 (9965, 5805) False
8 (5351, 4732) False
9 (11151, 5792) False
10 (6399, 5137) False
11 (8651, 5472) False
12 (6105, 5700) False
13 (4165, 5947) False
14 (5059, 5554) False


In [4]:
test_sets = []
for i in range(1, 16):
    test_sets.append(pd.read_csv('TestSet/ACT%s_competition_test.csv' % i))

In [5]:
for i in range(15):
    print(i, test_sets[i].shape, test_sets[i].isnull().values.any())

0 (12338, 9492) False
1 (2907, 5878) False
2 (2045, 5204) False
3 (598, 4307) False
4 (1072, 6275) False
5 (12406, 8922) False
6 (523, 4506) False
7 (3335, 5804) False
8 (1769, 4731) False
9 (3704, 5791) False
10 (2094, 5136) False
11 (2899, 5471) False
12 (1707, 5699) False
13 (1382, 5946) False
14 (1698, 5553) False


Every pair of training and test data appear to be in a 3:1 split. There is no missing data, although it is very sparse for the most part - we will deal with this in the next section.

## Data Preprocessing

The datasets have many columns, with most containing little information (low or zero variance). As a blanket approach to discarding features that are probably not predictive, we will remove features where the second most common value does not meet an arbitrarily chosen minimum threshold.

Note: the following cell takes multiple hours to run.

In [6]:
threshold = 100
for i in range(15):
    for col in training_sets[i].columns[2:]:
        if len(training_sets[i][col].value_counts()) > 1:
            if training_sets[i][col].value_counts().values[1] < threshold:
                training_sets[i].drop(col, axis=1, inplace=True)
                test_sets[i].drop(col, axis=1, inplace=True)
        else:
            training_sets[i].drop(col, axis=1, inplace=True)
            test_sets[i].drop(col, axis=1, inplace=True)   

In [7]:
for i in range(15):
    print(i, training_sets[i].shape, training_sets[i].isnull().values.any())

0 (37241, 3371) False
1 (8716, 1509) False
2 (6148, 1338) False
3 (1815, 1140) False
4 (3212, 1335) False
5 (37388, 3090) False
6 (1569, 919) False
7 (9965, 1545) False
8 (5351, 1127) False
9 (11151, 1436) False
10 (6399, 1653) False
11 (8651, 1841) False
12 (6105, 1829) False
13 (4165, 1527) False
14 (5059, 1480) False


We can see that the number of features for each pair of datasets has been reduced by about two thirds or more. As the above processing step took multiple hours, we will store the prepared datasets in case we want to retrieve them easily later.

In [10]:
import pickle

with open('training-data-prepared.pkl', 'wb') as f:
    pickle.dump(training_sets, f)
    
with open('test-data-prepared.pkl', 'wb') as f:
    pickle.dump(test_sets, f)

In [15]:
# Example processed dataset
training_sets[0].head()

Unnamed: 0,MOLECULE,Act,D_41,D_58,D_59,D_60,D_61,D_62,D_63,D_64,D_65,D_66,D_67,D_68,D_69,D_70,D_71,D_72,D_73,D_74,D_75,D_76,D_82,D_84,D_86,D_113,D_116,D_189,D_190,D_191,D_192,D_193,D_194,D_195,D_208,D_209,D_210,D_211,D_212,D_213,D_214,D_215,D_216,D_217,D_218,D_219,D_220,D_221,D_222,D_223,D_224,D_225,D_226,D_227,D_228,D_229,D_230,D_231,D_232,D_254,D_258,D_259,D_263,D_264,D_294,D_295,D_296,D_297,D_298,D_299,D_300,D_301,D_302,D_303,D_304,D_305,D_306,D_307,D_308,D_309,D_310,D_311,D_312,D_313,D_356,D_357,D_358,D_359,D_360,D_361,D_362,D_363,D_364,D_365,D_366,D_367,D_368,D_381,D_382,D_383,D_384,D_385,D_386,D_387,D_388,D_389,D_390,D_391,D_392,D_393,D_394,D_395,D_396,D_397,D_398,D_399,D_400,D_401,D_418,D_429,D_430,D_431,D_461,D_462,D_463,D_464,D_465,D_466,D_467,D_468,D_469,D_470,D_471,D_472,D_473,D_474,D_475,D_476,D_477,D_478,D_479,D_480,D_481,D_505,D_506,D_507,D_508,D_509,D_510,D_511,D_512,D_513,D_514,D_515,D_516,D_517,D_518,D_519,D_520,D_521,D_522,D_523,D_524,D_525,D_526,D_564,D_565,D_566,D_567,D_568,D_569,D_570,D_571,D_572,D_573,D_575,D_576,D_577,D_610,D_611,D_612,D_613,D_614,D_615,D_616,D_617,D_618,D_619,D_620,D_621,D_622,D_623,D_637,D_638,D_639,D_640,D_641,D_642,D_643,D_644,D_645,D_646,D_647,D_648,D_649,D_650,D_651,D_652,D_653,D_663,D_700,D_701,D_702,D_717,D_718,D_719,D_720,D_721,D_722,D_723,D_724,D_725,D_726,D_727,D_728,D_729,D_730,D_731,D_732,D_733,D_734,D_735,D_736,D_737,D_739,D_740,D_741,D_742,D_743,D_757,D_761,D_763,D_764,D_768,D_769,D_793,D_794,D_795,D_796,D_797,D_798,D_799,D_800,D_801,D_802,D_803,D_804,D_805,D_806,D_807,D_808,D_809,D_810,D_811,D_812,D_813,D_814,D_833,D_834,D_835,D_836,D_837,D_838,D_839,D_840,D_841,D_842,D_843,D_844,D_845,D_846,D_847,D_848,D_849,D_850,D_851,D_852,D_853,D_875,D_876,D_878,D_879,D_880,D_881,D_882,D_883,D_884,D_885,D_886,D_900,D_901,D_902,D_903,D_904,D_905,D_906,D_907,D_908,D_909,D_910,D_911,D_912,D_913,D_914,D_915,D_916,D_918,D_919,D_921,D_922,D_923,D_957,D_958,D_959,D_960,D_961,D_962,D_963,D_964,D_965,D_966,D_967,D_968,D_969,D_970,D_980,D_981,D_982,D_983,D_984,D_985,D_986,D_987,D_988,D_989,D_990,D_991,D_992,D_993,D_994,D_995,D_996,D_997,D_998,D_999,D_1000,D_1001,D_1002,D_1004,D_1005,D_1007,D_1022,D_1023,D_1026,D_1028,D_1029,D_1030,D_1031,D_1032,D_1062,D_1063,D_1064,D_1065,D_1066,D_1067,D_1068,D_1069,D_1070,D_1071,D_1072,D_1073,D_1074,D_1075,D_1076,D_1077,D_1078,D_1079,D_1080,D_1081,D_1082,D_1083,D_1084,D_1104,D_1105,D_1106,D_1107,D_1108,D_1109,D_1110,D_1111,D_1112,D_1113,D_1114,D_1115,D_1116,D_1117,D_1118,D_1119,D_1120,D_1121,D_1122,D_1123,D_1124,D_1125,D_1145,D_1146,D_1147,D_1148,D_1149,D_1150,D_1151,D_1152,D_1153,D_1154,D_1155,D_1156,D_1157,D_1158,D_1159,D_1160,D_1161,D_1162,D_1172,D_1173,D_1174,D_1175,D_1176,D_1177,D_1178,D_1179,D_1180,D_1181,D_1182,D_1183,D_1184,D_1185,D_1186,D_1187,D_1188,D_1189,D_1190,D_1191,D_1192,D_1193,D_1194,D_1195,D_1196,D_1214,D_1215,D_1216,D_1217,D_1218,D_1219,D_1220,D_1221,D_1222,D_1223,D_1224,D_1225,D_1226,D_1227,D_1228,D_1229,D_1230,D_1231,D_1232,D_1233,D_1234,D_1235,D_1236,D_1237,D_1239,D_1269,D_1286,D_1287,D_1288,D_1289,D_1290,D_1291,D_1292,D_1293,D_1294,D_1295,D_1296,D_1297,D_1298,D_1299,D_1300,D_1301,...,D_8126,D_8127,D_8128,D_8129,D_8130,D_8131,D_8132,D_8133,D_8146,D_8147,D_8148,D_8149,D_8150,D_8151,D_8152,D_8153,D_8154,D_8155,D_8156,D_8157,D_8158,D_8159,D_8160,D_8161,D_8162,D_8163,D_8195,D_8197,D_8199,D_8200,D_8201,D_8202,D_8205,D_8242,D_8243,D_8244,D_8246,D_8265,D_8266,D_8267,D_8268,D_8269,D_8270,D_8271,D_8272,D_8273,D_8274,D_8275,D_8276,D_8277,D_8278,D_8279,D_8280,D_8281,D_8282,D_8284,D_8290,D_8302,D_8303,D_8304,D_8305,D_8306,D_8307,D_8308,D_8309,D_8310,D_8311,D_8312,D_8313,D_8314,D_8315,D_8316,D_8317,D_8318,D_8319,D_8322,D_8348,D_8349,D_8350,D_8351,D_8352,D_8353,D_8354,D_8355,D_8356,D_8357,D_8358,D_8359,D_8360,D_8361,D_8362,D_8363,D_8369,D_8376,D_8382,D_8414,D_8415,D_8416,D_8417,D_8418,D_8419,D_8420,D_8421,D_8422,D_8423,D_8424,D_8425,D_8426,D_8427,D_8428,D_8429,D_8445,D_8446,D_8447,D_8448,D_8449,D_8450,D_8451,D_8452,D_8453,D_8454,D_8455,D_8456,D_8457,D_8458,D_8459,D_8460,D_8461,D_8462,D_8463,D_8468,D_8471,D_8481,D_8482,D_8483,D_8484,D_8485,D_8486,D_8487,D_8488,D_8489,D_8490,D_8491,D_8492,D_8493,D_8494,D_8495,D_8496,D_8507,D_9204,D_9205,D_9206,D_9207,D_9208,D_9209,D_9210,D_9211,D_9212,D_9213,D_9214,D_9215,D_9216,D_9217,D_9218,D_9219,D_9258,D_9259,D_9260,D_9261,D_9262,D_9263,D_9264,D_9265,D_9266,D_9267,D_9268,D_9269,D_9270,D_9271,D_9272,D_9291,D_9292,D_9293,D_9294,D_9295,D_9296,D_9297,D_9298,D_9299,D_9300,D_9301,D_9302,D_9303,D_9304,D_9305,D_9306,D_9307,D_9308,D_9343,D_9349,D_9350,D_9351,D_9352,D_9353,D_9354,D_9355,D_9356,D_9357,D_9358,D_9359,D_9360,D_9379,D_9380,D_9381,D_9382,D_9383,D_9384,D_9385,D_9386,D_9387,D_9388,D_9389,D_9390,D_9391,D_9392,D_9393,D_9394,D_9395,D_9418,D_9419,D_9420,D_9421,D_9422,D_9423,D_9424,D_9425,D_9426,D_9427,D_9428,D_9429,D_9443,D_9444,D_9445,D_9464,D_9466,D_9468,D_9470,D_9503,D_9505,D_9506,D_9507,D_9509,D_9554,D_9561,D_9562,D_9563,D_9564,D_9565,D_9566,D_9567,D_9568,D_9569,D_9570,D_9571,D_9572,D_9589,D_9590,D_9591,D_9592,D_9593,D_9594,D_9595,D_9596,D_9597,D_9598,D_9599,D_9600,D_9601,D_9602,D_9603,D_9622,D_9623,D_9624,D_9625,D_9626,D_9627,D_9628,D_9629,D_9630,D_9631,D_9632,D_9633,D_9669,D_9670,D_9672,D_9673,D_9674,D_9675,D_9676,D_9678,D_9696,D_9697,D_9698,D_9699,D_9700,D_9701,D_9702,D_9703,D_9704,D_9705,D_9706,D_9726,D_9727,D_9728,D_9729,D_9730,D_9731,D_9732,D_9733,D_9734,D_9735,D_9736,D_9738,D_10231,D_10232,D_10233,D_10234,D_10235,D_10236,D_10237,D_10238,D_10239,D_10240,D_10241,D_10242,D_10243,D_10244,D_10245,D_10246,D_10262,D_10289,D_10290,D_10291,D_10292,D_10293,D_10294,D_10295,D_10296,D_10297,D_10298,D_10299,D_10300,D_10301,D_10302,D_10303,D_10304,D_10317,D_10318,D_10319,D_10320,D_10321,D_10322,D_10323,D_10324,D_10325,D_10326,D_10327,D_10328,D_10329,D_10330,D_10331,D_10356,D_10364,D_10365,D_10366,D_10367,D_10368,D_10369,D_10370,D_10371,D_10372,D_10373,D_10374,D_10375,D_10376,D_10377,D_10387,D_10388,D_10389,D_10390,D_10391,D_10392,D_10393,D_10394,D_10395,D_10396,D_10397,D_10398,D_10399,D_10400,D_10401,D_10402,D_10403,D_10413,D_10414,D_10415,D_10416,D_10417,D_10418,D_10420,D_10421,D_10422,D_10425,D_10426,D_10427,D_10438,D_10439,D_10461,D_10462,D_10463,D_10465,D_10466,D_10467,D_10468,D_10469,D_10470,D_10471,D_10472,D_10473,D_10503,D_10540,D_10548,D_10550,D_10551,D_10552,D_10553,D_10554,D_10555,D_10556,D_10557,D_10558,D_10559,D_10572,D_10573,D_10574,D_10575,D_10576,D_10577,D_10578,D_10579,D_10580,D_10581,D_10582,D_10583,D_10584,D_10596,D_10598,D_10599,D_10600,D_10601,D_10602,D_10603,D_10604,D_10605,D_10606,D_10607,D_10608,D_10643,D_10645,D_10647,D_10666,D_10668,D_10669,D_10670,D_10671,D_10672,D_10673,D_10674,D_10675,D_10676,D_10677,D_10678,D_10679,D_10694,D_10695,D_10696,D_10697,D_10698,D_10699,D_10700,D_10701,D_10702,D_10703,D_10704,D_10705,D_10706,D_10735,D_10737,D_10741
0,ACT1_M_80,6.0179,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,7,12,9,3,5,7,7,6,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,11,18,12,11,15,10,9,6,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,5,8,4,3,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,ACT1_M_189,4.3003,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,4,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,4,5,3,1,2,2,1,0,0,0,0,0,0,0,0,0,0,0,2,1,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,1,0,0,0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,3,2,3,2,3,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,3,2,2,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,1,0,0,0,0,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,ACT1_M_190,5.2697,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,7,6,3,1,4,10,10,5,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,13,10,8,12,11,5,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,3,4,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,ACT1_M_402,6.1797,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,4,3,8,10,6,7,7,3,0,0,0,0,0,0,0,0,0,0,0,6,7,5,9,11,8,4,7,10,9,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,1,0,0,0,1,2,0,2,1,2,2,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,3,3,3,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,4,4,5,4,4,5,5,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,11,14,15,13,15,10,12,15,11,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,2,0,1,2,1,0,1,0,0,0,0,0,0,0,0,1,0,2,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,5,5,5,3,6,5,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,ACT1_M_659,4.3003,0,3,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,7,4,2,3,0,3,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,4,3,3,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3,1,3,1,4,3,1,3,1,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,9,6,7,2,4,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,4,2,2,3,0,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


## Train Model

We will use grid search cross-validation to build optimised random forest models for each dataset. R2 scores will be calculated for each of the fifteen models on a holdout portion of the training data, as per the evaluation method described in the competition.

Note: we cannot replicate the 'time-split' validation methodology used for evaluation in the competition, so our R2 scores will likely be superior to those on the leaderboad.

In [66]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from time import time

models = []
scores = []

param_grid = {'n_estimators': [10],
              'max_depth': [10, 20, 30, None],
              'min_samples_split': [2, 3],
              'min_samples_leaf': [1, 2]}

tic = time()
for i in range(15):
    toc = time()
    print("Computing dataset {}... cumulative run time: {:.1f} minutes.".format(i+1, (toc-tic)/60))
    X_train, X_test, y_train, y_test = train_test_split(training_sets[i].iloc[:, 2:], training_sets[i].iloc[:, 1])
    rf = RandomForestRegressor()
    grid_search = GridSearchCV(rf, param_grid, cv=3, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    models.append(grid_search)
    scores.append(grid_search.score(X_test, y_test))
toc = time()
print("Model training and testing complete. Total time elapsed: {:.1f} minutes.".format((toc-tic)/60))

Computing dataset 1... cumulative run time: 0.0 minutes.
Computing dataset 2... cumulative run time: 27.2 minutes.
Computing dataset 3... cumulative run time: 29.1 minutes.
Computing dataset 4... cumulative run time: 30.3 minutes.
Computing dataset 5... cumulative run time: 30.6 minutes.
Computing dataset 6... cumulative run time: 31.3 minutes.
Computing dataset 7... cumulative run time: 49.5 minutes.
Computing dataset 8... cumulative run time: 49.8 minutes.
Computing dataset 9... cumulative run time: 52.3 minutes.
Computing dataset 10... cumulative run time: 53.3 minutes.
Computing dataset 11... cumulative run time: 55.8 minutes.
Computing dataset 12... cumulative run time: 57.4 minutes.
Computing dataset 13... cumulative run time: 59.6 minutes.
Computing dataset 14... cumulative run time: 61.3 minutes.
Computing dataset 15... cumulative run time: 62.2 minutes.
Model training and testing complete. Total time elapsed: 63.2 minutes.


In [67]:
print("Mean R2 score: {:.5f}".format(np.mean(scores)))

Mean R2 score: 0.63281


The best R2 score on the competition leaderboard is 0.49, so clearly the actual test sets have different characteristics to the training data, given our far superior score using a hold out portion of the training data.

We can see from the best models found via grid search, that different optimal parameters were arrived at for different datasets.

In [74]:
for i in range(15):
    print("Dataset %s - best model parameters:" % (i+1))
    print(models[i].best_params_)
    print()

Dataset 1 - best model parameters:
{'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 3, 'n_estimators': 10}

Dataset 2 - best model parameters:
{'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 10}

Dataset 3 - best model parameters:
{'max_depth': 30, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 10}

Dataset 4 - best model parameters:
{'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 10}

Dataset 5 - best model parameters:
{'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 10}

Dataset 6 - best model parameters:
{'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 10}

Dataset 7 - best model parameters:
{'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 10}

Dataset 8 - best model parameters:
{'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 3, 'n_estimators': 10}

Dataset 9 - best model par

## Make Predictions

Although the leaderboard is closed to new submissions, for completeness, we will make predictions for the test data, in the required format.

In [68]:
rf_results = pd.DataFrame(columns=['MoleculeID', 'Prediction'])

for i in range(15):
    y_pred = models[i].predict(test_sets[i].iloc[:, 1:])
    ID = test_sets[i].iloc[:, 0]
    df = pd.DataFrame({'MoleculeID': ID, 'Prediction': y_pred})
    rf_results = rf_results.append(df)
    
rf_results.to_csv('rf_results.csv', index=False)