# Description of the problem and solution

The task1 was to predict a person's age from the brain image data: a standard regression problem. The original dataset included 832 features as well as a lot of NaN values and a few outliers. A good preprocessing stage was necessary in order to have a well defined dataset that could be used in our regression model. First step was the imputation of the dataset. Filling each NaN value with the median of each feature column. The use of the median instead of other value (e.g. mean) is justified since a lot of outliers are included in the dataset. (e.g. 1 2 _ 5 20 median: 3 mean: 7). Next step was the feature extraction. By using the "autofeat" library (paper: https://arxiv.org/pdf/1901.07329.pdf), we extracted the 21 most important features. The way the algorithm works is going through a loop of correlation of features with target, select promising features, train Lasso regression model with promising features, filter the good features keeping the ones with non-zero regression weights. We updated the datasets by keeping only the 21 most important features. Finally, we used these updated datasets for the training of our final regression model. A lot of outlier detection techniques were used but we decided to keep the outliers and use a tree-based method for our final model. Tree-methods have been proved to be robust to outliers and we avoid risking excluded important features / points from the dataset. The "ExtraTreesRegressor" model from the "sklearn" package was used and fine tuned based on the R2 score performance in our validation set. The final model had a score >0.6 in the validation sets using cross-validation and in the submission leaderboard of ETH scored 0.6812 while the hard baseline was set to 0.65 by the Advanced Machine Learning Task1 team.

# Include all the necessary packages

In [0]:
!pip install autofeat

from sklearn.metrics import r2_score

try:
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from autofeat import FeatureSelector

from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor



# Load the data from the CSV files

In [0]:
column_names_x = ['id']
for i in range(832):
  column_names_x.append('x'+str(i))

raw_dataset_x = pd.read_csv('/content/X_train.csv', names=column_names_x,
                      na_values = "?", comment='\t',
                      sep=",", skipinitialspace=True, skiprows=True)

dataset_x = raw_dataset_x.copy()
dataset_x.tail()

Unnamed: 0,id,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,x37,x38,...,x792,x793,x794,x795,x796,x797,x798,x799,x800,x801,x802,x803,x804,x805,x806,x807,x808,x809,x810,x811,x812,x813,x814,x815,x816,x817,x818,x819,x820,x821,x822,x823,x824,x825,x826,x827,x828,x829,x830,x831
1207,1207.0,,5395.719279,95668.548818,1125.414599,10341.757613,10.547165,108249.1874,1073291.0,108672.758838,2.365013,1042126.0,21031.538201,1059.704248,103860.409465,107.805931,100.350289,14040.029112,,90414.095308,10.10847,3.215609,102144.219482,10.087478,3825.046407,89.538601,4550.239929,3542.470382,206892.186644,1011027.0,11432.036864,10.957698,976.093967,84.689946,2500.095879,1068.544855,8.868114,3.695648,10381.626637,864106.8,...,6.934015,2.05604,9.362337,908.498493,94.234452,10.31914,4155.345879,8.498432e+16,10977.013897,93.766125,8.120353,10595.018829,959448.6,10.485803,10.601033,101328.492269,2.417968,48.359296,9486.144826,2.185578,966.094287,903925.4,10.510759,1049.753163,10856.049561,9109.098919,10320.748617,10892.743222,106528.008864,294.801642,10.519831,10.234396,987.456317,978.661701,109520.061346,102914.439553,8144.701025,10.44241,102380.867791,2.236101
1208,1208.0,93669.580198,3564.454295,96937.341346,1040.828378,8415.112792,10.7218,100628.318687,1027671.0,102269.472299,2.398768,1015631.0,10073.888339,1005.671797,106924.048275,98.512805,104.366843,11719.094168,115055.546469,95750.178334,,3.446025,100455.697718,10.303138,3502.012112,84.167435,3462.164372,3411.77148,206892.234016,1082826.0,9957.026155,11.155088,1127.806594,107.745872,2298.057201,1031.637383,8.778034,3.477553,10627.490011,1007256.0,...,10.732723,2.14034,10.930583,956.306694,117.419062,10.36167,3436.194754,6.249342e+16,9866.685225,104.420975,9.171168,10762.025617,920525.4,10.627956,10.101622,105133.566646,2.74025,55.449159,10959.655054,2.160671,1111.015635,1118527.0,10.342601,1030.540259,10097.00667,9039.07482,10895.768148,10315.088493,106898.299599,195.245438,10.618901,10.45655,1112.699713,,,106685.6479,7428.338174,10.405107,107051.806312,2.29704
1209,1209.0,94119.048262,,88398.586879,1298.024079,8905.109893,10.294486,103347.543915,1087571.0,106082.959731,2.34545,1074165.0,49595.639929,1057.051949,103147.267707,,104.099172,11275.080883,115055.523086,105201.547958,10.358605,2.872759,107834.745605,10.070086,2708.039945,90.315251,4502.08382,2267.035496,206892.199058,1016735.0,8545.036877,,818.975938,95.255808,2614.046035,1002.028549,11.262573,3.016052,10912.480045,1016656.0,...,10.447547,,10.502198,967.481047,88.216016,10.535296,3548.968573,3.358649e+16,8140.800029,98.17203,11.651412,8779.078077,792044.2,10.525088,10.468168,104071.834129,2.615953,130.309663,10513.17024,2.228708,728.052692,1147362.0,9.968195,1077.035231,7580.07975,8509.033486,10228.351982,10327.801413,100144.263391,310.183226,10.887052,10.836007,1124.755976,988.19391,102076.305244,101964.413583,6470.242459,9.786964,,2.222794
1210,1210.0,,3356.949591,97081.843123,1002.778571,10864.284014,10.797471,102894.681201,,100102.393811,2.436208,1072574.0,37587.58873,1047.896435,105241.465707,93.329187,102.255382,11668.076593,115055.525882,85619.926316,10.732048,2.422995,104044.942484,,2945.014802,112.182044,3202.074919,4883.134639,206892.208501,1095387.0,7036.096752,8.290676,878.350888,93.612258,2560.094596,1011.33159,10.580875,2.822595,,1085728.0,...,10.115872,2.1581,9.868292,1004.134878,110.99999,10.71302,3394.243287,1.33597e+17,8371.82167,89.70365,9.220535,8998.055593,1113826.0,10.30225,10.883418,101283.417629,2.186525,366.750195,11376.166095,2.186734,1120.066965,1013800.0,11.734373,1046.93903,9132.087999,5896.007658,10706.113864,10369.803817,102404.821085,,10.661966,10.569144,1010.143125,1064.316139,101477.589227,104517.364548,4922.920835,8.566389,103968.235402,2.071553
1211,1211.0,101744.439395,3416.474452,111000.425491,893.08046,9250.598106,10.935695,102906.878188,,107003.164135,2.547171,1087477.0,32084.6338,1075.700568,104998.478525,118.678047,106.993698,13226.038641,115055.543174,97282.156471,10.208538,3.710694,107439.936587,10.753281,4719.08552,96.505582,3419.271979,3796.393911,206892.165049,1006348.0,11518.011808,10.803776,1103.388499,106.35652,2583.097413,1081.964569,11.453524,3.203942,10257.152912,929206.5,...,8.157006,2.1688,8.8442,1005.89732,96.253172,,3289.749206,9.364744e+16,9170.136304,97.067706,,,1148141.0,10.98994,10.204516,100896.544241,2.720932,49.56783,8013.156899,2.550404,916.073375,1080536.0,9.101128,1013.211838,12517.066013,10990.064488,10349.678752,10096.876996,,322.445715,10.599132,10.350638,962.755003,1093.734877,105353.550546,106061.798352,9302.374002,10.949418,109317.619776,2.496408


In [0]:
column_names_y = ['id','y']
raw_dataset_y = pd.read_csv('/content/y_train.csv', names=column_names_y,
                      na_values = "?", comment='\t',
                      sep=",", skipinitialspace=True, skiprows=True)

dataset_y = raw_dataset_y.copy()
dataset_y.tail()

Unnamed: 0,id,y
1207,1207.0,66.0
1208,1208.0,73.0
1209,1209.0,74.0
1210,1210.0,78.0
1211,1211.0,64.0


# Print the missing values

In [0]:
missing_values = dataset_x.isna().sum()
print (missing_values)

id        0
x0       81
x1      103
x2       92
x3       91
       ... 
x827     83
x828     78
x829     98
x830     84
x831     92
Length: 833, dtype: int64


# Split the data into training and test data

In [0]:
# Split using sklearn.model_selection
x_train, x_test, y_train, y_test = train_test_split(dataset_x, dataset_y, test_size=0.2, random_state = 100)

In [0]:
train_stats = x_train.describe()
train_stats.pop("id")
train_stats = train_stats.transpose()
train_stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
x0,909.0,99849.359545,9534.020258,65533.368423,93818.485147,100183.062423,105994.290528,130226.576502
x1,885.0,3698.730375,943.683864,180.312021,3076.550570,3651.110055,4303.892503,7265.213902
x2,891.0,99975.389109,9540.065988,68544.573581,93937.346571,99386.035114,106102.200889,132221.045067
x3,898.0,999.944996,100.903669,694.745271,935.303439,999.571797,1068.606823,1434.200505
x4,890.0,10001.743350,1001.473353,6681.561828,9339.312428,10021.924636,10646.003276,13560.223285
...,...,...,...,...,...,...,...,...
x827,906.0,104990.967566,2761.209013,100015.768596,102783.100024,104986.305216,107352.370223,109999.847537
x828,907.0,6827.704539,1387.835714,1696.036569,6002.316474,6835.947954,7652.607118,11276.075121
x829,884.0,10.021817,0.982265,6.899008,9.378562,9.977236,10.676450,13.188278
x830,902.0,104960.353706,2845.423469,100003.049706,102653.373914,104838.184005,107428.898901,109993.046071


# Fill the NaN in the training data set with the median values of each column

In [0]:
x_train = x_train.fillna(x_train.median())
x_test = x_test.fillna(x_test.median())
missing_values = x_train.isna().sum()
print (missing_values)

id      0
x0      0
x1      0
x2      0
x3      0
       ..
x827    0
x828    0
x829    0
x830    0
x831    0
Length: 833, dtype: int64


# Remove the unnecessary "id" label

In [0]:
y_train.pop("id")
y_test.pop("id")

1143    1143.0
941      941.0
365      365.0
467      467.0
615      615.0
         ...  
156      156.0
689      689.0
28        28.0
69        69.0
1198    1198.0
Name: id, Length: 243, dtype: float64

# Feature Extraction using autofeat

In [0]:
fsel = FeatureSelector(featsel_runs=4,
        max_it=150,
        w_thr=1e-6,
        keep=None,
        n_jobs=1,
        verbose=1)

new_X = fsel.fit_transform(pd.DataFrame(x_train, columns=column_names_x), y_train)
print(new_X.columns) 

df_train = pd.DataFrame(x_train, columns=column_names_x)
df_test = pd.DataFrame(x_test, columns=column_names_x)

  y = column_or_1d(y, warn=True)


[featsel] Scaling data...done.
[featsel] 220/833 features after univariate filtering
[featsel] Feature selection run 1/4
[featsel] Feature selection run 2/4
[featsel] Feature selection run 3/4
[featsel] Feature selection run 4/4
[featsel] 28 features after 4 feature selection runs
[featsel] 28 features after correlation filtering
[featsel] 21 features after noise filtering
[featsel] 21 final features selected (including 0 original keep features).
Index(['x400', 'x635', 'x757', 'x516', 'x809', 'x214', 'x556', 'x617', 'x93',
       'x346', 'x596', 'x255', 'x309', 'x252', 'x292', 'x738', 'x537', 'x593',
       'x474', 'x614', 'x502'],
      dtype='object')


# Keep only the necessary features

In [0]:
dataset_selected_x = x_train.copy()
x_test_selected= x_test.copy()

#accepted=[400, 757, 635, 516, 132, 15, 809, 116, 214,
       #556, 617, 93, 346, 596, 309, 252, 292, 474,
       #593, 614]

accepted=[400, 635, 757, 516, 809, 214, 556, 617, 93,
       346, 596, 255, 309, 252, 292, 738, 537, 593,
       474, 614, 502]

for j in range(832):
        if (j in accepted):
          print (j)
        else:
          del dataset_selected_x['x'+str(j)]
          del x_test_selected['x'+str(j)]
      
train_stats = dataset_selected_x.describe()
train_stats.pop("id")
train_stats = train_stats.transpose()
train_stats

93
214
252
255
292
309
346
400
474
502
516
537
556
593
596
614
617
635
738
757
809


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
x93,969.0,3519.955,612.3839,1789.851,3122.307,3492.04,3882.096,5902.057
x214,969.0,1686.814,424.6669,385.7756,1438.003,1657.997,1929.01,3719.025
x252,969.0,5153.136,6443.209,-10663.89,2036.529,3051.817,5396.992,52184.76
x255,969.0,10631.72,1847.941,3258.064,9580.092,10533.04,11700.04,17427.68
x292,969.0,128952.5,15619.97,71152.48,120166.9,128221.8,137714.5,198229.3
x309,969.0,13609.34,2104.294,1787.475,12418.01,13558.88,14807.91,21422.39
x346,969.0,7264.234,1238.998,2195.927,6577.637,7355.575,8085.946,11215.81
x400,969.0,2.419177,0.1566607,1.441218,2.325897,2.416256,2.507563,3.029658
x474,969.0,213870.4,33633.2,65802.33,195133.6,211370.2,231157.9,482499.8
x502,969.0,73411960000000.0,50512130000000.0,-83842850000000.0,41292570000000.0,62458950000000.0,91989350000000.0,381690700000000.0


# Regression model

In [0]:
rfr = ExtraTreesRegressor(n_jobs=1, max_depth=None, n_estimators=180, random_state=0, min_samples_split=3, max_features=None)

rfr.fit(dataset_selected_x, np.ravel(y_train))

y_pred = rfr.predict(x_test_selected) 

score = r2_score(y_test, y_pred)
print(score)

0.6037280104368132


# Export the file with our predictions for submission

In [0]:
#accepted=[400, 757, 635, 516, 132, 15, 809, 116, 214,
       #556, 617, 93, 346, 596, 309, 252, 292, 474,
       #593, 614]

accepted=[400, 635, 757, 516, 809, 214, 556, 617, 93,
       346, 596, 255, 309, 252, 292, 738, 537, 593,
       474, 614, 502]

column_names_x = ['id']
for i in range(832):
  column_names_x.append('x'+str(i))

raw_dataset_x_test = pd.read_csv('/content/X_test.csv', names=column_names_x,
                      na_values = "?", comment='\t',
                      sep=",", skipinitialspace=True, skiprows=True)

dataset_x_test = raw_dataset_x_test.copy()
dataset_x_test.tail()

dataset_x_test = dataset_x_test.fillna(dataset_x_test.median())

dataset_selected_x_test = dataset_x_test.copy()

for j in range(832):
        if (j in accepted):
          print (j)
        else:
          del dataset_selected_x_test['x'+str(j)]

predictions = rfr.predict(dataset_selected_x_test)

index = 0.0
with open('predictions.txt', 'w') as f:
    f.write("%s\n" % "id,y")
    for predict in predictions:
        writing_str = str(index)+','+str(predict.item(0))
        f.write("%s\n" % writing_str)
        index = index + 1

93
214
252
255
292
309
346
400
474
502
516
537
556
593
596
614
617
635
738
757
809
