# Description of the problem and solution

The task1 was to predict a person's age from the brain image data: a standard regression problem. The original dataset included 832 features as well as a lot of NaN values and a few outliers. A good preprocessing stage was necessary in order to have a well defined dataset that could be used in our regression model. First step was the imputation of the dataset. Filling each NaN value with the median of each feature column. The use of the median instead of other value (e.g. mean) is justified since a lot of outliers are included in the dataset. (e.g. 1 2 _ 5 20 median: 3 mean: 7). Next step was the feature extraction. By using the "autofeat" library (paper: https://arxiv.org/pdf/1901.07329.pdf), we extracted the 21 most important features. The way the algorithm works is going through a loop of correlation of features with target, select promising features, train Lasso regression model with promising features, filter the good features keeping the ones with non-zero regression weights. We updated the datasets by keeping only the 21 most important features. Finally, we used these updated datasets for the training of our final regression model. A lot of outlier detection techniques were used but we decided to keep the outliers and use a tree-based method for our final model. Tree-methods have been proved to be robust to outliers and we avoid risking excluded important features / points from the dataset. The "ExtraTreesRegressor" model from the "sklearn" package was used and fine tuned based on the R2 score performance in our validation set. The final model had a score >0.6 in the validation sets using cross-validation and in the submission leaderboard of ETH scored 0.6812 while the hard baseline was set to 0.65 by the Advanced Machine Learning Task1 team.

# Include all the necessary packages

In [1]:
from sklearn.metrics import r2_score

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from autofeat import FeatureSelector

from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor

# Load the data from the CSV files

In [2]:
column_names_x = ['id']
for i in range(832):
  column_names_x.append('x'+str(i))

raw_dataset_x = pd.read_csv('Data/X_train.csv', names=column_names_x,
                      na_values = "?", comment='\t',
                      sep=",", skipinitialspace=True, skiprows=1)

dataset_x = raw_dataset_x.copy()
dataset_x.tail()

Unnamed: 0,id,x0,x1,x2,x3,x4,x5,x6,x7,x8,...,x822,x823,x824,x825,x826,x827,x828,x829,x830,x831
1207,1207.0,18707.457475,13610.725702,3785.886941,113497.632841,109.526764,97.812339,11274.011935,10803.953566,10949.811419,...,8472.132451,17727.003274,1033.057071,104.838553,102.235191,3099.069091,3.079234,1362.889974,,10110.36057
1208,1208.0,17108.239122,12168.536128,3442.619145,98218.773311,100.468476,109.994258,11031.326117,10231.743317,10687.321177,...,8839.251924,14721.087037,792.061138,105.823015,109.915094,3124.059793,2.935047,1577.40487,1026.749434,10620.330033
1209,1209.0,14264.707321,9273.405761,3580.894003,101668.927699,102.620705,104.470375,9159.594864,10661.827392,10623.176915,...,10264.321725,15226.056342,831.02519,101.926717,101.669153,2252.03187,,1781.79972,1066.379647,10317.757445
1210,1210.0,14907.07744,10936.636575,3159.167789,100400.608972,106.622507,84.859872,10356.404262,10107.960852,10384.92446,...,10310.165709,12976.062457,852.00107,103.098317,104.397562,2585.04866,2.731768,1300.379678,1049.37004,10876.010268
1211,1211.0,14975.969273,11451.350347,3107.470343,100080.295868,108.040548,,10766.88864,10726.567333,10632.289329,...,8853.948288,12953.077245,919.051273,107.668141,102.999153,2489.012323,2.721068,1416.819713,1088.315592,10359.138969


In [3]:
column_names_y = ['id','y']
raw_dataset_y = pd.read_csv('Data/y_train.csv', names=column_names_y,
                      na_values = "?", comment='\t',
                      sep=",", skipinitialspace=True, skiprows=1)

dataset_y = raw_dataset_y.copy()
dataset_y.tail()

Unnamed: 0,id,y
1207,1207.0,70.0
1208,1208.0,86.0
1209,1209.0,68.0
1210,1210.0,71.0
1211,1211.0,53.0


# Print the missing values

In [4]:
missing_values = dataset_x.isna().sum()
print (missing_values)

id        0
x0       91
x1       84
x2       88
x3       91
       ... 
x827    101
x828     95
x829     92
x830     80
x831     96
Length: 833, dtype: int64


# Split the data into training and test data

In [5]:
# Split using sklearn.model_selection
x_train, x_test, y_train, y_test = train_test_split(dataset_x, dataset_y, test_size=0.2, random_state = 100)

In [6]:
train_stats = x_train.describe()
train_stats.pop("id")
train_stats = train_stats.transpose()
train_stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
x0,901.0,15254.390521,2231.563761,6027.219619,13918.413076,15042.836058,16709.756102,28273.690135
x1,902.0,10970.368098,1587.024515,6764.060541,9873.366723,10825.381843,11921.708544,17777.338221
x2,906.0,3434.029571,433.096972,1849.453269,3156.328586,3409.310025,3702.856739,5622.951648
x3,896.0,100121.823708,9950.710491,65828.916291,93444.346839,100160.023492,106380.812492,133145.632257
x4,910.0,105.212591,2.837787,100.067706,102.784971,105.176689,107.681571,110.087261
...,...,...,...,...,...,...,...,...
x827,886.0,2492.950348,513.730065,867.825011,2168.018939,2470.568940,2800.578394,4904.988601
x828,895.0,2.730978,0.264650,1.744389,2.538333,2.715997,2.892752,3.795277
x829,899.0,1359.150032,260.487607,711.119041,1176.638459,1366.071397,1538.372796,2322.003232
x830,908.0,1052.854318,29.229104,1000.067137,1028.688835,1054.615566,1078.409654,1099.975679


# Fill the NaN in the training data set with the median values of each column

In [7]:
x_train = x_train.fillna(x_train.median())
x_test = x_test.fillna(x_test.median())
missing_values = x_train.isna().sum()
print (missing_values)

id      0
x0      0
x1      0
x2      0
x3      0
       ..
x827    0
x828    0
x829    0
x830    0
x831    0
Length: 833, dtype: int64


# Remove the unnecessary "id" label

In [8]:
y_train.pop("id")
y_test.pop("id")

1143    1143.0
941      941.0
365      365.0
467      467.0
615      615.0
         ...  
156      156.0
689      689.0
28        28.0
69        69.0
1198    1198.0
Name: id, Length: 243, dtype: float64

# Feature Extraction using autofeat

In [12]:
fsel = FeatureSelector(featsel_runs=4,
        max_it=150,
        w_thr=1e-6,
        keep=None,
        n_jobs=1,
        verbose=1)

new_X = fsel.fit_transform(pd.DataFrame(x_train, columns=column_names_x), y_train)
print(new_X.columns) 

df_train = pd.DataFrame(x_train, columns=column_names_x)
df_test = pd.DataFrame(x_test, columns=column_names_x)

  y = column_or_1d(y, warn=True)


[featsel] Scaling data...done.
[featsel] 231/833 features after univariate filtering
[featsel] Feature selection run 1/4
[featsel] Feature selection run 2/4
[featsel] Feature selection run 3/4
[featsel] Feature selection run 4/4
[featsel] 37 features after 4 feature selection runs
[featsel] 33 features after correlation filtering
[featsel] 22 features after noise filtering
[featsel] 22 final features selected (including 0 original keep features).
Index(['x68', 'x194', 'x633', 'x92', 'x485', 'x612', 'x641', 'x458', 'x380',
       'x233', 'x465', 'x726', 'x325', 'x89', 'x232', 'x415', 'x209', 'x348',
       'x115', 'x819', 'x742', 'x665'],
      dtype='object')


# Keep only the necessary features

In [10]:
dataset_selected_x = x_train.copy()
x_test_selected= x_test.copy()

#accepted=[400, 757, 635, 516, 132, 15, 809, 116, 214,
       #556, 617, 93, 346, 596, 309, 252, 292, 474,
       #593, 614]

accepted=[400, 635, 757, 516, 809, 214, 556, 617, 93,
       346, 596, 255, 309, 252, 292, 738, 537, 593,
       474, 614, 502]

for j in range(832):
        if (j in accepted):
          print (j)
        else:
          del dataset_selected_x['x'+str(j)]
          del x_test_selected['x'+str(j)]
      
train_stats = dataset_selected_x.describe()
train_stats.pop("id")
train_stats = train_stats.transpose()
train_stats

93
214
252
255
292
309
346
400
474
502
516
537
556
593
596
614
617
635
738
757
809


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
x93,969.0,4367.681,589.5625,2162.708,4015.083,4348.544,4679.013,7382.366
x214,969.0,2.303067,0.1842725,1.526802,2.195638,2.306954,2.41204,2.927466
x252,969.0,10504.52,269.7904,10000.14,10289.82,10503.14,10727.45,10999.66
x255,969.0,100.5422,9.167929,62.2626,94.50646,100.8159,106.0984,133.5456
x292,969.0,37821.05,0.02802936,37821.01,37821.03,37821.05,37821.08,37821.11
x309,969.0,2.078555,0.1601533,1.357216,1.986457,2.083465,2.170873,2.700711
x346,969.0,1049018.0,26867.07,1000087.0,1026117.0,1049662.0,1068778.0,1099860.0
x400,969.0,1004.372,99.72676,582.7449,940.4754,1006.425,1066.169,1468.679
x474,969.0,1051360.0,26874.0,1000022.0,1030836.0,1052140.0,1072486.0,1099845.0
x502,969.0,15693.04,2050.646,6837.685,14506.54,15562.46,16956.34,29098.47


# Regression model

In [11]:
rfr = ExtraTreesRegressor(n_jobs=1, max_depth=None, n_estimators=180, random_state=0, min_samples_split=3, max_features=None)

rfr.fit(dataset_selected_x, np.ravel(y_train))

y_pred = rfr.predict(x_test_selected) 

score = r2_score(y_test, y_pred)
print(score)

0.06651028255721181


# Export the file with our predictions for submission

In [None]:
#accepted=[400, 757, 635, 516, 132, 15, 809, 116, 214,
       #556, 617, 93, 346, 596, 309, 252, 292, 474,
       #593, 614]

accepted=[400, 635, 757, 516, 809, 214, 556, 617, 93,
       346, 596, 255, 309, 252, 292, 738, 537, 593,
       474, 614, 502]

column_names_x = ['id']
for i in range(832):
  column_names_x.append('x'+str(i))

raw_dataset_x_test = pd.read_csv('/content/X_test.csv', names=column_names_x,
                      na_values = "?", comment='\t',
                      sep=",", skipinitialspace=True, skiprows=True)

dataset_x_test = raw_dataset_x_test.copy()
dataset_x_test.tail()

dataset_x_test = dataset_x_test.fillna(dataset_x_test.median())

dataset_selected_x_test = dataset_x_test.copy()

for j in range(832):
        if (j in accepted):
          print (j)
        else:
          del dataset_selected_x_test['x'+str(j)]

predictions = rfr.predict(dataset_selected_x_test)

index = 0.0
with open('predictions.txt', 'w') as f:
    f.write("%s\n" % "id,y")
    for predict in predictions:
        writing_str = str(index)+','+str(predict.item(0))
        f.write("%s\n" % writing_str)
        index = index + 1

93
214
252
255
292
309
346
400
474
502
516
537
556
593
596
614
617
635
738
757
809
