# Tux for speed

### Evaluation of speed for the ML process, with and without feature selection

Feature selection process : 

In [1]:
import pandas as pd
dfResults = pd.read_csv("old_results/results.csv")

Find the best model considering MAPE mean

In [2]:
dfResults.sort_values("mean").iloc[0].name

139

Find the best list in the feature importance file

In [3]:
dfImportance = pd.read_csv("old_results/feature_importance.csv")
best_list = dfImportance.loc[dfResults.sort_values("mean").iloc[0].name]

Select the columns to keep and to drop

In [4]:
def select_features(best_list, strategy="absolute", nb_columns=500, quantile=0.98):
    if strategy == "absolute":
        return best_list.sort_values(ascending=False)[:nb_columns],best_list.sort_values(ascending=False)[nb_columns:]
    elif strategy == "percentile":
        return best_list[best_list > best_list.quantile(quantile)], best_list[best_list <= best_list.quantile(quantile)]

In [5]:
columns_to_keep, columns_to_drop = select_features(best_list, nb_columns=500)

Create the config file to run the experiment on

In [6]:
import json
size_methods = ["vmlinux", "GZIP-bzImage", "GZIP-vmlinux", "GZIP", "BZIP2-bzImage", 
              "BZIP2-vmlinux", "BZIP2", "LZMA-bzImage", "LZMA-vmlinux", "LZMA", "XZ-bzImage", "XZ-vmlinux", "XZ", 
              "LZO-bzImage", "LZO-vmlinux", "LZO", "LZ4-bzImage", "LZ4-vmlinux", "LZ4"]

config = {
    "max_depth":25,
    "nbFolds":5,
    "n_estimators":48,
    "columns_to_drop":["cid"]+size_methods+list(columns_to_drop.index),
    "minSampleSize":75000,
    "maxSampleSize":75001,
    "paceSampleSize":1
}

with open("config/config.json","w") as f:
    json.dump(config,f)

Running the script with feature selection : 

In [7]:
!python3 index.py

1564402698.0285175
Starting
Train size 75000
Fold 0
Fold 1
Fold 2
Fold 3
Fold 4
End
1564403057.0533917
Total time :  359.02487421035767


With 5 folds and 48 estimators, a training set of 75k rows and a max depth of 25, the process takes roughly 6 minutes.

In [10]:

config = {
    "max_depth":25,
    "nbFolds":5,
    "n_estimators":48,
    "columns_to_drop":["cid"]+size_methods,
    "minSampleSize":75000,
    "maxSampleSize":75001,
    "paceSampleSize":1
}

with open("config/config.json","w") as f:
    json.dump(config,f)

In [11]:
!python3 index.py

1564406615.4647942
Starting
Train size 75000
Fold 0
Fold 1
Fold 2
Fold 3
Fold 4
End
1564414574.5938969
Total time :  7959.129102706909


With the same configuration but without feature selection, working on 9000+ columns, the process takes 132 minutes, meaning 22 times more than when working on 500 features.