# Automated Performance Specialization

This notebook will be a rundown for automated performance specialization with a concrete example in Python.

## Peformance Specalization

Performance specialization is the addition of constraints to a configurable system so that every possible configuration meets a set performance objective.

The idea is to automate this process using machine learning instead of relying on expert knowledge to infer these additional constraints.

## Dataset

The first element we need is a dataset to train a machine learning model to answer our need.

In [1]:
import pandas as pd
df = pd.read_csv("datasets/Apache.csv")
df

Unnamed: 0,Base,HostnameLookups,KeepAlive,EnableSendfile,FollowSymLinks,AccessLog,ExtendedStatus,InMemory,Handle,perf
0,1,0,0,0,0,0,0,0,0,1200
1,1,0,1,0,1,0,0,0,0,2100
2,1,0,1,1,0,0,1,1,0,2310
3,1,0,0,1,0,1,0,1,0,1260
4,1,0,0,1,0,1,1,1,0,1140
...,...,...,...,...,...,...,...,...,...,...
187,1,1,1,0,0,1,1,1,0,1860
188,1,1,1,0,1,1,1,1,0,1920
189,1,0,0,1,0,0,0,1,0,1410
190,1,0,1,1,0,0,0,1,0,2460


This dataset contains 192 examples of Apache server configuration, and a measured performance metric, here a number of pages server by second. All columns except the "perf" one represents an options, and the value shows if it is acitvated or not.

Performance specialization is a classification problem, knowing if a configuration is acceptable or not. We need to set a performance objective, or threshold, to classify the examples and label them accordingly.

In [2]:
threshold = 1800 # Arbitrary value, can be changed

In [4]:
df["acceptable"] = df["perf"].map(lambda x: x > threshold)

In [5]:
df

Unnamed: 0,Base,HostnameLookups,KeepAlive,EnableSendfile,FollowSymLinks,AccessLog,ExtendedStatus,InMemory,Handle,perf,acceptable
0,1,0,0,0,0,0,0,0,0,1200,False
1,1,0,1,0,1,0,0,0,0,2100,True
2,1,0,1,1,0,0,1,1,0,2310,True
3,1,0,0,1,0,1,0,1,0,1260,False
4,1,0,0,1,0,1,1,1,0,1140,False
...,...,...,...,...,...,...,...,...,...,...,...
187,1,1,1,0,0,1,1,1,0,1860,True
188,1,1,1,0,1,1,1,1,0,1920,True
189,1,0,0,1,0,0,0,1,0,1410,False
190,1,0,1,1,0,0,0,1,0,2460,True


## Learning

We now have a dataset we will be able to learn on. For this task, we choose to use the Decision Tree algorithm for two main reason :
 * It is able to handle a large number of features and their interactions : this is a base requirement to deal with configurable systems
 * It produces a white box model, and we can extract decision rules : this is needed to infer the constraints for the specialization

Setup the learning algorithm

In [6]:
from sklearn import tree

# The hyperparameters heavily impacts the accuracy and the best combination should be searched for
cls = tree.DecisionTreeClassifier(max_depth=12, min_samples_split=4, criterion="gini")

Split the dataset into training and testing sets

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=["perf","acceptable"]), df["acceptable"], train_size=0.7)
X_train.head()

Unnamed: 0,Base,HostnameLookups,KeepAlive,EnableSendfile,FollowSymLinks,AccessLog,ExtendedStatus,InMemory,Handle
76,1,1,0,0,1,1,1,0,1
177,1,1,0,0,1,0,1,1,0
80,1,1,1,0,1,0,0,0,1
30,1,1,0,1,1,1,1,1,0
121,1,1,0,0,0,0,1,0,0


Train the Decision Tree with the training set

In [10]:
cls.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=12, min_samples_split=4)

Measuring the accuracy of the model. The performance threshold can vary, and it impacts the class distribution. A metric such as accuracy is very sensitive to class balance, so we prefer to rely on balanced accuracy that can deal with this problem.

In [12]:
from sklearn.metrics import balanced_accuracy_score

# Predicting the test set
y_pred = cls.predict(X_test)

# Comparing the prediction to the truth
balanced_accuracy_score(
    y_test,
    y_pred
)

0.9473684210526316

The accuracy is very high and confirms the ability to accurately model the performance of Apache given a configuration by the Decision Tree.

The last step is to extract the rules, which can be done quite simply : 

In [14]:
from sklearn.tree import _tree


def tree_to_rules_valid(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    #print ("def tree({}):".format(", ".join(feature_names)))

    def recurse(node, previous_rules):
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            #print ("{}if {} <= {}:".format(indent, name, threshold))
            recurse(tree_.children_left[node], previous_rules+[name + " <= " + str(threshold)])
            #print ("{}else:  # if {} > {}".format(indent, name, threshold))
            recurse(tree_.children_right[node], previous_rules+[name + " > " + str(threshold)])
        else:
            if tree_.value[node][0][0] < tree_.value[node][0][1]:
                #print(" & ".join(previous_rules) + " ---> " + str(tree_.value[node]))
                print(" & ".join(previous_rules))


    recurse(0, [])
    
tree_to_rules_valid(cls, X_train.columns)

KeepAlive > 0.5 & Handle <= 0.5 & InMemory <= 0.5 & AccessLog <= 0.5
KeepAlive > 0.5 & Handle <= 0.5 & InMemory <= 0.5 & AccessLog > 0.5 & EnableSendfile > 0.5 & HostnameLookups <= 0.5
KeepAlive > 0.5 & Handle <= 0.5 & InMemory > 0.5


According to these rules, we can say that to have acceptable configurations, KeepAlive must be activated (>0.5), and Handle deactivated (<=0.5). These conditions are necessary. Then there are rules about interactions between options. If InMemory is deactivated, we can find acceptable configurations if AccessLog is deactivated, and if AccessLog is activated, EnableSendfile needs to be activated and HostnameLookups deactivated.

These rules can easily be applied to constrain further Apache.

### Regression

If the specialization is definitely a classification problem, as it is based on the performance value, we can do it another way. The main problem of using classification is that the model competely ignore "how much acceptable" a configuration can be, and a configuration barely acceptable is the same thing as a highly performant configuration from the model "point of view". Performance is a very important value that is lost in the training process.

Instead of training a classification model on the acceptability of a configuration, we train a regression model to predict the performance value, predict the performance of a configuration and then determine if it's acceptable or not. If it is now aware of the performance, on the contrary of the classification model.

In [17]:
reg = tree.DecisionTreeRegressor(max_depth=12, min_samples_split=4, criterion="mse")

In [18]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=["perf","acceptable"]), df["perf"], train_size=0.7)
X_train.head()

Unnamed: 0,Base,HostnameLookups,KeepAlive,EnableSendfile,FollowSymLinks,AccessLog,ExtendedStatus,InMemory,Handle
189,1,0,0,1,0,0,0,1,0
126,1,0,1,0,0,0,1,0,0
156,1,0,0,0,0,0,1,1,0
14,1,1,1,1,0,0,1,1,0
180,1,1,0,0,0,1,0,1,0


In [19]:
reg.fit(X_train, y_train)

DecisionTreeRegressor(max_depth=12, min_samples_split=4)

We measure the accuracy of the model. Note that we do not measure the accuracy from a regression point of view, but what we are interested in, the balanced accuracy with the class based on the predicted value.

In [26]:
# Predicting the test set
y_pred = reg.predict(X_test)

# Comparing the prediction to the truth
balanced_accuracy_score(
    y_test > threshold,
    y_pred > threshold
)

0.9761904761904762

As we can see, doing this way is also very accurate, even more than with classification.

One of the downside of this technique however, is that the model ignores the threshold. As the model is created to reduce the error in term of difference between the predicted and the true value, if the model predicts 1801 instead of 1799, the error is very low, but it creates an error from the classification side as it is on the other side of the threshold.

### Specialized Regression

To avoid having a model which cannot know either performance or threshold, we imagine a new approach that can.

We use the regression model, but we manipulate the performance value in the dataset this way : if the value if higher than the threshold, we increase the value a lot. This will create a gap where the threshold should be, and when the regression model predicts a value on the other side of the threshold, this will simulate a very high error that will be avoided during the learning phase.

In [27]:
df["specialized_perf"] = df["perf"].map(lambda x: x + 10000 if x > threshold else x)
df

Unnamed: 0,Base,HostnameLookups,KeepAlive,EnableSendfile,FollowSymLinks,AccessLog,ExtendedStatus,InMemory,Handle,perf,acceptable,specialized_perf
0,1,0,0,0,0,0,0,0,0,1200,False,1200
1,1,0,1,0,1,0,0,0,0,2100,True,12100
2,1,0,1,1,0,0,1,1,0,2310,True,12310
3,1,0,0,1,0,1,0,1,0,1260,False,1260
4,1,0,0,1,0,1,1,1,0,1140,False,1140
...,...,...,...,...,...,...,...,...,...,...,...,...
187,1,1,1,0,0,1,1,1,0,1860,True,11860
188,1,1,1,0,1,1,1,1,0,1920,True,11920
189,1,0,0,1,0,0,0,1,0,1410,False,1410
190,1,0,1,1,0,0,0,1,0,2460,True,12460


In [28]:
reg = tree.DecisionTreeRegressor(max_depth=12, min_samples_split=4, criterion="mse")

X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=["perf","acceptable","specialized_perf"]), df["specialized_perf"], train_size=0.7)
X_train.head()

Unnamed: 0,Base,HostnameLookups,KeepAlive,EnableSendfile,FollowSymLinks,AccessLog,ExtendedStatus,InMemory,Handle
125,1,1,1,0,0,1,0,0,0
146,1,1,1,0,1,1,0,0,0
110,1,0,0,1,0,1,0,0,0
154,1,0,0,0,0,0,0,1,0
174,1,1,0,0,0,0,1,1,0


In [29]:
reg.fit(X_train, y_train)

DecisionTreeRegressor(max_depth=12, min_samples_split=4)

In [30]:
# Predicting the test set
y_pred = reg.predict(X_test)

# Comparing the prediction to the truth
balanced_accuracy_score(
    y_test > threshold,
    y_pred > threshold
)

0.9767441860465116

The accuracy is on par with simple regression.

However, this is only for one case, and we must investigate other parameters values, such as training set size (we used 70%), threhshold variation (what happen when we have a very restrictive threshold?), hyperparameters optimization...