In [1]:
!pip install pycona
import pandas as pd
from pycona import *

## Prediction-based interactive CA

In **PyConA**, we can customize the environment used in interactive CA systems. The basic environment for interactive CA systems, i.e., *ActiveCAEnv*, allows to customize the exact method use for the 3 subcomponents of interactive CA: query generation, finding the scope, finding the constraint. 

In addition to the basic CA environment, a prediction-based environment is defined, namely *ProbaActiveCAEnv*, using techniques presented in "Learning to Learn in Interactive Constraint Acquisition", AAAI, 2024.

The difference is that in ProbaActiveCAEnv, additional information is stored during the acquisition process; a constraint-level dataset is created and grown incrementally throughout the CA process, as gradually more information is obtained about constraints from the initial bias. Learned constraints get a positive label and excluded constraints get a negative label. The constraint-level dataset can be used to train a predictor/classifier, which can in turn be used to predict probabilities for the remaining candidate constraints and guide the acquisition process.

For the above, ProbaActiveCAEnv provides 2 additional options that can be customized by the user:
- Feature representation (.feature_representation): The feature representation used for the constraints
- Classifier (.classifier): The (probabilistic) classifier used to predict probabilities for the candidate constraints

By default, a decision tree classifier is used, while the default feature representation is the one presented in "Learning to Learn in Interactive Constraint Acquisition", AAAI, 2024.


Let's create an interactive CA system using ProbaActiveCAEnv, and compare its performance with using the basic ActiveCAEnv. For that, we will use the running example on nurse rostering from the introductory tutorial.

In [2]:
from pycona import benchmarks

instance, oracle = benchmarks.construct_nurse_rostering(3, 2, 8, 2)

# env = ProbaActiveCAEnv() # <- we do not have to use it, as it is the default
ga = GrowAcq() # GrowAcq(env)
learned_instance = ga.learn(instance, oracle, verbose=1)

Running growacq with <pycona.active_algorithms.mquacq2.MQuAcq2 object at 0x000001DF271B0070> as inner algorithm

Learned 0 constraints in 0 queries.
...L..
Learned 1 constraints in 5 queries.
...L.L
Learned 3 constraints in 9 queries.
...L.L.L
Learned 6 constraints in 14 queries.
...L.L.L.L
Learned 10 constraints in 20 queries.
...L.L.L.L.L
Learned 15 constraints in 27 queries.
......L..L
Learned 17 constraints in 35 queries.
.....L....L.L
Learned 20 constraints in 45 queries.
........L.L
Learned 22 constraints in 54 queries.
........L.L.L
Learned 25 constraints in 64 queries.
......L.L.L.L.
Learned 29 constraints in 74 queries.
.....L.L.L.L.L..
Learned 34 constraints in 85 queries.


In [3]:
# compare its performance with using the basic ActiveCAEnv
env_noprob = ActiveCAEnv()
ga_noprob = GrowAcq(env_noprob)
learned_instance = ga_noprob.learn(instance, oracle=oracle)


pd.concat([ga.env.metrics.short_statistics, 
           ga_noprob.env.metrics.short_statistics], keys=["Probabilistic", "Basic"])

Unnamed: 0,Unnamed: 1,CL,tot_q,top_lvl_q,tfs_q,tfc_q,avg_q_size,avg_gen_time,avg_t,max_t,tot_t,conv
Probabilistic,0,34,85,79,2,4,5.3176,0.1515,0.1528,0.8518,12.99,1
Basic,0,34,190,68,83,39,4.5105,0.1505,0.0705,0.5536,13.4012,1


As we can see, the number of queries was significantly smaller when using the probabilistic CA system.

### Customize the behaviour of ProbaActiveCAEnv

##### Changing the classifier used

In **PyConA**, we can also alter the choice of the classifier to use. **PyConA** uses scikit-learn classifiers, but any classifier with a .fit() and .predict_proba() can be used. We have also defined a *Predictor* abstract class, which can be subclassed to be used in ActiveCAPredict.

Let us now use a Naive Bayes classifier in ActiveCAPredict.

In [4]:
from sklearn.naive_bayes import GaussianNB
envNB = ProbaActiveCAEnv(classifier=GaussianNB())
gaNB = GrowAcq(envNB)
learned_instance = gaNB.learn(instance, oracle=oracle)

pd.concat([ga.env.metrics.short_statistics, 
           gaNB.env.metrics.short_statistics,
           ga_noprob.env.metrics.short_statistics], keys=["Decision Tree", "Naive Bayes", "Basic"])

Unnamed: 0,Unnamed: 1,CL,tot_q,top_lvl_q,tfs_q,tfc_q,avg_q_size,avg_gen_time,avg_t,max_t,tot_t,conv
Decision Tree,0,34,85,79,2,4,5.3176,0.1515,0.1528,0.8518,12.99,1
Naive Bayes,0,34,95,85,5,5,5.3895,0.1149,0.1164,0.4708,11.062,1
Basic,0,34,190,68,83,39,4.5105,0.1505,0.0705,0.5536,13.4012,1


We can observe a small increase in the number of queries, but still a decent performance compared to the basic ActiveCA.

##### Changing the feature representation


In **PyConA**, the FeatureRepresentation class is used to featurize constraints to be able to use them in a probabilistic classification context.

By subclassing FeatureRepresentation, we can define custom feature representations. Let's define a simple feature representation that only takes into account the relation of the constraint:

In [5]:
from pycona.predictor import FeatureRepresentation
from pycona.utils import get_relation

class FeaturesSimpleRel(FeatureRepresentation):

        def featurize_constraint(self, c):
            relation = get_relation(c, self.instance.language)
            return [relation]


We can then pass this feature representation to ActiveCAPredict, either when initializing or later:

In [6]:
env1 = ProbaActiveCAEnv(feature_representation=FeaturesSimpleRel())
ga1 = GrowAcq(env1)
learned_instance1 = ga1.learn(instance, oracle=oracle)

env2 = ProbaActiveCAEnv()
ga2 = GrowAcq(env2)
learned_instance2 = ga2.learn(instance, oracle=oracle)

pd.concat([ga2.env.metrics.short_statistics, 
           ga1.env.metrics.short_statistics], keys=["All Features", "Simple Features"])

Unnamed: 0,Unnamed: 1,CL,tot_q,top_lvl_q,tfs_q,tfc_q,avg_q_size,avg_gen_time,avg_t,max_t,tot_t,conv
All Features,0,34,82,79,1,2,5.3537,0.1633,0.1706,1.0124,13.9928,1
Simple Features,0,34,90,88,0,2,5.6667,0.1333,0.1379,0.7805,12.4102,1


Although we see a slight increase in the number of queries when the simple feature representation we defined is used, compared to the more advanced default feature representation, it is already good enough to reduce the number of queries by a lot. That is because it can directly recognise that the majority of the candidate constraints probably are not part of the target set of constraints, as they are not "!=" constraints.