# Introductory Notebook 2: Mixed-Datatype Problems, Multi-Objective and Mixed-Objective Problems
This notebook will cover mixed datatypes and generating counterfactuals with multiple mixed objectives types.

In [1]:
import sys
sys.path.append('../src/')
import decode_mcd
from decode_mcd import mcd_problem
from decode_mcd import mcd_dataset


from decode_mcd import design_targets
from decode_mcd import mcd_generator

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Creating a Dataset

First, lets create a dataset inspired by some basic arithmetic. This time, we will create a dataset with four variables of different types. `A` will be a random integer from 0 to 10 inclusive. `B` will be a random float from -1 to 1. `C` will be a random choice among "Add" "Subtract", "Multiply", and "Divide". Finally, `D` will be a boolean variable determining the sign of the expression. True will correspond to positive while False will mean negative. To effectively handle mixed datatypes will will use pandas.

In [2]:
num_data = 1000
A = np.random.randint(0, 10, num_data)
B = np.random.rand(num_data) * 2 - 1 #Randomized values which originally range from 0 to 1. Scales to -1 to 1.
C = np.random.choice(["Add", "Subtract", "Multiply", "Divide"], num_data)
D = np.random.choice([True, False], num_data)
x = pd.DataFrame({"A": A, "B": B, "C": C, "D": D})
display(x)


Unnamed: 0,A,B,C,D
0,5,0.295309,Divide,False
1,7,-0.348660,Divide,False
2,1,-0.617843,Add,False
3,0,-0.042743,Multiply,True
4,1,0.021514,Divide,True
...,...,...,...,...
995,8,-0.153462,Multiply,False
996,3,0.411702,Add,True
997,4,0.767748,Add,False
998,2,0.003527,Subtract,True


We will create two functions. In the first one we will perform operation C(D(A), B). The second will perform D(C(B, A))>=0. For example, if C is add and D is False, function 1 will calculate -A+B while function 2 will check if -(B+A) is greater than 0 or not. We code up the `evaluate` function which takes in a dataframe with a set of `A`, `B`, `C`, and `D` values and returns a set of `O1` and `O2` values, which is what we call our function

In [3]:
def apply_operation(C, x, y):
    #Vectorized function to calculate C(x, y), for example if C is "Subtract", calculates x-y.
    add_mask = (C == "Add")
    subtract_mask = (C == "Subtract")
    multiply_mask = (C == "Multiply")
    divide_mask = (C == "Divide")
    result = np.zeros(len(C))
    result[add_mask] = x[add_mask] + y[add_mask]
    result[subtract_mask] = x[subtract_mask] - y[subtract_mask]
    result[multiply_mask] = x[multiply_mask] * y[multiply_mask]
    result[divide_mask] = x[divide_mask] / y[divide_mask]
    return result

def apply_inverse(D, x):
    #Vectorized function to calculate D(x), for example if D is False, returns -x. 
    return x*D-x*~D

def evaluate(x):
    #Evaluation function to calculate both objectives. x is an nx4 dataframe. 
    A = x["A"] #First isolate the individual variables from the provided dataframe
    B = x["B"]
    C = x["C"]
    D = x["D"].astype(bool)
    objective_1 = apply_operation(C, apply_inverse(D, A), B) #Calculate objective 1
    objective_2 = np.greater(apply_inverse(D, apply_operation(C, B, A)), 0) #Calculate objective 2
    return pd.DataFrame({"O1": objective_1, "O2": objective_2}) #Create a nx2 dataframe with the objective values
 

Let's evaluate our dataset.

In [4]:
y = evaluate(x)
display(y)

Unnamed: 0,O1,O2
0,-16.931417,False
1,20.076850,True
2,-1.617843,False
3,-0.000000,False
4,46.482258,True
...,...,...
995,1.227697,True
996,3.411702,True
997,-3.232252,False
998,1.996473,False


Finally, we create our query. 

In [5]:
x_query = pd.DataFrame({"A": [0], "B": [0.0], "C": ["Add"], "D": [True]}, index = ["Query"])
display(x_query)

Unnamed: 0,A,B,C,D
Query,0,0.0,Add,True


### Setting up MCD

Now we are ready to set up the `McdDataset`. We specify the datatypes of our dataset. This time we have one of each type of variable. We specify the boundaries and options as specified earlier in the notebook.

In [6]:
from pymoo.core.variable import Real, Integer, Choice, Binary 
datatypes=[Integer(bounds=(0, 10)), 
           Real(bounds=(-1, 1)), 
           Choice(options=["Add", "Subtract", "Multiply", "Divide"]), 
           Binary()]

data = mcd_dataset.McdDataset(x=x, y=y, x_datatypes=datatypes)

Next, we create the design targets and the `McdProblem`. In this case, we create two design targets. We have one continuous target and one categorical target. Continuous targets should be used for any objective with ordinal significance (such as floats or ints), while categorical targets should be used for those without (such as classes or bools). We want set a hard lower bound of 10 for O1 with no upper bound, meaning C(D(A), B)>=10. We also set a requirement that O2 must be true, meaning that D(C(B, A))>=0

In [7]:
#We first set up a ContinuousTarget for O1, setting a minimum of 10 and setting no upper bound (i.e. infinity))
target_1 = design_targets.ContinuousTarget(label = "O1", lower_bound=10, upper_bound=np.inf)

#We then set up a CategoricalTarget for O2 specifying only True as the desired class. 
#Desired_classes is a list. In problems with multiple classes, this list specifies the acceptable classes.
target_2 = design_targets.CategoricalTarget(label = "O2", desired_classes=[True])

#We then create a DesignTargets object with the two targets.
y_targets = design_targets.DesignTargets(continuous_targets=[target_1], categorical_targets=[target_2])

In [8]:
problem = mcd_problem.McdProblem(mcd_dataset=data, x_query = x_query, y_targets = y_targets, prediction_function=evaluate)

Finally, we create the `McdGenerator`:

In [9]:
generator = mcd_generator.McdGenerator(mcd_problem=problem, pop_size=100, initialize_from_dataset=True)

### Generating Counterfactuals
Finally, we run the generator and sample.

In [10]:
generator.generate(n_generations=10)

1000 dataset entries found matching problem parameters
Initial population initialized from dataset of 1000 samples!
Training GA from 0 to 10 generations!
n_gen  |  n_eval  | n_nds  |     cv_min    |     cv_avg    |      eps      |   indicator  
     1 |        0 |      6 |  0.000000E+00 |  1.209752E+01 |             - |             -
     2 |      100 |      6 |  0.000000E+00 |  0.2775958492 |  0.000000E+00 |             f
     3 |      200 |      6 |  0.000000E+00 |  0.0844261116 |  0.000000E+00 |             f


  satisfaction = np.maximum(actual - query_ub, query_lb - actual)
  satisfaction = np.maximum(actual - query_ub, query_lb - actual)
  satisfaction = np.maximum(actual - query_ub, query_lb - actual)
  satisfaction = np.maximum(actual - query_ub, query_lb - actual)


     4 |      300 |      6 |  0.000000E+00 |  0.000000E+00 |  0.000000E+00 |             f
     5 |      400 |      6 |  0.000000E+00 |  0.000000E+00 |  0.000000E+00 |             f
     6 |      500 |      6 |  0.000000E+00 |  0.000000E+00 |  0.000000E+00 |             f
     7 |      600 |      6 |  0.000000E+00 |  0.000000E+00 |  0.000000E+00 |             f
     8 |      700 |      6 |  0.000000E+00 |  0.000000E+00 |  0.000000E+00 |             f
     9 |      800 |      6 |  0.000000E+00 |  0.000000E+00 |  0.000000E+00 |             f
    10 |      900 |      6 |  0.000000E+00 |  0.000000E+00 |  0.000000E+00 |             f


  satisfaction = np.maximum(actual - query_ub, query_lb - actual)
  satisfaction = np.maximum(actual - query_ub, query_lb - actual)
  satisfaction = np.maximum(actual - query_ub, query_lb - actual)


In [11]:
num_samples = 10 
counterfactuals = generator.sample(num_samples, include_dataset=False)
display(counterfactuals)

Collecting all counterfactual candidates!
Scoring all counterfactual candidates!
Calculating diversity matrix!
Sampling diverse set of counterfactual candidates!
samples_index=[np.int64(22), np.int64(153), np.int64(97), np.int64(39), np.int64(177), np.int64(87), np.int64(176), np.int64(16), np.int64(38), np.int64(88)]
Done! Returning CFs


Unnamed: 0,A,B,C,D
0,1,0.00985,Divide,True
1,5,0.489808,Divide,True
2,2,0.19024,Divide,True
3,3,0.025129,Divide,True
4,1,-0.015964,Divide,False
5,7,0.002456,Divide,True
6,4,0.144306,Divide,True
7,2,0.005912,Divide,True
8,3,0.214452,Divide,True
9,5,0.021514,Divide,True


Let's evaluate the counterfactuals we generated, we should see that every O1 value is greater or equal to 10, while every O2 value is true. 

In [12]:
evaluate(counterfactuals)

Unnamed: 0,O1,O2
0,101.520958,True
1,10.208088,True
2,10.51301,True
3,119.384876,True
4,62.642512,True
5,2849.927002,True
6,27.718813,True
7,338.306053,True
8,13.989167,True
9,232.41129,True


### Contraining input features
MCD provides a convenient interface for freezing input features that should not be changed. These can be specified when constructing the `McdProblem` through the parameter `features_to_freeze`. Let's say we want to ensure that our generate counterfactuals don't change the operation used in the query, which was addition. We can specify that as below. 

In [13]:
import importlib
importlib.reload(mcd_problem)
x_query = pd.DataFrame({"A": [-1], "B": [0.0], "C": ["Add"],  "D": [True]}, index = ["Query"])
to_freeze = ["C"]
problem = mcd_problem.McdProblem(mcd_dataset=data, x_query = x_query, y_targets = y_targets, features_to_freeze=to_freeze, prediction_function=evaluate)



In [14]:
import importlib
importlib.reload(mcd_generator)
generator = mcd_generator.McdGenerator(mcd_problem=problem, pop_size=100, initialize_from_dataset=True)
generator.generate(n_generations=10)
counterfactuals = generator.sample(num_samples, include_dataset=False)

261 dataset entries found matching problem parameters
Initial population initialized from dataset of 261 samples!
Training GA from 0 to 10 generations!
n_gen  |  n_eval  | n_nds  |     cv_min    |     cv_avg    |      eps      |   indicator  
     1 |        0 |      1 |  0.0953373044 |  1.048030E+01 |             - |             -
     2 |      100 |      1 |  0.0953373044 |  3.3255443406 |             - |             -
     3 |      200 |      1 |  0.000000E+00 |  2.3350621227 |             - |             -
     4 |      300 |      2 |  0.000000E+00 |  1.7001490346 |  1.0000000000 |         ideal
     5 |      400 |      3 |  0.000000E+00 |  1.1421771261 |  0.0063617221 |         ideal
     6 |      500 |      6 |  0.000000E+00 |  0.7399105182 |  0.7572665872 |         ideal
     7 |      600 |      7 |  0.000000E+00 |  0.5045780804 |  0.0136805328 |             f
     8 |      700 |      9 |  0.000000E+00 |  0.3006217072 |  0.0339088500 |             f
     9 |      800 |     11 | 

Now, we can see that our generated counterfactuals indeed all use the addition operator! 

In [15]:
display(counterfactuals)

Unnamed: 0,A,B,C,D
0,10,0.0,Add,True
1,10,0.264259,Add,True
2,10,0.021456,Add,True
3,10,0.115747,Add,True
4,10,0.062995,Add,True
5,10,0.004985,Add,True
6,10,0.14412,Add,True
7,10,0.412947,Add,True
8,10,0.0307,Add,True
9,10,0.06975,Add,True


### Adding Optimization Objectives to the Mix
Now let's try converting one of our two constraints into an objective. We will make our `ContinuousTarget` into a `MinimizationTarget`.  In this case, we do not specify bounds, but if we wanted to further contstraint the objective to minimize, this could be done with a `ContinuousTarget` and a `MinimizationTarget`. 

In [16]:
target_1 = design_targets.MinimizationTarget(label = "O1")
y_targets = design_targets.DesignTargets(minimization_targets=[target_1], categorical_targets=[target_2])
problem = mcd_problem.McdProblem(mcd_dataset=data, x_query = x_query, y_targets = y_targets, prediction_function=evaluate)



In [17]:
generator = mcd_generator.McdGenerator(mcd_problem=problem, pop_size=100, initialize_from_dataset=True)
generator.generate(n_generations=10)
counterfactuals = generator.sample(num_samples, include_dataset=False)

1000 dataset entries found matching problem parameters
Initial population initialized from dataset of 1000 samples!
Training GA from 0 to 10 generations!
n_gen  |  n_eval  | n_nds  |     cv_min    |     cv_avg    |      eps      |   indicator  
     1 |        0 |     71 |  0.000000E+00 |  0.5074925075 |             - |             -
     2 |      100 |     72 |  0.000000E+00 |  0.000000E+00 |  0.0003884776 |             f
     3 |      200 |     90 |  0.000000E+00 |  0.000000E+00 |  0.3333333333 |         ideal
     4 |      300 |    100 |  0.000000E+00 |  0.000000E+00 |  0.0122772257 |         ideal
     5 |      400 |    100 |  0.000000E+00 |  0.000000E+00 |  0.0066136045 |             f
     6 |      500 |    100 |  0.000000E+00 |  0.000000E+00 |  0.0050456493 |             f
     7 |      600 |    100 |  0.000000E+00 |  0.000000E+00 |  0.0493069366 |         ideal
     8 |      700 |    100 |  0.000000E+00 |  0.000000E+00 |  0.0061095408 |             f
     9 |      800 |    100 

We can see that in this case, the generated counterfactuals have much smaller O1 values. 

In [18]:
evaluate(counterfactuals)

Unnamed: 0,O1,O2
0,0.99643,True
1,9.99758,True
2,0.921598,True
3,17.419469,True
4,-7.954702,True
5,6.773268,True
6,2.532651,True
7,1.067632,True
8,-0.998695,True
9,-0.004147,True


This concludes the second MCD tutorial notebook! The third will cover advanced selection options for counterfactuals. 