# Parameters settings for subgroup discovery
For this example is used the UCI adult dataset where the objective is to predict whether a person makes more (label 1) or less (0) than $50,000 a year.

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.tree import DecisionTreeClassifier
#Import dataset
d = fetch_openml(data_id=1590, as_frame=True)
X = d.data
d_train=pd.get_dummies(X)
y_true = (d.target == '>50K') * 1
#training the classifier
classifier = DecisionTreeClassifier(min_samples_leaf=10, max_depth=4)
classifier.fit(d_train, y_true)
#Producing y_pred
y_pred = classifier.predict(d_train)

## SubgroupDiscoveryTask Object
This class will contain all the parameters useful for the sg discovery algorithms.<br/>
**Parameters:**

|name |type |default value |description
|:-----|:-----|:-----|:----- 
|X| pandas.Dataframe or <br/> numpy.ndarray| |dataset.
|y_true| pandas.Dataframe,<br/> pandas.Series or <br/> numpy.ndarray| |ground truth.
|y_pred| pandas.Dataframe,<br/> pandas.Series or <br/> numpy.ndarray| |predicted label.
|feature_names| list of strings| None| see below.
|nominal_features| list of strings| None| see below.
|numeric_features| list of strings| None| see below.
|qf | String or <br/>callable object| 'equalized_odds_difference'| see below.
|discretizer| String| 'equalfreq'| see below.
|num_bins | int| 6|  see below.
|dynamic_discretization| bool| True| see below.
|result_set_size| int| 5| maximum number of subgroups <br/>that the sg discovery will return.
|depth| int| 3| maximum number of descriptors<br/> that a description can contain.
|min_quality| float| 0.1| minimum quality<br/> that a subgroup needs to be selected. 
|min_support| int| 200| minimum size <br/>that a subgroup needs to be selected. 

### X and feature_names
X parameters can be a Pandas DataFrame or a Numpy array. If we pass a numpy array we must also pass the feature_names parameter, a list with the column names of X:

In [2]:
import fairsd as fsd
task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred) # X is a pandas.Dataframe

#X as numpy array
x_np    = X.to_numpy()
columns = list(X.columns)
task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, feature_names = columns)

### nominal_features and numeric_features
These two parameters (string list type) are used to specify which columns of X have nominal values and which ones have numeric values. For attributes that do not appear in either of the two list, the data type will be automatically inferred.<br/>
**Example:**<br/>
    Let's analize the education-num attribute

In [3]:
educationnum_val = X['education-num'].unique()
print(np.sort(educationnum_val))

[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16.]


The values of this attribute are integer number from 1 to 16. <br/>
The package would treat this attribute as numeric by default, but if we want to treat it as a nominal attribute we can use the nominal_features parameter:

In [4]:
task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, nominal_features = ['education-num'])

### qf -- quality function
This parameter ca be a string or a callable object.<br/>
* For the list of the accepted strings see [quality_function_options](https://github.com/MaurizioPulizzi/fairsd/blob/62e8e408801efe1dcc3e19f05fc3ae3181440241/fairsd/algorithms.py#L13)
* For a more detailed explanation of the qf parameter see [here]()

### discretizer
This parameter determine the algorithm that will perform the numerical features discretization.
It is set to 'equalfreq' by default but other options are possible:<br/>

|dicretizer parameter |description 
|:-----|:----- 
|'eualfreq'|approximate equal frequency discretization.
|'equalwidth'|equal width discretization.
|'mdlp'|minimum description length principle discretization (Fayyad and Irani approach).

In [5]:
#example
task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, discretizer='mdlp')

### num_bins
This parameter determine the maximum number of bins that a numerical feature discretization operation will produce. The number of produced bins could also be less than num_bins due to the min_support constraint:<br/>
* the equal-frequency and mdlp discretizations will never produce bins with size lower than min_support, as we know in advance that these bins could never be used to create a description with large enough support.<br/>

If a value lower than two is used, the number of bins will be automatically decided.


In [6]:
#example
task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, discretizer='equalwidth', num_bins = 4)

### dynamic_discretization
If is set to False, the discretization of numeric features will be done only once for each numerical feature before starting the subgroup discovery algorithm.<br/>
If instead is set to True, the discretization of numerical featues will be integrated with the subgroup discovery algorithm: will be done each time the algorithm will try to expand a description with a numerical attribute.

In [7]:
#example
task=fsd.SubgroupDiscoveryTask(X, y_true, y_pred, dynamic_discretization = False)