# Pysubgroup

**pysubgroup** is a Python package that enables subgroup discovery in Python+pandas (scipy stack) data analysis environment. It provides for a lightweight, easy-to-use, extensible and freely available implementation of state-of-the-art algorithms, interestingness measures and presentation options.

As of 2018, this library is still in a prototype phase. It has, however, been already succeesfully employed in active application projects.


## Subgroup Discovery

Subgroup Discovery is a well established data mining technique that allows you to identify patterns in your data.
More precisely, the goal of subgroup discovery is to identify descriptions of data subsets that show an interesting distribution with respect to a pre-specified target concept.
For example, given a dataset of patients in a hospital, we could be interested in subgroups of patients, for which a certain treatment X was successful.
One example result could then be stated as:

_"While in general the operation is successful in only 60% of the cases", for the subgroup
of female patients under 50 that also have been treated with drug d, the successrate was 82%."_

Here, a variable _operation success_ is the target concept, the identified subgroup has the interpretable description _female=True AND age<50 AND drug_D = True_. We call these single conditions (such as _female=True_) selection expressions or short _selectors_.
The interesting behavior for this subgroup is that the distribution of the target concept differs significantly from the distribution in the overall general dataset.
A discovered subgroup could also be seen as a rule:
```
female=True AND age<50 AND drug_D = True ==> Operation_outcome=SUCCESS
```
Computationally, subgroup discovery is challenging since a large number of such conjunctive subgroup descriptions have to be considered. Of course, finding computable criteria, which subgroups are likely interesting to a user is also an eternal struggle. 
Therefore, a lot of literature has been devoted to the topic of subgroup discovery (including some of my own work). Recent overviews on the topic are for example:

* Herrera, Franciso, et al. "[An overview on subgroup discovery: foundations and applications.](https://scholar.google.de/scholar?q=Herrera%2C+Franciso%2C+et+al.+%E2%80%9CAn+overview+on+subgroup+discovery%3A+foundations+and+applications.%E2%80%9D+Knowledge+and+information+systems+29.3+(2011)%3A+495-525.)" Knowledge and information systems 29.3 (2011): 495-525.
* Atzmueller, Martin. "[Subgroup discovery.](https://scholar.google.de/scholar?q=Atzmueller%2C+Martin.+%E2%80%9CSubgroup+discovery.%E2%80%9D+Wiley+Interdisciplinary+Reviews%3A+Data+Mining+and+Knowledge+Discovery+5.1+(2015)%3A+35-49.)" Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5.1 (2015): 35-49.
* And of course, my point of view on the topic is [summarized in my dissertation](https://opus.bibliothek.uni-wuerzburg.de/files/9781/Dissertation-Lemmerich.pdf):


## Prerequisites and Installation
pysubgroup is built to fit in the standard Python data analysis environment from the scipy-stack.
Thus, it can be used just having pandas (including its dependencies numpy, scipy, and matplotlib) installed. Visualizations are carried out with the matplotlib library.

pysubgroup consists of pure Python code. Thus, you can simply download the code from the repository and copy it in your `site-packages` directory.
pysubgroup is also on PyPI and should be installable using:  

```
pip install pysubgroup
```

## How to use:
A simple use case (here using the well known _titanic_ data) can be created in just a few lines of code:

In [1]:
import pysubgroup as ps

# Load the example dataset
from pysubgroup.tests.DataSets import get_titanic_data
data = get_titanic_data()

target = ps.BinaryTarget ('Survived', True)
searchspace = ps.create_selectors(data, ignore=['Survived'])
task = ps.SubgroupDiscoveryTask (
    data, 
    target, 
    searchspace, 
    result_set_size=5, 
    depth=2, 
    qf=ps.LiftQF)
result = ps.BeamSearch().execute(task)


In [3]:
ps.LiftQF?


[0;31mInit signature:[0m [0mps[0m[0;34m.[0m[0mLiftQF[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Lift Quality Function

LiftQF is a StandardQF with a=0.
Thus it treats the difference in ratios as the quality without caring about the relative size of a subgroup.
[0;31mInit docstring:[0m         
[0;31mFile:[0m           /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pysubgroup/binary_target.py
[0;31mType:[0m           ABCMeta
[0;31mSubclasses:[0m     


The first two lines imports _pysubgroup_ package.
The following lines load an example dataset (the popular titanic dataset).

Therafter, we define a target, i.e., the property we are mainly interested in (_'survived'}.
Then, we define the searchspace as a list of basic selectors. Descriptions are built from this searchspace. We can create this list manually, or use an utility function.
Next, we create a SubgroupDiscoveryTask object that encapsulates what we want to find in our search.
In particular, that comprises the target, the search space, the depth of the search (maximum numbers of selectors combined in a subgroup description), and the interestingness measure for candidate scoring (here, the Weighted Relative Accuracy measure).

The last line executes the defined task by performing a search with an algorithm---in this case beam search. The result of this algorithm execution is stored in a SubgroupDiscoveryResults object.

To just print the result, we could for example do:

In [2]:
result.to_dataframe()

  df = pd.DataFrame(res, columns=headers, dtype=np.float64)


Unnamed: 0,quality,subgroup,size_sg,size_dataset,positives_sg,positives_dataset,size_complement,relative_size_sg,relative_size_complement,coverage_sg,coverage_complement,target_share_sg,target_share_complement,target_share_dataset,lift
0,0.13215,Sex=='female',56.0,156.0,40.0,54.0,100.0,0.358974,0.641026,0.740741,0.259259,0.714286,0.14,0.346154,2.063492
1,0.101331,Parch==0 AND Sex=='female',41.0,156.0,30.0,54.0,115.0,0.262821,0.737179,0.555556,0.444444,0.731707,0.208696,0.346154,2.113821
2,0.079142,Sex=='female' AND SibSp: [0:1[,25.0,156.0,21.0,54.0,131.0,0.160256,0.839744,0.388889,0.611111,0.84,0.251908,0.346154,2.426667
3,0.077663,Cabin.isnull() AND Sex=='female',43.0,156.0,27.0,54.0,113.0,0.275641,0.724359,0.5,0.5,0.627907,0.238938,0.346154,1.813953
4,0.071746,Embarked=='S' AND Sex=='female',37.0,156.0,24.0,54.0,119.0,0.237179,0.762821,0.444444,0.555556,0.648649,0.252101,0.346154,1.873874
