# Measuring Complexity in Python with `pymfe`

`Pymfe` is a set of meta-features, including Complexity Data, all made for Python, as a Python Package. A few details on What is `pymfe` can be seen here: <https://github.com/ealcobaca/pymfe>.

In our case, we want to efficiently extract complexity measures, in this notebook we will try to extract a few complexity data values from a generic data set. In general, we will extract main complexities, such as linearity separation, feature-based, class balance and neighborhood information.

## Installation

This is how you install `pymfe` in your environment, make sure that all of the required packages is installed successfully.

In [None]:
%pip install -U pymfe

## Importing our Data set

In [5]:
import pandas as pd

df = pd.read_csv('seeds_dataset.csv')

## Splitting Data from Target

Make sure to split your data from the corresponding target feature column. This is a necessary step because we need to explicit our Data using `mfe.fit()` in the next step.

In [6]:
target = df.pop('label').values
data = df.values

## Measuring Complexity

### Importing `pymfe`

In [14]:
from pymfe.mfe import MFE


### A general Approach

A general approach would extract all complexities measures presented in a data set, as we can see in our example

In [16]:
mfe = MFE(groups=['complexity'])
mfe.fit(data, target)
ft = mfe.extract()
print("\n".join("{:50} {:30}".format(x, y) for x, y in zip(ft[0], ft[1])))

c1                                                             0.9999999999999998
c2                                                                            0.0
cls_coef                                                       0.8800452059292428
density                                                        0.9117338801549328
f1.mean                                                        0.3182269381394485
f1.sd                                                         0.21095891573659348
f1v.mean                                                      0.04198678143236121
f1v.sd                                                       0.021182150211258875
f2.mean                                                      0.002206548501187337
f2.sd                                                       0.0028772370809719634
f3.mean                                                       0.21904761904761905
f3.sd                                                         0.22916589362887552
f4.mean         

### Extracting a set of complexity data

What if we wanted to extract a set of complexity data, for example: take one measure of feature-based information, one measure of linearity, one measure of class balance and one measure of neighborhood information. For this notebook, we'll take into account the following set of measures: `[C2, L2, N1, F2]`

In [17]:
complexity_measures = ['C2', 'L2', 'N1', 'F2']

Now, let us setup our MFE object to restrict the extraction to our complexity measures.

In [20]:
mfe = MFE(groups=['complexity'], 
          features=complexity_measures,
          summary=['mean'])

Finally, The extraction of the desired complexity data.

In [24]:
mfe.fit(data, target)
ft = mfe.extract()
print("\n".join("{:50} {:30}".format(x, y) for x, y in zip(ft[0], ft[1])))


c2                                                                            0.0
f2.mean                                                      0.002206548501187337
l2.mean                                                      0.014285714285714271
n1                                                            0.12380952380952381


For us, the implementation on `cbdgen` may be similar to the previous extraction, which is great, because `cbdgen` is ready to this approach without much work. The question is: Is it more efficient than other packages? Is it better than `ECoL`?