In [1]:
from lemuras import Table

### Sample data

In [2]:
cols = ['type', 'size', 'weight', 'tel']
rows = [
    ['A', 1, 12, '+79360360193'],
    ['B', 4, 12, '84505505151'],
    ['A', 3, 10, '+31415926535'],
    ['B', 6, 14, ''],
    ['A', 4, 10, '23816326412'],
    ['A', 2, 12, 'None'],
]

df1 = Table(cols, rows, 'Sample')
df1

'type','size','weight','tel'
'A',1,12,'+79360360193'
'B',4,12,'84505505151'
'A',3,10,'+31415926535'
'B',6,14,''
'A',4,10,'23816326412'
'A',2,12,'None'


# Group by single column

You can create groups by single columns:

In [3]:
gr = df1.groupby('type')
gr

'type','counts'
'A',4
'B',2


And then aggregate the groups into a new table:

In [4]:
df2 = gr.agg({
    'size': { 'Count': 'count', 'SizeAvg': 'avg' },
    'weight': { 'WeightMedian': 'median', 'WeightSum': 'sum' }
})

df2

'type','Count','SizeAvg','WeightMedian','WeightSum'
'A',4,2.5,11.0,44
'B',2,5.0,13.0,26


# Group by multiple columns

To create groups by multiple columns put a list with key columns:

In [5]:
df2 = df1.groupby(['type', 'weight']).agg({
    'size': {
        'Count': 'count',
        'SizeSum': 'sum'
    }
})

df2

'type','weight','Count','SizeSum'
'A',12,2,3
'A',10,2,7
'B',12,1,4
'B',14,1,6


# Group by all

You can easily create aggregate all the rows by creating a group by none column. Just do not put any column:

In [6]:
gr = df1.groupby()
gr

'counts'
6


Then aggregate it as you wish:

In [7]:
df2 = gr.agg({ 'size': { 'Count': 'count', 'Sum': 'sum' } })

df2

'Count','Sum'
6,20


# Aggregation with own function or lambda

The following aggregation functions are available by strings:

- **`'count'`** - elements count, group size.

- **`'min'`** - the lowest value.

- **`'max'`** - the highest value.

- **`'sum'`** - elements sum.

- **`'avg'`**, **`'mean'`** - average value.

- **`'mode'`** - the most common value.

- **`'middle'`**, **`'median'`** - a number where half of the numbers are lower and half the numbers are higher.

- **`'first'`** - the first value of the list (it's handy when you don't need the specific element, but just an example).

- **`'last'`** - the last value of the list.


But if it is not enough, you can put functions or lambda expressions in aggregation:

In [8]:
def check5(lst):
    for el in lst:
        if el == 5:
            return True
    return False

df2 = df1.groupby().agg({ 'size': {
    'Count': 'count',
    'Random': 'last',
    'Something': lambda x: 2 * sum(x) - 3 * min(x),
    'Have_size_5': check5,
}})

df2

'Count','Random','Something','Have_size_5'
6,2,37,False


# Default aggregation funtions

Sometimes you don't know all the columns of the tables that your code will process, but you anyway need to aggregate them. In this case, you can specify default functions using `default_fun` argument, which can be either a string/function or a dictionary with them. This allows to aggregate all the columns without indicating their names.

Here is an example of grouping by one column with aggregating by specific column and with single default function:

In [9]:
df2 = df1.groupby('type').agg({'size': {'size-sum': 'sum'}}, 'first')
df2

'type','size-sum','size','weight','tel'
'A',10,1,12,'+79360360193'
'B',10,4,12,'84505505151'


Another example, grouping by all and aggregating with multiple default functions:

In [10]:
df2 = df1.groupby().agg({}, {'min': 'min', 'max': 'max'})
df2

'type_min','type_max','size_min','size_max','weight_min','weight_max','tel_min','tel_max'
'A','B',1,6,10,14,'','None'


As you can sea, Lemuras automatically combines dictionary keys with columns names when using default aggregation functions.

# Groups access

In addition, you can extract groups keys and counts:

In [11]:
gr = df1.groupby(['type', 'weight'])
gr.counts()

'type','weight','counts'
'A',12,2
'A',10,2
'B',12,1
'B',14,1


Get a list with all the groups:

In [12]:
for el in gr.split():
    print(el)

- Table object, title: "Group type=A weight=12", 4 columns, 2 rows.
'type' 'weight' 'size' 'tel'
'A' 12 1 '+79360360193'
'A' 12 2 'None'
- Table object, title: "Group type=A weight=10", 4 columns, 2 rows.
'type' 'weight' 'size' 'tel'
'A' 10 3 '+31415926535'
'A' 10 4 '23816326412'
- Table object, title: "Group type=B weight=12", 4 columns, 1 rows.
'type' 'weight' 'size' 'tel'
'B' 12 4 '84505505151'
- Table object, title: "Group type=B weight=14", 4 columns, 1 rows.
'type' 'weight' 'size' 'tel'
'B' 14 6 ''


Or retrieve a specific single group:

In [13]:
gr.get_group(['A', 12])

'type','weight','size','tel'
'A',12,1,'+79360360193'
'A',12,2,'None'


Both **`.split()`** and **`.get_group()`** methods take **`should_add_keys`** argument. Its default value is **True**, but if it is set to **False**, then you won't get keys columns:

In [14]:
gr.get_group(['A', 10], should_add_keys=False)

'size','tel'
3,'+31415926535'
4,'23816326412'
