### Data Aggregation 
Data aggregation is a process of data grouping. Then each group is being analyzed and can be processed by different functions<br>
In Data Analysis a technique which is called **Split-Apply-Comb** is commonly used
- **Split** - each data fragment is splitted into many small fragments according to special conditions
- **Apply** - each fragment is being processed by different functions. For example: .mean() or .std()
- **Comb** - All results are being combined as a whole<br>

After splitting the data the following operations can be applied:
- **Aggregation** - calculate final statistics. For example: mean or number of elements in a group
- **Transformation** - executes computations by groups or elements
- **Filtration** - groups deletion according to calculations

### Grouping by 
- **df.groupby( [ column_names ] )** - groups by columns. **Returns intermidiate description of grouping**
- **df.groupby( level = [ list of num ] or [ list of labels ] )** - groups by hierarchical index/indexes
- **gb.ngroups** - returns number of groups 
- **gb.groups** - returns groups and their values
- **gb.size()** - returns the size of each group
- **gb.count()** - returns the number of elements in each group
- **gb.get_group( ' group_name ' )** - retrieves provided groups
- **gb.nth( row_num )** - retrieves nrow from each group (every first,third...)
- **gb.sort_values(' col_name ', ascending = False )** - sort values of a group

To print groups it is necessary to create a special function<br>
All groups are sorted by ASC to change the order provide parameter **sort = False**

### Aggregation Functions
- **gb.agg( function )** - applies a provided function to each column. **agg( )** is a link to the applied function. **agg( ) and aggregate( )** are the same<br>
- **gb.agg ( [ func_1, func_n ] )** - a list of functions can be provided
- **gb.agg ( { 'col_name_1' : func_1, 'col_name_n': func_n } )** - applies an exact function for an exact column
- **gb.agg( ' new_name ' = pd.NamedAgg( column = ' col_name ', aggfunc = ' min ' )** -can change final column name and apply a function

Many functions can be applied for aggregation ( gb.count(),gb.median() ). **For more info look up in the documentation**

### Transformation
Method **.transform()** applies a function to all values in a DataFrame in **each group**
- **gb.transform( func )** - applies a function to each value in columns. **Func** can be any,usually, **lambda** 

### Filtration 
With the help of method **.filter( )** can decide which values will be removed from final result.<br>
It is necessary to provide a **function** which has to return either **True** or **False**. Thus, if the function returns **True** for a **group**, then the group will be in a final result, otherwise the group excludes.
- **gb.filter( function )** - applies a function to a group and filters it. **Function usually is lambda**

**gb** - GroupBy object

In [3]:
path = 'D:/ML/Books/Learning_Pandas_russian_translation-1-master/Notebooks/Data/sensors.csv'
data = pd.read_csv(path)
data.head()

Unnamed: 0,interval,sensor,axis,reading
0,0,accel,Z,0.0
1,0,accel,Y,0.5
2,0,accel,X,1.0
3,1,accel,Z,0.1
4,1,accel,Y,0.4


In [82]:
grouped_data = data.groupby(['sensor'])
grouped_data.get_group('accel')['reading']

0     0.0
1     0.5
2     1.0
3     0.1
4     0.4
5     0.9
6     0.2
7     0.3
8     0.8
9     0.3
10    0.2
11    0.7
Name: reading, dtype: float64

In [2]:
import pandas as pd
import numpy as np

In [29]:
path = 'D:/ML/Books/Learning_Pandas_russian_translation-1-master/Notebooks/Data/sensors.csv'
data = pd.read_csv(path)
data.head()

Unnamed: 0,interval,sensor,axis,reading
0,0,accel,Z,0.0
1,0,accel,Y,0.5
2,0,accel,X,1.0
3,1,accel,Z,0.1
4,1,accel,Y,0.4


### Group by

In [30]:
# Group By
grouped_data = data.groupby('sensor')

# Can see how many groups in a variable 
print(grouped_data.ngroups)

# Can see what are groups and thei values
groups = grouped_data.groups
print('\n'+str(groups))

2

{'accel': Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64'), 'orientation': Int64Index([12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], dtype='int64')}


In [31]:
# Create a function for groups reading
def print_groups(group_object,row_num = 5):
    for name, group in group_object:
        print(name)
        print(group[:row_num])

# See the content of each group
print_groups(grouped_data)

accel
   interval sensor axis  reading
0         0  accel    Z      0.0
1         0  accel    Y      0.5
2         0  accel    X      1.0
3         1  accel    Z      0.1
4         1  accel    Y      0.4
orientation
    interval       sensor axis  reading
12         0  orientation    Z      0.0
13         0  orientation    Y      0.1
14         0  orientation    X      0.0
15         1  orientation    Z      0.0
16         1  orientation    Y      0.2


In [32]:
# The size of groups 
print(grouped_data.size())

# The number of elements in each group 
print('\n'+str(grouped_data.count()))

# Retrieve exact group 
print('\n'+str(grouped_data.get_group('orientation').head()))

# Retrieve nrow from each group
print('\n'+str(grouped_data.nth(2))) # every row which has index 2

sensor
accel          12
orientation    12
dtype: int64

             interval  axis  reading
sensor                              
accel              12    12       12
orientation        12    12       12

    interval       sensor axis  reading
12         0  orientation    Z      0.0
13         0  orientation    Y      0.1
14         0  orientation    X      0.0
15         1  orientation    Z      0.0
16         1  orientation    Y      0.2

             interval axis  reading
sensor                             
accel               0    X      1.0
orientation         0    X      0.0


In [34]:
# Grouping by several columns
grouped_data = data.groupby(['sensor','axis'])
print_groups(grouped_data)

('accel', 'X')
    interval sensor axis  reading
2          0  accel    X      1.0
5          1  accel    X      0.9
8          2  accel    X      0.8
11         3  accel    X      0.7
('accel', 'Y')
    interval sensor axis  reading
1          0  accel    Y      0.5
4          1  accel    Y      0.4
7          2  accel    Y      0.3
10         3  accel    Y      0.2
('accel', 'Z')
   interval sensor axis  reading
0         0  accel    Z      0.0
3         1  accel    Z      0.1
6         2  accel    Z      0.2
9         3  accel    Z      0.3
('orientation', 'X')
    interval       sensor axis  reading
14         0  orientation    X      0.0
17         1  orientation    X      0.1
20         2  orientation    X      0.2
23         3  orientation    X      0.3
('orientation', 'Y')
    interval       sensor axis  reading
13         0  orientation    Y      0.1
16         1  orientation    Y      0.2
19         2  orientation    Y      0.3
22         3  orientation    Y      0.4
('orient

In [42]:
# Grouping by indexes
hierarchical = data.copy()
hierarchical = hierarchical.set_index(['sensor','axis'])
print(hierarchical.head())

grouped = hierarchical.groupby(level=['sensor','axis'])
print_groups(grouped)

             interval  reading
sensor axis                   
accel  Z            0      0.0
       Y            0      0.5
       X            0      1.0
       Z            1      0.1
       Y            1      0.4
('accel', 'X')
             interval  reading
sensor axis                   
accel  X            0      1.0
       X            1      0.9
       X            2      0.8
       X            3      0.7
('accel', 'Y')
             interval  reading
sensor axis                   
accel  Y            0      0.5
       Y            1      0.4
       Y            2      0.3
       Y            3      0.2
('accel', 'Z')
             interval  reading
sensor axis                   
accel  Z            0      0.0
       Z            1      0.1
       Z            2      0.2
       Z            3      0.3
('orientation', 'X')
                  interval  reading
sensor      axis                   
orientation X            0      0.0
            X            1      0.1
            X  

### Aggregation Functions

In [51]:
# Aggregation Functions 
print('\n'+str(grouped.agg(np.mean).head()))

# List of aggregation functions
print('\n'+str(grouped.agg([np.sum,np.std]).head()))

# Can apply exact function for exact column by using Dictionary
print('\n'+str(grouped.agg({'interval':len,'reading': np.mean})))

# Aggregation for exact column 
print('\n'+str(grouped['reading'].mean()))


                  interval  reading
sensor      axis                   
accel       X          1.5     0.85
            Y          1.5     0.35
            Z          1.5     0.15
orientation X          1.5     0.15
            Y          1.5     0.25

                 interval           reading          
                      sum       std     sum       std
sensor      axis                                     
accel       X           6  1.290994     3.4  0.129099
            Y           6  1.290994     1.4  0.129099
            Z           6  1.290994     0.6  0.129099
orientation X           6  1.290994     0.6  0.129099
            Y           6  1.290994     1.0  0.129099

                  interval  reading
sensor      axis                   
accel       X            4     0.85
            Y            4     0.35
            Z            4     0.15
orientation X            4     0.15
            Y            4     0.25
            Z            4     0.00

sensor       axis
accel 

### Transformation

In [53]:
data = pd.DataFrame({'Label':list('ACBAC'),
                     'Values':np.arange(5),
                     'Values_2':np.arange(5,10),
                     'Other':['foo','bar','baz','fiz','buz']}, index= list('VWXYZ'))
data

Unnamed: 0,Label,Values,Values_2,Other
V,A,0,5,foo
W,C,1,6,bar
X,B,2,7,baz
Y,A,3,8,fiz
Z,C,4,9,buz


In [57]:
# Let's group the data and apply transformation (add 10 to each column values )
grouped_data = data.groupby('Label')
transformed_data = grouped_data.transform(lambda val: val+10)
transformed_data

Unnamed: 0,Values,Values_2
V,10,15
W,11,16
X,12,17
Y,13,18
Z,14,19


In [60]:
# Dealing with Missing values 
df = pd.DataFrame({'Label':list('ABABAB'),
                   'Values':[10,20,11,np.nan,12,22]})
grouped_df = df.groupby('Label')

filled_df = grouped_df.transform(lambda col: col.fillna(col.mean()))
filled_df

Unnamed: 0,Values
0,10.0
1,20.0
2,11.0
3,21.0
4,12.0
5,22.0


### Filtration

In [72]:
df = pd.DataFrame({'Label':list('AABCCC'),
                   'Values':[1,2,3,4,np.nan,8]})
print(df)

# Get rid of groups with only one value
f = lambda x: x.Values.count()>1
print('\n'+str(df.groupby('Label').filter(f)))

# Get rid of groups with NaN values
f = lambda x: x.Values.isnull().sum() == 0
print('\n'+str(df.groupby('Label').filter(f)))

  Label  Values
0     A     1.0
1     A     2.0
2     B     3.0
3     C     4.0
4     C     NaN
5     C     8.0

  Label  Values
0     A     1.0
1     A     2.0
3     C     4.0
4     C     NaN
5     C     8.0

  Label  Values
0     A     1.0
1     A     2.0
2     B     3.0


In [80]:
# Get rid of groups where avg values is less than 2.0
grouped = df.groupby('Label')
group_mean = grouped.mean().mean()
f = lambda x: abs(x.Values.mean() - group_mean) > 2.0
grouped.filter(f)

Unnamed: 0,Label,Values
3,C,4.0
4,C,
5,C,8.0
