# Data Aggrigation:  
The last stage of data manipulation is data aggregation you generally mean a transformation that produces a single integer from an array.  
In fact, you have already made many operations of data aggregation, for example when we calculated the __sum()__, __mean()__ and __count()__. In fact, these functions operate on a set of data and shall perform a calculation with a consistent result consisting of a single value.  
However, a more formal manner and the one with more control in data aggregation is that which includes the __categorization__ of a set.  

The __Categorization__ of a set of data carried out for grouping is often a critical stage in the process of data analysis. It is a process of transformation since after the division into different groups, you apply a function that converts or transforms the data in some way depending on the group they belong to.  
Very often the two phases of grouping and application of a function are performed in a single step.  
Also for this part of data analysis, pandas provides a tool very flesxible and high performance:
__GroupBy__.   

Again, as in the case of join, those familiar with relational databases and the SQL language can find similarities. Nevertheless, languages such as SQL are quite limited when applied to operations on groups. In fact, given the flexiblity of a programming languages like Python, with all the libraries avaliable, especially pandas, you can perform very complex operations on groups.  


### GroupBy:
Now you will analyze in detail what the process of __GroupBy__ is and how it works. Generally, it refers to its internal mechanism as a process called __SPLIT-APPLY-COMBINE__. So in its pattern of operation you may conceive this process as divided into three different phases expressed precisely by three operations:  
 • splitting: division into groups of data sets.  
 • applying: application of a function on each         group.  
 • combining: combination of all the resultsobtained by different groups.  


In [516]:
frame = pd.DataFrame({ 'color': ['white','red','green','red','green'],
... 'object': ['pen','pencil','pencil','ashtray','pen'],
... 'price1' : [5.56,4.20,1.30,0.56,2.75],
... 'price2' : [4.75,4.12,1.60,0.75,3.15]})

In [517]:
frame

Unnamed: 0,color,object,price1,price2
0,white,pen,5.56,4.75
1,red,pencil,4.2,4.12
2,green,pencil,1.3,1.6
3,red,ashtray,0.56,0.75
4,green,pen,2.75,3.15


suppose you need to calculate the average __price1__ column using group labels listed in the column color.  
There are several ways to do this. You can for example access the __price1__ column and call the __groupby()__ function with the column color.

In [518]:
group = frame['price1'].groupby(frame['color'])

In [519]:
group

<pandas.core.groupby.generic.SeriesGroupBy object at 0x124a82b80>

The object that we got is a __GroupBy__  object. In the operation that you just did there was not really any calculation; there was just a collection of all the information needed to calculate to be executed. What you have done is in fact a process of grouping, in which all rows having the same value of color are grouped into a single item.

To analyze in detail how the division into groups of rows of DataFrame was made, you call the attribute groups GroupBy object.

In [520]:
group.groups

{'green': [2, 4], 'red': [1, 3], 'white': [0]}

As you can see, each group is listed explicitly specifying the rows of the data frame assigned to each of them. Now it is sufficient to apply the operation on the group to obtain the results for each individual group.  

In [521]:
group.mean()

color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

In [522]:
group.sum()

color
green    4.05
red      4.76
white    5.56
Name: price1, dtype: float64

### Hierarical Grouping:  
You have seen how to group the data according to the values of a column as a key choice. The same thing can be extended to multiple columns,i.e, make a grouping of multiple keys hierarical.

In [523]:
ggroup = frame['price1'].groupby([frame['color'], frame['object']])

In [524]:
ggroup.groups

{('green', 'pen'): [4], ('green', 'pencil'): [2], ('red', 'ashtray'): [3], ('red', 'pencil'): [1], ('white', 'pen'): [0]}

In [525]:
ggroup.sum()

color  object 
green  pen        2.75
       pencil     1.30
red    ashtray    0.56
       pencil     4.20
white  pen        5.56
Name: price1, dtype: float64

So far you have applied the grouping to a single column of data, but in reality it can be extended to a multiple columns or the entire data frame. Also if you do not need to reuse the object GroupBy several times, it is convenient to combine in a single passing all of the grouping and calculation to be done, without defining any intermediate variable.

In [526]:
frame[['price1','price2']].groupby(frame['color']).mean()

Unnamed: 0_level_0,price1,price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,2.025,2.375
red,2.38,2.435
white,5.56,4.75


In [527]:
frame[['price1','price2']].groupby(frame['color']).mean()

Unnamed: 0_level_0,price1,price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,2.025,2.375
red,2.38,2.435
white,5.56,4.75


# Group Iteration:
The __GroupBy__ object supports the operation of an iteration for generating a sequence of 2-tuples containing the name of the group together with the data portion.

In [528]:
for name, group in frame.groupby('color'):
    print (name)
    print (group)

green
   color  object  price1  price2
2  green  pencil    1.30    1.60
4  green     pen    2.75    3.15
red
  color   object  price1  price2
1   red   pencil    4.20    4.12
3   red  ashtray    0.56    0.75
white
   color object  price1  price2
0  white    pen    5.56    4.75


In the example you have seen, you only applied the print variable for illustration. In fact, you replace the printing operation of a variable with the function to be applied on it.

### Chain of Transformation:
From these examples you have seen that for each grouping, when subjected to some function calculations or other operations in general, regardless of how it was obtained and selection criteria, the result will be a data structure Series (if we selected a single column data) or a DataFrame, which then retains the index system and the name of the columns.

In [529]:
result1 = frame['price1'].groupby(frame['color']).mean()

In [530]:
type(result1)

pandas.core.series.Series

In [531]:
result1 #Series

color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

In [532]:
result2 = frame[['price1', 'price2']].groupby(frame['color']).mean()

In [533]:
type(result2)

pandas.core.frame.DataFrame

In [534]:
result2 #DataFrame

Unnamed: 0_level_0,price1,price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,2.025,2.375
red,2.38,2.435
white,5.56,4.75


So it is possible to select a single column at any point in the various phases of this process. Here are three cases in which the selection of a single column in three different stages of the process applies. This example illustrates the great flexability of this system of grouping peovided by pandas.

In [535]:
frame['price1'].groupby(frame['color']).mean()

color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

In [536]:
frame.groupby(frame['color'])['price1'].mean()

color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

In addition, after an operation of aggregation the names of some columns may not be very meaningful in certain cases. In fact it is often useful to add a prefix to the column name that describes the type of business combination.  
Adding a prefix, instead of completely replacing the name, is very useful for keeping track of the source data from which they are derived aggregated values.  
This is important if you apply a process of transfromation chain (a series of DataFrame is generated from each other) in which it is important to somehow keep some reference with the source data.

In [537]:
frame

Unnamed: 0,color,object,price1,price2
0,white,pen,5.56,4.75
1,red,pencil,4.2,4.12
2,green,pencil,1.3,1.6
3,red,ashtray,0.56,0.75
4,green,pen,2.75,3.15


In [538]:
means = frame.groupby(frame['color'])[['price1','price2']].mean().add_prefix('mean_')

In [539]:
means

Unnamed: 0_level_0,mean_price1,mean_price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,2.025,2.375
red,2.38,2.435
white,5.56,4.75


### Functions of Groups:
Although many methods have not been implemented specifically for use with __GroupBy__, they actually work correctly with data structures as the Series. You saw in the previous section how easy it is to get the Series by a GroupBy object, specifying the name of the column and then by applying the method to make the calculation.  
For example, you can use the calculation quantiles with the __quantiles()__ function. 

In [540]:
group = frame.groupby('color')

In [541]:
group['price1'].quantile(0.6)

color
green    2.170
red      2.744
white    5.560
Name: price1, dtype: float64

You can also define their own aggregation functions. Define the function separately and then you pass as an argument to the __mark()__ function.  
For example, you could calculate the range of values of each group.

In [542]:
def range(series):
    return series.max() - series.min()
    

In [543]:
group['price1'].agg(range)

color
green    1.45
red      3.64
white    0.00
Name: price1, dtype: float64

The __agg()__ function allows you to use aggregate functions on an entire DataFrame.

In [544]:
group[['price1','price2']].agg(range)

Unnamed: 0_level_0,price1,price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,1.45,1.55
red,3.64,3.37
white,0.0,0.0


Also you can use more aggregate functions at the same time always with the __mark()__ function passing an array containing the list of operations to be done, which will become the new columns.

In [545]:
group['price1'].agg(['mean','std',range])

Unnamed: 0_level_0,mean,std,range
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
green,2.025,1.025305,1.45
red,2.38,2.573869,3.64
white,5.56,,0.0


In [546]:
group[['price1','price2']].agg(['mean','std',range])

Unnamed: 0_level_0,price1,price1,price1,price2,price2,price2
Unnamed: 0_level_1,mean,std,range,mean,std,range
color,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
green,2.025,1.025305,1.45,2.375,1.096016,1.55
red,2.38,2.573869,3.64,2.435,2.38295,3.37
white,5.56,,0.0,4.75,,0.0


# Advanced Data Aggregation:
In this section you will introduced to __transformation()__ and __apply()__ functions, which will allow you to perform many kinds of group operations, some very complex.  
Now suppose we wnat to bring together the same DataFrame the following:  
(i) The DataFrame of origin (the one containing the data).  
(ii) That obtained by the calculation of group aggregation, for example, the sum. 

In [547]:
frame = pd.DataFrame({ 'color':['white','red','green','red','green'],
... 'price1':[5.56,4.20,1.30,0.56,2.75],
... 'price2':[4.75,4.12,1.60,0.75,3.15]})

In [548]:
frame

Unnamed: 0,color,price1,price2
0,white,5.56,4.75
1,red,4.2,4.12
2,green,1.3,1.6
3,red,0.56,0.75
4,green,2.75,3.15


In [549]:
sums = frame.groupby(frame['color']).sum().add_prefix('tot_')

In [550]:
sums

Unnamed: 0_level_0,tot_price1,tot_price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,4.05,4.75
red,4.76,4.87
white,5.56,4.75


In [551]:
sums = frame.groupby(frame['color'])['price1'].sum().add_prefix('tot_')

In [552]:
sums

color
tot_green    4.05
tot_red      4.76
tot_white    5.56
Name: price1, dtype: float64

In [553]:
sums = frame.groupby(frame['color'])['price1'].sum().rename('tot_price1').reset_index()

In [554]:
sums

Unnamed: 0,color,tot_price1
0,green,4.05
1,red,4.76
2,white,5.56


In [555]:
print(sums.to_string(index=False))

color  tot_price1
green        4.05
  red        4.76
white        5.56


In [556]:
sums

Unnamed: 0,color,tot_price1
0,green,4.05
1,red,4.76
2,white,5.56


In [557]:
tot_price1 = frame.groupby('color')['price1'].sum().rename('tot_price1')

In [558]:
tot_price2 = frame.groupby('color')['price2'].sum().rename('tot_price2')

In [559]:
sums

Unnamed: 0,color,tot_price1
0,green,4.05
1,red,4.76
2,white,5.56


In [560]:
result=pd.merge(frame,tot_price1,left_on='color',right_index=True)

result=pd.merge(frame,tot_price2,left_on='color',right_index=True)

print(result)

   color  price1  price2  tot_price2
0  white    5.56    4.75        4.75
1    red    4.20    4.12        4.87
3    red    0.56    0.75        4.87
2  green    1.30    1.60        4.75
4  green    2.75    3.15        4.75


In [561]:
import pandas as pd

# Original DataFrame
frame = pd.DataFrame({
    'color': ['white', 'red', 'green', 'red', 'green'],
    'price1': [5.56, 4.20, 1.30, 0.56, 2.75],
    'price2': [4.75, 4.12, 1.60, 0.75, 3.15]
})

# Group and calculate totals
tot_price1 = frame.groupby('color')['price1'].sum().rename('tot_price1')
tot_price2 = frame.groupby('color')['price2'].sum().rename('tot_price2')

# Merge the first total
result = pd.merge(frame, tot_price1, left_on='color', right_index=True)

# Merge the second total
result = pd.merge(result, tot_price2, left_on='color', right_index=True)

print(result)


   color  price1  price2  tot_price1  tot_price2
0  white    5.56    4.75        5.56        4.75
1    red    4.20    4.12        4.76        4.87
3    red    0.56    0.75        4.76        4.87
2  green    1.30    1.60        4.05        4.75
4  green    2.75    3.15        4.05        4.75


So thanks to the __merge()__, you managed to add the results of a calculation of aggregation in each line of the DataFrame to start. But actually there is another way to do this type of operation. That is by using the __transform()__. This function performs the calculation of aggregation as you have seen before, but at the same time shows the values calculated based on the key value on each line of the DataFrame to start.

In [562]:
import numpy as np

In [563]:
frame.groupby('color').transform(np.sum).add_prefix('tot_')

Unnamed: 0,tot_price1,tot_price2
0,5.56,4.75
1,4.76,4.87
2,4.05,4.75
3,4.76,4.87
4,4.05,4.75


As you can see the __transform()__ method specialized function that has very specific requirements: the function passed as an argument must produce a single scalar value (aggrgation) to be broadcasted.

The method to cover more general GroupBy is applicable to __apply()__. This method applies in its entirety the scheme split-apply-combine. In fact, this function divides the object into parts in order to be manipulated, invokes the passage of function on each piece, and then tries to chain together the various parts. 

In [564]:
frame = pd.DataFrame( { 'color':['white','black','white','white','black','black'],
... 'status':['up','up','down','down','down','up'],
... 'value1':[12.33,14.55,22.34,27.84,23.40,18.33],
... 'value2':[11.23,31.80,29.99,31.18,18.25,22.44]})

In [565]:
frame

Unnamed: 0,color,status,value1,value2
0,white,up,12.33,11.23
1,black,up,14.55,31.8
2,white,down,22.34,29.99
3,white,down,27.84,31.18
4,black,down,23.4,18.25
5,black,up,18.33,22.44


In [566]:
frame.groupby(['color','status']).apply(lambda x:x.max())

Unnamed: 0_level_0,Unnamed: 1_level_0,color,status,value1,value2
color,status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
black,down,black,down,23.4,18.25
black,up,black,up,18.33,31.8
white,down,white,down,27.84,31.18
white,up,white,up,12.33,11.23


In [567]:
frame

Unnamed: 0,color,status,value1,value2
0,white,up,12.33,11.23
1,black,up,14.55,31.8
2,white,down,22.34,29.99
3,white,down,27.84,31.18
4,black,down,23.4,18.25
5,black,up,18.33,22.44


In [568]:
reindex = {0: 'first', 1:'second', 2:'third', 3:'fourth',4:'fifth',5:'sixth'}

In [569]:
frame.rename(index=reindex)

Unnamed: 0,color,status,value1,value2
first,white,up,12.33,11.23
second,black,up,14.55,31.8
third,white,down,22.34,29.99
fourth,white,down,27.84,31.18
fifth,black,down,23.4,18.25
sixth,black,up,18.33,22.44


In [570]:
temp = pd.date_range('12/06/2024', periods=10, freq='H')

In [571]:
temp

DatetimeIndex(['2024-12-06 00:00:00', '2024-12-06 01:00:00',
               '2024-12-06 02:00:00', '2024-12-06 03:00:00',
               '2024-12-06 04:00:00', '2024-12-06 05:00:00',
               '2024-12-06 06:00:00', '2024-12-06 07:00:00',
               '2024-12-06 08:00:00', '2024-12-06 09:00:00'],
              dtype='datetime64[ns]', freq='H')

In [572]:
timeseries = pd.Series(np.random.rand(10), index=temp)

In [573]:
timeseries

2024-12-06 00:00:00    0.937479
2024-12-06 01:00:00    0.286185
2024-12-06 02:00:00    0.933360
2024-12-06 03:00:00    0.406129
2024-12-06 04:00:00    0.216975
2024-12-06 05:00:00    0.430628
2024-12-06 06:00:00    0.849103
2024-12-06 07:00:00    0.333514
2024-12-06 08:00:00    0.107547
2024-12-06 09:00:00    0.719963
Freq: H, dtype: float64

In [574]:
timetable = pd.DataFrame( {'date': temp, 'value1' : np.random.rand(10),
... 'value2' : np.random.rand(10)})

In [575]:
timetable

Unnamed: 0,date,value1,value2
0,2024-12-06 00:00:00,0.074599,0.507956
1,2024-12-06 01:00:00,0.924956,0.047119
2,2024-12-06 02:00:00,0.695631,0.273885
3,2024-12-06 03:00:00,0.334812,0.003628
4,2024-12-06 04:00:00,0.179927,0.698406
5,2024-12-06 05:00:00,0.854949,0.02206
6,2024-12-06 06:00:00,0.064615,0.830684
7,2024-12-06 07:00:00,0.208654,0.990387
8,2024-12-06 08:00:00,0.08668,0.089906
9,2024-12-06 09:00:00,0.267486,0.038139


In [576]:
timetable['cat'] = ['up','down','left','left','up','up','down','right','right','up']

In [577]:
timetable

Unnamed: 0,date,value1,value2,cat
0,2024-12-06 00:00:00,0.074599,0.507956,up
1,2024-12-06 01:00:00,0.924956,0.047119,down
2,2024-12-06 02:00:00,0.695631,0.273885,left
3,2024-12-06 03:00:00,0.334812,0.003628,left
4,2024-12-06 04:00:00,0.179927,0.698406,up
5,2024-12-06 05:00:00,0.854949,0.02206,up
6,2024-12-06 06:00:00,0.064615,0.830684,down
7,2024-12-06 07:00:00,0.208654,0.990387,right
8,2024-12-06 08:00:00,0.08668,0.089906,right
9,2024-12-06 09:00:00,0.267486,0.038139,up


The example shown above, however, has duplicate key value ...

We add to the DataFrame preceding a column that represents a set of text values that we will use as a key values

 # Conclusions:
 In this chapter you saw the three basic parts which divide the data manipulation: preparation, processing, and data aggregation. Thanks to a series of examples you've got to know a set of library functions that allow pandas to perform these operations.  
 You saw how to apply these functions on simple data structures so that you can become familiar with how they work  and understand its applicability to more complex cases.  
 Eventually you get knowledge of all the tools necessary to prepare a data set for the next phase of data analysis: __data visualisation__.  
 In the next chapter, you will be presented with the Python library __Matplotlib__, which can convert the data structures in any chart.

Great ...  
Good Bye 