<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Aggregation-and-Grouping" data-toc-modified-id="Data-Aggregation-and-Grouping-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Aggregation and Grouping</a></span><ul class="toc-item"><li><span><a href="#Aggregation" data-toc-modified-id="Aggregation-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Aggregation</a></span></li><li><span><a href="#Grouping" data-toc-modified-id="Grouping-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Grouping</a></span><ul class="toc-item"><li><span><a href="#Iterating-GroupBy-object" data-toc-modified-id="Iterating-GroupBy-object-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Iterating GroupBy object</a></span></li><li><span><a href="#groups" data-toc-modified-id="groups-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span><code>groups</code></a></span></li><li><span><a href="#size" data-toc-modified-id="size-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span><code>size</code></a></span></li><li><span><a href="#get_group()" data-toc-modified-id="get_group()-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span><code>get_group()</code></a></span></li><li><span><a href="#Applying-aggregations" data-toc-modified-id="Applying-aggregations-1.2.5"><span class="toc-item-num">1.2.5&nbsp;&nbsp;</span>Applying aggregations</a></span></li><li><span><a href="#Multiple-Aggregations" data-toc-modified-id="Multiple-Aggregations-1.2.6"><span class="toc-item-num">1.2.6&nbsp;&nbsp;</span>Multiple Aggregations</a></span></li><li><span><a href="#Grouping-by-Multiple-Variables" data-toc-modified-id="Grouping-by-Multiple-Variables-1.2.7"><span class="toc-item-num">1.2.7&nbsp;&nbsp;</span>Grouping by Multiple Variables</a></span></li></ul></li></ul></li><li><span><a href="#Summary" data-toc-modified-id="Summary-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Summary</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#💡-Check-for-understanding" data-toc-modified-id="💡-Check-for-understanding-2.0.1"><span class="toc-item-num">2.0.1&nbsp;&nbsp;</span>💡 Check for understanding</a></span></li></ul></li></ul></li></ul></div>

# Data Aggregation and Grouping 

Data aggregation and grouping are fundamental operations in data analysis. They involve combining multiple pieces of data into a single result. For instance, you may want to group data by certain variables and then calculate summary statistics like count, mean, sum, or standard deviation.

In [1]:
import pandas as pd

# Load the dataset
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv'
df = pd.read_csv(url)

# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Aggregation

Aggregation in pandas involves applying a function to a dataset, transforming multiple values into a single value. pandas provides various aggregation functions, including:

- `mean()`: Compute the arithmetic mean.
- `sum()`: Compute the sum of values.
- `min()`: Compute the minimum value.
- `max()`: Compute the maximum value.
- `count()`: Count the number of non-null values.
- `std()`: Compute the standard deviation.

For instance, to find the mean value of a numerical column, use the `mean()` function:

In [2]:
df['Fare'].mean()

32.204207968574636

## Grouping

Grouping is the process of splitting the data into groups based on certain criteria. The `groupby()` function is used for this purpose.

In [3]:
grupo = df.groupby('Sex')

grupo

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x137bafa30>

In [7]:
display(df.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [8]:
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


### Iterating GroupBy object

The `groupby()` function in pandas returns a `GroupBy` object that can be iterated over. Each iteration provides a tuple where the first item is the group identifier and the second item is the data in that group as a DataFrame.

In [9]:
for nombre, gro in grupo:
    
    display('Nombre', nombre)
    display('Grupo', gro)

'Nombre'

'female'

'Grupo'

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


'Nombre'

'male'

'Grupo'

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5000,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [10]:
df.Sex.unique()

array(['male', 'female'], dtype=object)

In [12]:
df.Sex.nunique() == len(grupo)

True

### `groups`
The `groups` attribute of a pandas `GroupBy` object is a dictionary. The keys of this dictionary are the computed unique groups and the corresponding values are the axis labels belonging to each group.

In [13]:
grupo.groups

{'female': [1, 2, 3, 8, 9, 10, 11, 14, 15, 18, 19, 22, 24, 25, 28, 31, 32, 38, 39, 40, 41, 43, 44, 47, 49, 52, 53, 56, 58, 61, 66, 68, 71, 79, 82, 84, 85, 88, 98, 100, 106, 109, 111, 113, 114, 119, 123, 128, 132, 133, 136, 140, 141, 142, 147, 151, 156, 161, 166, 167, 172, 177, 180, 184, 186, 190, 192, 194, 195, 198, 199, 205, 208, 211, 215, 216, 218, 229, 230, 233, 235, 237, 240, 241, 246, 247, 251, 254, 255, 256, 257, 258, 259, 264, 268, 269, 272, 274, 275, 276, ...], 'male': [0, 4, 5, 6, 7, 12, 13, 16, 17, 20, 21, 23, 26, 27, 29, 30, 33, 34, 35, 36, 37, 42, 45, 46, 48, 50, 51, 54, 55, 57, 59, 60, 62, 63, 64, 65, 67, 69, 70, 72, 73, 74, 75, 76, 77, 78, 80, 81, 83, 86, 87, 89, 90, 91, 92, 93, 94, 95, 96, 97, 99, 101, 102, 103, 104, 105, 107, 108, 110, 112, 115, 116, 117, 118, 120, 121, 122, 124, 125, 126, 127, 129, 130, 131, 134, 135, 137, 138, 139, 143, 144, 145, 146, 148, 149, 150, 152, 153, 154, 155, ...]}

### `size`

The `size` attribute of a pandas `GroupBy` object returns a Series giving the size (i.e., the number of items) of each group. This is like applying the `count()` function to each group, but `size` includes `NaN` values and `count` does not. 



In [15]:
grupo.size()

Sex
female    314
male      577
dtype: int64

### `get_group()`

The `get_group` method is used to select a single group from a `GroupBy` object as a DataFrame. You provide the name of the group you want to select as an argument.

In [17]:
grupo.get_group('female').head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### Applying aggregations

Once the data is split into groups, you need a way to represent each group in the resulting output. That's where aggregation functions come in.

For example, if you group a DataFrame by a categorical variable (like 'City'), you'll end up with a separate group for each unique city in your data. But how do you want to represent each city in your result? Do you want the mean of another variable (like 'Sales') for each city? The sum? The maximum? This is what the aggregation function determines.

When you apply an aggregation function after a `groupby()`, pandas applies that function to each group separately and then combines the results into a new DataFrame.

In [20]:
df.groupby('Sex').mean(numeric_only=True)

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,431.028662,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818
male,454.147314,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893


In [21]:
df.groupby('Sex')['Age'].mean(numeric_only=True)

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

In [22]:
df.groupby('Sex')[['Age', 'Fare']].mean(numeric_only=True)

Unnamed: 0_level_0,Age,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,27.915709,44.479818
male,30.726645,25.523893


In [25]:
df.groupby('Sex')[['Age', 'Fare']].agg(['mean', 'std', 'count'])

Unnamed: 0_level_0,Age,Age,Age,Fare,Fare,Fare
Unnamed: 0_level_1,mean,std,count,mean,std,count
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
female,27.915709,14.110146,261,44.479818,57.997698,314
male,30.726645,14.678201,453,25.523893,43.138263,577


In [26]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [28]:
df.groupby('Sex').agg({'Age': 'mean', 'Fare': 'sum',  'Embarked': 'first'})

Unnamed: 0_level_0,Age,Fare,Embarked
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,27.915709,13966.6628,C
male,30.726645,14727.2865,S


In [29]:
import numpy as np

np.sum

<function numpy.sum(a, axis=None, dtype=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>)>

In [30]:
df.groupby('Sex').agg({'Age': 'mean', 'Fare': np.sum,  'Embarked': 'first'})

Unnamed: 0_level_0,Age,Fare,Embarked
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,27.915709,13966.6628,C
male,30.726645,14727.2865,S


In [32]:
def producto(lst):
    
    # recibe un iterable y devuelve un numero
    
    res = 1
    
    
    for e in lst:
        
        res *= e
        
    return res

In [33]:
producto([1,2,3,4])

24

In [34]:
df.groupby('Sex').agg({'Age': 'mean', 'Fare': producto,  'Embarked': 'first'})

Unnamed: 0_level_0,Age,Fare,Embarked
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,27.915709,inf,C
male,30.726645,0.0,S


In [39]:
def moda(serie):
    
    return serie.mode()[0]

In [40]:
moda(df.Embarked)

'S'

In [41]:
df.groupby('Sex').agg({'Age': 'mean', 'Fare': 'sum',  'Embarked': moda})

Unnamed: 0_level_0,Age,Fare,Embarked
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,27.915709,13966.6628,S
male,30.726645,14727.2865,S


### Multiple Aggregations

You can perform multiple aggregations at once using the `agg()` function. 

For example, let's find the count, mean, and standard deviation of the age of passengers, grouped by their gender:

In [42]:
df.groupby('Sex').agg(['mean', 'std'])

  df.groupby('Sex').agg(['mean', 'std'])


Unnamed: 0_level_0,PassengerId,PassengerId,Survived,Survived,Pclass,Pclass,Age,Age,SibSp,SibSp,Parch,Parch,Fare,Fare
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std,mean,std
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
female,431.028662,256.846324,0.742038,0.438211,2.159236,0.85729,27.915709,14.110146,0.694268,1.15652,0.649682,1.022846,44.479818,57.997698
male,454.147314,257.486139,0.188908,0.391775,2.389948,0.81358,30.726645,14.678201,0.429809,1.061811,0.235702,0.612294,25.523893,43.138263


### Grouping by Multiple Variables

You can group by multiple variables by passing a list to the `groupby()` function.

For example, let's find the average age of passengers grouped by their gender and whether they survived:

In [43]:
df.groupby(['Sex', 'Survived']).mean()

  df.groupby(['Sex', 'Survived']).mean()


Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Pclass,Age,SibSp,Parch,Fare
Sex,Survived,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,0,434.851852,2.851852,25.046875,1.209877,1.037037,23.024385
female,1,429.699571,1.918455,28.847716,0.515021,0.515021,51.938573
male,0,449.121795,2.476496,31.618056,0.440171,0.207265,21.960993
male,1,475.724771,2.018349,27.276022,0.385321,0.357798,40.821484


In [45]:
df.groupby(['Sex', 'Survived']).agg({'Age': 'mean', 'Fare': 'min', 'Pclass': moda})

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Fare,Pclass
Sex,Survived,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0,25.046875,6.75,3
female,1,28.847716,7.225,1
male,0,31.618056,0.0,3
male,1,27.276022,0.0,3


In [46]:
df.groupby('Pclass').mean()

  df.groupby('Pclass').mean()


Unnamed: 0_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,461.597222,0.62963,38.233441,0.416667,0.356481,84.154687
2,445.956522,0.472826,29.87763,0.402174,0.380435,20.662183
3,439.154786,0.242363,25.14062,0.615071,0.393075,13.67555


In [52]:
df.groupby('Pclass').agg({'Survived': 'mean', 'Fare': 'mean', 'PassengerId': 'count'})

Unnamed: 0_level_0,Survived,Fare,PassengerId
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.62963,84.154687,216
2,0.472826,20.662183,184
3,0.242363,13.67555,491


In [47]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [55]:
df.groupby('Embarked')[['Survived', 'Age', 'Fare']].mean()

Unnamed: 0_level_0,Survived,Age,Fare
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,0.553571,30.814769,59.954144
Q,0.38961,28.089286,13.27603
S,0.336957,29.445397,27.079812


In [57]:
df[['Survived', 'Age', 'Fare']].head()

Unnamed: 0,Survived,Age,Fare
0,0,22.0,7.25
1,1,38.0,71.2833
2,1,26.0,7.925
3,1,35.0,53.1
4,0,35.0,8.05


In [54]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [58]:
df.groupby('Embarked')['Fare'].agg(['min', 'mean', 'max'])

Unnamed: 0_level_0,min,mean,max
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,4.0125,59.954144,512.3292
Q,6.75,13.27603,90.0
S,0.0,27.079812,263.0


In [63]:
df[df.Embarked=='S']['Fare'].max()

263.0

In [64]:
df.groupby('Embarked')[['Fare', 'Pclass']].agg(['min', 'mean', 'max'])

Unnamed: 0_level_0,Fare,Fare,Fare,Pclass,Pclass,Pclass
Unnamed: 0_level_1,min,mean,max,min,mean,max
Embarked,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
C,4.0125,59.954144,512.3292,1,1.886905,3
Q,6.75,13.27603,90.0,1,2.909091,3
S,0.0,27.079812,263.0,1,2.350932,3


In [65]:
df.groupby(['Embarked','Pclass'])['Fare'].agg(['min', 'mean', 'max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,min,mean,max
Embarked,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
C,1,26.55,104.718529,512.3292
C,2,12.0,25.358335,41.5792
C,3,4.0125,11.214083,22.3583
Q,1,90.0,90.0,90.0
Q,2,12.35,12.35,12.35
Q,3,6.75,11.183393,29.125
S,1,0.0,70.364862,263.0
S,2,0.0,20.327439,73.5
S,3,0.0,14.644083,69.55


In [66]:
df.groupby('Embarked')[['Fare', 'Pclass']].agg({'Fare': ['min', 'mean', 'max'], 'Pclass': moda})

Unnamed: 0_level_0,Fare,Fare,Fare,Pclass
Unnamed: 0_level_1,min,mean,max,moda
Embarked,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
C,4.0125,59.954144,512.3292,1
Q,6.75,13.27603,90.0,3
S,0.0,27.079812,263.0,3


In [68]:
df[(df.Embarked=='S') & (df.Fare==263)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S
341,342,1,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0,C23 C25 C27,S
438,439,0,1,"Fortune, Mr. Mark",male,64.0,1,4,19950,263.0,C23 C25 C27,S


In [69]:
df[df.Embarked=='S'].shape

(644, 12)

In [71]:
df[df.Embarked=='S'].groupby('Pclass')['Embarked'].count()

Pclass
1    127
2    164
3    353
Name: Embarked, dtype: int64

In [72]:
df[df.Fare==0]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
179,180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,,S
263,264,0,1,"Harrison, Mr. William",male,40.0,0,0,112059,0.0,B94,S
271,272,1,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0,,S
277,278,0,2,"Parkes, Mr. Francis ""Frank""",male,,0,0,239853,0.0,,S
302,303,0,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S
413,414,0,2,"Cunningham, Mr. Alfred Fleming",male,,0,0,239853,0.0,,S
466,467,0,2,"Campbell, Mr. William",male,,0,0,239853,0.0,,S
481,482,0,2,"Frost, Mr. Anthony Wood ""Archie""",male,,0,0,239854,0.0,,S
597,598,0,3,"Johnson, Mr. Alfred",male,49.0,0,0,LINE,0.0,,S
633,634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0,,S


In [74]:
df[df.Fare==0].groupby('Pclass')['Embarked'].count()

Pclass
1    5
2    6
3    4
Name: Embarked, dtype: int64

In [79]:
df[df.Fare==0].groupby(['Pclass', 'Survived']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Pclass,Survived,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,5,5,5,3,5,5,5,5,3,5
2,0,6,6,6,0,6,6,6,6,0,6
3,0,3,3,3,3,3,3,3,3,0,3
3,1,1,1,1,1,1,1,1,1,0,1


# Summary

- Aggregation involves applying a function to a dataset that reduces multiple values into a single value. Common aggregation functions in pandas include `mean()`, `sum()`, `min()`, `max()`, `count()`, and `std()`.
- Grouping in pandas is done using the `groupby()` function, which splits data into groups based on certain criteria. The grouped data can then be aggregated separately.
    - A `GroupBy` object can be iterated over, with each iteration yielding a tuple where the first item is the group identifier, and the second item is the data in that group as a DataFrame.
    - The `groups` attribute of a `GroupBy` object is a dictionary where the keys are the computed unique groups, and the corresponding values are the axis labels belonging to each group.
    - The `size` attribute of a `GroupBy` object returns a Series giving the size of each group. Unlike the `count()` function, `size` includes `NaN` values.
    - The `get_group()` method of a `GroupBy` object allows for the selection of a single group as a DataFrame.
- After a `groupby()` operation, an aggregation function is necessary to represent each group in the resulting output.
- Multiple aggregations can be performed at once using the `agg()` function.
- Grouping can be done by multiple variables by passing a list to the `groupby()` function. This can be helpful when you want to analyze your data at different levels of granularity.