### Objectives
You will be able to:

Use groupby methods to aggregate different groups in a dataframe
### Using .groupby()
Consider an example of the titanic DataFrame:
During the Exploratory Data Analysis phase, one of the most common tasks you'll want to do is split the dataset into subgroups and compare them to see if you can notice any trends. For instance, you may want to group the passengers together by gender or age. You can do this by using the .groupby() method built-in to pandas DataFrames.

To group passengers by gender, you would type:

In [2]:
import pandas as pd
df = pd.read_csv('data/titanic.csv',index_col=0)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.groupby('Sex')
df.groupby(df['Sex'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001AA2DD29D60>

Note that this alone will not display a result -- although you have split the dataset into groups, you don't have a meaningful way to display information until you chain an **Aggregation Function** onto the groupby. This allows you to compute summary statistics!

You can quickly use an aggregation function by chaining the call to the end of the <font color='red'>.groupby() </font> method.

In [15]:
#convert pclass to int so as to be seen in the sum table
#df['Pclass'].astype('int32') #couldnt convert due to presence of '?' in pclass
df.groupby('Sex').sum()

Unnamed: 0_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,135343,233,7286.0,218,204,13966.6628
male,262043,109,13919.17,248,136,14727.2865


You can use aggregation functions to quickly help us compare subsets of our data. For example, the aggregate statistics displayed above allow you to quickly notice that there were more female survivors overall than male survivors.
hat there were more female survivors overall than male survivors.

### Aggregation functions
There are many built-in aggregate methods provided for you in the pandas package, and you can even write and apply your own. Some of the most common aggregate methods you may want to use are:

- <font color='red'>.min()</font>: returns the minimum value for each column by group
- <font color='red'>.max()</font>: returns the maximum value for each column by group
- <font color='red'>.mean()</font>: returns the average value for each column by group
- <font color='red'>.median()</font>: returns the median value for each column by group
- <font color='red'>.count()</font>: returns the count of each column by group
You can also see a list of all of the built-in aggregation methods by creating a grouped object and then using tab completion to inspect the available methods:

You can also see a list of all of the built-in aggregation methods by creating a grouped object and then using tab completion to inspect the available methods:

In [18]:
grouped_df = df.groupby('Sex')
grouped_df.

  grouped_df.min()


Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
female,2,0,1,"Abbott, Mrs. Stanton (Rosa Hunt)",0.75,0,0,110152,6.75
male,1,0,1,"Abbing, Mr. Anthony",0.42,0,0,110413,0.0


### Multiple groups
You can also split data into multiple different levels of groups by passing in an array containing the name of every column you want to group by -- for instance, by every combination of both Sex and Pclass.

In [19]:
df.groupby(['Sex','Pclass']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,1,467.965909,0.965909,34.531646,0.568182,0.488636,104.824574
female,2,426.1,0.914286,27.757353,0.442857,0.557143,21.49256
female,3,395.59854,0.50365,21.892857,0.839416,0.751825,15.626431
female,?,533.578947,0.789474,32.8125,1.157895,1.0,57.726316
male,1,448.353982,0.345133,41.025474,0.318584,0.274336,69.105863
male,2,440.872549,0.156863,30.982234,0.343137,0.22549,20.048897
male,3,454.966867,0.138554,26.437942,0.51506,0.231928,12.658833
male,?,512.033333,0.266667,32.619048,0.2,0.166667,22.353467


### Selecting information from grouped objects
Since the resulting object returned is a DataFrame, you can also slice a selection of columns you're interested in from the DataFrame returned.

The example below demonstrates the syntax for returning the mean of the Survived class for every combination of Sex and Pclass:

In [22]:
df.groupby(['Sex','Pclass'])['Survived'].mean()

Sex     Pclass
female  1         0.965909
        2         0.914286
        3         0.503650
        ?         0.789474
male    1         0.345133
        2         0.156863
        3         0.138554
        ?         0.266667
Name: Survived, dtype: float64

The above example slices by column, but you can also slice by index. Take a look:

In [29]:
grouped = df.groupby(['Sex','Pclass'])['Survived'].mean()
print(grouped['female'])
print(grouped['female'][1])

Pclass
1    0.965909
2    0.914286
3    0.503650
?    0.789474
Name: Survived, dtype: float64
0.9142857142857143


Note that you need to provide only the value female as the index, and are returned all the groups where the passenger is female, regardless of the Pclass value. The second example shows the results for female passengers with a 1st-class ticket. #correction i thik skld be second class since index 0 takes 0.91 which is for class 2 index1