# Groupby

The groupby method allows you to group rows of data together based on a column e.g ID and call aggregate functions on the row values e.g sum, average e.t.c

In [2]:
import pandas as pd
# Create dataframe from a key:value pair store
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}

In [8]:
df = pd.DataFrame(data, index=['R1','R2','R3','R4','R5','R6'])
df.index.name = 'Index'
df

Unnamed: 0_level_0,Company,Person,Sales
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
R1,GOOG,Sam,200
R2,GOOG,Charlie,120
R3,MSFT,Amy,340
R4,MSFT,Vanessa,124
R5,FB,Carl,243
R6,FB,Sarah,350


** Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:**

In [10]:
grp_by_co = df.groupby("Company")
grp_by_co

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11c5ac438>

And then call aggregate methods off the object:

In [12]:
grp_by_co.mean()  # Pandas will ignore non-numeric columns when attempting to use numeric aggregation functions. Returns a df of companies with thieir mean sales.

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


In [13]:
df.groupby('Company').mean()    # The short method

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


More examples of aggregate methods:

In [17]:
df.groupby('Company').std()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,75.660426
GOOG,56.568542
MSFT,152.735065


In [18]:
df.groupby('Company').std().loc['FB']   # The result is always a DataFrame so the df methods can also be used on the result. Retrieves a row in the df where the row value = 'FB'

Sales    75.660426
Name: FB, dtype: float64

In [30]:
df.groupby('Company').min() # min() will work on both alphanumeric cols, returning the min sales values for each co and also the person whose name is lowest placed alphabetically in ascending order.

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Carl,243
GOOG,Charlie,120
MSFT,Amy,124


In [31]:
# Letters have numeric unicode representation that Python uses to compare letters
'a' > 'b'

False

In [27]:
'a' < 'b'

True

In [29]:
'a' == 'A'

False

In [33]:
ord('a')

97

In [35]:
ord('b')

98

In [36]:
ord('A')

65

In [41]:
df

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200
1,GOOG,Charlie,120
2,MSFT,Amy,340
3,MSFT,Vanessa,124
4,FB,Carl,243
5,FB,Sarah,350


In [39]:
df.groupby('Company').max()

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,Sarah,350
GOOG,Sam,200
MSFT,Vanessa,340


In [42]:
grp_by_co.count() # Returns the company data values frequency in each col.

Unnamed: 0_level_0,Person,Sales
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
FB,2,2
GOOG,2,2
MSFT,2,2


In [43]:
grp_by_co.describe() # Returns useful statisitcs about a DataFrame

Unnamed: 0_level_0,Sales,Sales,Sales,Sales,Sales,Sales,Sales,Sales
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
FB,2.0,296.5,75.660426,243.0,269.75,296.5,323.25,350.0
GOOG,2.0,160.0,56.568542,120.0,140.0,160.0,180.0,200.0
MSFT,2.0,232.0,152.735065,124.0,178.0,232.0,286.0,340.0


In [47]:
desc_data = grp_by_co.describe().transpose()
desc_data

Unnamed: 0,Company,FB,GOOG,MSFT
Sales,count,2.0,2.0,2.0
Sales,mean,296.5,160.0,232.0
Sales,std,75.660426,56.568542,152.735065
Sales,min,243.0,120.0,124.0
Sales,25%,269.75,140.0,178.0
Sales,50%,296.5,160.0,232.0
Sales,75%,323.25,180.0,286.0
Sales,max,350.0,200.0,340.0


In [49]:
desc_data['GOOG']  # Fetch on a column from above df

Sales  count      2.000000
       mean     160.000000
       std       56.568542
       min      120.000000
       25%      140.000000
       50%      160.000000
       75%      180.000000
       max      200.000000
Name: GOOG, dtype: float64

In [55]:
desc_data.transpose()   # Fetch on a row basis
desc_data.loc['GOOG']

Sales  count      2.000000
       mean     160.000000
       std       56.568542
       min      120.000000
       25%      140.000000
       50%      160.000000
       75%      180.000000
       max      200.000000
Name: GOOG, dtype: float64

# Great Job!