# GroupBy

Groupby is used for grouping the data according to the categories and apply a function to the categories.<br>
For Grouping data sets, we need the result in terms of various groups present in the data set. Panadas has in-built methods which can roll the data into various groups.<br>

### Split Data into Groups:<br>
Pandas object can be split into groups in several ways:
- obj.groupby('key')
- obj.groupby(['key1','key2'])
- obj.groupby(key,axis=1)

### Iterating through Groups:
With the groupby object in hand, we can iterate through the object similar to itertools.obj.<br>

### Select a Group:
Using the get_group() method, we can select a single group.<br>

### Using `apply`, we can perform the following operations:
- Aggregation − computing a summary statistic
- Transformation − perform some group-specific operation
- Filtering − discarding the data with some condition


In [None]:
# create a DataFrame object and perform all the operations
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print (df)

In [None]:
# groupby
print (df.groupby('Team'))

In [None]:
# groups property gives list of indices for each group
print (df.groupby('Team').groups)

In [None]:
# groupby with multiple columns
print (df.groupby(['Team','Year']).groups)

In [None]:
# groupby year
grouped = df.groupby('Year')
# iterating through groups
for key, val in grouped:
    print ('key:', key)
    print ('val:', val)

### Transform function

Pandas Groupby Transform can be performed by using __DataFrameGroupBy.transform()__ function.<br>
This function transforms the DataFrame with the specified function and returns the DataFrame having the same indexes as the original object.<br>

In [None]:
import pandas as pd
df = pd.read_excel("../datasets/sales_transactions.xlsx")
# this is a small dataset with 12 records
df.head(20)

In [None]:
# sum ext price for each order
df.groupby('order')["ext price"].sum()

In [None]:
# what if we want to add the total order as a variable? => transform()

In [None]:
df.groupby('order')["ext price"].transform('sum')


This returns a different size data set from our normal groupby functions. Instead of only showing the totals for 3 orders, we retain the same number of items as the original data set. That is the unique feature of using transform.<br>

In [None]:
df["Order_Total"] = df.groupby('order')["ext price"].transform('sum')
df["Percent_of_Order"] = df["ext price"] / df["Order_Total"]
df.head(20)

## Lagged (or forward) values

Use the shift function to get lagging (or forward) values. This is quite commonly needed. Think for example to calculate a percentage change over time.

Let's calculate the percentage change in State tax revenue (this would be more appropriate to use compared with a $ change in tax revenue).


In [None]:
df = pd.read_excel(r'..\datasets\State Tax and GSP.xlsx', sheet_name='Sheet1')
df.head()

In [None]:
df.sort_values(['StateId', 'Year'], inplace=True)

In [None]:
df['TAX_lag'] = df.groupby('StateId')['TAX'].shift(1)
df['TAX_change'] = ( df['TAX'] - df['TAX_lag'] ) / df['TAX_lag']
df[['Year', 'StateId', 'TAX','TAX_lag', 'TAX_change']].head()