## How to think about group operations
- Iterating over groups
- Selecting a column or subset of columns
- Grouping with dictionaries and series
- Grouping with functions
- Grouping by Index levels


## Data Aggregation
- Column-wise and Multiple Function Application
- Returning aggregated Data without Row Indexes

## Apply: General split-apply-combine
- Suppressing the Group keys
- Quantile and Bucked analysis
- Filling missing and group specific values
- Random sampling and permutation
- group Weighted Average and Correlation
- group wise linear regression

## Group Transformations adn 'Unwrapped' GroupBys

## Pivot tables adn Cross-tabulation
- cross-tabulations: crosstab

In [1]:
import numpy as np
import pandas as pd

In [3]:
df = pd.DataFrame({"key1": ['a', 'a', None, 'b', 'c'],
                  'key2' : pd.Series([1, 2, 3, None, 5], 
                                     dtype='Int64'),
                   'data1' : np.random.standard_normal(5),
                   'data2' : np.random.standard_normal(5),
                                     
                    
                                     })

In [4]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,-0.949232,-0.859027
1,a,2.0,0.90237,0.987569
2,,3.0,0.805328,-0.89225
3,b,,0.269996,0.291597
4,c,5.0,-1.341779,-0.229023


In [5]:
 # mean of data1 using labels from key1
    
grouped = df["data1"].groupby(df['key1'])

In [6]:
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000018F93327910>

In [7]:
means = df['data1'].groupby([df['key1'], df["key2"]]).mean()

In [8]:
means

key1  key2
a     1      -0.949232
      2       0.902370
c     5      -1.341779
Name: data1, dtype: float64

In [9]:
means.unstack()

key2,1,2,5
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,-0.949232,0.90237,
c,,,-1.341779


In [12]:
# using Series as groupkeys
states = np.array(['PB', 'DL', 'UK', 'HP', 'UP',])
years = [2001, 2002, 2005, 2001, 2009,]

df['data1'].groupby([states, years]).mean()

DL  2002    0.902370
HP  2001    0.269996
PB  2001   -0.949232
UK  2005    0.805328
UP  2009   -1.341779
Name: data1, dtype: float64

In [14]:
grouped.mean()

key1
a   -0.023431
b    0.269996
c   -1.341779
Name: data1, dtype: float64

In [15]:
means2 = df['data1'].groupby([df['key1'], df['key2']]).mean()

means2

key1  key2
a     1      -0.949232
      2       0.902370
c     5      -1.341779
Name: data1, dtype: float64

In [17]:
means2.unstack()

key2,1,2,5
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,-0.949232,0.90237,
c,,,-1.341779


In [18]:
means2.unstack(0)

key1,a,c
key2,Unnamed: 1_level_1,Unnamed: 2_level_1
1,-0.949232,
2,0.90237,
5,,-1.341779


In [19]:
df.groupby('key1').mean()

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1.5,-0.023431,0.064271
b,,0.269996,0.291597
c,5.0,-1.341779,-0.229023


In [25]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,-0.949232,-0.859027
1,a,2.0,0.90237,0.987569
2,,3.0,0.805328,-0.89225
3,b,,0.269996,0.291597
4,c,5.0,-1.341779,-0.229023


In [26]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,-0.949232,-0.859027
a,2,0.90237,0.987569
c,5,-1.341779,-0.229023


In [28]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     1       1
      2       1
c     5       1
dtype: int64

In [29]:
df.groupby('key1', dropna= False).size()

key1
a      2
b      1
c      1
NaN    1
dtype: int64

In [30]:
df.groupby(['key1', 'key2'], dropna=False).size()

key1  key2
a     1       1
      2       1
b     <NA>    1
c     5       1
NaN   3       1
dtype: int64

In [31]:
df.groupby('key1').count()

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,2,2,2
b,0,1,1
c,1,1,1


### Iterating over groups

In [32]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)

a
  key1  key2     data1     data2
0    a     1 -0.949232 -0.859027
1    a     2  0.902370  0.987569
b
  key1  key2     data1     data2
3    b  <NA>  0.269996  0.291597
c
  key1  key2     data1     data2
4    c     5 -1.341779 -0.229023


In [33]:
# in case of multiple key elements
for (k1, k2), group in df.groupby(['key1','key2']):
    print((k1, k2))
    print(group)

('a', 1)
  key1  key2     data1     data2
0    a     1 -0.949232 -0.859027
('a', 2)
  key1  key2    data1     data2
1    a     2  0.90237  0.987569
('c', 5)
  key1  key2     data1     data2
4    c     5 -1.341779 -0.229023


#### dictionary data in the form of pieces

In [35]:

pieces = {name : group for name, group in df.groupby('key1')}
pieces['b']

Unnamed: 0,key1,key2,data1,data2
3,b,,0.269996,0.291597


In [42]:
pieces['c']

Unnamed: 0,key1,key2,data1,data2
4,c,5,-1.341779,-0.229023


In [43]:
grouped = df.groupby({'key1': 'key','key2': 'key',
                     'data1': 'data', 'data2': 'data'})

In [49]:
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000018F9795F9A0>

In [48]:
for group_key, group_values in grouped:
    print(group_key)
    print(group_values)

## Selecting a Column or Subset of Columns

In [51]:
df.groupby('key1')['data1']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000018F978AF040>

In [52]:
df.groupby('key1')[['data2']]

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000018F97822580>

In [54]:
# this can also be called by-
df['data1'].groupby(df['key1'])
df['data2'].groupby(df['key1'])

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000018F978EE970>

In [56]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,1,-0.859027
a,2,0.987569
c,5,-0.229023


In [58]:
s_grouped = df.groupby(['key1', 'key2'])['data2']

In [60]:
s_grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000018F97770520>

In [62]:
s_grouped.mean()

key1  key2
a     1      -0.859027
      2       0.987569
c     5      -0.229023
Name: data2, dtype: float64

### Grouping with Dictionaries and Series

In [63]:
people = pd.DataFrame(np.random.standard_normal((5,5)),
                     columns=['a', 'b', 'c', 'd','e'],
                     index = ['Kunal', 'Rahul', 'Sachin', 'Sorav', 'Andrew'])

In [64]:
# add few NA values
people.iloc[2:3, [1, 2]] = np.nan

In [66]:
people

Unnamed: 0,a,b,c,d,e
Kunal,0.168348,-1.118059,-0.800172,-0.866632,-0.197982
Rahul,0.200227,0.640015,1.108393,-0.08585,-0.888133
Sachin,0.847146,,,0.463212,-0.077719
Sorav,1.440603,0.431537,-0.670088,-0.121215,-1.211302
Andrew,-0.63742,0.237133,-0.426589,-0.631543,0.098273


In [67]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
          'd': 'blue', 'e': 'red', 'f': 'orange'}

In [69]:
by_column = people.groupby(mapping, axis= 'columns')
by_column.sum()

Unnamed: 0,blue,red
Kunal,-1.666804,-1.147693
Rahul,1.022543,-0.047891
Sachin,0.463212,0.769427
Sorav,-0.791303,0.660838
Andrew,-1.058132,-0.302014


In [71]:
# same functionality holds for Series

map_series = pd.Series(mapping)

map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [72]:
people.groupby(map_series, axis= 'columns').count()

Unnamed: 0,blue,red
Kunal,2,3
Rahul,2,3
Sachin,1,2
Sorav,2,3
Andrew,2,3


### Grouping with Functions

In [73]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
5,1.809178,-0.046507,-0.361866,-1.073697,-2.297416
6,0.209726,0.237133,-0.426589,-0.168331,0.020554


In [75]:
# mixing functions with arrays, dicitonaries, or Series

key_list = ['one', 'one', 'two', 'one', 'two']

people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
5,one,0.168348,-1.118059,-0.800172,-0.866632,-1.211302
6,two,-0.63742,0.237133,-0.426589,-0.631543,-0.077719


### Grouping with Index levels

In [79]:
levels = pd.MultiIndex.from_arrays([['IN', 'US', 'CAN'],
                                   [1,2,3]],
                                   names=['fdk', 'goleala'])

In [83]:
hier_df = pd.DataFrame(np.random.standard_normal((4,3)),
                      columns = levels)

hier_df

fdk,IN,US,CAN
goleala,1,2,3
0,-0.973483,0.175499,-1.375942
1,-0.417278,-0.593448,-0.097291
2,-1.864057,-1.157927,0.2198
3,-0.062763,0.796581,-0.905844


In [84]:
# to groupby level, pass level keyword

In [88]:
hier_df.groupby(level = 'fdk', axis = 'columns').count()

fdk,CAN,IN,US
0,1,1,1
1,1,1,1
2,1,1,1
3,1,1,1


## Data Aggregation
- transformation that produces scalar values
- [optimised methods](https://learning.oreilly.com/library/view/python-for-data/9781098104023/ch10.html#table_opt_groupby_methods)

In [89]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,-0.949232,-0.859027
1,a,2.0,0.90237,0.987569
2,,3.0,0.805328,-0.89225
3,b,,0.269996,0.291597
4,c,5.0,-1.341779,-0.229023


In [90]:
grouped = df.groupby('key1')

In [91]:
grouped['data1'].nsmallest(2)

key1   
a     0   -0.949232
      1    0.902370
b     3    0.269996
c     4   -1.341779
Name: data1, dtype: float64

In [92]:
# aggregation function
def peak_to_peak(arr):
    return arr.max() - arr.min()

In [94]:
grouped.agg(peak_to_peak)

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1.0,1.851602,1.846596
b,,0.0,0.0
c,0.0,0.0,0.0


In [96]:
grouped.describe()

Unnamed: 0_level_0,key2,key2,key2,key2,key2,key2,key2,key2,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
a,2.0,1.5,0.707107,1.0,1.25,1.5,1.75,2.0,2.0,-0.023431,...,0.43947,0.90237,2.0,0.064271,1.305741,-0.859027,-0.397378,0.064271,0.52592,0.987569
b,0.0,,,,,,,,1.0,0.269996,...,0.269996,0.269996,1.0,0.291597,,0.291597,0.291597,0.291597,0.291597,0.291597
c,1.0,5.0,,5.0,5.0,5.0,5.0,5.0,1.0,-1.341779,...,-1.341779,-1.341779,1.0,-0.229023,,-0.229023,-0.229023,-0.229023,-0.229023,-0.229023


### Column-Wise multiple function application

In [104]:
iris = pd.read_csv(r"E:\pythonfordatanalysis\semainedu26fevrier\iris.csv")

In [105]:
iris.head()

Unnamed: 0,Id,Sepal Length (cm),Sepal Width (cm),Petal Length (cm),Petal Width (cm),Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [107]:
iris.columns

Index(['Id', 'Sepal Length (cm)', 'Sepal Width (cm)', 'Petal Length (cm)',
       'Petal Width (cm)', 'Species'],
      dtype='object')

In [108]:
grouped2 = iris.groupby(['Sepal Length (cm)', 'Sepal Width (cm)', 'Petal Length (cm)',
       'Petal Width (cm)'])

In [115]:
missing_values = iris.isna()

missing_values.sum = iris.isna().sum()

missing_values_percent = (iris.isna().sum() / len(iris)) * 100

iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Id                 150 non-null    int64  
 1   Sepal Length (cm)  150 non-null    float64
 2   Sepal Width (cm)   150 non-null    float64
 3   Petal Length (cm)  150 non-null    float64
 4   Petal Width (cm)   150 non-null    float64
 5   Species            150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [116]:
grouped.agg(['mean', 'std', peak_to_peak])

Unnamed: 0_level_0,key2,key2,key2,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,mean,std,peak_to_peak,mean,std,peak_to_peak,mean,std,peak_to_peak
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
a,1.5,0.707107,1.0,-0.023431,1.30928,1.851602,0.064271,1.305741,1.846596
b,,,,0.269996,,0.0,0.291597,,0.0
c,5.0,,0.0,-1.341779,,0.0,-0.229023,,0.0


In [120]:
# dropping column as it gave traceback before
iris2 = iris.drop('Species', axis=1)

In [121]:
iris2.agg(['mean', 'std', peak_to_peak])

Unnamed: 0,Id,Sepal Length (cm),Sepal Width (cm),Petal Length (cm),Petal Width (cm)
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
peak_to_peak,149.0,3.6,2.4,5.9,2.4


## Apply: General split-apply-combine

In [125]:
def top(iris2, n=5, column = 'Sepal Length (cm)'):
    return iris2.sort_values(column, ascending= False)[:n]

In [127]:
top(iris2, n=6)

Unnamed: 0,Id,Sepal Length (cm),Sepal Width (cm),Petal Length (cm),Petal Width (cm)
131,132,7.9,3.8,6.4,2.0
135,136,7.7,3.0,6.1,2.3
122,123,7.7,2.8,6.7,2.0
117,118,7.7,3.8,6.7,2.2
118,119,7.7,2.6,6.9,2.3
105,106,7.6,3.0,6.6,2.1


In [130]:
iris2.groupby('Sepal Length (cm)').apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,Id,Sepal Length (cm),Sepal Width (cm),Petal Length (cm),Petal Width (cm)
Sepal Length (cm),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4.3,13,14,4.3,3.0,1.1,0.1
4.4,8,9,4.4,2.9,1.4,0.2
4.4,38,39,4.4,3.0,1.3,0.2
4.4,42,43,4.4,3.2,1.3,0.2
4.5,41,42,4.5,2.3,1.3,0.3
...,...,...,...,...,...,...
7.7,117,118,7.7,3.8,6.7,2.2
7.7,118,119,7.7,2.6,6.9,2.3
7.7,122,123,7.7,2.8,6.7,2.0
7.7,135,136,7.7,3.0,6.1,2.3


In [132]:
# recall by groupby object

result = iris2.groupby("Sepal Length (cm)").describe()

result

Unnamed: 0_level_0,Id,Id,Id,Id,Id,Id,Id,Id,Sepal Width (cm),Sepal Width (cm),...,Petal Length (cm),Petal Length (cm),Petal Width (cm),Petal Width (cm),Petal Width (cm),Petal Width (cm),Petal Width (cm),Petal Width (cm),Petal Width (cm),Petal Width (cm)
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Sepal Length (cm),Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
4.3,1.0,14.0,,14.0,14.0,14.0,14.0,14.0,1.0,3.0,...,1.1,1.1,1.0,0.1,,0.1,0.1,0.1,0.1,0.1
4.4,3.0,30.333333,18.583146,9.0,24.0,39.0,41.0,43.0,3.0,3.033333,...,1.35,1.4,3.0,0.2,3.3993500000000003e-17,0.2,0.2,0.2,0.2,0.2
4.5,1.0,42.0,,42.0,42.0,42.0,42.0,42.0,1.0,2.3,...,1.3,1.3,1.0,0.3,,0.3,0.3,0.3,0.3,0.3
4.6,4.0,20.5,20.141168,4.0,6.25,15.0,29.25,48.0,4.0,3.325,...,1.425,1.5,4.0,0.225,0.05,0.2,0.2,0.2,0.225,0.3
4.7,2.0,16.5,19.091883,3.0,9.75,16.5,23.25,30.0,2.0,3.2,...,1.525,1.6,2.0,0.2,0.0,0.2,0.2,0.2,0.2,0.2
4.8,5.0,25.4,14.046352,12.0,13.0,25.0,31.0,46.0,5.0,3.18,...,1.6,1.9,5.0,0.2,0.07071068,0.1,0.2,0.2,0.2,0.3
4.9,6.0,41.666667,37.866432,2.0,16.25,36.5,53.0,107.0,6.0,2.866667,...,2.85,4.5,6.0,0.533333,0.6713171,0.1,0.1,0.15,0.8,1.7
5.0,10.0,39.2,26.029044,5.0,26.25,38.5,48.5,94.0,10.0,3.12,...,1.6,3.5,10.0,0.43,0.3267687,0.2,0.2,0.25,0.55,1.0
5.1,9.0,35.111111,28.117808,1.0,20.0,24.0,45.0,99.0,9.0,3.477778,...,1.7,3.0,9.0,0.4,0.2828427,0.2,0.2,0.3,0.4,1.1
5.2,4.0,37.5,15.154757,28.0,28.75,31.0,39.75,60.0,4.0,3.425,...,2.1,3.9,4.0,0.475,0.6184658,0.1,0.175,0.2,0.5,1.4


In [133]:
# invoking a method inside GroupBy
def f(group):
    return group.describe()
grouped.apply(f)

Unnamed: 0_level_0,Unnamed: 1_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,count,2.0,2.0,2.0
a,mean,1.5,-0.023431,0.064271
a,std,0.707107,1.30928,1.305741
a,min,1.0,-0.949232,-0.859027
a,25%,1.25,-0.486331,-0.397378
a,50%,1.5,-0.023431,0.064271
a,75%,1.75,0.43947,0.52592
a,max,2.0,0.90237,0.987569
b,count,0.0,1.0,1.0
b,mean,,0.269996,0.291597


In [134]:
iris2.apply(f)

Unnamed: 0,Id,Sepal Length (cm),Sepal Width (cm),Petal Length (cm),Petal Width (cm)
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


### Suppressing the Group Keys

In [136]:
iris2.groupby('Sepal Length (cm)', group_keys = False).apply(top)

Unnamed: 0,Id,Sepal Length (cm),Sepal Width (cm),Petal Length (cm),Petal Width (cm)
13,14,4.3,3.0,1.1,0.1
8,9,4.4,2.9,1.4,0.2
38,39,4.4,3.0,1.3,0.2
42,43,4.4,3.2,1.3,0.2
41,42,4.5,2.3,1.3,0.3
...,...,...,...,...,...
117,118,7.7,3.8,6.7,2.2
118,119,7.7,2.6,6.9,2.3
122,123,7.7,2.8,6.7,2.0
135,136,7.7,3.0,6.1,2.3


### Quantile and Bucket analysis

In [138]:
frame = pd.DataFrame({'data1': np.random.standard_normal(1000),
                     'data2': np.random.standard_normal(1000)})

frame.head()

Unnamed: 0,data1,data2
0,-0.218315,-1.59393
1,2.851876,0.110652
2,-2.136658,-0.057209
3,1.380738,-1.014417
4,0.375207,0.910869


In [139]:
quartiles = pd.cut(frame['data1'],4)

In [140]:
quartiles.head()

0     (-1.6, 0.127]
1    (1.854, 3.581]
2    (-3.333, -1.6]
3    (0.127, 1.854]
4    (0.127, 1.854]
Name: data1, dtype: category
Categories (4, interval[float64, right]): [(-3.333, -1.6] < (-1.6, 0.127] < (0.127, 1.854] < (1.854, 3.581]]

In [144]:
def get_stats(group):
    return pd.DataFrame({
        'min': group.min(), 'max':group.max(),
        'count':group.count(), "mean":group.mean()
    })

In [142]:
grouped3 = frame.groupby(quartiles)

In [145]:
grouped3.apply(get_stats)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"(-3.333, -1.6]",data1,-3.326377,-1.615895,47,-2.008459
"(-3.333, -1.6]",data2,-2.449983,1.891096,47,-0.031678
"(-1.6, 0.127]",data1,-1.592835,0.127111,499,-0.559453
"(-1.6, 0.127]",data2,-3.688831,2.891848,499,0.060442
"(0.127, 1.854]",data1,0.129218,1.845841,419,0.767489
"(0.127, 1.854]",data2,-3.10763,3.210868,419,0.063923
"(1.854, 3.581]",data1,1.860336,3.580781,35,2.272007
"(1.854, 3.581]",data2,-1.915169,1.595983,35,-0.040774


In [146]:
# the same result can also be computed with;
grouped3.agg({'min', 'max', 'count', 'mean'})

Unnamed: 0_level_0,data1,data1,data1,data1,data2,data2,data2,data2
Unnamed: 0_level_1,max,count,min,mean,max,count,min,mean
data1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
"(-3.333, -1.6]",-1.615895,47,-3.326377,-2.008459,1.891096,47,-2.449983,-0.031678
"(-1.6, 0.127]",0.127111,499,-1.592835,-0.559453,2.891848,499,-3.688831,0.060442
"(0.127, 1.854]",1.845841,419,0.129218,0.767489,3.210868,419,-3.10763,0.063923
"(1.854, 3.581]",3.580781,35,1.860336,2.272007,1.595983,35,-1.915169,-0.040774


In [147]:
# using pandas.qcut

quartiles_samp = pd.qcut(frame['data1'], 4, labels= False)

quartiles_samp.head()

0    1
1    3
2    0
3    3
4    2
Name: data1, dtype: int64

In [148]:
grouped4 = frame.groupby(quartiles_samp)

In [149]:
grouped4.apply(get_stats)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,data1,-3.326377,-0.634348,250,-1.220039
0,data2,-2.449983,2.891848,250,0.170686
1,data1,-0.629175,0.018952,250,-0.288017
1,data2,-2.67328,2.854084,250,-0.041068
2,data1,0.023148,0.676789,250,0.326941
2,data2,-3.688831,3.210868,250,0.070652
3,data1,0.681385,3.580781,250,1.29125
3,data2,-3.10763,2.653283,250,0.015843


### Ex: Filling missing values with group-spscific values

In [150]:

s = pd.Series(np.random.standard_normal(6))

s[::2] = np.nan

s

0         NaN
1   -0.860730
2         NaN
3    1.412749
4         NaN
5    0.419197
dtype: float64

In [152]:
s.fillna(s.mean())

0    0.323739
1   -0.860730
2    0.323739
3    1.412749
4    0.323739
5    0.419197
dtype: float64

In [156]:
states = ['MP', 'UP', 'MH', 'RJ',
         'HP', 'UK', 'PB', 'KR']

group_key = ['Centre', 'Centre', 'Centre', 'Centre',
            'North', 'North', "North", "North"]

In [158]:
data4 = pd.Series(np.random.standard_normal(8),
                index= states)

In [159]:
data4

MP    0.712152
UP    0.211319
MH    0.925109
RJ   -0.204536
HP    0.293047
UK    0.497221
PB    0.034935
KR   -0.842300
dtype: float64

In [163]:
# missing data for states

data4[['MP', "PB", "KR"]] = np.nan

In [164]:
data4

MP         NaN
UP    0.211319
MH    0.925109
RJ   -0.204536
HP    0.293047
UK    0.497221
PB         NaN
KR         NaN
dtype: float64

In [165]:
data4.groupby(group_key).size()

Centre    4
North     4
dtype: int64

In [166]:
data4.groupby(group_key).mean()

Centre    0.310630
North     0.395134
dtype: float64

In [167]:
data4.groupby(group_key).count()

Centre    3
North     2
dtype: int64

In [168]:
data4.groupby(group_key).mean()

Centre    0.310630
North     0.395134
dtype: float64

In [169]:
# filling NA values from group means
def fill_mean(group):
    return group.fillna(group.mean())

data.groupby(group_key).apply(fill_mean)

Centre  MP   -1.190913
        UP    0.136560
        MH   -0.046813
        RJ    0.962568
North   HP    0.477671
        UK   -0.695437
        PB    0.745012
        KR    2.058104
dtype: float64

In [170]:
data

MP   -1.190913
UP    0.136560
MH   -0.046813
RJ    0.962568
HP    0.477671
UK   -0.695437
PB    0.745012
KR    2.058104
dtype: float64

In [171]:
data4

MP         NaN
UP    0.211319
MH    0.925109
RJ   -0.204536
HP    0.293047
UK    0.497221
PB         NaN
KR         NaN
dtype: float64

In [173]:
# filling predefined vlaues
fill_values = {'Centre': 0.5,
              'North': -1}

def fill_func(group):
    return group.fillna(fill_values[group.name])

data4.groupby(group_key).apply(fill_func)

Centre  MP    0.500000
        UP    0.211319
        MH    0.925109
        RJ   -0.204536
North   HP    0.293047
        UK    0.497221
        PB   -1.000000
        KR   -1.000000
dtype: float64

### Ex: Random Sampling and Permutation

In [175]:
suits = ['H', 'S', 'C', 'D'] # hearts, club, diammonds, spades
card_val = (list(range(1, 11)) + [10] * 3) * 4

In [177]:
base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']
cards = []
for suit in suits:
    cards.extend(str(num) + suit for num in base_names)
    
deck= pd.Series(card_val, index = cards)

In [178]:
deck.head(13)

AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
dtype: int64

In [179]:
# drawing first five cards from deck 
def draw(deck, n=5):
    return deck.sample(n)

draw(deck)

6H      6
6S      6
7S      7
10H    10
9D      9
dtype: int64

In [180]:
# two random cards from each suit
def get_suit(card):
    # last letter is suit
    return card[-1]

In [181]:
deck.groupby(get_suit).apply(draw, n = 2)

C  KC    10
   AC     1
D  JD    10
   6D     6
H  AH     1
   2H     2
S  9S     9
   AS     1
dtype: int64

In [182]:
# alternatively, group_keys = False
deck.groupby(get_suit, group_keys= False).apply(draw, n=2)

4C     4
KC    10
6D     6
JD    10
8H     8
AH     1
JS    10
8S     8
dtype: int64

### Ex: Group weighted Average and Correlation

In [183]:
df5 = pd.DataFrame({"category": ['a', 'a', 'a', 'a'],
                  "data": np.random.standard_normal(4),
                   'weights': np.random.uniform(size = 4)})

In [184]:
df5

Unnamed: 0,category,data,weights
0,a,-0.272834,0.453331
1,a,1.032855,0.573681
2,a,0.123029,0.2396
3,a,-0.715634,0.474301


In [185]:
# average weight by category
grouped = df5.groupby('category')

def get_wavg(group):
    return np.average(group['data'],
                     weights=group['weights'])

grouped.apply(get_wavg)

category
a    0.091272
dtype: float64

### Ex: Group-Wise linear Regression

- using regress function of statsmodel econometrics library

In [190]:
import statsmodels.api as sm


In [None]:
! pip install statsmodels

In [191]:
def regress(data, yvar= None, xvars = None):
    Y = data[yvar]
    X = data [xvar]
    X['intercept'] = 1
    result = sm.OLS(Y, X).fit()
    return result.params

In [193]:
iris2.columns

Index(['Id', 'Sepal Length (cm)', 'Sepal Width (cm)', 'Petal Length (cm)',
       'Petal Width (cm)'],
      dtype='object')

In [195]:
iris2.rename(columns={'Sepal Length (cm)' : 'sepal_length'}, inplace = True)

In [196]:
iris2.columns

Index(['Id', 'sepal_length', 'Sepal Width (cm)', 'Petal Length (cm)',
       'Petal Width (cm)'],
      dtype='object')

In [None]:
sepal_length.apply(regress, yvar= "AAPl", 
             xvars = ["SPX"])


### Group transforms and 'Unwrapped' GroupBys

In [198]:
df6 = pd.DataFrame({'key': ['a', 'b', 'c'] * 4,
                   'value' : np.arange(12.)
                   })

In [199]:
df6

Unnamed: 0,key,value
0,a,0.0
1,b,1.0
2,c,2.0
3,a,3.0
4,b,4.0
5,c,5.0
6,a,6.0
7,b,7.0
8,c,8.0
9,a,9.0


In [203]:
# group means by key

g = df6.groupby('key')['value']

g.mean()

key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64

In [205]:
def get_mean(group):
    return group.mean()

g.transform(get_mean)

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [207]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [208]:
def times_two(group):
    return group* 2

g.transform(times_two)

0      0.0
1      2.0
2      4.0
3      6.0
4      8.0
5     10.0
6     12.0
7     14.0
8     16.0
9     18.0
10    20.0
11    22.0
Name: value, dtype: float64

In [210]:
# ranks in descending order
def get_ranks(group):
    return group.rank(ascending=False)

g.transform(get_ranks)

0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: value, dtype: float64

In [211]:
# transformation function composed of aggregations

def normalize(x):
    return (x - x.mean())/x.std()

In [212]:
g.transform(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

In [213]:
g.apply(normalize)

key    
a    0    -1.161895
     3    -0.387298
     6     0.387298
     9     1.161895
b    1    -1.161895
     4    -0.387298
     7     0.387298
     10    1.161895
c    2    -1.161895
     5    -0.387298
     8     0.387298
     11    1.161895
Name: value, dtype: float64

In [214]:
# using built-in 'mean' and 'sum' functions

g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [215]:
normalized2 = (df6['value'] - g.transform('mean')) / g.transform('std')

In [216]:
normalized2

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

### Pivot tables and Cross- tabulation