# Group Transforms and Analysis

You learned the basics of computing group statistics and applying your own transformations to groups in a dataset.

Let’s consider a collection of hypothetical stock portfolios. I first randomly generate a broad universe of 2000 tickers:

In [20]:
import random; random.seed(0)
import string
import numpy as np
import pandas as pd

In [17]:
N = 1000
def rands(n):
    choices = string.ascii_uppercase
    return '1'.join([random.choice(choices) for _ in range(n)])

tickers = np.array([rands(5) for _ in range(N)])

In [18]:
tickers

array(['D1C1R1J1M', 'W1E1D1V1H', 'Y1V1L1R1U', 'Y1L1M1W1R', 'M1A1C1F1M',
       'B1F1O1P1D', 'Y1K1E1D1L', 'V1Y1G1V1V', 'U1J1W1J1D', 'H1C1J1T1Q',
       'W1A1U1P1L', 'H1T1A1F1N', 'G1T1U1L1F', 'L1U1O1S1O', 'B1Z1H1L1C',
       'F1C1N1X1A', 'K1H1U1Y1K', 'W1H1A1O1Q', 'I1P1R1J1K', 'B1U1T1B1R',
       'G1I1W1Z1I', 'K1K1T1C1G', 'V1A1K1G1X', 'S1B1L1A1N', 'K1U1P1G1U',
       'C1F1C1U1R', 'W1I1V1E1R', 'G1N1F1C1Z', 'Z1P1I1P1E', 'N1N1B1C1O',
       'B1H1E1Q1P', 'O1H1M1Z1W', 'Q1S1G1Q1M', 'H1J1Q1W1Y', 'K1F1A1M1V',
       'F1H1U1I1R', 'Y1V1Y1X1C', 'J1S1R1C1I', 'A1P1O1X1I', 'S1T1Z1M1N',
       'F1B1Z1X1H', 'S1V1N1G1J', 'B1X1B1W1K', 'L1E1T1I1F', 'G1Z1I1I1O',
       'X1K1V1N1X', 'N1I1F1R1L', 'B1Y1T1C1D', 'L1W1Y1G1V', 'P1R1O1C1C',
       'X1T1S1G1W', 'K1U1C1R1E', 'I1U1C1O1L', 'A1V1B1X1G', 'P1Y1E1Q1G',
       'U1D1O1X1O', 'F1C1W1H1O', 'T1J1S1K1W', 'X1T1Y1T1O', 'C1U1S1N1X',
       'W1D1G1D1D', 'T1Q1A1L1N', 'S1Z1F1B1O', 'A1V1O1T1L', 'N1K1C1W1M',
       'I1Z1J1P1S', 'X1K1S1Y1Y', 'P1Q1X1L1Z', 'G1N1L1I1W', 'C1W1

I then create a DataFrame containing 3 columns representing hypothetical, but random portfolios for a subset of tickers:

In [21]:
M = 500
df = pd.DataFrame({'Momentum': np.random.randn(M) / 200 +.03,
                        'Value': np.random.randn(M) / 200 + .08,
                        'ShortInterest': np.random.randn(M) / 200 - .02},
                        index = tickers[:M])

In [22]:
df

Unnamed: 0,Momentum,Value,ShortInterest
D1C1R1J1M,0.036649,0.077906,-0.025152
W1E1D1V1H,0.033180,0.076947,-0.029863
Y1V1L1R1U,0.029630,0.069364,-0.017857
Y1L1M1W1R,0.031200,0.080340,-0.022389
M1A1C1F1M,0.029876,0.082379,-0.013235
...,...,...,...
E1L1S1U1P,0.041140,0.080067,-0.011857
B1X1G1R1I,0.033145,0.078836,-0.022577
N1N1M1Z1I,0.030463,0.078472,-0.018437
M1O1Q1V1P,0.027751,0.081169,-0.021985


Next, let’s create a random industry classification for the tickers. To keep things simple, I’ll just keep it to 2 industries, storing the mapping in a Series:

In [25]:
ind_names = np.array(['FINANCIAL', 'TECH'])
sampler = np.random.randint(0, len(ind_names), N)
industries = pd.Series(ind_names[sampler], index= tickers,
                        name = 'industry')

Now we can group by industries and carry out group aggregation and transformations:

In [32]:
by_industry = df.groupby(industries)

In [33]:
by_industry.sum()

Unnamed: 0_level_0,Momentum,Value,ShortInterest
industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
FINANCIAL,6.978585,18.499729,-4.524055
TECH,8.156718,21.80383,-5.327546


In [35]:
by_industry.describe().T

Unnamed: 0,industry,FINANCIAL,TECH
Momentum,count,229.0,271.0
Momentum,mean,0.030474,0.030099
Momentum,std,0.004656,0.005033
Momentum,min,0.016736,0.017135
Momentum,25%,0.0271,0.026808
Momentum,50%,0.03101,0.030479
Momentum,75%,0.033795,0.033662
Momentum,max,0.042247,0.04414
Value,count,229.0,271.0
Value,mean,0.080785,0.080457


By defining transformation functions, it’s easy to transform these portfolios by industry. For example, standardizing within industry is widely used in equity portfolio construction:

In [36]:
# within-Industry Standarize

def zscore(group):
    return (group - group.mean())/group.std()

df_stand = by_industry.apply(zscore)

You can verify that each industry has mean 0 and standard deviation 1:

In [37]:
df_stand.groupby(industries).agg(['mean', 'std'])

Unnamed: 0_level_0,Momentum,Momentum,Value,Value,ShortInterest,ShortInterest
Unnamed: 0_level_1,mean,std,mean,std,mean,std
industry,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
FINANCIAL,-1.163553e-16,1.0,3.202921e-15,1.0,-5.963207e-16,1.0
TECH,8.674898e-16,1.0,1.699338e-15,1.0,-3.408508e-16,1.0


Other, built-in kinds of transformations, like rank, can be used more concisely:

In [38]:
# Within-industry rank descending
ind_rank = by_industry.rank(ascending=False)

In [39]:
ind_rank

Unnamed: 0,Momentum,Value,ShortInterest
D1C1R1J1M,24.0,163.0,198.0
W1E1D1V1H,73.0,203.0,264.0
Y1V1L1R1U,133.0,225.0,73.0
Y1L1M1W1R,103.0,124.0,161.0
M1A1C1F1M,146.0,98.0,25.0
...,...,...,...
E1L1S1U1P,2.0,127.0,12.0
B1X1G1R1I,75.0,164.0,201.0
N1N1M1Z1I,137.0,170.0,110.0
M1O1Q1V1P,185.0,123.0,191.0


In [41]:
ind_rank.groupby(industries).agg(['min', 'max'])

Unnamed: 0_level_0,Momentum,Momentum,Value,Value,ShortInterest,ShortInterest
Unnamed: 0_level_1,min,max,min,max,min,max
industry,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
FINANCIAL,1.0,229.0,1.0,229.0,1.0,229.0
TECH,1.0,271.0,1.0,271.0,1.0,271.0


In quantitative equity, “rank and standardize” is a common sequence of transforms. You could do this by chaining together rank and zscore like so:

In [43]:
# Industry rank and standardize

by_industry.apply(lambda x: zscore(x.rank())).info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, D1C1R1J1M to X1T1P1N1Y
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Momentum       500 non-null    float64
 1   Value          500 non-null    float64
 2   ShortInterest  500 non-null    float64
dtypes: float64(3)
memory usage: 31.8+ KB
