# Value and bivariate sorts

This chapter extends univariate portfolio analysis to bivariate sorts which means that we assign stocks to portfolios based on two characteristics. Bivariate sorts are regularly used in the academic asset pricing literature. Yet, some scholars also use sorts with three grouping variables. Conceptually, portfolio sorts are easily applicable in higher dimensions.

We form portfolios on firm size and the book-to-market ratio. To calculate book-to-market ratios, accounting data is required which necessitates additional steps during portfolio formation. In the end, we demonstrate how to form portfolios on two sorting variables using so-called independent and dependent portfolio sorts.

## Data preparation

First, we load the necessary data from our `SQLite`-database introduced in our chapter on *"Accessing & managing financial data"*. We conduct portfolio sorts based on the CRSP sample but keep only the necessary columns in our memory. We use the same data sources for firm size as in the previous chapter.

In [1]:
import pandas as pd
import sqlite3
# Read sqlite query results into a pandas DataFrame
tidy_finance = sqlite3.connect("D:/Tidy/tidyfinance.sqlite")
crsp_monthly = pd.read_sql_query("SELECT * from crsp_monthly", tidy_finance)
factors_ff_monthly = pd.read_sql_query("SELECT * from factors_ff_monthly", tidy_finance)

In [2]:
crsp_monthly=pd.merge(crsp_monthly,factors_ff_monthly,left_on='month',right_on='month')[['permno','gvkey', 'month', 'industry', 'ret_excess', 'mkt_excess','mktcap', 'mktcap_lag', 'exchange']]
crsp_monthly['month']=pd.to_datetime(crsp_monthly['month'])
crsp_monthly['permno']=crsp_monthly['permno'].astype(int)
crsp_monthly=crsp_monthly.dropna()

Further, we utilize accounting data. The most common source of accounting data is *Compustat*. We only need book equity data in this application, which we select from our database. Additionally, we convert the variable `datadate` to its monthly value, as we only consider monthly returns here and do not need to account for the exact date. To achieve this, we use the function `pd.to_datetime()`.

In [3]:
compustat = pd.read_sql_query("SELECT * from compustat", tidy_finance)
be = compustat[['gvkey', 'datadate', 'be']].dropna()

In [4]:
from datetime import timedelta
be['month']=pd.to_datetime(be['datadate']) + timedelta(days=1) - pd.DateOffset(months=1)

## Book-to-market ratio

A fundamental problem in handling accounting data is the *look-ahead bias* - we must not include data in forming a portfolio that is not public knowledge at the time. Of course, researchers have more information when looking into the past than agents had at that moment. However, abnormal excess returns from a trading strategy should not rely on an information advantage because the differential cannot be the result of informed agents' trades. Hence, we have to lag accounting information.

We continue to lag market capitalization and firm size by one month. Then, we compute the book-to-market ratio, which relates a firm's book equity to its market equity. Firms with high (low) book-to-market are called value (growth) firms. After matching the accounting and market equity information from the same month, we lag book-to-market by six months. This is a sufficiently conservative approach because accounting information is usually released well before six months pass. However, in the asset pricing literature, even longer lags are used as well.^[The definition of a time lag is another choice a researcher has to make, similar to breakpoint choices as we describe in the section on p-hacking.]

Having both variables, i.e., firm size lagged by one month and book-to-market lagged by six months, we merge these sorting variables to our returns using the `sorting_date`-column created for this purpose. The final step in our data preparation deals with differences in the frequency of our variables. Returns and firm size are recorded monthly. Yet the accounting information is only released on an annual basis. Hence, we only match book-to-market to one month per year and have eleven empty observations. To solve this frequency issue, we carry the latest book-to-market ratio of each firm to the subsequent months, i.e., we fill the missing observations with the most current report. This is done via the `fillna()`-function after sorting by date and firm (which we identify by permno and gvkey) and on a firm basis (which we do by `groupby()` as usual). As the last step, we remove all rows with missing entries because the returns cannot be matched to any annual report.

In [5]:
me=crsp_monthly.assign(sorting_date = crsp_monthly.month + pd.DateOffset(months=1))[['permno', 'sorting_date','mktcap']].rename(columns={'mktcap':'me'})

In [6]:
bm=be.merge(crsp_monthly[['month', 'permno', 'gvkey', 'mktcap']],left_on=["gvkey", "month"],right_on=["gvkey", "month"],how='inner')

In [7]:
bm=bm.assign(bm=bm.be/bm.mktcap,sorting_date = bm.month + pd.DateOffset(months=6))[['permno', 'gvkey', 'sorting_date', 'bm']].sort_values(['permno', 'gvkey', 'sorting_date'])

In [8]:
data_for_sorts=crsp_monthly.merge(bm, left_on=["permno", "gvkey", "month"],right_on=["permno", "gvkey", 'sorting_date'],how='left').merge(me, left_on=["permno",  "month"],right_on=["permno",'sorting_date'],how='left')[['permno', 'gvkey', 'month', 'ret_excess', 'mktcap_lag','me', 'bm', 'exchange']]

In [9]:
data_for_sorts=data_for_sorts.sort_values(['permno', 'gvkey', 'month'])

In [10]:
data_for_sorts=data_for_sorts.assign(bm=data_for_sorts.groupby(['permno', 'gvkey']).bm.fillna(method='ffill')).dropna()

The last step of preparation for the portfolio sorts is the computation of breakpoints. We continue to use the same function allowing for the specification of exchanges to use for the breakpoints. Additionally, we reintroduce the argument `var` into the function for defining different sorting variables.

In [11]:
def bucket(row,var,groups):
    if groups == 5:
        if 0<=row[var]<=row['20%']:
            value = 1
        elif row[var]<=row['40%']:
            value=2
        elif row[var]<=row['60%']:
            value=3
        elif row[var]<=row['80%']:
            value=4
        elif row[var]>row['80%']:
            value=5
        else:
            value=''
    return value

In [12]:
import numpy as np
def assign_portfolio(df,var,groups,exchange):
    nyse = df.loc[df['exchange'].isin(exchange)].groupby(["month"])[var].describe(percentiles=np.linspace(0,1,groups+1)[1:-1]).reset_index()
    if groups % 2==1:
        nyse=nyse.iloc[:,[0]+list(range(5,5+groups))].merge(df, how='inner', left_on=['month'], right_on = ['month'])
    else:
        nyse=nyse.iloc[:,[0]+list(range(5,5+groups-1))].merge(df, how='inner',left_on=['month'], right_on = ['month'])
    nyse['portfolio_{}'.format(var)]=nyse.apply( lambda x:bucket(x,var,groups), axis=1)
    nyse=nyse.sort_values(['permno','month'])
    return nyse    

After these data preparation steps, we present bivariate portfolio sorts on an independent and dependent basis.

In [24]:
del(compustat)
del(crsp_monthly)
del(bm)
del(be)
del(me)

NameError: name 'compustat' is not defined

## Independent sorts

Bivariate sorts create portfolios within a two-dimensional space spanned by two sorting variables. It is then possible to assess the return impact of either sorting variable by the return differential from a trading strategy that invests in the portfolios at either end of the respective variables spectrum. We create a five-by-five matrix using book-to-market and firm size as sorting variables in our example below. We end up with 25 portfolios. Since we are interested in the *value premium* (i.e., the return differential between high and low book-to-market firms), we go long the five portfolios of the highest book-to-market firms and short the five portfolios of the lowest book-to-market firms. The five portfolios at each end are due to the size splits we employed alongside the book-to-market splits.

To implement the independent bivariate portfolio sort, we assign monthly portfolios for each of our sorting variables separately to create the variables `portfolio_bm` and `portfolio_bm`, respectively. Then, these separate portfolios are combined to the final sort stored in `portfolio_combined`. After assigning the portfolios, we compute the average return within each portfolio for each month. Additionally, we keep the book-to-market portfolio as it makes the computation of the value premium easier. The alternative would be to disaggregate the combined portfolio in a separate step. Notice that we weigh the stocks within each portfolio by their market capitalization, i.e., we decide to value-weight our returns.

In [13]:
data_for_sorts=data_for_sorts.merge(assign_portfolio(data_for_sorts,'bm',5,['NYSE'])[['permno','portfolio_bm','month']],left_on=['permno','month'],right_on=['permno','month'])

In [14]:
data_for_sorts=data_for_sorts.merge(assign_portfolio(data_for_sorts,'me',5,['NYSE'])[['permno','portfolio_me','month']],left_on=['permno','month'],right_on=['permno','month'])

In [15]:
value_portfolios=data_for_sorts.dropna()

In [16]:
value_portfolios['portfolio_combined']=value_portfolios.apply(lambda x:str(int(x['portfolio_bm'])) + '-' + str(int(x['portfolio_me'])),axis = 1)

Equipped with our monthly portfolio returns, we are ready to compute the value premium. However, we still have to decide how to invest in the five high and the five low book-to-market portfolios. The most common approach is to weigh these portfolios equally, but this is yet another researcher's choice. Then, we compute the return differential between the high and low book-to-market portfolios and show the average value premium.

In [17]:
value_premium=value_portfolios.groupby(['month','portfolio_combined']).apply(lambda x: pd.Series([np.average(x['ret_excess'], weights=x['mktcap_lag'])], 
                                                                index=['ret'])).reset_index()

In [18]:
value_premium['portfolio_bm']=value_premium.portfolio_combined.apply(lambda x:x[0])

In [19]:
value_premium=value_premium.groupby(['month', 'portfolio_bm']).ret.mean().unstack()

In [20]:
(value_premium.iloc[: , -1]-value_premium.iloc[: , 0]).mean()*100

0.3063750693423136

## Dependent sorts

In the previous exercise, we assigned the portfolios without considering the second variable in the assignment. This protocol is called independent portfolio sorts. The alternative, i.e., dependent sorts, creates portfolios for the second sorting variable within each bucket of the first sorting variable. In our example below, we sort firms into five size buckets, and within each of those buckets, we assign firms to five book-to-market portfolios. Hence, we have monthly breakpoints that are specific to each size group. The decision between independent and dependent portfolio sorts is another choice for the researcher. Notice that dependent sorts ensure an equal amount of stocks within each portfolio.

To implement the dependent sorts, we first create the size portfolios by calling `assign_portfolio()` with `var = me`. Then, we group our data again by month and by the size portfolio before assigning the book-to-market portfolio. The rest of the implementation is the same as before. Finally, we compute the value premium.

In [21]:
data_for_sorts=data_for_sorts.drop(['portfolio_bm','portfolio_me'],axis=1)

In [22]:
data_for_sorts=data_for_sorts.merge(assign_portfolio(data_for_sorts,'me',5,['NYSE'])[['permno','portfolio_me','month']],left_on=['permno','month'],right_on=['permno','month'])

In [32]:
nyse=data_for_sorts.loc[data_for_sorts['exchange'].isin(['NYSE'])].groupby(['month','portfolio_me'])['bm'].describe(percentiles=np.linspace(0,1,5+1)[1:-1]).reset_index()

In [35]:
nyse.iloc[:,[0,1]+list(range(6,5+5+1))]

Unnamed: 0,month,portfolio_me,20%,40%,50%,60%,80%
0,1960-08-01,1,3.043874,3.086235,3.107416,3.128596,3.170957
1,1960-08-01,2,0.892990,0.892990,0.892990,0.892990,0.892990
2,1960-08-01,3,1.551437,1.551437,1.551437,1.551437,1.551437
3,1960-08-01,4,1.095951,1.095951,1.095951,1.095951,1.095951
4,1960-08-01,5,1.298501,1.612112,1.768917,1.925723,2.239334
...,...,...,...,...,...,...,...
3620,2020-12-01,1,0.467692,0.747937,0.914651,1.106672,1.689066
3621,2020-12-01,2,0.332199,0.516728,0.666623,0.777333,1.002481
3622,2020-12-01,3,0.228130,0.396968,0.468030,0.537173,0.803560
3623,2020-12-01,4,0.154653,0.267289,0.373745,0.487313,0.711881


In [36]:
nyse=nyse.iloc[:,[0,1]+list(range(6,5+5+1))].merge(data_for_sorts, how='inner',left_on=['month','portfolio_me'], right_on = ['month','portfolio_me'])

In [42]:
nyse['portfolio_bm']=nyse.apply( lambda x:bucket(x,'bm',5), axis=1)

In [44]:
data_for_sorts=nyse.copy()

In [46]:
value_portfolios=data_for_sorts.dropna()
value_portfolios['portfolio_combined']=value_portfolios.apply(lambda x:str(int(x['portfolio_bm'])) + '-' + str(int(x['portfolio_me'])),axis = 1)
value_premium=value_portfolios.groupby(['month','portfolio_combined']).apply(lambda x: pd.Series([np.average(x['ret_excess'], weights=x['mktcap_lag'])], 
                                                                index=['ret'])).reset_index()
value_premium['portfolio_bm']=value_premium.portfolio_combined.apply(lambda x:x[0])
value_premium=value_premium.groupby(['month', 'portfolio_bm']).ret.mean().unstack()
(value_premium.iloc[: , -1]-value_premium.iloc[: , 0]).mean()*100

0.25145431478579017

The value premium from dependent sorts is 3.02 percent per year.

Overall, we show how to conduct bivariate portfolio sorts in this chapter. In one case, we sort the portfolios independently of each other. Yet we also discuss how to create dependent portfolio sorts. Along the line of the previous chapter, we see how many choices a researcher has to make to implement portfolio sorts, and bivariate sorts increase the number of choices.