# Regressions - Scaling

## Heteroscedasticity

One of the problems of heteroscedasticity is that you may get significant results for variables that are driven by scale.

Let's run a regression where we run market value on the number of employees:


In [None]:
import pandas as pd
import numpy as np
from scipy.stats import mstats # for winsorize
import statsmodels.formula.api as smf # for OLS

df = pd.read_excel(r'..\datasets\Compustat-Funda.xlsx')

In [None]:
# filter: only keep if positive price, equity, assets and csho (number of shares outstanding)
df = df[ ( df['prcc_f'] > 0) & ( df['ceq'] > 0) & ( df['at'] > 0) & (df['csho'] > 0 ) ]

# add mtb, roa and size
df['mtb'] = df['prcc_f'] * df['csho'] / df['ceq']
df['roa'] = df['ni'] / df['at']
df['size'] = np.log ( df['prcc_f'] * df['csho'] )

# drop if any of these have missing values
df = df.dropna(subset=['mtb', 'roa', 'size'])
df.head()

In [None]:
# add winsorized variables to dataframe
df['mtb_w'] = mstats.winsorize(df['mtb'], limits=[0.01, 0.01]).data
df['roa_w'] = mstats.winsorize(df['roa'], limits=[0.01, 0.01]).data
df['size_w'] = mstats.winsorize(df['size'], limits=[0.01, 0.01]).data

In [None]:
# regression
lm = smf.ols("size_w ~ emp ", data=df).fit() 
print(str(lm.summary()))

## Choice of scalar

To deal with size effects driving regression results you need to select an appropriate scalar.

## Example: Earnings management

From 'Earnings management to avoid earnings decreases and losses' (Burgstahler and Dichev, 1997):

In [None]:
from IPython.display import Image
Image("images/scaling - burgstahler.png")

The variable displayed is earnings per share scaled by stock price. The discontinuity around 0 is often mentioned as evidence that managers manage earnings to avoid losses.

From Durtschi and Easton (Earnings Management? The Shapes of the Frequency Distributions of Earnings Metrics Are Not Evidence Ipso Facto):

In [None]:
Image("images/scaling - durtschi.png")

The above figure shows earnings per share unscaled; it does not have the same discontinuity around 0 (it does show some 'spikes' around 0.10, 0.20, 0.30 suggesting that managers prefer to manage up to round numbers).

In [None]:
Image("images/scaling - durtschi2.png")

The authors show that stock price is correlated with the sign of earnings: firms that have losses often have a low stock price, firms with profits tend to have a higher stock price. For the eps/p graph this means that data points on the 'loss' side (left side of y-axis) are pushed to the left (dividing by small price makes eps/p become more negative). Profits get scaled by a larger price, pushing these observations against the y-axis.

> The purpose of Durtchi and Easton is to highlight the importance of scaling, not to argue earnings management does not exist.


## Other issues with scaling

Suppose you have the following regression: `Y = a + bX + cZ + e`, but both 'Y' and 'Z' are correlated with size ('S').
That means there is likely hetoroscedasticity issues (larger values for Y and Z give larger errors). 

So, it looks like Y and Z need to be scaled: `Y/S = a + bX + cZ/S + e`
    
Scaling can solve heteroscedasticity problems, but can also introduce new problems (a 'catch-22').

## Example

Using the dataset constructed earlier, let's add a random variable R. Let's say R is something that a researcher collected, and the researcher is planning to write a paper on the relation between R and firm size.

"Hypothesis 1: Firm size is negatively correlated with 'whatever' (measured by a random number)"

In [None]:
# df.shape[0] is the number of rows in the dataframe df
# np.rand.rand( df.shape[0 ]) returns a column with random numbers
df['R'] = np.random.rand( df.shape[0], 1) 
df['R_P'] = df['R'] / df['prcc_f']
df[['prcc_f','R', 'R_P']].describe()

### Testing the hypothesis

Let's run this regression:

`R/P = a + b Marketcap + e`

Where 'R/P' is the random variable R scaled by stock price and Size is market cap.

If the dependent (Y) and independent variable (Z) are both scaled by S (for example sales, stock price or market cap), the c may be significant even though Z and Y may not be related. In this example, the independent variable (marketcap) is not scaled. Nonetheless, the same issue can exist (because the dependent variable is scaled). 


### Dependent variable: R/P

'R' is scaled by stock price (the researcher could also have scaled by assets, equity, the length of the annual report, etc).

In [None]:
# regression
lm = smf.ols("R_P ~ size ", data=df).fit() 
print(str(lm.summary()))

### Test for a relation with the scalar only

To see if spurious results as a result of scaling is an issue (or the extent of the issue), you can regress 1 scaled by stock price as the dependent variable. The coefficient b will show the bias because of scaling:
    
`1 / P = a + b Marketcap + e`

### Dependent variable: 1/P

In [None]:
# one divided by stock price (just the inverse of the scalar)
df['one_P'] = 1 / df['prcc_f']
# regression
lm = smf.ols("one_P ~ size ", data=df).fit() 
print(str(lm.summary()))

### Dependent variable: R

In [None]:
# regression
lm = smf.ols("R ~ size ", data=df).fit() 
print(str(lm.summary()))

#### Rule of thumb

If the dependent variable is scaled, the coefficients in the model can reflect both the relations with the numerator, and the denominator. Any correlation with the denominator will bias (positive or negative) the coefficient.
    