# EPS ESTIMATE REVISION MOMENTUM MODELLING OVERVIEW

When speaking of the stock market, there are adages that may sound pithy but in practice often prove to be powerful and true. One of  most well known is “Don’t fight the Fed”, which argues that investors should respect the trend coming from the Federal Reserve decision making regarding monetary policy. Another that is less well known (if at all) is in regards to the power of a stock’s EPS trend, which is “Earnings trump and trend”.  This is an observation that often the best longer term stock investments are those where the investor is on the right side of the earnings trend. Both of these phrases speak to the power of trends in the stock market, of which the focus of this analysis will be on the latter of these two phrases, “Earnings trump and trend” and for which we hope the reader will come to a foundational place of understanding and conviction that such statement may in fact hold statistical power in its truthfulness, which we will embark to analyze and attempt to model with efficacy.

A few examples as detail / proof points of what we are talking about with this statement are as follows from companies most have heard of:

 •	Google – During the past 5 years, GOOGL is up 279% with over 80% of this return coming from earnings per share (EPS) growth.
 
 •	Domino’s Pizza – During the past 5 years, DPZ is up 228% with over 90% of this return coming from earnings per share (EPS) growth.  
 
 •	Home Depot – During the past 5 years, HD is up 219% with approximately 70% of this return coming from earnings per share (EPS) growth.

Therefore, while getting the valuation “right” for a given stock is important, we would argue that getting the earnings “right” is even more important hence the “trump” notion. And accompanying this is the power of trends which Newton’s first law of motion would support the general power of trends which says, “an object in motion stays in motion with the same speed and in the same direction unless acted upon by an unbalanced force”.  We believe that such “momentum” is prevalent in the stock market and manifests itself in various ways, one of which is the operating performance / earnings trajectory of a given company.  This can be seen in earnings trends that ebb and flow over time as aspects such as industry dynamics, product cycles, etc. strengthen and weaken over various time periods.  

These foundational and experiential beliefs serve as the basis of the forthcoming analysis whereby we seek to use machine learning and related tools in order to analyze and model the earnings estimates of a basket of stocks, for which we will call this exercise EPS estimate revision momentum modelling. 

# BUSINESS UNDERSTANDING

The goal of this analysis is not only to use powerful algorithms to predict EPS but also better understand various drivers behind such predictions.  The chosen variables / data set had this goal in mind when being gathered and organized.  The use case for the analysis is multifaceted for both professional and individual investors.  For the professional, this analysis could serve to aid in the following:  predicting EPS which in turn could be used in combination with a target multiple and/or DCF in order to establish an estimated target price, predicting EPS and comparing growth potential across a universe of stocks, comparing the models outcomes with human driven modelling / analytical efforts, providing investment ideas based on the model’s output and providing insight into the various power of individual and groupings of metrics which can serve to guide the focus of an analyst’s research efforts to gain insight and dig deeper on additional metrics.  The use case could also be for individual investors as a means to generate investment buy and sell ideas based on the predictions made by the model.

#### ???? Goals – what outcomes (NEED HELP HERE)
#### ???? Accuracy in EPS predictions, determine if / which EPS variables have predictive price power, segmentation of data set besides by sector, drivers behind EPS predictions, 
#### ???? Measure a good outcome???

# DATA UNDERSTANDING

To begin, let’s provide some background information on some “what’s” and “why’s” of our analysis.

What are some of the reasons for using consensus earnings estimate data instead of historical or other measures?

As discussed above, the premise of our analysis is to model the consensus EPS estimates.  We chose to model the consensus EPS estimates for several reasons:

•	While historical EPS is typically only reported on a quarterly basis, consensus EPS estimates are updated on a much more frequent basis thus gleaning insight from such daily movement as information enters the market is deemed valuable.

•	Crowdsource for insight…..don’t reinvent the wheel.  We are primarily approaching our analysis by studying those that know the subject (i.e. EPS trajectory) best and then seek to glean insight and improve upon it with powerful statistical calculations.  While we do have a few variables that are unrelated to consensus EPS estimates, the majority seek to capture changes in such metrics. We believe that embedded in their estimates is information from two parties which arguably know each company best and are trying to use statistics to validate such belief. One is the sell side analysts that cover a given stock and thus their livelihood depends on such skill and the other is company management which serves as the foundation of financial information being fed to such analysts and they too have their livelihood tied up in their respective company. Hence, we are not trying to outsmart them with other information but rather glean insight from them and use statistical tools to capture their collective wisdom in a manner which is statistical efficacious.

What basket of stocks did we choose and why?

We narrowed down the universe of stocks for this analysis to 400 stocks.  We narrowed down the universe using Bloomberg and Factset based on the following parameters: 

•	US domiciled and traded companies

•	The real estate sector was excluded due to the fact such stocks do not “trade” based on EPS (but rather funds from 
operations or FFO)

•	Market cap greater than 1.5 billion.

•	We set a floor on the number of analysts covering each stock as greater than 4 so that there is sufficient estimate data available for comparability purposes.

•	At least 10 years of trading history, so that we can compute the desired metrics and compare accordingly.

•	Five year EPS growth must be above zero and the R-squared (historical EPS vs time) must be above 0.1, which further narrows down the universe to companies that are deemed higher quality and have had more stability with their historical EPS profile which we expect to aid in our ability to analyze and model such consensus EPS.  To further clarify, on the other side of the coin are companies with extremely volatile EPS profiles and at times are in the negative range which understandably would be much more difficult and less insightful to model.

What was done to enable comparability across the basket of stocks given they are all unique businesses and grow at different rates?

•	We normalized some of the variables in a way that ensured comparability (more details on this below).  Thus while some companies have grown faster than others we included both the absolute growth levels as well as relative / proportional levels.  Additionally, the analyst estimate figures are comparable across the universe thus needed no adjustments.

What explanatory variables did we include and why?
We gathered weekly data during the past five years (12/30/16 - 10/1/21) and established six categories by which an explanatory variable is included in:  GROWTH, STABILITY, ANALYST REVISIONS, INCOME STATEMENT, RETURNS and OTHER.  In total there are 52 explanatory variables, 6 potential response variables and a total of ~99k rows of data.  Note, we stopped the data at 10/1/21 in order to be able to calculate some of the response variables that need to see what the forward metrics consist of.  Details on each category and the variables include are as follows:

#### GROWTH
Consensus (i.e. sell side analysts) EPS median estimates trends:

Annualized growth rate for each of the following time periods on a daily basis for the past 5 year period (1 month, 2 month, 3 month, 6 month, 1 year, 3 year and 5 year)

•	Labeled as “Best_EPS_##” in the data set.

Percent relative to their respective 5 year growth rate for all the aforementioned time periods except the 5 year.  The 5 year growth rate approximates a company's longer term growth rate thus this ratio captures the current trend relative to the longer term.

•	Labeled as “Best_EPS_##_vs_5Y”.

3M, 6M and 1Y growth rate ranks within our universe thus aiding in comparability.

•	Labeled as “Best_EPS_##_Rank”.

A continuous and classification variable seeking to capture short-term EPS acceleration, which is deemed attractive as the EPS is not only improving but in somewhat of a parabolic manner.  For the continuous variable we used the aforementioned EPS growth ranks (3M, 6M and 1Y), then averaged the three to get a continuous variable rank.  For the classification variable we segmented into those meeting (or not) the following criteria: EPS growth of 3M > 6M AND 6M > 1Y.

•	Labeled as "ST_Accel" and "ST_Accel_Class".

#### STABILITY
Stability measures seek to capture the stability of EPS which not only aids in modelling efforts but also evidences confidence in the outlook of a company by analysts and management that provide much of the underlying information.

R-squared of the weekly 5 year EPS figures.  Both the value and the rank of this measure.

•	Labeled as “Best_EPS_5Y_R2" and “Best_EPS_5Y_R2_Rank".

Standard deviation of the weekly 5 year EPS figures.  

•	Labeled as “Best_EPS_5Y_SD".

FactSet derived measure of EPS stability defined as: measuring the consistency for an estimate item over the past 5 years.

•	Labeled as “Best_EPS_5Y_Stability".

#### ANALYST REVISIONS
See details above regarding crowdsourcing for the reasoning behind these revisions metrics.

Sell side analyst EPS revisions (Upward, Downward and Unchanged).  A “revision” is a change (regardless of magnitude) in an analyst’s estimate during the past 3 months for a company’s EPS for the next 12 month period.  Calculated the % of total revisions for each metric.

•	Labeled as “An_Up", "An_Down", "An_Unch".

We included the current value as well as the change in each of these variables on a 3- and 6-month basis and labeled them similarly to the previous metrics but added “_#M” on the end.  This captures second derivative changes in the revisions metrics.

Analyst revision ratios	
We pulled a predetermined metric labeled “Mark_Rev” from FactSet which seeks to quantify the relative trend in the analyst revisions with the lower being better.

•	Labeled as “An_Mark"

We also created our own metric labeled as “Net_Est_Rev_Ratio” which equates to the following:  (Upward Revisions – Downward Revisions) / Total Revisions.  This ratio ranges from -100 to 100 and seeks to capture the change in such revision on a proportion basis during the past 3 month period.

•	Labeled as “NRR"

Both of these metrics were also used to create variables which capture the change in the Net_Est_Rev_Ratio and Mark_Rev on a 3- and 6-month basis, which similarly as above capture second derivative changes and labeled them similarly to the previous metrics but added “_#M” on the end.

#### RETURNS
Technical analysis (aka price momentum) presumes that price leads fundamentals as the collective market begins to price in changes in fundamentals prior to such changes becoming quantifiable.  As such, in hopes to capture some of the collective market’s wisdom based on the similar rationale above regarding analyst and management information capture, we included relative return data on a 1-, 3- and 6-month basis.  This was calculated by subtracting the S&P 500 Equal Weighted Index from each stock's given return, which removes any noise that can be caused by overall market movements and captures a cleaner measure of the performance of a stock.  

•	Labeled as “Return_#M".

Additionally, with the aforementioned technical analysis view in mind, we also included a relative price momentum measure which is defined as the change over the last 6 months in the one month moving average relative to the index.

•	Labeled as “Rel_PMO".

#### OTHER
Included Market Capitalization values and Sector classification variables.

•	Labeled as “Mkt_Cap" and "Sector".

## EXPLANATORY VARIABLES SUMMARY (51 total)

Best_EPS_1M = 1 month trailing consensus EPS growth

Best_EPS_3M = 3 month trailing consensus EPS growth

Best_EPS_6M = 6 month trailing consensus EPS growth

Best_EPS_1Y = 1 year trailing consensus EPS growth

Best_EPS_3Y = 3 year trailing consensus EPS growth

Best_EPS_5Y = 5 year trailing consensus EPS growth

Best_EPS_1M_v5Y = 1 month / 5 year trailing EPS ratio

Best_EPS_3M_v5Y = 3 month / 5 year trailing EPS ratio

Best_EPS_6M_v5Y = 6 month / 5 year trailing EPS ratio

Best_EPS_1Y_v5Y = 1 year / 5 year trailing EPS ratio

Best_EPS_3Y_v5Y = 3 year / 5 year trailing EPS ratio

Best_EPS_1M_Rank = 1 month trailing EPS rank within this 400 stock universe

Best_EPS_3M_Rank = 3 month trailing EPS rank within this 400 stock universe

Best_EPS_6M_Rank = 6 month trailing EPS rank within this 400 stock universe

Best_EPS_1Y_Rank = 1 year trailing EPS rank within this 400 stock universe

ST_Accel = average of Best_EPS_3M_Rank, Best_EPS_6M_Rank and Best_EPS_1Y_Rank

ST_Accel_Class = binary measure (1 or 0) if Best_EPS_3M > Best_EPS_6M AND Best_EPS_6M > Best_EPS_1Y or not

Best_EPS_5Y_R2 = 5 year R-squared of weekly EPS

Best_EPS_5Y_R2_Rank =  5 year r-squared EPS rank within this 400 stock universe

Best_EPS_5Y_SD =  5 year standard deviation of weekly EPS

Best_EPS_5Y_Stability = FactSet calculated stability of 5 year weekly EPS

An_Up = % of analysts that revised their EPS estimate UP during the past 3 months

An_Down = % of analysts that revised their EPS estimate DOWN during the past 3 months

An_Unch = % of analysts that left their EPS estimate UNCHANGED during the past 3 months

An_Mark = FactSet calculated analyst revision measure

NRR = "An_Up" minus "An_Down"

An_Up_3M = "An_Up" minus "An_Up" 3 months ago

An_Down_3M = "An_Up" minus "An_Up" 3 months ago

An_Unch_3M = "An_Up" minus "An_Up" 3 months ago

An_Mark_3M = "An_Mark" minus "An_Mark" 3 months ago

NRR_3M = "NRR" minus "NRR" 3 months ago

An_Up_6M = "An_Up" minus "An_Up" 6 months ago

An_Down_6M = "An_Up" minus "An_Up" 6 months ago

An_Unch_6M = "An_Up" minus "An_Up" 6 months ago

An_Mark_6M = "An_Mark" minus "An_Mark" 6 months ago

NRR_6M = "NRR" minus "NRR" 6 months ago

ROIC = trailing 12 month return on invested capital (ROIC)

ROIC_1Y_Chg = 1 year change in ROIC

ROIC_SD = 5 year standard deviation of ROIC

ROE = trailing 12 month return on equity (ROE)

ROE_1Y_Chg = 1 year change in ROE

ROE_SD = 5 year standard deviation of ROE

FCF_Mgn = trailing 12 month free cash flow margin (FCF margin)

FCF_Mgn_1Y_Chg	 = 1 year change in FCF margin

FCF_Mgn_SD	= 5 year standard deviation of FCF margin

Op_Mgn = trailing 12 month operating margin (Op margin)

Op_Mgn_1Y_Chg	 = 1 year change in Op margin

Op_Mgn_SD	= 5 year standard deviation of Op margin

Return_1M = 1 month trailing relative price return

Return_3M = 3 month trailing relative price return

Return_6M = 6 month trailing relative price return

Rel_PMO = relative price momentum

Market_Cap = current market capitalization

Sector = GICS sector classification

## RESPONSE VARIABLES SUMMARY (6 total)

Fwd_Best_EPS_6M = 6 month FORWARD consensus EPS growth

Fwd_Best_EPS_6M_v5Y = 6 month FORWARD consensus EPS growth / 5 year trailing 5 year consensus EPS growth

Fwd_ST_Accel_3M = FORWARD average of Best_EPS_3M_Rank, Best_EPS_6M_Rank and Best_EPS_1Y_Rank

Fwd_ST_Accel_Class_3M = FORWARD binary measure (1 or 0) if Best_EPS_3M > Best_EPS_6M AND Best_EPS_6M > Best_EPS_1Y or not

Fwd_Return_1M = 1 month FORWARD price return

Fwd_Return_3M = 3 month FORWARD price return

Fwd_Return_6M = 6 month FORWARD price return

## DATA QUALITY

•	FactSet was used to gather all of the data.  As SMU students, we are provided free FactSet licenses upon request during our time in the program via the business library.  We attempted to get the data via FactSet's API but our student license did not provide such access so we used the FactSet excel add-in.  Given the reputation of FactSet in the marketplace we have high confidence in the accuracy of the data gathered. Upon gathering the data in excel, we used R code to format and create summary CSV files to be used during our analysis.

In [1]:
# read in data and packages
import pandas as pd
import numpy as np
import bamboolib as bam

df = pd.read_csv('FS_DATA_ALL_ML_ADJ_5Y.csv')

In [2]:
# quick summary view of data set
df.head()

Unnamed: 0,Date,ticker,Fwd_Best_EPS_6M,Fwd_Best_EPS_6M_v5Y,Fwd_ST_Accel_3M,Fwd_ST_Accel_Class_3M,Fwd_Return_6M,Fwd_Return_3M,Fwd_Return_1M,Mkt_Cap,...,FCF_Mgn,FCF_Mgn_1Y_Chg,FCF_Mgn_SD,Op_Mgn,Op_Mgn_1Y_Chg,Op_Mgn_SD,Return_1M,Return_3M,Return_6M,Rel_PMO
0,12/30/2016,AAPL-US,-0.016685,-0.248072,24.5,0.0,18.352664,19.648291,2.777767,608683.06,...,24.352012,-5.922154,2.692903,26.63497041,-2.173891,2.801545,4.206145,-0.331259,14.932847,11.204
1,12/30/2016,MSFT-US,0.054237,2.036492,26.7,0.0,5.182207,1.703727,3.342819,480342.2,...,32.736822,6.862099,4.760858,26.62114882,-2.56037,4.109913,3.697074,5.283427,15.901137,14.9649
2,12/30/2016,GOOGL-US,-0.045126,-0.396018,10.866667,0.0,10.275401,2.061749,4.120195,547815.2,...,28.77871,8.073658,4.200644,25.8288478,1.499313,2.621106,2.480841,-4.75713,4.441488,3.776
3,12/30/2016,AMZN-US,0.819502,0.817332,91.666667,0.0,22.04727,13.30291,8.940398,357688.0,...,7.137447,1.771002,2.150023,3.200305912,1.321354,0.867805,0.10668,-13.756174,-3.798473,-1.211
4,12/30/2016,TSLA-US,,,,0.0,62.17998,25.312459,15.857481,34523.973,...,-22.346721,35.023011,91.4723,-10.19323636,3.82712,77.175347,16.57443,1.421392,-8.429813,-13.237


In [3]:
# summary of variables
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104800 entries, 0 to 104799
Data columns (total 61 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Date                   99600 non-null  object 
 1   ticker                 99600 non-null  object 
 2   Fwd_Best_EPS_6M        93183 non-null  float64
 3   Fwd_Best_EPS_6M_v5Y    87723 non-null  float64
 4   Fwd_ST_Accel_3M        97928 non-null  float64
 5   Fwd_ST_Accel_Class_3M  99600 non-null  float64
 6   Fwd_Return_6M          94400 non-null  float64
 7   Fwd_Return_3M          99600 non-null  float64
 8   Fwd_Return_1M          99600 non-null  float64
 9   Mkt_Cap                99600 non-null  float64
 10  Sector                 99600 non-null  object 
 11  Best_EPS_1M            98671 non-null  float64
 12  Best_EPS_3M            98503 non-null  float64
 13  Best_EPS_6M            98312 non-null  float64
 14  Best_EPS_1Y            98231 non-null  float64
 15  

In [3]:
# change variable types accordingly
df['Op_Mgn'] = pd.to_numeric(df['Op_Mgn'], downcast='float', errors='coerce')
df['Fwd_ST_Accel_Class_3M'] = df['Fwd_ST_Accel_Class_3M'].astype('category')
df['ST_Accel_Class'] = df['ST_Accel_Class'].astype('category')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104800 entries, 0 to 104799
Data columns (total 60 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   Date                   99600 non-null  object  
 1   ticker                 99600 non-null  object  
 2   Fwd_Best_EPS_6M        93183 non-null  float64 
 3   Fwd_Best_EPS_6M_v5Y    82523 non-null  float64 
 4   Fwd_ST_Accel_3M        97928 non-null  float64 
 5   Fwd_ST_Accel_Class_3M  99600 non-null  category
 6   Fwd_Return_6M          94400 non-null  float64 
 7   Fwd_Return_3M          99600 non-null  float64 
 8   Fwd_Return_1M          99600 non-null  float64 
 9   Mkt_Cap                99600 non-null  float64 
 10  Sector                 99600 non-null  object  
 11  Best_EPS_1M            98671 non-null  float64 
 12  Best_EPS_3M            98503 non-null  float64 
 13  Best_EPS_6M            98312 non-null  float64 
 14  Best_EPS_1Y            98231 non-nul

In [4]:
# Remove columns: Date, Ticker
# These columns will not aid in modelling efforts
if 'Date' in df:
    del df['Date']
    
if 'ticker' in df:
    del df['ticker']
    
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104800 entries, 0 to 104799
Data columns (total 58 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   Fwd_Best_EPS_6M        93183 non-null  float64 
 1   Fwd_Best_EPS_6M_v5Y    82523 non-null  float64 
 2   Fwd_ST_Accel_3M        97928 non-null  float64 
 3   Fwd_ST_Accel_Class_3M  99600 non-null  category
 4   Fwd_Return_6M          94400 non-null  float64 
 5   Fwd_Return_3M          99600 non-null  float64 
 6   Fwd_Return_1M          99600 non-null  float64 
 7   Mkt_Cap                99600 non-null  float64 
 8   Sector                 99600 non-null  object  
 9   Best_EPS_1M            98671 non-null  float64 
 10  Best_EPS_3M            98503 non-null  float64 
 11  Best_EPS_6M            98312 non-null  float64 
 12  Best_EPS_1Y            98231 non-null  float64 
 13  Best_EPS_3Y            97953 non-null  float64 
 14  Best_EPS_5Y            96924 non-nul

In [8]:
# segment data based on characteristics / category to aid in analysis

# Response variables
df_response = df[['Fwd_Best_EPS_6M', 'Fwd_Best_EPS_6M_v5Y', 'Fwd_ST_Accel_3M', 'Fwd_ST_Accel_Class_3M', 'Fwd_Return_6M', 'Fwd_Return_3M', 'Fwd_Return_1M']]
df_response

# EPS growth variables
df_growth = df[['Best_EPS_1M', 'Best_EPS_3M', 'Best_EPS_6M', 'Best_EPS_1Y', 'Best_EPS_3Y', 'Best_EPS_5Y', 'Best_EPS_3M_v5Y', 'Best_EPS_6M_v5Y', 'Best_EPS_1Y_v5Y', 'Best_EPS_3Y_v5Y', 'Best_EPS_3M_Rank', 'Best_EPS_6M_Rank', 'Best_EPS_1Y_Rank', 'ST_Accel', 'ST_Accel_Class']]
df_growth

# EPS stability variables
df_stability = df[['Best_EPS_5Y_R2', 'Best_EPS_5Y_R2_Rank', 'Best_EPS_5Y_SD', 'Best_EPS_5Y_Stability']]
df_stability

# Analyst Revision variables
df_revisions = df[['An_Up', 'An_Down', 'An_Unch', 'An_Mark', 'NRR', 'An_Up_3M', 'An_Down_3M', 'An_Unch_3M', 'An_Mark_3M', 'NRR_3M', 'An_Up_6M', 'An_Down_6M', 'An_Unch_6M', 'An_Mark_6M', 'NRR_6M']]
df_revisions

# Income statement variables
df_inc_stmt = df[['ROIC', 'ROIC_1Y_Chg', 'ROIC_SD', 'ROE', 'ROE_1Y_Chg', 'ROE_SD', 'FCF_Mgn', 'FCF_Mgn_1Y_Chg', 'FCF_Mgn_SD', 'Op_Mgn', 'Op_Mgn_1Y_Chg', 'Op_Mgn_SD']]
df_inc_stmt

# Return variables
df_return = df[['Return_1M', 'Return_3M', 'Return_6M', 'Rel_PMO']]
df_return

Unnamed: 0,Return_1M,Return_3M,Return_6M,Rel_PMO
0,4.206145,-0.331259,14.932847,11.2040
1,3.697074,5.283427,15.901137,14.9649
2,2.480841,-4.757130,4.441488,3.7760
3,0.106680,-13.756174,-3.798473,-1.2110
4,16.574430,1.421392,-8.429813,-13.2370
...,...,...,...,...
104795,,,,
104796,,,,
104797,,,,
104798,,,,


In [None]:
######################## I stopped here, the rest is "notes"############################

### EPS Growth variables analysis

In [7]:
# check for NAs - EPS growth variables
df_growth.isna().sum()

Best_EPS_1M          6129
Best_EPS_3M          6297
Best_EPS_6M          6488
Best_EPS_1Y          6569
Best_EPS_3Y          6847
Best_EPS_5Y          7876
Best_EPS_3M_v5Y     18719
Best_EPS_6M_v5Y     18815
Best_EPS_1Y_v5Y     18851
Best_EPS_3Y_v5Y     18863
Best_EPS_3M_Rank     6297
Best_EPS_6M_Rank     6488
Best_EPS_1Y_Rank     6569
ST_Accel             6820
ST_Accel_Class       5200
dtype: int64

In [8]:
# replace N/A values in EPS growth variables with median values
# THOUGHTS ON THIS??????
df_growth[['Best_EPS_1M']] = df_growth[['Best_EPS_1M']].fillna(df_growth[['Best_EPS_1M']].median())
df_growth[['Best_EPS_3M']] = df_growth[['Best_EPS_3M']].fillna(df_growth[['Best_EPS_3M']].median())
df_growth[['Best_EPS_6M']] = df_growth[['Best_EPS_6M']].fillna(df_growth[['Best_EPS_6M']].median())
df_growth[['Best_EPS_1Y']] = df_growth[['Best_EPS_1Y']].fillna(df_growth[['Best_EPS_1Y']].median())
df_growth[['Best_EPS_3Y']] = df_growth[['Best_EPS_3Y']].fillna(df_growth[['Best_EPS_3Y']].median())
df_growth[['Best_EPS_5Y']] = df_growth[['Best_EPS_5Y']].fillna(df_growth[['Best_EPS_5Y']].median())
df_growth[['Best_EPS_3M_v5Y']] = df_growth[['Best_EPS_3M_v5Y']].fillna(df_growth[['Best_EPS_3M_v5Y']].median())
df_growth[['Best_EPS_6M_v5Y']] = df_growth[['Best_EPS_6M_v5Y']].fillna(df_growth[['Best_EPS_6M_v5Y']].median())
df_growth[['Best_EPS_1Y_v5Y']] = df_growth[['Best_EPS_1Y_v5Y']].fillna(df_growth[['Best_EPS_1Y_v5Y']].median())
df_growth[['Best_EPS_3Y_v5Y']] = df_growth[['Best_EPS_3Y_v5Y']].fillna(df_growth[['Best_EPS_3Y_v5Y']].median())
df_growth[['Best_EPS_3M_Rank']] = df_growth[['Best_EPS_3M_Rank']].fillna(df_growth[['Best_EPS_3M_Rank']].median())
df_growth[['Best_EPS_6M_Rank']] = df_growth[['Best_EPS_6M_Rank']].fillna(df_growth[['Best_EPS_6M_Rank']].median())
df_growth[['Best_EPS_1Y_Rank']] = df_growth[['Best_EPS_1Y_Rank']].fillna(df_growth[['Best_EPS_1Y_Rank']].median())
df_growth[['ST_Accel']] = df_growth[['ST_Accel']].fillna(df_growth[['ST_Accel']].median())
df_growth[['ST_Accel_Class']] = df_growth[['ST_Accel_Class']].fillna(0)
df_growth

Unnamed: 0,Best_EPS_1M,Best_EPS_3M,Best_EPS_6M,Best_EPS_1Y,Best_EPS_3Y,Best_EPS_5Y,Best_EPS_3M_v5Y,Best_EPS_6M_v5Y,Best_EPS_1Y_v5Y,Best_EPS_3Y_v5Y,Best_EPS_3M_Rank,Best_EPS_6M_Rank,Best_EPS_1Y_Rank,ST_Accel,ST_Accel_Class
0,0.00000,0.34825,0.17992,-0.07653,0.14663,0.16376,2.12659,1.09871,-0.46732,0.89540,0.898,0.825,0.119,61.40000,0.0
1,0.00000,0.08304,0.20974,0.07273,0.03844,0.01533,5.41770,13.68290,4.74459,2.50758,0.701,0.845,0.438,66.13333,0.0
2,0.00695,0.04446,0.05988,0.19207,0.18900,0.17332,0.25651,0.34551,1.10819,1.09048,0.597,0.620,0.757,65.80000,0.0
3,-0.04959,-0.72109,-0.25045,1.51042,1.82327,0.62393,-1.15572,-0.40141,2.42080,2.92222,0.035,0.083,0.992,37.00000,0.0
4,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
104795,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0
104796,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0
104797,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0
104798,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0


In [9]:
# re-check for NAs - EPS growth variables
df_growth.isna().sum()

Best_EPS_1M         0
Best_EPS_3M         0
Best_EPS_6M         0
Best_EPS_1Y         0
Best_EPS_3Y         0
Best_EPS_5Y         0
Best_EPS_3M_v5Y     0
Best_EPS_6M_v5Y     0
Best_EPS_1Y_v5Y     0
Best_EPS_3Y_v5Y     0
Best_EPS_3M_Rank    0
Best_EPS_6M_Rank    0
Best_EPS_1Y_Rank    0
ST_Accel            0
ST_Accel_Class      0
dtype: int64

In [10]:
# check summary statistics - EPS growth variables
df_growth.describe()

Unnamed: 0,Best_EPS_1M,Best_EPS_3M,Best_EPS_6M,Best_EPS_1Y,Best_EPS_3Y,Best_EPS_5Y,Best_EPS_3M_v5Y,Best_EPS_6M_v5Y,Best_EPS_1Y_v5Y,Best_EPS_3Y_v5Y,Best_EPS_3M_Rank,Best_EPS_6M_Rank,Best_EPS_1Y_Rank,ST_Accel
count,104800.0,104800.0,104800.0,104800.0,104800.0,104800.0,104800.0,104800.0,104800.0,104800.0,104800.0,104800.0,104800.0,104800.0
mean,0.239005,0.269546,0.266599,0.224823,0.201332,0.268976,2.037869,2.278274,1.571484,1.009913,0.498633,0.49946,0.499516,49.818829
std,2.956148,2.19958,1.80468,1.415461,0.927354,2.381394,102.483534,77.023768,132.430055,63.941699,0.281217,0.280347,0.280197,23.846909
min,-11.86364,-3.96135,-1.98982,-0.99768,-0.33293,-0.19984,-9683.7525,-7607.20485,-25788.35065,-14241.61503,0.0,0.0,0.0,0.0
25%,0.0,-0.00722,0.0,0.03153,0.05873,0.06563,0.02461,0.131805,0.36695,0.6272,0.262,0.265,0.266,32.591667
50%,0.0,0.07985,0.1194,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.8643,0.5,0.5,0.5,49.73333
75%,0.07317,0.331272,0.32064,0.26154,0.2248,0.23347,1.706635,1.70176,1.362355,1.103765,0.734,0.734,0.733,67.23333
max,288.38585,284.0,258.0,214.0,110.33333,174.74752,12806.23224,11490.19231,15660.90909,1028.57143,1.0,1.0,1.0,100.0


In [11]:
df_growth

Unnamed: 0,Best_EPS_1M,Best_EPS_3M,Best_EPS_6M,Best_EPS_1Y,Best_EPS_3Y,Best_EPS_5Y,Best_EPS_3M_v5Y,Best_EPS_6M_v5Y,Best_EPS_1Y_v5Y,Best_EPS_3Y_v5Y,Best_EPS_3M_Rank,Best_EPS_6M_Rank,Best_EPS_1Y_Rank,ST_Accel,ST_Accel_Class
0,0.00000,0.34825,0.17992,-0.07653,0.14663,0.16376,2.12659,1.09871,-0.46732,0.89540,0.898,0.825,0.119,61.40000,0.0
1,0.00000,0.08304,0.20974,0.07273,0.03844,0.01533,5.41770,13.68290,4.74459,2.50758,0.701,0.845,0.438,66.13333,0.0
2,0.00695,0.04446,0.05988,0.19207,0.18900,0.17332,0.25651,0.34551,1.10819,1.09048,0.597,0.620,0.757,65.80000,0.0
3,-0.04959,-0.72109,-0.25045,1.51042,1.82327,0.62393,-1.15572,-0.40141,2.42080,2.92222,0.035,0.083,0.992,37.00000,0.0
4,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
104795,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0
104796,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0
104797,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0
104798,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0


In [16]:
# outliers
# check box plots and percentiles?????

# set range for EPS growth figures (annualized) to +/- 200%
df_growth = df_growth.loc[(df_growth['Best_EPS_1M'] < 2) & (df_growth['Best_EPS_1M'] > -2)]
df_growth = df_growth.loc[(df_growth['Best_EPS_3M'] < 2) & (df_growth['Best_EPS_3M'] > -2)]
df_growth = df_growth.loc[(df_growth['Best_EPS_6M'] < 2) & (df_growth['Best_EPS_6M'] > -2)]
df_growth = df_growth.loc[(df_growth['Best_EPS_1Y'] < 2) & (df_growth['Best_EPS_1Y'] > -2)]
df_growth = df_growth.loc[(df_growth['Best_EPS_3Y'] < 2) & (df_growth['Best_EPS_3Y'] > -2)]
df_growth = df_growth.loc[(df_growth['Best_EPS_5Y'] < 2) & (df_growth['Best_EPS_5Y'] > -2)]

# set range for EPS growth figures vs 5Y average to +/- ?????
#df_growth = df_growth.loc[(df_growth['Best_EPS_3M_v5Y'] < 5) & (df_growth['Best_EPS_3M_v5Y'] > -5)]
#df_growth = df_growth.loc[(df_growth['Best_EPS_6M_v5Y'] < 5) & (df_growth['Best_EPS_6M_v5Y'] > -5)]
#df_growth = df_growth.loc[(df_growth['Best_EPS_1Y_v5Y'] < 5) & (df_growth['Best_EPS_1Y_v5Y'] > -5)]
#df_growth = df_growth.loc[(df_growth['Best_EPS_3Y_v5Y'] < 5) & (df_growth['Best_EPS_3Y_v5Y'] > -5)]

df_growth

Unnamed: 0,Best_EPS_1M,Best_EPS_3M,Best_EPS_6M,Best_EPS_1Y,Best_EPS_3Y,Best_EPS_5Y,Best_EPS_3M_v5Y,Best_EPS_6M_v5Y,Best_EPS_1Y_v5Y,Best_EPS_3Y_v5Y,Best_EPS_3M_Rank,Best_EPS_6M_Rank,Best_EPS_1Y_Rank,ST_Accel,ST_Accel_Class
0,0.00000,0.34825,0.17992,-0.07653,0.14663,0.16376,2.12659,1.09871,-0.46732,0.89540,0.898,0.825,0.119,61.40000,0.0
2,0.00695,0.04446,0.05988,0.19207,0.18900,0.17332,0.25651,0.34551,1.10819,1.09048,0.597,0.620,0.757,65.80000,0.0
3,-0.04959,-0.72109,-0.25045,1.51042,1.82327,0.62393,-1.15572,-0.40141,2.42080,2.92222,0.035,0.083,0.992,37.00000,0.0
4,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0
6,0.03064,0.18739,0.10357,-0.01340,0.11441,0.06062,3.09121,1.70855,-0.22106,1.88727,0.822,0.744,0.204,59.00000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
104795,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0
104796,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0
104797,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0
104798,0.00000,0.07985,0.11940,0.12672,0.13097,0.13574,0.50276,0.75698,0.79365,0.86430,0.500,0.500,0.500,49.73333,0.0


In [None]:
# scale values??????

In [17]:
# re-check summary statistics - EPS growth variables
df_growth.describe()

Unnamed: 0,Best_EPS_1M,Best_EPS_3M,Best_EPS_6M,Best_EPS_1Y,Best_EPS_3Y,Best_EPS_5Y,Best_EPS_3M_v5Y,Best_EPS_6M_v5Y,Best_EPS_1Y_v5Y,Best_EPS_3Y_v5Y,Best_EPS_3M_Rank,Best_EPS_6M_Rank,Best_EPS_1Y_Rank,ST_Accel
count,75262.0,75262.0,75262.0,75262.0,75262.0,75262.0,75262.0,75262.0,75262.0,75262.0,75262.0,75262.0,75262.0,75262.0
mean,0.064403,0.103547,0.128317,0.152326,0.175095,0.20352,0.524091,0.673004,0.781666,0.89781,0.468503,0.472717,0.491447,47.726022
std,0.373736,0.321315,0.281639,0.240117,0.190808,0.222371,1.130601,1.004666,0.823788,0.491916,0.253873,0.256143,0.262495,21.687511
min,-1.99793,-1.97605,-1.95035,-0.99218,-0.32982,-0.19809,-3.99581,-3.99377,-3.87324,-3.28366,0.0,0.0,0.0,0.0
25%,0.0,-0.00569,0.003295,0.04661,0.08136,0.09696,0.02133,0.135815,0.413703,0.684103,0.258,0.262,0.28,31.93333
50%,0.0,0.0699,0.11278,0.12672,0.13343,0.1452,0.50276,0.75698,0.79365,0.8643,0.496,0.5,0.5,49.73333
75%,0.03397,0.2,0.2297,0.22923,0.232777,0.25913,0.8165,1.07389,1.039285,1.032575,0.663,0.67,0.702,62.46667
max,1.99504,1.99701,1.99315,1.97018,1.98526,1.99608,3.99961,3.99771,3.9949,3.99773,1.0,1.0,0.997,98.76667
