# Demo Quantitative Stock Selection Model: Asset Growth

## List Comprehension

List comprehension offers a shorter syntax when you want to create a new list based on the values of an existing list. It is a way to use for-loop in a compact form. 

In [1]:
#Suppose we want to compute the square each number in mylist
mylist = [21, 30, 12, 24, 9]
mylist_square = [ 21*21, 30*30, 12*12, 24*24, 9*9]
mylist_square = [ s*s for s in mylist ]
mylist_square

[441, 900, 144, 576, 81]

## Concatenate Data Frames

Given two data frames with the same set of columns, we will stack one on topt of the other using **pd.concat()**. 

In [2]:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'key': ['E', 'D', 'C', 'B'],
                    'value': [5, 6, 7, 8]})

display(df1)
display(df2)

Unnamed: 0,key,value
0,A,1
1,B,2
2,C,3
3,D,5


Unnamed: 0,key,value
0,E,5
1,D,6
2,C,7
3,B,8


In [3]:
#Concatenate df1 and df2 with df1 on top of df2
df3 = pd.concat([df1,df2])
df3.reset_index(drop=True, inplace=True)
df3

Unnamed: 0,key,value
0,A,1
1,B,2
2,C,3
3,D,5
4,E,5
5,D,6
6,C,7
7,B,8


## Asset Growth 

$$
\frac{\text{Total Assets at }t}{\text{Total Assets at }{t-1}}
$$

In [4]:
import pandas as pd
import numpy as np
df = pd.read_csv('classdata/AnnualTotalAsset.csv')
df['datadate']=pd.to_datetime(df['datadate'],format="%Y%m%d")

In [5]:
#We only select the columns we need
df=df[["LPERMNO","datadate","fyear","at"]]
df.head()

Unnamed: 0,LPERMNO,datadate,fyear,at
0,54594,2000-05-31,1999.0,740.998
1,54594,2001-05-31,2000.0,701.854
2,54594,2002-05-31,2001.0,710.199
3,54594,2003-05-31,2002.0,686.621
4,54594,2004-05-31,2003.0,709.292


Column **at** is the total asset. Column **datadate** is when the fiscal year end. Column **fyear** is the fiscal year.

Sort the dataframe by **LPERMNO** and then by **datadate**

In [6]:
df.sort_values(by=['LPERMNO', 'datadate'], inplace=True)
df.reset_index(drop=True, inplace=True)

## Step 1: Generate Signal

We introduce two new command:

   - Use **pct_change()** to calculate the percentage change between rows in total assets.
   - Use **groupby** to group rows from the same stock and apply **pct_change** to each group.

The output of **df.groupby("LPERMNO")** is a special object called "groupby" data frame. Different from a regular data frame, a groupby data frame cannot be printed.  A groupby object contains information about the groups and allows us to apply the same transformation to each group. 

In [7]:
type(df.groupby("LPERMNO"))

pandas.core.groupby.generic.DataFrameGroupBy

In [8]:
df.groupby("LPERMNO")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7eff7321f040>

**get_group()** will return the rows from the same group (same stock in this case). 

In [9]:
#The rows from stock 93433
df.groupby("LPERMNO").get_group(93433)

Unnamed: 0,LPERMNO,datadate,fyear,at
120658,93433,2010-12-31,2010.0,231.814
120659,93433,2011-12-31,2011.0,118.112
120660,93433,2012-12-31,2012.0,81.517
120661,93433,2013-12-31,2013.0,42.404
120662,93433,2014-12-31,2014.0,12.33
120663,93433,2015-12-31,2015.0,6.004
120664,93433,2016-12-31,2016.0,7.25
120665,93433,2017-12-31,2017.0,6.648
120666,93433,2018-12-31,2018.0,23.494


Apply **.pct_change()+1** to generate the asset growth for stock  93433

In [10]:
df.groupby("LPERMNO").get_group(93433)["at"].pct_change()+1

120658         NaN
120659    0.509512
120660    0.690167
120661    0.520186
120662    0.290774
120663    0.486942
120664    1.207528
120665    0.916966
120666    3.533995
Name: at, dtype: float64

By removing **.get_group(93433)** in the code above, we can generate the asset growth for each stock. We save the asset growth in a new column called **Signal**.

In [11]:
df["Signal"]=df.groupby("LPERMNO")["at"].pct_change()+1
df.groupby("LPERMNO").get_group(93433)

Unnamed: 0,LPERMNO,datadate,fyear,at,Signal
120658,93433,2010-12-31,2010.0,231.814,
120659,93433,2011-12-31,2011.0,118.112,0.509512
120660,93433,2012-12-31,2012.0,81.517,0.690167
120661,93433,2013-12-31,2013.0,42.404,0.520186
120662,93433,2014-12-31,2014.0,12.33,0.290774
120663,93433,2015-12-31,2015.0,6.004,0.486942
120664,93433,2016-12-31,2016.0,7.25,1.207528
120665,93433,2017-12-31,2017.0,6.648,0.916966
120666,93433,2018-12-31,2018.0,23.494,3.533995


## Step 2: Generate Date When Each Signal Becomes Available

We now have the signal in column **Signal**. Let's generate the quarter when each signal becomes available for the first time in a new column called  **Date**. 

In this example, we assume the available date is six months after **datadate** and we represent each quarter by "yyyymm" with mm being 03, 06, 09, or 12. 

In [12]:
#These libraries are loaded to move each date six months forwards (to the end of the returned month)
from dateutil.relativedelta import relativedelta

In [13]:
#Let's generate the available date by moving a datadate 6 month forward. For example:
s=pd.to_datetime("20100831",format="%Y%m%d")
s+relativedelta(months=6)

Timestamp('2011-02-28 00:00:00')

In [14]:
#Get the year part
(s+relativedelta(months=6)).year

2011

In [15]:
#Get the quarter part
(s+relativedelta(months=6)).quarter

1

Generate **Date** as $100\times year + 3\times quarter$ using list comprehension.

In [16]:
df["year"]=[(s+relativedelta(months=6)).year for s in df["datadate"]]
df["quarter"]=[(s+relativedelta(months=6)).quarter for s in df["datadate"]]
df["Date"]=df["year"]*100+df["quarter"]*3

In [17]:
df.groupby("LPERMNO").get_group(93433)

Unnamed: 0,LPERMNO,datadate,fyear,at,Signal,year,quarter,Date
120658,93433,2010-12-31,2010.0,231.814,,2011,2,201106
120659,93433,2011-12-31,2011.0,118.112,0.509512,2012,2,201206
120660,93433,2012-12-31,2012.0,81.517,0.690167,2013,2,201306
120661,93433,2013-12-31,2013.0,42.404,0.520186,2014,2,201406
120662,93433,2014-12-31,2014.0,12.33,0.290774,2015,2,201506
120663,93433,2015-12-31,2015.0,6.004,0.486942,2016,2,201606
120664,93433,2016-12-31,2016.0,7.25,1.207528,2017,2,201706
120665,93433,2017-12-31,2017.0,6.648,0.916966,2018,2,201806
120666,93433,2018-12-31,2018.0,23.494,3.533995,2019,2,201906


Be careful that some stock may have a growth equal to infinity or NaN. See the following examples

In [18]:
df.groupby("LPERMNO").get_group(32548)

Unnamed: 0,LPERMNO,datadate,fyear,at,Signal,year,quarter,Date
23695,32548,2001-01-31,2000.0,0.0,,2001,3,200109
23696,32548,2002-01-31,2001.0,0.0,,2002,3,200209
23697,32548,2003-01-31,2002.0,0.0,,2003,3,200309
23698,32548,2004-01-31,2003.0,0.021,inf,2004,3,200409
23699,32548,2004-12-31,2004.0,22.52,1072.380952,2005,2,200506
23700,32548,2005-12-31,2005.0,39.301,1.74516,2006,2,200606
23701,32548,2006-12-31,2006.0,70.88,1.803516,2007,2,200706
23702,32548,2007-12-31,2007.0,93.974,1.325818,2008,2,200806
23703,32548,2008-12-31,2008.0,114.507,1.218497,2009,2,200906
23704,32548,2009-12-31,2009.0,133.379,1.164811,2010,2,201006


To remove the abnormal data, we need to replace infinity by NaN and drop the rows that contain NaNs. 

In [19]:
df.replace(np.inf, np.nan, inplace=True)
df.dropna(how='any',axis=0, inplace=True)
df.sort_values(by=['LPERMNO', 'Date'],inplace=True)
df.reset_index(drop=True,inplace=True)
df.head()

Unnamed: 0,LPERMNO,datadate,fyear,at,Signal,year,quarter,Date
0,10001,2001-06-30,2001.0,61.261,1.211817,2001,4,200112
1,10001,2002-06-30,2002.0,56.855,0.928078,2002,4,200212
2,10001,2003-06-30,2003.0,61.341,1.078902,2003,4,200312
3,10001,2004-06-30,2004.0,60.219,0.981709,2004,4,200412
4,10001,2005-06-30,2005.0,57.986,0.962919,2005,4,200512


## Step 3: Generate Signals for All Dates

Note that since the signal is constructed using the data at the annual frequency, the signal will likely be used four 
times. For example, for signals constructed using the financial statement for the fiscal year ending in December 2012, we assume the same signal value is available in June, September, and December of 2013 and in March of 2014, after which the financial statement for 2013 will be used to update the signal value.  

The code above only generates one signal each year. Next we will need to generate the signal for all quarters of a year.

The strategy is to make three copies of **df** with their **Date** being 9, 12, and 15 months forward from **datadate**. To do so, we first move **datadate** forward by 9, 12, and 15 months, respectively, and generate **Date** as $100\times year + 3\times quarter$ using list comprehension.

In [20]:
#fill the signal for the remaining three quarters of a year.
dftemp=df.copy()
for i in [9, 12, 15]:
    dftemp["Date"]=[(s+relativedelta(months=i)).year*100+(s+relativedelta(months=i)).quarter*3 for s in dftemp["datadate"]]
    df=pd.concat([df,dftemp])

df.sort_values(by=['LPERMNO', 'datadate'], inplace=True)
df.reset_index(drop=True, inplace=True)

For most stocks, this works nicely. See stock 93433 below.

In [21]:
df.groupby("LPERMNO").get_group(93433)[0:12]

Unnamed: 0,LPERMNO,datadate,fyear,at,Signal,year,quarter,Date
386280,93433,2011-12-31,2011.0,118.112,0.509512,2012,2,201206
386281,93433,2011-12-31,2011.0,118.112,0.509512,2012,2,201209
386282,93433,2011-12-31,2011.0,118.112,0.509512,2012,2,201212
386283,93433,2011-12-31,2011.0,118.112,0.509512,2012,2,201303
386284,93433,2012-12-31,2012.0,81.517,0.690167,2013,2,201306
386285,93433,2012-12-31,2012.0,81.517,0.690167,2013,2,201309
386286,93433,2012-12-31,2012.0,81.517,0.690167,2013,2,201312
386287,93433,2012-12-31,2012.0,81.517,0.690167,2013,2,201403
386288,93433,2013-12-31,2013.0,42.404,0.520186,2014,2,201406
386289,93433,2013-12-31,2013.0,42.404,0.520186,2014,2,201409


Be careful that a company may change the end of its fiscal year. If that happens, there will be more than one signals on some quarters. See quarter 200206 of stock 10421 in the following example.

In [22]:
df.groupby("LPERMNO").get_group(10421)[0:7]

Unnamed: 0,LPERMNO,datadate,fyear,at,Signal,year,quarter,Date
3844,10421,2001-03-31,2000.0,695.526,1.479267,2001,3,200109
3845,10421,2001-03-31,2000.0,695.526,1.479267,2001,3,200112
3846,10421,2001-03-31,2000.0,695.526,1.479267,2001,3,200203
3847,10421,2001-03-31,2000.0,695.526,1.479267,2001,3,200206
3848,10421,2001-12-31,2001.0,816.608,1.174087,2002,2,200206
3849,10421,2001-12-31,2001.0,816.608,1.174087,2002,2,200209
3850,10421,2001-12-31,2001.0,816.608,1.174087,2002,2,200212


We use **groupby** to group the rows with the same "LPERMNO" and "Date". Then we use **.tail(1)** to only keep the last row, which is the latest signal (the one we are supposed to use).

In [23]:
df.groupby(["LPERMNO","Date"]).get_group((10421,200206))

Unnamed: 0,LPERMNO,datadate,fyear,at,Signal,year,quarter,Date
3847,10421,2001-03-31,2000.0,695.526,1.479267,2001,3,200206
3848,10421,2001-12-31,2001.0,816.608,1.174087,2002,2,200206


In [24]:
df.groupby(["LPERMNO","Date"]).get_group((10421,200206)).tail(1)

Unnamed: 0,LPERMNO,datadate,fyear,at,Signal,year,quarter,Date
3848,10421,2001-12-31,2001.0,816.608,1.174087,2002,2,200206


By  removing **.get_group((10421,200206))** in the code above, we can remove duplicated Dates in all groups. 

In [25]:
df=df.groupby(["LPERMNO","Date"]).tail(1)
df.reset_index(drop=True, inplace=True)

There is only one signal in quarter 200206 now.

In [26]:
df.groupby("LPERMNO").get_group(10421)[0:7]

Unnamed: 0,LPERMNO,datadate,fyear,at,Signal,year,quarter,Date
3843,10421,2001-03-31,2000.0,695.526,1.479267,2001,3,200109
3844,10421,2001-03-31,2000.0,695.526,1.479267,2001,3,200112
3845,10421,2001-03-31,2000.0,695.526,1.479267,2001,3,200203
3846,10421,2001-12-31,2001.0,816.608,1.174087,2002,2,200206
3847,10421,2001-12-31,2001.0,816.608,1.174087,2002,2,200209
3848,10421,2001-12-31,2001.0,816.608,1.174087,2002,2,200212
3849,10421,2001-12-31,2001.0,816.608,1.174087,2002,2,200303


## Step 4: Generate Summary Statistics

Next, we print the summmary statistics for signals grouped by **Date** (with quarter=12 only). 

In [27]:
df[df.quarter==4].groupby("Date")["Signal"].describe(percentiles=[0.1,0.9])

Unnamed: 0_level_0,count,mean,std,min,10%,50%,90%,max
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
200012,1.0,0.720942,,0.720942,0.720942,0.720942,0.720942,0.720942
200103,1.0,0.720942,,0.720942,0.720942,0.720942,0.720942,0.720942
200106,1.0,0.720942,,0.720942,0.720942,0.720942,0.720942,0.720942
200109,1.0,0.720942,,0.720942,0.720942,0.720942,0.720942,0.720942
200112,619.0,1.070348,0.489472,0.021516,0.721766,1.016669,1.382239,8.227926
...,...,...,...,...,...,...,...,...
201909,266.0,1.109369,0.378158,0.272896,0.882710,1.045347,1.342671,4.085383
201912,261.0,1.058989,0.290926,0.161474,0.834333,1.021757,1.295721,3.408770
202003,261.0,1.058989,0.290926,0.161474,0.834333,1.021757,1.295721,3.408770
202006,261.0,1.058989,0.290926,0.161474,0.834333,1.021757,1.295721,3.408770


Save the columns we want to a csv file.

In [28]:
df[["LPERMNO","datadate","Date","Signal"]].to_csv("Signal1.csv",index=False)