# Target Definition

Using the date and reported EPS at a given date to create the target which consists in the EPS x years later.

## Import Libraries

In [1]:
import pandas as pd

## Import Data

In [2]:
data = pd.read_csv('../data/preprocessed/preprocessed_data.csv')
data.head()

Unnamed: 0,fiscalDateEnding,totalAssets,totalCurrentAssets,cashAndCashEquivalentsAtCarryingValue,cashAndShortTermInvestments,inventory,currentNetReceivables,totalNonCurrentAssets,propertyPlantEquipment,accumulatedDepreciationAmortizationPPE,...,incomeTaxExpense,interestAndDebtExpense,netIncomeFromContinuingOperations,comprehensiveIncomeNetOfTax,ebit,ebitda,netIncome,reportedEPS,Sector,Industry
0,2024-11-02,48228277000,5484654000.0,1991342000.0,2363164000.0,1447687000.0,1336331000.0,42743620000.0,3415550000.0,3772438000.0,...,142067000,322227000.0,1635273000.0,1638319000.0,2032798000,3732798000.0,1635273000,6.38,MANUFACTURING,SEMICONDUCTORS & RELATED DEVICES
1,2023-10-28,48794478000,4384022000.0,958061000.0,958061000.0,1642214000.0,1469734000.0,44410460000.0,3219157000.0,3424775000.0,...,293424000,264641000.0,3314579000.0,3324429000.0,3872644000,5872644000.0,3314579000,10.08,MANUFACTURING,SEMICONDUCTORS & RELATED DEVICES
2,2022-10-29,50302350000,4937992000.0,1470572000.0,1470572000.0,1399914000.0,1800462000.0,45364360000.0,2401304000.0,3148203000.0,...,350188000,200408000.0,2748561000.0,2736974000.0,3299157000,5313357000.0,2748561000,9.59,MANUFACTURING,SEMICONDUCTORS & RELATED DEVICES
3,2021-10-30,52322071000,5378317000.0,1977964000.0,1977964000.0,1200610000.0,1459056000.0,46943750000.0,1979051000.0,2956246000.0,...,-61708000,399975000.0,1390422000.0,1453318000.0,1513539000,2356898000.0,1390422000,6.43,MANUFACTURING,SEMICONDUCTORS & RELATED DEVICES
4,2020-10-31,21468603000,2517688000.0,1055860000.0,1055860000.0,608260000.0,737536000.0,18895680000.0,1120561000.0,2765095000.0,...,90856000,193305000.0,1220761000.0,1161478000.0,1504922000,2082070000.0,1220761000,4.91,MANUFACTURING,SEMICONDUCTORS & RELATED DEVICES


## Extract the year information

We need to extract the year from the date so that we then will be able to retrieve the EPS x years later (which is what we want as a target).

In [3]:
# we need to extract the year from the date
data['year'] = data.fiscalDateEnding.str[:4].astype(int)
data[['fiscalDateEnding', 'year']].head()


Unnamed: 0,fiscalDateEnding,year
0,2024-11-02,2024
1,2023-10-28,2023
2,2022-10-29,2022
3,2021-10-30,2021
4,2020-10-31,2020


There is one thing we need to pay attention to: we do not want to have twice the same year for a given symbol. Let's take a look.

In [4]:
# first, let's create a temporary dataset with only what we need to do our verifications
temp = data.copy()
temp = temp[['symbol', 'fiscalDateEnding', 'year']]
temp.head()

Unnamed: 0,symbol,fiscalDateEnding,year
0,ADI,2024-11-02,2024
1,ADI,2023-10-28,2023
2,ADI,2022-10-29,2022
3,ADI,2021-10-30,2021
4,ADI,2020-10-31,2020


In [5]:
# then, we need the list of symbols in the data
symbols = temp['symbol'].unique()
len(symbols)

101

In [6]:
# for each symbol, we will look at the list of years we have
# then we can compare the length of the list of years [2024, 2023,...,2016]
# with the length of the set of this same list
# this works because a mathematical set only contains unique values.

# exemple
mylist = ['a', 'a', 'a', 'b', 'c']
myset = set(mylist)
print(f"my list: {mylist} | my set: {myset}")

my list: ['a', 'a', 'a', 'b', 'c'] | my set: {'b', 'a', 'c'}


In [7]:
# in our case

for sym in symbols:
    listofyears = temp[temp['symbol'] == sym]['year'].values
    setofyears = set(listofyears)
    if len(listofyears) != len(setofyears):
        print(f"symbol: {sym} | list length: {len(listofyears)} | set length: {len(setofyears)} | duplicate(s): {len(listofyears) - len(setofyears)}")

symbol: JNJ | list length: 17 | set length: 14 | duplicate(s): 3


We do, it seems, have one symbol with more than one time the same year in the dataset. Let's investigate to see if we can come up with a solution.

In [8]:
temp[temp['symbol'] == 'JNJ']

Unnamed: 0,symbol,fiscalDateEnding,year
1283,JNJ,2023-12-31,2023
1284,JNJ,2022-12-31,2022
1285,JNJ,2021-12-31,2021
1286,JNJ,2020-12-31,2020
1287,JNJ,2019-12-31,2019
1288,JNJ,2018-12-31,2018
1289,JNJ,2017-12-31,2017
1290,JNJ,2016-12-31,2016
1291,JNJ,2015-12-31,2015
1292,JNJ,2014-12-31,2014


In this particular case, the issue comes from the data scources themselves. Indeed, there are two entries for 2012 in both the balance sheet data and the income statement data. Here, we will keep the largest entry.

In [9]:
data = data.groupby(['year', 'symbol'], as_index=False).max()

In [10]:
data[data['symbol'] == 'JNJ']

Unnamed: 0,year,symbol,fiscalDateEnding,totalAssets,totalCurrentAssets,cashAndCashEquivalentsAtCarryingValue,cashAndShortTermInvestments,inventory,currentNetReceivables,totalNonCurrentAssets,...,incomeTaxExpense,interestAndDebtExpense,netIncomeFromContinuingOperations,comprehensiveIncomeNetOfTax,ebit,ebitda,netIncome,reportedEPS,Sector,Industry
108,2010,JNJ,2010-12-31,94682000000,39541000000.0,15810000000.0,19425000000.0,5180000000.0,9646000000.0,45154000000.0,...,3489000000,451000000.0,12266000000.0,13399000000.0,16206000000,16527000000.0,12266000000,4.76,LIFE SCIENCES,PHARMACEUTICAL PREPARATIONS
192,2011,JNJ,2011-12-31,102908000000,47307000000.0,19355000000.0,27658000000.0,5378000000.0,9774000000.0,45848000000.0,...,3613000000,455000000.0,13334000000.0,12861000000.0,17402000000,18150000000.0,13334000000,5.0,LIFE SCIENCES,PHARMACEUTICAL PREPARATIONS
281,2012,JNJ,2012-12-31,121347000000,54316000000.0,24542000000.0,32261000000.0,7495000000.0,11309000000.0,70690000000.0,...,3261000000,571000000.0,10514000000.0,10675000000.0,14646000000,15792000000.0,10853000000,5.11,LIFE SCIENCES,PHARMACEUTICAL PREPARATIONS
373,2013,JNJ,2013-12-31,132683000000,56407000000.0,20927000000.0,29206000000.0,7878000000.0,11713000000.0,72404000000.0,...,1640000000,482000000.0,13831000000.0,16781000000.0,15953000000,17316000000.0,13831000000,5.52,LIFE SCIENCES,PHARMACEUTICAL PREPARATIONS
467,2014,JNJ,2014-12-31,130358000000,55744000000.0,14523000000.0,33089000000.0,8184000000.0,10985000000.0,68529000000.0,...,4240000000,533000000.0,16323000000.0,8461000000.0,21096000000,22494000000.0,16323000000,5.97,LIFE SCIENCES,PHARMACEUTICAL PREPARATIONS
562,2015,JNJ,2015-12-31,133411000000,60210000000.0,13732000000.0,38376000000.0,8053000000.0,10734000000.0,67711000000.0,...,3787000000,552000000.0,15409000000.0,12966000000.0,19748000000,20948000000.0,15409000000,6.2,LIFE SCIENCES,PHARMACEUTICAL PREPARATIONS
657,2016,JNJ,2016-12-31,141208000000,65032000000.0,18972000000.0,41907000000.0,8144000000.0,11699000000.0,70028000000.0,...,3263000000,726000000.0,16540000000.0,14804000000.0,20529000000,21729000000.0,16540000000,6.68,LIFE SCIENCES,PHARMACEUTICAL PREPARATIONS
752,2017,JNJ,2017-12-31,157303000000,43088000000.0,17824000000.0,18296000000.0,8765000000.0,13490000000.0,107636000000.0,...,16373000000,934000000.0,1300000000.0,3002000000.0,18607000000,21607000000.0,1300000000,7.3,LIFE SCIENCES,PHARMACEUTICAL PREPARATIONS
847,2018,JNJ,2018-12-31,152954000000,46033000000.0,18107000000.0,19687000000.0,8599000000.0,14098000000.0,99756000000.0,...,2702000000,1005000000.0,15297000000.0,13506000000.0,19004000000,23404000000.0,15297000000,8.18,LIFE SCIENCES,PHARMACEUTICAL PREPARATIONS
944,2019,JNJ,2019-12-31,157728000000,45274000000.0,17305000000.0,20435000000.0,9020000000.0,14481000000.0,105186000000.0,...,2209000000,318000000.0,15119000000.0,14450000000.0,17646000000,22146000000.0,15119000000,8.68,LIFE SCIENCES,PHARMACEUTICAL PREPARATIONS


## Building the Target Variable

We can create a copy of the dataset and shift the year information by a constant than merge the original dataset and its altered copy to get ourselves a target variable.

In [None]:
# Let's say we want to predict the EPS 5 years from the observed data year
PREDICTION_HORIZON = 5

target_data = data.copy()
target_data = target_data[['symbol', 'year', 'reportedEPS']]

# we can substract the prediction horizon from the year
target_data['year'] = target_data['year'] - PREDICTION_HORIZON

# let's also rename the columns
target_data.rename(columns={'reportedEPS' : 'futureEPS'}, inplace=True)

# Now we can merge the original dataset with this new one
data = data.merge(target_data, on=['symbol', 'year'], how='left')
data.head()

Unnamed: 0,year,symbol,fiscalDateEnding,totalAssets,totalCurrentAssets,cashAndCashEquivalentsAtCarryingValue,cashAndShortTermInvestments,inventory,currentNetReceivables,totalNonCurrentAssets,...,interestAndDebtExpense,netIncomeFromContinuingOperations,comprehensiveIncomeNetOfTax,ebit,ebitda,netIncome,reportedEPS,Sector,Industry,futureEPS
0,2004,BLK,2004-12-31,1145235000,867424000.0,457673000.0,457673000.0,15549000.0,,277811000.0,...,,,,142517000,163203000.0,143141000,2.73,FINANCE,"SECURITY BROKERS, DEALERS & FLOTATION COMPANIES",7.05
1,2005,BLK,2005-12-31,1848000000,1234567000.0,484223000.0,484223000.0,61882000.0,,613433000.0,...,,,,297403000,328305000.0,233908000,4.04,FINANCE,"SECURITY BROKERS, DEALERS & FLOTATION COMPANIES",10.94
2,2006,BLK,2006-12-31,20469492000,2124670000.0,1160304000.0,1160304000.0,472483000.0,,11354230000.0,...,,,,405451000,478260000.0,322602000,5.09,FINANCE,"SECURITY BROKERS, DEALERS & FLOTATION COMPANIES",11.85
3,2007,BLK,2007-12-31,22561515000,3241842000.0,1656200000.0,1656200000.0,395006000.0,,19319670000.0,...,,,,1368130000,1566954000.0,995272000,8.2,FINANCE,"SECURITY BROKERS, DEALERS & FLOTATION COMPANIES",13.69
4,2008,BLK,2008-12-31,19924000000,3242000000.0,2032000000.0,2032000000.0,122000000.0,,16682000000.0,...,,,,1695671000,1971420000.0,786419000,6.43,FINANCE,"SECURITY BROKERS, DEALERS & FLOTATION COMPANIES",16.6


At the time of writing this (2024), we do not know what would be the EPS for a 2023 or 2024 observations five years from now (i.e. in 2028 and 2029). We will therefore get some NaN values which we preemptively deal with by using an inner joint merge.

## Saving the Dataset

In [12]:
data.to_csv('../data/preprocessed/preprocessed_data_with_target.csv', index=False)