# Handling Financial Data

The goal of this notebook is to study and use Financial Data on corporations. 
The approach make good use of a memory to keep track of last know data.

## Features engineering techniques :

- [Start with the end](#Start) 
- [Get Data](#Get_Data)
- [Reorder Data](#Reorder_Data)
- [Missing Values](#Missing_Values)
- [Base Feature Engineering](#Base_FE)
- [Market Features](#Market_Features)
- [Time Features](#Time_Features)
- [Running Moving Average](#RMA) (<- Magic)
- [Moving Average Features](#MA_FE)
- [Betas](#Betas)
- [Putting it all together](#All) (<- All the features)
- [Complete Feature Exploration](#FE_exploration)

In [None]:
import jpx_tokyo_market_prediction
env = jpx_tokyo_market_prediction.make_env()
iter_test = env.iter_test()

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from datetime import datetime
import pickle
import collections

def timestamp_to_date(timestamp):
    return(datetime.fromtimestamp(timestamp))

DEBUG = False

train_financials<a id='Start'></a>
# Start with the end
Looking at iterator submission data.

In [None]:
(prices, options, financials, trades, secondary_prices, sample_prediction) = next(iter_test)

financials

Only very small data. We just get the update. We will need an oline FE framework for this. 

<a id='Get_Data'></a>
# Get Data

We start with a pandas framework for EDA. 

In [None]:
# get raw data
train = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/train_files/stock_prices.csv')[['Date','SecuritiesCode','Volume','Close','Target','ExpectedDividend']]
train_financials = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/train_files/financials.csv')

if DEBUG:
    train_financials = train_financials[train_financials.Date.isin(train_financials.Date.unique()[-300:])]


train_financials_sup = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/supplemental_files/financials.csv')

#transform to float
bool_cols = ['ApplyingOfSpecificAccountingOfTheQuarterlyFinancialStatements',
       'MaterialChangesInSubsidiaries',
       'ChangesBasedOnRevisionsOfAccountingStandard',
       'ChangesOtherThanOnesBasedOnRevisionsOfAccountingStandard',
       'ChangesInAccountingEstimates', 'RetrospectiveRestatement']

float_cols = ['NetSales',
       'OperatingProfit', 'OrdinaryProfit', 'Profit', 'EarningsPerShare',
       'TotalAssets', 'Equity', 'EquityToAssetRatio', 'BookValuePerShare',
       'ResultDividendPerShare1stQuarter', 'ResultDividendPerShare2ndQuarter',
       'ResultDividendPerShare3rdQuarter',
       'ResultDividendPerShareFiscalYearEnd', 'ResultDividendPerShareAnnual',
       'ForecastDividendPerShare1stQuarter',
       'ForecastDividendPerShare2ndQuarter',
       'ForecastDividendPerShare3rdQuarter',
       'ForecastDividendPerShareFiscalYearEnd',
       'ForecastDividendPerShareAnnual', 'ForecastNetSales',
       'ForecastOperatingProfit', 'ForecastOrdinaryProfit', 'ForecastProfit',
       'ForecastEarningsPerShare',
       'NumberOfIssuedAndOutstandingSharesAtTheEndOfFiscalYearIncludingTreasuryStock',
       'NumberOfTreasuryStockAtTheEndOfFiscalYear', 'AverageNumberOfShares']


features_to_lag = ['Date', 'DisclosedDate', 'TypeOfDocument', 'NetSales',
       'OperatingProfit', 'OrdinaryProfit', 'Profit', 'EarningsPerShare',
       'TotalAssets', 'Equity', 'EquityToAssetRatio', 'BookValuePerShare',
       'ResultDividendPerShare1stQuarter', 'ResultDividendPerShare2ndQuarter',
       'ResultDividendPerShare3rdQuarter',
       'ResultDividendPerShareFiscalYearEnd', 'ResultDividendPerShareAnnual',
       'ForecastDividendPerShare1stQuarter',
       'ForecastDividendPerShare2ndQuarter',
       'ForecastDividendPerShare3rdQuarter',
       'ForecastDividendPerShareFiscalYearEnd',
       'ForecastDividendPerShareAnnual', 'ForecastNetSales',
       'ForecastOperatingProfit', 'ForecastOrdinaryProfit', 'ForecastProfit',
       'ForecastEarningsPerShare',
       'ApplyingOfSpecificAccountingOfTheQuarterlyFinancialStatements',
       'MaterialChangesInSubsidiaries',
       'ChangesBasedOnRevisionsOfAccountingStandard',
       'ChangesOtherThanOnesBasedOnRevisionsOfAccountingStandard',
       'ChangesInAccountingEstimates', 'RetrospectiveRestatement',
       'NumberOfIssuedAndOutstandingSharesAtTheEndOfFiscalYearIncludingTreasuryStock',
       'NumberOfTreasuryStockAtTheEndOfFiscalYear', 'AverageNumberOfShares']

dtypes_dict = {}

for c in bool_cols:
    dtypes_dict[c] = 'bool'

for c in float_cols:
    dtypes_dict[c] = 'float32'

def clean_financials(df):
    df = df.replace('-',np.nan).replace('－',np.nan)
    df = df.astype(dtypes_dict)
    return df
    
# clean data
train_financials = clean_financials(train_financials)
train_financials_sup = clean_financials(train_financials_sup)

# Feature engineering:

In [None]:
%%time 

df_result = []
grouped_train = train.groupby('SecuritiesCode')
grouped_financial = train_financials.groupby('SecuritiesCode')
Forecast_col = ['ForecastDividendPerShareFiscalYearEnd','ForecastDividendPerShareAnnual','ForecastNetSales',
            'ForecastOperatingProfit','ForecastOrdinaryProfit','ForecastProfit','ForecastEarningsPerShare']
Forecast_surprise_col = [s+'_surprise_rel' for s in Forecast_col]

for code in train.SecuritiesCode.unique():
    train_security = grouped_train.get_group(code)
    financial_security = grouped_financial.get_group(code)
    financial_security[Forecast_col] = financial_security[Forecast_col]
    financial_security[Forecast_surprise_col] = financial_security[Forecast_col]/financial_security[Forecast_col].shift()
    financial_security = financial_security.ffill()
    train_security = train_security.merge(financial_security, on=['Date','SecuritiesCode'], how='left')
    train_security = train_security.ffill()
    train_security['ID_str'] = train_security.DateCode.str[-4:]
    train_security['Days_Since_Disclosure'] = (pd.to_datetime(train_security['Date']) - pd.to_datetime(train_security['DisclosedDate'])).dt.days
    df_result.append(train_security)
    
df = pd.concat(df_result)

# On-line Feature Engineering:



We need:

    - data cleaning
    - memory to keep track of current data 
    - memory to keep track of past data (growth)
    - Featrue Engineering

In [None]:
Forecast_col = ['ForecastDividendPerShareFiscalYearEnd','ForecastDividendPerShareAnnual','ForecastNetSales','ForecastOperatingProfit','ForecastOrdinaryProfit','ForecastProfit','ForecastEarningsPerShare']
Forecast_surprise_col = [s+'_surprise_rel' for s in Forecast_col]

Amounts = ['NetSales', 'OperatingProfit','OrdinaryProfit','Profit', 'TotalAssets', 'Equity']
Bases = ['Profit', 'TotalAssets', 'Equity']
Forecasts = ['ForecastDividend','ForecastNetSales','ForecastOperatingProfit','ForecastOrdinaryProfit','ForecastProfit','ForecastEarnings']

horizons = [250,60,20,5,1]
suffixes = ['_y_1','_q_1','_m_1','_w_1','_1']

# lagger
dict_lag = {}
for h in horizons:
    dict_lag[h] = collections.deque(maxlen=h)

df_memory = pd.DataFrame(index=train.SecuritiesCode.unique(), columns = financials.columns).astype(dtypes_dict)

def update_fiancials(df, update, date):
    df.Date = date
    for i in range(len(group)):
        row = group.iloc[i]
        if row.SecuritiesCode in(df_memory.index):
            df.loc[row.SecuritiesCode] = np.where(row.isna(),df.loc[row.SecuritiesCode],row)
    return df

def build_base_finanacials_features(df):
    # Days Since Disclosure
    df['DaysSinceDisclosure'] = (pd.to_datetime(df['Date']) - pd.to_datetime(df['DisclosedDate'])).dt.days
    # Base Amount Features
    df['ForecastDividend'] =  df['AverageNumberOfShares']*df['ForecastDividendPerShareAnnual']
    df['ForecastEarnings'] =  df['AverageNumberOfShares']*df['ForecastEarningsPerShare']
    return df

def build_ratio_finanacials_features(df):
    # ratios
    for Am in Amounts:
        for Base in Bases:
            if Am!=Base:
                df['r'+Am+'/'+Base] = df[Am]/df[Base]
    # forecast ratios:
    for For in Forecasts:
        for Base in Bases:
            df['r'+For+'/'+Base] = df[Am]/df[Base]  
    return df

def build_growth_finanacials_features(df):
    # Growth Features:
    for Am in Amounts:
        df[Am+'_YoY_growth'] = df[Am]/df[Am+'_y_1']
        df[Am+'_QoQ_growth'] = df[Am]/df[Am+'_q_1']
    return df


def Fundamental_Data_builder(group, df_memory):

    #clean data
    group = clean_financials(group)
    
    #update memory
    df_memory = update_fiancials(df_memory.copy(), group, date).copy()
    df_features = df_memory.copy()
     
    # build base features
    df_features = build_base_finanacials_features(df_features)
    
    #create lagged features - TODO : lag only interesting Features
    df_list = [df_features.copy()]
    
    for h,s in zip(horizons,suffixes):
        dict_lag[h].append(df_memory[features_to_lag].copy())
        df_lag = dict_lag[h][0]
        df_lag.columns = [c + s for c in features_to_lag]
        df_list.append(df_lag)
    df_features_agg = pd.concat(df_list, axis=1)

    # Build ratio / growth features
    df_features_agg = build_ratio_finanacials_features(df_features_agg)
    df_features_agg = build_growth_finanacials_features(df_features_agg)

    return df_features_agg, df_memory

In [None]:
result = []

for date, group in train_financials.groupby('Date'):
    features_agg, df_memory = Fundamental_Data_builder(group, df_memory.copy())
    result.append(features_agg.copy())
    
df_output = pd.concat(result).astype(dtypes_dict)

Date_Names = df_output.columns[df_output.columns.str.startswith('Date')]
df_output[Date_Names] = df_output[Date_Names].astype("str")

df_output.to_parquet('train_financials.parquet')
df_memory.astype({'Date':str}).to_parquet('memory.parquet')
pickle.dump( dict_lag, open('dict_lag_train.pkl', 'wb'))

In [None]:
result_sup = []

for date, group in train_financials_sup.groupby('Date'):
    features_agg, df_memory = Fundamental_Data_builder(group, df_memory.copy())
    result_sup.append(features_agg.copy())
    
df_output_sup = pd.concat(result_sup).astype(dtypes_dict)
Date_Names = df_output_sup.columns[df_output_sup.columns.str.startswith('Date')]
df_output_sup[Date_Names] = df_output_sup[Date_Names].astype(str)

df_output_sup.to_parquet('train_financials_sup.parquet')
df_memory.astype({'Date':str}).to_parquet('memory_sup.parquet')
pickle.dump(dict_lag, open('dict_lag_train_sup.pkl', 'wb'))

# EDA:

Return by quartile depending on the age of the news;
We mainly look at two things:
- Long terms relations between features and target; We get some relationships between size / performance and target; 
- Short term impact of news (good or bad);

In [None]:
df_output = df_output.merge(train, on = ['Date','SecuritiesCode'], how='left')

numeric_features = df_output.columns[(df_output.dtypes=='float32')|(df_output.dtypes=='float64')]

data_change = df_output[(df_output['DaysSinceDisclosure']==1)][numeric_features]
data_change2 = df_output[(df_output['DaysSinceDisclosure']>1)][numeric_features]

In [None]:
error_plot = []

for c in numeric_features:
    try:
        print(c)
        plt.scatter(data_change[c],data_change.Target)
        plt.show()

        n = 10
        q = np.arange(n+1)/n

        quant = data_change[c].quantile(q=q).values
        quant[0] = -np.Inf
        quant[n] = np.Inf

        data_change['groups'] = pd.cut(data_change[c], quant, duplicates='drop')
        data_change['avg_Target'] = data_change.groupby('groups')['Target'].transform('mean')
        plt.plot(data_change.groupby('groups')['Target'].mean())

        data_change2['groups'] = pd.cut(data_change2[c], quant,duplicates='drop')
        data_change2['avg_Target'] = data_change2.groupby('groups')['Target'].transform('mean')
        plt.plot(data_change2.groupby('groups')['Target'].mean())

        plt.show()
        
    except:
        error_plot.append(c)
        
print(error_plot)