# FIAM Hackathon

## Importing the Necessary Libraries

In [1]:
import pandas as pd 
import pandas_datareader.data as web
import numpy as np 
import yfinance as yf 
import matplotlib.pyplot as plt 
import seaborn as sns 
import scipy as sp 
from sklearn import preprocessing, decomposition, model_selection, linear_model, feature_selection, ensemble, metrics  
rs = np.random.seed(123)
# pd.set_option("display.max_rows", None)  
# pd.set_option("display.max_columns", None) 
pd.set_option("display.float_format", "{:.4f}".format) 
import warnings 
warnings.filterwarnings("ignore") 

In [2]:
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats("svg") 

## Loading the Data 

In [3]:
ticker_data = pd.read_csv("C:/Users/khail/OneDrive/Desktop/Github Projects/ticker_data.csv", parse_dates = ["date"])
ticker_data["date"] = ticker_data["date"].dt.strftime("%Y-%m")
ticker_data["date"] = pd.to_datetime(ticker_data["date"]) 
ticker_data.sort_values(by = ["date", "stock_ticker"], inplace = True, ignore_index = True)

In [4]:
ticker_data.shape 

(273373, 160)

## Variable Explanation 

- **age** : Firm age
- **aliq_at** : Liquidity of book assets 
- **aliq_mat** : Liquidity of market assets 
- **ami_126d** : Amihud measure
- **at_be** : Book leverage
- **at_gr1** : Asset growth 
- **at_me** : Assets-to-market
- **at_turnover** : Capital turnover 

- **be_gr1a** : Change in common equity
- **be_me** : Book-to-market equity
- **beta_60m** : Market beta
- **beta_dimson_21d** : Dimson beta
- **betabab_1260d** : Frazzini-Pedersen market beta
- **betadown_252d** : Downside beta
- **bev_mev** : Book-to-market enterprise value 
- **bidaskhl_21d** : High-low bid-ask spread

- **capex_abn** : Abnormal corporate investment
- **capx_gr1** : CAPEX growth (1 year)
- **capx_gr2** : CAPEX growth (2 years)
- **capx_gr3** : CAPEX growth (3 years)
- **cash_at** : Cash-to-assets 
- **chcsho_12m** : Net stock issues 
- **coa_gr1a** : Change in current operating assets 
- **col_gr1a** : Change in current operating liabilities
- **cop_at** : Cash-based operating profits-to-book assets
- **cop_atl1** : Cash-based operating profits-to-lagged book assets
- **corr_1260d** : Market correlation
- **coskew_21d** : Coskewness
- **cowc_gr1a** : Change in current operating working capital 

- **dbnetis_at** : Net debt issuance
- **debt_gr3** : Growth in book debt (3 years)
- **debt_me** : Debt-to-market 
- **dgp_dsale** : Change gross margin minus change sales
- **div12m_me** : Dividend yield
- **dolvol_126d** : Dollar trading volume (126 days)
- **dolvol_var_126d** : Coefficient of variation for dollar trading volume
- **dsale_dinv** : Change sales minus change Inventory
- **dsale_drec** : Change sales minus change receivables
- **dsale_dsga** : Change sales minus change SG&A

- **earnings_variability** : Earnings variability
- **ebit_bev** : Return on net operating assets
- **ebit_sale** : Profit margin
- **ebitda_mev** : Ebitda-to-market enterprise value
- **emp_gr1** : Hiring rate
- **eq_dur** : Equity duration
- **eqnetis_at** : Net equity issuance
- **eqnpo_12m** : Equity net payout
- **eqnpo_me** : Net payout yield
- **eqpo_me** : Payout yield

- **f_score** : Pitroski F-score
- **fcf_me** : Free cash flow-to-price
- **fnl_gr1a** : Change in financial liabilities
- **gp_at** : Gross profits-to-assets 
- **gp_atl1** : Gross profits-to-lagged assets

- **intrinsic_value** : Intrinsic value-to-market
- **inv_gr1** : Inventory growth
- **inv_gr1a** : Inventory change
- **iskew_capm_21d** : Idiosyncratic skewness from CAPM
- **iskew_ff3_21d** : Idiosyncratic skewness from the Farma-French 3-factor model
- **iskew_hxz4_21d** : Idiosyncratic skewness from the q-factor model 
- **ivol_capm_21d** : Idiosyncratic volatility from the CAPM (21 days)
- **ivol_capm_252d** : Idiosyncratic volatility from the CAPM (252 days)
- **ivol_ff3_21d** : Idiosyncratic volatility from the Farma-French 3-factor model 
- **ivol_hxz4_21d** : Idiosyncratic volatility from the q-factor model

- **kz_index** : Kaplan-Zingales index
- **lnoa_gr1a** : Change in long-term net operating assets 
- **lti_gr1a** : Change in long-term investments 
- **market_equity** : Market equity 
- **mispricing_mgmt** : Mispricing factor : Management
- **mispricing_perf** : Mispricing factor : Performance 
- **month** : Month 
- **mspread** : Market spread

- **ncoa_gr1a** : Change in non-current operating assets
- **ncol_gr1a** : Change in non-current operating liabilities
- **netdebt_me** : Net debt-to-price 
- **netis_at** : Net total issuance 
- **nfna_gr1a** : Change in net financial assets 
- **ni_ar1** : Earnings persistence 
- **ni_be** : Return on equity 
- **ni_inc8q** : Number of consecutive quarters with earnings increases 
- **ni_ivol** : Earnings volatility 
- **ni_me** : Earnings to price 
- **niq_at** : Quarterly return on assets 
- **niq_at_chg1** : Change in quarterly return on assets 
- **niq_be** : Quarterly return on equity 
- **niq_be_chg1** : Change in quarterly return on equity 
- **niq_su** : Standardized earnings surprise 
- **nncoa_gr1a** : Change in net operating assets 
- **noa_at** : Net operating assets 
- **noa_gr1a** : Change in net operating assets 

- **o_score** : Ohlson O-score
- **oaccruals_at** : Operating accruals 
- **oaccruals_ni** : Percent operating accruals 
- **ocf_at** : Operating cash flow to assets 
- **ocf_at_chg1** : Change in operating cash flow to assets 
- **ocf_me** : Operating cash fow to market equity 
- **ocfq_saleq_std** : Standard deviation of operating cash flow to sales (quarterly)
- **op_at** : Operating profit to assets
- **op_atl1** : Operating profit to lagged assets 
- **ope_be** : Operating profit to book equity 
- **ope_bel1** : Operating profit to lagged book equity 
- **opex_at** : Operating expenses to assets 

- **pi_nix** : Pre-tax income (excluding extraordinary items)
- **ppeinv_gr1a** : PPE investments growth (1-year)
- **prc** : Price
- **prc_highprc_252d** : Price to 252-Day high 
- **qmj** : Quality minus junk 
- **qmj_growth** : Quality minus junk, growth subcomponent
- **qmj_prof** : Quality minus junk, profitability subcomponent
- **qmj_safety** : Quality minus junk, safety subcomponent

- **rd5_at** : Research and Development (5-year average) to assets
- **rd_me** : Research and Development to market equity
- **rd_sale** : Research and Development to sales
- **resff3_12_1** : Residual of the Fama-French 3-factor model (12-month, 1-month lag)
- **resff3_6_1** : Residual of the Fama-French 3-factor model (6-month, 1-month lag)
- **ret_1_0** : 1-month return 
- **ret_3_1** : 3-month return (excluding last month)
- **ret_6_1** : 6-month return (excluding last month)
- **ret_9_1** : 9-month return (excluding last month)
- **ret_12_1** : 12-month return (excluding last month)
- **ret_12_7** : 6-month return (7 to 12 months ago)
- **ret_60_12** : 5-year return (excluding last year)
- **rmax1_21d** : Maximum daily return (21 days)
- **rmax5_21d** : Maximum 5-day return (21 days)
- **rmax5_rvol_21d** : Maximum 5-day return to realized volatility (21 days)
- **rskew_21d**: Return skewness (21 days)
- **rvol_21d** : Realized volatility (21 days)

- **sale_bev** : Sales to book equity 
- **sale_emp_gr1** : Sales to employee growth (1 year)
- **8sale_gr1** : Sales growth (1 year)
- **sale_gr3** : Sales growth (3 year)
- **sale_me** : Sales to market equity
- **saleq_gr1** :  Quarterly sales growth (1 year)
- **saleq_su** : Sales surprise (quarterly)
- **seas_1_1an** : Seasonality 1-Year 
- **seas_1_1na** : Seasonality 1-Year 
- **seas_2_5an** : Seasonality 2-Year to 5-Year 
- **seas_2_5na** : Seasonality 2-Year to 5-Year  
- **sti_gr1a** : Short-term investment growth (1 year) 

- **taccruals_at** :
- **taccruals_ni** :
- **tangibility** : Asset tangibility
- **tax_gr1a** : Tax expense surprise 
- **turnover_126d** : Share turnover 
- **turnover_var_126d** : Coefficient of variation for share turnover 

- **year** : Year
- **z_score** : Altman Z-score
- **zero_trades_126d** : Number of zero trades with turnover as tiebreaker (6 months)
- **zero_trades_21d** : Number of zero trades with turnover as tiebreaker (1 month)
- **zero_trades_252d** : Number of zero trades with turnover as tiebreaker (12 months) 

## 1. Data Preparation

### 1.1 Drop Uninformative Features 

In [5]:
ticker_data.drop(columns = ["exchcd", "shrcd", "rf"], inplace = True)

### 1.2 Drop Forward Looking Features 

In [6]:
ticker_data.drop(columns = ["eps_medest", "eps_meanest", "eps_stdevest", "eps_actual"], inplace = True)

### 1.3 Drop Observations with Missing Tickers

In [7]:
ticker_data.dropna(subset = ["stock_ticker"], inplace = True)

### 1.4 Dummy Variables : Month & Sector

In [8]:
month_mapper = {1 : "Jan", 2 : "Feb", 3 : "Mar", 4 : "Apr", 5 : "May", 6 : "Jun", 
          7 : "Jul", 8 : "Aug", 9 : "Sept", 10 : "Oct", 11 : "Nov", 12 : "Dec"}
ticker_data["month"] = ticker_data["month"].map(month_mapper)

In [9]:
ticker_data = pd.get_dummies(data = ticker_data , columns = ["month"], drop_first = True, 
                             dtype = "int", prefix = "", prefix_sep = "")

### 1.5 Filter for the Most Traded Stocks Per Month 

In [10]:
ticker_data = ticker_data.sort_values(by = ["date", "dolvol_126d"], ascending = [True, False])
ticker_data = ticker_data.groupby("date").head(500).reset_index(drop = True)

### 1.6 In-Sample Vs Out-of-Sample

In [11]:
ticker_data["date"] = ticker_data["date"].dt.strftime("%Y-%m")
ticker_train = ticker_data[(ticker_data["date"] >= "2000-01") & (ticker_data["date"] < "2010-01")].copy( )
ticker_test = ticker_data[(ticker_data["date"] >= "2010-01")].copy( )

## 2. Custom Time Series CV 

### 2.1 Rolling Fixed Window : Avoiding Lookahead Bias

In [12]:
class RollingTimeSeriesCV :
    
    def __init__(self, train_duration, test_duration, lookahead, n_splits) :
        self.test_duration = test_duration 
        self.train_duration = train_duration 
        self.lookahead = lookahead
        self.n_splits = n_splits

    def split(self, X, y = None, groups = None) :

        unique_dates = X["date"].unique( )  ## Extract unique dates 
        days = sorted(unique_dates, reverse = True)   ## Sort unique dates in descending order 
        
        split_idx = [  ]
        for i in range(self.n_splits) :
            
            test_end_idx = i * self.test_duration 
            test_start_idx = test_end_idx + self.test_duration - 1
                
            train_end_idx = test_start_idx + 1 + self.lookahead 
            train_start_idx = train_end_idx + self.train_duration - 1 
            
            split_idx.append([train_start_idx, train_end_idx, test_start_idx, test_end_idx])

        for split in split_idx :
            train_start_idx, train_end_idx, test_start_idx, test_end_idx = split
    
            # Translate indices to dates using the 'days' list
            train_start_date = days[train_start_idx] if train_start_idx < len(days) else None
            train_end_date = days[train_end_idx] if train_end_idx < len(days) else None
            test_start_date = days[test_start_idx] if test_start_idx < len(days) else None
            test_end_date = days[test_end_idx] if test_end_idx < len(days) else None
            
            yield train_start_date, train_end_date, test_start_date, test_end_date 
        
    def get_n_splits(self, X, y, groups = None) :
        return self.n_splits

### 2.2 Testing Time Series CV 

**Key Consideration** :
> $\text{train_duration} + \text{test_duration} \times \text{n_splits} + \text{lookahead} <= \text{total number of months in training set}$

In [13]:
cv1 = RollingTimeSeriesCV(train_duration = 60, test_duration = 6, lookahead = 1, n_splits = 8)

In [14]:
for train_start_date, train_end_date, test_start_date, test_end_date in cv1.split(X = ticker_train) :
    
    print(f"Train : {train_start_date} to {train_end_date} , Test : {test_start_date} to {test_end_date}")
    print("")

Train : 2004-06 to 2009-05 , Test : 2009-07 to 2009-12

Train : 2003-12 to 2008-11 , Test : 2009-01 to 2009-06

Train : 2003-06 to 2008-05 , Test : 2008-07 to 2008-12

Train : 2002-12 to 2007-11 , Test : 2008-01 to 2008-06

Train : 2002-06 to 2007-05 , Test : 2007-07 to 2007-12

Train : 2001-12 to 2006-11 , Test : 2007-01 to 2007-06

Train : 2001-06 to 2006-05 , Test : 2006-07 to 2006-12

Train : 2000-12 to 2005-11 , Test : 2006-01 to 2006-06



In [15]:
cv2 = RollingTimeSeriesCV(train_duration = 72, test_duration = 3, lookahead = 3, n_splits = 12)

In [16]:
for train_start_date, train_end_date, test_start_date, test_end_date in cv2.split(X = ticker_train) :
    
    print(f"Train : {train_start_date} to {train_end_date} , Test : {test_start_date} to {test_end_date}")
    print("")

Train : 2003-07 to 2009-06 , Test : 2009-10 to 2009-12

Train : 2003-04 to 2009-03 , Test : 2009-07 to 2009-09

Train : 2003-01 to 2008-12 , Test : 2009-04 to 2009-06

Train : 2002-10 to 2008-09 , Test : 2009-01 to 2009-03

Train : 2002-07 to 2008-06 , Test : 2008-10 to 2008-12

Train : 2002-04 to 2008-03 , Test : 2008-07 to 2008-09

Train : 2002-01 to 2007-12 , Test : 2008-04 to 2008-06

Train : 2001-10 to 2007-09 , Test : 2008-01 to 2008-03

Train : 2001-07 to 2007-06 , Test : 2007-10 to 2007-12

Train : 2001-04 to 2007-03 , Test : 2007-07 to 2007-09

Train : 2001-01 to 2006-12 , Test : 2007-04 to 2007-06

Train : 2000-10 to 2006-09 , Test : 2007-01 to 2007-03



## 3. Feature Selection 

In [17]:
X_train = ticker_train.drop(columns = ["stock_ticker", "stock_exret"]).set_index("date")
y_train = ticker_train["stock_exret"]

In [18]:
X_test = ticker_test.drop(columns = ["stock_ticker", "stock_exret"]).set_index("date")
y_test = ticker_test["stock_exret"]

### 3.1 F-Statistic 

In [19]:
def select_features(X, y, n_features = 50) :
    
    selector = SelectKBest(f_regression, k = n_features)
    selector.fit(X, y)
    selected_features = X.columns[selector.get_support()].tolist()
    return selected_features

### 3.2 Random Forest 

In [139]:
ts_cv = model_selection.TimeSeriesSplit(n_splits = 10)
rf_reg = ensemble.RandomForestRegressor(random_state = 42)
rf_reg.fit(X_train, y_train)