# Exercise Instructions: Panel Data Modeling with Machine Learning Models

**Objective:**
The goal of this exercise is to practice panel data modeling skills using three machine learning models (Random Forest, Single Decision Tree, and Linear Regression with Elastic Net) that have not been utilized in the project so far. Completing the entire task or a significant portion during the class will earn you an additional 7 points (above what is outlined in the syllabus) towards your final grade.

**Tasks:**

1. **GitHub Setup:**
   - If you haven't done so already, [create](https://github.com/join) a GitHub account.
   - [Download](https://desktop.github.com) and [configure](https://docs.github.com/en/desktop/configuring-and-customizing-github-desktop/configuring-basic-settings-in-github-desktop) GitHub Desktop on your laptop. (Here you can find nice intro to the GitHub Dekstop app: [link](https://joshuadull.github.io/GitHub-Desktop/02-getting-started/index.html)). If you prefare git command line usage you can go with this [instruction](https://github.com/michaelwozniak/ml2_tools?tab=readme-ov-file#git).
2. **Repository Forking:**
   - [Fork](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo) the following repository to your projects: [https://github.com/michaelwozniak/ML-in-Finance-I-case-study-forecasting-tax-avoidance-rates](https://github.com/michaelwozniak/ML-in-Finance-I-case-study-forecasting-tax-avoidance-rates)

3. **Repository Cloning:**
   - [Clone](https://docs.github.com/en/desktop/adding-and-cloning-repositories/cloning-a-repository-from-github-to-github-desktop) the forked repository to your local computer using GitHub Desktop.

4. **Notebook Exploration:**
   - Open the file `notebooks/10.exercise.ipynb` to begin the ML tasks.

5. **Model Creation:**

   In the file `notebooks/10.exercise.ipynb`:
   - Create the following models:
      1. Random Forest ([RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html))
      2. Decision Tree ([DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html))
      3. Linear Regression with Elastic Net ([ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html))
   
   Follow a similar process to the models presented in class (e.g., KNN - `notebooks/07.knn-model.ipynb`):
      - Load the prepared training data.
      - Perform feature engineering if deemed necessary (note: these three models do not require data standardization, unlike SVM and KNN).
      - Conduct feature selection.
      - Perform hyperparameter tuning.
      - Identify a local champion for each model class (the best model for RF, DT, Elastic Net).
      - Save local champions to a pickle file.

6. **Model Evaluation:**
   - In the notebook `notebooks/09.final-comparison-and-summary.ipynb`, load the models you created and check if they outperform the previously used models.

7. **Version Control:**
   - At the end of the class, even if the tasks are incomplete, [commit](https://docs.github.com/en/desktop/making-changes-in-a-branch/committing-and-reviewing-changes-to-your-project-in-github-desktop) your changes using GitHub Desktop.
   - [Push](https://docs.github.com/en/desktop/making-changes-in-a-branch/pushing-changes-to-github-from-github-desktop) your changes to your remote GitHub repository.

8. **Submission:**
   - Send me the link to your GitHub project (my email: *mj.wozniak9@uw.edu.pl*).

Good luck with the exercise! If you have any questions, feel free to ask.

In [1]:
import pandas as pd
import numpy as np
import pickle
from pathlib import Path

from sklearn.ensemble   import RandomForestRegressor
from sklearn.tree       import DecisionTreeRegressor
from sklearn.linear_model import ElasticNet

from sklearn.model_selection import (GridSearchCV, TimeSeriesSplit,
                                     train_test_split)
from sklearn.metrics      import mean_squared_error, make_scorer
from sklearn.feature_selection import RFECV, SelectFromModel
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline    import Pipeline


In [2]:
import sklearn, sys
print("sklearn:", sklearn.__version__)
print("Python executable:", sys.executable)


sklearn: 1.6.1
Python executable: /Users/igor/Downloads/ML_dod1/.venv/bin/python


In [3]:
DATA_DIR   = Path("../data/output")
MODEL_DIR  = Path("../models")
MODEL_DIR.mkdir(parents=True, exist_ok=True)

df   = pd.read_csv(DATA_DIR / "train_fe.csv", index_col=0)
rank = pd.read_excel(DATA_DIR / "feature_ranking.xlsx", index_col=0)

TARGET = "etr"
df     = df.sort_values("rok").reset_index(drop=True)
X, y   = df.drop(columns=TARGET), df[TARGET]


In [4]:
benchmark2 = [
    "ta",
    "txt",
    "pi",
    "str",
    "xrd",
    "ni",
    "ppent",
    "intant",
    "dlc",
    "dltt",
    "capex",
    "revenue",
    "cce",
    "adv",
    "diff",
    "roa",
    "lev",
    "intan",
    "rd",
    "ppe",
    "sale",
    "cash_holdings",
    "adv_expenditure",
    "capex2",
    "cfc",
    "dta",
    "y_v2x_polyarchy",
    "WB_GDPgrowth",
    "WB_GDPpc",
    "WB_Inflation",
    "rr_per_country",
    "rr_per_sector",
    "etr_y_past",
    "etr_y_ma",
    "diff_ma",
    "roa_ma",
    "lev_ma",
    "intan_ma",
    "ppe_ma",
    "sale_ma",
    "cash_holdings_ma",
    "roa_past",
    "lev_past",
    "intan_past",
    "ppe_past",
    "sale_past",
    "cash_holdings_past",
]           
mi_features_35  = rank.iloc[:35].index.tolist()

tree_feats = rank.sort_values("mi_score", ascending=False).iloc[:60].index


In [None]:
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.feature_selection import RFECV, SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

tscv = TimeSeriesSplit(n_splits=6, test_size=363)

SCORER = "neg_root_mean_squared_error"

def train_champion(model_name: str,
                   base_estimator,
                   param_grid: dict,
                   feature_candidates,
                   scale=False):

    steps = []
    if scale:
        steps.append(("sc", StandardScaler()))
    steps.append(("model", base_estimator))
    pipe = Pipeline(steps)

    search = GridSearchCV(
        pipe,
        param_grid={f"model__{k}": v for k, v in param_grid.items()},
        cv=5,
        scoring=SCORER,
        n_jobs=-1,
    )
    search.fit(X[feature_candidates], y)
    best_pipe = search.best_estimator_

    if isinstance(base_estimator, (RandomForestRegressor, DecisionTreeRegressor)):
        fs = RFECV(
            estimator=best_pipe,
            step=1,
            cv=5,
            scoring=SCORER,
            n_jobs=-1,
            importance_getter="named_steps.model.feature_importances_",
        )
    else:                                         
        fs = SelectFromModel(
            estimator=best_pipe.named_steps["model"],
            prefit=True,
            threshold="median",
        )
    fs.fit(X[feature_candidates], y)
    selected_feats = X[feature_candidates].columns[fs.get_support()]

    cv_val = []
    for tr_idx, te_idx in tscv.split(X):
        best_pipe.fit(X.iloc[tr_idx][selected_feats], y.iloc[tr_idx])
        preds = best_pipe.predict(X.iloc[te_idx][selected_feats])
        cv_val.append(mean_squared_error(y.iloc[te_idx], preds))
    
    mean_rmse, std_rmse = np.mean(cv_val), np.std(cv_val)

    best_pipe.fit(X[selected_feats], y)
    pickle.dump(best_pipe, open(MODEL_DIR / f"{model_name}.sav", "wb"))

    print(f"{model_name}: {mean_rmse:.4f} ± {std_rmse:.4f}  |  {len(selected_feats)} feats")
    return mean_rmse, std_rmse, best_pipe


In [6]:
rf_grid = dict(
    n_estimators =[200, 500, 800],
    max_depth    =[None, 10, 20, 30],
    min_samples_leaf=[1, 3, 5],
    max_features =["sqrt", "log2", .5]
)

dt_grid = dict(
    max_depth       =[None, 5, 10, 20, 30],
    min_samples_leaf=[1, 3, 5, 10],
    min_samples_split=[2, 5, 10]
)

en_grid = dict(
    alpha =[0.01, 0.1, 1, 5, 10],
    l1_ratio=[.1, .3, .5, .7, .9, 1.0],
    max_iter=[5000]
)


In [7]:
rf_score = train_champion("rf",
                          RandomForestRegressor(random_state=0, n_jobs=-1),
                          rf_grid,
                          tree_feats,    
                          scale=False)

dt_score = train_champion("dt",
                          DecisionTreeRegressor(random_state=0),
                          dt_grid,
                          tree_feats,
                          scale=False)

en_score = train_champion("elastic",
                          ElasticNet(),
                          en_grid,
                          mi_features_35,   
                          scale=True)      


rf: 0.0198 ± 0.0034  |  56 feats
dt: 0.0214 ± 0.0040  |  3 feats
elastic: 0.0221 ± 0.0029  |  35 feats
