## Data Preprocessing

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
file_path = 'JPMaQS_Quantamental_Indicators.csv'
df = pd.read_csv(file_path, parse_dates=['real_date'])

print(df.columns)

Index(['Unnamed: 0', 'real_date', 'cid', 'xcat', 'value', 'grading', 'eop_lag',
       'mop_lag'],
      dtype='object')


In [3]:
df = df[['cid', 'xcat', 'real_date', 'value', 'grading', 'eop_lag', 'mop_lag']]
df['ticker'] = df['cid'] + "_" + df['xcat']

In [4]:
# Cross-sections and categories of interest
cids = ["AUD", "CAD", "CHF", "EUR", "GBP", "JPY", "NOK", "NZD", "SEK", "USD"]
xcats = [
    "RYLDIRS05Y_NSA", "INTRGDPv5Y_NSA_P1M1ML12_3MMA", "CPIC_SJA_P6M6ML6AR",
    "CPIH_SA_P1M1ML12", "INFTEFF_NSA", "PCREDITBN_SJA_P1M1ML12",
    "RGDP_SA_P1Q1QL4_20QMA"
]
df = df[df['cid'].isin(cids) & df['xcat'].isin(xcats)]

In [5]:
# Separate columns for each xcat
df_pivot = df.pivot_table(index=['real_date', 'cid'], columns='xcat', values='value').reset_index()

print(df_pivot.head())

xcat  real_date  cid  CPIC_SJA_P6M6ML6AR  CPIH_SA_P1M1ML12  INFTEFF_NSA  \
0    2000-01-03  AUD            1.428580          1.647446     1.874567   
1    2000-01-03  CAD            1.709066          2.292576     1.749144   
2    2000-01-03  CHF                 NaN          1.663356     0.827757   
3    2000-01-03  EUR                 NaN          1.446079          NaN   
4    2000-01-03  GBP            0.314695          1.156351     2.005796   

xcat  INTRGDPv5Y_NSA_P1M1ML12_3MMA  PCREDITBN_SJA_P1M1ML12  \
0                         0.247776                9.517471   
1                         1.788620                6.888624   
2                         0.381352                4.423255   
3                              NaN                     NaN   
4                        -0.108668                     NaN   

xcat  RGDP_SA_P1Q1QL4_20QMA  RYLDIRS05Y_NSA  
0                  4.307267        5.391185  
1                  3.303088        4.637855  
2                  1.156352        2.3

In [6]:
# Handling missing values
df_pivot.sort_values(by=['cid', 'real_date'], inplace=True)
df_pivot.fillna(method='ffill', inplace=True)

  df_pivot.fillna(method='ffill', inplace=True)


In [7]:
# Signal constituents
df_pivot['XGDP_NEG'] = -df_pivot['INTRGDPv5Y_NSA_P1M1ML12_3MMA']
df_pivot['XCPI_NEG'] = - (df_pivot['CPIC_SJA_P6M6ML6AR'] + df_pivot['CPIH_SA_P1M1ML12']) / 2 + df_pivot['INFTEFF_NSA']
df_pivot['XPCG_NEG'] = - df_pivot['PCREDITBN_SJA_P1M1ML12'] + df_pivot['INFTEFF_NSA'] + df_pivot['RGDP_SA_P1Q1QL4_20QMA']

In [8]:
# Relevant columns
selected_cols = ['real_date', 'cid', 'XGDP_NEG', 'XCPI_NEG', 'XPCG_NEG', 'RYLDIRS05Y_NSA']
df_pivot = df_pivot[selected_cols]

In [9]:
# Normalize feature variables using z-scores
for col in ['XGDP_NEG', 'XCPI_NEG', 'XPCG_NEG', 'RYLDIRS05Y_NSA']:
    df_pivot[col + '_ZN4'] = (df_pivot[col] - df_pivot[col].mean()) / df_pivot[col].std()

In [10]:
# Composite indicator
df_pivot['MACRO_AVGZ'] = df_pivot[['XGDP_NEG_ZN4', 'XCPI_NEG_ZN4', 'XPCG_NEG_ZN4', 'RYLDIRS05Y_NSA_ZN4']].mean(axis=1)

In [11]:
# Monthly frequency
df_pivot.set_index('real_date', inplace=True)
df_monthly = df_pivot.resample('M').last()

In [12]:
# Drop rows with missing values
df_monthly.dropna(inplace=True)

In [13]:
# Features and target
X = df_monthly[['XGDP_NEG_ZN4', 'XCPI_NEG_ZN4', 'XPCG_NEG_ZN4', 'RYLDIRS05Y_NSA_ZN4']]
y = df_monthly['MACRO_AVGZ']

In [14]:
# Train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
# Scaling and feature expansion
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), X.columns)
    ]
)

## Building our Model

In [30]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score

In [19]:
preprocessor = StandardScaler()

# Random Forest Regressor pipeline
rf_model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(n_estimators=100, random_state=42))
])

In [20]:
# Train the Random Forest Regressor model
rf_model_pipeline.fit(X_train, y_train)
rf_predictions_train = rf_model_pipeline.predict(X_train)
rf_predictions_test = rf_model_pipeline.predict(X_test)

train_mse_rf = mean_squared_error(y_train, rf_predictions_train)
test_mse_rf = mean_squared_error(y_test, rf_predictions_test)
train_r2_rf = r2_score(y_train, rf_predictions_train)
test_r2_rf = r2_score(y_test, rf_predictions_test)

print("Random Forest Regressor:")
print(f"Training MSE: {train_mse_rf}")
print(f"Test MSE: {test_mse_rf}")
print(f"Training R²: {train_r2_rf}")
print(f"Test R²: {test_r2_rf}")

cv_scores_rf = cross_val_score(rf_model_pipeline, X, y, cv=5, scoring='r2')
print("Cross-Validation R² Scores:", cv_scores_rf)
print("Mean Cross-Validation R²:", np.mean(cv_scores_rf))

Random Forest Regressor:
Training MSE: 0.0015991635940493449
Test MSE: 0.0038223955048743608
Training R²: 0.9932320517075023
Test R²: 0.9808777373054388
Cross-Validation R² Scores: [-2.78935438  0.28107945  0.84705524 -0.49929551 -0.19701962]
Mean Cross-Validation R²: -0.4715069631131919


We can attempt the following actions to reduce overfitting and enhance the Random Forest Regressor's performance. Hyperparameter tuning can adjust the Random Forest model's hyperparameters via methods like Grid Search or Random Search.To get rid of noise, fewer characteristics are used.
As regularization take into account models with integrated regularization, such as Gradient Boosting.Employing stronger cross-validation methods is recommended.

In [37]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

In [38]:
# Define the parameter grid
param_grid = {
    'model__n_estimators': [50, 100, 200],
    'model__max_depth': [None, 10, 20, 30],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4],
    'model__bootstrap': [True, False]
}

# Set up the pipeline
rf_model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(random_state=42))
])

In [40]:
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf_model_pipeline, param_grid=param_grid,
                           cv=5, n_jobs=-1, scoring='r2', verbose=2)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters
print("Best parameters found: ", grid_search.best_params_)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best parameters found:  {'model__bootstrap': True, 'model__max_depth': None, 'model__min_samples_leaf': 1, 'model__min_samples_split': 2, 'model__n_estimators': 200}


In [41]:
best_rf_model = grid_search.best_estimator_

best_rf_predictions_train = best_rf_model.predict(X_train)
best_rf_predictions_test = best_rf_model.predict(X_test)

train_mse_best_rf = mean_squared_error(y_train, best_rf_predictions_train)
test_mse_best_rf = mean_squared_error(y_test, best_rf_predictions_test)
train_r2_best_rf = r2_score(y_train, best_rf_predictions_train)
test_r2_best_rf = r2_score(y_test, best_rf_predictions_test)

print("Best Random Forest Regressor:")
print(f"Training MSE: {train_mse_best_rf}")
print(f"Test MSE: {test_mse_best_rf}")
print(f"Training R²: {train_r2_best_rf}")
print(f"Test R²: {test_r2_best_rf}")

cv_scores_best_rf = cross_val_score(best_rf_model, X, y, cv=5, scoring='r2')
print("Cross-Validation R² Scores:", cv_scores_best_rf)
print("Mean Cross-Validation R²:", np.mean(cv_scores_best_rf))

Best Random Forest Regressor:
Training MSE: 0.0014175949461399484
Test MSE: 0.003910450525903866
Training R²: 0.9940004829206452
Test R²: 0.9804372252648731
Cross-Validation R² Scores: [-2.64822255  0.2695247   0.86733604 -0.49947904 -0.18997028]
Mean Cross-Validation R²: -0.4401622252770364


There is a tendency towards overfitting as the Random Forest Regressor model is positioned more towards the right side of the graph. The extremely high training R2 score of 0.9940, which slightly decreases to 0.9804 for the test R2, and the notable differences in cross-validation R2 values demonstrate this. These indications point to the possibility that the model is underfitting to fresh data and overfitting to the training set. The next models being examined are the XGBoost (Extreme Gradient Boosting) and Gradient Boosting Regressor, based on the performance of the current model. Because it can handle non-linear relationships and interactions well and incorporate regularization to avoid overfitting, the Gradient Boosting Regressor is the model of choice. It constructs trees in a sequential manner, enabling each tree to fix the mistakes of the one before it.XGBoost, an advanced implementation of gradient boosting, is also considered due to its optimization for speed and performance, including regularization terms to control overfitting, making it highly effective in various data science competitions.

# Answer the Questions for Random Forest Regressor

Question  Where does your model fit in the fitting graph?

The Random Forest Regressor model leans toward the overfitting region and fits to the right side of the fitting graph. The high training R2 value (0.9940) and the declining test R2 value (0.9804) make this clear. The model is overfitting because of the fluctuation in the cross-validation R2 scores, which further suggests that the model is not generalizing well across various data subsets.



## Conclusion Section for Random Forest Regressor


Overfitting was seen in the Random Forest Regressor, which had a very high training R2 (0.9940) and a poor test R2 (0.9804). The model's inability to generalize effectively across various data subsets is demonstrated by the wide variations in cross-validation R2 scores. This shows that even though the model is capable of capturing intricate non-linear correlations in the training set, it might not function as well in the absence of data.There are multiple ways to enhance the model. First, to determine the ideal parameters for the Random Forest model, a thorough hyperparameter tuning can be carried out using GridSearchCV or RandomizedSearchCV. Second, by using feature engineering, one can produce new features—such as interaction terms, polynomial features, or domain-specific transformations—that more accurately represent the underlying patterns in the data. Third, overfitting can be avoided by using models like XGBoost or Gradient Boosting that include built-in regularization. Fourth, the model's capacity for generalization can be enhanced by utilizing strong cross-validation methods like stratified k-fold to guarantee that it is assessed on a variety of data subsets. Fifth, to improve overall performance, take into account ensemble methods like stacking or blending, which combine the strengths of various models. Finally, adding extra data to the model's training set can aid it.