## Data Preprocessing

Our goal in the data processing stage was to get the dataset as ready as possible for training models that depend on feature scaling and data quality, such as Ridge Regression and Linear Regression. To make sure our models would train on relevant data, we started by sifting and choosing pertinent columns to concentrate on the most important macroeconomic and financial variables. To ensure reliable trend analysis, we preserved the continuity and completeness of the time series data by pivoting the data and used forward filling to address missing values. Consistent scaling was guaranteed by engineering new characteristics and normalizing them with z-scores. This is important since algorithms such as Linear Regression rely on the assumption that input features are on a same scale. We were able to condense the total macroeconomic conditions into a single, more manageable goal variable by developing a composite indicator (MACRO_AVGZ). The robustness of the models is increased by downsampling to a monthly frequency, which reduced noise and recorded more steady trends. By dividing the dataset into training and test sets, we were able to assess the models' performance on previously unseen data, which gave us information on how well they could generalize. Together, these actions made sure that our models would train on clear, organized, and pertinent data, which improved their accuracy and capacity to generalize to new data. This was especially important for the Ridge Regression model, which uses regularization to reduce overfitting.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
file_path = 'JPMaQS_Quantamental_Indicators.csv'
df = pd.read_csv(file_path, parse_dates=['real_date'])

print(df.columns)

Index(['Unnamed: 0', 'real_date', 'cid', 'xcat', 'value', 'grading', 'eop_lag',
       'mop_lag'],
      dtype='object')


In [None]:
df = df[['cid', 'xcat', 'real_date', 'value', 'grading', 'eop_lag', 'mop_lag']]
df['ticker'] = df['cid'] + "_" + df['xcat']

In [None]:
# Cross-sections and categories of interest
cids = ["AUD", "CAD", "CHF", "EUR", "GBP", "JPY", "NOK", "NZD", "SEK", "USD"]
xcats = [
    "RYLDIRS05Y_NSA", "INTRGDPv5Y_NSA_P1M1ML12_3MMA", "CPIC_SJA_P6M6ML6AR",
    "CPIH_SA_P1M1ML12", "INFTEFF_NSA", "PCREDITBN_SJA_P1M1ML12",
    "RGDP_SA_P1Q1QL4_20QMA"
]
df = df[df['cid'].isin(cids) & df['xcat'].isin(xcats)]

In [None]:
# Separate columns for each xcat
df_pivot = df.pivot_table(index=['real_date', 'cid'], columns='xcat', values='value').reset_index()

print(df_pivot.head())

xcat  real_date  cid  CPIC_SJA_P6M6ML6AR  CPIH_SA_P1M1ML12  INFTEFF_NSA  \
0    2000-01-03  AUD            1.428580          1.647446     1.874567   
1    2000-01-03  CAD            1.709066          2.292576     1.749144   
2    2000-01-03  CHF                 NaN          1.663356     0.827757   
3    2000-01-03  EUR                 NaN          1.446079          NaN   
4    2000-01-03  GBP            0.314695          1.156351     2.005796   

xcat  INTRGDPv5Y_NSA_P1M1ML12_3MMA  PCREDITBN_SJA_P1M1ML12  \
0                         0.247776                9.517471   
1                         1.788620                6.888624   
2                         0.381352                4.423255   
3                              NaN                     NaN   
4                        -0.108668                     NaN   

xcat  RGDP_SA_P1Q1QL4_20QMA  RYLDIRS05Y_NSA  
0                  4.307267        5.391185  
1                  3.303088        4.637855  
2                  1.156352        2.3

In [None]:
# Handling missing values
df_pivot.sort_values(by=['cid', 'real_date'], inplace=True)
df_pivot.fillna(method='ffill', inplace=True)

  df_pivot.fillna(method='ffill', inplace=True)


In [None]:
# Calculate signal constituents
df_pivot['XGDP_NEG'] = -df_pivot['INTRGDPv5Y_NSA_P1M1ML12_3MMA']
df_pivot['XCPI_NEG'] = - (df_pivot['CPIC_SJA_P6M6ML6AR'] + df_pivot['CPIH_SA_P1M1ML12']) / 2 + df_pivot['INFTEFF_NSA']
df_pivot['XPCG_NEG'] = - df_pivot['PCREDITBN_SJA_P1M1ML12'] + df_pivot['INFTEFF_NSA'] + df_pivot['RGDP_SA_P1Q1QL4_20QMA']

In [None]:
# Relevant columns
selected_cols = ['real_date', 'cid', 'XGDP_NEG', 'XCPI_NEG', 'XPCG_NEG', 'RYLDIRS05Y_NSA']
df_pivot = df_pivot[selected_cols]

In [None]:
# Normalize feature variables using z-scores
for col in ['XGDP_NEG', 'XCPI_NEG', 'XPCG_NEG', 'RYLDIRS05Y_NSA']:
    df_pivot[col + '_ZN4'] = (df_pivot[col] - df_pivot[col].mean()) / df_pivot[col].std()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pivot[col + '_ZN4'] = (df_pivot[col] - df_pivot[col].mean()) / df_pivot[col].std()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pivot[col + '_ZN4'] = (df_pivot[col] - df_pivot[col].mean()) / df_pivot[col].std()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pivot[col + '_ZN4'] = (df_pivo

In [None]:
# Composite indicator
df_pivot['MACRO_AVGZ'] = df_pivot[['XGDP_NEG_ZN4', 'XCPI_NEG_ZN4', 'XPCG_NEG_ZN4', 'RYLDIRS05Y_NSA_ZN4']].mean(axis=1)

In [None]:
# Monthly frequency
df_pivot.set_index('real_date', inplace=True)
df_monthly = df_pivot.resample('M').last()

In [None]:
# Drop rows with missing values
df_monthly.dropna(inplace=True)

In [None]:
# Features and target
X = df_monthly[['XGDP_NEG_ZN4', 'XCPI_NEG_ZN4', 'XPCG_NEG_ZN4', 'RYLDIRS05Y_NSA_ZN4']]
y = df_monthly['MACRO_AVGZ']

In [None]:
# Train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Scaling and feature expansion
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), X.columns)
    ]
)

## Building our Model

# Model 1

In [None]:
# Linear Regression pipeline
linear_model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_expansion', PolynomialFeatures(degree=2, include_bias=False)),
    ('model', LinearRegression())
])

In [None]:
# Linear Regression model
linear_model_pipeline.fit(X_train, y_train)

In [None]:
y_train_pred_linear = linear_model_pipeline.predict(X_train)
y_test_pred_linear = linear_model_pipeline.predict(X_test)

train_mse_linear = mean_squared_error(y_train, y_train_pred_linear)
test_mse_linear = mean_squared_error(y_test, y_test_pred_linear)
train_r2_linear = r2_score(y_train, y_train_pred_linear)
test_r2_linear = r2_score(y_test, y_test_pred_linear)

print("Linear Regression:")
print(f"Training MSE: {train_mse_linear}")
print(f"Test MSE: {test_mse_linear}")
print(f"Training R²: {train_r2_linear}")
print(f"Test R²: {test_r2_linear}")

Linear Regression:
Training MSE: 2.6444872105683483e-31
Test MSE: 1.1537005234208904e-31
Training R²: 1.0
Test R²: 1.0


According to the findings of the linear regression, the model's Mean Squared Error (MSE) for the training set (2.6444872105683483e-31) and the test set (1.1537005234208904e-31) is remarkably low, meaning that the values in both datasets are nearly similar. For both the training and test sets, the R2 value is 1.0, which indicates that the model fully explains all of the variance in the target variable. These apparently perfect results point to a serious problem: overfitting. Overfitting happens when the model performs exceptionally well on the training set but may perform poorly when applied to fresh, unseen data because it catches both the noise and outliers in addition to the underlying patterns in the training data.
According to the findings of the linear regression, the model's Mean Squared Error (MSE) for the training set (2.6444872105683483e-31) and the test set (1.1537005234208904e-31) is remarkably low, meaning that the values in both datasets are nearly similar. For both the training and test sets, the R2 value is 1.0, which indicates that the model fully explains all of the variance in the target variable. These apparently perfect results point to a serious problem: overfitting. Overfitting happens when the model performs exceptionally well on the training set but may perform poorly when applied to fresh, unseen data because it catches both the noise and outliers in addition to the underlying patterns in the training data.

# Model 2

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

In [None]:
# Ridge Regression pipeline
ridge_model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_expansion', PolynomialFeatures(degree=2, include_bias=False)),
    ('model', Ridge(alpha=1.0))  # alpha is the regularization strength
])

In [None]:
# Ridge Regression model
ridge_model_pipeline.fit(X_train, y_train)

In [None]:
y_train_pred_ridge = ridge_model_pipeline.predict(X_train)
y_test_pred_ridge = ridge_model_pipeline.predict(X_test)

train_mse_ridge = mean_squared_error(y_train, y_train_pred_ridge)
test_mse_ridge = mean_squared_error(y_test, y_test_pred_ridge)
train_r2_ridge = r2_score(y_train, y_train_pred_ridge)
test_r2_ridge = r2_score(y_test, y_test_pred_ridge)

print("Ridge Regression:")
print(f"Training MSE: {train_mse_ridge}")
print(f"Test MSE: {test_mse_ridge}")
print(f"Training R²: {train_r2_ridge}")
print(f"Test R²: {test_r2_ridge}")

cv_scores_ridge = cross_val_score(ridge_model_pipeline, X, y, cv=5, scoring='r2')
print("Cross-Validation R² Scores:", cv_scores_ridge)
print("Mean Cross-Validation R²:", np.mean(cv_scores_ridge))

Ridge Regression:
Training MSE: 1.453966772964182e-05
Test MSE: 1.4214579694990766e-05
Training R²: 0.9999533961393585
Test R²: 0.9999530193010754
Cross-Validation R² Scores: [0.99963664 0.99984517 0.99967164 0.99912687 0.99812884]
Mean Cross-Validation R²: 0.9992818331772734


With a test MSE of 1.4214579694990766e-05 and a training MSE of 1.453966772964182e-05, the Ridge Regression model performs exceptionally well, showing remarkably accurate predictions for both training and test datasets. The training and test sets' R2 values, which are 0.9999530193010754 and 0.9999533961393585, respectively, are incredibly high and indicate that the model explains almost all of the variance in the target variable. Moreover, the cross-validation R2 scores demonstrate strong and consistent performance across several data subsets, with a mean of 0.9992818331772734 and a range of 0.99812884 to 0.99984517. These findings demonstrate how Ridge Regression achieves better generalization and successfully reduces overfitting, a problem with the original Linear Regression model. As a result, Ridge Regression balances good fit and predictive capacity on both known and unknown data, demonstrating its high degree of accuracy and dependability in predicting the composite financial indicator in the portfolio optimization.

# Answer to Questions

## Answer the questions: Where does your model fit in the fitting graph? and What are the next models you are thinking of and why?

Question 1:

Linear Regression:The initial excellent R2 values of the Linear Regression model suggested that it was overfitting. This implies that it fits the training data—noise included—too closely, which may result in inadequate generalization to fresh data. This model would be on the right side of the fitting graph, where model complexity is high and overfitting causes the error on test data to start rising.

Ridge Regression:The appropriate range for model complexity is fit by the Ridge Regression model. By penalizing large coefficients, the regularization term helps prevent overfitting and achieve a balance between variance and bias. Good generalization is demonstrated by this model's low MSEs and excellent R2 scores on both test and training sets of data. Ridge Regression would be close to the bottom of the U-shaped curve in the fitting graph, where test and training errors are minimized and optimal model complexity is represented.

To compare the effects of regularization, we employed Ridge Regression and Linear Regression:A baseline for understanding the performance of a basic model in the absence of regularization was provided by linear regression. Its flawless R2 scores demonstrated that it assisted in identifying any overfitting problems.In order to solve the overfitting seen with Linear Regression, Ridge Regression was devised. Ridge Regression penalizes big coefficients, which lowers overfitting and enhances generalization to fresh data by including an L2 regularization factor. This illustrated how crucial regularization is to building a strong prediction model.

Question 2:
Investigating non-linear models, such as the Random Forest Regressor, can be very helpful for portfolio optimization in the following stages. This is the reason why: The Random Forest Regressor to enhance predictive performance, Random Forest is an ensemble learning technique that combines several decision trees. It records intricate interactions and non-linear correlations between features that may be overlooked by linear models. These kinds of associations are common in financial data, thus this can be quite helpful there.Advantages: By averaging several trees, it lessens overfitting and is resistant to noise and outliers in the data. It also offers feature importance metrics, which are helpful for comprehending the underlying causes of predictions, and manages both numerical and categorical characteristics with ease. Comparison with Linear Models: We can assess if incorporating non-linear interactions considerably increases the prediction accuracy and robustness of the portfolio optimization model by contrasting Random Forest's performance with that of the linear models (Linear and Ridge Regression).In situations when there are intricate relationships between the financial indicators, Random Forest may perform better, providing a possibly more accurate and trustworthy model for making decisions.

## Conclusion section: What is the conclusion of your 1st model? What can be done to possibly improve it?
For both the training and test datasets, the Linear Regression model showed flawless R2 values, suggesting a significant level of overfitting. This flawless fit implies that the training data's noise and particular patterns, which are not very generalizable to fresh, unobserved data, were being captured by the model. Many approaches can be taken into consideration in order to enhance the Linear Regression model. Large coefficients can be penalized and overfitting can be decreased by using regularization techniques like Lasso Regression (L1 regularization) and Ridge Regression (L2 regularization). Furthermore, the model can be made simpler and more capable of generalization by limiting the number of features to just those that are most pertinent. Furthermore, overfitting can be lessened by evaluating the model's performance using cross-validation techniques and adjusting the hyperparameters accordingly.

In contrast to Linear Regression, however, the Ridge Regression model offered a more robust and balanced match. For the training and test datasets, it showed low MSE values and good R2 scores. The regularization term's addition reduced overfitting and produced a model that performs well when applied to new data. The robustness and dependability of the model were further validated by the cross-validation scores. Still, there may be room for advancement with the Ridge Regression model. Finding the ideal regularization parameter (alpha) value that reduces error and improves generalization can be accomplished by further fine-tuning it using methods like Grid Search or Random Search. Furthermore, enhancing or adding new features might help the model forecast more accurately by capturing more pertinent data.Exploring ensemble methods such as Random Forest or Gradient Boosting can capture more complex relatio

There are various ways to build on the Ridge Regression model's performance. Using models such as the Random Forest Regressor can aid in identifying non-linear patterns within the data. Model performance can be further improved by experimenting with different regularization strategies, such as ElasticNet, which combines L1 and L2 regularization. Robust and trustworthy predictions for portfolio optimization can be achieved by regularly assessing the performance of the model using cross-validation and modifying the modeling strategy in response to the findings. By following these procedures, we may create a more resilient model that fits the training set of data more accurately and generalizes to new data with greater efficacy, offering more trustworthy insights for portfolio optimization.


