We outline a structured approach for presenting research findings. The framework is divided into several key segments:

1. Introduction
1. Dataset overview
1. Analytics and learning strategies
1. Empirical resuts: baseline and robustness
1. Conclusion

The opening segment encompasses four essential elements:

- Contextual Background: What is the larger setting of the study? What makes this area of inquiry compelling? What are the existing gaps or limitations within the current body of research? What are some unanswered yet noteworthy questions?

- Project Contributions: What are the specific advancements made by this study, such as in data acquisition, algorithmic development, parameter adjustments, etc.?

- Summary of the main empirical results: What is the main statistical statement? is it significant (e.g. statistically or economically)?

- Literature and Resource Citations: What are related academic papers? What are the github repositories, expert blogs, or software packages that used in this project?

In the dataset profile, one should consider:

- The origin and composition of data utilized in the study. If the dataset is original, then provide the source code to ensure reproducibility.

- The chronological accuracy of the data points, verifying that the dates reflect the actual availability of information.

- A detailed analysis of descriptive statistics, with an emphasis on discussing the importance of the chosen graphs or metrics.

The analytics and machine learning methodologies section accounts for:

- A detailed explanation of the foundational algorithm.

- A description of the data partitioning strategy for training, validation and test.

- An overview of the parameter selection and optimization process.

To effectively convey the empirical findings, separate the baseline results from the additional robustness tests. Within the primary empirical outcomes portion, include:

- Key statistical evaluations (for instance, if presenting a backtest – provide a pnl graph alongside the Sharpe ratio).

- Insights into what primarily influences the results, such as specific characteristics or assets that significantly impact performance.

The robustness of empirical tests section should detail:

- Evaluation of the stability of the principal finding against variations in hyperparameters or algorithmic modifications.

Finally, the conclusive synthesis should recapitulate the primary findings, consider external elements that may influence the results, and hint at potential directions for further investigative work.


# Introduction
In this project on Machine Learning for Portfolio Management and Trading, we investigate the effect of central bank policies on financial markets. Our approach focuses on betting activity related to Federal Reserve decisions and monetary policy related topics, aiming to study whether this information can be transformed into a tradable strategy. The data is sourced from Polymarket, a decentralized prediction market platform where users anonymously place bets on real-world events, including Federal Reserve policy outcomes.

By analyzing the evolution of these betting probabilities, especially in the hours leading up to market open, we seek to detect shifts in sentiment or positioning that may precede movements in the S&P 500. In particular, sudden changes in betting activity before scheduled announcements could indicate meaningful information flow, whether due to collective expectations, rapid consensus formation, or, in rare cases, the presence of informed traders operating in an anonymous and decentralized environment. Our goal is to determine whether such early signals contain predictive value and whether they can be systematically incorporated into a daily trading strategy.

---


### Contextual Background
Markets, and our reference index the S&P 500, are highly affected by the monetary policies of central banks, particularly the Federal Reserve. Much research has examined how monetary policy decisions shape market activity. In general, unexpected changes in expectations about Fed policy are associated with sharp increases in volatility and large market moves, while pre-scheduled FOMC announcements tend to allow markets to price interest-rate risk more efficiently. In this paper, we highlight episodes of market behavior that coincide with unexpected policy-related events and show how these patterns appear in the daily returns of the S&P 500. These observations build on established evidence that equity markets react strongly to monetary policy surprises as described in Bernanke, B. & Kuttner, K. (2004).

[1]: Bernanke, B. & Kuttner, K. (2004). *What Explains the Stock Market’s Reaction to Federal Reserve Policy?*

---

### Prediction Markets and Informational Efficiency

A second motivation for using Polymarket comes from the fact that prediction markets often react quickly to new information and can aggregate expectations better than surveys or expert opinions. Wolfers and Zitzewitz (2006) show that prediction-market prices usually adjust ahead of major events and can reflect both public expectations and, at times, the actions of better-informed traders.

Based on this idea, we look at whether Polymarket, because it is decentralized, anonymous, and trades 24/7. Since it has low barriers to participation and updates continuously, it gives us a high-frequency view of expectations around monetary-policy decisions. If betting probabilities move before the equity market opens or ahead of FOMC announcements, these shifts may reflect changes in sentiment, early positioning, or even informed trading. This allows us to check whether these signals help predict next-day S&P 500 returns.

[2]: Wolfers, J. & Zitzewitz, E. (2006). *Prediction Markets in Theory and Practice.*

---

### Project Contributions

Our study contributes in a few ways. First, to the best of our knowledge, nobody has used Polymarket data to study possible anticipation effects on the S&P 500 or to build a trading strategy from it. So the idea of linking decentralized prediction-market data with equity-index returns is new.

Second, we built our own dataset by collecting all the relevant Polymarket markets through their API and cleaning the data ourselves. Third, on the machine-learning side, we use LightGBM, which works well with noisy and unstructured features like ours and can capture non-linear relationships. We explain this more in the model section of the notebook.

Taken together, these points allow us to test whether Polymarket signals contain information that can help forecast next-day S&P 500 returns.

---

### Summary of the main empirical results
 What is the main statistical statement? is it significant (e.g. statistically or economically)?

 Our proposed approach of using gradiend boosting methods to analyze the unstructred dataset of Polymarket have led to promising results. It appears that feeding the unstructred data, without any preprocessing, into a rolling LightGBM model leads to a profitable startegy with a sharpe ratio of 2.2 which is able to beat a simple buy and hold startegy on the same time horizon. 

After polishing the dataset, by adding an indicator which acts as a proxy of volume and market sentiment, along with transforming the input feature matrix through rolling means and normalization the performance has further increase, and the developed strategy achieved a sharpe ratio 0f 6. Such results seems to good to be true...


### Literature and Resource Citations
 What are related academic papers? What are the github repositories, expert blogs, or software packages that used in this project?

 Throughout the development of the project, we have taken and adapted several code portions developed in the `skfin` code repository available on GitHub and developed by Prof. Sylvain Champonnois. Available at: https://github.com/schampon/skfin

 Besides standard packages for  plotting, data and numerical manipulation, we have used `sklearn` which provides acess to many estimators and models such as `LightGBM`

# Dataset Overview

# Analytics and learning 

For our predictive analysis we rely on Gradient Boosting Machines (GBM) desscribed in Ke, G. et al. (2017), a family of models that build many small decision trees in sequence, with each new tree correcting the errors of the previous ones. This iterative process allows the model to capture complex, nonlinear relationships without requiring strong assumptions about how the data is generated. We implement this approach using LightGBM, an efficient and highly optimized GBM framework that uses histogram-based binning and leaf-wise tree growth to achieve high accuracy with very fast training times.

LightGBM is particularly well suited to our setting because the inputs we use—market-derived signals, Polymarket probabilities, and event-driven return patterns—are irregular, noisy, and only partially structured. Traditional linear or parametric models struggle with this type of data, as they rely on smooth relationships and stationarity. In contrast, LightGBM naturally handles discontinuities, threshold effects, nonlinearities, missing values, and interactions between features without extensive preprocessing. These properties make it an effective tool for identifying how unexpected monetary-policy-related events propagate into the daily returns of the S&P 500.

---

### Applying LightGBM to S&P Returns

In addition, we implement our model within a rolling training framework. Since Polymarket contracts typically remain active for an average of around 45 days, we assume that the predictive structure of the data evolves over similar horizons. To account for this, we retrain the model every day using all information until the day before the returns we are trying to predict: on each date we fit LightGBM on the most recent data, generate a prediction for the next day’s S&P 500 return, and then roll the window forward by one day. This approach allows the model to continuously adapt to changing market conditions, shifting expectations about monetary policy, and evolving patterns in prediction-market pricing. 

Since Polymarket is a relatively new platform, most of its trading activity and market depth have increased only within the past year. For this reason, we adopt an 80/20 chronological split, where the initial 80% of observations are used for model initialization and parameter calibration, and the remaining 20% aligns with the period of elevated volume and richer informational content. After training on the initial 80%, we then apply the previously described rolling-window strategy on the final portion of the dataset. This approach also improves computational efficiency, as the number of trees constructed for each timestamp is reduced to the length of the last portion of the test set.


This methodology is particularly appropriate for S&P 500 returns because equity index returns are well known to exhibit low long-term predictability, rapidly shifting regimes, and short-lived signals driven by macro announcements and policy surprises. Most of the information in daily stock-index movements is influenced by immediate news flow and sentiment rather than persistent structural factors, meaning that models benefit from frequent retraining and a focus on the most recent market environment. A rolling approach captures these dynamics by continually updating the model’s understanding of how current prediction-market probabilities and monetary-policy expectations translate into next-day index returns.



[3]: Ke, G. et al. (2017). *LightGBM: A Highly Efficient Gradient Boosting Decision Tree*.


# Empirical resuts: baseline and robustness

After obtaining the predicitons from 

# Conclusion

As a result of our analysis, we can conclude that there seems to be valuable information in Polymarket betting data. Such information can be exploited to develop a profitable and robust trading strategy. Given the unstructured data of the data our proposed approach of using a LightGBM model on a rolling window basis seems to serve this purpose as we were able to obtain sharpes of ... . Areas of further investigation are.. 
