# Chapter 95: Automated Scientific Discovery

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the concept of automated scientific discovery and its relevance to time‑series analysis.
- Apply symbolic regression techniques to discover mathematical relationships in time‑series data.
- Use equation discovery algorithms to uncover underlying physical or economic laws from observations.
- Implement causal discovery methods to infer causal structures from time‑series (e.g., Granger causality, PC algorithm).
- Leverage AI to generate hypotheses and guide scientific experimentation.
- Apply these techniques to the NEPSE dataset to discover relationships between stocks or between stocks and external factors.
- Evaluate the discovered equations and causal models for validity and interpretability.
- Recognize the limitations and challenges of automated discovery, including overfitting and confounding.

---

## **95.1 Introduction to Automated Scientific Discovery**

Automated scientific discovery is the use of artificial intelligence to uncover new knowledge, relationships, and laws from data. Traditionally, science has progressed through human intuition, hypothesis formulation, and experimentation. AI can accelerate this process by:

- **Discovering equations** that fit observed data (symbolic regression).
- **Inferring causal relationships** from observational time‑series (causal discovery).
- **Generating hypotheses** that can be tested experimentally.
- **Automating the scientific method** in closed‑loop systems.

In the context of time‑series, automated discovery can help us understand the underlying dynamics of a system. For the NEPSE stock market, we might discover:

- A mathematical relationship between the prices of two stocks (e.g., cointegration).
- A causal relationship: “changes in interest rates cause changes in banking stock prices.”
- A physical‑like law: “price movement is proportional to trading volume squared.”

These discoveries can lead to better forecasting models, improved trading strategies, and deeper understanding of market behaviour.

This chapter will introduce key techniques in automated scientific discovery, with practical examples using Python libraries like `gplearn` (symbolic regression), `causalnex`, and `statsmodels` (causal inference).

---

## **95.2 Symbolic Regression**

Symbolic regression is a type of regression analysis that searches the space of mathematical expressions to find a model that best fits a given dataset. Unlike traditional regression, which assumes a fixed form (e.g., linear), symbolic regression can discover arbitrary equations.

### **95.2.1 How It Works**
Symbolic regression typically uses genetic programming:

1. **Initialisation**: Create a population of random mathematical expressions (trees) from a set of operators (+, -, ×, ÷, sin, exp, etc.) and variables.
2. **Fitness evaluation**: Evaluate each expression on the training data (e.g., mean squared error).
3. **Selection**: Choose the fittest expressions to breed.
4. **Crossover and mutation**: Combine and modify expressions to create new ones.
5. **Repeat** for many generations.

The result is a set of equations that trade off complexity and accuracy.

### **95.2.2 Symbolic Regression with `gplearn`**

We'll use `gplearn` to discover a relationship between NEPSE stock prices and volume.

```python
# pip install gplearn
import numpy as np
import pandas as pd
from gplearn.genetic import SymbolicRegressor
from gplearn.functions import make_function
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Generate synthetic NEPSE-like data (or use real)
np.random.seed(42)
n = 1000
price = 1000 + np.cumsum(np.random.randn(n) * 5)
volume = np.random.lognormal(12, 1, n)
# Create a known relationship: price_change = 0.01 * volume - 0.5 * lagged_price + noise
price_change = 0.01 * volume - 0.5 * np.roll(price, 1) + np.random.randn(n) * 2
df = pd.DataFrame({
    'price': price,
    'volume': volume,
    'price_change': price_change,
    'lag_price': np.roll(price, 1)
})
df = df.dropna()

# Define features and target
X = df[['price', 'volume', 'lag_price']].values
y = df['price_change'].values

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define function set (operators)
function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log', 'neg', 'sin', 'cos']

# Create symbolic regressor
est = SymbolicRegressor(
    population_size=5000,
    generations=20,
    stopping_criteria=0.01,
    p_crossover=0.7,
    p_subtree_mutation=0.1,
    p_hoist_mutation=0.05,
    p_point_mutation=0.1,
    max_samples=0.9,
    verbose=1,
    parsimony_coefficient=0.01,
    random_state=42
)

# Train (this may take a few minutes)
est.fit(X_train, y_train)

# Evaluate
y_pred = est.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Test MSE: {mse:.4f}")

# Print the discovered equation
print("Discovered equation:")
print(est._program)

# Plot
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('True')
plt.ylabel('Predicted')
plt.title('Symbolic Regression Performance')

plt.subplot(1,2,2)
plt.plot(est._program, label='Discovered Program')
plt.legend()
plt.show()
```

**Explanation**:

- We create synthetic data with a known (but noisy) relationship between price change, volume, and lagged price.
- `SymbolicRegressor` evolves a population of expressions. The `parsimony_coefficient` penalises complex expressions, encouraging simpler equations.
- The best program is printed. It might be something like `add(mul(0.01, volume), mul(-0.5, lag_price))`, closely approximating the true relationship.
- This demonstrates how symbolic regression can recover underlying physical laws from data.

### **95.2.3 Applying to Real NEPSE Data**

For real NEPSE data, you might try to discover relationships between:

- Stock returns and trading volume.
- Price movements of related stocks (e.g., banking sector).
- Price and external factors (e.g., exchange rate, oil prices).

However, beware of overfitting: the discovered equations may not generalise out of sample. Use cross‑validation and test on a hold‑out period.

---

## **95.3 Equation Discovery Algorithms**

Symbolic regression is one form of equation discovery. Other approaches include:

- **SINDy** (Sparse Identification of Nonlinear Dynamics): Uses sparse regression to identify governing equations from time‑series.
- **Eureqa**: Commercial software for symbolic regression.
- **PySR**: A high‑performance symbolic regression library in Python.

### **95.3.1 SINDy Example**

SINDy is particularly suited for discovering dynamical systems from time‑series.

```python
# pip install pysindy
import pysindy as ps
import numpy as np
import matplotlib.pyplot as plt

# Generate data from a simple dynamical system: dx/dt = -x
t = np.linspace(0, 10, 100)
x = np.exp(-t)[:, np.newaxis]
x_dot = -x  # derivative

# Create SINDy model
model = ps.SINDy()
model.fit(x, t=t, x_dot=x_dot)
model.print()

# This should discover: x' = -1.000 x
```

For NEPSE, you could model stock prices as a dynamical system, though financial data is notoriously noisy and non‑stationary.

---

## **95.4 Causal Discovery**

Causal discovery aims to infer causal relationships from observational data. In time‑series, this is often done using:

- **Granger causality**: A statistical test for whether one time series helps predict another.
- **PC algorithm**: A constraint‑based method that uses conditional independence tests.
- **LiNGAM**: Assumes linear non‑Gaussian relationships.
- **VAR models**: Vector autoregression with causality tests.

Understanding causality is crucial for making interventions (e.g., “if we change interest rates, what happens to stock prices?”) and for building robust forecasting models.

### **95.4.1 Granger Causality**

Granger causality tests whether past values of one time series improve the prediction of another, beyond its own past.

```python
from statsmodels.tsa.stattools import grangercausalitytests
import pandas as pd

# Example: test if volume Granger‑causes price
# Prepare data: two columns, price and volume
df_g = df[['price', 'volume']].dropna()

# Test with up to 5 lags
gc_result = grangercausalitytests(df_g, maxlag=5, verbose=True)

# Interpret p-values; if p < 0.05, reject null hypothesis (no causality)
```

**Interpretation**: If volume Granger‑causes price, it means past volume helps predict future price, suggesting a directional relationship (but not necessarily true causality).

### **95.4.2 PC Algorithm for Time‑Series**

The PC algorithm can be applied to time‑series by treating each lag as a separate variable. Libraries like `causalnex` or `causallearn` implement this.

```python
# pip install causalnex
from causalnex.structure import StructureModel
from causalnex.structure.notears import from_pandas
import pandas as pd

# Create a DataFrame with multiple lags
df_lags = pd.DataFrame({
    'price_t': df['price'],
    'price_t1': df['price'].shift(1),
    'volume_t': df['volume'],
    'volume_t1': df['volume'].shift(1),
}).dropna()

# Learn structure using NOTEARS (continuous optimisation)
sm = from_pandas(df_lags, tabu_edges=[], tabu_parent_nodes=[])

# Visualise
sm.plot('causal_graph.png')
```

The resulting graph shows directed edges, e.g., `volume_t1 -> price_t`, suggesting that past volume influences current price.

### **95.4.3 Limitations of Causal Discovery**

- **Confounding**: Unobserved variables can create spurious relationships.
- **Feedback loops**: In financial markets, causality is often bidirectional (price and volume influence each other).
- **Non‑stationarity**: Causal structures may change over time.
- **Data requirements**: Large samples are needed for reliable inference.

Despite these challenges, causal discovery can generate hypotheses for further investigation.

---

## **95.5 Hypothesis Generation with AI**

AI can also generate hypotheses by identifying surprising patterns or anomalies. For example, if a stock's price movement deviates significantly from its historical relationship with another stock, an LLM could suggest potential causes (e.g., merger rumours, regulatory changes).

```python
def generate_hypothesis(stock_a, stock_b, correlation_break):
    prompt = f"""
    Historically, stocks {stock_a} and {stock_b} in the NEPSE market have had a correlation of 0.8.
    Over the last week, their correlation has dropped to 0.2.
    What are some possible hypotheses for this decoupling? Consider market news, sector-specific events, or company announcements.
    """
    # Call LLM (as in Chapter 94)
    # ...
```

This combines data analysis with LLM reasoning to suggest plausible explanations.

---

## **95.6 Applications to NEPSE**

Let's apply these techniques to the NEPSE dataset to discover meaningful relationships.

### **95.6.1 Discovering a Relationship Between Banking Stocks**

We might hypothesise that banking stocks move together. Symbolic regression could find an equation linking their returns.

```python
# Assume we have data for NABIL and EBL (two major banks)
df_banks = pd.DataFrame({
    'nabil_return': nabil_returns,
    'ebl_return': ebl_returns,
})

X = df_banks[['nabil_return']].values
y = df_banks['ebl_return'].values

est = SymbolicRegressor(...)
est.fit(X, y)
print(est._program)
```

The discovered equation might be something like `ebl_return = 0.95 * nabil_return + 0.02`, indicating a strong linear relationship.

### **95.6.2 Causal Discovery Between Volume and Price**

Using Granger causality, we might find that volume Granger‑causes price for some stocks, but not others. This could inform feature selection for prediction models.

### **95.6.3 Discovering Market Regimes**

Symbolic regression could be applied piecewise to discover different equations for different market regimes (bull, bear, sideways). This aligns with the regime‑switching models from Chapter 83.

---

## **95.7 Evaluation and Validation**

Discovered equations and causal models must be validated:

- **Out‑of‑sample testing**: Does the equation hold on unseen data?
- **Cross‑validation**: Use rolling window evaluation for time‑series.
- **Simulation**: Can the discovered dynamics simulate realistic behaviour?
- **Domain knowledge**: Does the discovered relationship make sense to a domain expert?

For the NEPSE system, a discovered relationship should be reviewed by a financial analyst before being used in trading decisions.

---

## **95.8 Tools and Libraries**

- **gplearn**: Symbolic regression with genetic programming.
- **PySR**: Faster symbolic regression with multi‑language support.
- **pysindy**: Sparse identification of nonlinear dynamics.
- **causalnex**: Causal discovery and Bayesian networks.
- **causallearn**: Collection of causal discovery algorithms.
- **statsmodels**: Granger causality, VAR models.
- **DoWhy**: Causal inference library.

---

## **95.9 Challenges and Limitations**

- **Overfitting**: Complex equations can fit noise. Use parsimony penalties and validation.
- **Interpretability**: Some discovered equations may be too complex to understand.
- **Non‑identifiability**: Multiple equations may fit the data equally well.
- **Computational cost**: Symbolic regression is computationally expensive.
- **Causal discovery assumptions**: Many algorithms assume no hidden confounders, linearity, etc.

---

## **95.10 Future Directions**

- **Integration with LLMs**: Use LLMs to guide the search for equations or to interpret discovered models.
- **Automated experimentation**: AI that designs experiments to test hypotheses.
- **Discovering physical laws from video**: Extending to spatiotemporal data.
- **Causal representation learning**: Learning causal variables directly from time‑series.

---

## **Chapter Summary**

In this chapter, we explored automated scientific discovery and its application to time‑series data. We covered:

- Symbolic regression using genetic programming to discover mathematical relationships.
- Equation discovery with SINDy for dynamical systems.
- Causal discovery techniques including Granger causality and the PC algorithm.
- Hypothesis generation using LLMs.
- Practical applications to the NEPSE dataset.
- Evaluation methods and challenges.

Automated discovery can uncover hidden patterns and relationships in financial time‑series, leading to better models and deeper understanding. While not a replacement for human expertise, it is a powerful tool in the data scientist's arsenal.

In the next chapter, we will explore **Edge AI and TinyML**, focusing on deploying time‑series models on resource‑constrained devices.

---

**End of Chapter 95**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='94. large_language_models_for_time_series.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='96. edge_ai_and_tinyml.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
