# **Chapter 19: Defining Prediction Targets**

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand how to translate business goals into precise prediction targets
- Design target variables appropriate for different forecasting horizons
- Create binary, multi‑class, and regression targets from raw NEPSE data
- Distinguish between price prediction, return prediction, and directional prediction
- Implement probability targets and confidence intervals
- Prevent target leakage when constructing labels
- Validate targets for consistency and practical utility
- Choose the right target for your specific trading or investment strategy

---

## **19.1 Understanding Your Prediction Goal**

Before writing any code or training any model, you must be crystal clear about what you are trying to predict. The prediction target is the variable your model will learn to forecast. In the context of the NEPSE stock prediction system, the goal might be:

- **Predict the exact closing price** for tomorrow – useful for limit orders or valuation.
- **Predict the direction of price movement** (up/down) – sufficient for many trading strategies.
- **Predict the magnitude of return** – for position sizing.
- **Predict volatility** – for risk management.
- **Predict whether a stock will hit its circuit breaker** – a binary event.

Each goal leads to a different target variable and often a different modeling approach. Defining the target correctly is the most important step in the entire pipeline; a poorly chosen target will yield useless models no matter how sophisticated the features.

### **19.1.1 Connecting Business Goals to Targets**

Let’s consider a few realistic scenarios for a trader using the NEPSE system:

| Business Goal | Desired Prediction | Target Type | Example Target |
|---------------|--------------------|-------------|----------------|
| Buy at a specific price tomorrow | Exact closing price | Regression | `Close` at t+1 |
| Decide whether to buy or sell | Direction (up/down) | Binary classification | 1 if `Close` at t+1 > `Close` at t, else 0 |
| Scale position size | Magnitude of return | Regression | `(Close_{t+1} - Close_t) / Close_t` |
| Avoid circuit breaker hits | Probability of hitting upper circuit | Probability / Binary | 1 if `Daily_Return` ≥ 4% |
| Hedge volatility | Future volatility | Regression | 20-day rolling standard deviation of returns |

In this chapter, we will focus on the most common targets: next-day close price, next-day return, and direction. However, the principles apply to any target you might define.

---

## **19.2 Target Variable Design**

Designing a target variable involves deciding what exactly the model should predict and over what horizon. The raw NEPSE CSV contains many columns, but the most common source for targets is the `Close` price. However, we must be careful: using raw prices can lead to non‑stationarity issues (see Chapter 2). Often it is better to predict returns or transformations that are more stationary.

### **19.2.1 Predicting Raw Prices**

Predicting the exact future closing price is the most direct regression task. It is straightforward to implement:

```python
import pandas as pd
import numpy as np

# Load NEPSE data
df = pd.read_csv('nepse_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Symbol', 'Date']).reset_index(drop=True)

# Create target: next day's closing price
df['Target_Close_t+1'] = df.groupby('Symbol')['Close'].shift(-1)

# Check the result
print(df[['Date', 'Symbol', 'Close', 'Target_Close_t+1']].head(10))
```

**Explanation:**

- We use `groupby('Symbol')` because each symbol has its own time series. Without grouping, the shift would incorrectly use the next row even if it belongs to a different stock.
- `shift(-1)` moves the next day's close value up by one row. For the last row of each symbol, the target will be `NaN` (since there is no next day). We will drop those rows later.
- This target is intuitive: the model learns to output a number that should be as close as possible to the actual next closing price.

**Drawbacks of raw price targets:**
- Prices are non‑stationary; they trend over time. A model trained on old prices may not generalize to new price levels.
- Performance metrics like RMSE are scale‑dependent; a stock trading at 1000 NPR will have larger errors than one at 100 NPR, even if the relative error is the same.

### **19.2.2 Predicting Returns**

Returns (percentage changes) are much more stationary than raw prices. Most financial forecasting models use returns as the target. The daily return can be computed as:

```
Return_t+1 = (Close_t+1 - Close_t) / Close_t
```

or in log form: `log(Close_t+1) - log(Close_t)`.

```python
# Compute daily return
df['Return'] = df.groupby('Symbol')['Close'].pct_change()

# Target: next day's return
df['Target_Return_t+1'] = df.groupby('Symbol')['Return'].shift(-1)

# For the first row of each symbol, Return will be NaN; we'll handle later.
print(df[['Date', 'Symbol', 'Close', 'Return', 'Target_Return_t+1']].head(10))
```

**Explanation:**

- `pct_change()` calculates the percentage change from the previous row within each symbol group. This gives us today's return (based on yesterday's close).
- We then shift that return backward to align today's features with tomorrow's return. So at row `t`, we have features from day `t` and the target is the return from day `t` to day `t+1`.
- Returns are typically centered around zero and have more stable statistical properties than prices. Models predicting returns often generalize better across different market regimes.

**Log returns** are often preferred because they are approximately normally distributed and additive over time:

```python
df['Log_Close'] = np.log(df['Close'])
df['Log_Return'] = df.groupby('Symbol')['Log_Close'].diff()
df['Target_LogReturn_t+1'] = df.groupby('Symbol')['Log_Return'].shift(-1)
```

Log returns are approximately equal to simple returns for small changes, but they have nicer mathematical properties.

### **19.2.3 Predicting Direction**

For many trading strategies, only the direction matters – whether the price will go up or down. This simplifies the problem to binary classification.

```python
# Binary direction: 1 if next day's close > today's close, else 0
df['Target_Direction_t+1'] = (
    df.groupby('Symbol')['Close'].shift(-1) > df['Close']
).astype(int)

# Check distribution
print(df['Target_Direction_t+1'].value_counts(normalize=True))
```

**Explanation:**

- The comparison `shift(-1) > df['Close']` creates a boolean Series. We convert it to integer (0/1).
- This target ignores the magnitude of the move – a 0.1% increase is treated the same as a 10% increase. This can be both a strength (less sensitive to noise) and a weakness (doesn't differentiate between small and large moves).
- Class imbalance: In trending markets, ups may outnumber downs, or vice versa. We may need to handle imbalance (e.g., via class weights or resampling).

We could also create a ternary target: up, down, or no change (within a threshold). For example, define "no change" as absolute return < 0.5%.

```python
# Multi-class direction
def direction_class(ret, threshold=0.005):
    if ret > threshold:
        return 2  # up
    elif ret < -threshold:
        return 0  # down
    else:
        return 1  # flat

df['Target_Multi'] = df['Target_Return_t+1'].apply(lambda x: direction_class(x) if pd.notna(x) else np.nan)
```

---

## **19.3 Time Horizon Selection**

The prediction horizon determines how far into the future we are forecasting. The NEPSE data is daily, so natural horizons are:

- **Short‑term:** 1 to 5 days ahead (next day to next week)
- **Medium‑term:** 5 to 20 days ahead (up to one month)
- **Long‑term:** 20+ days ahead (multi‑month)

### **19.3.1 Short‑Term Horizons**

For a day trader or swing trader, the focus is on 1‑ to 5‑day forecasts. These are the most common in machine learning for finance because patterns are more discernible over short periods.

```python
# Short-term: 1-day ahead (already shown)
# 3-day ahead target (closing price in 3 days)
df['Target_Close_t+3'] = df.groupby('Symbol')['Close'].shift(-3)

# 3-day return
df['Target_Return_t+3'] = (df.groupby('Symbol')['Close'].shift(-3) / df['Close']) - 1
```

**Explanation:**

- `shift(-3)` aligns today's features with the closing price three trading days later. Note that weekends and holidays mean "3 trading days" may span more than 3 calendar days, but for a trading system that's exactly what we want.
- The further we predict, the noisier and less accurate the forecasts tend to be.

### **19.3.2 Medium‑Term Horizons**

Medium‑term forecasts (e.g., 10 or 20 days) are relevant for position traders or portfolio rebalancing.

```python
# 10-day ahead return
df['Target_Return_t+10'] = (df.groupby('Symbol')['Close'].shift(-10) / df['Close']) - 1
```

**Caution:** With longer horizons, the overlap between training and test periods in cross‑validation becomes more complex. We must ensure that our validation strategy respects that future information is not leaked.

### **19.3.3 Multi‑Step Forecasting**

If you need forecasts for multiple horizons simultaneously, you have two choices:

1. **Direct approach:** Train separate models for each horizon (e.g., one model for t+1, another for t+2, etc.).
2. **Recursive approach:** Train one model for t+1, then use its predictions as features to predict t+2, and so on.

The direct approach is simpler and often more accurate, though it requires maintaining multiple models. The recursive approach can compound errors.

```python
# Direct multi-step example
horizons = [1, 2, 3, 5, 10]
targets = {}
for h in horizons:
    targets[f'Target_t+{h}'] = df.groupby('Symbol')['Close'].shift(-h)

# Now you have separate columns, each can be used to train a dedicated model.
```

---

## **19.4 Binary Classification Targets**

Binary classification is widely used for directional forecasting. The target is typically 1 for "up" and 0 for "down". However, we must define what "up" means precisely.

### **19.4.1 Simple Up/Down**

The simplest definition: up if next close > current close.

```python
df['Target_Up'] = (df.groupby('Symbol')['Close'].shift(-1) > df['Close']).astype(int)
```

### **19.4.2 Threshold‑Based Up/Down**

To avoid predicting tiny, random movements, we can define a threshold. Only moves larger than a certain percentage are considered "up" or "down"; moves within the threshold are discarded or treated as a third class.

```python
threshold = 0.01  # 1%
future_return = (df.groupby('Symbol')['Close'].shift(-1) / df['Close']) - 1

df['Target_Up_Threshold'] = 0
df.loc[future_return > threshold, 'Target_Up_Threshold'] = 1
df.loc[future_return < -threshold, 'Target_Up_Threshold'] = -1   # or 0 if binary

# For binary classification, we might keep only rows with |return| > threshold
# and map -1 to 0 for down.
```

**Explanation:**

- This creates a cleaner signal, ignoring noise. However, it reduces the number of training samples and may discard valuable information if small moves are actually predictable.
- In practice, you might experiment with different thresholds based on transaction costs – you only care about moves large enough to be profitable after costs.

### **19.4.3 Event‑Based Targets**

Sometimes you want to predict a specific event, such as hitting the upper circuit breaker (≥4% gain in NEPSE). This is a binary target.

```python
# Daily return (using Prev. Close if available, else compute)
if 'Prev. Close' in df.columns:
    df['Daily_Return'] = (df['Close'] - df['Prev. Close']) / df['Prev. Close']
else:
    df['Daily_Return'] = df.groupby('Symbol')['Close'].pct_change()

# Target: 1 if tomorrow's return is >= 4% (upper circuit), else 0
tomorrow_return = df.groupby('Symbol')['Daily_Return'].shift(-1)
df['Target_UpperCircuit'] = (tomorrow_return >= 0.04).astype(int)
```

**Explanation:**

- Circuit breaker events are rare (class imbalance). Predicting them is a classic rare‑event prediction problem. We would need special techniques (e.g., oversampling, cost‑sensitive learning) to handle the imbalance.

---

## **19.5 Regression Targets**

Regression targets predict a continuous value. The most common are price, return, and volatility.

### **19.5.1 Price Regression**

Already covered in 19.2.1. One nuance: when predicting price, we often use log price to stabilize variance.

```python
df['Log_Close'] = np.log(df['Close'])
df['Target_LogClose_t+1'] = df.groupby('Symbol')['Log_Close'].shift(-1)
```

After prediction, we can exponentiate to get price.

### **19.5.2 Return Regression**

Predicting the exact return (percentage change) is the standard approach in academic finance.

```python
df['Target_Return_t+1'] = df.groupby('Symbol')['Close'].pct_change().shift(-1)
```

**Why returns?** Returns are scale‑free and more stationary. They also align with financial theory (asset returns are often modeled as random walks plus drift).

### **19.5.3 Volatility Regression**

Volatility forecasting is crucial for risk management. We can define volatility as the standard deviation of returns over a future window, or as the daily price range.

```python
# Future 5-day realized volatility (standard deviation of daily returns)
# First, compute daily returns
df['Return'] = df.groupby('Symbol')['Close'].pct_change()

# For each day, compute the standard deviation of returns over the next 5 days
# This requires a rolling forward window – careful with lookahead
def future_volatility(series, window=5):
    return series.rolling(window).std().shift(-window)

df['Target_Volatility_t+5'] = df.groupby('Symbol')['Return'].apply(
    lambda x: future_volatility(x, 5)
)
```

**Explanation:**

- `future_volatility` computes the rolling standard deviation and then shifts it backward so that today's row contains the volatility of the next `window` days.
- This target is useful for models that predict risk, not just return.

---

## **19.6 Multi‑Class Classification Targets**

Sometimes binary up/down is too coarse, and regression is too detailed. Multi‑class offers a middle ground. Common classes:

- Strong down, down, flat, up, strong up
- Or based on quantiles of historical returns

### **19.6.1 Quantile‑Based Classes**

Divide the distribution of future returns into quantiles (e.g., quintiles).

```python
# For each symbol, compute future return quantiles
future_return = df.groupby('Symbol')['Close'].pct_change().shift(-1)

# Drop NaNs and compute quantile bins
valid = future_return.dropna()
quantiles = pd.qcut(valid, q=5, labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5'])

# Map back to original index
df['Target_Quintile'] = np.nan
df.loc[valid.index, 'Target_Quintile'] = quantiles
```

**Explanation:**

- This creates five classes of equal frequency. The model learns to predict which quintile the next return falls into.
- This can be useful for strategies that differentiate between strong and weak moves, but the classes are relative to the historical distribution, which may shift over time.

### **19.6.2 Fixed Threshold Classes**

Define classes based on fixed economic thresholds, e.g.,:
- Class 0: return < -2%
- Class 1: -2% ≤ return < 0%
- Class 2: 0% ≤ return < 2%
- Class 3: return ≥ 2%

```python
def return_class(ret):
    if pd.isna(ret):
        return np.nan
    if ret < -0.02:
        return 0
    elif ret < 0:
        return 1
    elif ret < 0.02:
        return 2
    else:
        return 3

df['Target_Class'] = future_return.apply(return_class)
```

These thresholds are stable over time and interpretable.

---

## **19.7 Probability Targets**

Sometimes we want the model to output a probability – e.g., probability that the return exceeds 2%, or probability of an up move. This can be achieved by:

- Using a probabilistic model (e.g., logistic regression outputs probabilities).
- Training a classifier and calibrating its outputs (e.g., Platt scaling).
- Using quantile regression or distribution forecasting.

For the NEPSE system, probability targets are useful for risk‑adjusted decision making.

### **19.7.1 Binary Probability Target**

If we define a binary event, a classifier can output the probability of that event. For example, probability of an up move.

```python
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV

# Assume X_train, y_train (binary) are ready
model = LogisticRegression()
model.fit(X_train, y_train)

# Get probabilities
probs = model.predict_proba(X_test)[:, 1]  # probability of class 1
```

**Explanation:**

- `predict_proba` returns an array of shape (n_samples, n_classes). For binary classification, the second column is the probability of the positive class.
- These probabilities can be used directly for position sizing (e.g., Kelly criterion) or for ranking trades.

### **19.7.2 Quantile Regression for Probabilistic Forecasts**

Quantile regression predicts specific quantiles of the target distribution, giving a full probabilistic forecast without assuming a distributional form.

```python
from sklearn.ensemble import GradientBoostingRegressor

# Train a model to predict the 0.1 quantile (lower bound)
model_lower = GradientBoostingRegressor(loss='quantile', alpha=0.1)
model_lower.fit(X_train, y_train)  # y_train is return

# Predict the 0.9 quantile (upper bound)
model_upper = GradientBoostingRegressor(loss='quantile', alpha=0.9)
model_upper.fit(X_train, y_train)

lower = model_lower.predict(X_test)
upper = model_upper.predict(X_test)
```

**Explanation:**

- Gradient boosting with `loss='quantile'` directly estimates the conditional quantile. This gives us prediction intervals without any parametric assumptions.
- The interval [lower, upper] should contain the true value 80% of the time (if the model is well‑calibrated).

---

## **19.8 Creating Labels from Raw Data**

Labels are the actual values of the target variable for historical data. They must be created carefully to avoid using future information. The `shift` operation is the primary tool, but there are nuances.

### **19.8.1 Using `shift` Correctly**

Always use `groupby` to ensure shifts are within each symbol's time series. Also, be mindful of missing days (e.g., weekends, holidays). If your data includes only trading days, `shift` works perfectly. If there are gaps, you may need to reindex to a continuous calendar and then fill.

```python
# Correct: shift within each symbol
df['Target'] = df.groupby('Symbol')['Close'].shift(-1)

# Incorrect: global shift (mixes symbols)
df['Target_wrong'] = df['Close'].shift(-1)  # WRONG
```

### **19.8.2 Handling Non‑Trading Days**

If your dataset includes only trading days, there is no issue. But if you ever need to predict over calendar days (e.g., Monday from Friday's close), you must account for the weekend gap. In that case, you would reindex each symbol to a daily calendar and then shift.

```python
# Example: reindex to daily calendar for one symbol
symbol_df = df[df['Symbol'] == 'NEPSE'].set_index('Date').asfreq('D')
# Now shift works with calendar days, but you'll have many NaN rows for non‑trading days.
```

For most trading systems, using trading days is appropriate because that's when you can actually trade.

### **19.8.3 Creating Multi‑Horizon Labels**

For multi‑horizon forecasting, you create multiple target columns, each shifted by a different amount. Ensure you have enough data for the longest horizon (the last `h` rows of each symbol will have NaN targets and must be dropped).

```python
horizons = [1, 2, 3, 5, 10]
for h in horizons:
    df[f'Target_t+{h}'] = df.groupby('Symbol')['Close'].shift(-h)
```

---

## **19.9 Target Leakage Prevention**

Target leakage occurs when information from the future is inadvertently used to create the target or features. This is a critical pitfall in time‑series. For targets, the main leakage risk is using data that wouldn't be available at the "prediction time" to compute the target.

### **19.9.1 Leakage in Target Creation**

Consider using `Prev. Close` to compute today's return and then using that return as a target for tomorrow. That's fine because `Prev. Close` is known today. But if you use tomorrow's `High` to compute a target, that's leakage.

```python
# LEAKAGE EXAMPLE (WRONG)
df['Target_Leaky'] = (df.groupby('Symbol')['High'].shift(-1) - df['Close']) / df['Close']
# This uses tomorrow's High, which is not known today.
```

Always ensure your target uses only information that would be available at the time you are predicting. For a target like "next day's close", it's fine because you are explicitly using future data as the label – that's the definition of a target. The key is that when you train, you must not use any future information in the **features**. The target is allowed to be from the future because it's what you're trying to predict.

### **19.9.2 Leakage in Feature Engineering for Target**

Sometimes you might accidentally create features that include the target value. For example, if you compute a rolling mean that includes the current day's target, you leak.

```python
# LEAKAGE: using future data in feature
df['SMA_5_leaky'] = df['Close'].rolling(5).mean()  # includes today's close
# But if your target is tomorrow's close, today's close is valid as a feature.
# However, if you were predicting today's close, using today's close would be leakage.
```

Always think about the temporal order: features must be from time `t` or earlier; target is from time `t+1` or later.

### **19.9.3 Validation Leakage**

When you split data, ensure that no target information from the validation set is used to create features for the training set. This is why we use time‑based splits and never random shuffles.

---

## **19.10 Target Validation**

After creating your target, you should validate that it makes sense and is suitable for your modeling task.

### **19.10.1 Basic Checks**

- **Missing values:** After shifting, the last `h` rows for each symbol will be NaN. Check that you have enough data.
- **Distribution:** Plot the target distribution. For returns, it should be roughly symmetric around zero. For binary classification, check class balance.
- **Stationarity:** For regression targets, run an Augmented Dickey‑Fuller test to see if the target is stationary. Returns usually are; prices are not.

```python
from statsmodels.tsa.stattools import adfuller

# Test stationarity of returns for one symbol
returns = df[df['Symbol']=='NEPSE']['Return'].dropna()
result = adfuller(returns)
print(f'ADF Statistic: {result[0]}')
print(f'p-value: {result[1]}')
if result[1] <= 0.05:
    print("Series is stationary")
else:
    print("Series is non-stationary")
```

### **19.10.2 Persistence and Random Walk Baseline**

For returns, the simplest baseline is the historical mean (zero). For direction, the persistence model (predict "up" if today was up) is a common baseline. Compare your target's predictability against these baselines.

```python
# Persistence baseline for direction
df['Prev_Direction'] = df.groupby('Symbol')['Target_Direction_t+1'].shift(1)
# This is not a valid model for future, but it gives a baseline accuracy.
```

### **19.10.3 Economic Significance**

Ultimately, the target must align with a profitable trading strategy. For example, if you predict direction with 51% accuracy, you might still lose money after transaction costs. Consider designing targets that incorporate costs directly (e.g., only predict moves large enough to cover costs).

---

## **Chapter Summary**

In this chapter, we explored the critical first step of any prediction system: defining the target variable. Using the NEPSE dataset as our example, we covered:

- **Understanding the business goal** – the target must reflect what you actually want to achieve.
- **Designing target types:** regression (price, return, volatility), classification (binary, multi‑class), and probability targets.
- **Horizon selection:** short‑term (1‑5 days), medium‑term (5‑20 days), and long‑term.
- **Creating labels correctly** using `groupby` and `shift` to align features with future values.
- **Preventing target leakage** – ensuring no future information contaminates the training process.
- **Validating targets** – checking stationarity, distribution, and baseline comparisons.

### **Practical Takeaways for the NEPSE System:**

- For most trading strategies, predicting **returns** (not raw prices) is preferable due to stationarity.
- **Directional targets** simplify the problem and are sufficient for many strategies, but may ignore magnitude.
- **Multi‑class targets** can differentiate between strong and weak moves.
- Always use **time‑aware splits** and never shuffle when creating training and test sets.
- Validate that your target is **predictable beyond a naive baseline** before investing in complex models.

With well‑defined targets, we are ready to build models that learn meaningful patterns. In the next chapter, **Chapter 20: Data Splitting Strategies**, we will explore how to properly divide time‑series data for training, validation, and testing, ensuring our models are evaluated realistically.

---

**End of Chapter 19**