![QuantConnect Logo](https://cdn.quantconnect.com/web/i/icon.png)
<hr>

# Introduction
This notebook illustrates how applying various pre-processing techniques to factors values can impact the accuracy of a machine learning model. The model used here is a random forest from the `lightgbm` Python package.

# Part 1: Get Raw Data
Gather some raw data so you can train and test the model. Start with the producing the label, which is 1 if the future weekly return for SPY market index from open to open is positive, otherwise it's 0.

In [1]:
import plotly.graph_objects as go

# Create a QuantBook.
qb = QuantBook()
# Add the asset.
symbol = qb.add_equity("SPY", Resolution.DAILY).symbol
# Get the asset history.
history = qb.history(symbol, datetime(2000, 1, 1), datetime(2024, 1, 1))
# Calculate the labels.
label = history.loc[symbol]['open'].pct_change(5).shift(-5).dropna().apply(
    lambda x: int(x > 0)
)
# Show the result.
go.Figure(
    go.Scatter(
        x=label.index, y=history.loc[symbol]['open'], mode='markers', 
        marker=dict(
            color=['blue' if x else 'red' for x in label.values], size=3
        )
    ),
    dict(
        title="Label Distribution<br><sup>Labels change frequently "
            + "as a result of market volatility.</sup>", 
        xaxis_title="Date", yaxis_title="Price", 
        xaxis={'range': [label.index[0], label.index[-1]]}
    )
).show()
print(label)
print(label.value_counts())

Next, define the factors you'll input into the model to predict the label. For demonstration purposes, use random factors.

In [46]:
np.random.seed(2)
num_factors = 4
num_samples = len(label)
factors = np.random.rand(num_samples, num_factors)
# Make one of the factors non-stationary.
factors[:, -1] = factors[:, -1].cumsum()
factors

# Part 2: Test Model Accuracy Using Raw Factors
The next code block defines a method to train and test the model. The method below uses 75% of the data to train it and leaves 25% of the data for the out-of-sample test. Let's see how the model performs with just the raw factor values. The number that the method returns represents the percentage of samples in the out-of-sample dataset that the model predicted the correct label. The method also displays a line plot showing the probability that the model predicts for each possible label (0 or 1) for each sample in the tst set. The factors are random, but there are more labels of class 1 than class 0, so we should expect the model to give a greater probability to class 1 for each prediction and we should expect an accuracy slightly greater than 50%.

In [47]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def oos_accuracy(factors, label):
    X_train, X_test, y_train, y_test = train_test_split(
        factors, label, test_size=0.25, shuffle=False
    )
    model = lgb.train(
        {
            'seed' : 1234, 
            'verbose': -1, 
            'boosting_type': 'rf', 
            'feature_fraction': 0.8, 
            'objective': 'multiclass', 
            'num_class': 2, 
            'bagging_freq': 5, 
            'bagging_fraction': 0.8,
            'is_unbalanced': True
        }, 
        train_set=lgb.Dataset(
            data=X_train, label=y_train, free_raw_data=True
        ).construct()
    )
    predictions = model.predict(X_test)
    x = list(range(len(predictions)))
    go.Figure(
        [
            go.Scatter(x=x, y=predictions[:, 0], name=0),
            go.Scatter(x=x, y=predictions[:, 1], name=1)
        ],
        dict(
            title="Probability of Each Label<br><sup>Class 1 gets a greater "
                + "probability because the SPY has an upward bias</sup>", 
            xaxis_title="Date", yaxis_title="Probability"
        )
    ).show()
    y_hat = predictions.argmax(axis=1)
    print(f"Accuracy: {round(accuracy_score(y_hat, y_test), 4)}")

oos_accuracy(factors, label)

# Part 3: Test Model Accuracy Using Stationary Factors

Lopez de Prado explains that "supervised learning algorithms typically require stationary features" (2018, p. 76). Let's perform an Augmented Dickey–Fuller test to see if our factors are stationary at the 95% confidence level.

### Test Factor Stationarity

In [48]:
from statsmodels.tsa.stattools import adfuller

for factor_idx in range(num_factors):
    factor = factors[:, factor_idx]
    test_results = adfuller(factor, maxlag=1, regression='c', autolag=None)
    # Check the p-value.
    output = "Stationary" if test_results[1] <= 0.05 else "Not stationary"
    print(f"Factor {factor_idx}: {output}")

### Adjust Factor Values to Achieve Stationarity
If the raw factors aren't stationary, you can transform them to make them stationary. Lopez de Prado mentions that "virutally all finance papers attempt to recover stationarity by applying an integer differentiation. . ., which means that most studies have over-differentiated the series, that is, they have removed much more memory than was necessary to statisfy standard econometric assumptions" (2018, p. 76). To avoid over-differentiating the factors, you can use Lopez de Prado's fractional differentiation technique. The following code is from Lopez de Prado (2018, pp. 79-84):

In [49]:
def get_weights_ffd(d, thres):
    '''
    Computing the weights for differentiating the series with fixed 
    window size
    
        Parameters:
            d (float): differentiating factor
            thres (float): threshold for cutting off weights
            
        Returns:
            w (np.ndarray): array contatining weights
    '''
    w, k = [1.0], 1
    while True:
        w_ = -w[-1] / k * (d - k + 1)
        if abs(w_) < thres:
            break
        w.append(w_)
        k += 1
    w = np.array(w[::-1]).reshape(-1, 1)
    return w

def frac_diff_ffd(series, d, thres=1e-5):
    '''
    Fractional differentiation with constant width window
    Note 1: thres determines the cut-off weight for the window
    Note 2: d can be any positive fractional, not necessarily bounded 
    [0,1]
    
        Parameters:
            series (pd.DataFrame): dataframe with time series
            d (float): differentiating factor
            thres (float): threshold for cutting off weights
        
        Returns:
            df (pd.DataFrame): dataframe with differentiated series
    '''
    w = get_weights_ffd(d, thres)
    width = len(w) - 1

    df = {}
    for name in series.columns:
        series_f = series[[name]].ffill().dropna()
        df_ = pd.Series(index=np.arange(series.shape[0]), dtype=object)
        for iloc1 in range(width, series_f.shape[0]):
            loc0, loc1 = series_f.index[iloc1 - width], series_f.index[iloc1]
            if not np.isfinite(series.loc[loc1, name]):
                continue    # exclude NAs
            df_[loc1] = np.dot(w.T, series_f.loc[loc0:loc1])[0, 0]
        df[name] = df_.dropna().copy(deep=True)
    df = pd.concat(df, axis=1)
    return df

def ffd(process, thres=0.01):
    '''
    Finding the minimum differentiating factor that passes the ADF test
    
        Parameters:
            process (np.ndarray): array with random process values
            apply_constant_width (bool): flag that shows whether to use 
             constant width window (if True) or increasing width window 
             (if False)
            thres (float): threshold for cutting off weights
    '''    
    for d in np.linspace(0, 1, 11):
        process_diff = frac_diff_ffd(pd.DataFrame(process), d, thres)
        test_results = adfuller(
            process_diff[process.name], maxlag=1, regression='c', autolag=None
        )
        if test_results[1] <= 0.05:
            break
    return process_diff[process.name]

stationary_factors = pd.DataFrame()
for factor_idx in range(num_factors):
    stationary_factors[factor_idx] = ffd(
        pd.Series(factors[:, factor_idx], name=factor_idx)
    )
stationary_factors

### Test Accuracy
Let's now test the out-of-sample accuracy of the model when using the stationary factors.

In [50]:
oos_accuracy(stationary_factors.values, label)

# Part 4: Test Model Accuracy Using Standardized Factors

Another common preproccessing technique is standardization, which transforms the factor values to be normally distributed. Let's try it.

### Standardize Factors Values


In [51]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
standardized_factors = scaler.fit_transform(stationary_factors)
standardized_factors

### Test Accuracy
Now that you have the standardized factors, use them to test the model's out-of-sample accuracy.

In [52]:
oos_accuracy(standardized_factors, label)

# Part 5: Test Model Accuracy Using Principal Components
Principal component analysis (PCA) is another common preprocessing technique that can reduce the dimensionality of the factors. PCA performs best when the factors are on the same scale ([reference](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html)), so in this case, perform PCA on the standardized factors.

### Perform PCA

In [53]:
from sklearn.decomposition import PCA

pca = PCA(random_state=0)
principal_components = pca.fit_transform(standardized_factors[1:, :])
principal_components

### Test Accuracy

In [54]:
oos_accuracy(principal_components, label.iloc[1:])