## Project: Predicting Stock Market
## Part 1: Data preparation, model estimation
**Data:** S&P500 index valuation, New York Stock Exchange (NYSE), 1950 to 2015  
**DataQuest Lesson:** https://app.dataquest.io/c/11/m/65/guided-project%3A-predicting-the-stock-market/1/the-dataset?path=2&slug=data-scientist&version=1.2  
**Source:** https://github.com/NickyThreeNames/DataquestGuidedProjects/blob/master/Guided%20Project-%20Predicting%20the%20stock%20market/sphist.csv


#### Useful links

Inspiration notebooks:
* https://medium.com/shiyan-boxer/s-p-500-stock-price-prediction-using-machine-learning-and-deep-learning-328b1839d1b6
* https://medium.com/mlearning-ai/predict-sp500-stock-price-with-python-machine-learning-sentiment-analysis-a296dc276353
* https://medium.com/p/59b06de25357
* https://iopscience.iop.org/article/10.1088/1742-6596/1366/1/012130
* https://towardsdatascience.com/stock-market-predictions-with-rnn-using-daily-market-variables-6f928c867fd2
* https://www.kaggle.com/code/samaxtech/predicting-s-p500-index-linearreg-randomforests
* https://www.kaggle.com/code/janiobachmann/s-p-500-time-series-forecasting-with-prophet

Feature Engineering:
* https://towardsdatascience.com/one-hot-encoding-scikit-vs-pandas-2133775567b8
* https://towardsdatascience.com/using-columntransformer-to-combine-data-processing-steps-af383f7d5260


Correlation for non-numeric data (Dython library):  

Variable standarization:  

Variable selection:  

Optimal Binning and WoE transformation for continuous dependent variable:

Diagnostics:

Regression:


## Analysis plan
1. Import data
2. Research about approach
3. Clean Data
4. Feature Engineering
5. Feature Selection
6. Model training & tuning
7. Final model selection - evaluation on test data
8. Diagnostics


Additional : comparing simple train/test split forecast vs. rolling X-days ? Arima/Sarima/GARCH models vs linear / neural


### 1. Setting up environment

#### 1.1 Importing packages & setting-up parameters

In [1]:
# Set-up auto-reload functions for faster debugging 
# (automatically refreshes changes in subpackages codes)
# https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2

In [2]:
# Import parent directory (main project directory)
# for packages importing
import sys
import os

# Getting the parent directory name in which your script is running
parent = os.path.dirname(os.path.realpath(''))

# adding the parent directory to
# the sys.path.
sys.path.append(parent)

# now we can import the module in the parent
# directory.

In [3]:
# Project packages import
import gp24package.data.make_dataset as gp24md
import gp24package.features.build_features as gp24feat
# import gp24package.explore_visualise.eda as gp23eda
# import gp24package.features.build_features as gp23feat
# import gp24package.models.hyperparameters_model as gp23hyperparam
# import gp24package.models.train_model as gp23train


# Pylance highligting package issue (not to be worried about)
# https://github.com/microsoft/pylance-release/blob/main/TROUBLESHOOTING.md#unresolved-import-warnings

# Standard Python libraries import
from IPython.display import display, HTML #  tidied-up display
from time import time #  project timer
from itertools import chain # for list iterations
from datetime import datetime
import pandas as pd
import numpy as np

# plots
import matplotlib.pyplot as plt
import seaborn as sn

# Necessary packages
import gp24package
import session_info # build and requirements.txt
import pickle # dump models

import ta #technical-analysis

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

#turning on plot display in JN
%matplotlib inline 
# Setting pandas display options
pd.options.display.max_columns = 300
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 250

In [4]:
# MANUAL_INPUT - mark of sections of code, that are not automated and have to be manually re-coded to new datasets.

# parameters
seed = 12345
target = 'SalePrice'
ID_vars = ['Order', 'PID']


# One Hot Encoding list
# contains varaibles nad possible values for one hot encoding
categories = [('month',[1,2,3,4,5,6,7,8,9,10,11,12]),
              ('weekday',[0,1,2,3,4,5,6])]


#### 1.2 Starting project timer and exporting requirements

In [5]:
# Starting project timer
tic_all = time()

In [6]:
# Collecting packages info and saving to requirements.txt file
session_info.show(cpu = True, std_lib = True, dependencies = True, write_req_file = True,
                  req_file_name = 'requirements.txt')

#### 1.3 Importing and inspecting source data

In [7]:
data = gp24md.MakeDataset('sphist.csv').data
print(type(data))
pd.options.display.float_format = '{:.2f}'.format
display(HTML(data.head().to_html()))

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.42,2090.42,2066.78,2077.07,4043820000.0,2077.07
1,2015-12-04,2051.24,2093.84,2051.24,2091.69,4214910000.0,2091.69
2,2015-12-03,2080.71,2085.0,2042.35,2049.62,4306490000.0,2049.62
3,2015-12-02,2101.71,2104.27,2077.11,2079.51,3950640000.0,2079.51
4,2015-12-01,2082.93,2103.37,2082.93,2102.63,3712120000.0,2102.63


In [8]:
data['Date'] = pd.to_datetime(data['Date'])
data.sort_values(ascending = True, inplace  = True, by = 'Date')
print(data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 16590 entries, 16589 to 0
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       16590 non-null  datetime64[ns]
 1   Open       16590 non-null  float64       
 2   High       16590 non-null  float64       
 3   Low        16590 non-null  float64       
 4   Close      16590 non-null  float64       
 5   Volume     16590 non-null  float64       
 6   Adj Close  16590 non-null  float64       
dtypes: datetime64[ns](1), float64(6)
memory usage: 1.0 MB
None


In [9]:
data2 = gp24feat.DateFeatures(data_in = data, variable = 'Date').output()
data2.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,month,day,weekday,month_sin,month_cos,weekday_sin,weekday_cos
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,1,3,1,0.5,0.87,0.78,0.62
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,1,4,2,0.5,0.87,0.97,-0.22
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,1,5,3,0.5,0.87,0.43,-0.9
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,1,6,4,0.5,0.87,-0.43,-0.9
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,1,9,0,0.5,0.87,0.0,1.0


In [10]:
# Using OneHotEncoder from sklearn to create binary variables representing months and weekdays

data3 = gp24feat.FeaturesOneHotEncoding(data_in = data2, features_list = categories).transform(data2)
data3.head()

Unnamed: 0,month_1,month_2,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12,weekday_0,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6
16589,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
16588,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
16587,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
16586,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
16585,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
# Concatenate to original data
data4 = pd.concat([data2, data3], axis = 1)

data4.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,month,day,weekday,month_sin,month_cos,weekday_sin,weekday_cos,month_1,month_2,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12,weekday_0,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,1,3,1,0.5,0.87,0.78,0.62,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,1,4,2,0.5,0.87,0.97,-0.22,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,1,5,3,0.5,0.87,0.43,-0.9,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,1,6,4,0.5,0.87,-0.43,-0.9,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,1,9,0,0.5,0.87,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
# dodanie zmiennych
# Generate all combinations of days for ratio calculations

# Do porownania srednie
# Jesli przewiduje cenę danego dnia - nie powinienem znać ceny, mogę znać tylko przeszłość

# https://www.kaggle.com/code/samaxtech/predicting-s-p500-index-linearreg-randomforests
# https://www.alpharithms.com/predicting-stock-prices-with-linear-regression-214618/
# https://towardsdatascience.com/3-basic-steps-of-stock-market-analysis-in-python-917787012143
# https://tradewithpython.com/generating-buy-sell-signals-using-python
# https://not-satoshi.com/course-binance-api-trading-bot-sma-macd-and-bollinger-bands/

import itertools
import numpy as np


# list of days to iterate by
days = [3, 5, 30, 60, 90, 120, 240, 365]

# Dictionary to store results
results = {}
for i in days:
    mean_value = data['Close'].rolling(i).mean()
    stdev_value = data['Close'].rolling(i).std()
    max_value = data['Close'].rolling(i).max()
    min_value = data['Close'].rolling(i).min()
    return_value = data['Close'].pct_change(periods = i)

    results[f'mean_{i}'] = mean_value
    results[f'stdev_{i}'] = stdev_value
    results[f'max_{i}'] = max_value
    results[f'min_{i}'] = min_value
    results[f'returns_{i}'] = return_value
    

combinations = list(itertools.combinations_with_replacement(days, 2))
print(combinations)

for combo in combinations:
    day1, day2 = combo
    if day2 >= day1:
        # calculate variable value
        mean_stdev_ratio = np.where(results[f'stdev_{day2}'] != 0, results[f'mean_{day1}'] / results[f'stdev_{day2}'], np.nan)
        mean_max_ratio = np.where(results[f'max_{day2}'] != 0, results[f'mean_{day1}'] / results[f'max_{day2}'], np.nan)
        mean_min_ratio = np.where(results[f'min_{day2}'] != 0, results[f'mean_{day1}'] / results[f'min_{day2}'], np.nan)
        stdev_mean_ratio = np.where(results[f'mean_{day2}'] != 0, results[f'stdev_{day1}'] / results[f'mean_{day2}'], np.nan)
        stdev_max_ratio = np.where(results[f'max_{day2}'] != 0, results[f'stdev_{day1}'] / results[f'max_{day2}'], np.nan)
        stdev_min_ratio = np.where(results[f'min_{day2}'] != 0, results[f'stdev_{day1}'] / results[f'min_{day2}'], np.nan)
        max_mean_ratio = np.where(results[f'mean_{day2}'] != 0, results[f'max_{day1}'] / results[f'mean_{day2}'], np.nan)
        max_stdev_ratio = np.where(results[f'stdev_{day2}'] != 0, results[f'max_{day1}'] / results[f'stdev_{day2}'], np.nan)
        max_min_ratio =  np.where(results[f'min_{day2}'] != 0, results[f'max_{day1}'] / results[f'min_{day2}'], np.nan)
        min_mean_ratio = np.where(results[f'mean_{day2}'] != 0, results[f'min_{day1}'] / results[f'mean_{day2}'], np.nan)
        min_stdev_ratio = np.where(results[f'stdev_{day2}'] != 0, results[f'min_{day1}'] / results[f'stdev_{day2}'], np.nan)
        min_max_ratio = np.where(results[f'max_{day2}'] != 0, results[f'min_{day1}'] / results[f'max_{day2}'], np.nan)
        # assign to dictionary
        results[f'mean_{day1}_stdev_{day2}_ratio'] = mean_stdev_ratio
        results[f'mean_{day1}_max_{day2}_ratio'] = mean_max_ratio
        results[f'mean_{day1}_min_{day2}_ratio'] = mean_min_ratio
        results[f'stdev_{day1}_mean_{day2}_ratio'] = stdev_mean_ratio
        results[f'stdev_{day1}_max_{day2}_ratio'] = stdev_max_ratio
        results[f'stdev_{day1}_min_{day2}_ratio'] = stdev_min_ratio
        results[f'max_{day1}_mean_{day2}_ratio'] = max_mean_ratio
        results[f'max_{day1}_stdev_{day2}_ratio'] = max_stdev_ratio
        results[f'max_{day1}_min_{day2}_ratio'] = max_min_ratio
        results[f'min_{day1}_mean_{day2}_ratio'] = min_mean_ratio
        results[f'min_{day1}_stdev_{day2}_ratio'] = min_stdev_ratio
        results[f'min_{day1}_max_{day2}_ratio'] = min_max_ratio
        # Only calculate the ratio if day2 > day1, otherwise value = 1
        if day2 > day1:
            # calculate variable value
            mean_ratio = np.where(results[f'mean_{day2}'] != 0, results[f'mean_{day1}'] / results[f'mean_{day2}'], np.nan)
            stdev_ratio = np.where(results[f'stdev_{day2}'] != 0, results[f'stdev_{day1}'] / results[f'stdev_{day2}'], np.nan)
            max_ratio = np.where(results[f'max_{day2}'] != 0, results[f'max_{day1}'] / results[f'max_{day2}'], np.nan)
            min_ratio = np.where(results[f'min_{day2}'] != 0, results[f'min_{day1}'] / results[f'min_{day2}'], np.nan)
            # assign to dictionary
            results[f'mean_{day1}_mean_{day2}_ratio'] = mean_ratio
            results[f'stdev_{day1}_stdev_{day2}_ratio'] = stdev_ratio
            results[f'max_{day1}_max_{day2}_ratio'] = max_ratio
            results[f'min_{day1}_min_{day2}_ratio'] = min_ratio


results = pd.DataFrame(results)

results = results.shift(periods=1)
print(results.shape)
results.head(15)


[(3, 3), (3, 5), (3, 30), (3, 60), (3, 90), (3, 120), (3, 240), (3, 365), (5, 5), (5, 30), (5, 60), (5, 90), (5, 120), (5, 240), (5, 365), (30, 30), (30, 60), (30, 90), (30, 120), (30, 240), (30, 365), (60, 60), (60, 90), (60, 120), (60, 240), (60, 365), (90, 90), (90, 120), (90, 240), (90, 365), (120, 120), (120, 240), (120, 365), (240, 240), (240, 365), (365, 365)]
(16590, 584)


Unnamed: 0,mean_3,stdev_3,max_3,min_3,returns_3,mean_5,stdev_5,max_5,min_5,returns_5,mean_30,stdev_30,max_30,min_30,returns_30,mean_60,stdev_60,max_60,min_60,returns_60,mean_90,stdev_90,max_90,min_90,returns_90,mean_120,stdev_120,max_120,min_120,returns_120,mean_240,stdev_240,max_240,min_240,returns_240,mean_365,stdev_365,max_365,min_365,returns_365,mean_3_stdev_3_ratio,mean_3_max_3_ratio,mean_3_min_3_ratio,stdev_3_mean_3_ratio,stdev_3_max_3_ratio,stdev_3_min_3_ratio,max_3_mean_3_ratio,max_3_stdev_3_ratio,max_3_min_3_ratio,min_3_mean_3_ratio,min_3_stdev_3_ratio,min_3_max_3_ratio,mean_3_stdev_5_ratio,mean_3_max_5_ratio,mean_3_min_5_ratio,stdev_3_mean_5_ratio,stdev_3_max_5_ratio,stdev_3_min_5_ratio,max_3_mean_5_ratio,max_3_stdev_5_ratio,max_3_min_5_ratio,min_3_mean_5_ratio,min_3_stdev_5_ratio,min_3_max_5_ratio,mean_3_mean_5_ratio,stdev_3_stdev_5_ratio,max_3_max_5_ratio,min_3_min_5_ratio,mean_3_stdev_30_ratio,mean_3_max_30_ratio,mean_3_min_30_ratio,stdev_3_mean_30_ratio,stdev_3_max_30_ratio,stdev_3_min_30_ratio,max_3_mean_30_ratio,max_3_stdev_30_ratio,max_3_min_30_ratio,min_3_mean_30_ratio,min_3_stdev_30_ratio,min_3_max_30_ratio,mean_3_mean_30_ratio,stdev_3_stdev_30_ratio,max_3_max_30_ratio,min_3_min_30_ratio,mean_3_stdev_60_ratio,mean_3_max_60_ratio,mean_3_min_60_ratio,stdev_3_mean_60_ratio,stdev_3_max_60_ratio,stdev_3_min_60_ratio,max_3_mean_60_ratio,max_3_stdev_60_ratio,max_3_min_60_ratio,min_3_mean_60_ratio,min_3_stdev_60_ratio,min_3_max_60_ratio,mean_3_mean_60_ratio,stdev_3_stdev_60_ratio,max_3_max_60_ratio,min_3_min_60_ratio,mean_3_stdev_90_ratio,mean_3_max_90_ratio,mean_3_min_90_ratio,stdev_3_mean_90_ratio,stdev_3_max_90_ratio,stdev_3_min_90_ratio,max_3_mean_90_ratio,max_3_stdev_90_ratio,max_3_min_90_ratio,min_3_mean_90_ratio,min_3_stdev_90_ratio,min_3_max_90_ratio,mean_3_mean_90_ratio,stdev_3_stdev_90_ratio,max_3_max_90_ratio,min_3_min_90_ratio,mean_3_stdev_120_ratio,mean_3_max_120_ratio,mean_3_min_120_ratio,stdev_3_mean_120_ratio,stdev_3_max_120_ratio,stdev_3_min_120_ratio,max_3_mean_120_ratio,max_3_stdev_120_ratio,max_3_min_120_ratio,min_3_mean_120_ratio,min_3_stdev_120_ratio,min_3_max_120_ratio,mean_3_mean_120_ratio,stdev_3_stdev_120_ratio,max_3_max_120_ratio,min_3_min_120_ratio,mean_3_stdev_240_ratio,mean_3_max_240_ratio,mean_3_min_240_ratio,stdev_3_mean_240_ratio,stdev_3_max_240_ratio,stdev_3_min_240_ratio,max_3_mean_240_ratio,max_3_stdev_240_ratio,max_3_min_240_ratio,min_3_mean_240_ratio,min_3_stdev_240_ratio,min_3_max_240_ratio,mean_3_mean_240_ratio,stdev_3_stdev_240_ratio,max_3_max_240_ratio,min_3_min_240_ratio,mean_3_stdev_365_ratio,mean_3_max_365_ratio,...,min_60_stdev_365_ratio,min_60_max_365_ratio,mean_60_mean_365_ratio,stdev_60_stdev_365_ratio,max_60_max_365_ratio,min_60_min_365_ratio,mean_90_stdev_90_ratio,mean_90_max_90_ratio,mean_90_min_90_ratio,stdev_90_mean_90_ratio,stdev_90_max_90_ratio,stdev_90_min_90_ratio,max_90_mean_90_ratio,max_90_stdev_90_ratio,max_90_min_90_ratio,min_90_mean_90_ratio,min_90_stdev_90_ratio,min_90_max_90_ratio,mean_90_stdev_120_ratio,mean_90_max_120_ratio,mean_90_min_120_ratio,stdev_90_mean_120_ratio,stdev_90_max_120_ratio,stdev_90_min_120_ratio,max_90_mean_120_ratio,max_90_stdev_120_ratio,max_90_min_120_ratio,min_90_mean_120_ratio,min_90_stdev_120_ratio,min_90_max_120_ratio,mean_90_mean_120_ratio,stdev_90_stdev_120_ratio,max_90_max_120_ratio,min_90_min_120_ratio,mean_90_stdev_240_ratio,mean_90_max_240_ratio,mean_90_min_240_ratio,stdev_90_mean_240_ratio,stdev_90_max_240_ratio,stdev_90_min_240_ratio,max_90_mean_240_ratio,max_90_stdev_240_ratio,max_90_min_240_ratio,min_90_mean_240_ratio,min_90_stdev_240_ratio,min_90_max_240_ratio,mean_90_mean_240_ratio,stdev_90_stdev_240_ratio,max_90_max_240_ratio,min_90_min_240_ratio,mean_90_stdev_365_ratio,mean_90_max_365_ratio,mean_90_min_365_ratio,stdev_90_mean_365_ratio,stdev_90_max_365_ratio,stdev_90_min_365_ratio,max_90_mean_365_ratio,max_90_stdev_365_ratio,max_90_min_365_ratio,min_90_mean_365_ratio,min_90_stdev_365_ratio,min_90_max_365_ratio,mean_90_mean_365_ratio,stdev_90_stdev_365_ratio,max_90_max_365_ratio,min_90_min_365_ratio,mean_120_stdev_120_ratio,mean_120_max_120_ratio,mean_120_min_120_ratio,stdev_120_mean_120_ratio,stdev_120_max_120_ratio,stdev_120_min_120_ratio,max_120_mean_120_ratio,max_120_stdev_120_ratio,max_120_min_120_ratio,min_120_mean_120_ratio,min_120_stdev_120_ratio,min_120_max_120_ratio,mean_120_stdev_240_ratio,mean_120_max_240_ratio,mean_120_min_240_ratio,stdev_120_mean_240_ratio,stdev_120_max_240_ratio,stdev_120_min_240_ratio,max_120_mean_240_ratio,max_120_stdev_240_ratio,max_120_min_240_ratio,min_120_mean_240_ratio,min_120_stdev_240_ratio,min_120_max_240_ratio,mean_120_mean_240_ratio,stdev_120_stdev_240_ratio,max_120_max_240_ratio,min_120_min_240_ratio,mean_120_stdev_365_ratio,mean_120_max_365_ratio,mean_120_min_365_ratio,stdev_120_mean_365_ratio,stdev_120_max_365_ratio,stdev_120_min_365_ratio,max_120_mean_365_ratio,max_120_stdev_365_ratio,max_120_min_365_ratio,min_120_mean_365_ratio,min_120_stdev_365_ratio,min_120_max_365_ratio,mean_120_mean_365_ratio,stdev_120_stdev_365_ratio,max_120_max_365_ratio,min_120_min_365_ratio,mean_240_stdev_240_ratio,mean_240_max_240_ratio,mean_240_min_240_ratio,stdev_240_mean_240_ratio,stdev_240_max_240_ratio,stdev_240_min_240_ratio,max_240_mean_240_ratio,max_240_stdev_240_ratio,max_240_min_240_ratio,min_240_mean_240_ratio,min_240_stdev_240_ratio,min_240_max_240_ratio,mean_240_stdev_365_ratio,mean_240_max_365_ratio,mean_240_min_365_ratio,stdev_240_mean_365_ratio,stdev_240_max_365_ratio,stdev_240_min_365_ratio,max_240_mean_365_ratio,max_240_stdev_365_ratio,max_240_min_365_ratio,min_240_mean_365_ratio,min_240_stdev_365_ratio,min_240_max_365_ratio,mean_240_mean_365_ratio,stdev_240_stdev_365_ratio,max_240_max_365_ratio,min_240_min_365_ratio,mean_365_stdev_365_ratio,mean_365_max_365_ratio,mean_365_min_365_ratio,stdev_365_mean_365_ratio,stdev_365_max_365_ratio,stdev_365_min_365_ratio,max_365_mean_365_ratio,max_365_stdev_365_ratio,max_365_min_365_ratio,min_365_mean_365_ratio,min_365_stdev_365_ratio,min_365_max_365_ratio
16589,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16588,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16587,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16586,16.81,0.14,16.93,16.66,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,121.23,0.99,1.01,0.01,0.01,0.01,1.01,122.08,1.02,0.99,120.13,0.98,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16585,16.92,0.07,16.98,16.85,0.02,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,258.03,1.0,1.0,0.0,0.0,0.0,1.0,258.94,1.01,1.0,256.96,0.99,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16584,17.0,0.08,17.08,16.93,0.01,16.9,0.16,17.08,16.66,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,222.54,1.0,1.0,0.0,0.0,0.0,1.0,223.63,1.01,1.0,221.67,0.99,107.6,1.0,1.02,0.0,0.0,0.0,1.01,108.13,1.03,1.0,107.18,0.99,1.01,0.48,1.0,1.02,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16583,17.03,0.05,17.08,16.98,0.01,16.97,0.09,17.08,16.85,0.02,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,340.6,1.0,1.0,0.0,0.0,0.0,1.0,341.6,1.01,1.0,339.6,0.99,191.24,1.0,1.01,0.0,0.0,0.0,1.01,191.8,1.01,1.0,190.68,0.99,1.0,0.56,1.0,1.01,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16582,17.07,0.03,17.09,17.03,0.01,17.02,0.07,17.09,16.93,0.01,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,530.93,1.0,1.0,0.0,0.0,0.0,1.0,531.65,1.0,1.0,529.79,1.0,252.46,1.0,1.01,0.0,0.0,0.0,1.0,252.8,1.01,1.0,251.92,1.0,1.0,0.48,1.0,1.01,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16581,16.96,0.18,17.09,16.76,-0.02,16.99,0.13,17.09,16.76,-0.01,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,96.48,0.99,1.01,0.01,0.01,0.01,1.01,97.22,1.02,0.99,95.34,0.98,125.82,0.99,1.01,0.01,0.01,0.01,1.01,126.78,1.02,0.99,124.34,0.98,1.0,1.3,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16580,16.84,0.22,17.09,16.67,-0.02,16.93,0.2,17.09,16.67,-0.02,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,76.15,0.99,1.01,0.01,0.01,0.01,1.01,77.28,1.03,0.99,75.38,0.98,85.68,0.99,1.01,0.01,0.01,0.01,1.01,86.95,1.03,0.98,84.82,0.98,0.99,1.13,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
CZUPURKACZ

In [None]:
# TRAIN TEST SPLIT

data['Date_split'] = np.where(data["Date"] > datetime(year=2015, month=4, day=1),1,0)

In [19]:



# 2023/07/25 - TO DO : optymalizacja ? 
# PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.
# Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy(



# moving all values 1 index value further in order to avoid information spill
# without .shift all variables calculated in this step would use current date close price as known information



In [None]:
# https://technical-analysis-library-in-python.readthedocs.io/en/latest/

In [None]:

# Create technical indicators

# https://www.investopedia.com/terms/s/sma.asp
data['SMA_5'] = ta.trend.sma_indicator(data['Close'], window=5)
data['SMA_5'] = data['SMA_5'].shift(periods=1)

data['SMA_10'] = ta.trend.sma_indicator(data['Close'], window=10)
data['SMA_10'] = data['SMA_10'].shift(periods=1)

data['SMA_30'] = ta.trend.sma_indicator(data['Close'], window=30)
data['SMA_30'] = data['SMA_30'].shift(periods=1)
# https://www.investopedia.com/terms/e/ema.asp
data['EMA_5'] = ta.trend.ema_indicator(data['Close'], window=5)
data['EMA_5'] = data['EMA_5'].shift(periods=1)

data['EMA_10'] = ta.trend.ema_indicator(data['Close'], window=10)
data['EMA_10'] = data['EMA_10'].shift(periods=1)

data['EMA_30'] = ta.trend.ema_indicator(data['Close'], window=30)
data['EMA_30'] = data['EMA_30'].shift(periods=1)

# https://www.investopedia.com/terms/r/rsi.asp
data['RSI_14'] = ta.momentum.rsi(data['Close'], window=14)
data['RSI_14'] = data['RSI_14'].shift(periods=1)


# https://www.investopedia.com/terms/p/pricerateofchange.asp
data['ROC_10'] = ta.momentum.roc(data['Close'], window=10)
data['ROC_10'] = data['ROC_10'].shift(periods=1)

# te funkcje zwracaja obiekty, a nie liczby - do sprawdzenia
data['AOI_34'] = ta.momentum.awesome_oscillator(high = data['High'], low = data['Low'])
data['AOI_34'] = data['AOI_34'].shift(periods=1)

data['KAMA_30'] = ta.momentum.kama(data['Close'])
data['KAMA_30'] = data['KAMA_30'].shift(periods=1)

data['PPO_30'] = ta.momentum.ppo(data['Close'])
data['PPO_30'] = data['PPO_30'].shift(periods=1)

data['SRSI_14'] = ta.momentum.stochrsi(data['Close'])
data['SRSI_14'] = data['SRSI_14'].shift(periods=1)

data['SO_14'] = ta.momentum.stoch(high = data['High'], low = data['Low'], close = data['Close'])
data['SO_14'] = data['SO_14'].shift(periods=1)

data['TSII_14'] = ta.momentum.tsi(data['Close'])
data['TSII_14'] = data['TSII_14'].shift(periods=1)

# Create seasonal indicators
data['Month'] = data['Date'].dt.month
data['DayOfWeek'] = data['Date'].dt.dayofweek


# Convert 'Month' and 'DayOfWeek' to cyclic coordinates
data['Month_sin'] = np.sin(2*np.pi*data['Month']/12)
data['Month_cos'] = np.cos(2*np.pi*data['Month']/12)
data['DayOfWeek_sin'] = np.sin(2*np.pi*data['DayOfWeek']/7)
data['DayOfWeek_cos'] = np.cos(2*np.pi*data['DayOfWeek']/7)


print(data.head(20))

            Date  Open  High   Low  Close     Volume  Adj Close  Date_split  \
16589 1950-01-03 16.66 16.66 16.66  16.66 1260000.00      16.66           0   
16588 1950-01-04 16.85 16.85 16.85  16.85 1890000.00      16.85           0   
16587 1950-01-05 16.93 16.93 16.93  16.93 2550000.00      16.93           0   
16586 1950-01-06 16.98 16.98 16.98  16.98 2010000.00      16.98           0   
16585 1950-01-09 17.08 17.08 17.08  17.08 2520000.00      17.08           0   
16584 1950-01-10 17.03 17.03 17.03  17.03 2160000.00      17.03           0   
16583 1950-01-11 17.09 17.09 17.09  17.09 2630000.00      17.09           0   
16582 1950-01-12 16.76 16.76 16.76  16.76 2970000.00      16.76           0   
16581 1950-01-13 16.67 16.67 16.67  16.67 3330000.00      16.67           0   
16580 1950-01-16 16.72 16.72 16.72  16.72 1460000.00      16.72           0   
16579 1950-01-17 16.86 16.86 16.86  16.86 1790000.00      16.86           0   
16578 1950-01-18 16.85 16.85 16.85  16.85 1570000.00