## Project: Predicting Stock Market
## Part 1: Data preparation, model estimation
**Data:** S&P500 index valuation, New York Stock Exchange (NYSE), 1950 to 2015  
**DataQuest Lesson:** https://app.dataquest.io/c/11/m/65/guided-project%3A-predicting-the-stock-market/1/the-dataset?path=2&slug=data-scientist&version=1.2  
**Source:** https://github.com/NickyThreeNames/DataquestGuidedProjects/blob/master/Guided%20Project-%20Predicting%20the%20stock%20market/sphist.csv


#### Useful links

Inspiration notebooks:
* https://medium.com/shiyan-boxer/s-p-500-stock-price-prediction-using-machine-learning-and-deep-learning-328b1839d1b6
* https://medium.com/mlearning-ai/predict-sp500-stock-price-with-python-machine-learning-sentiment-analysis-a296dc276353
* https://medium.com/p/59b06de25357
* https://iopscience.iop.org/article/10.1088/1742-6596/1366/1/012130
* https://towardsdatascience.com/stock-market-predictions-with-rnn-using-daily-market-variables-6f928c867fd2
* https://www.kaggle.com/code/samaxtech/predicting-s-p500-index-linearreg-randomforests
* https://www.kaggle.com/code/janiobachmann/s-p-500-time-series-forecasting-with-prophet

Feature Engineering:

Correlation for non-numeric data (Dython library):  

Variable standarization:  

Variable selection:  

Optimal Binning and WoE transformation for continuous dependent variable:

Diagnostics:

Regression:


## Analysis plan
1. Import data
2. Research about approach
3. Clean Data
4. Feature Engineering
5. Feature Selection
6. Model training & tuning
7. Final model selection - evaluation on test data
8. Diagnostics


Additional : comparing simple train/test split forecast vs. rolling X-days ? Arima/Sarima/GARCH models vs linear / neural


### 1. Setting up environment

#### 1.1 Importing packages & setting-up parameters

In [1]:
# Set-up auto-reload functions for faster debugging 
# (automatically refreshes changes in subpackages codes)
# https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2

In [2]:
# Import parent directory (main project directory)
# for packages importing
import sys
import os

# Getting the parent directory name in which your script is running
parent = os.path.dirname(os.path.realpath(''))

# adding the parent directory to
# the sys.path.
sys.path.append(parent)

# now we can import the module in the parent
# directory.

In [3]:
# Project packages import
import gp24package.data.make_dataset as gp24md
# import gp24package.explore_visualise.eda as gp23eda
# import gp24package.features.build_features as gp23feat
# import gp24package.models.hyperparameters_model as gp23hyperparam
# import gp24package.models.train_model as gp23train


# Pylance highligting package issue (not to be worried about)
# https://github.com/microsoft/pylance-release/blob/main/TROUBLESHOOTING.md#unresolved-import-warnings

# Standard Python libraries import
from IPython.display import display, HTML #  tidied-up display
from time import time #  project timer
from itertools import chain # for list iterations
from datetime import datetime
import pandas as pd
import numpy as np

# plots
import matplotlib.pyplot as plt
import seaborn as sn

# Necessary packages
import gp24package
import session_info # build and requirements.txt
import pickle # dump models

import ta #technical-analysis


#turning on plot display in JN
%matplotlib inline 
# Setting pandas display options
pd.options.display.max_columns = 300
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 250

In [4]:
# MANUAL_INPUT - mark of sections of code, that are not automated and have to be manually re-coded to new datasets.

# parameters
seed = 12345
target = 'SalePrice'
ID_vars = ['Order', 'PID']

#### 1.2 Starting project timer and exporting requirements

In [5]:
# Starting project timer
tic_all = time()

In [6]:
# Collecting packages info and saving to requirements.txt file
session_info.show(cpu = True, std_lib = True, dependencies = True, write_req_file = True,
                  req_file_name = 'requirements.txt')

#### 1.3 Importing and inspecting source data

In [7]:
data = gp24md.MakeDataset('sphist.csv').data
print(type(data))
pd.options.display.float_format = '{:.2f}'.format
display(HTML(data.head().to_html()))

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.42,2090.42,2066.78,2077.07,4043820000.0,2077.07
1,2015-12-04,2051.24,2093.84,2051.24,2091.69,4214910000.0,2091.69
2,2015-12-03,2080.71,2085.0,2042.35,2049.62,4306490000.0,2049.62
3,2015-12-02,2101.71,2104.27,2077.11,2079.51,3950640000.0,2079.51
4,2015-12-01,2082.93,2103.37,2082.93,2102.63,3712120000.0,2102.63


In [8]:
data['Date'] = pd.to_datetime(data['Date'])
data.sort_values(ascending = True, inplace  = True, by = 'Date')
print(data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 16590 entries, 16589 to 0
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       16590 non-null  datetime64[ns]
 1   Open       16590 non-null  float64       
 2   High       16590 non-null  float64       
 3   Low        16590 non-null  float64       
 4   Close      16590 non-null  float64       
 5   Volume     16590 non-null  float64       
 6   Adj Close  16590 non-null  float64       
dtypes: datetime64[ns](1), float64(6)
memory usage: 1.0 MB
None


In [9]:
data['Date_split'] = np.where(data["Date"] > datetime(year=2015, month=4, day=1),1,0)


# Do porownania srednie
# Jesli przewiduje cenę danego dnia - nie powinienem znać ceny, mogę znać tylko przeszłość

# https://www.kaggle.com/code/samaxtech/predicting-s-p500-index-linearreg-randomforests
# https://www.alpharithms.com/predicting-stock-prices-with-linear-regression-214618/
# https://towardsdatascience.com/3-basic-steps-of-stock-market-analysis-in-python-917787012143
# https://tradewithpython.com/generating-buy-sell-signals-using-python
# https://not-satoshi.com/course-binance-api-trading-bot-sma-macd-and-bollinger-bands/




# list of days to iterabe by
days = [3, 5, 30, 60, 90, 120, 240, 365]
# Lista wartości
values = []

# Dictionary to store results
results = {}
for days in days:
    mean_value = data['Close'].rolling(days).mean()
    stdev_value = data['Close'].rolling(days).std()
    max_value = data['Close'].rolling(days).max()
    min_value = data['Close'].rolling(days).min()

    results[f'mean_{days}'] = mean_value
    results[f'stdev_{days}'] = stdev_value
    results[f'max_{days}'] = max_value
    results[f'min_{days}'] = min_value
    


results_df = pd.DataFrame(results)

# moving all values 1 index value further in order to avoid information spill
# without .shift all variables calculated in this step would use current date close price as known information
results_df = results_df.shift(periods=1)
print(results_df.shape)
print(results_df.head())

(16590, 32)
       mean_3  stdev_3  max_3  min_3  mean_5  stdev_5  max_5  min_5  mean_30  \
16589     NaN      NaN    NaN    NaN     NaN      NaN    NaN    NaN      NaN   
16588     NaN      NaN    NaN    NaN     NaN      NaN    NaN    NaN      NaN   
16587     NaN      NaN    NaN    NaN     NaN      NaN    NaN    NaN      NaN   
16586   16.81     0.14  16.93  16.66     NaN      NaN    NaN    NaN      NaN   
16585   16.92     0.07  16.98  16.85     NaN      NaN    NaN    NaN      NaN   

       stdev_30  max_30  min_30  mean_60  stdev_60  max_60  min_60  mean_90  \
16589       NaN     NaN     NaN      NaN       NaN     NaN     NaN      NaN   
16588       NaN     NaN     NaN      NaN       NaN     NaN     NaN      NaN   
16587       NaN     NaN     NaN      NaN       NaN     NaN     NaN      NaN   
16586       NaN     NaN     NaN      NaN       NaN     NaN     NaN      NaN   
16585       NaN     NaN     NaN      NaN       NaN     NaN     NaN      NaN   

       stdev_90  max_90  min_90 

In [10]:
# https://technical-analysis-library-in-python.readthedocs.io/en/latest/

In [14]:

# Create technical indicators

# https://www.investopedia.com/terms/s/sma.asp
data['SMA_5'] = ta.trend.sma_indicator(data['Close'], window=5)
data['SMA_5'] = data['SMA_5'].shift(periods=1)

data['SMA_10'] = ta.trend.sma_indicator(data['Close'], window=10)
data['SMA_10'] = data['SMA_10'].shift(periods=1)

data['SMA_30'] = ta.trend.sma_indicator(data['Close'], window=30)
data['SMA_30'] = data['SMA_30'].shift(periods=1)
# https://www.investopedia.com/terms/e/ema.asp
data['EMA_5'] = ta.trend.ema_indicator(data['Close'], window=5)
data['EMA_5'] = data['EMA_5'].shift(periods=1)

data['EMA_10'] = ta.trend.ema_indicator(data['Close'], window=10)
data['EMA_10'] = data['EMA_10'].shift(periods=1)

data['EMA_30'] = ta.trend.ema_indicator(data['Close'], window=30)
data['EMA_30'] = data['EMA_30'].shift(periods=1)

# https://www.investopedia.com/terms/r/rsi.asp
data['RSI_14'] = ta.momentum.rsi(data['Close'], window=14)
data['RSI_14'] = data['RSI_14'].shift(periods=1)


# https://www.investopedia.com/terms/p/pricerateofchange.asp
data['ROC_10'] = ta.momentum.roc(data['Close'], window=10)
data['ROC_10'] = data['ROC_10'].shift(periods=1)

# te funkcje zwracaja obiekty, a nie liczby - do sprawdzenia
data['AOI_34'] = ta.momentum.AwesomeOscillatorIndicator(high = data['High'], low = data['Low']).awesome_oscillator()
data['AOI_34'] = data['AOI_34'].shift(periods=1)

data['KAMA_30'] = ta.momentum.KAMAIndicator(data['Close']).kama()
data['KAMA_30'] = data['KAMA_30'].shift(periods=1)

data['PPO_30'] = ta.momentum.PercentagePriceOscillator(data['Close']).ppo()
data['PPO_30'] = data['PPO_30'].shift(periods=1)

data['SRSI_14'] = ta.momentum.StochRSIIndicator(data['Close']).stochrsi()
data['SRSI_14'] = data['SRSI_14'].shift(periods=1)

data['SO_14'] = ta.momentum.stoch(high = data['High'], low = data['Low'], close = data['Close'])
data['SO_14'] = data['SO_14'].shift(periods=1)

data['TSII_14'] = ta.momentum.tsi(data['Close'])
data['TSII_14'] = data['TSII_14'].shift(periods=1)

# Create seasonal indicators
data['Month'] = data['Date'].dt.month
data['DayOfWeek'] = data['Date'].dt.dayofweek


# Convert 'Month' and 'DayOfWeek' to cyclic coordinates
data['Month_sin'] = np.sin(2*np.pi*data['Month']/12)
data['Month_cos'] = np.cos(2*np.pi*data['Month']/12)
data['DayOfWeek_sin'] = np.sin(2*np.pi*data['DayOfWeek']/7)
data['DayOfWeek_cos'] = np.cos(2*np.pi*data['DayOfWeek']/7)


print(data.head(20))

            Date  Open  High   Low  Close     Volume  Adj Close  Date_split  \
16589 1950-01-03 16.66 16.66 16.66  16.66 1260000.00      16.66           0   
16588 1950-01-04 16.85 16.85 16.85  16.85 1890000.00      16.85           0   
16587 1950-01-05 16.93 16.93 16.93  16.93 2550000.00      16.93           0   
16586 1950-01-06 16.98 16.98 16.98  16.98 2010000.00      16.98           0   
16585 1950-01-09 17.08 17.08 17.08  17.08 2520000.00      17.08           0   
16584 1950-01-10 17.03 17.03 17.03  17.03 2160000.00      17.03           0   
16583 1950-01-11 17.09 17.09 17.09  17.09 2630000.00      17.09           0   
16582 1950-01-12 16.76 16.76 16.76  16.76 2970000.00      16.76           0   
16581 1950-01-13 16.67 16.67 16.67  16.67 3330000.00      16.67           0   
16580 1950-01-16 16.72 16.72 16.72  16.72 1460000.00      16.72           0   
16579 1950-01-17 16.86 16.86 16.86  16.86 1790000.00      16.86           0   
16578 1950-01-18 16.85 16.85 16.85  16.85 1570000.00

In [12]:
data.head(15)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,Date_split,SMA_5,SMA_10,SMA_30,EMA_5,EMA_10,EMA_30,RSI_14,ROC_10,AOI_34,KAMA_30,PPO_30,SRSI_14,SO_14,TSII_14,Month,DayOfWeek,Month_sin,Month_cos,DayOfWeek_sin,DayOfWeek_cos
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,0,,,,,,,,,,,,,,,1,1,0.5,0.87,0.78,0.62
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,0,,,,,,,,,,,,,,,1,2,0.5,0.87,0.97,-0.22
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,0,,,,,,,,,,,,,,,1,3,0.5,0.87,0.43,-0.9
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,0,,,,,,,,,,,,,,,1,4,0.5,0.87,-0.43,-0.9
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,0,,,,,,,,,,,,,,,1,0,0.5,0.87,0.0,1.0
16584,1950-01-10,17.03,17.03,17.03,17.03,2160000.0,17.03,0,16.9,,,16.93,,,,,,,,,,,1,1,0.5,0.87,0.78,0.62
16583,1950-01-11,17.09,17.09,17.09,17.09,2630000.0,17.09,0,16.97,,,16.96,,,,,,,,,,,1,2,0.5,0.87,0.97,-0.22
16582,1950-01-12,16.76,16.76,16.76,16.76,2970000.0,16.76,0,17.02,,,17.01,,,,,,,,,,,1,3,0.5,0.87,0.43,-0.9
16581,1950-01-13,16.67,16.67,16.67,16.67,3330000.0,16.67,0,16.99,,,16.92,,,,,,,,,,,1,4,0.5,0.87,-0.43,-0.9
16580,1950-01-16,16.72,16.72,16.72,16.72,1460000.0,16.72,0,16.93,,,16.84,,,,,,16.72,,,,,1,0,0.5,0.87,0.0,1.0
