## Project: Predicting Stock Market
## Part 1: Data preparation, model estimation
**Data:** S&P500 index valuation, New York Stock Exchange (NYSE), 1950 to 2015  
**DataQuest Lesson:** https://app.dataquest.io/c/11/m/65/guided-project%3A-predicting-the-stock-market/1/the-dataset?path=2&slug=data-scientist&version=1.2  
**Source:** https://github.com/NickyThreeNames/DataquestGuidedProjects/blob/master/Guided%20Project-%20Predicting%20the%20stock%20market/sphist.csv


#### Useful links

Inspiration notebooks:
* https://medium.com/shiyan-boxer/s-p-500-stock-price-prediction-using-machine-learning-and-deep-learning-328b1839d1b6
* https://medium.com/mlearning-ai/predict-sp500-stock-price-with-python-machine-learning-sentiment-analysis-a296dc276353
* https://medium.com/p/59b06de25357
* https://iopscience.iop.org/article/10.1088/1742-6596/1366/1/012130
* https://towardsdatascience.com/stock-market-predictions-with-rnn-using-daily-market-variables-6f928c867fd2
* https://www.kaggle.com/code/samaxtech/predicting-s-p500-index-linearreg-randomforests
* https://www.kaggle.com/code/janiobachmann/s-p-500-time-series-forecasting-with-prophet

Feature Engineering:

Correlation for non-numeric data (Dython library):  

Variable standarization:  

Variable selection:  

Optimal Binning and WoE transformation for continuous dependent variable:

Diagnostics:

Regression:


### 1. Setting up environment

#### 1.1 Importing packages & setting-up parameters

In [23]:
# Set-up auto-reload functions for faster debugging 
# (automatically refreshes changes in subpackages codes)
# https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [24]:
# Import parent directory (main project directory)
# for packages importing
import sys
import os

# Getting the parent directory name in which your script is running
parent = os.path.dirname(os.path.realpath(''))

# adding the parent directory to
# the sys.path.
sys.path.append(parent)

# now we can import the module in the parent
# directory.

In [25]:
# Project packages import
import gp24package.data.make_dataset as gp24md
# import gp24package.explore_visualise.eda as gp23eda
# import gp24package.features.build_features as gp23feat
# import gp24package.models.hyperparameters_model as gp23hyperparam
# import gp24package.models.train_model as gp23train


# Pylance highligting package issue (not to be worried about)
# https://github.com/microsoft/pylance-release/blob/main/TROUBLESHOOTING.md#unresolved-import-warnings

# Standard Python libraries import
from IPython.display import display, HTML #  tidied-up display
from time import time #  project timer
from itertools import chain # for list iterations
from datetime import datetime
import pandas as pd
import numpy as np

# plots
import matplotlib.pyplot as plt
import seaborn as sn

# Necessary packages
import gp24package
import session_info # build and requirements.txt
import pickle # dump models

#turning on plot display in JN
%matplotlib inline 
# Setting pandas display options
pd.options.display.max_columns = 300
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 250

In [26]:
# MANUAL_INPUT - mark of sections of code, that are not automated and have to be manually re-coded to new datasets.

# parameters
seed = 12345
target = 'SalePrice'
ID_vars = ['Order', 'PID']

#### 1.2 Starting project timer and exporting requirements

In [27]:
# Starting project timer
tic_all = time()

In [28]:
# Collecting packages info and saving to requirements.txt file
session_info.show(cpu = True, std_lib = True, dependencies = True, write_req_file = True,
                  req_file_name = 'requirements.txt')

#### 1.3 Importing and inspecting source data

In [29]:
data = gp24md.MakeDataset('sphist.csv').data
print(type(data))
pd.options.display.float_format = '{:.2f}'.format
display(HTML(data.head().to_html()))

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.42,2090.42,2066.78,2077.07,4043820000.0,2077.07
1,2015-12-04,2051.24,2093.84,2051.24,2091.69,4214910000.0,2091.69
2,2015-12-03,2080.71,2085.0,2042.35,2049.62,4306490000.0,2049.62
3,2015-12-02,2101.71,2104.27,2077.11,2079.51,3950640000.0,2079.51
4,2015-12-01,2082.93,2103.37,2082.93,2102.63,3712120000.0,2102.63


In [30]:
data['Date'] = pd.to_datetime(data['Date'])
data.sort_values(ascending = True, inplace  = True, by = 'Date')
print(data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 16590 entries, 16589 to 0
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       16590 non-null  datetime64[ns]
 1   Open       16590 non-null  float64       
 2   High       16590 non-null  float64       
 3   Low        16590 non-null  float64       
 4   Close      16590 non-null  float64       
 5   Volume     16590 non-null  float64       
 6   Adj Close  16590 non-null  float64       
dtypes: datetime64[ns](1), float64(6)
memory usage: 1.0 MB
None


In [35]:
data['Date_split'] = np.where(data["Date"] > datetime(year=2015, month=4, day=1),1,0)


# Do porownania srednie
# Jesli przewiduje cenę danego dnia - nie powinienem znać ceny, mogę znać tylko przeszłość

# https://www.kaggle.com/code/samaxtech/predicting-s-p500-index-linearreg-randomforests

# list of days to iterabe by
days = [3, 5, 30, 60, 90, 120, 240, 365]
# Lista wartości
values = []

# Dictionary to store results
results = {}
for days in days:
    mean_value = data['Close'].rolling(days).mean()
    std_value = data['Close'].rolling(days).std()
    max_value = data['Close'].rolling(days).max()
    min_value = data['Close'].rolling(days).min()

data['mean_close_5d'] = data['Close'].rolling(3).mean()


data['mean_close_3m'] = data['Close'].rolling(30).mean()
data['mean_close_6m'] = data['Close'].rolling(60).mean(s)
data['mean_close_9m'] = data['Close'].rolling(90).mean()
data['mean_close_12m'] = data['Close'].rolling(120).mean()
data['std_close_3m'] = data['Close'].rolling(30).std()
data['std_close_6m'] = data['Close'].rolling(60).std()
data['std_close_9m'] = data['Close'].rolling(90).std()
data['std_close_12m'] = data['Close'].rolling(120).std()


In [36]:
data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,Date_split,mean_close_3m,mean_close_6m,mean_close_9m,mean_close_12m,std_close_3m,std_close_6m,std_close_9m,std_close_12m,mean_close_5d,mean_close_5d_v2,mean_close_3m_v2
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,0,,,,,,,,,,,
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,0,,,,,,,,,,,
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,0,,,,,,,,,16.81,16.76,
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,0,,,,,,,,,16.92,16.89,
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,0,,,,,,,,,17.0,16.95,
