In [1]:
# For data processing
import numpy as np
import pandas as pd

# Save the list datasets inside datasets pickle file
import pickle

### Preparing the original dataset for further processing

*Segregate the stocks within different numpy arrays according to the ticker name.*
![Stack-stock-data-on-top-of-each-other.png](https://i.postimg.cc/HnDyJp9s/Stack-stock-data-on-top-of-each-other.png)

WARNING❗ 
- Different stocks will give different no. of rows, as all stocks were not always available in that time span.
- Here we are not able to properly get all days values.

To prevent the error, we are removing those stocks will are having empty values for those dates.

E.g. Stock A might start from 1991-01-01 and Stock B from 1990-12-31 then both of them are not in same shape. So we can remove stock B. Making sure all stocks are having same no of rows.

In [2]:
# Retrieve the data from your drive
df = pd.read_csv('../datasets/stock-prices-S&P-constituents/stocks-data.csv')
df = df.set_index('Date')

In [3]:
# ticker of each stock
symbols = df['Symbol'].unique()

n_stocks = len(symbols) # number of stocks

stocks = [] # Store the stocks data inside stocks list
filtered_symbols = [] # This are the symbols for those stocks which have 7307 rows only

# segregate the stocks within different numpy arrays according to the ticker name
for i in range(n_stocks):
    total_captured_days = np.delete(df[df['Symbol'] == symbols[i]].reset_index().to_numpy(), 1, axis=1).shape[0]
    # Just take those stocks which was available from 1990-01-01 to 2018-12-31
    if total_captured_days == n_rows:
        stocks.append(np.delete(df[df['Symbol'] == symbols[i]].reset_index().to_numpy(), 1, axis=1))
        filtered_symbols.append(symbols[i])

In [4]:
# Create a check point
final_stocks = stocks

*Total number of stocks now are only 251. But we made sure these 251 stocks have same no of rows.*

In [5]:
len(final_stocks), len(filtered_symbols), type(final_stocks)

(251, 251, list)

In [6]:
# Convert the list into array for smooth manipulation of data later
final_stocks = np.array(final_stocks)
type(final_stocks), final_stocks.shape

(numpy.ndarray, (251, 7307, 7))

We just need opening price and adjacent closing price, so we can remove other features like Volume, Min, Max etc.

In [7]:
df[df['Symbol']=='AAPL']

Unnamed: 0_level_0,Symbol,Adj Close,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1990-01-02,AAPL,0.264482,0.332589,0.334821,0.312500,0.314732,183198400.0
1990-01-03,AAPL,0.266257,0.334821,0.339286,0.334821,0.339286,207995200.0
1990-01-04,AAPL,0.267145,0.335938,0.345982,0.332589,0.341518,221513600.0
1990-01-05,AAPL,0.268033,0.337054,0.341518,0.330357,0.337054,123312000.0
1990-01-08,AAPL,0.269808,0.339286,0.339286,0.330357,0.334821,101572800.0
...,...,...,...,...,...,...,...
2018-12-24,AAPL,35.375175,36.707500,37.887501,36.647499,37.037498,148676800.0
2018-12-26,AAPL,37.866348,39.292500,39.307499,36.680000,37.075001,234330000.0
2018-12-27,AAPL,37.620605,39.037498,39.192501,37.517502,38.959999,212468400.0
2018-12-28,AAPL,37.639881,39.057499,39.630001,38.637501,39.375000,169165600.0


In [8]:
final_stocks[0, 0, :]

array(['1990-01-02', 0.2644821405410766, 0.3325890004634857,
       0.3348209857940674, 0.3125, 0.3147319853305816, 183198400.0],
      dtype=object)

In [9]:
final_stocks[0, 0, 2:4], final_stocks[0, 0, 6]

(array([0.3325890004634857, 0.3348209857940674], dtype=object), 183198400.0)

In [10]:
final_stocks = np.delete(final_stocks, np.s_[2:5], axis=2) # Delete Close, High, Low columns 
final_stocks = np.delete(final_stocks, 3, axis=2) # Delete Volume column, NOTE after first deletion index is changed
final_stocks.shape # Now it only contains Date, Adj Close and Open

(251, 7307, 3)

In [11]:
final_stocks[:, :, 0] = np.array([pd.to_datetime(stock_i).date for stock_i in final_stocks[:, :, 0]])

In [12]:
final_stocks[0, 0, 0]

datetime.date(1990, 1, 2)

### Datasets creation with non-overlapping testing period from original dataset

We divide the dataset contsisting of 29 years starting from January 1990 till December 2018, using a 4-year window, 1-year stride, where each study period is divided into a training part(of 756 days almost = 3 years) and trading part(of 252 days almost = 1 year).

So, we obtain 26 study periods with non-overlapping trading part.

![Dataset-creation-with-non-overlapping-testing-period.png](https://i.postimg.cc/7YWw5YwL/Dataset-creation-with-non-overlapping-testing-period.png)

**METHOD TO CREATE THE NON-OVERLAPPING TESTING PERIODS**

1. Store the dates inside temp variable.

  ![dates-layed-out-in-stock-price-prediction.png](https://i.postimg.cc/65MCHfGk/dates-layed-out-in-stock-price-prediction.png)

2. Define 2 variables. `year_start` that will point to the starting day of each dataset and `start_index` which will tell what is the exact index number of that starting day.
  ![year-start-start-index-variables.png](https://i.postimg.cc/MGvP8krx/year-start-start-index-variables.png)

3. `year_start` will go till `2015`. As 2015-2018 is the last last study period.

4. Another variable called `year_end` will point to the end of the year. To be precise it is not exactly the end of the last year but one day after the last day. I.e. if the `year_start = '1990-01-02'` then `year_end = '1994-01-02'`.

  Why this will make any sense?

  The reason is, we will try to find the exact index value of the first day and last day inside each dataset.
  Then, we will use condition indexing using those indexes. Now, we made the `year_end` like that because it will take less efforts to change the year value of `year_start` by `window_size` to get `year_end`. Then, in the condition we not include this `year_end`.
  
  ![year-end.png](https://i.postimg.cc/XY9CjPWR/year-end.png)

5. Index the `temp` that contained the whole 29 years time from `year_start`(including) to `year_end`(excluding). Condition will be `temp[(temp>=year_start) & (temp<year_end)]`. This is the `timeline` of the current dataset(not yet created!).
  ![timeline-creation.png](https://i.postimg.cc/JzrDkWfL/timeline-creation.png)

6. Calculate the `end_index` using the length of the current dataset's timeline and `start_index`. 
  
  `end_index = start_index + len(timeline)`
  ![progress-of-end-index.png](https://i.postimg.cc/T2rzsZHB/progress-of-end-index.png)

7. Slice the data from `start_index` to `end_index`(excluding). As the date is now useless for our further steps. So make sure to delete the date part by slicing from `index=1` till end as date's index = 0.

  `data[:, start_index:end_index, 1:]`
  
  Then append it to `datasets` list.

8. Finally update the `year_start` by moving the previous `year_start` `stride=1` year. If the `year_start` is `1990-01-02` then next time it will be `1991-01-02`.
![progress-year-start.png](https://i.postimg.cc/HkwvsY30/progress-year-start.png)

9. Now, as we have the new `year_start`, we will use it to find the `before_start_timeline` that has passed from `1990-01-02`. This timeline will store the days from `1990-01-02` to the current `year_start` date.

  It will help us find the new `start_index`. We need to just find the length of this `before_start_timeline` or in other words how many days have passed from `1990-01-02` till the current `year_start`.

![progress-of-start-index.png](https://i.postimg.cc/dtcKLdWy/progress-of-start-index.png)

In [13]:
def dataset_generator(data, window_size=4, stride=1):
    '''
    data: stocks data containing date from 1990 to 2018 -> dims = (_, _, _)
    window_size: no of years contained inside any dataset -> int
    stride: by how much amount the window should slide -> int

    returns list of datasets each having 'window_size'ed years of stock data -> list
    '''

    # datasets -> [D1, D2, D3, ..., D26], Di will be a (251, 4 year time length, 2)
    datasets = []

    # Step 1
    temp = data[0, :, 0]

    # Step 2
    year_start = data[0, 0, 0]
    start_index = 0

    # Step 3
    while year_start.year<=2015:

        # Step 4
        year_end = year_start.replace(year=year_start.year+window_size)

        # Step 5
        timeline = temp[(temp>=year_start) & (temp<year_end)]

        # Step 6
        end_index = start_index + timeline.shape[0]

        # Step 7
        datasets.append(data[:, start_index:end_index, 1:])

        # Step 8
        year_start = year_start.replace(year=year_start.year+stride)

        # Step 9
        before_start_timeline = temp[temp<=year_start]
        start_index = len(before_start_timeline)

    return datasets

In [14]:
datasets = dataset_generator(final_stocks)

In [15]:
len(datasets)

26

In [16]:
datasets[0].shape

(251, 1013, 2)

In [17]:
datasets[0][0, 0, :]

array([0.2644821405410766, 0.3147319853305816], dtype=object)

Order of values for the above result : *Adj Close, Open*

In [20]:
with open('../datasets/datasets-list', 'wb') as file:
    pickle.dump(datasets, file)

In [4]:
with open('../datasets/symbols', 'wb') as file:
    pickle.dump(filtered_symbols, file)