<a href="https://colab.research.google.com/github/MatteoBettini/Stock-Market-Prediction-2020/blob/main/notebooks/Data%20exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Take-home Assessment

# Dataset exploration

In this section we will upload and explore the dataset "**Processed_NASDAQ**,  containing several daily features of NASDAQ Composite from 2010 to 2017. The dataset was acquired from [this repository](https://archive.ics.uci.edu/ml/datasets/CNNpred%3A+CNN-based+stock+market+prediction+using+a+diverse+set+of+variables#).

It covers features from various categories of technical indicators, futures contracts, price of commodities, important indices of markets around the world, price of major companies in the U.S. market, and treasury bill rates. Sources and thorough description of features have been mentioned in the paper '[CNNpred: CNN-based stock market prediction using a diverse set of variables](https://arxiv.org/pdf/1810.08923.pdf)'.

The dataset contains 1984 entries each representing a day of trading in the stock market. Each entry has 84 features (of which 2 are strings identifying the date and the name of the market), the remaining 82 features are grouped in the following way:

*   Primitive features
*   Technical indicators
*   Economic data
*   World stock markets
*   The exchange rate of U.S. dollar
*   Commodities
*   Big U.S. Companies
*   Futures contracts

A tabular description of the features is also reported in the following images.


![](https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/feature_description/feature_table_1.png?token=ANHXQQI7BINXDEPRGLZ42LS74XAH6
)

![](https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/feature_description/feature_table_2.png?token=ANHXQQOG2ZP4BPHXIWXNWWK74XAKY)




### Imports

In [None]:
# To plot figures
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# To make this notebook's output stable across runs
np.random.seed(42)

### Loading the dataset

In [7]:
nasdaq_url = 'https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/stock_markets_datasets/Processed_NASDAQ.csv?token=ANHXQQK4VPBE6ABSBCHTF5K74UP3W'
dji_url = 'https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/stock_markets_datasets/Processed_DJI.csv?token=ANHXQQOXYQZSSSTVFKX6RZS74XDXA'
nyse_url = 'https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/stock_markets_datasets/Processed_NYSE.csv?token=ANHXQQNAISMPCLVLRTGNJBC74XD2C'
russel_url = 'https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/stock_markets_datasets/Processed_RUSSELL.csv?token=ANHXQQPGLBLSM3B36OLWIPC74XD3U'
s_p_url = 'https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/stock_markets_datasets/Processed_S%26P.csv?token=ANHXQQNRFS3NKP2XCF5Q5MS74XD5K'

In [8]:
nasdaq_df = pd.read_csv(nasdaq_url)
# Dataset is now stored in a Pandas Dataframe

Now that we have loaded the dataset we can start to inspect the data

In [11]:
nasdaq_df.head()

Unnamed: 0,Date,Close,Volume,mom,mom1,mom2,mom3,ROC_5,ROC_10,ROC_15,ROC_20,EMA_10,EMA_20,EMA_50,EMA_200,DTB4WK,DTB3,DTB6,DGS5,DGS10,Oil,Gold,DAAA,DBAA,GBP,JPY,CAD,CNY,AAPL,AMZN,GE,JNJ,JPM,MSFT,WFC,XOM,FCHI,FTSE,GDAXI,GSPC,...,NYSE,TE1,TE2,TE3,TE5,TE6,DE1,DE2,DE4,DE5,DE6,CTB3M,CTB6M,CTB1Y,Name,AUD,Brent,CAC-F,copper-F,WIT-oil,DAX-F,DJI-F,EUR,FTSE-F,gold-F,HSI-F,KOSPI-F,NASDAQ-F,GAS-F,Nikkei-F,NZD,silver-F,RUSSELL-F,S&P-F,CHF,Dollar index-F,Dollar index,wheat-F,XAG,XAU
0,2009-12-31,2269.149902,,,,,,,,,,,,,,0.04,0.06,0.2,2.69,3.85,,,5.33,6.39,,,,,,,,,,,,,,,,,...,,3.81,3.79,3.65,0.02,0.16,1.06,2.54,6.19,6.33,6.35,,,,NASDAQ,0.35,-0.13,0.15,0.09,0.1,0.48,-1.19,-0.12,0.27,0.34,1.68,-0.07,-0.96,-2.4,0.67,0.03,0.26,-1.08,-1.0,-0.11,-0.08,-0.06,-0.48,0.3,0.39
1,2010-01-04,2308.419922,0.560308,0.017306,,,,,,,,,,,,0.05,0.08,0.18,2.65,3.85,0.02683,0.0,5.35,6.39,-0.004222,-0.004467,-0.010644,-0.001991,0.015565,-0.004609,0.02115,0.004192,0.028318,0.01542,0.012227,0.014078,0.019724,,,0.016043,...,0.019733,3.8,3.77,3.67,0.03,0.13,1.04,2.54,6.21,6.31,6.34,-0.1,-0.04386,-0.01487,NASDAQ,1.73,2.81,1.99,1.36,2.71,0.96,1.28,0.61,1.74,2.05,-0.52,0.54,1.51,5.6,0.31,1.52,3.26,1.61,1.62,-0.57,-0.59,-0.42,3.12,3.91,2.1
2,2010-01-05,2308.709961,0.225994,0.000126,0.017306,,,,,,,,,,,0.03,0.07,0.17,2.56,3.77,0.002699,0.00156,5.24,6.3,-0.007628,-0.009838,-0.001441,1.5e-05,0.001729,0.0059,0.005178,-0.011596,0.01937,0.000323,0.027452,0.003904,-0.000264,0.004036,-0.002718,0.003116,...,0.003839,3.74,3.7,3.6,0.04,0.14,1.06,2.53,6.13,6.23,6.27,-0.055556,-0.073394,-0.033962,NASDAQ,-0.08,0.59,-0.11,0.24,0.32,-0.14,-0.04,-0.31,0.38,0.04,2.03,-0.18,-0.08,-4.2,0.47,-0.07,1.96,-0.2,0.31,0.43,0.03,0.12,-0.9,1.42,-0.12
3,2010-01-06,2301.090088,-0.048364,-0.0033,0.000126,0.017306,,,,,,,,,,0.03,0.06,0.15,2.6,3.85,0.016883,0.006009,5.3,6.34,0.002067,0.008418,-0.007311,0.000191,-0.015906,-0.018116,-0.005151,0.008134,0.005494,-0.006137,0.001425,0.008643,0.001186,0.001358,0.00041,0.000546,...,0.003104,3.82,3.79,3.7,0.03,0.12,1.04,2.49,6.19,6.28,6.31,-0.117647,0.0,0.015625,NASDAQ,0.91,1.61,0.15,2.41,1.72,-0.01,0.01,0.31,0.16,1.59,0.79,0.78,-0.36,6.6,0.19,0.56,2.15,-0.02,0.07,-0.56,-0.24,-0.17,2.62,2.25,1.77
4,2010-01-07,2300.050049,0.007416,-0.000452,-0.0033,0.000126,0.017306,,,,,,,,,0.02,0.05,0.16,2.62,3.85,-0.006256,0.000221,5.31,6.33,-0.005609,0.011196,0.002035,-7.3e-05,-0.001849,-0.017013,0.05178,-0.007137,0.019809,-0.0104,0.036286,-0.003142,0.001775,-0.000597,-0.002481,0.004001,...,0.0022,3.83,3.8,3.69,0.03,0.14,1.02,2.48,6.17,6.28,6.31,0.066667,0.019802,0.007692,NASDAQ,-0.41,-0.46,0.15,-1.9,-0.63,-0.12,0.28,-0.66,0.06,-0.25,-0.6,-1.27,-0.05,-3.38,-0.09,-0.72,0.94,0.5,0.4,0.58,0.58,0.54,-1.85,0.22,-0.58


In [None]:
def series_to_supervised(data, n_in, dropnan=False):
    """
    Frame a time series as a supervised learning dataset.
    Arguments:
        data: Sequence of observations as a list or NumPy array.
        n_in: Number of lag observations as input (X).
        dropnan: Boolean whether or not to drop rows with NaN values.
    Returns:
        Pandas DataFrame of series framed for supervised learning.
    """
    n_vars = 1 if type(data) is list else data.shape[1]
    df = pd.DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    # forecast sequence (t)
    cols.append(df.iloc[:][0] - df.shift(1).iloc[:][0])

    names += ['target']
  
    # put it all together
    agg = pd.concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg

In [3]:
dataset_url = 'https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/stock_markets_datasets/Processed_NASDAQ.csv?token=ANHXQQK4VPBE6ABSBCHTF5K74UP3W'
nasdaq_df = pd.read_csv(dataset_url)
# Dataset is now stored in a Pandas Dataframe
nasdaq_df.info()
nasdaq_df = nasdaq_df.drop(columns=['Date','Name'])


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1984 entries, 0 to 1983
Data columns (total 84 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Date            1984 non-null   object 
 1   Close           1984 non-null   float64
 2   Volume          1983 non-null   float64
 3   mom             1983 non-null   float64
 4   mom1            1982 non-null   float64
 5   mom2            1981 non-null   float64
 6   mom3            1980 non-null   float64
 7   ROC_5           1979 non-null   float64
 8   ROC_10          1974 non-null   float64
 9   ROC_15          1969 non-null   float64
 10  ROC_20          1964 non-null   float64
 11  EMA_10          1975 non-null   float64
 12  EMA_20          1965 non-null   float64
 13  EMA_50          1935 non-null   float64
 14  EMA_200         1785 non-null   float64
 15  DTB4WK          1984 non-null   float64
 16  DTB3            1984 non-null   float64
 17  DTB6            1984 non-null   f

In [None]:
data = series_to_supervised(nasdaq_df.values, 40)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1984 entries, 0 to 1983
Columns: 3281 entries, var1(t-40) to target
dtypes: float64(3281)
memory usage: 49.7 MB
