# S&P 500 Predictive Analysis

Vulcun provided a dataset of past stock market prices

## Loading in the Dataset

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import os

In [2]:
test_set = pd.read_csv("dataset/observations_test.csv")
train_set = pd.read_csv("dataset/observations_train.csv")
ser = pd.read_csv("dataset/series.csv")

In [3]:
train = train_set.merge(ser, on = "series_id", how = 'left')
train.head()

Unnamed: 0,series_id,date,value,name,frequency,units,seasonal_adjustment,Description
0,AAA10Y,2000-01-03 00:00:00.0000000,1.17,Moodys Seasoned Aaa Corporate Bond Yield Relat...,Daily,Percent,Not Seasonally Adjusted,Series is calculated as the spread between Moo...
1,AAA10Y,2000-01-04 00:00:00.0000000,1.2,Moodys Seasoned Aaa Corporate Bond Yield Relat...,Daily,Percent,Not Seasonally Adjusted,Series is calculated as the spread between Moo...
2,AAA10Y,2000-01-05 00:00:00.0000000,1.16,Moodys Seasoned Aaa Corporate Bond Yield Relat...,Daily,Percent,Not Seasonally Adjusted,Series is calculated as the spread between Moo...
3,AAA10Y,2000-01-06 00:00:00.0000000,1.15,Moodys Seasoned Aaa Corporate Bond Yield Relat...,Daily,Percent,Not Seasonally Adjusted,Series is calculated as the spread between Moo...
4,AAA10Y,2000-01-07 00:00:00.0000000,1.17,Moodys Seasoned Aaa Corporate Bond Yield Relat...,Daily,Percent,Not Seasonally Adjusted,Series is calculated as the spread between Moo...


## Data Cleaning

In [5]:
train.dtypes

series_id               object
date                    object
value                  float64
name                    object
frequency               object
units                   object
seasonal_adjustment     object
Description             object
dtype: object

I want to transform the date into a datetime object so it is easier to work with time.

In [7]:
train.describe()

Unnamed: 0,value
count,100621.0
mean,664.4961
std,4442.223695
min,-87.57
25%,0.66
50%,1.76
75%,6.0
max,85450.0


It's weird to have negative values for the value as that would mean that the company is paying customer to buy it's stock. Therefore, I have decided to investigate further into this.

In [13]:
train.loc[train["value"] < 0.0, "units"].unique()

array(['Percent Change from Year Ago', 'Percent',
       'Percent Change at Annual Rate'], dtype=object)

The negative values we observed was because the units is measured in change in percent. Knowing this, we must standarize our value so that we can perform analysis.

### Check for Null values

In [9]:
train.isnull().any()

series_id              False
date                   False
value                   True
name                   False
frequency              False
units                  False
seasonal_adjustment    False
Description             True
dtype: bool

In [17]:
["frequency"].unique()

AttributeError: 'list' object has no attribute 'unique'

In [18]:
train.loc[train["value"].isnull()]

Unnamed: 0,series_id,date,value,name,frequency,units,seasonal_adjustment,Description
10,AAA10Y,2000-01-17 00:00:00.0000000,,Moodys Seasoned Aaa Corporate Bond Yield Relat...,Daily,Percent,Not Seasonally Adjusted,Series is calculated as the spread between Moo...
35,AAA10Y,2000-02-21 00:00:00.0000000,,Moodys Seasoned Aaa Corporate Bond Yield Relat...,Daily,Percent,Not Seasonally Adjusted,Series is calculated as the spread between Moo...
79,AAA10Y,2000-04-21 00:00:00.0000000,,Moodys Seasoned Aaa Corporate Bond Yield Relat...,Daily,Percent,Not Seasonally Adjusted,Series is calculated as the spread between Moo...
105,AAA10Y,2000-05-29 00:00:00.0000000,,Moodys Seasoned Aaa Corporate Bond Yield Relat...,Daily,Percent,Not Seasonally Adjusted,Series is calculated as the spread between Moo...
131,AAA10Y,2000-07-04 00:00:00.0000000,,Moodys Seasoned Aaa Corporate Bond Yield Relat...,Daily,Percent,Not Seasonally Adjusted,Series is calculated as the spread between Moo...
...,...,...,...,...,...,...,...,...
94337,TEDRATE,2017-09-04 00:00:00.0000000,,TED Spread,Daily,Percent,Not Seasonally Adjusted,Series is calculated as the spread between 3-M...
94362,TEDRATE,2017-10-09 00:00:00.0000000,,TED Spread,Daily,Percent,Not Seasonally Adjusted,Series is calculated as the spread between 3-M...
94395,TEDRATE,2017-11-23 00:00:00.0000000,,TED Spread,Daily,Percent,Not Seasonally Adjusted,Series is calculated as the spread between 3-M...
94417,TEDRATE,2017-12-25 00:00:00.0000000,,TED Spread,Daily,Percent,Not Seasonally Adjusted,Series is calculated as the spread between 3-M...


For stocks measured daily, we can conclude that the price is missing due to an observed holiday leading to a close of the stock market. For these days, we can impute the value based on the previous close price.