<img src='img/logo.png' />

<img src='img/title.png'>

<img src='img/py3k.png'>

# Table of Contents
* [Learning Objectives:](#Learning-Objectives:)
* [Missing Data](#Missing-Data)
	* [Concatenate and investigate](#Concatenate-and-investigate)
	* [Fill missing data](#Fill-missing-data)
	* [Left Joining](#Left-Joining)
	* [Resample and reindex](#Resample-and-reindex)

# Learning Objectives:

After completion of this module, learners should be able to:

* Use join and concatenate between two DataFrames
* Use forward fill to fill in missing data
* Create a datetime index from `pd.date_range`

In [None]:
import pandas as pd
from pandas_datareader import data
import numpy as np
%matplotlib inline

# Missing Data

In Pandas missing data are represented as the value `NaN`, which is similar to the Numpy `np.nan`. In Pandas, however, statistical operations like `.sum()` and `.mean()` ignore the `NaN` values. Operations like `groupby` also ignore `NaN` values.

In this example we want to align time series data and fill in missing values. We are going to use the `ffill()` method. See the [documentation on missing data](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) for more information.

Not all stock data are reported on the same days. The US and Australia may have different holidays.

In [None]:
sp500 = data.DataReader('^GSPC', 'yahoo', '2016-1-1', '2016-1-31')[['Close']].rename(columns={'Close':'sp500'})
aus = data.DataReader('^AXJO', 'yahoo', '2016-1-1', '2016-1-31')[['Close']].rename(columns={'Close':'aus'})

The `.equals()` method is a quick way to determine if two index objects (of any type) have the same values and order. In the next sections we'll investigate where the dates do not align and fill in the missing values.

In [None]:
aus.index.equals(sp500.index)

## Concatenate and investigate

Let's take an outer join to figure out where the missing data are.

In [None]:
df = pd.concat([sp500, aus], axis=1, join='outer')
df

Matplotlib will show gaps in the data.

In [None]:
df.plot()

## Fill missing data

There are a number of ways by which we can *fill* in missing data. For this data set *forward filling* is appropriate because the close price would not have changed on days were no trading occured.

In [None]:
df.ffill()

We also could have done a linear interpolation to provide the missing data.

In [None]:
df.interpolate(how='linear')

## Left Joining

By joining `aus` to `sp500` I'm using only the SP500 index and ignoring data from January 18 in AXJO. January 26th in the `aus` column is filled with the value from January 25th.

In [None]:
sp500.join(aus).ffill()

## Resample and reindex

Suppose I wanted *all* of the days in January. I can resample or reindex and forward fill the missing data.

In [None]:
sp500.resample('D').mean().join(aus.resample('D').mean()).ffill()

Notice that the first three days in January are missing and so are the last two days. To include those I make a new index and join the reindexed DataFrames.

In [None]:
jan_index=pd.date_range('2016-1-1', '2016-1-31')
jan_index

In [None]:
sp500.reindex(jan_index).join(aus.reindex(jan_index)).ffill()

In order to get the first three days in January I need to read in the last day recorded in December. I'm going to read in starting on December 20 just to be safe.

In [None]:
sp500 = data.DataReader('^GSPC', 'yahoo', '2015-12-20', '2016-1-31')[['Close']].rename(columns={'Close':'sp500'})
aus = data.DataReader('^AXJO', 'yahoo', '2015-12-20', '2016-1-31')[['Close']].rename(columns={'Close':'aus'})

First I'll resample by day and forward fill.

In [None]:
sp500 = sp500.resample('D').mean().ffill()
aus = aus.resample('D').mean().ffill()

In [None]:
sp500.index.equals(aus.index)

The two DataFrame Indexes are equal and can easlily be joined or concatenated before re-indexing. A final forward fill sets values in January 30 and 31.

In [None]:
sp500.join(aus).reindex(jan_index).ffill()

<img src='img/copyright.png'>