## Data Retrieval

Yahoo Finance API is used to get stocks data for Google over period of seven years. 

In [29]:
import yahoo_finance as yahoo
import pandas as pd
import numpy as np
from IPython.display import display # Allows the use of display() for DataFrames

# Show matplotlib plots inline (nicely formatted in the notebook)
%matplotlib inline

###### Entry Parameters #######
startDate = '2010-01-01'
endDate = '2017-01-01'
ticker = 'GOOG'

#Used for re-running: stops querying the API if we already have the data
fetchData = False

#Stock Data- first step is to obtain the list of stocks, and then select a stock to run through machine learning
fileName = 'data/stocksData.csv'
###############################

# returive stock data using yahoo Finance API and return a dataFrame
def retrieveStockData():
    try:
        if fetchData:
            frames = []
            for symbol in ['GOOG']:
                print('Retriving data for ticker _' + symbol + '_ .....')
                target_data = yahoo.Share(symbol).get_historical(startDate, endDate)
                df = pd.DataFrame(target_data).sort_values(by='Date')
                df = df[['Symbol', 'Date','Open', 'Close', 'Adj_Close', 'High','Low', 'Volume']]
                frames.append(df)

            data = pd.concat(frames)    
            # save as CSV to stop blowing up their API
            data.to_csv(fileName, index_col=None, header=0, parse_dates=['Date'])
        else:
            # read the existing csv 
            data = pd.read_csv(fileName)

        data.fillna(method='ffill', inplace=True)
        data.fillna(method='bfill', inplace=True)
        #Date and Symbol columns not required
        data.drop(['Date', 'Symbol'], axis = 1, inplace = True)
        print "Wholesale customers dataset has {} samples with {} features each.".format(*data.shape)
        return data 
    except:
         print "Dataset could not be loaded. Is the dataset missing?"
        

data = retrieveStockData()

#whats of interest here is the percentage change from one day to the next
#data = data.pct_change()

display(data.head())


Wholesale customers dataset has 1762 samples with 7 features each.


Unnamed: 0,Index,Open,Close,Adj_Close,High,Low,Volume
0,1761,626.951088,626.751061,313.062468,629.511067,624.241073,3927000
1,1760,627.181073,623.991055,311.683844,627.841071,621.541045,6031900
2,1759,625.861078,608.261023,303.826685,625.861078,606.361042,7987100
3,1758,609.401025,594.101005,296.753749,610.001045,592.651008,12876600
4,1757,592.000997,602.021036,300.709808,603.251034,589.110988,9483900


## Data Exploration
In this section I will explore the data through visualizations and code to understand how each feature is related to the others. I will observe a statistical description of the dataset, consider the relevance of each feature, and select a few sample data points from the dataset which you will track through the course of this project.


In [28]:
# Display a description of the dataset
display(data.describe())

Unnamed: 0,Index,Open,Close,Adj_Close,High,Low,Volume
count,1762.0,1762.0,1762.0,1762.0,1762.0,1762.0,1762.0
mean,880.5,667.988556,667.7322,460.416137,673.341401,662.013391,4051175.0
std,508.789904,159.812252,159.790218,173.76548,160.469786,159.052972,2933264.0
min,0.0,438.310758,436.070761,217.817563,442.28076,433.630737,7900.0
25%,440.25,548.639605,547.364993,299.208801,553.582514,542.753834,1901175.0
50%,880.5,613.846067,614.161057,437.910342,619.111038,609.356057,3642400.0
75%,1320.75,749.96282,749.452515,579.082539,756.218456,742.882817,5165425.0
max,1761.0,1226.802152,1220.172036,813.109985,1228.882066,1218.602083,29760600.0


The first thing I notice here is that the mean values and median values differ a lot. 
That means the distribution should not be a normal distribution. In such a case, median is considered to be more reiable than mean.

### Implementation: Selecting Samples

To get a better understanding of stock data and how this data will transform through the analysis, it would be best to select a few sample data points and explore them in more detail. In the code block below, add three indices to the indices list which will represent the stocks to track. 
It is suggested to try different sets of samples until you obtain customers that vary significantly from one another.

In [34]:
# Select three indices to sample from the dataset
indices = [50, 500, 1700]

# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)

print "Chosen samples of Google stock dataset:"
display(samples)

Chosen samples of Google stock dataset:


Unnamed: 0,Index,Open,Close,Adj_Close,High,Low,Volume
0,1711,568.301005,565.56097,282.497987,571.450956,564.250962,6668000
1,1261,632.051074,640.251081,319.805735,644.491104,632.001082,3224800
2,61,776.030029,776.429993,776.429993,778.710022,772.890015,1201400


## Identifying outliers
Outliers are data points that are distinctvely separated from other data points any data point more than 1.5 interquartile ranges (IQRs) below the first quartile or above the third quartile.

Outliers has the biggest effect on the mean but not much effect on the median. Effecting mean can lead to effecting variance and then having largest effect on standard deviation.

In [10]:
#locate number of outliers for each column, outlier being 1.5 IQR up or down from upper or lower quartile
outliers = pd.DataFrame(index=data.index)
outliers = pd.DataFrame(np.where(
        (data > 1.5 * ((data.quantile(0.75) - data.quantile(0.25)) + data.quantile(0.75))) |
        (data < 1.5 * (data.quantile(0.25) - (data.quantile(0.75)) - data.quantile(0.25))),1, 0), 
                        columns=data.columns)

print outliers

      Unnamed: 0  Open  Close  Adj_Close  High  Low  Volume
0              0     0      0          0     0    0       0
1              0     0      0          0     0    0       0
2              0     0      0          0     0    0       0
3              0     0      0          0     0    0       0
4              0     0      0          0     0    0       0
5              0     0      0          0     0    0       0
6              0     0      0          0     0    0       0
7              0     0      0          0     0    0       0
8              0     0      0          0     0    0       0
9              0     0      0          0     0    0       0
10             0     0      0          0     0    0       0
11             0     0      0          0     0    0       0
12             0     0      0          0     0    0       0
13             0     0      0          0     0    0       0
14             0     0      0          0     0    0       0
15             0     0      0          0