# Pulling Stock Data
This is a Python notebook to analyze Stock data based on ticker symbols.


## 0. Programming in Python
Let's start by seeing how to make variables, functions, and logic in Python.

### Variables
Variables can take on many data types, from `'Strings'` to numbers (`0`), `True`/`False` and even functions.

Type **`x = 'Hello world!'`** in the box below, then press **`Shift + Enter`** to execute the code.

Nothing happened... That's because we've only just created the variable `x`. Now to show the value of the variable, we need to type **`x`**. Again, press **`Shift + Enter`** to execute the code.

### Functions
Functions can take on any name - you get to choose - but the syntax for defining a function is always the same. Functions can also operate on one or more `variables` that get defined when you initially create the function. These variables are a placeholder for what to do with any data that gets sent to the function when it is used.

Python uses indentation instead of `{` and `}` like in JavaScript or CSS. So be careful how you indent your code.

Below, type **`def myFunction(y):`** on the first line, then hit `Enter` and type **`return y + 3`**. The second line should be automatically indented. Press **`Shift + Enter`** to save that function.

In the box below that, type **`myFunction(3)`** - what do you expect it to return when you press **`Shift + Enter`**?

### Logic
Logic is the last piece of the programming foundation; this will test some comparison, and depending on whether the comparison is `True` or `False`, one outcome will result.

We've put the basic structure in the box below, but you need to add a comparison in the `()` to test: e.g. **`(5 > 3)`** or **`('a' == 'a')`**. Press **`Shift + Enter`** to run the logic below.

In [None]:
if ():
   print 'The comparison is True'
else:
   print 'The comparison is False'

### Packages
Packages are the secret sauce of Python - they are the libraries of pre-existing code that we bring in to perform complex functions and streamline our code. To use a package we have to import it before accessing it's contents. To that we just type `import package_name`.

Let's start by importing the data scientist's best friend, `numpy`. It is the package for "scientific computing", which means it lets us manipulate large arrays of numbers.

To save space in our code when we use a package we can shorten it's name when we import it by typing **`import package_name as short_name`**. Let's import `numpy`, giving it the shortened name `np`, as is common practice.

If it worked, nothing should have happened.

So let's test it out. Using numpy functions, create an array of the following numbers: `[1,2,3,4]`, and take the log of that array. To use a function from a package type `package_name.function_name`.

Your output should look like this:

`array([ 0.        ,  0.69314718,  1.09861229,  1.38629436])`

### Wrap Up
So that's it!  Variables, functions, logic and packages are the building blocks of programming in any language.

Now that you've got a handle on those, we're going to get a bit more complicated working with our data.

*Note: Just like above, you'll need to press `Shift + Enter` to run any code in an `In [ ]:` box.*

## <font color='green'>You're now finished with this section! Let your facilitators know, and have a break.</font>

![done](https://media.giphy.com/media/XreQmk7ETCak0/giphy.gif)

## 1. Sourcing Data
To begin working with our data, let's use an API called [Quandl](https://www.quandl.com) to bring in stock data.

We need to first `import quandl` to get the Quandl library of functions, then set our API key. The key for today is **`Byjzu4U8rmR1iEhZnp7V`** - copy and paste that between the `""` below.

In [None]:
!pip install quandl # install quandl
!pip install --upgrade pandas # upgrade pandas (some housekeeping)

import quandl
quandl.ApiConfig.api_key = ""

Now that we've got our connection to quandl, let's pull a single stock (`AAPL`) and store that in a variable called `data`.

Add **`WIKI/AAPL`** between the `""` below.

In [None]:
data = quandl.get("", rows=5)

To find out the type of data, we can type **`print type(data)`**.

And if we want to look at the data itself, we can type **`data`** below.

To assess the health of each stock, let's find the `Close` price for each stock. If you notice above, that's the 4th column.

Based on Quandl's [API documentation](https://docs.quandl.com/docs/time-series-2), we can extract just that column by adding **`.4`** after `WIKI/AAPL` to get **`WIKI/AAPL.4`**:

In [None]:
data = quandl.get("WIKI/AAPL", rows=5, collapse='monthly')
data

That's great! But we want to show data for the last 10 years.

First, we need to have Python tell us what today's date is and subtract 10 years from that. Replace `#TODO` with **``print start_date``** after the definition to check the value makes sense:

In [None]:
import datetime
start_date = (datetime.datetime.now() - datetime.timedelta(days=10*365)).strftime('%Y-%m-01')
#TODO

Now we need to add a new parameter to our Quandl get request. After `collapse='monthly'`, add a comma and then replace `#TODO` wtih **`start_date=start_date`** inside the parenthesis.

Then, to save space in the notebook, we're only going to show the top five rows of data using `data.head(5)`.

How could you show the top 10 rows instead?

In [1]:
data = quandl.get("WIKI/AAPL.4", rows=120, collapse='monthly', #TODO )
data.head(5)

SyntaxError: invalid syntax (<ipython-input-1-38fef9732a47>, line 2)

Python comes with a number of great visualization tools built in.

Let's do a quick visualization of the data to see if it looks right, run the code below:

In [None]:
%matplotlib inline
!pip install mpld3 # install a package to let us zoom into our plots
import mpld3
mpld3.enable_notebook()

ax = data.plot()

Uh-oh! Looks like there's a problem: There's a big drop in AAPL stock in 2014! **If you don't see this, check with a facilitator**

Why?!  Well, a quick Google shows they [split their stock](https://www.washingtonpost.com/news/the-switch/wp/2014/06/09/apples-stock-price-just-dropped-more-than-500-a-share-but-dont-panic/). 

Luckily, Quandl has accounted for that. Instead of `Close`, we'll need to use the `Adjusted Close` price (column 11) from Quandl.

Modify the code below to get the **11**th column instead of the 4th.

In [None]:
data = quandl.get("WIKI/AAPL.4", collapse='monthly', start_date = '2007-01-01')

# Now plot the data:
data.plot()

Great! Let's now focus on the full portfolio.


## <font color='green'>You're now finished with this section! Let your facilitators know, and have a break.</font>

![done](https://media.giphy.com/media/3o7TKLpxzkbvjwEgSc/giphy.gif)

## 2. Sourcing more Data

Now that we have access to stock data, let's pull in our list of Warren's 2003 aquisitions.

First, we need to bring in the cleaned CSV file we exported from Open Refine, and store it as a variable. Let's call it `buffett`.

Now let's take the output of the Open Refine scrubbing step and replace **`your_clean_csv.csv`** with that output:

In [None]:
from pandas import read_csv
buffett = read_csv('/resources/data/your_clean_csv.csv')
buffett.head(5)

Next we want to choose some of those stocks for our analysis. We could either type out the ticker symbol (e.g. `MSFT`) by hand, or we could refer to the `buffett` variable in order to select the ticker symbols. Let's choose the first 3 stocks from `buffett`. Replace `buffett['TICKER'][0:0]` with `buffett['TICKER']`**`[0:3]`**. If your column is called something different (like `Ticker`) then update accordingly:

In [None]:
symbols = [ 'WIKI/%s.11' % ticker for ticker in buffett['TICKER'][0:0] ]
symbols

Good job - we're now ready to pull in the historical stock data from quandl!

Let's save our progress by pulling this data in to a new dataframe, `data2`. This means that if we run in to any problems with `data2`, we won't affect our original `data`:

In [None]:
data2 = quandl.get(symbols, rows=121, collapse='monthly', start_date=start_date)
data2.head(5)

You might see `NaN` ("Not a Number") where you'd expect to see numbers. That just means we have some missing data; we won't worry too much about that at this point. 

Let's have a look at the end of the data. To do this we look at the "tail". Type **`data2.tail(5)`** below and run the cell:

Let's clean up the column name to just their symbol name by removing `WIKI/` and ` - Close` from each column name.

We can iterate over the columns using the `for col in data.columns` syntax. To look at the final 5 rows of our dataframe, **replace `#TODO` with `data2.tail(5)`**:

In [None]:
data2.columns = [col.replace(' - Adj. Close','').replace('WIKI/','') for col in data2.columns]
#TODO

Let's have another quick look at the data, to make sure everything looks good.

Add the code to plot `data2` (hint: you've used similar code above already for `data`):

Explore the data by using the icons in the bottom left corner of the plot. 

It looks like we have some incomplete data.

Further investigation reveals that we only have data for Kraft Heinz (KHC) from July 2015 onwards. We'll need to remember this for later, as this might affect our analysis.

## <font color='green'>You're now finished with this section! Let your facilitators know, and have a break.</font>

![done](https://media.giphy.com/media/26FL2NwYBOq3Z6C6Q/giphy.gif)

## 3. Analysis

## <font color='orange'>Please wait for your instructor to begin the Analysis section before proceeding</font>

Data analysis is all about answering questions. In this case Warren has one question - what will the prices of his stocks be in the future?

To answer that question we are going to build our own Time Series Forecasting model. This model will use the historic data that we have collected to forecast predictions for the future. We'll test it on one stock's data as we go.

Let's start by creating a new dataframe called `apple`, which will just be a copy of our existing dataframe called `data`. We'll then rename the one column of data and plot it to have a look at the data. Run the next cell without editing it and look at the stock data again:

In [None]:
apple = data.copy()
apple.columns = ['close']  #renaming "Adj. close" to something simpler
apple.plot()

What observations can we make about the behavior of the stock price that might help our predictions?

When looking at time series data, it's important to look for patterns that our model can try to use to make a prediction. 
We do this by asking four central questions:

 1. Is there a **Trend**?.........................The direction that the data is headed in, increasing or decreasing. Data with no up or down trend is said to be "stationary".
 2. Is there **Changing Variance**?......Time series data tends to fluctuate a lot. Does the scale of those fluctuations change over time?
 3. Is there **Seasonality**?..................Patterns associated with days, weeks, months
 4. Are there **Long-run Cycles**?.......Patterns occuring beyond the scale of months, not described by seasonal change. 
 
So first of all, let's handle any **trend** we've noticed by differencing. Create a new column on our `apple` data called `'diff_close'`, and make it equal to difference between the `close` column and the `close` column `shift`ed forward by one. This will give us the differences.

Now let's plot the new column, using `diff_close` as the y axis.

Does it look stationary now? If not, you can always difference again!

Once you feel happy that your data is stationary we need to make sure that the variance is stationary too. Let's start by making a new column called `log_close` that is, you guessed it, the log of the `close` column! Then plot this new column as the y axis.

Now let's finish our "trend-cleaning" off by calculating the difference of the `log_close` column and store it in another new column called `difflog_close`. Plot this new column as the y axis. 

How should we interpret this graph? What do you notice?

Now that we have removed the changing variance, and "induced" stationarity, we are ready to apply our ARIMA model.

We first need to import package, then we can create the model. These two lines have been completed for you. To generate the model, we need to create a variable calle `results_ARIMA` and set it equal to the fitted model. Write the code to do that in replace of `#TODO`:

*Note: to create a fitted model, call `model.fit()`*

In [None]:
!pip install statsmodels
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(apple.log_close, order=(5, 1, 0))

#TODO

We now have a fitted ARIMA model, easy as that!

To see what we have created we plot our model, which is already written on the first line below. For comparison to the original data, write code to plot the `apple.difflog_close` data in place of `#TODO` on the second line.

In [None]:
results_ARIMA.fittedvalues.plot(color='red')
#TODO

We should be able to see the red line (our model) following the movement of the blue linw (the true data), but at a smaller scale. This is because a lot of the extreme jumps in the data are unpredictable random noise - such is the nature of stocks.

Now that we've seen our model in action, it is time to use it to create predictions.

Create a variable called `log_forecast` and set it equal to the forecasts for the next 3 steps, then `print log_forecast` on the second line:

*Note: `results_ARIMA.fit.forecast()` will return multiple arguments, we only want the first one indexed by `[0]`*

Great work!

Now, to get those prediction back in to something useful, we need to "unlog" them and add them to the end of our original data. To reward you, and save you some time, we've put the process in to a function called `ARIMAforecast`, which will take any series of data, parameters, and number of forecasting steps, and predict the future using ARIMA.

Simply run the code below to see the results!

The last few datapoints on the curve are the ARIMA predictions that you generated.

In [None]:
def ARIMAforecast(series, params = [5,1, 0], steps = 4):   
    
    X = series                                                     #take the data
    log_X = np.log(X)                                              #log the data
    model = ARIMA(log_X, order=(params[0], params[1], params[2]))  #create ARIMA model
    results_ARIMA = model.fit(disp=0)                              #fit ARIMA model
    forecast = results_ARIMA.forecast(steps = steps)[0]            #forecast 'steps' number of steps
    unlog_forecast = np.exp(forecast)                              #"unlog" the forecast
    full_predict = np.append(X,unlog_forecast)                     #append it to the end of the column
    return full_predict.tolist()                                   #return as list

import matplotlib.pyplot as plt                                    #import plotting package
plt.plot(ARIMAforecast(apple.close))                               #plot results

Congratulations! We have built a model that can predict the future.

This is a simplistic introduction to time series analysis, although these methods are the foundation to many of the cutting edge techniques used currently. If you'd like to like to investigate ARIMA further, we recommend following the introduction here:

https://datascience.ibm.com/exchange/public/entry/view/815137c868b916821dec777bdc23013c

## <font color='green'>You're now finished with this section! Let your facilitators know, and have a break.</font>

![done](https://media.giphy.com/media/R6aNZ3Uc1aR1K/giphy.gif)

## 4. Preparing for Visualization

Let's get our data ready for a more snazzy visualization.

Based on the [D3.js show reel](https://bl.ocks.org/mbostock/1256572), we need the data to be arranged like this:

```
symbol,date,price
MSFT,Jan 2000,39.81
MSFT,Feb 2000,36.35
MSFT,Mar 2000,43.22
MSFT,Apr 2000,28.37
MSFT,May 2000,25.45```

### 4.1 Replotting with forecasts

First, we need to calculate our predictions for each stock (using the **$\alpha$** and **$\beta$** values you've chosen) and append them to the end of our dataframe. Due to data formatting, this requires a few extra lines of code, which we have completed below. We will predict 6 months ahead.

Read through the code below and then print the last 10 rows of data3 to validate that the code works. Write your code in place of `#TODO`:

In [None]:
import pandas as pd

numberofmonths = 6 # how many predictions do we want to make?

# Here we need to add numberofmonths onto the end of the dataframe using data.reindex:
data3 = data2.reindex(pd.date_range(datetime.datetime.now().date(), periods=121 + numberofmonths, freq='MS')+pd.DateOffset(days=-1, months=-120),fill_value="NaN")

master_params = [5,1,0]

for i, column in enumerate(data3):
    if data3[column].isnull().values.any(): # select only the range of dates for which we have data
        last_nan = np.where(data3[column].isnull().values)[0][-1]
        stock = data3[column].iloc[(last_nan+1):-numberofmonths].values
    else:
        stock = data3[column].iloc[:-numberofmonths].values
    
    stock = stock.tolist()
    forecast = ARIMAforecast(stock, params = master_params, steps = numberofmonths) # create our forecast
    
    col_forecast = pd.Series(forecast[-numberofmonths:], index = pd.date_range(start= datetime.datetime.now().date(), periods=numberofmonths, freq='MS') + pd.DateOffset(days=-1, months=1))
    data3[column].iloc[-numberofmonths:] = col_forecast
    

#TODO

How did your model fair on the other stocks? Let's visualize the results!

Below, we create two overlapping plots, one with the existing data and one with our forecasts, marked in red. Run the code without editing to have a look:

In [None]:
fig, ax = plt.subplots()
_ = ax.plot(data3.iloc[:-numberofmonths])
_ = ax.plot(data3.iloc[-(numberofmonths+1):], color = "Red")

What do you notice about our forecasts?

### 4.2 Exporting for further visualization

Next, we want to rearrange all of the data to be more like what we need for D3. Python comes with a function called `unstack()` that does just that!

Below, make a variable called **`datalist`** and set it equal to **`data3.unstack()`**.

Then show the top 10 rows of the datalist. (Hint, you've used this type of function before)

Almost there! We've done the hard work, now we just need it in a CSV format.

We've written this bit of code to clean the data a bit more and to output the data in a comma-separated value format. Read through it, then run it without edit.

In [None]:
csv = datalist.to_csv(header=True, index_label=['symbol','date','price'], date_format='%b %Y', index=True)
csv = csv.replace("price,0","price") # remove addition of ',0' on first line
print csv

We could copy & paste this into a new CSV file for our D3.js visualization, or we could write code to do that for us.

To make the downloadable file, we've got to bring in a library called `base64` which will encode the file. Then we use that to create the file and add a bit of HTML to make it so we can download the file.

Run the cell below, and download the resulting **`stocks.csv`** file.

In [None]:
import base64
from IPython.display import HTML

b64 = base64.b64encode(csv.encode())
payload = b64.decode()
html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{filename}</a>'
html = html.format(payload=payload,title="stocks.csv",filename="stocks.csv")
HTML(html)


Now you can import this file into your D3.js visualization.

## <font color='green'>You're done!</font>

![done](https://media.giphy.com/media/15BuyagtKucHm/giphy.gif)