# Using DataFrames

This lesson introduces:

* Computing returns (percentage change)
* Basic mathematical operations on DataFrames

This first cell load data for use in this lesson.

In [47]:
# Setup: Load prices
import pandas as pd
prices = pd.read_hdf("data/dataframes.h5", "prices")
sep_04 = pd.read_hdf("data/dataframes.h5", "sep_04")
goog = pd.read_hdf("data/dataframes.h5", "goog")

In [67]:
prices.columns:["SPY", "AAPL", "GOOG"]
prices

Unnamed: 0,SPY,AAPL,GOOG
2018-09-04,289.81,228.36,1197.0
2018-09-05,289.03,226.87,1186.48
2018-09-06,288.16,223.1,1171.44
2018-09-07,287.6,221.3,1164.83
2018-09-10,288.1,218.33,1164.64
2018-09-11,289.05,223.85,1177.36
2018-09-12,289.12,221.07,1162.82
2018-09-13,290.83,226.41,1175.33
2018-09-14,290.88,223.84,1172.53
2018-09-17,289.34,217.88,1156.05


## Problem: Compute Returns

Compute returns using 

```python
returns = prices.pct_change()
```

which computes the percentage change.

Additionally, extract returns for each name using 

```python
spy_returns = returns["SPY"]
```

In [68]:
returns = prices.pct_change()
returns.head()




Unnamed: 0,SPY,AAPL,GOOG
2018-09-04,,,
2018-09-05,-0.002691,-0.006525,-0.008789
2018-09-06,-0.00301,-0.016617,-0.012676
2018-09-07,-0.001943,-0.008068,-0.005643
2018-09-10,0.001739,-0.013421,-0.000163


In [69]:
returns = returns.dropna()
returns.head()

Unnamed: 0,SPY,AAPL,GOOG
2018-09-05,-0.002691,-0.006525,-0.008789
2018-09-06,-0.00301,-0.016617,-0.012676
2018-09-07,-0.001943,-0.008068,-0.005643
2018-09-10,0.001739,-0.013421,-0.000163
2018-09-11,0.003297,0.025283,0.010922


In [75]:
spy_returns = returns["SPY",]


In [8]:
#this is not working: check later

#```python
spy_returns = returns["SPY",]
aapl_returns = returns["AAPL",]
goog_returns = returns["GOOG",]
#```


## Problem: Compute Log Returns

```python
import numpy as np

log_returns = np.log(prices).diff()
```

first difference of the natural log of the prices. Mathematically this is 
$r_{t}=\ln\left(P_{t}\right)-\ln\left(P_{t-1}\right)=\ln\left(\frac{P_{t}}{P_{t-1}}\right)\approx\frac{P_{t}}{P_{t-1}}-1$.

In [9]:
import numpy as np
log_prices = np.log(prices)
log_prices.head()

log_returns = log_prices.diff().dropna()
log_returns

# takes the log of prices then applies the difference, and drops missing values
# can also be done as a single operator
# log_returns = np.log(prices).diff().dropna()

Unnamed: 0,SPY,AAPL,GOOG
2018-09-05,-0.002695,-0.006546,-0.008827
2018-09-06,-0.003015,-0.016757,-0.012757
2018-09-07,-0.001945,-0.008101,-0.005659
2018-09-10,0.001737,-0.013512,-0.000163
2018-09-11,0.003292,0.024969,0.010863
2018-09-12,0.000242,-0.012497,-0.012427
2018-09-13,0.005897,0.023868,0.010701
2018-09-14,0.000172,-0.011416,-0.002385
2018-09-17,-0.005308,-0.026987,-0.014155
2018-09-18,0.005411,0.001651,0.004462


## Basic Mathematical Operations

|  Operation            | Symbol | Precedence |
|:----------------------|:------:|:----------:|
| Parentheses           | ()     | 4          |
| Exponentiation        | **     | 3          |
| Multiplication        | *      | 2          | 
| Division              | /      | 2          |
| Floor division        | //     | 2          |
| Modulus               | %      | 2          | 
| Matrix multiplication | @      | 2          |
| Addition              | +      | 1          |
| Subtraction           | -      | 1          |

**Note**: Higher precedence operators are evaluated first, and ties are
evaluated left to right. 


## Problem: Scalar Operations
1. Add 1 to all returns
2. Square the returns
3. Multiply the price of Google by 2. 
4. Extract the fractional return using floor division and modulus

In [13]:
returns_1 = returns + 1


returns_2 = returns ** 2

prices["GOOG",] = 2 * prices["GOOG",]

In [79]:
returns // 1
# integer division

Unnamed: 0,SPY,AAPL,GOOG
2018-09-05,-1.0,-1.0,-1.0
2018-09-06,-1.0,-1.0,-1.0
2018-09-07,-1.0,-1.0,-1.0
2018-09-10,0.0,-1.0,-1.0
2018-09-11,0.0,0.0,0.0
2018-09-12,0.0,-1.0,-1.0
2018-09-13,0.0,0.0,0.0
2018-09-14,0.0,-1.0,-1.0
2018-09-17,-1.0,-1.0,-1.0
2018-09-18,0.0,0.0,0.0


In [80]:
returns % 1
# % modulus operator - remainder after division by 1

Unnamed: 0,SPY,AAPL,GOOG
2018-09-05,0.997309,0.993475,0.991211
2018-09-06,0.99699,0.983383,0.987324
2018-09-07,0.998057,0.991932,0.994357
2018-09-10,0.001739,0.986579,0.999837
2018-09-11,0.003297,0.025283,0.010922
2018-09-12,0.000242,0.987581,0.98765
2018-09-13,0.005914,0.024155,0.010758
2018-09-14,0.000172,0.988649,0.997618
2018-09-17,0.994706,0.973374,0.985945
2018-09-18,0.005426,0.001652,0.004472


In [81]:
returns - returns // 1


Unnamed: 0,SPY,AAPL,GOOG
2018-09-05,0.997309,0.993475,0.991211
2018-09-06,0.99699,0.983383,0.987324
2018-09-07,0.998057,0.991932,0.994357
2018-09-10,0.001739,0.986579,0.999837
2018-09-11,0.003297,0.025283,0.010922
2018-09-12,0.000242,0.987581,0.98765
2018-09-13,0.005914,0.024155,0.010758
2018-09-14,0.000172,0.988649,0.997618
2018-09-17,0.994706,0.973374,0.985945
2018-09-18,0.005426,0.001652,0.004472


## Problem: Addition of Series
Add the returns on SPY to those of AAPL 


In [15]:

spy_aapl = returns["SPY",] + returns["AAPL",]
spy_aapl
# creates a new series with the same index (here, of dates)


2018-09-05   -0.009216
2018-09-06   -0.019628
2018-09-07   -0.010011
2018-09-10   -0.011682
2018-09-11    0.028580
2018-09-12   -0.012177
2018-09-13    0.030070
2018-09-14   -0.011179
2018-09-17   -0.031920
2018-09-18    0.007078
2018-09-19   -0.005510
dtype: float64

## Problem: Combining methods and mathematical operations
Using only basic mathematical operations compute the 
correlation between the returns on AAPL and SPY. 

In [23]:
nobs = aapl_returns.shape[0]
a = aapl_returns - aapl_returns.mean()
s = spy_returns - spy_returns.mean()
aapl_var = a.dot(a) / (nobs-1)
spy_var = s.dot(s) / (nobs - 1)
aapl_spy_cov = a.dot(s) / (nobs - 1)
c = aapl_spy_cov / (np.sqrt(aapl_var * spy_var))

## Problem: Addition of DataFrames
Construct a `DataFrame` that only contains the SPY column from returns
and add it to the return `DataFrame`  

In [41]:
spy_returns_df = pd.DataFrame({"SPY": spy_returns})
spy_returns_df

Unnamed: 0,SPY
2018-09-05,-0.002691
2018-09-06,-0.00301
2018-09-07,-0.001943
2018-09-10,0.001739
2018-09-11,0.003297
2018-09-12,0.000242
2018-09-13,0.005914
2018-09-14,0.000172
2018-09-17,-0.005294
2018-09-18,0.005426


In [31]:
returns

Unnamed: 0,SPY,AAPL,GOOG
2018-09-05,-0.002691,-0.006525,-0.008789
2018-09-06,-0.00301,-0.016617,-0.012676
2018-09-07,-0.001943,-0.008068,-0.005643
2018-09-10,0.001739,-0.013421,-0.000163
2018-09-11,0.003297,0.025283,0.010922
2018-09-12,0.000242,-0.012419,-0.01235
2018-09-13,0.005914,0.024155,0.010758
2018-09-14,0.000172,-0.011351,-0.002382
2018-09-17,-0.005294,-0.026626,-0.014055
2018-09-18,0.005426,0.001652,0.004472


In [43]:
spy_returns_df

Unnamed: 0,"(AAPL,)","(GOOG,)","(SPY,)",SPY
2018-09-04,,,,
2018-09-05,,,,
2018-09-06,,,,
2018-09-07,,,,
2018-09-10,,,,
2018-09-11,,,,
2018-09-12,,,,
2018-09-13,,,,
2018-09-14,,,,
2018-09-17,,,,


## Problem: Non-conformable math

Add the prices in `sep_04` to the prices of `goog`. What happens? 

## Problem: Constructing portfolio returns
Set up a 3-element array of portfolio weights 

$$w=\left(\frac{1}{3},\,\frac{1}{3}\,,\frac{1}{3}\right)$$

and compute the return of a portfolio with weight $\frac{1}{3}$ in each security.


## Exercises

### Exercise: Combine math with function

Add 1 to the output of `np.arange` to produce the sequence 1, 2, ..., 10.

### Exercise: Understand pandas math

Use the `Series` and `DataFrame` below to compute the sums 

* `a+b`
* `a+c`
* `b+c`
* `a+b+c`

to understand how missing values are treated by pandas

In [82]:
# Setup: Data for exercise
import pandas as pd
import numpy as np

rs = np.random.RandomState(19991231)

idx = ["A","a","B",3]
columns = ["A",1,"B",3]
a = pd.Series([1,2,3,4], index=idx)
b = pd.Series([10,9,8,7], index=columns)
values = rs.randint(1, 11, size=(4,4))
c = pd.DataFrame(values, columns=columns, index=idx)

### Exercise: Math with duplicates

Add the Series `d` to `a` to see what happens with delays.

In [83]:
# Setup: Data for exercise

d = pd.Series([10, 101], index=["A","A"])