*This notebook is intellectual property of Auquan and is distributed under the [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License](https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode). Any modification or distribution of this notebook without express permission of Auquan is prohibited and will result in legal prosecution.*


# Expected Value

The expected value of a random variable is the probability-weighted average of all possible values.
When these probabilities are equal, the expected value is the same as arithmetic mean, defined as the sum of the observations divided by the number of observations:
$$\mu = \frac{\sum_{i=1}^N X_i}{N}$$

where $X_1, X_2, \ldots , X_N$ are our observations.

For example, if a dice is rolled repeatedly many times, we expect all numbers from 1 - 6 to show up an equal number of times. So the expected value in rolling a six-sided die is 3.5.


In [None]:
from __future__ import print_function
import numpy as np
import scipy.stats as stats

# Let's say the random variables x1 and x2 have the following values
x1 = [10,9,8,5,5,6,7,4,3,2]
x2 = x1 + [100]

print ('Mean of x1:', sum(x1), '/', len(x1), '=', np.mean(x1))
print ('Mean of x2:', sum(x2), '/', len(x2), '=', np.mean(x2))

When the probabilities of different observations are not equal, i.e a random variable $X$ can take value $X_1$ with probability $p_1$, $X_2$ with probability $p_2$, and so on, the expected value of X is the same as <i>weighted</i> arithmetic mean.
The weighted arithmetic mean is defined as
$$\sum_{i=1}^n p_i X_i $$

where $\sum_{i=1}^n p_i = 1$

This is how we calculated the expected value of trades in the first exercise: $EV(Trade) = \sum P(X)*profit(X)$

The expected value is simply the average of all values obtained if you perform the experiment it represents many times. Number of times a value X is obtained / Total number of times an experiment is perfomed is the probability(X). Hence the result $EV = \sum P(X)*X$

This follows from the law of large numbers - the average of the results obtained from a large number of repetitions of an experiment should be close to the expected value, and will tend to become closer as more trials are performed.

### Some properties of expected values that are handy:
* The expected value of a constant is equal to the constant itself $E[c] = c$
* The expected value is linear, i.e $E[aX+bY] = aE[X]+bE[Y]$ 
* If $X \leq Y$ , then $E[X] \leq E[Y]$
* The expected value not multiplicative, i.e. $E[XY]$ is not necessarily equal to $E[X]E[Y]$. 
  The amount by which they differ is called the covariance, covered in a later notebook.
  $Cov(X,Y)=E[XY]-E[X]E[Y]$
  If X and Y are uncorrelated, $Cov(X,Y)=0$


You've already used expected value to solve questions in the first post of this week. For example, in Ex2 of part 1, when you calculate if you should trade or wait at a level, you use the following approach:
- Calculate probability $p$ that level gets better by 1 before it goes back to the same level where you entered position
- Calculate profit if you wait and exit position when level gets better
- Calculate loss if you wait and stock goes back to the same level where you entered position
- EV(Waiting for Next Level) = $p*Profit(Current Level+1) + (1-p)*Loss(Enter Level)$

Now you can compare this to profit from exiting position at current level. You wait to trade if EV(Waiting for Next Level) > Current Profit otherwise trade now

### Other measures of centrality that are commonly used are:

- **Median**

Number which appears in the middle of the list when it is sorted in increasing or decreasing order, i.e. the value in $(n+1)/2$ when $n$ is odd and the average of the values in $n/2$ and $(n+2)/2$ positions when $n$ is even. One advantage of using median in describing data compared to the mean is that it is not skewed so much by extremely large or small values

The median tell us the value that splits the data set in half, but not how much smaller or larger the other values are.

In [None]:
print('Median of x1:', np.median(x1))
print('Median of x2:', np.median(x2))

* **Mode**

Most frequently occuring value in a data set. The mode of a probability distribution is the value x at which its probability distribution function takes its maximum value.

In [None]:
def mode(l):
    # Count the number of times each element appears in the list
    counts = {}
    for e in l:
        if e in counts:
            counts[e] += 1
        else:
            counts[e] = 1
            
    # Return the elements that appear the most times
    maxcount = 0
    modes = {}
    for key in counts:
        if counts[key] > maxcount:
            maxcount = counts[key]
            modes = {key}
        elif counts[key] == maxcount:
            modes.add(key)
            
    if maxcount > 1 or len(l) == 1:
        return list(modes)
    return 'No mode'
    
print ('All of the modes of x1:', mode(x1))

* **Geometric mean**

It is the central tendency of a set of numbers by using the product of their values (as opposed to the arithmetic mean which uses their sum). The geometric mean is defined as the nth root of the product of n numbers:
$$ G = \sqrt[n]{X_1X_1\ldots X_n} $$

for observations $X_i \geq 0$. We can also rewrite it as an arithmetic mean using logarithms:
$$ \ln G = \frac{\sum_{i=1}^n \ln X_i}{n} $$

The geometric mean is always less than or equal to the arithmetic mean (when working with nonnegative observations), with equality only when all of the observations are the same.

In [None]:
# Use scipy's gmean function to compute the geometric mean
print ('Geometric mean of x1:', stats.gmean(x1))
print ('Geometric mean of x2:', stats.gmean(x2))

Geometric MEan is frequently used with stock returns. If we have stocks returns $R_1, \ldots, R_T$ over different times, we use the geometric mean to calculate average return $R_G$ so that if the rate of return over the whole time period were constant and equal to $R_G$, the final price of the security would be the same as in the case of returns $R_1, \ldots, R_T$.
$$ R_G = \sqrt[T]{(1 + R_1)\ldots (1 + R_T)} - 1$$

# Variance and Standard Deviation

Variance and Standard Deviation are measures of dispersion of dataset from the mean.

We can define the mean absolute deviation as the average of the distances of observations from the arithmetic mean. We use the absolute value of the deviation, so that 5 above the mean and 5 below the mean both contribute 5, because otherwise the deviations always sum to 0.

$$ MAD = \frac{\sum_{i=1}^n |X_i - \mu|}{n} $$

where $n$ is the number of observations and $\mu$ is their mean.

Instead of using absolute deviations, we can use the squared deviations, this is called **variance** $\sigma^2$ : the average of the squared deviations around the mean:
$$ \sigma^2 = \frac{\sum_{i=1}^n (X_i - \mu)^2}{n} $$

**Standard deviation** is simply the square root of the variance, $\sigma$, and it is the easier of the two to interpret because it is in the same units as the observations.

Note that variance is additive while standard deviation is not.

In [None]:
print('Variance of x1:', np.var(x1))
print('Standard deviation of x1:', np.std(x1))
print('Variance of x2:', np.var(x2))
print('Standard deviation of x2:', np.std(x2))

Standard deviation indicates the amount of variation in a set of data values. A low standard deviation indicates that the data points tend to be close to the expected value, while a high standard deviation indicates that the data points are spread out over a wider range of values.
 
### Some properties of standard deviation that are handy:

* The standard deviation of a constant is equal to 0
* Standard deviations cannot be added. Therefore, $\sigma(X+Y)\neq \sigma(X) + \sigma(Y)$
* However, variance, can be added. Infact, $\sigma^2(X+Y) = \sigma^2(X) + \sigma^2(Y) + 2*Cov(X,Y)$
* If X and Y are uncorrelated,  $Cov(X,Y)=0$ and $\sigma^2(X+Y) = \sigma^2(X) + \sigma^2(Y)$

## Volatility

If an experiment is performed daily and the results of an experiment on one day do not affect the on their results any other day, daily observation are uncorrelated. If we measure daily standard deviation as $\sigma_i$ then we can calculate the standard deviation for an year, also called annualized standard deviation as:
$$\sigma_{ann} = \sqrt{\sum_{i=1}^T \sigma_i^2}$$

In finance, we sum over all trading days and this annualized standard deviation is called **Volatility**. A year usually has 252 trading days and it is common practice to approximate this to 256 for quick calculation. Therefore a stock who's returns follow a standard normal distribution with $\sigma = 1$ will have a volatility of $\sqrt{256*1^2}= 16$%

In [None]:
# Install yahoo finance to obtain historical market data
!pip install yfinance

In [None]:
import yfinance as yf

startDateStr = '2012-12-31'
endDateStr = '2017-12-31'
instrumentIds = ['AAPL','MSFT', 'FB', 'GOOG', 'T', 'INTC', 'V', 'CSCO', 'VZ']
data_dict = {}
for instrumentId in instrumentIds:
    data = yf.download(instrumentId, startDateStr, endDateStr)
    data_dict[instrumentId] = data.Close

**Ex1: We've loaded data 5y daily data for a few stocks from 2013-17. Write code to calculate the volatility of these stock returns**

In [None]:
import numpy as np

# Calculate vol of each stock
for instrumentId in instrumentIds:
    prices = data_dict[instrumentId]
    
    returns = None # Calculate this using price(t)/price(t-1) - 1
    vol = None # Calculate this using np.std() over returns 
    
    print('Volatility for %s is : %s'%(instrumentId, vol) )

Now let's say we create a basket of stocks in the given ratio. 

Now let's say we create a basket of stocks in the given ratio. 

**Ex2: Calculate the average return of this basket and volatility of this basket assuming each stock is independent.**

| Symbol | % of Portfolio |
|--------|----------------|
|AAPL | 24% |
|MSFT | 18.25% |
|GOOG | 17.5% |
|FB	| 11.75% |
|T | 6.5% |
|INTC | 6% |
|V | 5.75% |
|CSCO | 5.25% |
|VZ | 5% |

In [None]:
# Calculate vol of basket made up of stocks in given ratio
w = np.array([0.24, 0.1825, 0.175, 0.1175, 0.065, 0.06, 0.0575, 0.0525 , 0.05])

portfolio_price = None # Compute this using weighted composition of stock prices
portfolio_returns = None # Compute this as done before
if portfolio_returns:
    print('Volatility for the portfolio is: %s'%(np.std(returns)))

In [None]:
instrumentId = 'XLK'
ds = yf.download(instrumentId, startDateStr, endDateStr)
XLK_price = ds.Close

**Ex3: Now calculate the returns and volatility for XLK. XLK is the sector index for techstocks in S&P and these stocks make up more than 70% of the index.**

Do you find that the volatility you calclate for index matches with that of the basket above?

In [None]:
#Calculate vol of XLK
XLK_returns = None #Calculate this using XLK_price
if XLK_returns:
    print('Volatility for the XLK is: %s'%(np.std(returns)))

What does this tell us? The stock prices are not independent of each other and we must account for this in our calculations. We do this by calculating Covariance. We'll cover this in next post.

### Note: These are Only Estimates

It is important to remember that when we are working with a subset of actual data, these computations will only give you sample statistics, that is mean and standard deviation of a sample of data. Whether or not this reflects the current true population mean and standard deviation is not always obvious, and more effort has to be put into determining that. This is especially problematic in finance because all data are time series and the mean and variance may change over time. In general do not assume that because something is true of your sample, it will remain true going forward.

This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Auquan. Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, Auquan, has not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information, believed to be reliable, available to Auquan, at the time of publication. Auquan makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.