<a href="https://colab.research.google.com/github/LukaT11/quantitative_finance/blob/master/Measures_of_Central_Tendency.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Measures of Central Tendency**

In this notebook we will discuss ways to summarize a set of data using a single number. The goal is to capture information about the distribution of data.

## **Arithmetic mean**

The arithmetic mean is used very frequently to summarize numerical data, and is usually the one assumed to be meant by the word "average." It is defined as the sum of the observations divided by the number of observations:

$$\mu = \frac{\sum_{i=1}^N X_i}{N}$$


where $X_1, X_2, \ldots , X_N$ are our observations.

In [1]:
#Two useful statistical libraries
import scipy.stats as stats
import numpy as np

# We'll use these two data sets as examples
x1 = [1, 2, 2, 3, 4, 5, 5, 7]
x2 = x1 + [100]

print ('Mean of x1:', sum(x1), '/', len(x1), '=', np.mean(x1))
print ('Mean of x2:', sum(x2), '/', len(x2), '=', np.mean(x2))

Mean of x1: 29 / 8 = 3.625
Mean of x2: 129 / 9 = 14.333333333333334


We can also define a <i>weighted</i> arithmetic mean, which is useful for explicitly specifying the number of times each observation should be counted. For instance, in computing the average value of a portfolio, it is more convenient to say that 70% of your stocks are of type X rather than making a list of every share you hold.

The weighted arithmetic mean is defined as
$$\sum_{i=1}^n w_i X_i $$

where $\sum_{i=1}^n w_i = 1$. In the usual arithmetic mean, we have $w_i = 1/n$ for all $i$.

## **Median**

The median of a set of data is the number which appears in the middle of the list when it is sorted in increasing or decreasing order. When we have an odd number  $n$  of data points, this is simply the value in position $ (n+1)/2$ . When we have an even number of data points, the list splits in half and there is no item in the middle; so we define the median as the average of the values in positions  $n/2$  and $ (n+2)/2$ .

The median is less affected by extreme values in the data than the arithmetic mean. It tells us the value that splits the data set in half, but not how much smaller or larger the other values are.

In [2]:
print ('Median of x1:', np.median(x1))
print ('Median of x2:', np.median(x2))

Median of x1: 3.5
Median of x2: 4.0


## **Mode**

The mode is the most frequently occuring value in a data set. It can be applied to non-numerical data, unlike the mean and the median. One situation in which it is useful is for data whose possible values are independent. For example, in the outcomes of a weighted die, coming up 6 often does not mean it is likely to come up 5; so knowing that the data set has a mode of 6 is more useful than knowing it has a mean of 4.5.

The important concepts here are:

*   [collections.Counter](https://docs.python.org/2/library/collections.html#collections.Counter) and its [most_common](https://docs.python.org/2/library/collections.html#collections.Counter.most_common) method
*   A [list comprehension](https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions)



In [3]:
# Scipy has a built-in mode function, but it will return exactly one value
# even if two values occur the same number of times, or if no value appears more than once

from collections import Counter

print ('One mode of x1:', stats.mode(x1)[0][0])

def mode_function2(lst):
    counter = Counter(lst)
    _,val = counter.most_common(1)[0]
    print('Multiple modes: ' + str( [x for x,y in counter.items() if y == val]))
  
mode_function2(x1)



One mode of x1: 2
Multiple modes: [2, 5]


For data that can take on many different values, such as returns data, there may not be any values that appear more than once. In this case we can bin values, like we do when constructing a histogram, and then find the mode of the data set where each value is replaced with the name of its bin. That is, we find which bin elements fall into most often.

In [0]:
!pip install quandl

In [5]:
import quandl

start = '2014-01-01'
end = '2015-01-01'

quandl.ApiConfig.api_key = 'xx8w-rWbjyXpygy-PN5m'

data = quandl.get('WIKI/MSFT.4', start_date = start, end_date = end)

returns = data.pct_change()[1:]

hist, bins = np.histogram(returns, 20) # Break data up into 20 bins
maxfreq = max(hist)
# Find all of the bins that are hit with frequency maxfreq, then print the intervals corresponding to them
print ('Mode of bins:', [(bins[i]) for i, j in enumerate(hist) if j == maxfreq])

Mode of bins: [-0.008048068531763167, -0.004092245037164856]
