# Calculating Financial Statistics
## Introduction

In this lesson, we will be learning how to use Python to calculate key financial statistics for making informed decisions about investments. There are many different kinds of financial assets, or *securities*, ranging from real estate to government bonds to the common stock. But no matter the type of investment, there are two things that we always need to consider about each asset - the return and the risk.

The *rate of return* is a measure of the amount of money gained or lost in an investment. A positive return signifies a profit and a negative return indicates a loss. The *risk* of an investment is defined as the likelihood of suffering a financial loss.

There is often a tradeoff between risk and return, where the higher the potential return of an asset, the higher the risk involved. Thus, it is important to understand both aspects for making smart choices about an investment - which we will learn to do in this lesson.

Let us get started by looking at the rate of return!

***
### Exercise

1. When speaking about investments, we express the calculated rate of return as a percentage. For example, it is more common to say that Investment $A$ has a  $7.5\%$ rate of return, instead of a rate of return of `0.075`.

    Let us write a function that we will use throughout the lesson for converting a decimal value to the percent form. First, define a variable called `rate_of_return` and assign to it a value of `0.075`.

In [1]:
rate_of_return = 0.075

2. Next, define a function called `display_as_percentage()` that takes a parameter called `val`, which will be the value in decimal form. Format `val` to have our function return a percentage. We want to:

* Round the result to 1 decimal place
* Return a formatted string that ends with a `'%'`

In [2]:
def display_as_percentage(val):
  return (f"{val:.1%}")

3. Let us test out our function! Call `display_as_percentage()` to format `rate_of_return` as a percentage, and print out the result.

In [3]:
print(display_as_percentage(rate_of_return))

7.5%


***

## Simple Rate of Return

Now that we are familiar with how the rate of return is usually expressed, let us take a look at calculating it.

The most basic type of return is the **simple rate of return**. It is defined as the difference between the starting and ending price of an investment over a time period, divided by the starting price. If an investment returns dividends, those dividends should be added to the numerator.

$$
R = \frac{E - S + D}{S}
$$

* $R$: simple rate of return
* $S$: starting price of investment
* $E$: ending price of investment
* $D$: dividend

In the equation above, the numerator represents the absolute amount gained or lost in an investment. For example, if the starting price of an investment is $\$25$ and the ending price is $\$30$, then there is a $\$5$ profit.

While this gives us some information about the return of an investment, it is by itself not very useful for comparing between investments. That is, a $\$5$ profit from a $\$25$ investment may be a decent gain, but the same $\$5$ profit from a $\$10,000$ investment would be trivial.

This is why the simple rate of return is calculated as the absolute gain or loss divided by the starting price of the investment. By expressing the return as a percentage of the originally invested amount, we can more easily compare across different investments.

***
### Exercise

1. Define a function called `calculate_simple_return()` that has 2 required parameters and 1 optional parameter:

* `start_price`: starting price of investment
* `end_price`: ending price of investment
* `dividend`: dividend (default value: 0)

For now, just put return in the function body.

In [4]:
def calculate_simple_return(start_price, end_price, dividend=0):
    return 0

2. Now, calculate the simple rate of return using the formula provided in the narrative above and return the result.

In [5]:
def calculate_simple_return(start_price, end_price, dividend=0):
    return ((end_price - start_price + dividend) / start_price)

3. Call the function, passing in `200` as the `start_price` and `250` as the `end_price`, and store the result in a variable called `simple_return`.

In [6]:
simple_return = calculate_simple_return(start_price=200, end_price=250)

4. Print the string `'The simple rate of return is X%'`, where `X%` is the simple_return expressed as a percentage. Use the `display_as_percentage()` helper function you have written in the previous exercise to display the value as a percentage.

In [7]:
print(f'The simple rate of return is {display_as_percentage(simple_return)}')

The simple rate of return is 25.0%


5. Now, modify your function call to include a dividend of `20` for the investment and run your code. How did the simple rate of return change?

In [8]:
simple_return = calculate_simple_return(start_price=200, end_price=250, dividend=20)

print(f'The simple rate of return is {display_as_percentage(simple_return)}')

The simple rate of return is 35.0%


***

## Logarithmic Rate of Return

Another type of return is the *logarithmic rate of return*, also known as the continuously compounded return. This is the expected return for an investment where the earnings are assumed to be continually reinvested over the time period. It is calculated by taking the difference between the log of the ending price and the log of the starting price.

$$
r = \log(E) − \log(S) = \log\left(\frac{E}{S}\right)
$$

* $r$: logarithmic rate of return
* $S$: starting price of investment
* $E$: ending price of investment

The advantage of the log rate of return is that it is easy to make calculations about a single asset over time. On the other hand, calculating the simple rate of return is easier for dealing with multiple assets over the same time period.

We will take a closer look at both of these in the subsequent exercises. But whichever type of return we choose to use, it is important to remember to be consistent in using the same one for any further financial calculations.

***
### Exercise

1. First, import the `log` function from the `math` module.

In [9]:
from math import log

2. Define a function called calculate_log_return() that has 2 parameters:

* `start_price`: starting price of investment
* `end_price`: ending price of investment

For now, just put `return` in the function body.

In [10]:
def calculate_log_return(start_price, end_price):
  return 0

3. Now, calculate the logarithmic rate of return using the formula provided in the narrative above and return the result.

In [11]:
def calculate_log_return(start_price, end_price):
  return log(end_price / start_price)

4. Call the function, passing in `200` as the `start_price` and `250` as the `end_price`, and store the result in a variable called `log_return`.

In [12]:
log_return = calculate_log_return(start_price=200, end_price=250)

5. Print the string `'The log rate of return is X%'`, where `X%` is the `log_return` expressed as a percentage.

In [13]:
print(f'The log rate of return is {display_as_percentage(log_return)}')

The log rate of return is 22.3%


***

## Aggregate Across Time I

When describing the rate of return of an investment, something that is important to keep in mind is the time frame of the investment. For example, an investment with a $2\%$ rate of return over one day is surely not the same as an investment with a $2\%$ rate of return over one month. Thus, it is common to convert returns to a standard time period. Often, this means converting to the annual rate of return in a process called *annualizing*.

As we have covered earlier, aggregating across time for a single asset can easily be calculated using the log rate of return.

To convert a log rate of return from one time period to another, we can multiple the rate of return by the number of original time periods there are in the new time period.

$$
r = r_{0} \times t
$$

* $r$: converted log rate of return
* $r0$: original log rate of return
* $t$: the number of original time periods in the new time period

For example, if we are converting daily returns to annual returns, $t$ may be $252$ because that is typically the number of trading days in a year. If we are converting monthly returns to annual returns, $t$ would be $12$.

***
### Exercise

In [14]:
daily_return_a = 0.001
monthly_return_b = 0.022

1. You are given the log rate of return for two investments - the daily return of Investment A and monthly return of Investment B. Print out the following strings:

* `'The daily rate of return for Investment A is X%'`, where `X%` is `daily_return_a` expressed as a percentage.
* `'The monthly rate of return for Investment B is X%'`, where `X%` is `monthly_return_b` expressed as a percentage.

In [15]:
print(f'The daily rate of return for Investment A is {display_as_percentage(daily_return_a)}')
print(f'The monthly rate of return for Investment B is {display_as_percentage(monthly_return_b)}')

The daily rate of return for Investment A is 0.1%
The monthly rate of return for Investment B is 2.2%


2. It is difficult to compare daily returns with monthly returns. Let us write a function to convert both to annual returns! Define a function called `annualize_return()` that has 2 parameters:

* `log_return`: original log rate of return
* `t`: the number of original time periods in the new time period

For now, just put return in the function body.

In [16]:
def annualize_return(log_return, t):
    return 0

3. Now, calculate the converted rate of return using the formula provided in the narrative above and return the result.

In [17]:
def annualize_return(log_return, t):
    return log_return * t

4. Use the function to annualize `daily_return_a` and store the result in a variable called `annual_return_a`.

    Print the string `'The annual rate of return for Investment A is X%'`, where `X%` is the `annual_return_a` expressed as a percentage.

In [18]:
daily_return_multiplier = 252
annual_return_a = annualize_return(log_return=daily_return_a, t=daily_return_multiplier)

print(f'The annual rate of return for Investment A is {display_as_percentage(annual_return_a)}')

The annual rate of return for Investment A is 25.2%


5. Use the function to annualize `monthly_return_b` and store the result in a variable called `annual_return_b`.

    Print the string `'The annual rate of return for Investment B is X%'`, where `X%` is the `annual_return_b` expressed as a percentage.

    How do the annual returns for Investments A and B compare?

In [19]:
monthly_return_multiplier = 12
annual_return_b = annualize_return(log_return=monthly_return_b, t=monthly_return_multiplier)

print(f'The annual rate of return for Investment A is {display_as_percentage(annual_return_b)}')

The annual rate of return for Investment A is 26.4%


***

## Aggregate Across Time II

Now, let us look at an extension of the previous conversion formula. Suppose we know the log rate of return for 5 days of a given year. Which daily log return would we use to calculate the annual return?

In this case, we can first take the average of the 5 daily log returns, then multiple by 252, the number of trading days in a year. The general formula is:

$$
r = \frac{r_{0_1} + r_{0_2} + ... + r_{0_n}}{n} \times t
$$

* $r$: converted log rate of return
* $r_{0_n}$: the n<sup>th</sup> log return from the original time period
* $n$: the number of returns from the original time period
* $t$: the number of original time periods in the new time period

Notice how if we only have one log return from the original time period, $n$ equals to $1$, and we can simplify the equation to obtain the conversion formula we saw in the previous exercise.

On the other hand, if we know the log return for all $252$ trading days, then $n$ equals to $t$, and we will simply be summing up all the log returns across the new time period.

$$
r = r_{0_1} + r_{0_2} + ... + r_{0_t}
$$

* $r$: converted log rate of return
* $r_{0_n}$: the n<sup>th</sup> log return from the original time period
* $t$: the number of original time periods in the new time period

***
### Exercise

In [20]:
daily_returns = [0.002, -0.002, 0.003, 0.002, -0.001]

1. You are provided with a list of 5 daily log returns from a given week in the list `daily_returns`. Write a function called `convert_returns()` that takes the following 2 parameters:

* `log_returns`: list of log rates of return from the original time period
* `t`: the number of original time periods in the new time period

For now, just put `return` in the function body.

In [21]:
def convert_returns(log_returns, t):
    return 0

2. Now, calculate the converted rate of return using the first formula provided in the narrative above and return the result.

In [22]:
def convert_returns(log_returns, t):
    return sum(log_returns) / len(log_returns) * t

3. Use the function to find the annual rate of return from the list of daily returns and store the result in a variable called `annual_return`.

    Copy the following line to print out the result:

        print('The annual rate of return is', display_as_percentage(annual_return))

In [23]:
annual_return = convert_returns(daily_returns, t=daily_return_multiplier)
print(f'The annual rate of return is {display_as_percentage(annual_return)}')

The annual rate of return is 20.2%


4. Now, let us use the function to find the weekly rate of return from the list of daily returns and store the result in a variable called `weekly_return`.

    Copy the following line to the text editor to print out the result:

        print('The weekly rate of return is', display_as_percentage(weekly_return))

In [24]:
weekly_trading_days = 5
weekly_return = convert_returns(daily_returns, t=weekly_trading_days)
print(f'The weekly rate of return is {display_as_percentage(weekly_return)}')

The weekly rate of return is 0.4%


5. We will demonstrate that log returns are additive over time. Since we are given the daily log return for all 5 trading days of the week, the rate of return for that given week is also equivalent to summing up all the daily returns in that period.

    Reassign the variable `weekly_return` to be the sum of all values in the list of `daily_returns`, and run the code. You should see that you get the same result as calling `convert_returns()`!

In [25]:
weekly_return = sum(daily_returns)
print(f'The weekly rate of return is {display_as_percentage(weekly_return)}')

The weekly rate of return is 0.4%


***

## Aggregate Across Assets

Although we will be focusing primarily on individual assets in this lesson, it is important to recognize that these investments often make up the pieces of a larger *financial portfolio*. Portfolios will be covered more in depth in a future lesson, but let’s get a preview of how to calculate the expected rate of return for an entire portfolio of investments.

As we have learned earlier, using the simple rate of return makes it easy to aggregate across multiple assets. The portfolio return would simply be the weighted average of each individual asset’s simple rate of return.

$$
R = (W_{1} * R_{1}) + (W_{2} * R_{2}) + ... + (W_{n} * R_{n})
$$
    
* $R$: portfolio simple rate of return
* $W_{i}$: weight of the ith investment in the portfolio
* $R_{i}$: simple rate of return of the ith investment in the portfolio

The weights of each asset is obtained by:

$$
W_{i} = \frac{S_{i}}{S_{1} + S_{2} + ... + S_{n}}
$$​​

* $W_{i}$: weight of the ith investment in the portfolio
* $S_{i}$: starting price of the ith investment in the portfolio

***

## Variance

Now that we have a good understanding of rate of return, let us shift our focus to assessing the risk involved in an investment. One of the key statistics for understanding risk is variance. *Variance* is a measure of the spread of a dataset, or how far apart each value is from the mean. The greater the variance, the more spread out or variable the data is.

For example, let us take a look at the data below showing the returns of two investments over the course of a week. Which one has a higher variance? What do you think that tells us about their relative risk?
```
# Investment A
returns_a = [0.05, -0.10, -0.08, 0.05, 0.14]
 
# Investment B
returns_b = [-0.01, 0.02, 0.01, 0.04, 0.03]
```
As seen, the returns for Investment A fluctuates more throughout the week while the returns for Investment B remain relatively consistent. This means Investment A has a higher variance and therefore a higher risk. An asset with a high variance is generally a riskier one because its return can vary significantly in a short period of time, making it less stable and more unpredictable.

The formula for calculating variance is:

$$
\sigma^{2} = \frac{\Sigma(X_{i} - \bar{X})^{2}}{n}
$$

* $\sigma^{2}$: variance
* $X_{i}$: the ith value in the dataset
* $\bar{X}$: the mean of the dataset
* $n$: the number of values in the dataset

***
### Exercise

In [26]:
import numpy as np

returns_disney = [0.22, 0.12, 0.01, 0.05, 0.04]
returns_cbs = [-0.13, -0.15, 0.31, -0.06, -0.29]

variance_disney = np.var(returns_disney)
variance_cbs = np.var(returns_cbs)

1. You are given the historical annual stock returns for the Walt Disney Company (DIS) and CBS Corporation (CBS). Run the cell below to see the variance of each. Which would be considered the riskier investment?

    Note: We are using Python's numpy library to calculate the variance here. Don’t worry too much about the syntax - we will cover numpy in a future lesson!

In [27]:
print(f'The variance of Disney stock returns is {variance_disney:.4f}')
print(f'The variance of CBS stock returns is {variance_cbs:.4f}')

The variance of Disney stock returns is 0.0057
The variance of CBS stock returns is 0.0405


2. Now, let us calculate the variance ourselves! We will use the sample `dataset` provided. First, calculate the mean of the values in `dataset` and store the result in a variable called `mean`.

In [28]:
dataset = [10, 8, 9, 10, 12]

In [29]:
mean = sum(dataset) / len(dataset)

3. Next, calculate the sum in the numerator of the variance formula by looping through the `dataset` list and adding up the squared difference between each data point and the `mean`. Store the sum in a variable called `numerator`.

In [30]:
numerator = 0
for number in dataset:
    numerator += (number - mean) ** 2

4. Divide the `numerator` by the number of values in `dataset` to obtain the variance. Store the value in a variable called `variance`.

In [31]:
variance = numerator / len(dataset)

5. Great! Now that we have the code for calculating the variance, let us move it inside a function so we can call it for other datasets.

    Define a function called `calculate_variance()` that takes a parameter called `dataset`.

In [32]:
def calculate_variance(dataset):
    mean = sum(dataset) / len(dataset)
    numerator = 0
    for number in dataset:
        numerator += (number - mean) ** 2
    variance = numerator / len(dataset)
    return variance

6. Use `calculate_variance()` to find the variance of `returns_disney` and `returns_cbs`. Reassign the variables `variance_disney` and `variance_cbs` to store those values, respectively.

    Run the code! We should see that we get the same result as when we called the `numpy` function.

In [33]:
variance_disney = calculate_variance(returns_disney)
variance_cbs = calculate_variance(returns_cbs)

print(f'The variance of Disney stock returns is {variance_disney:.4f}')
print(f'The variance of CBS stock returns is {variance_cbs:.4f}')

The variance of Disney stock returns is 0.0057
The variance of CBS stock returns is 0.0405


## Standard Deviation

Although the variance is useful in determining the relative risk of an investment, it is sometimes not the easiest statistic to interpret since it does not have the same unit as the original data. As an alternative, it is common to use the standard deviation to describe the spread of the dataset.

*Standard deviation* is simply the square root of the variance. It has the same unit as the original dataset.

$$
\sigma = \sqrt{\frac{\Sigma(X_{i} - \bar{X})^{2}}{n}}
$$

* $\sigma$: standard deviation
* $X_{i}$: the ith value in the dataset
* $\bar{X}$: the mean of the dataset
* $n$: the number of values in the dataset

***
### Exercise

In [34]:
import numpy as np

returns_disney = [0.22, 0.12, 0.01, 0.05, 0.04]
returns_cbs = [-0.13, -0.15, 0.31, -0.06, -0.29]

stddev_disney = np.std(returns_disney)
stddev_cbs = np.std(returns_cbs)

1. Let us again use Python's `numpy` library to preview the standard deviations for the two stock returns. Run the cell below. Notice how the standard deviation has the same unit as the original data.

In [35]:
print(f'The standard deviation of Disney stock returns is {display_as_percentage(stddev_disney)}')
print(f'The standard deviation of CBS stock returns is {display_as_percentage(stddev_cbs)}')

The standard deviation of Disney stock returns is 7.5%
The standard deviation of CBS stock returns is 20.1%


2. Now, let us calculate the standard deviation ourselves! First, import the `sqrt` function from the math module.

In [36]:
from math import sqrt

3. Next, call the `calculate_variance()` function and pass in the `dataset` provided. Store the returned value in a variable called `variance`.

In [37]:
dataset = [10, 8, 9, 10, 12]

In [38]:
variance = calculate_variance(dataset)

4. Take the square root of variance using the imported `sqrt()` function and store the result in a variable called `stddev`.

In [39]:
stddev = sqrt(variance)

5. Now, let us move the code for calculating standard deviation inside a function!

    Define a function called `calculate_stddev()` that takes a parameter called `dataset`.

In [40]:
from math import sqrt

def calculate_stddev(dataset):
    variance = calculate_variance(dataset)
    stddev = sqrt(variance)
    return stddev

6. Use `calculate_stddev()` to find the standard deviation of `returns_disney` and `returns_cbs`. Reassign the variables `stddev_disney` and `stddev_cbs` to store those values, respectively.

    Run the code! We should see that we get the same result as when we called the `numpy` function.

In [41]:
stddev_disney = calculate_stddev(returns_disney)
stddev_cbs = calculate_stddev(returns_cbs)

print(f'The standard deviation of Disney stock returns is {display_as_percentage(stddev_disney)}')
print(f'The standard deviation of CBS stock returns is {display_as_percentage(stddev_cbs)}')

The standard deviation of Disney stock returns is 7.5%
The standard deviation of CBS stock returns is 20.1%


***

Correlation I

Another important statistic for assessing risk is the correlation between the returns of two assets. *Correlation* is a measure of how closely two datasets are associated with each other. It is often represented by the correlation coefficient, which is a value that ranges between $-1$ and $1$. This indicates whether there is a positive correlation, negative correlation, or no correlation:

* **Positive correlation** – when the rate of return of one asset deviates upward from its mean, the other usually deviates upward as well.
* **Negative correlation** – when the rate of return of one asset deviates upward from its mean, the other usually deviates downward.
* **No correlation** – when a change in one asset’s rate of return does not dictate a change in another. The correlation coefficient will be close to $0$.

Two assets from the same industry generally have a positive correlation, as they are likely affected by similar external conditions. For example, automobile stocks may all be positively correlated with each other. Oil stocks, on the other hand, may be negatively correlated with automobile stocks because high oil costs can negatively impact car sales.

When building a portfolio, it is generally a good idea to include assets that are not correlated with each other. If assets are independent of one another, then there is a lower risk of the financial loss that can occur when assets in a portfolio are correlated. This allows for greater diversification and balances out the overall risk and return of the portfolio.

**Note**: You may also see correlation calculated using asset prices instead of returns, but returns are generally preferred. See <a href="https://quantdare.com/correlation-prices-returns/">this blogpost on correlation calculations</a> for more information if you are interested!

***
### Exercise

In [42]:
import numpy as np

def calculate_correlation(set_x, set_y):
    sum_x = sum(set_x)
    sum_y = sum(set_y)
    sum_x2 = sum([x ** 2 for x in set_x])
    sum_y2 = sum([y ** 2 for y in set_y])
    sum_xy = sum([x * y for x, y in zip(set_x, set_y)])
    n = len(set_x)
    numerator = n * sum_xy - sum_x * sum_y
    denominator = sqrt((n * sum_x2 - sum_x ** 2) * (n * sum_y2 - sum_y ** 2))
    return numerator / denominator

returns_general_motors = [0.018, -0.005, -0.047, -0.009, -0.012, 0.003, -0.027, -0.014, 0.029, -0.062, 0.009]
returns_ford = [0.002, -0.004, -0.027, -0.022, -0.001, 0.002, -0.006, -0.017, 0.035, -0.029, 0.002]
returns_exxon_mobil = [0.008, 0.015, 0.009, 0.012, 0.003, -0.007, 0.006, 0.005, -0.048, 0.025, -0.012]
returns_apple = [-0.002, 0.007, -0.004, -0.004, 0.002, 0.013, -0.011, 0.017, -0.001, 0.012, 0.006]

1. You are given the stock returns for the General Motors Company (GM), Ford Motor Company (F), Exxon Mobil Corporation (XOM), and Apple Inc. (AAPL) in  the cell below. We are also provided with a function called `calculate_correlation()` for calculating the correlation coefficient between two datasets.

    Run the code to see the coefficient for General Motors and Ford. Is there a correlation between them? If so, what kind?

In [43]:
corr_gm_ford = calculate_correlation(returns_general_motors, returns_ford)
print(f'The correlation coefficient between General Motors and Ford is {corr_gm_ford}')

The correlation coefficient between General Motors and Ford is 0.8414599743167742


2. Call `calculate_correlation()` and pass in `returns_general_motors` and `returns_exxon_mobil`. Print out the result in the string `'The correlation coefficient between General Motors and ExxonMobil is X'`, where `X` is the correlation coefficient.

    Is there a correlation between them? If so, what kind?

In [44]:
corr_gm_exxon = calculate_correlation(returns_general_motors, returns_exxon_mobil)
print(f'The correlation coefficient between General Motors and ExxonMobil is {corr_gm_exxon}')

The correlation coefficient between General Motors and ExxonMobil is -0.7032246241393197


3. Call `calculate_correlation()` and pass in `returns_general_motors` and `returns_apple`. Print out the result in the string `'The correlation coefficient between General Motors and Apple is X'`, where `X` is the correlation coefficient.

    Is there a correlation between them? If so, what kind?

In [45]:
corr_gm_apple = calculate_correlation(returns_general_motors, returns_apple)
print(f'The correlation coefficient between General Motors and ExxonMobil is {corr_gm_apple}')

The correlation coefficient between General Motors and ExxonMobil is -0.05181389942186936


4. Python's `numpy` library also has a function for generating a coefficient matrix that displays the correlation between all pairs of datasets in a list. Use the following code to print out `corrcoef_matrix`.

    `corrcoef_matrix = np.corrcoef([returns_general_motors, returns_ford, returns_exxon_mobil, returns_apple])`

    How well do the other assets correlate with each other?

    **Hint**: The `corrcoef_matrix` that is printed out displays the correlation between all pairs of datasets in the list that was passed to the function:

    `[returns_general_motors, returns_ford, returns_exxon_mobil, returns_apple]`

    The list serves as both the row names and column names of the matrix.

    For example, the value in the 1st row and 3rd column (which is equal to the value in the 3rd row and 1st column) is the correlation coefficient between `returns_general_motors` and `returns_exxon_mobil`.

    The values along the diagonal of the matrix is the correlation coefficient of each dataset with itself, which will always be `1` (indicating a perfect correlation).

    See <a href="https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html">here</a> for more info on `corrcoef()` function!

In [46]:
corrcoef_matrix = np.corrcoef([returns_general_motors, returns_ford, returns_exxon_mobil, returns_apple])

corrcoef_matrix

array([[ 1.        ,  0.84145997, -0.70322462, -0.0518139 ],
       [ 0.84145997,  1.        , -0.87407739, -0.1286648 ],
       [-0.70322462, -0.87407739,  1.        ,  0.09955855],
       [-0.0518139 , -0.1286648 ,  0.09955855,  1.        ]])

***

## Correlation II

Now that we have a good understanding of what correlation is and how to use it to assess the risk in an investment, let’s take a closer look at how the correlation coefficient is calculated.

Below is the formula for the Pearson correlation coefficient:

$$
r_{xy} = \frac{n * \Sigma{(X_{i} * Y_{i})} - \Sigma{X_{i}} * \Sigma{Y_{i}}}{\sqrt{n * \Sigma{X_{i}^{2}} - (\Sigma{X_{i}})^{2}} \sqrt{n * \Sigma{Y_{i}^{2}} - (\Sigma{Y_{i}})^{2}}}
$$

* $r_{xy}$: correlation coefficient
* $X_{i}$: the ith value in dataset X
* $Y_{i}$: the ith value in dataset Y
* $n$: the number of values in the dataset

This equation may seem overwhelming, but do not worry! We will break it down.

***
### Exercise

In [47]:
from math import sqrt
import numpy as np

def calculate_correlation(set_x, set_y):
  # Sum of all values in each dataset
  sum_x = sum(set_x)
  sum_y = sum(set_y)

  # Sum of all squared values in each dataset
  sum_x2 = sum([x ** 2 for x in set_x])

  sum_y2 = 0
  for y in set_y:
    sum_y2 += y ** 2

  # Sum of the product of each respective element in each dataset 
  sum_xy = 0
  for i in range(len(set_x)):
    sum_xy += set_x[i] * set_y[i]

  # Length of dataset
  n = len(set_x)

  # Calculate correlation coefficient
  numerator = n * sum_xy - sum_x * sum_y
  denominator = sqrt((n * sum_x2 - sum_x ** 2) * (n * sum_y2 - sum_y ** 2))

  return numerator / denominator

1. We have provided the code for the `calculate_correlation()` function that calculates the correlation coefficient between two input datasets. It is a long one! Take a moment to understand what each section is doing. Proceed to the next checkpoint when you are ready.

    **Hint**: The `calculate_correlation()` function takes two lists as inputs, `set_x` and `set_y`, each containing n values. We will break down the correlation coefficient formula and calculate each individual component.

    First, we calculate the sum of all values in each dataset:

* `sum_x`: $\Sigma X_{i} = X_{1} + X_{2} + ... + X{n}$
* `sum_y`: $\Sigma Y_{i} = Y_{1} + Y_{2} + ... + Y{n}$

    Then, we find the sum of all squared values in each dataset:

* `sum_x2`: $\Sigma X^{2}_{i} = \left(X_{1}\right)^{2} + \left(X_{2}\right)^{2} + ... + \left(X_{n}\right)^{2}$
* `sum_y2`: $\Sigma Y^{2}_{i} = \left(Y_{1}\right)^{2} + \left(Y_{2}\right)^{2} + ... + \left(Y_{n}\right)^{2}$

    Finally, we add up the product of each respective element from each dataset:

* `sum_xy`: $\Sigma \left(X_{i} * Y_{i}\right) = \left(X_{1} * Y_{1}\right) + \left(X_{2} * Y_{2}\right) + ... + \left(X_{n} * Y_{n}\right)$

    With these summations calculated, we can put it all together and compute the correlation coefficient with the given formula!

2. As you may have noticed, there are multiple ways to write the code to achieve the same end result. For example, `sum_x2` and `sum_y2` both store the sum of all squared values in a list, but `sum_x2` is obtained using a list comprehension while `sum_y2` is obtained using a for loop.

    Although both approaches are perfectly valid, it is always a good idea to look out for ways to simplify code and improve readability. The list comprehension used for `sum_x2` looks much more clean and concise! Let us refactor `sum_y2` to use a list comprehension as well.

In [48]:
def calculate_correlation(set_x, set_y):
  # Sum of all values in each dataset
  sum_x = sum(set_x)
  sum_y = sum(set_y)

  # Sum of all squared values in each dataset
  sum_x2 = sum([x ** 2 for x in set_x])
  sum_y2 = sum([y ** 2 for y in set_y])
  
  # Sum of the product of each respective element in each dataset 
  sum_xy = 0
  for i in zip(set_x, set_y):
    sum_xy += set_x[i] * set_y[i]

  # Length of dataset
  n = len(set_x)

  # Calculate correlation coefficient
  numerator = n * sum_xy - sum_x * sum_y
  denominator = sqrt((n * sum_x2 - sum_x ** 2) * (n * sum_y2 - sum_y ** 2))

  return numerator / denominator

3. Now, let us take a look at how `sum_xy` is obtained. We are looping through the range of indices generated from `range(len(set_x))` and accessing each respective elements in `set_x` and `set_y` by their index `i`.

    Again, while there is nothing wrong with this approach, the code is just not the most intuitive or easy to read and understand. `zip()` to the rescue!

    Recall that `zip()` is a built-in function that takes two (or more) lists as inputs and returns an object that groups each respective elements together in a tuple. Replace `range(len(set_x))` with a call to `zip()`. We would then be able to loop through the elements of both lists simultaneously.

In [49]:
from math import sqrt

def calculate_correlation(set_x, set_y):
  # Sum of all values in each dataset
  sum_x = sum(set_x)
  sum_y = sum(set_y)

  # Sum of all squared values in each dataset
  sum_x2 = sum([x ** 2 for x in set_x])
  sum_y2 = sum([y ** 2 for y in set_y])
  
  # Sum of the product of each respective element in each dataset 
  sum_xy = 0
  for item in zip(set_x, set_y):
    sum_xy += item[0] * item[1]

  # Length of dataset
  n = len(set_x)

  # Calculate correlation coefficient
  numerator = n * sum_xy - sum_x * sum_y
  denominator = sqrt((n * sum_x2 - sum_x ** 2) * (n * sum_y2 - sum_y ** 2))

  return numerator / denominator

4. The code for `sum_xy` is looking cleaner already! Is there a way to simplify it further? That is right - list comprehension! Modify your code to use a list comprehension to obtain `sum_xy`.

In [50]:
def calculate_correlation(set_x, set_y):
  # Sum of all values in each dataset
  sum_x = sum(set_x)
  sum_y = sum(set_y)

  # Sum of all squared values in each dataset
  sum_x2 = sum([x ** 2 for x in set_x])
  sum_y2 = sum([y ** 2 for y in set_y])
  
  # Sum of the product of each respective element in each dataset 
  sum_xy = sum([x * y for x,y in zip(set_x, set_y)])

  # Length of dataset
  n = len(set_x)

  # Calculate correlation coefficient
  numerator = n * sum_xy - sum_x * sum_y
  denominator = sqrt((n * sum_x2 - sum_x ** 2) * (n * sum_y2 - sum_y ** 2))

  return numerator / denominator

***

## Review

Congratulations on reaching the end!

In this lesson, we learned to calculate and understand the rate of return of an investment:

* **Simple Rate of Return** – advantageous for aggregating over assets
* **Logarithmic Rate of Return** – advantageous for aggregating over time

We also covered other key financial statistics and what they signify in terms of the risk of an investment:

* **Variance** – measure of the spread of a dataset; an asset with low variance is less risky
* **Standard Deviation** – square root of the variance; easier to interpret than variance because it has the same unit as the original dataset
* **Correlation** – measure of the association between datasets; assets with no correlation have returns that are independent of each other

Let us practice what we have learned!

***
### Exercise

In [51]:
annual_returns = [0.02, 0.05, -0.04, 0.04, 0.02, -0.02, 0.01, 0.03, 0.05, 0.02]

1. We are given the historical annual rates of return for an investment in the list `annual_returns`. Recall that the calculated returns are typically in decimal form, as seen here. But more often, we’d express this value as a percentage.

    Use the `display_as_percentage()` function to display each value in `annual_returns` as a percentage. Store the resulting list in a variable called `annual_returns_percentage`. Feel free to print out the list to see what it looks like.

In [52]:
annual_returns_percentage = [display_as_percentage(i) for i in annual_returns]

annual_returns_percentage

['2.0%',
 '5.0%',
 '-4.0%',
 '4.0%',
 '2.0%',
 '-2.0%',
 '1.0%',
 '3.0%',
 '5.0%',
 '2.0%']

2. Great! Now, let us further improve the readability of the data by putting it in some context and formatting it in a sentence.

    Print out the string: `'The historical annual rates of return are: X%, X%, ..., X%'`, where the percent rates of return are separated by commas.

    How do the rates of return change over the years? Is the investment overall profitable?

In [53]:
print(f'The historical annual rates of return are: {", ".join(annual_returns_percentage)}')

The historical annual rates of return are: 2.0%, 5.0%, -4.0%, 4.0%, 2.0%, -2.0%, 1.0%, 3.0%, 5.0%, 2.0%


3. Now let us calculate the variance of the rates of return. Use the `calculate_variance()` function and store the returned value in a variable called `variance`.

    Copy the following line to the text editor to print out the result:

    `print('The variance of the rates of return is', variance)`

    Can you tell anything about the relative risk of the investment from this number?

In [54]:
variance = calculate_variance(annual_returns)

print(f'The variance of the rates of return is {variance:.5f}')

The variance of the rates of return is 0.00076


4. Recall that the variance is not always the most intuitive statistic to interpret because it does not have the same unit of measurement as the original dataset. But standard deviation does!

    Use the `calculate_stddev()` function to calculate the standard deviation of the rates of return. Since the standard deviation has the same unit as the rate of return, also use `display_as_percentage()` to format the value and store it in a variable called `stddev`.

    Copy the following line to the editor to print out the result:

    `print('The standard deviation of the rates of return is', stddev)`

    What can you tell about the overall spread of the data and risk of the investment?

In [55]:
stddev = calculate_stddev(annual_returns)

print(f'The standard deviation of the rates of return is {display_as_percentage(stddev)}')

The standard deviation of the rates of return is 2.7%
