# Analyzing Stock Data to Identify Investment Risk

**Author: Daniel Park**

Even though the internet dates back to the 70s when the internet was still in its infancy, it was during the 90s when it became a mainstay in homes across the globe. And with new age of information, so did the people who set up shop on the space.

It was during that time when companies began setting up virtual marketplaces, namely Jeff Bezo's Amazon and Pierre Omidyar's eBay to match buyers and sellers of goods and services. Although these two are strong household names, how do these companies hold up when it comes to investing in them now?

In this analysis, we will analyze the risk and return for each of these E-commerce companies by calculating the rates of return as well as other key statistics such as variance and correlation for assessing risk.

To ensure that the plots display correctly in this document, run the code cell below to reload modules before executing user code to make workflow possible.

In [1]:
%load_ext autoreload
%autoreload 2

## Risk Investment Analysis Helper Functions

In order to calculate the risk and reward between the two companies, we must first set up basic functions to calculate log return, variance, standard deviation, and the correlation coefficient (we will go over why these are important later.)

But first, let's import the necessary math functions as well as a function for displaying a decimal as a percentage

In [2]:
from math import log, sqrt

def display_as_percentage(val):
    return '{:.1f}%'.format(val*100)

### Logarithmic Rate of Return

The first helper function that we need to write is the **logarithmic rate of return** also known as the **continuously compounded return** given by the formula below.


$$
r = log(E) - log(S) = log(\frac{E}{S})
$$

where 

- `r`: logarithmic rate of return
- `S`: starting price of investment
- `E`: ending price of investment

This formula calculates the expected return for an investment where the earnings are assumed to be continually reinvested over a certain time period.

In [3]:
def calculate_log_return(start_price, end_price):
    return log(end_price / start_price)

### Variance

To assess the risk involved in an investment, one of the key statistics to understand risk is **variance** which measures the spread of a dataset or how far apart each value is from the mean. The greater the variance, the more spread out or variable the data is represented by the formula below.

$$\sigma^{2} = \frac{\Sigma(X_{i} - \bar{X})^{2}}{n}$$

- `σ2`: variance
- `Xi`: the ith value in the dataset
- `X̄`: the mean of the dataset
- `n`: the number of values in the dataset


In the context of investing, an asset with a high variance is generally a riskier one because its return can vary significantly in a short period of time, making it less stable and more unpredictable.

In [4]:
def calculate_variance(dataset):
    mean = sum(dataset)/len(dataset)
    numerator = 0
    for data in dataset:
        numerator += (data-mean) ** 2
    return numerator / len(dataset)

### Standard Deviation


Although the variance is useful in determining the relative risk of an investment, it is sometimes not the easiest statistic to interpret since it does not have the same unit as the original data. Alternatively, we can use the standard deviation to describe the spread of the dataset.

And in this case, we can calculate the standard deviation simply as the square root of the variance.

$$\sigma = \sqrt{\frac{\Sigma(X_{i} - \bar{X})^{2}}{n}}$$

- `σ`: standard deviation
- `Xi`: the ith value in the dataset
- `X̄`: the mean of the dataset
- `n`: the number of values in the dataset

In [5]:
def calculate_stddev(dataset):
    variance = calculate_variance(dataset)
    return sqrt(variance)

### Calculating Correlation


Another important statistic for assessing risk is the correlation between the returns of two assets. Correlation is a measure of how closely two datasets are associated with each other. It is often represented by the correlation coefficient, which is a value that ranges between -1 and 1. This indicates whether there is a positive correlation, negative correlation, or no correlation.

In finance, two assets from the same industry generally have a positive correlation, as they are likely affected by similar external conditions. So for example, automobile stocks may be positively correlated with each other while oil stocks may be negatively correlated with automobile stocks.

When building a portfolio, it is generally a good idea to include assets that are not correlated with each other. If assets are independent of one another, then there is a lower risk of the financial loss that can occur when assets in a portfolio are correlated. This allows for greater diversification and balances out the overall risk and return of the portfolio.

And to get a single value that tells us the relationship between two continuous variables,, we use the Pearson Correlation Coefficient that measures both the strength and direction of the linear relationship.

We use the formula shown below:

$$\rho = \frac{\text{cov}(X,Y)}{\sigma_x \sigma_y}$$

- `x̄` = Mean of x

- `ȳ` = Mean of y

In [6]:
# Calculate Correlation Coefficient
def calculate_correlation(set_x, set_y):
    sum_x = sum(set_x)
    sum_y = sum(set_y)
    
    sum_x2 = sum([x ** 2 for x in set_x])
    sum_y2 = sum([y ** 2 for y in set_y])
    sum_xy = sum([x * y for x,y in zip(set_x, set_y)])
    
    n = len(set_x)
    numerator = n * sum_xy - sum_x * sum_y
    denominator = sqrt((n * sum_x2 - sum_x ** 2) * (n * sum_y2 - sum_y ** 2))
    return numerator / denominator

## Calculate Rate of Return

Now to start calculating the logarithmic rates of return from the stock prices, I defined a function called `get_returns()` that takes a parameter called `prices`, which will be a list of stock prices.

To calculate the returns,the list of prices is iterated through and the log return of two adjacent prices is calculated and placed in a list of returns. The importance of calculating the percentage difference is to avoid having the difference in price affect the possible risk of investment.

The function will return a list of log returns calculated from each adjacent pair of prices.

In [7]:
def get_returns(prices):
    returns = []
    for i in range(len(prices)-1):
        start_price = prices[i]
        end_price = prices[i+1]
        returns.append(calculate_log_return(start_price, end_price))
    return returns

And to analyze the risk investment between Amazon (AZMN) and EBay (EBAY), I will use data from [Yahoo Finance](https://finance.yahoo.com/) of the price stock at the start of each month from June 2021 to June 2022 (to avoid the discrepency caused by Amazon's 20:1 Stock Split)

For now, I manually inputted the data to put more effort in learning of the economic processes but in a later edition, I will use BeautifulSoup to scrap the data directly from [Yahoo Finance](https://finance.yahoo.com/).

In [8]:
amazon_prices = [171.73, 167.65, 174.82, 164.45, 168.09, 207.25, 167.55, 150.00, 152.73, 164.15, 152.40, 152.26]
ebay_prices = [61.52, 70.21, 68.81, 77.06, 69.70, 76.47, 68.64, 66.45, 59.91, 54.56, 57.87, 51.97, 49.08]

With the data, I used `get_returns()` function to find the monthly log rates of return from the Amazon and eBay stock prices. Store those list of returns in the variables `amazon_returns` and `ebay_returns`, respectively.

In [9]:
amazon_returns = get_returns(amazon_prices)
ebay_returns = get_returns(ebay_prices)

Now I ouputted the lists of monthly returns, with the help of displaying the price return as a percentage, however, it is hard to tell how the return of the two stocks compare and if they are on average profitable.

In [10]:
[display_as_percentage(price_return) for price_return in amazon_returns]

['-2.4%',
 '4.2%',
 '-6.1%',
 '2.2%',
 '20.9%',
 '-21.3%',
 '-11.1%',
 '1.8%',
 '7.2%',
 '-7.4%',
 '-0.1%']

In [11]:
[display_as_percentage(price_return) for price_return in ebay_returns]

['13.2%',
 '-2.0%',
 '11.3%',
 '-10.0%',
 '9.3%',
 '-10.8%',
 '-3.2%',
 '-10.4%',
 '-9.4%',
 '5.9%',
 '-10.8%',
 '-5.7%']

Using the monthly rate of returns, we can sum them together to get the annual rate of return. We avoided using the two years between 2021 and 2022 since more data leads to more accurate results.
   

Now we can see below that on average over the period between June 2021 and June 2022, the annual rate of return for both companies fell, with EBay having a less annual return rate than EBay.

In [12]:
amzn_annual_return = sum(amazon_returns)
ebay_annual_return = sum(ebay_returns)

print(amzn_annual_return)
print(ebay_annual_return)

-0.12033389011770391
-0.22591070535467814


## Assess Investment Risk

Now to assess the risk of each investment, I calculated the variance of each stock's monthly returns. Use the `calculate_variance()` function we provided in the first task and print out the results.

As we see below, the variance for Amazon’s monthly returns is slightly higher than eBay’s. A greater variance generally signifies a riskier investment.

In [13]:
print(calculate_variance(amazon_returns))
print(calculate_variance(ebay_returns))

0.01069053597390166
0.007967784162336419


To further justify our finding with the variance, we can calculate the standard deviation of each stock's monthly returns using the `calculate_stddev()` function, and print out the results.

Amazon’s monthly returns have a greater standard deviation than eBay’s. As we’ve already seen previously, investing in Amazon stock is likely riskier.It is worthwhile to note again that Amazon also has the higher rate of return, demonstrating the risk-return tradeoff that often exists in investments.

In [14]:
print(display_as_percentage(calculate_stddev(amazon_returns)))
print(display_as_percentage(calculate_stddev(ebay_returns)))

10.3%
8.9%


Finally, calculate the correlation between the stock returns using the `calculate_correlation()` function, and we can see the correlation coefficient is about 0.16 which indicates a moderate positive correlation.

As said in the **Calculating Correlation** section, we should be careful about investing in highly correlated stocks to avoid putting all our eggs in one basket, so to speak. Instead, it is wise to invest in uncorrelated stocks, such that a loss in one does not automatically mean a loss in the other. This diversifies the investment portfolio and reduces overall risk.

In [15]:
print(calculate_correlation(amazon_returns, ebay_returns))

0.16507117833663582


This project is meant to serve as a introductory guide in analyzing risk investment as a gateway into financial analytics with Python, and should not be used as financial advice.