# SLU05 - Covariance and correlation: Exercise notebook

In [None]:
import pandas as pd
import numpy as np
import math
import utils
import seaborn as sns
import hashlib
from matplotlib import pyplot as plt
import utils

def _hash(s):
    return hashlib.blake2b(bytes(str(s), encoding='utf8'), digest_size=5).hexdigest()

In this notebook, you will practice the following:

- Covariance 
- Pearson correlation
- Spearman correlation
- Correlation matrix
- Spurious correlations

## Exercise 1 - Covariance and correlation with Pandas

In this exercise, you will calculate covariance and correlation on a sample dataset.

We're going to use a [dataset of car fuel consumption](https://www.kaggle.com/datasets/anderas/car-consume) for this exercise. Let's begin by taking a quick look at the dataset:

In [None]:
carride = pd.read_csv('data/carride2.csv')
carride.head()

### Exercise 1.1 - Are speed and consumption related?

We'll begin by checking if the speed of the cars is related to the fuel consumption.

Edit the function below so that it returns the covariance, Pearson correlation, and Spearman correlation between speed and consumption.

In [None]:
def check_if_related(consumption, speed):
    """ Calculates covariance and correlations between the given variables.

    Parameters:
        consumption, speed (pd.Series): variable between which to calculate the covariance and correlations

    Returns:
        covariance (float): covariance between the given variables
        pearson_corr (float): Pearson correlation between the given variables
        spearman_corr (float): Spearman correlation between the given variables
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return covariance, pearson_corr, spearman_corr

In [None]:
consumption, speed = utils.get_distance_speed()
cov, p_corr, s_corr = check_if_related(consumption, speed)
np.testing.assert_almost_equal(cov, -25.061536002557343, decimal=3, err_msg="The covariance seems to be wrong.")
np.testing.assert_almost_equal(p_corr, -0.10365751316032523, decimal=3, err_msg="The Pearson correlation seems to be wrong.")
np.testing.assert_almost_equal(s_corr, -0.11331148715392869, decimal=3, err_msg="The Spearman correlation seems to be wrong.")
print(f"Well done! Everything seems to be in order! Approximated values:\nCovariance = "
    f"{round(check_if_related(consumption, speed)[0],2)}\n"
    "Pearson correlation  = "
    f"{round(check_if_related(consumption, speed)[1],2)}\n"
    "Spearman correlation = "
    f"{round(check_if_related(consumption, speed)[2],2)}\n"
    "The results show that the correlation is not significant.")

### Exercise 1.2 - Changing units

Now for a simple multiple choice exercise. The distance unit in the dataset is meter. Let's assume we want to know the distances in feet.

We know that 1 meter = 3.28 feet, meaning that the unitary distance represented in the unit 'feet' is larger than the unitary distance represented in the dataset unit, meter.

If we extract the covariance and Pearson/Spearman correlations again, but this time in feet, which of the following statements is true?

- A. The covariance, Pearson correlation and Spearman correlation will decrease.
- B. The covariance will increase, but the Pearson correlation and Spearman correlation will decrease.
- C. They all (covariance, Pearson correlation and Spearman correlation) remain the same.
- D. The covariance will increase, but Pearson correlation and Spearman correlation will remain the same.

Write the letter corresponding to your chosen answer as a text string into the variable `ex1_answer` below.

In [None]:
# ex1_answer = "Z"
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert _hash(ex1_answer.lower()) == '838cd7f570', "Wrong choice. Remember that correlation does not depend on units."
print("Good job!")

## Exercise 2 - Pearson experiment

The following dataset presents the heights of fathers and their sons, based on a famous Karl Pearson's experiment around 1903. The number of cases is 1078. Random noise was added to the original data, to produce heights to the nearest 0.1 inch.  
(more info: https://www.kaggle.com/datasets/abhilash04/fathersandsonheight)

In [None]:
fathers_sons_heights = pd.read_csv('data/pearson.csv')
fathers_sons_heights.head()

In [None]:
plt.scatter(fathers_sons_heights['Fathers'],fathers_sons_heights['Sons'])
plt.xlabel('Fathers')
plt.ylabel('Sons')
plt.title('Height of fathers and sons.');

### Exercise 2.1 - What can you read from the plot?
What can we infer from the scatter plot above concerning the height variable?

    A. There is no correlation between Father and son.

    B. There is a negative correlation visible.

    C. There is a positive correlation visible.

    D. Nothing can be inferred from just the graph.
    
Write the letter corresponding to your chosen answer as a text string into the variable `ex2_answer` below.

In [None]:
# ex2_answer = "Z"
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert _hash(ex2_answer.lower()) == '2add4c06d4', "Not correct. Remember that the correlation is not affected by the unit choice."
print("Nice!")

### Exercise 2.2 - The outliers
Now let's look at another data set fathers and sons heights. We can spot some really weird heights or so called outliers. This could be the tallest man ever, or giants really existed!

In [None]:
outliers_fathers_sons = pd.read_csv('data/outlier_data.csv')
plt.scatter(outliers_fathers_sons['Fathers'],outliers_fathers_sons['Sons'])
plt.xlabel('Fathers')
plt.ylabel('Sons')
plt.title('Heights of fathers and sons with outliers.');

Calculate the Pearson and Spearman correlations between the fathers' and sons' heights for this (`outlier_data`) and the previous (`fathers_sons_heights`) data set. What do results do you expect?

In [None]:
# pearson_corr_normal, spearman_corr_normal = 
# pearson_corr_outlier, spearman_corr_outlier = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
np.testing.assert_almost_equal(pearson_corr_normal - pearson_corr_outlier, 0.07206794228263841, decimal=4,
                err_msg="The Pearson correlations seems to be off. You should see some change between the two datasets.")
np.testing.assert_almost_equal(spearman_corr_normal - spearman_corr_outlier, 0.0020233482820588566, decimal=4,
                err_msg="The Spearman correlation seems to be off. You should see almost no change between the two datasets.")
print("So far, so good!")
# quick plot to see what happens
utils.plot_correlation_bargraph(pearson_corr_normal, pearson_corr_outlier,
                          spearman_corr_normal, spearman_corr_outlier)

### Exercise 2.3 - What to do when you have outliers?

So, unsurprisingly, the higher the dads are, the higher the sons: a positive correlation. But, as you can see, having outliers may hugely affect your analysis. Then, when dealing with a dataset with outliers, which correlation methodology should you use?

    A. Pearson.

    B. Spearman.

Write the letter corresponding to your chosen answer as a text string into the variable `ex3_answer` below.

In [None]:
# ex3_answer = "Z"
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert _hash(ex3_answer.lower()) == '9350f68d6b', "Not correct. Look at the changes in the correlations in the bar graphs."
print("Nice! Now you can avoid the data disruption caused by the longest or "
      "shortest man ever alive!")

## Exercise 3 - Interest in health issues 

This health search dataset includes an index of volumes of searches for various common medical topics in the United States. The data covers the period 2004 for all the US counties. source: https://www.searching-for-health.com/

In [None]:
health = pd.read_csv('data/health_issue.csv')
health.head()

Explore the dataset using the tools you learned in this SLU, then answer the questions in the cell below:

- you can use `display()` to force it to pretty print.
- use the heatmap of the correlation matrix that we used in the learning notebooks.
- you may want to import something to help with the visualization.
- you can either paste the answers or use a purely programmatic solution.

In [None]:
# explore the dataset

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Complete the following questions

# Q1: What is the pair of health issues with the most negative Pearson
#   correlation?
# (pass the answer as a list, and remember, you can just type it in, no
#   fancy Pandas needed)
# health_pair_with_lowest_pearson_corr = ...
# YOUR CODE HERE
raise NotImplementedError()

# # Q2: What is the health issue with the most negative Pearson
#   correlation with Obesity?
# health_rank_pearson_corr_with_obesity = ...
# YOUR CODE HERE
raise NotImplementedError()

# Q3: What is the Spearman correlation between vaccine and stroke?
# spearman_corr_between_vaccine_and_stroke = ...
# YOUR CODE HERE
raise NotImplementedError()

# Q4: Observe the top Pearson correlation pairs, and then look at the
#   general correlation matrix.
# Which health issue seems to be the most correlated to other health
#   issues making it a possible confounding variable?
# possible_confounding_variable =
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert _hash(sorted([h.lower() for h in health_pair_with_lowest_pearson_corr])) == '506ae2bc48',\
"That is not the pair with the lowest Pearson correlation."
assert _hash(health_rank_pearson_corr_with_obesity.lower()) == '2def1fe0b7', "That is not it, look again."
np.testing.assert_almost_equal(spearman_corr_between_vaccine_and_stroke, 0.187897, decimal=3,
                    err_msg="Wrong spearman correlation value between vaccine and stroke.")
assert _hash(possible_confounding_variable.lower()) == '22037c48a8', "Not correct, check again."
print("You got it!")

## Exercise 4 - Lots of stocks
You were hired by a hedge fund, because money. 

On the first day, your boss, Greedy McRiskyface asked you to select one stock pair so that he can short one and long the other.

Note: If an investor wants to short (sell) one stock and long (buy) another it means they expects the prices to move in **oposite directions!** This aspect will help you understand which correlation extreme you're expected to find.

If you select the best possible pair (use Pearson) you get a raise!

The answer should be (1) the two stocks, as a list and (2) their Pearson correlation, as a float.

It is very important that you restart the kernel everytime you run this exercise, otherwise the asserts might not pass although your solution is correct.

In [None]:
stock_data = utils.get_stocks_data_2()
stock_data.head()

In [None]:
# selected_stock_pair = ...
# selected_stock_pair_pearson_corr = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(selected_stock_pair, list), "The variable should be a list."
assert len(selected_stock_pair) == 2, "There should be two stocks in the list."
assert _hash(sorted(selected_stock_pair))=='56085de6e5', 'The selected stock pair is not correct.'
np.testing.assert_almost_equal(selected_stock_pair_pearson_corr, -0.4694947400556539, decimal=3),
err_msg="The Pearson correlation value is not correct."
utils.dirty_little_secret()