In [0]:
import requests
from IPython.core.display import HTML
HTML(f"""
<style>
@import "https://cdn.jsdelivr.net/npm/bulma@0.9.4/css/bulma.min.css";
</style>
""")

# Covariance and Correlation

**Overview**
This exercise is an introduction to correlation and covariance matrix.
The exercise will guide:
- Understanding of what **covariance** and **correlation** are.
- Implementation of both matrices using basic NumPy operations.
- Interpreting the meaning of these matrices in terms of variable relationships.


In data analysis and machine learning, understanding the relationships between variables is crucial.

Two key tools for this are the **covariance** and **correlation** matrices.
Recall the definitions: 

$$
\text{Covariance: }\newline  \text{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})
$$

$$\text{ }\newline$$

$$
\text{Correlation: } \newline   \text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}
$$
The cell below defines a synthetic dataset containing 4 features, the data could resemble _height_, _weight_, _age_ and _shoe size_ of a set of people. 

---
**Task 1 (medium): Reflect on theory💡📽️**
1. Explain the key characteristics of covariance and correlation. Use the following questions to guide your explanation.    - What does each one assess?
    - How do they describe relationships between variables differently?
    - Why might you prefer one measure over the other in analysis?
    - What does a covariance of zero mean? Does it imply independence?
    - Why are the diagonal elements of the covariance matrix always the variances of the variables?
    - Why are the diagonal elements of the correlation matrix always equal to 1?
    - What does the sign of a covariance or correlation value tell us?
    - Can two variables be strongly related but have a correlation close to zero? Under what circumstances?




---

In [0]:
#Write your reflections here...


---
**Task 2 (easy): Generate data👩‍💻**
1. Run the cell below to load the data.


---

In [0]:
import numpy as np
import pandas as pd
import util_corr_cov

data = {
    'Height': [150, 155, 160, 165, 170, 175, 180, 185, 190, 195],
    'Weight': [50, 53, 57, 60, 65, 70, 72, 75, 78, 80],
    'Age':    [21, 22, 23, 26, 27, 28, 30, 23, 25, 31],
    'Shoe_Size': [36, 37, 38, 39, 40, 41, 42, 41, 42, 39]
}

df = pd.DataFrame(data)
X = df.values
df


---
**Task 3 (easy): Calculate 1👩‍💻**
1. Complete the functions `covariance_matrix`
 and `correlation_matrix`
.

**Important**
It is not allowed to use built-in covariance/correlation functions.




---

In [0]:
def covariance_matrix(X):
    """
    Computes the covariance matrix manually.

    Parameters
    ----------
    X : numpy.ndarray, shape (n_samples, n_features)

    Returns
    -------
    cov_matrix : numpy.ndarray, shape (n_features, n_features)
        A square matrix representing the covariance between each pair of variables.
    """
    # Write your code here


def correlation_matrix(X):
    """
    Computes the correlation matrix manually using the covariance matrix.

    Parameters
    ----------
    X : numpy.ndarray, shape (n_samples, n_features)

    Returns
    -------
    corr_matrix : numpy.ndarray, shape (n_features, n_features)
        A square matrix representing the correlation between each pair of variables.
    """
    # Write your code here


---
**Task 4 (easy): Calculate 2👩‍💻**
1. Run the cell below to calculate and visualize the covariance and correlation matrix.


---

In [0]:
cov_mat = covariance_matrix(X)
print("Covariance Matrix:\n", cov_mat)

corr_mat = correlation_matrix(X)
print("\nCorrelation Matrix:\n", corr_mat)

util_corr_cov.plot_cov(df, cov_mat)
util_corr_cov.plot_corr(df, corr_mat)


---
**Task 5 (medium): Reflection💡📽️**
Reflect on the results. Use the following questions to guide your reflection:
1. Look at your covariance matrix. Which variables have the largest variances? What does that tell you?
2. Which pairs of variables have the highest positive covariance or correlation?
3. Which pairs show negative or near-zero relationships? What might that indicate?
4. Do the results align with what you expected from the dataset?
5. Compare your covariance and correlation matrices:
6. How do the magnitudes of the numbers differ?
7. What stays consistent between them?
8. Why might correlation be a more useful comparison measure when variables are in different units or scales?
9. If two variables have a high covariance but a low correlation, what might that suggest about the scales or units involved?


---

In [0]:
#Write your reflections here...