### Covariance, Correlation & Coefficient of Correlation

- Correlation:

Correlation is a statistical measure that describes the strength and direction of a linear relationship between two variables. The most commonly used correlation coefficient is the Pearson correlation coefficient, which ranges from -1 to +1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and +1 indicates a perfect positive correlation.

The formula for calculating the Pearson correlation coefficient (r) is:

![image.png](attachment:image.png)

where x and y are the variables being analyzed, n is the number of observations, xi and yi are the values of x and y for the ith observation, and x̄ and ȳ are the mean values of x and y, respectively.

- Covariance:

Covariance is a statistical measure that describes how two variables vary together. If the two variables increase or decrease together, their covariance is positive. If they vary in opposite directions, their covariance is negative.

The formula for calculating covariance is:

![image.png](attachment:image-3.png)

where X and Y are the variables being analyzed, n is the number of observations, xi and yi are the values of X and Y for the ith observation, and x̄ and ȳ are the mean values of X and Y, respectively.

For example, let's say we have the following two sets of data:

X = {1, 2, 3, 4, 5}
Y = {6, 7, 8, 9, 10}

The means of X and Y are both 3. The covariance between X and Y is:

Cov(X,Y) = ((1-3)(6-8) + (2-3)(7-8) + (3-3)(8-8) + (4-3)(9-8) + (5-3)*(10-8))/5
= 2.0

This positive covariance indicates that X and Y tend to increase together.

- Correlation coefficient:

The correlation coefficient is the same as the correlation but expressed as a single number between -1 and 1.

The formula for the correlation coefficient (r) is:

r = Cov(X,Y) / (SD(X) * SD(Y))

For example, using the same data as before:

X = {1, 2, 3, 4, 5}
Y = {6, 7, 8, 9, 10}

The correlation coefficient between X and Y is:

r = Cov(X,Y) / (SD(X) * SD(Y))
= 2.0 / (1.58 * 1.58)
= 0.63

This means that there is a moderate positive correlation between X and Y.

Note that the correlation coefficient only measures linear relationships between variables, and does not capture other types of relationships, such as curvilinear relationships.



In [1]:
import numpy as np

# Define two sets of data
x = np.array([1, 2, 3, 4, 5])
y = np.array([6, 7, 8, 9, 10])

# Compute the covariance matrix
cov_matrix = np.cov(x, y)

# Extract the covariance value
covariance = cov_matrix[0, 1]

# Compute the correlation matrix
corr_matrix = np.corrcoef(x, y)

# Extract the correlation value
correlation = corr_matrix[0, 1]

# Compute the correlation coefficient
corr_coef = np.corrcoef(x, y)[0, 1]

print("Covariance:", covariance)
print("Correlation:", correlation)
print("Correlation Coefficient:", corr_coef)

Covariance: 2.5
Correlation: 0.9999999999999999
Correlation Coefficient: 0.9999999999999999


Note that the np.cov function returns a covariance matrix, which is a 2x2 matrix that contains the variances and covariances of the input variables. We extract the covariance value by selecting the element in the first row and second column (cov_matrix[0, 1]). Similarly, the np.corrcoef function returns a correlation matrix, which is also a 2x2 matrix that contains the correlations between the input variables. We extract the correlation value in the same way as for the covariance. Finally, we compute the correlation coefficient using the same formula as before (np.corrcoef(x, y)[0, 1]).

In [3]:
# Negative correlation
# Define two sets of data with negative correlation
x = np.array([1, 2, 3, 4, 5])
y = np.array([10, 8, 6, 4, 2])

# Compute the covariance and correlation
covariance = np.cov(x, y)[0, 1]
correlation = np.corrcoef(x, y)[0, 1]
corr_coef = np.corrcoef(x, y)[0, 1]

print("Covariance:", covariance)
print("Correlation:", correlation)
print("Correlation Coefficient:", corr_coef)

Covariance: -5.0
Correlation: -0.9999999999999999
Correlation Coefficient: -0.9999999999999999


In this example, the two sets of data have a strong negative correlation, as seen by the covariance, correlation, and correlation coefficient all being negative and equal to -1.0.

In [4]:
# Positive Correlation

# Define two sets of data with positive correlation
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])

# Compute the covariance and correlation
covariance = np.cov(x, y)[0, 1]
correlation = np.corrcoef(x, y)[0, 1]
corr_coef = np.corrcoef(x, y)[0, 1]

print("Covariance:", covariance)
print("Correlation:", correlation)
print("Correlation Coefficient:", corr_coef)


Covariance: 5.0
Correlation: 0.9999999999999999
Correlation Coefficient: 0.9999999999999999


In this example, the two sets of data have a strong positive correlation, as seen by the covariance, correlation, and correlation coefficient all being positive and equal to 1.0.

In [5]:
# No Correlation

# Define two sets of data with no correlation
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 3, 5, 1])

# Compute the covariance and correlation
covariance = np.cov(x, y)[0, 1]
correlation = np.corrcoef(x, y)[0, 1]
corr_coef = np.corrcoef(x, y)[0, 1]

print("Covariance:", covariance)
print("Correlation:", correlation)
print("Correlation Coefficient:", corr_coef)

Covariance: -0.25
Correlation: -0.09999999999999999
Correlation Coefficient: -0.09999999999999999


In this example, the two sets of data have no clear correlation, as seen by the covariance and correlation coefficient being positive but small, and the correlation being close to 0.

### What is Causality in statistics?

Causality in statistics refers to the relationship between an event (the cause) and a second event (the effect), where the second event is understood as a consequence of the first. In other words, causality is the relationship between cause and effect, and it is concerned with understanding how one variable affects another variable.

Establishing causality requires more than just observing that two variables are related - it requires demonstrating that the relationship between the two variables is not just due to chance, but is in fact a true cause-and-effect relationship. This can be challenging, as there may be other variables (known as confounding variables) that are related to both the cause and effect variables, which can make it difficult to determine which variable is truly causing the effect.

To establish causality, researchers often use experimental designs, where one variable is manipulated (the independent variable) and the effect on another variable (the dependent variable) is observed. By controlling for other variables and randomizing the assignment of participants to different groups, researchers can establish a more causal relationship between the variables.

Causality is an important concept in many fields, including science, medicine, social science, and economics, as it is necessary to understand the relationship between variables in order to make predictions and develop interventions that can improve outcomes.

Some examples of causality in different fields:

 - Science: A researcher wants to investigate the effect of a new drug on blood pressure. In a randomized controlled trial, some participants are given the drug while others receive a placebo. The researcher measures the participants' blood pressure before and after the treatment. By comparing the results between the two groups, the researcher can establish causality and determine if the drug is effective at reducing blood pressure.

 - Medicine: A physician observes that patients who smoke have a higher risk of developing lung cancer. To establish causality, the physician may conduct a longitudinal study where patients are followed over time to see if smoking is associated with an increased risk of lung cancer, while controlling for other variables such as age, gender, and exposure to environmental toxins.

 - Social Science: A researcher wants to investigate whether a new intervention program is effective at reducing juvenile delinquency. The researcher randomly assigns participants to either receive the intervention or not, and tracks their behavior over time. By comparing the outcomes between the two groups, the researcher can establish causality and determine if the intervention is effective.

 - Economics: A policy maker wants to investigate the effect of a minimum wage increase on employment rates. By comparing employment rates before and after the minimum wage increase, while controlling for other variables such as inflation and economic growth, the policy maker can establish causality and determine if the minimum wage increase has had an impact on employment.

In each of these examples, causality is established by carefully designing studies that control for confounding variables, randomizing participants or events, and observing the effect of one variable on another. By doing so, researchers can draw conclusions about the causal relationship between variables, which can inform policy decisions and improve outcomes.