https://github.com/mwaskom/seaborn-data
>- Covariance And Correlation are widely used in:
>  - EDA
>  - Feature Selection

> ### Complete Notes Below
> - Covariance and Correlation 
> - Pearson Correlation Coefficient & Spearman Rank Correlation  

## Covariance And Correlation

In [1]:
import seaborn as sns

In [2]:
df=sns.load_dataset('healthexp')
df.head()

Unnamed: 0,Year,Country,Spending_USD,Life_Expectancy
0,1970,Germany,252.311,70.6
1,1970,France,192.143,72.2
2,1970,Great Britain,123.993,71.9
3,1970,Japan,150.437,72.0
4,1970,USA,326.961,70.9


In [5]:
##covariance
df.cov()

  df.cov()


Unnamed: 0,Year,Spending_USD,Life_Expectancy
Year,201.098848,25718.83,41.915454
Spending_USD,25718.827373,4817761.0,4166.800912
Life_Expectancy,41.915454,4166.801,10.733902


In [7]:
##Pearson correlation coefficient

df.corr(method='pearson')

  df.corr(method='pearson')


Unnamed: 0,Year,Spending_USD,Life_Expectancy
Year,1.0,0.826273,0.902175
Spending_USD,0.826273,1.0,0.57943
Life_Expectancy,0.902175,0.57943,1.0


In [8]:
##spearman rank correlation
df.corr(method='spearman')

  df.corr(method='spearman')


Unnamed: 0,Year,Spending_USD,Life_Expectancy
Year,1.0,0.931598,0.896117
Spending_USD,0.931598,1.0,0.747407
Life_Expectancy,0.896117,0.747407,1.0


In [None]:
df=sns.load_dataset('flights')
df.head()

Unnamed: 0,year,month,passengers
0,1949,Jan,112
1,1949,Feb,118
2,1949,Mar,132
3,1949,Apr,129
4,1949,May,121


In [10]:
df.corr()

  df.corr()


Unnamed: 0,year,passengers
year,1.0,0.921824
passengers,0.921824,1.0


In [11]:
df=sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [12]:
df.corr()

  df.corr()


Unnamed: 0,total_bill,tip,size
total_bill,1.0,0.675734,0.598315
tip,0.675734,1.0,0.489299
size,0.598315,0.489299,1.0


# **Covariance and Correlation**  

Covariance and correlation are **two statistical measures** used to analyze the relationship between two numerical variables.  

- **Covariance** determines the **direction** of the relationship.  
- **Correlation** determines the **strength and direction** of the relationship.  

These concepts are widely used in **data analysis, finance, machine learning, and statistics**.

---

## **1. Covariance**  

### **Definition**  
Covariance measures how two variables change **together**. It indicates whether an increase in one variable **corresponds** to an increase or decrease in another variable.  

### **Mathematical Formula for Covariance**  

For **two random variables** $X$ and $Y$, covariance is given by:  

$$
\text{Cov}(X, Y) = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{n}
$$

where:  
- $X_i, Y_i$ → Individual values of variables **$X$** and **$Y$**  
- $\bar{X}, \bar{Y}$ → Mean of **$X$** and **$Y$**  
- $n$ → Number of observations  

> **Note:** This formula calculates the **population covariance**. For **sample covariance**, divide by $`n-1`$ instead of $n$.  

### **Interpretation of Covariance**  

- **Positive Covariance** $(\text{Cov}(X, Y) > 0)$  
  - $X$ and $Y$ increase together (or decrease together).  
  - Example: Higher study time leads to higher marks.  
- **Negative Covariance** $(\text{Cov}(X, Y) < 0)$  
  - $X$ increases while $Y$ decreases (or vice versa).  
  - Example: More social media usage leads to lower productivity.  
- **Zero Covariance** $(\text{Cov}(X, Y) = 0)$  
  - No linear relationship between $X$ and $Y$.  

---

## **2. Correlation**  

### **Definition**  
Correlation **standardizes** covariance and measures the **strength and direction** of the relationship between two variables. It ranges from **$-1$ to $+1$**.

### **Mathematical Formula for Correlation**  
$$
r = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}
$$

where:  
- $\sigma_X$ → **Standard deviation of $X$**  
- $\sigma_Y$ → **Standard deviation of $Y$**  
- $\text{Cov}(X, Y)$ → **Covariance between $X$ and $Y$**  

### **Properties of Correlation Coefficient $(r)$**  

- $ r = 1 $ → Perfect **positive correlation** (strongest relationship).  
- $ r = -1 $ → Perfect **negative correlation** (strongest inverse relationship).  
- $ r = 0 $ → No correlation (no relationship).  
- $ 0 < r < 1 $ → Weak to strong **positive** correlation.  
- $ -1 < r < 0 $ → Weak to strong **negative** correlation.  

> **Note:** Correlation is **unitless**, making it easy to compare across datasets.

---

## **3. Covariance vs. Correlation: Key Differences**  

| Feature         | Covariance | Correlation |
|---------------|------------|------------|
| **Definition** | Measures the **direction** of the relationship | Measures **strength and direction** of the relationship |
| **Range** | $-\infty$ to $+\infty$ | $-1$ to $+1$ |
| **Unit dependence** | Depends on measurement units | Unitless (normalized) |
| **Comparison** | Cannot compare across datasets | Can compare across datasets |
| **Effect of Scaling** | Affected by change in scale | Not affected by scale changes |

---

## **4. Example Calculation: Covariance and Correlation**  

Let's consider the dataset below:

| X (Study Hours) | Y (Marks) |
|-----------------|----------|
| 2  | 10  |
| 4  | 20  |
| 6  | 30  |
| 8  | 40  |
| 10 | 50  |

### **Step 1: Compute Means**  
$$
\bar{X} = \frac{2 + 4 + 6 + 8 + 10}{5} = 6
$$
$$
\bar{Y} = \frac{10 + 20 + 30 + 40 + 50}{5} = 30
$$

### **Step 2: Compute Covariance**  
$$
\text{Cov}(X, Y) = \frac{(2-6)(10-30) + (4-6)(20-30) + (6-6)(30-30) + (8-6)(40-30) + (10-6)(50-30)}{5}
$$

$$
= \frac{(4 \times 20) + (2 \times 10) + (0 \times 0) + (2 \times 10) + (4 \times 20)}{5} = \frac{160}{5} = 32
$$

Since $\text{Cov}(X, Y) > 0$, **$X$ and $Y$ are positively related**.

### **Step 3: Compute Standard Deviations**  
$$
\sigma_X = \sqrt{\frac{(2-6)^2 + (4-6)^2 + (6-6)^2 + (8-6)^2 + (10-6)^2}{5}}
$$
$$
= \sqrt{\frac{16 + 4 + 0 + 4 + 16}{5}} = \sqrt{8} = 2.83
$$

$$
\sigma_Y = \sqrt{\frac{(10-30)^2 + (20-30)^2 + (30-30)^2 + (40-30)^2 + (50-30)^2}{5}}
$$
$$
= \sqrt{\frac{400 + 100 + 0 + 100 + 400}{5}} = \sqrt{200} = 14.14
$$

### **Step 4: Compute Correlation**  
$$
r = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} = \frac{32}{(2.83 \times 14.14)} = 1
$$

Since $r = 1$, there is a **perfect positive correlation** between **study hours and marks**.

---

## **5. Python Implementation**  

### **Computing Covariance in Python**

In [1]:
import numpy as np

X = np.array([2, 4, 6, 8, 10])
Y = np.array([10, 20, 30, 40, 50])

# Compute Covariance Matrix
cov_matrix = np.cov(X, Y, bias=True)  # Use bias=True for population covariance
cov_XY = cov_matrix[0, 1]

print("Covariance between X and Y:", cov_XY)

Covariance between X and Y: 40.0


### **Computing Correlation in Python**

In [2]:
import scipy.stats as stats

# Compute Pearson Correlation
corr_coefficient, _ = stats.pearsonr(X, Y)

print("Correlation coefficient (r):", corr_coefficient)

Correlation coefficient (r): 1.0


# **Pearson Correlation Coefficient & Spearman Rank Correlation**  

Correlation measures the **relationship** between two variables. The two most commonly used correlation coefficients are:  

1. **Pearson Correlation Coefficient ($r$)** – Measures **linear relationships**.  
2. **Spearman Rank Correlation ($\rho$)** – Measures **monotonic relationships** (order-based).  

---

## **1. Pearson Correlation Coefficient ($r$)**  

### **Definition**  
The **Pearson Correlation Coefficient ($r$)** measures the **strength and direction** of a **linear relationship** between two continuous variables.  

### **Formula**  
$$
r = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum (X_i - \bar{X})^2} \sqrt{\sum (Y_i - \bar{Y})^2}}
$$  
or equivalently:  
$$
r = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}
$$  

where:  
- $\text{Cov}(X, Y)$ → Covariance between $X$ and $Y$.  
- $\sigma_X, \sigma_Y$ → Standard deviations of $X$ and $Y$.  

### **Interpretation of $r$**  
- $ r = 1 $ → **Perfect positive correlation**  
- $ r = -1 $ → **Perfect negative correlation**  
- $ r = 0 $ → **No correlation**  
- $ 0 < r < 1 $ → **Weak to strong positive correlation**  
- $ -1 < r < 0 $ → **Weak to strong negative correlation**  

### **Example Calculation**  

Given dataset:  

| $X$ (Hours Studied) | $Y$ (Marks) |
|-----------------|----------|
| 2  | 10  |
| 4  | 20  |
| 6  | 30  |
| 8  | 40  |
| 10 | 50  |

Using the **previously calculated covariance and standard deviations**:  

$$
r = \frac{32}{(2.83 \times 14.14)} = 1
$$  

Since **$r = 1$**, there is a **perfect positive correlation** between study hours and marks.

---

## **2. Spearman Rank Correlation ($\rho$ or $r_s$)**  

### **Definition**  
The **Spearman Rank Correlation ($\rho$)** measures the **monotonic relationship** between two variables. It is used when data is **ordinal, non-linear, or not normally distributed**.
- Measures the **strength and direction** of a **monotonic** relationship between two variables.  
- Uses **ranked values** instead of raw data.  
- Suitable for **non-linear but monotonic** relationships.  

### **Formula**  

1. **General Formula (Based on Covariance):**  
   $$   r_s = \frac{\text{Cov}(\text{Rank}_X, \text{Rank}_Y)}{\sigma_{\text{Rank}_X} \sigma_{\text{Rank}_Y}}
   $$ 
   - Similar to Pearson’s formula but applied to **ranked data**.  
   - Covariance and standard deviation are calculated **on the ranks**.  

2. **Simplified Formula (Using Rank Differences):**  
   $$
   \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}
   $$ 
   where:  
   - $d_i$ → Difference between **ranks** of $X$ and $Y$ for each data point.  
   - $n$ → Total number of observations.  


#### **Key Points**  
✅ **Based on ranks**, not raw values.  
✅ **Suitable for ordinal data** or **non-parametric** distributions.  
✅ **Captures monotonic trends**, even if the relationship is not **linear**.  
✅ **Less sensitive to outliers** than Pearson’s correlation.  
✅ **Used in cases** where data is **not normally distributed**.  

### **Interpretation of $\rho$**  
- $\rho = 1$ → Perfectly increasing monotonic relationship.  
- $\rho = -1$ → Perfectly decreasing monotonic relationship.  
- $\rho = 0$ → No monotonic relationship.  

### **Example Calculation**  

| $X$ | $Y$ | Rank(X) | Rank(Y) | $d_i$ = Rank(X) - Rank(Y) | $d_i^2$ |
|----|----|--------|--------|----|----|
| 2  | 10  | 1 | 1 | 0 | 0 |
| 4  | 20  | 2 | 2 | 0 | 0 |
| 6  | 30  | 3 | 3 | 0 | 0 |
| 8  | 40  | 4 | 4 | 0 | 0 |
| 10 | 50  | 5 | 5 | 0 | 0 |

$$
\rho = 1 - \frac{6 (0)}{5(5^2 - 1)} = 1
$$  

Since **$\rho = 1$**, the variables have a **perfect monotonic relationship**.

---

## **3. Pearson vs. Spearman: Key Differences**  

| Feature            | Pearson Correlation ($r$) | Spearman Rank Correlation ($\rho$) |
|-------------------|------------------|------------------|
| **Measures** | **Linear** relationship | **Monotonic** relationship |
| **Data Type** | Continuous (interval/ratio) | Ordinal or continuous |
| **Effect of Outliers** | Sensitive to outliers | Less sensitive to outliers |
| **Use Case** | Normally distributed data | Non-linear or skewed data |

---

## **4. Python Implementation**  

### **Pearson Correlation in Python**

In [3]:
import numpy as np
import scipy.stats as stats

X = np.array([2, 4, 6, 8, 10])
Y = np.array([10, 20, 30, 40, 50])

# Compute Pearson Correlation
pearson_corr, _ = stats.pearsonr(X, Y)
print("Pearson Correlation:", pearson_corr)

Pearson Correlation: 1.0


### **Spearman Rank Correlation in Python**

In [4]:
# Compute Spearman Rank Correlation
spearman_corr, _ = stats.spearmanr(X, Y)
print("Spearman Rank Correlation:", spearman_corr)

Spearman Rank Correlation: 0.9999999999999999


## **Conclusion**  
- **Pearson ($r$)** → Use when **data is normally distributed and has a linear relationship**.  
- **Spearman ($\rho$)** → Use when **data is ordinal, non-linear, or has outliers**.  

Both methods help in understanding the relationship between variables based on **data type and distribution**. 🚀