In machine learning (ML), understanding the relationships between variables is crucial for building accurate and effective models. Two important concepts that help in quantifying relationships between variables are covariance and correlation.

1. **Covariance:**
Covariance is a measure that indicates the extent to which two variables change together. If the values of two variables tend to increase or decrease together, their covariance will be positive. If one variable tends to increase while the other decreases, their covariance will be negative. A covariance close to zero indicates that there is little to no linear relationship between the variables.

Mathematically, the covariance between two variables X and Y is calculated using the following formula:

Cov(X, Y) = Σ [(xi - μx) * (yi - μy)] / N

where:
- xi, yi: Individual data points for variables X and Y
- μx, μy: Means of variables X and Y
- N: Total number of data points

However, interpreting the magnitude of covariance can be challenging, as it depends on the scales of the variables. Therefore, covariance isn't always a standardized measure for comparing relationships.

2. **Correlation:**
Correlation is a standardized measure that quantifies the strength and direction of the linear relationship between two variables. It's normalized to always fall within the range of -1 to 1. A correlation coefficient of +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

The most commonly used correlation coefficient is Pearson's correlation coefficient (r), which is calculated as:

r = Cov(X, Y) / (σx * σy)

where:
- Cov(X, Y): Covariance between variables X and Y
- σx, σy: Standard deviations of variables X and Y

Correlation is a valuable tool for understanding how changes in one variable relate to changes in another. It is particularly useful in feature selection, identifying multicollinearity (when variables are highly correlated), and gaining insights into the relationships between features in your dataset.

In summary, covariance and correlation are essential concepts in machine learning for assessing relationships between variables. Covariance provides a basic understanding of whether variables tend to vary together or in opposite directions, while correlation quantifies the strength and direction of linear relationships in a standardized way. Both concepts are fundamental for feature engineering, model selection, and understanding the underlying patterns within your data.

In [1]:
import seaborn as sns

In [4]:
df=sns.load_dataset('healthexp')

In [5]:
df.head()

Unnamed: 0,Year,Country,Spending_USD,Life_Expectancy
0,1970,Germany,252.311,70.6
1,1970,France,192.143,72.2
2,1970,Great Britain,123.993,71.9
3,1970,Japan,150.437,72.0
4,1970,USA,326.961,70.9


In [7]:
## covarience
df.cov()

Unnamed: 0,Year,Spending_USD,Life_Expectancy
Year,201.098848,25718.83,41.915454
Spending_USD,25718.827373,4817761.0,4166.800912
Life_Expectancy,41.915454,4166.801,10.733902


## correlation
### spearmen

Spearman's rank correlation coefficient is a valuable tool for assessing the strength and direction of monotonic relationships between variables. It's a non-parametric alternative to Pearson's correlation that is particularly useful when dealing with nonlinear relationships or ordinal data.


In [10]:
df.corr(method='spearman')

Unnamed: 0,Year,Spending_USD,Life_Expectancy
Year,1.0,0.931598,0.896117
Spending_USD,0.931598,1.0,0.747407
Life_Expectancy,0.896117,0.747407,1.0


### pearson
Pearson's correlation coefficient, often denoted as "r," is a statistical measure that quantifies the strength and direction of a linear relationship between two continuous variables. It ranges from -1 to 1, where -1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship between the variables.

In [11]:
df.corr(method='pearson')

Unnamed: 0,Year,Spending_USD,Life_Expectancy
Year,1.0,0.826273,0.902175
Spending_USD,0.826273,1.0,0.57943
Life_Expectancy,0.902175,0.57943,1.0


# Skewness

Skewness is a statistical measure that describes the asymmetry of the probability distribution of a dataset. It indicates the degree to which the data is skewed, or "lopsided," towards one tail of the distribution. In other words, it measures the extent to which the values are concentrated on one side of the mean compared to the other side.

Skewness is important because it provides insights into the shape of the distribution and helps to understand the departure from symmetry. There are three main types of skewness:

1. **Negative Skewness (Left Skewness):**
   A distribution is negatively skewed when the tail on the left side of the distribution is longer or fatter than the tail on the right side. This means that the majority of the data points are concentrated towards the right side, and the distribution is stretched towards the left. The mean is typically less than the median in a negatively skewed distribution.

2. **Positive Skewness (Right Skewness):**
   A distribution is positively skewed when the tail on the right side of the distribution is longer or fatter than the tail on the left side. In this case, the majority of the data points are concentrated towards the left side, and the distribution is stretched towards the right. The mean is usually greater than the median in a positively skewed distribution.

3. **No Skewness (Symmetrical Distribution):**
   A distribution is symmetrical when it is balanced around its mean, and both tails are of equal length. In a symmetrical distribution, the mean and the median are approximately equal.

Mathematically, skewness is often quantified using various formulas, with one common method being the Pearson's First Coefficient of Skewness:

\[ \text{Skewness} = \frac{3 \times (\text{Mean} - \text{Median})}{\text{Standard Deviation}} \]

Positive values of skewness indicate positive skew (right skew), negative values indicate negative skew (left skew), and a skewness value of 0 indicates a symmetrical distribution.

Skewness is a useful measure when dealing with datasets that might not follow a normal distribution. In data analysis, understanding the skewness of a dataset can help in choosing appropriate statistical techniques and making accurate interpretations of the data.

![skewness.png](attachment:skewness.png)