# Bivariate analysis

Understand the relationship between pairs of variables






## 01. Features and labels

Bivariate analyses begins the process of inferring cause and effect relationships. We need to divide our data fields into those that represent *causes* and those that represent potentian *effect* measures.


* **features**, causes, independent variable, data fields used to explain or predict the labels.
* **labels**, effects, dependent variables, something that you want to predict or explain because it represents a valuable outcome of interest 




### 01.01 Example

We want to explain the variable *charges* with all the other features

In [1]:
# https://www.kaggle.com/mirichoi0218/insurance
import pandas as pd 
df = pd.read_csv('http://www.ishelp.info/data/insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## 02. Effect Size

The relationship of a feature with the label is the effect size of that feature, or the amount of effect that each feature has on the label measured separately

In [2]:
label_data_types = ['numeric','numeric','categorical']
feature_data_types = ['numeric','categorical','categorical']
effect_size_stat = ['Pearson correlation','one-way ANOVA','Pearson chi-square']
visualization = ['Scatterplot','Bar Chart','CrossTab']

dict = {
    'Label data type' : label_data_types,
    'Feature data type': feature_data_types,
    'Effect size stat': effect_size_stat,
    'Visualization' : visualization
}

In [3]:
df = pd.DataFrame(dict)
df

Unnamed: 0,Label data type,Feature data type,Effect size stat,Visualization
0,numeric,numeric,Pearson correlation,Scatterplot
1,numeric,categorical,one-way ANOVA,Bar Chart
2,categorical,categorical,Pearson chi-square,CrossTab


**Note**

Having an effect does not imply that this is a causal effect, but we are going to imply as long as we have theorical reasons 

### 02.01. Pearson Correlation

Statistical measure of effect size that indicates how much two numeric variable influence each other, degree to which a pair of variables are **linearly** related. It assumes that the feature and the label we are comparing are both normal ditributed.


The correlation coefficient ranges from -1 to 1. Negative correlations indicate that as one variable increases, the other variable decreases whereas positive indicate that as one variablle increases, the other increases as well.

Scatterplots provide an easy way to interpret correlation coefficients.




#### 02.01.01. Sample Formula

$r_{xy} = \frac{\sum_{i=1}^{N}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{N}(x_i - \bar{x})^2  \sum_{i=1}^{N}(y_i - \bar{y})^2}}$

#### 02.01.02. Cohen's standard

* Small effect-size: .10 < r < .29
* Medium effect-size: .30 < r < .49
* Large effect-size: .50 < r 

#### 02.01.03. Assumptions

**Assumptions for a valid correlation**, the degree to which these assumptions hold true determines the degree to which we can trust the *r* we calculate.

- Continuous data
- Linear relationship
    - We can make transformations in our data to fix it
- Homoskedastic relationship 
    - It refers to the consistency across all values of x and y (same variance across all values) 
    - There are ways to handle it for example techniques for correcting skewness

The reason we make in the **univariate analysis** phase measures about skewness and kurtosis is to assess how much our data violates these assumpsions. Violations of any of these assumptions avobe are usually the result of the numeric variables not having normal distributions

**Solution**
- Select a statistic that does not depend on those assumptions
- Adjust the scale of the data using mathematical transformation

#### 02.01.04. Other stats

The non-parametric (a.k.a. distribution free) statistics does not depend on normal distribution methods of calculating a correlation coefficient (r).

* Kendall (tau)
* Spearman's (rho)



[Non-linear dependence article](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0697-7)

## 03. P-values


p-value is the probability that the estimated effect size is due to random "chance", the p-value found is the probability that random data could produce a similarly strong relationship, or stronger as the one found. p-value is a confidence value, that increase if the number of data increases *(more data, more confidence of the results found)*

p-value is the probability that future data collections will have results at least as extreme as those observed in the test data given that we are assuming they don't have any relationship (Null Hypothesis) 

The genereally accepted threshold for p-value is 0.05, meaning that there must be less than a 5% likelihood that the correlation we found here would occur again if there was trully no relationship. In that case we can Reject the null hypothesis $(H_0)$





### 03.01. p-values factors

* Sample size

Higuer sample sizes tend to decrease p-values.

* DoF (degrees of freedom)

Range of possible outcomes (measured as the degrees of freedom) is part of the p-value equation. Higher DoF tends to increase p-values.

### 03.02. Correlation Context

In the correlation(r) context, this means that a p-value represents the likelihood that future data will demonstratre a correlation(r) = 0.0 even though we just found >0 in the current data set. *The lower the p-value the lower the likelihood that there is no relationship between two variables*