**CORRELATION** (*r*): This is seeing how much two **NUMERICAL** variables are correlated (or *related*). 

Simple numbers:
1: Straight line (completely related)
0: Scattered (no relation)
-1: Straight line, but downward (or decreasing)

Anything in-between gives an indication as to how much of a relationship there is between two numerical variables.

In [2]:
import pandas as pd
import matplotlib as mat
import numpy as np
df = pd.read_csv('http://www.ishelp.info/data/insurance.csv')
df.charges.corr(df.age)

#could replace age with bmi, etc

0.2990081933306476

In [3]:
df_numeric = df.select_dtypes(include=['float64', 'int64'])
df_numeric.corr()


##df.corr()
#this displays them

Unnamed: 0,age,bmi,children,charges
age,1.0,0.109272,0.042469,0.299008
bmi,0.109272,1.0,0.012759,0.198341
children,0.042469,0.012759,1.0,0.067998
charges,0.299008,0.198341,0.067998,1.0


**ASSUMPTIONS**: In order for correlation to be interpreted properly, the following MUST be TRUE about the two variables: forget about about middle parts of data
1. **Continuous data**: some equal spread of data--NOT groups, which may indicate the variable has categories, or is categorical
2. **Linear Relationship**: if there's a curve in the pattern of data, then that means there's something else going on
3. **Homoscedasticity (Equal Variance)**: big scary word, basically means that if the data DOES follow a straight line, it is evenly spread around that line, and NOT far away at some areas and really close at other areas

**P-VALUES**: This is the OPPOSITE of correlation, in that we want p-values to be CLOSE to 0. P-values are similar to percentage.

Break this down step by step:
**H0: Null Hypothesis**--No relationship between two variables (status quo)
**HA: Alternative Hypothesis**--There IS a relationship between two variables (questioning the status quo)

**P-value** means the following:
1. Suppose the null hypothesis is assumed true (there is NO relationship between two variables)
2. However, your data happens to show a relationship
3. What is the probability that you would find a relationship in the data IF the null hypothesis were true?
(HINT: It would probably be very low, right?)

Therefore, REJECT the null hypothesis and assume there is a relationship.

**NOTE**: P-values should be below **0.05** (**95% confident**, meaning 5% chance to see this kind of relationship if there were no relationship between the variables)

p-value=0.10 (**90%** confident)

Don't go above 0.10, but if it's only slightly higher, then the relationship would still be "of interest," but may have other variables involved as well.

In [4]:
from scipy import stats
corr = stats.pearsonr(df.charges, df.age)
corr

PearsonRResult(statistic=0.29900819333064765, pvalue=4.886693331718192e-29)

pearsonr() gives the correlation coefficient AND p-value between two variables. NOTICE the 2nd value is NOT 4.88, but is 4.88e-29, meaning is is 0.00000000000000000000000000000000000488.

The following breaks it down into a presentable bit.

In [5]:
print('r: \t' + str(round(corr[0], 4))) 
print('p-value:' + str(round(corr[1], 4)))
corr[1]

r: 	0.299
p-value:0.0


4.886693331718192e-29