<div class="alert alert-block alert-warning"><b>Hypothesis flow</b></div>

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline
import scikit_posthocs as sp
from scipy import stats
import pandas as pd
pd.options.display.float_format = '{:,.4f}'.format

### 1.Defining Hypothesis

* H₀: μ=x, H₁: μ≠x
* H₀: μ≤x, H₁: μ>x
* H₀: μ≥x, H₁: μ<x

<img src="hypothesis.jpg" width="600" height="300">

If I test something like, my drug D is lowering recovery time and drug C doesnt. If I found that there is no difference between results -> I fail to reject the Null hypothesis and vice versa if there is a significant difference beetween results I reject null hypothesis. 

* Type 1 error -> False positive -> I rejected null hypothesis in favor of alternative hypothesis
* Type 2 error -> False negative -> I fail to reject null hypothesis, when it should be rejected

### 2.Assumption Check

Parametric or non parametric? If I meet these criterios I can use parametric test.

* Observations in each sample are normally distributed.
* Observations in each sample have the same variance.
* Observations in each sample are independent and identically distributed (IID)

It make sense my two group has to be quite similar, I cannot have apple and pears and test something... <br>Independent = values from sample A cannot affect value from sample B<br>Identically distribured -> I cannot compare sample A with regular coins with sample B with weighted coins because the probability and therefore a final distribution will be much different.

### 3. Selecting the proper test

<img src="paired-unpaired.png" width="900" height="500">

* paired samples -> are you happy? Before After -> asking same people 
* unpaired samples -> are you happy? Before After -> asking different people -> independent, much harder to evaluate

<b>Continous data - hypo selection

<img src="hypocontinuous.png" width="900" height="500">

<b>Categorical data - hypo selection

<img src="hypocategorical.png" width="500" height="200">

### 4. Decision and Conclusion

If the P-value is lower than our threshold, we reject H0 in favor of Ha/H1, bellow picture is self explanatory.

<img src="pvalue.png" width="400" height="200">

<div class="alert alert-block alert-warning"><b>Test examples</b></div>

# t-test independent - unpaired

Professor made an online lecture for Group A, recorded it and Group B watched the lecutre after. Professor believes that students who followed the course asnchronously (watched later) will be much worse than Group A (those we attended lecture in realtime). He recorded the average grades.

## <b>1.DEFINING MAIN HYPOTHESIS

* H₀: μₛ≤μₐ -> Students who saw video later will perform better -> higher score
* H₁: μₛ>μₐ -> Students who saw lecture in realtime will perform better.

Grades are obtained from different individuals -> unpaired.

In [6]:
#my data
sync = np.array([94. , 84.9, 82.6, 69.5, 80.1, 79.6, 81.4, 77.8, 81.7, 78.8, 73.2,
       87.9, 87.9, 93.5, 82.3, 79.3, 78.3, 71.6, 88.6, 74.6, 74.1, 80.6])
asyncr = np.array([77.1, 71.7, 91. , 72.2, 74.8, 85.1, 67.6, 69.9, 75.3, 71.7, 65.7, 72.6, 71.5, 78.2])

## <b>2.Assumption check

### <b>2A -> are data normaly distributed?

* H₀: The data is normally distributed.
* H₁: The data is not normally distributed.
* P-value 0.05

In [10]:
def check_normality(data):
    test_stat_normality, p_value_normality=stats.shapiro(data)
    print("p value:%.4f" % p_value_normality)
    if p_value_normality <0.05:
        print("Reject null hypothesis >> The data is not normally distributed")
    else:
        print("Fail to reject null hypothesis >> The data is normally distributed")

In [16]:
check_normality(sync)
check_normality(asyncr)

p value:0.6556
Fail to reject null hypothesis >> The data is normally distributed
p value:0.0803
Fail to reject null hypothesis >> The data is normally distributed


<code style="background:GREEN;color:WHITE">DATA ARE NORMALLY DISTRIBUTED - FIRST ASSUMPITON CHECKED</code>

### <b>2B -> is the variances of the samples different?

* H₀: The variances of the samples are the same.
* H₁: The variances of the samples are different.

In [17]:
def check_variance_homogeneity(group1, group2):
    test_stat_var, p_value_var= stats.levene(group1,group2)
    print("p value:%.4f" % p_value_var)
    if p_value_var <0.05:
        print("Reject null hypothesis >> The variances of the samples are different.")
    else:
        print("Fail to reject null hypothesis >> The variances of the samples are same.")

In [18]:
check_variance_homogeneity(sync, asyncr)

p value:0.8149
Fail to reject null hypothesis >> The variances of the samples are same.


<code style="background:GREEN;color:WHITE">VARIANCES OF THE SAMPLEA ARE THE SAME - SECOND ASSUMPITON CHECKED</code>

## 3. Selecting the Proper test

Assumption for paramteric tests are satisfied, therefore we have unpair data, parametric test and 2 groups => unpaired samples T-test

In [20]:
#stats.ttest_ind(sync,asyncr) 
#function only takes my two data and makes everything by itself
#stats.ttest_ind -> from stats - t-test - independed (unpaired)

ttest,p_value = stats.ttest_ind(sync,asyncr)
print("p value:%.8f" % p_value)
print("since the hypothesis is one sided >> use p_value/2 >> p_value_one_sided:%.4f" %(p_value/2))
if p_value/2 <0.05: 
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis")

p value:0.00753598
since the hypothesis is one sided >> use p_value/2 >> p_value_one_sided:0.0038
Reject null hypothesis


* H₀: μₛ≤μₐ -> Students who saw video later will perform better -> higher score
* H₁: μₛ>μₐ -> Students who saw lecture in realtime will perform better.

## 4.Conclusion

At this significance level, there is enough evidence to conclude that the average grade of the students who follow the course synchronously is higher than the students who follow the course asynchronously.

<font size="6"><B><code style="background:GREEN;color:WHITE">WE REJECT H0 IN FAVOR OF HA

<div class="alert alert-block alert-warning"><b></b></div>

# ANOVA

Check if the special baby formula food has an effect on baby weight gain. She has three groups - breastfed, special formula, both. Is there significance difference between these groups?

In [23]:
only_breast=np.array([794.1, 716.9, 993. , 724.7, 760.9, 908.2, 659.3 , 690.8, 768.7,
       717.3 , 630.7, 729.5, 714.1, 810.3, 583.5, 679.9, 865.1])

only_formula=np.array([ 898.8,  881.2,  940.2,  966.2,  957.5, 1061.7, 1046.2,  980.4,
        895.6,  919.7, 1074.1,  952.5,  796.3,  859.6,  871.1 , 1047.5,
        919.1 , 1160.5,  996.9])

both=np.array([976.4, 656.4, 861.2, 706.8, 718.5, 717.1, 759.8, 894.6, 867.6,
       805.6, 765.4, 800.3, 789.9, 875.3, 740. , 799.4, 790.3, 795.2 ,
       823.6, 818.7, 926.8, 791.7, 948.3])

## 1.Defining Hypothesis

* H₀: μ₁=μ₂=μ₃ or The mean of the samples is the same.
* H₁: At least one of them is different.

## 2.Assumption check

* Data are independend

### 2A normality check

* H₀: The data is normally distributed.
* H₁: The data is not normally distributed.

In [24]:
check_normality(only_breast)
check_normality(only_formula)
check_normality(both)

p value:0.4694
Fail to reject null hypothesis >> The data is normally distributed
p value:0.8879
Fail to reject null hypothesis >> The data is normally distributed
p value:0.7973
Fail to reject null hypothesis >> The data is normally distributed


<code style="background:GREEN;color:WHITE">DATA ARE NORMALLY DISTRIBUTED - FIRST ASSUMPITON CHECKED</code>

### 2B variance check

* H₀: The variances of the samples are the same.
* H₁: The variances of the samples are different.

In [25]:
stat, pvalue_levene= stats.levene(only_breast,only_formula,both)

print("p value:%.4f" % pvalue_levene)
if pvalue_levene <0.05:
    print("Reject null hypothesis >> The variances of the samples are different.")
else:
    print("Fail to reject null hypothesis >> The variances of the samples are same.")

p value:0.7673
Fail to reject null hypothesis >> The variances of the samples are same.


<code style="background:GREEN;color:WHITE">VARIANCES OF THE SAMPLEA ARE THE SAME - SECOND ASSUMPITON CHECKED</code>

## 3. Selecting the Proper test

Unpaired - more than two groups - assumption satisfied -> parametric => Parametric analysis of variance

In [26]:
#running our anova function
F, p_value = stats.f_oneway(only_breast,only_formula,both)
print("p value:%.6f" % p_value)
if p_value <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis")

p value:0.000000
Reject null hypothesis


## 4.Conclusion

* H₀: μ₁=μ₂=μ₃ or The mean of the samples is the same.
* H₁: At least one of them is different.

At this significance level, it can be concluded that at least one of the groups has a different average monthly weight gain.

<font size="6"><B><code style="background:GREEN;color:WHITE">WE REJECT H0 IN FAVOR OF HA

<div class="alert alert-block alert-danger"><b>To find which group or groups cause the difference, we need to perform a posthoc test/pairwise comparison</b></div>

In [27]:
# Pairwise T test for multiple comparisons of independent groups. May be used after a parametric ANOVA to do pairwise comparisons.

import scikit_posthocs as sp
posthoc_df= sp.posthoc_ttest([only_breast,only_formula,both], equal_var=True, p_adjust="bonferroni")

group_names= ["only breast", "only formula","both"]
posthoc_df.columns= group_names
posthoc_df.index= group_names
posthoc_df.style.applymap(lambda x: "background-color:violet" if x<0.05 else "background-color: white")

Unnamed: 0,only breast,only formula,both
only breast,1.0,0.0,0.129454
only formula,0.0,1.0,4e-06
both,0.129454,4e-06,1.0


* “only breast” is different than “only formula”
* “only formula” is different than both “only breast” and “both”
* “both” is different than “only formula”

<div class="alert alert-block alert-warning"><b></b></div>

# Mann Whitney U

In [29]:
#data
test_team=np.array([6.2,  7.1,  1.5,  2,3 ,  2,  1.5,  6.1,  2.4,  2.3, 12.4,  1.8,  5.3,  3.1, 9.4,  2.3, 4.1])
developer_team=np.array([2.3,  2.1,  1.4,  2.0, 8.7,  2.2,  3.1,  4.2,  3.6, 2.5,  3.1,  6.2, 12.1,  3.9,  2.2, 1.2 ,3.4])

## <b>1.DEFINING MAIN HYPOTHESIS

Does one team works more than other one?

* H₀: μ₁≤μ₂
* H₁: μ₁>μ₂

## <b>2.Assumption check

### 2A normality check

In [31]:
check_normality(test_team)
check_normality(developer_team)

p value:0.0046
Reject null hypothesis >> The data is not normally distributed
p value:0.0005
Reject null hypothesis >> The data is not normally distributed


<code style="background:RED;color:WHITE">DATA ARE NOT NORMALLY DISTRIBUTED - FIRST ASSUMPITON NOT SATISFIED</code>

### 2B variance check

In [33]:
check_variance_homogeneity(test_team, developer_team)
#variance homogenity = assumption that all groups have the same or similar variance

p value:0.5410
Fail to reject null hypothesis >> The variances of the samples are same.


<code style="background:GREEN;color:WHITE">VARIANCES OF THE SAMPLEA ARE THE SAME - SECOND ASSUMPITON CHECKED</code>

## 3.Selecting the proper test

2 groups - different individuals so its unpaired BUT data are not normally distributed and thats why we need to use non parametric test for 2 groups -> Mann-Whitney U Test.

In [34]:
ttest,pvalue = stats.mannwhitneyu(test_team,developer_team, alternative="two-sided")
print("p-value:%.4f" % pvalue)
if pvalue <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to recejt null hypothesis")

p-value:0.8226
Fail to recejt null hypothesis


## 4.Conclusion

At this significance level, it can be said that there is no statistically significant difference between the average overwork time of the two teams.

<font size="6"><B><code style="background:RED;color:WHITE">WE FAILED TO REJECT H0

<div class="alert alert-block alert-warning"><b></b></div>

# Kruskal-Wallis

In [36]:
#data
youtube=np.array([1913, 1879, 1939, 2146, 2040, 2127, 2122, 2156, 2036, 1974, 1956,
       2146, 2151, 1943, 2125])
       
instagram =  np.array([2305., 2355., 2203., 2231., 2185., 2420., 2386., 2410., 2340.,
       2349., 2241., 2396., 2244., 2267., 2281.])
       
facebook = np.array([2133., 2522., 2124., 2551., 2293., 2367., 2460., 2311., 2178.,
       2113., 2048., 2443., 2265., 2095., 2528.]) 

## 1.Defining hypothesis

Is there any difference between add provides?

* H₀: μ₁=μ₂=μ₃ or The mean of the samples is the same.
* H₁: At least one of them is different.

## <b>2.Assumption check

### 2A normality check

* H₀: The data is normally distributed.
* H₁: The data is not normally distributed.

In [38]:
check_normality(youtube)
check_normality(instagram)
check_normality(facebook)

p value:0.0285
Reject null hypothesis >> The data is not normally distributed
p value:0.4156
Fail to reject null hypothesis >> The data is normally distributed
p value:0.1716
Fail to reject null hypothesis >> The data is normally distributed


<code style="background:GREEN;color:WHITE">DATA ARE NORMALLY DISTRIBUTED - FIRST ASSUMPITON CHECKED</code>

### 2B variance check

* H₀: The variances of the samples are the same.
* H₁: The variances of the samples are different.

In [39]:
stat, pvalue_levene= stats.levene(youtube, instagram, facebook)

print("p value:%.4f" % pvalue_levene)
if pvalue_levene <0.05:
    print("Reject null hypothesis >> The variances of the samples are different.")
else:
    print("Fail to reject null hypothesis >> The variances of the samples are same.")

p value:0.0012
Reject null hypothesis >> The variances of the samples are different.


<code style="background:RED;color:WHITE">VARIANCES OF THE SAMPLEA ARE DIFFERENT - SECOND ASSUMPITON NOT SATISFIED</code>

## 3.Selecting the proper test

I have more than two groups, which are un paired, and IID is not satisfied therefore I am choosing non parametric test -> Kruskal Wallis.

In [40]:
F, p_value = stats.kruskal(youtube, instagram, facebook)
print("p value:%.6f" % p_value)
if p_value <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis")

p value:0.000015
Reject null hypothesis


## 4.Conclusion

At this significance level, at least one of the average customer acquisition number is different.

In [41]:
posthoc_df = sp.posthoc_mannwhitney([youtube,instagram, facebook], p_adjust = 'bonferroni')
group_names= ["youtube", "instagram","facebook"]
posthoc_df.columns= group_names
posthoc_df.index= group_names
posthoc_df.style.applymap(lambda x: "background-color:violet" if x<0.05 else "background-color: white")

Unnamed: 0,youtube,instagram,facebook
youtube,1.0,1e-05,0.002337
instagram,1e-05,1.0,1.0
facebook,0.002337,1.0,1.0


<font size="6"><B><code style="background:GREEN;color:WHITE">WE REJECT H0 IN FAVOR OF HA

<div class="alert alert-block alert-warning"><b></b></div>

# t-test dependent

Diet test :) measurement before and after.

In [44]:
#data
test_results_before_diet=[224, 235, 223, 253, 253, 224, 244, 225, 259, 220, 242, 240, 239, 229, 276, 254, 237, 227]
test_results_after_diet=[198, 195, 213, 190, 246, 206, 225, 199, 214, 210, 188, 205, 200, 220, 190, 199, 191, 218]

## 1.Defining hypothesis

* H₀: μd>=0 or The true mean difference is equal to or bigger than zero.
* H₁: μd<0 or The true mean difference is smaller than zero.

## <b>2.Assumption check

### 2A normality check

* H₀: The data is normally distributed.
* H₁: The data is not normally distributed.

In [45]:
check_normality(test_results_before_diet)
check_normality(test_results_after_diet)

p value:0.1635
Fail to reject null hypothesis >> The data is normally distributed
p value:0.1003
Fail to reject null hypothesis >> The data is normally distributed


<code style="background:GREEN;color:WHITE">DATA ARE NORMALLY DISTRIBUTED - FIRST ASSUMPITON CHECKED</code>

## 3.Selecting the proper test

The data is paired since data is collected from the same individuals and assumptions are satisfied, then we can use the dependent t-test.

In [46]:

test_stat, p_value_paired = stats.ttest_rel(test_results_before_diet,test_results_after_diet)
print("p value:%.6f" % p_value_paired , "one tailed p value:%.6f" %(p_value_paired/2))
if p_value_paired <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis")

p value:0.000008 one tailed p value:0.000004
Reject null hypothesis


## 4.Conclusion

At this significance level, there is enough evidence to conclude mean cholesterol level of patients has decreased after the diet.

<font size="6"><B><code style="background:GREEN;color:WHITE">WE REJECT H0 IN FAVOR OF HA

<div class="alert alert-block alert-warning"><b></b></div>

# Wilcoxon signed-rank tes

Two companis are doing a file compression and investor is trying to find out which company is better.

In [47]:
#data
piedpiper=[4.57, 4.55, 5.47, 4.67, 5.41, 5.55, 5.53, 5.63, 3.86, 3.97, 5.44, 3.93, 5.31, 5.17, 4.39, 4.28, 5.25]
endframe = [4.27, 3.93, 4.01, 4.07, 3.87, 4. , 4. , 3.72, 4.16, 4.1 , 3.9 , 3.97, 4.08, 3.96, 3.96, 3.77, 4.09]

## 1.Defining hypothesis

* H₀: μd>=0 or The true mean difference is equal to or bigger than zero.
* H₁: μd<0 or The true mean difference is smaller than zero.

## <b>2.Assumption check

### 2A normality check

* H₀: The data is normally distributed.
* H₁: The data is not normally distributed.

In [48]:
check_normality(piedpiper)
check_normality(endframe)

p value:0.0304
Reject null hypothesis >> The data is not normally distributed
p value:0.9587
Fail to reject null hypothesis >> The data is normally distributed


<code style="background:RED;color:WHITE">DATA ARE NOT NORMALLY DISTRIBUTED - FIRST ASSUMPITON NOT SATISFIED</code>

## 3.Selecting the proper test

IID not satisfied, non parametric, data are paire and we have only 2 groups of data - Wilcoxon Signed Rank test.

In [49]:
test,pvalue = stats.wilcoxon(endframe,piedpiper) ##alternative default two sided
print("p-value:%.6f" %pvalue, ">> one_tailed_pval:%.6f" %(pvalue/2))

test,one_sided_pvalue = stats.wilcoxon(endframe,piedpiper, alternative="less")
print("one sided pvalue:%.6f" %(one_sided_pvalue))
if pvalue <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to recejt null hypothesis")

p-value:0.000214 >> one_tailed_pval:0.000107
one sided pvalue:0.000107
Reject null hypothesis


## 4.Conclusion

At this significance level, there is enough evidence to conclude that the performance of the PiedPaper is better than the EndFrame.

<font size="6"><B><code style="background:GREEN;color:WHITE">WE REJECT H0 IN FAVOR OF HA

<div class="alert alert-block alert-warning"><b></b></div>

# Friedman Chi-Square

Reseacher was curious wheter these is a difference between the methodology and methods.

In [50]:
#data
method_A = np.array([89.8, 89.9, 88.6, 88.7, 89.6, 89.7, 89.2, 89.3])
method_B =   np.array([90.0, 90.1, 88.8, 88.9, 89.9, 90.0, 89.0, 89.2])
method_C = np.array([91.5, 90.7, 90.3, 90.4, 90.2, 90.3, 90.2, 90.3])

## 1.Defining hypothesis

* H₀: μ₁=μ₂=μ₃ or The mean of the samples is the same.
* H₁: At least one of them is different.

## <b>2.Assumption check

### 2A normality check

* H₀: The data is normally distributed.
* H₁: The data is not normally distributed.

In [54]:
check_normality(method_A)
check_normality(method_B)
check_normality(method_C)

p value:0.3076
Fail to reject null hypothesis >> The data is normally distributed
p value:0.0515
Fail to reject null hypothesis >> The data is normally distributed
p value:0.0016
Reject null hypothesis >> The data is not normally distributed


<code style="background:RED;color:WHITE">DATA ARE NOT NORMALLY DISTRIBUTED - FIRST ASSUMPITON NOT SATISFIED</code>

### 2B variance check

* H₀: The variances of the samples are the same.
* H₁: The variances of the samples are different.

In [55]:
stat, pvalue_levene= stats.levene(method_A, method_B, method_C)

print("p value:%.4f" % pvalue_levene)
if pvalue_levene <0.05:
    print("Reject null hypothesis >> The variances of the samples are different.")
else:
    print("Fail to reject null hypothesis >> The variances of the samples are same.")

p value:0.1953
Fail to reject null hypothesis >> The variances of the samples are same.


<code style="background:GREEN;color:WHITE">VARIANCES OF THE SAMPLEA ARE THE SAME - SECOND ASSUMPITON CHECKED</code>

## 3.Selecting the proper test

IID not satisfied, non parametric, 3 groups, paired data - same test -> nonparametric ANOVA - Friedman Chi Square.

In [57]:
test_stat,p_value = stats.friedmanchisquare(method_A,method_B, method_C)
print("p value:%.4f" % p_value)
if p_value <0.05:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis")
    
print(np.round(np.mean(method_A),2), np.round(np.mean(method_B),2), np.round(np.mean(method_C),2))     

p value:0.0015
Reject null hypothesis
89.35 89.49 90.49


## 4.Conclusion

At this significance level, at least one of the methods has a different performance.

In [58]:
data = np.array([method_A, method_B, method_C]) 
posthoc_df=sp.posthoc_wilcoxon(data, p_adjust="holm")
# posthoc_df = sp.posthoc_nemenyi_friedman(data.T) ## another option for the posthoc test

group_names= ["Method A", "Method B","Method C"]
posthoc_df.columns= group_names
posthoc_df.index= group_names
posthoc_df.style.applymap(lambda x: "background-color:violet" if x<0.05 else "background-color: white")

Unnamed: 0,Method A,Method B,Method C
Method A,1.0,0.078125,0.023438
Method B,0.078125,1.0,0.023438
Method C,0.023438,0.023438,1.0


Method C outpetform A and B.

<font size="6"><B><code style="background:GREEN;color:WHITE">WE REJECT H0 IN FAVOR OF HA

<div class="alert alert-block alert-warning"><b></b></div>

# The goodness of Fit - categorical

An analyst of a financial investment company is curious about the relationship between gender and risk appetite. I am trying to found out if there is any connection between Gender and Risk Appetite -> categorical data.

## 1.Defining hypothesis

* H₀: Gender and risk appetite are independent.
* H₁: Gender and risk appetite are dependent.

## 3.Selecting the proper test

chi2 test should be used for this question. This test is known as the goodness-of-fit test. It implies that if the observed data are very close to the expected data. The assumption of this test every Ei ≥ 5 (in at least 80% of the cells) is satisfied.

In [59]:
from scipy.stats import chi2_contingency

obs =np.array([[53, 23, 30, 36, 88],[71, 48, 51, 57, 203]])
chi2, p, dof, ex = chi2_contingency(obs, correction=False)

print("expected frequencies:\n ", np.round(ex,2))
print("degrees of freedom:", dof)
print("test stat :%.4f" % chi2)
print("p value:%.4f" % p)

expected frequencies:
  [[ 43.21  24.74  28.23  32.41 101.41]
 [ 80.79  46.26  52.77  60.59 189.59]]
degrees of freedom: 4
test stat :7.0942
p value:0.1310


4 degrees of freedom because we have 5 columns? :)

In [60]:
from scipy.stats import chi2
## calculate critical stat

alpha = 0.01
df = (5-1)*(2-1)
critical_stat = chi2.ppf((1-alpha), df)
print("critical stat:%.4f" % critical_stat)

critical stat:13.2767


## 4.Conclusion

Since the p-value is larger than α=0.01 ( or calculated statistic=7.14 is smaller than the critical statistic=13.28) → Fail to Reject H₀. At this significance level, it can be concluded that gender and risk appetite are independent.

<font size="6"><B><code style="background:RED;color:WHITE">WE FAILED TO REJECT H0

<div class="alert alert-block alert-warning"><b></b></div>

# BONUS Pearson correlation

Correlation coefficients give a measure of the linear relationship between the two variables. 

* Data has to be linear 
* No outliers
* Data has to be normallz distributed
* Data must be homoscedastic - variance check?

Homoscedasticity, or homogeneity of variances, is an assumption of equal or similar variances in different groups being compared.

In [61]:
X1 = [500,400,300,200,100]
X2 = [200,400,600,800,1000]
y = [100,200,300.5,400,500.5]
df = pd.DataFrame({'X1':X1,
                   'X2':X2,
                   'y' :y})
# correlation matrix
df.corr()

Unnamed: 0,X1,X2,y
X1,1.0,-1.0,-1.0
X2,-1.0,1.0,1.0
y,-1.0,1.0,1.0


Interpretation is quite an easy thing :)