# Practice:  Statistical Significance

Let's continue to work with the diabetes dataset to apply a t-test or an ANOVA test to real world data.

In [20]:
# Import pandas, so that we can import the diabetes dataset and work with the data frame version of this data
import pandas as pd

In [21]:
# Set the path
path = 'https://raw.githubusercontent.com/GWC-DCMB/ClubCurriculum/master/'
# This is where the file is located
filename = path + 'SampleData/diabetes.csv'

In [22]:
# Load the diabetes dataset into a DataFrame
diabetes_df = pd.read_csv(filename)
diabetes_df

Unnamed: 0,AGE,SEX,BMI,MAP,TC,LDL,HDL,TCH,LTG,GLU,Y
0,59,2,32.1,101.00,157,93.2,38.0,4.00,4.8598,87,151
1,48,1,21.6,87.00,183,103.2,70.0,3.00,3.8918,69,75
2,72,2,30.5,93.00,156,93.6,41.0,4.00,4.6728,85,141
3,24,1,25.3,84.00,198,131.4,40.0,5.00,4.8903,89,206
4,50,1,23.0,101.00,192,125.4,52.0,4.00,4.2905,80,135
5,23,1,22.6,89.00,139,64.8,61.0,2.00,4.1897,68,97
6,36,2,22.0,90.00,160,99.6,50.0,3.00,3.9512,82,138
7,66,2,26.2,114.00,255,185.0,56.0,4.55,4.2485,92,63
8,60,2,32.1,83.00,179,119.4,42.0,4.00,4.4773,94,110
9,29,1,30.0,85.00,180,93.4,43.0,4.00,5.3845,88,310


# Problem A

For our first problem, we are interested in understanding whether there are differences in LDL levels (the "bad" cholesterol) by sex, i.e. are LDL levels different for males vs. females?

**1. Formulate the null hypothesis and the alternative hypothesis.**
- **Null hypothesis**: There is NO difference in LDL levels between male and female. 
- **Alternative hypothesis**: There is a difference in LDL levels by sex. 

In [23]:
# Import numpy 
import numpy as np

Males are indicated by "1" for the variable "SEX", while females are indicated by "2".

In [24]:
# Define a vector of the LDL levels for males and name it ldl_male
diabetes_male = diabetes_df.query('SEX == 1')
ldl_male = diabetes_male['LDL']

# Define a vector of the LDL levels for females and name it ldl_female
diabetes_female = diabetes_df.query('SEX == 2')
ldl_female = diabetes_female['LDL']

**2. Identify and compute a test statistic that can be used to reject or fail to reject the null hypothesis.**
- As we are working with two independent samples, we will use the two-sample t-test and use the t-statistic.

**3. Compute the test statistic and p-value.**

In [25]:
# Import stats methods to help calculate the t-statistic and p-value
from scipy import stats

In [26]:
# Run a Student's t-test
t_statistic, p_value = stats.ttest_ind(ldl_male, ldl_female)

# Print out the test statistic and p-value
print("t-statistic = " + str(t_statistic))
print("p-value = " + str(p_value))

t-statistic = -3.02289333435
p-value = 0.00264998737357


**4. Compare the p-value to an acceptable significance value, $\alpha$ and compare the test statistic to acceptable critical value(s)**. If p-value $\leq \alpha$ and the test-statistic $\geq$ +critical value or test-statistic $\leq$ -critical value, that the observed effect is statistically significant, the null hypothesis is rejected, and the alternative hypothesis is valid.**
- p-value $= 0.0026 \lt 0.05$, so we reject the null hypothesis. 
- t-statistic $= -3.02 \lt -1.96$, so this reaffirms that we reject the null hypothesis. 
- Interpretation: There is a significant difference in LDL levels between males and females.

# Problem B

For this next problem, we are interested in determining whether there are differences in LDL levels between underweight or normal, overweight, and obese samples. I combined underweight and normal weight participants into the same group, as there are only two underweight participants.

**1. Formulate the null hypothesis and the alternative hypothesis.**
- **Null hypothesis**: There is NO difference in LDL levels among the underweight or normal, overweight, and obese samples or the BMI categories.
- **Alternative hypothesis**: There is a difference in LDL levels by BMI categories. 

In [27]:
# Define a vector of LDL levels for underweight participants (BMI < 20)
diabetes_underweight_normal = diabetes_df.query('BMI < 20')
ldl_underweight_normal = diabetes_underweight_normal['LDL']

# Define a vector of LDL levels for overweighted participants (BMI >= 25 and BMI < 30)
diabetes_overweight = diabetes_df.query('BMI >= 25 and BMI < 30')
ldl_overweight = diabetes_overweight['LDL']

# Define a vector of LDL levels for obese participants (BMI >= 30)
diabetes_obese = diabetes_df.query('BMI >= 30')
ldl_obese = diabetes_obese['LDL']

**2. Identify and compute a test statistic that can be used to fail to reject or reject the null hypothesis.**
- As we are working with 3 independent samples, we will use the an **ANOVA test** and use the **F-statistic** as our test statistic.


**3. Compute the test statistic and p-value.**

In [28]:
# Run an ANOVA test
f_statistic_anova, p_value_anova = stats.f_oneway(ldl_underweight_normal, ldl_overweight, ldl_obese)

# Print out the f-statistic and p-value
print("f-statistic = " + str(f_statistic_anova))
print("p-value = " + str(p_value_anova))

f-statistic = 11.516546719
p-value = 1.58362390317e-05


In [29]:
# Total degree of freedom is sample size - 1 or N - 1. 
degree_freedom_total = diabetes_df.shape[0] - 1
# When comparing between the groups, the degree of freedom is number of groups - 1 or k - 1.
degree_freedom_between = 3 - 1
# When comparing within the groups, the degree of freedom is sample size - number of groups or N - k.
degree_freedom_within = diabetes_df.shape[0] - 3

# Determine the critical value for the f-statistic
critical_value_for_f_value = stats.f.ppf(q = 1 - 0.05, dfn = degree_freedom_between, dfd = degree_freedom_within)
critical_value_for_f_value

3.0162684445780026

**4. Compare the p-value to an acceptable significance value, $\alpha$ and compare the test statistic to acceptable critical value(s)**. If p-value $\leq \alpha$ and the f-statistic $\geq$ critical value, that the observed effect is statistically significant, the null hypothesis is rejected, and the alternative hypothesis is valid.**
- p-value $= 0.0000158 \lt 0.05$, so we reject the null hypothesis. 
- F-statistic $= 11.51 \gt 3.01$, so this reaffirms that we reject the null hypothesis. 
- Interpretation: There is a significant difference in diabetes progression among the BMI categories.

## Regressions

**Definition**: **linear regression** is a linear approach to describe the relationship (or association) between an outcome (or dependent variable) and one or more explanatory (or independent variables). 

**Definition**: **simple linear regression** is the case of having only one explanatory or independent variable. 

Do you remember the equation for a line? 
$$y = mx+b$$

where x is the independent variable, y is the dependent variable, m is the slope, and b is the intercept. 

Nice! One more thing before we run some code, regression models will provide us information on the statistical significance of the relationship between the dependent and independent variables. It will quantify the degree of association (or the effect size), which is also the slope. 

## Problem C

Is there an association between LDL levels and disease progression? 

In [30]:
import statsmodels.api as sm

In [31]:
# Define a matrix of an independent variable
X = diabetes_df['LDL']
# Add a column with a constant
X = sm.add_constant(X)

# Define the dependent variable
y = diabetes_df['Y']

# Note the difference in argument order
model = sm.OLS(y, X).fit()

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.03
Model:,OLS,Adj. R-squared:,0.028
Method:,Least Squares,F-statistic:,13.75
Date:,"Wed, 19 Feb 2020",Prob (F-statistic):,0.000236
Time:,11:35:58,Log-Likelihood:,-2540.4
No. Observations:,442,AIC:,5085.0
Df Residuals:,440,BIC:,5093.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,101.2015,14.205,7.124,0.000,73.283,129.120
LDL,0.4412,0.119,3.708,0.000,0.207,0.675

0,1,2,3
Omnibus:,46.057,Durbin-Watson:,1.87
Prob(Omnibus):,0.0,Jarque-Bera (JB):,28.041
Skew:,0.48,Prob(JB):,8.15e-07
Kurtosis:,2.225,Cond. No.,469.0


**"coef"** is the **regression coefficient** (or slope) and quantifies the association (or correlation) between the independent and dependent variables. 
- If the sign is +, then there is a positive association between the dependent and independent variables. As the independent variable increases, so will the dependent variable. 
- If the sign is -, then there is a negative association between the dependent and independent variables. As the independent variable increases, the dependent variable will decrease. 
- Is the association between LDL levels and disease progression positive or negative?
    - **Answer**: Positive. 
    - **Interpretation**: For everyone one-unit increase in LDL levels, the disease progression will increase by 0.44. 

**"t"** is the **test statistic**. Do you remember the critical value for when the significance level is 0.05?
- **Answer**: 1.96
- So if the our calculated t-statistic is greater than 1.96 or less than -1.96, then the association is statistically significant.
- Is the association between LDL levels and disease progression statistically significant?
    - **Answer**: Yes, because 3.708 > 1.96

**P>|t|** is the **p-value**. So if the p-value < 0.05, then the association is statistically significant. 
- Using the p-value, is the association between LDL levels and disease progression statistically significant?
    - **Answer**: Yes, because 0.000 < 0.05!

For more information and codes on running regression models, please refer to this link:  https://towardsdatascience.com/simple-and-multiple-linear-regression-in-python-c928425168f9.

Congratulations on completing the lesson and practice! 

It's a lot of information, but you learned powerful tools to be on your way to answer your own research questions by analyzing real world data! 