# Advanced Statistics with Python 

## We will cover 4 major topics :

- One-Sample Hypothesis Test
- Two-Sample Independent Hypothesis T-Test
- One-Way Analysis of Variance
- Chi-Squared Test of Independence

#### Hypothesis testing

A hypothesis is simply put an educated guess about something. You should be able to test it, either by experiment or observation. 

Hypothesis testing is simply a way for you to test your results of an experiment to verify if you have meaningful results.

These hypothesis are usually in the form : If I change, i.e. increase/decrease, this independent variable then the dependent variable will change, i.e. increase/decrease.

You hypothesis should include :

- if/then statement
- Specify both the independent and dependent variables
- Be testable by experiment, survey or other scientifically sound technique.
- Should have some design criteria

When formulating your hypothesis you also have to specify what the null hypothesis is.... which essentially states what happens if your hypothesis is wrong. Meaning that nothing happens.


### Blueprint for conducting hypothesis testing is :

- Collect data in such a way that you have a random sample, this is so that it is representative of the population you're studying

- Formulate your hypothesis

- Figure out the distribution your population follows (more on this later)

    - based on said distribution figure out what type of test to use, i.e. parametric/non-parametric, one or two tailed
    
- Pick the appropriate test to verify hypothesis

- Find the critical value, this is based on the alpha value (more on this later)

- Run statistical test. Based on result, accept/reject the null hypothesis using the p-value and test statistic.


















A quick glance at some of the distributions out there :

- Beta Distribution.
- Exponential Distribution.
- Gamma Distribution.
- Inverse Gamma Distribution.
- Log Normal Distribution.
- Logistic Distribution.
- Maxwell-Boltzmann Distribution.
- Poisson Distribution.
- Skewed Distribution.
- Symmetric Distribution.
- Uniform Distribution.
- Unimodal Distribution.
- Weibull Distribution.

[Here](http://people.stern.nyu.edu/adamodar/New_Home_Page/StatFile/statdistns.htm) is a neat website that does a pretty good job at explaining some of these distributions.

### One-Sample Hypothesis Test

- One tailed : right-tailed or left-tailed
- Two tailed, aka inequality

### Z-test : this is a test that is used to validate a hypothesis that the sample drawn belongs to the same population

Requires : 
- sample to be normally distributed
- population mean
- population standard deviation
- sample size greater than 30

Null hypothesis : Sample mean is the same as the population mean

Alternative : Sample mean is lower, higher or not the same





In [17]:
import pandas as pd
# from scipy import stats
from statsmodels.stats import weightstats as stests

#This comes from a Kaggle data set
df = pd.read_csv("cardio_train.csv", sep=';')

# df.head()

ztest ,pval = stests.ztest(df['ap_hi'], x2=None, value=128)

print("z-score",ztest)
print("p-value", float(pval))

if pval<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

z-score 1.4040093635408895
p-value 0.16031606130705378
128.8172857142857
accept null hypothesis


In [1]:
import pandas as pd
# from scipy import stats
from statsmodels.stats import weightstats as stests


stests.ztest?

  import pandas.util.testing as tm


### T-test : this test is used to compare the mean of 2 given samples when the population parameters are, like mean and standard deviation, are not known.

Requires :
- normal distribution of the sample 
- independance of samples
- homogeneity of variance

Null hypothesis : means are the same

Alternative : Sample mean is lower, higher or not the same

3 versions are available :
- One sample t-test : compares the mean of the single group against a know mean
- Independent samples t-test : compares the mean for 2 distinct groups
- Paired samples t-test : compares the mean from the same group at different times

What does the t-score mean ?

#### Two-Sample Independent Hypothesis T-Test

#### Let's see if the number of miles ran by Boulderites is greater than Denverites

In [18]:
#Independent samples t-test
from statsmodels.stats.weightstats import ttest_ind
import numpy as np

Let's use the Statsmodels package

In [19]:
#Random sampling from a group of Boulderites
b_miles = np.random.normal(12.5, 2.6, 1000)

#Random sampling from a group of Denverites 
d_miles = np.random.normal(10.7, 2.1, 1000)

In [20]:
#Perform the test :
ttest_stats = ttest_ind(b_miles, d_miles)
ttest_stats

(17.093635031597223, 3.0200646889491784e-61, 1998.0)

In [2]:
from statsmodels.stats.weightstats import ttest_ind
ttest_ind?

Now using Scipy instead

In [21]:
from scipy.stats import ttest_ind

ttest_sp = ttest_ind(b_miles, d_miles)
ttest_sp

Ttest_indResult(statistic=17.09363503159724, pvalue=3.0200646889483175e-61)

In [22]:
#Dependent samples t-test
# from scipy.stats import ttest_rel

# ttest_pair = ttest_rel(b_miles, d_miles)
# ttest_pair

Ttest_relResult(statistic=16.677143571064153, pvalue=2.8217239987865275e-55)

### ANOVA (ANalaysis Of VAriance) : this is an extension of the t-test which allows you to compare multiple (3 or more) samples with a single test. This test looks for variation within groups and between groups.

Requires :
- data must be close to normal distribution
- samples must be independent
- sample variances must be equal
- similar sample sizes

2 versions :
- One way ANOVA : this is used to compare the difference between the 3 or more samples/groups of a single independent variable
- Two way ANOVA : is an extension of the one-way ANOVA, which allows you to see the impact of two independent variables

Null : All pairs of samples are the same, i.e. all sample means are equal

Alternative : At least one pair of samples is significantly different


The following examples comes from [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html)

In [23]:
import scipy.stats as stats

'''Here are some data on a shell measurement (the length of the anterior adductor muscle scar, standardized by 
dividing by length) in the mussel Mytilus trossulus from five locations: 
-Tillamook, Oregon
-Newport, Oregon
-Petersburg, Alaska 
-Magadan, Russia
-Tvarminne, Finland 
taken from a much larger data set used in McDonald et al. (1991).

Used for identifying empty shells to determine their correct taxonomic placement
'''
    
tillamook = [0.0571, 0.0813, 0.0831, 0.0976, 0.0817, 0.0859, 0.0735,
              0.0659, 0.0923, 0.0836]

newport = [0.0873, 0.0662, 0.0672, 0.0819, 0.0749, 0.0649, 0.0835,
            0.0725]

petersburg = [0.0974, 0.1352, 0.0817, 0.1016, 0.0968, 0.1064, 0.105]

magadan = [0.1033, 0.0915, 0.0781, 0.0685, 0.0677, 0.0697, 0.0764,
            0.0689]

tvarminne = [0.0703, 0.1026, 0.0956, 0.0973, 0.1039, 0.1045]


stats.f_oneway(tillamook, newport, petersburg, magadan, tvarminne)

#The results tell us that the shell measurement is statistically different. 

F_onewayResult(statistic=7.121019471642447, pvalue=0.0002812242314534544)

In [4]:
import scipy.stats as stats

stats.f_oneway?

### Chi-square test : this is used to verify independance of categorical variables

Requires :
- independent samples
- variables to be categorical

2 versions :
- Goodness of fit test, which determines if a sample matches the population
- Chi-squared fit test for 2 independent variables is used to compare variables in a contingency table to check if the data fits.

Null : Variable A and variable B are independent.

A small(high) chi-square value means that the data fits, i.e. there is (not) a relationship. This is a real number that tells you how much difference exists between your observed count and the count you would expect if there were no relationship at all in the population.

To perform a chi-squared test and get a p-value you need :
- degrees of freedom - 1
- alpha level, this is chosen by the user. Usually 0.05 is used.

Let's take a look at a chi-square test using die

In [None]:
from scipy import stats
import numpy as np

In [25]:
#Generate random die rolls
r1 = np.random.randint(1,7,5000)
r2 = np.random.randint(1,7,5000)
r3 = np.random.randint(1,7,5000)
r4 = np.random.randint(1,7,5000)
r5 = np.random.randint(1,7,5000)


#Calculate the counts for each face of the die
unique, counts1 = np.unique(r1, return_counts=True)
unique, counts2 = np.unique(r2, return_counts=True)
unique, counts3 = np.unique(r3, return_counts=True)
unique, counts4 = np.unique(r4, return_counts=True)
unique, counts5 = np.unique(r5, return_counts=True)

#Combine the results
die_rolls = np.array([counts1, counts2, counts3, counts4, counts5])

#Running the test
chi2_stat, p_val, dof, ex = stats.chi2_contingency(die_rolls)

#Print out the results
print("===Chi2 Stat===")
print(chi2_stat)
print("\n")

print("===Degrees of Freedom===")
print(dof)
print("\n")

print("===P-Value===")
print(p_val)
print("\n")

print("===Contingency Table===")
print(ex)

#What do you think of the values in the contingency table ?

===Chi2 Stat===
14.356104530652805


===Degrees of Freedom===
20


===P-Value===
0.8119914228939127


===Contingency Table===
[[826.6 812.  838.2 840.6 832.6 850. ]
 [826.6 812.  838.2 840.6 832.6 850. ]
 [826.6 812.  838.2 840.6 832.6 850. ]
 [826.6 812.  838.2 840.6 832.6 850. ]
 [826.6 812.  838.2 840.6 832.6 850. ]]


Interesting* annecdote about chi-square test

*Heavily depends on your definition of interesting