In [48]:
import numpy as np 
import seaborn as sns
import pandas as pd


# Nonparametric Test

Nonparametric tests are methods of statistical analysis that do not require a distribution to meet the required assumptions to be analyzed (especially if the data is not normally distributed). 

Nonparametric tests serve as an alternative to parametric tests such as T-test or ANOVA that can be employed only if the underlying data satisfies certain criteria and assumptions.


Called distribution-free tests because they are based on fewer assumptions

1. Mann-Whitney U Test (Wilcoxon Rank-Sum test): Indepdentent Samples T-Test  

    - Use: To compare a continuous outcome in two independent samples.

    - Null Hypothesis: H0: Two populations are equal



2. Wilcoxon-Signed Rank Test: Paired Sample T-Test

    - When: normality, homogeneity of variances, or outliers assumptions for One-Way ANOVA are not met, you may want to run the nonparametric Kruskal-Wallis test instead.  
    
    - Use: To compare a continuous outcome in two matched or paired samples.  
    
    - Null Hypothesis: H0: Median difference is zero  
    
    - Test Statistic: The test statistic is W, defined as the smaller of W+ and W- which are the sums of the positive and negative ranks of the difference scores, respectively.  


3. Kruskal-Wallis Test: One-Way Anova

    - Use: To compare a continuous outcome in more than two independent samples.

    - Null Hypothesis: H0: k population medians are equal




# Chi-squared

The Chi-Square Test of Independence determines whether there is an association between categorical variables (i.e., whether the variables are independent or related).

1. Can only compare categorical variables.   

  
2. Cannot make comparisons between continuous variables or between categorical and continuous variables.   

  
3. Only assesses associations between categorical variables, and can not provide any inferences about causation.  

  
4. If your categorical variables represent "pre-test" and "post-test" observations, then the chi-square test of independence is not appropriate.


# Hypothesis

$$H_0: \text{Variable 1 is independent of Variable 2}$$
$$H_0: \text{Variable 1 is not independent of Variable 2}$$


## Test Statistic

$$\chi^{2} = \sum_{i=1}{R}\sum_{i=1}{C}\frac{()_{ij} - e_{ij})^2}{e_{ij}}$$

where:

$O_{ij}$ is the observed cell count in the $i^{th}$ row and $j^{th}$ column  
  
  
$e_{ij}$ is the expected cell count in the $i^{th}$ row and $j^{th}$ column  
  
$$e_{ij} = \frac{\text{row i total * col j total}}{\text{grand total}}$$

  
  
The calculated Χ2 value is then compared to the critical value from the Χ2 distribution table with degrees of freedom df = (R - 1)(C - 1) and chosen confidence level. If the calculated Χ2 value > critical Χ2 value, then we reject the null hypothesis.


### Efficient sample size Assumption
1. Expected frequencies for each cell are at least 1.  
  
  
2. Expected frequencies should be at least 5 for the majority (80%) of the cells.

# Example

In [30]:
df = sns.load_dataset('titanic')
df.head(3)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True


In [107]:
contingency_table = pd.crosstab(df['survived'],df['sex'])
contingency_table


sex,female,male
survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,81,468
1,233,109


## Hypothesis

$$H_0: \text{Sex and survival are independent}$$
$$H_0: \text{Sex and survival are not independent}$$

## Test Statistic

$$\chi^{2} = \sum_{i=1}^{R}\sum_{i=1}^{C}\frac{(O_{ij} - e_{ij})^2}{e_{ij}}$$

where:

$O_{ij}$ is the observed cell count in the $i^{th}$ row and $j^{th}$ column  
  
  
$e_{ij}$ is the expected cell count in the $i^{th}$ row and $j^{th}$ column  
  
$$e_{ij} = \frac{\text{row i total * col j total}}{\text{grand total}}$$


$$col_1 = 314$$  

$$col_2 = 577$$  

$$row_1 = 549$$  

$$row_2 = 342$$

$$\text{grand total} = \text{If we sum across these we have 891 passengers}$$

$$e_{1,1} = \frac{549*314}{891} = 193.47$$  

$$e_{1,2} = \frac{549*577}{891} = 355.52$$  

$$e_{2,1} = \frac{342*314}{891} = 120.52$$  

$$e_{2,2} = \frac{342*577}{891} = 221.47$$  


### Sample size assumption has been met
1. All expected sample counts are greater than one and atleast 80% are greater than 5.


$$\chi^{2} = \sum_{i=1}^{R}\sum_{i=1}^{C}\frac{(O_{ij} - e_{ij})^2}{e_{ij}}$$



$$\frac{(obs-exp)^2}{exp} = \frac{(81-193.47)^2}{193.47} = 65.38$$

$$\frac{(obs-exp)^2}{exp} = \frac{(468-355.52)^2}{355.52} = 35.60$$

$$\frac{(obs-exp)^2}{exp} = \frac{(233-120.52)^2}{120.52} = 104.97$$

$$\frac{(obs-exp)^2}{exp} = \frac{(109-221.47)^2}{221.47} = 57.11$$


$$\chi^{2} = 263.06$$


## Degrees of Freedom for a 2x2 table

$$df=(R−1)∗(C−1)=(2−1)∗(2−1)=1$$

### Using Scipy


In [108]:
f_obs = np.array([contingency_table.iloc[0][0:].values,
                  contingency_table.iloc[1][0:].values])
f_obs

array([[ 81, 468],
       [233, 109]])

In [109]:
from scipy import stats
stats.chi2_contingency(f_obs)[0:3]

(260.71702016732104, 1.1973570627755645e-58, 1)

## Conclusion

The results were exactly the same as our calculations by hand. The $\chi^{2}$= 260.7, p-value = ~0 and degrees of freedom = 1.

Conclusions
With a p-value < 0.05 , we can reject the null hypothesis. There is definitely some sort of relationship between 'sex' and the 'survival'. We don't know what this relationship is, but we do know that these two variables are not independent of each other.