### Wilcoxon Rank Sum Test (AKA Mann Whitney Test)

This is useful when you have two independent populations and you want to conduct a non-parametric test on these. This might be because you have two very small samples that aren't normally distributed.



### Independent samples test

If you want a single sample or paired sample test, then use the Wilcoxon Signed Rank Test.

In [1]:
# The second sample should be equal to or larger than the first sample

sample_1 = c(64, 60, 68, 73, 72, 70)
n_1 = length(sample_1)
n_1

sample_2 = c(68, 72, 79, 69, 84, 80, 78)
n_2 = length(sample_2)
n_2

$ H_0 : \mu_1 = \mu_2 $

$ H_a : \mu_1 \ne \mu_2 $

$ \alpha = 0.05 $

Below, the vector with the fewest elements should be the first one.

In [2]:
combined_samples = c(sample_1, sample_2)
indexes = rank(combined_samples)
sample_1_length = length(sample_1)

X = head(indexes, sample_1_length)
X
Y = tail(indexes, -sample_1_length)
Y

### The W distribution

The test statistic comes from the W distribution. Consider the rank vectors, defined above. 

Let:

$ m $ be the length of X

$ n $ be the length of Y.

Consider X. The _lowest possible_ value for all of its elements, summed is:

$ m = \frac{m(m+1)}{2} $

Then, the _largest possible_ value for all of its elements, summed is:

$ m = \frac{m(m+n+1+n)}{2} = \frac{m(m+2n+1)}{2} $

Which is, also the summed values of X, but made of the largest possible elements.

#### An example using the smallest possible elements for X

$ X = {1,2,3} $

$ Y = {4,5,6,7} $

Then in this scenario, the minimum value in the W distribution is:

$ \frac{3(3+1)}{2} = 1+2+3 = 6 $

And the maximum value in the W distribution is:

$ \frac{3(3+1+2*4)}{2} = 7+6+5 = 18 $

### The sample statistic

Given the above, then it's clear that the summed ranks of data in sample_1 will be somewhere in the W distribution.

In [3]:
testStatistic = sum(X)
testStatistic

### Region of rejection

Using the _Table A.14 Critical Values for the Wilcoxon Rank-Sum Test_ table, find the critical value for this test.

Find the table with n_1, n_2 (n_1 must be the smaller value).

If this is a one-tail test, then search for your chosen $ \alpha $ to find the relevant c value. If this is a two-tailed test, then, use $ \alpha/2 $.

The c value is the value for the upper tail. To find the value of the lower tail, aka $ c* $, do:

$ c* = n_2(n_2+n_1+1) - c $

In [4]:
c = 56
cStar = n_1*(n_1 + n_2 + 1) - c
cStar

If the test statistic is either above $ c $ or below $ c* $, then you can reject the $ h_0 $ and accept the alternative.