<img style="float: right;" src="http://www2.le.ac.uk/liscb1.jpg">  
# Leicester Institute of Structural and Chemical Biology: Python for Biochemists
# The Mann–Whitney *U* test

Description  
## Requirements
In order to use the Mann-Whitney U test reliably, you must have:
* Independent (non-paired, non-sequential) observations
* Ordinal observations, i.e. you can tell if an observation is 'more than' another observation
* The distributions of both populations are equal under the null hypothesis
* The distributions the two populations are unequal under the alternative hypothesis
* \> ~20 observations  

Probably the easiest way to understand how the Mann-Whitney *U* test works is to do a simple example in code.  

For comparing two small sets of observations, a direct method is quick, and gives insight into the meaning of the U statistic, which corresponds to the number of wins out of all pairwise contests (The tortoise and hare example below is taken from the [Wikipedia article](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test)).

In [1]:
# We run a race with 6 tortoises and 6 hares.  What's the finishing order? 
race_result = ['T', 'H', 'H', 'H', 'H', 'H', 'T', 'T', 'T', 'T', 'T', 'H']

In [2]:
def mann_whitney_u_sequence(sequence):
    labels = tuple(set(sequence))
    if len(labels) != 2:
        raise ValueError('There must be exactly two kinds of things in your sequence.')
        
    wins = {label:0 for label in labels}
    for i, item in enumerate(sequence):  # Enumerate every item in the list.
        for later_item in sequence[i:]:  # Now check every item that comes later in the sequence,...
            if item != later_item:       # and if later_item is the other kind of thing, item scores a win.
                wins[item] += 1
    return wins   

In [3]:
u_sequence = mann_whitney_u_sequence(race_result)
print(f'The Mann-Whitney-U values are: {u_sequence}')

The Mann-Whitney-U values are: {'H': 25, 'T': 11}


If we have a list of placings, we can see how the Mann-Whiteney-U test deals with ties:

In [4]:
finishing_places_hares = [2, 3, 4, 5, 6, 11]
finishing_places_tortoises = [1, 7, 8, 9, 10, 11]

In [5]:
def mann_whitney_u_places(places_1, places_2):
    u_1 = 0
    
    for place_1 in places_1:            # For every place in places_1,
        for place_2 in places_2:        # compare it to every place in places_2,
            if place_1 > place_2:       # If place_1 is more, it wins.
                u_1 += 1
            elif place_1 == place_2:    # If place_1 is the same, it gets half a win.
                u_1 += 0.5
                
    u_2 = (len(places_1) * len(places_2)) - u_1
    
    return (u_1, u_2)
mann_whitney_u_places(finishing_places_hares, finishing_places_tortoises)

(10.5, 25.5)

In [6]:
u_places = mann_whitney_u_places(finishing_places_hares, finishing_places_tortoises)
print(f'The Mann-Whitney-U values are: {u_places}')

The Mann-Whitney-U values are: (10.5, 25.5)


While this works for small datasets, once they get bigger, it becomes computationally very expensive.  Fortunatelly, there's a faster, more robust version in `scipy`, that also gives the p-value - but make sure to specify if you're looking for the `'two-side'` or one-sided (with `'less'` or `'greater'`) alternative hypothesis:

In [7]:
from scipy.stats import mannwhitneyu

result = mannwhitneyu(finishing_places_hares, finishing_places_tortoises, alternative='two-sided')

print(f'The Mann-Whitney-U statistic for the first group is {result.statistic}')
print(f'The Mann-Whitney-U p-value is {result.pvalue}')

The Mann-Whitney-U statistic for the first group is 10.5
The Mann-Whitney-U p-value is 0.26149617619114585
