# Statistical Analysis II - Practicum 1

## Non-parametric statistics

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import mannwhitneyu

### Mann-Whitney U-test (Wilcoxon rank sum test)

Example from [the web](http://users.sussex.ac.uk/~grahamh/RM1web/Mann-Whitney%20worked%20example.pdf)

The effectiveness of advertising for two rival products (Brand X and Brand Y) was compared.
Market research at a local shopping centre was carried out, with the participants being shown adverts for two rival brands of tea, which they then rated on the overall likelihood of them buying the product (out of 10, with 10 being "definitely going to buy the product").
Half of the participants gave ratings for one of the products, the other half gave ratings for the other product.

In [None]:
df = pd.DataFrame({'Participant':np.arange(6),
                   'Rating': [3,4,2,6,2,5]})
df = df.set_index('Participant')

df2 = pd.DataFrame({'Participant':np.arange(6),
                   'Rating': [9,7,5,10,6,8]})

df2 = df2.set_index('Participant')

print(df)

print(df2)

Which test do we use? We have two conditions, with each participant taking part in only one of the conditions.  The data are ratings (ordinal data), and hence a nonparametric test is appropriate -  the Mann-Whitney U test. 

Rank all scores together, ignoring which group they belong to. 

In [None]:
df['Product'] = 1
df2['Product'] = 2

df_compare = pd.concat([df,df2])

df_compare['Rank'] = df_compare.Rating.rank()

df_compare

Add up the ranks for Brand 1, to get T1; and the same for 2, to get T2

In [None]:
T1 = df_compare.groupby('Product').get_group(1)['Rank'].sum()

print(T1)

T2 = df_compare.groupby('Product').get_group(2)['Rank'].sum()

print(T2)

Select the larger rank.  In this case it’s T2.  

Calculate **n1, n2** and **nx**

These are the number of participants in each group, and the number of people in the group that gave the larger rank total.  

Therefore

In [None]:
n1 = 6         
n2 = 6         
nx = 6 #because the largest rank is T2

Find U (Note: Tx is the larger rank total, T2 in this case)  

$U = n_1 n_2 + n_x \frac{n_x+1}{2}-T_x$

In [None]:
U = n1*n2+nx*(nx+1)/2-T2

U

Use a table of critical U values for the Mann-Whitney U Test

![title](images/significance_table_MW5.png)

![title](images/significance_table_MW1.png)

For **n1 = 6** and **n2=6**, *the critical value of U is 5* for a two-tailed test at the **0.05 significance level**, and *2* **at the 0.01 significance level**. 

To be significant, our obtained U has to be equal to or less than this critical value. 

Our obtained U = 2, which implies that we can say that there is a highly significant difference (**p<.01**) between the ratings given to each brand in terms of the likelihood of buying the product.

How would it have been with python scipy? Can see the details of the documentation [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html)

In [None]:
mannwhitneyu(x=df.Rating,
             y=df2.Rating,
             alternative='two-sided')