# M-squared Test

In [73]:
# This cell is a code cell.  In cells like this
# we run Python code

# Here's some code that will likely appear near the top of every homework or lecture this semester.

from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

import scipy.stats as stats

import scipy

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=np.VisibleDeprecationWarning)

# To add text, referred to as comments to a code cell
# just put a hashtag a the beginning of the comment
# Comments are ignored by the computer when executing the code

When the categorical data are both ordinal, you can assign values to the different categories and run what's called an $M^2$ test.   

Recall that we can use `scipy.stats.pearsonr` to calculate the Pearson correlation coefficient, called $r$.  Then, 

$M^2 = (n-1)r^2$

When the sample size, $n$, is large, $M^2$ follows a Chi-squared distribution with degrees of freedom 1.  

$H_o$: There is a trend between these two variables

$H_a$: There is NO trend between these two variables.  

Notice that the alternative does not accept directions.  We will return to this discussion later.  


The numbers that you attach to the ordered categories **DO matter**.  Let's look at some examples using this data from our book.  This looks at the number of servings of alcohol a mother has on average while pregnant and whether her baby has any malformations.    

|Alcohol|  Malformation Absent  | Present |
|---|---|---|
|0 |17066  |48|
|<1 |14464 |38|
|1-2|788   |5|
|3-5|126   |1|
|6+ |37    |1|

The numbers you give the labels are supposed to reflect how "far apart" you feel these categories are.  For the first example, let's use the reasonable choice of 1, 2, 3, and 4 for the labels.  




In [74]:
a = 0
b = 1
c = 2
d = 3
e = 4

zero = np.repeat(a, 17114)
one = np.repeat(b, 14502)
two = np.repeat(c, 793)
thr = np.repeat(d, 127)
four = np.repeat(e, 38)

consumption = np.append(zero, np.append(one, np.append(two, np.append(thr, four))))

i = 0
j = 1

mal_zero = np.append(np.repeat(i, 17066), np.repeat(j, 48))
mal_one = np.append(np.repeat(i, 14464), np.repeat(j, 38))
mal_two = np.append(np.repeat(i, 788), np.repeat(j, 5))
mal_thr = np.append(np.repeat(i, 126), np.repeat(j, 1))
mal_four = np.append(np.repeat(i, 37), np.repeat(j, 1))

malformation = np.append(mal_zero, np.append(mal_one, np.append(mal_two, np.append(mal_thr, mal_four))))

(r, p) = stats.pearsonr(malformation, consumption)

M2 = r**2 * (len(consumption)-1)

pval = scipy.stats.chi2.pdf(M2, df = 1)

print(f"With these values for the categories, the M-square is {np.round(M2, 3)}, which has a p-value of {np.round(pval, 4)}.")
print(f"This comes from an r of {r}")


With these values for the categories, the M-square is 1.828, which has a p-value of 0.1183.
This comes from an r of 0.007490842429343666


Next, let's use the midpoint of the interval for the label.  That is, 0 servings of alcohol will be paired with 0.  Between 0 and 1 serving will be paired with 0.5.  Then, 1 to 2 has a midpoint of 1.5 while 3 to 5 has a midpoint of 4.  The last interval has no midpoint, but we don't want to choose an extreme value.  The choice of 7 was arbitrary, but that doesn't mean that it would change the results.  

In [75]:
a = 0
b = .5
c = 1.5
d = 4
e = 7  # change to 6, 70 or some other value to demonstrat statement about


zero = np.repeat(a, 17114)
one = np.repeat(b, 14502)
two = np.repeat(c, 793)
thr = np.repeat(d, 127)
four = np.repeat(e, 38)

consumption = np.append(zero, np.append(one, np.append(two, np.append(thr, four))))

i = 0
j = 1

mal_zero = np.append(np.repeat(i, 17066), np.repeat(j, 48))
mal_one = np.append(np.repeat(i, 14464), np.repeat(j, 38))
mal_two = np.append(np.repeat(i, 788), np.repeat(j, 5))
mal_thr = np.append(np.repeat(i, 126), np.repeat(j, 1))
mal_four = np.append(np.repeat(i, 37), np.repeat(j, 1))


malformation = np.append(mal_zero, np.append(mal_one, np.append(mal_two, np.append(mal_thr, mal_four))))

(r, p) = stats.pearsonr(malformation, consumption)

M2 = r**2 * (len(consumption)-1)

pval = scipy.stats.chi2.pdf(M2, df = 1)


print(f"With these values for the categories, the M-square is {np.round(M2, 3)}, which has a p-value of {np.round(pval, 4)}.")
print(f"This comes from an r of {r}")

With these values for the categories, the M-square is 6.57, which has a p-value of 0.0058.
This comes from an r of 0.014202067281577017


What if we used the Spearman's rho, the non-parametric version of correlation?


`scipy.stats.spearmanr`

In [76]:
stats.pearsonr(consumption, malformation)

PearsonRResult(statistic=0.014202067281577017, pvalue=0.010369491709263002)

In [77]:
stats.spearmanr(consumption, malformation)

SpearmanrResult(correlation=0.0032846493413217063, pvalue=0.5533142141720796)

In [78]:
a = 0
b = 1
c = 2
d = 3
e = 4


zero = np.repeat(a, 17114)
one = np.repeat(b, 14502)
two = np.repeat(c, 793)
thr = np.repeat(d, 127)
four = np.repeat(e, 38)

consumption = np.append(zero, np.append(one, np.append(two, np.append(thr, four))))

i = 0
j = 1

mal_zero = np.append(np.repeat(i, 17066), np.repeat(j, 48))
mal_one = np.append(np.repeat(i, 14464), np.repeat(j, 38))
mal_two = np.append(np.repeat(i, 788), np.repeat(j, 5))
mal_thr = np.append(np.repeat(i, 126), np.repeat(j, 1))
mal_four = np.append(np.repeat(i, 37), np.repeat(j, 1))


malformation = np.append(mal_zero, np.append(mal_one, np.append(mal_two, np.append(mal_thr, mal_four))))

(r, p) = stats.pearsonr(malformation, consumption)

(rho, p_2) = stats.spearmanr(malformation, consumption)

M2 = r**2 * (len(consumption)-1)

pval = scipy.stats.chi2.pdf(M2, df = 1)

pval2 = scipy.stats.chi2.pdf(rho**2 *(len(consumption)-1),  df = 1)


print(f"With these values for the categories, the M-square from Pearson's r is {np.round(M2, 3)}, which has a p-value of {np.round(pval, 4)}.")
print(f"This comes from a Pearson r of {r}")

print(f"With these values for the categories, the M-square from Spearman's rho is {np.round(rho**2 *(len(consumption)-1), 3)}, which has a p-value of {np.round(pval2, 4)}.")
print(f"This comes from a rho of {rho}")

With these values for the categories, the M-square from Pearson's r is 1.828, which has a p-value of 0.1183.
This comes from a Pearson r of 0.007490842429343666
With these values for the categories, the M-square from Spearman's rho is 0.351, which has a p-value of 0.5645.
This comes from a rho of 0.003284649341321706


In [79]:
a = 0
b = .5
c = 1.5
d = 4
e = 7


zero = np.repeat(a, 17114)
one = np.repeat(b, 14502)
two = np.repeat(c, 793)
thr = np.repeat(d, 127)
four = np.repeat(e, 38)

consumption = np.append(zero, np.append(one, np.append(two, np.append(thr, four))))

i = 0
j = 1

mal_zero = np.append(np.repeat(i, 17066), np.repeat(j, 48))
mal_one = np.append(np.repeat(i, 14464), np.repeat(j, 38))
mal_two = np.append(np.repeat(i, 788), np.repeat(j, 5))
mal_thr = np.append(np.repeat(i, 126), np.repeat(j, 1))
mal_four = np.append(np.repeat(i, 37), np.repeat(j, 1))


malformation = np.append(mal_zero, np.append(mal_one, np.append(mal_two, np.append(mal_thr, mal_four))))

(r, p) = stats.pearsonr(malformation, consumption)

(rho, p_2) = stats.spearmanr(malformation, consumption)

M2 = r**2 * (len(consumption)-1)


pval = scipy.stats.chi2.pdf(M2, df = 1)

pval2 = scipy.stats.chi2.pdf(rho**2 *(len(consumption)-1),  df = 1)


print(f"With these values for the categories, the M-square from Pearson's r is {np.round(M2, 3)}, which has a p-value of {np.round(pval, 4)}.")
print(f"This comes from a Pearson r of {r}")

print(f"With these values for the categories, the M-square from Spearman's rho is {np.round(rho**2 *(len(consumption)-1), 3)}, which has a p-value of {np.round(pval2, 4)}.")
print(f"This comes from a rho of {rho}")

save_for_later = r *(len(consumption)-1)**0.5

With these values for the categories, the M-square from Pearson's r is 6.57, which has a p-value of 0.0058.
This comes from a Pearson r of 0.014202067281577017
With these values for the categories, the M-square from Spearman's rho is 0.351, which has a p-value of 0.5645.
This comes from a rho of 0.003284649341321706


In [80]:
a = (17114+1)/2
b = (17114*2 + 14502)/2
c = (17114*3+14502*2+793)/2
d = (17114*4+14502*3+793*2+127)/2
e = (17114*5+14502*4+793*3+127*2+38)/2


zero = np.repeat(a, 17114)
one = np.repeat(b, 14502)
two = np.repeat(c, 793)
thr = np.repeat(d, 127)
four = np.repeat(e, 38)

consumption = np.append(zero, np.append(one, np.append(two, np.append(thr, four))))

i = 0
j = 1

mal_zero = np.append(np.repeat(i, 17066), np.repeat(j, 48))
mal_one = np.append(np.repeat(i, 14464), np.repeat(j, 38))
mal_two = np.append(np.repeat(i, 788), np.repeat(j, 5))
mal_thr = np.append(np.repeat(i, 126), np.repeat(j, 1))
mal_four = np.append(np.repeat(i, 37), np.repeat(j, 1))


malformation = np.append(mal_zero, np.append(mal_one, np.append(mal_two, np.append(mal_thr, mal_four))))

(r, p) = stats.pearsonr(malformation, consumption)

(rho, p_2) = stats.spearmanr(malformation, consumption)

M2 = r**2 * (len(consumption)-1)

pval = scipy.stats.chi2.pdf(M2, df = 1)

pval2 = scipy.stats.chi2.pdf(rho**2 *(len(consumption)-1),  df = 1)


print(f"With these values for the categories, the M-square from Pearson's r is {np.round(M2, 3)}, which has a p-value of {np.round(pval, 4)}.")
print(f"This comes from a Pearson r of {r}")

print(f"With these values for the categories, the M-square from Spearman's rho is {np.round(rho**2 *(len(consumption)-1), 3)}, which has a p-value of {np.round(pval2, 4)}.")
print(f"This comes from a rho of {rho}")

With these values for the categories, the M-square from Pearson's r is 1.895, which has a p-value of 0.1124.
This comes from a Pearson r of 0.0076274892663406955
With these values for the categories, the M-square from Spearman's rho is 0.351, which has a p-value of 0.5645.
This comes from a rho of 0.003284649341321706


As you can see, if we use the non-parametric correlation, known as Spearman's rho or $\rho$, then the labels don't matter.  However, in this case the $M^2$ test based on that is not significant.

If the data is balanced (meaning group sizes within the larger sample are the same or at least very similar), then the number values on the labels don't matter as much.  

Let's examine these fictitious data.

In [81]:
a = 0
b = 1
c = 2


zero = np.repeat(a, 1000)
one = np.repeat(b, 1000)
two = np.repeat(c, 1000)


consumption = np.append(zero, np.append(one, two))

i = 0
j = 1

mal_zero = np.append(np.repeat(i, 1000-48), np.repeat(j, 48))
mal_one = np.append(np.repeat(i, 1000-38), np.repeat(j, 38))
mal_two = np.append(np.repeat(i, 1000-5), np.repeat(j, 5))



malformation = np.append(mal_zero, np.append(mal_one, mal_two))

(r, p) = stats.pearsonr(malformation, consumption)

(rho, p_2) = stats.spearmanr(malformation, consumption)

M2 = r**2 * (len(consumption)-1)

pval = scipy.stats.chi2.pdf(M2, df = 1)

pval2 = scipy.stats.chi2.pdf(rho**2 *(len(consumption)-1),  df = 1)


print(f"With these values for the categories, the M-square from Pearson's r is {np.round(M2, 3)}, which has a p-value of {np.round(pval, 4)}.")
print(f"This comes from a Pearson r of {r}")

print(f"With these values for the categories, the M-square from Spearman's rho is {np.round(rho**2 *(len(consumption)-1), 3)}, which has a p-value of {np.round(pval2, 4)}.")
print(f"This comes from a rho of {rho}")

With these values for the categories, the M-square from Pearson's r is 31.421, which has a p-value of 0.0.
This comes from a Pearson r of -0.10235793797163036
With these values for the categories, the M-square from Spearman's rho is 31.421, which has a p-value of 0.0.
This comes from a rho of -0.10235793797162661


In [82]:
a = (17114+1)/2
b = (17114*2 + 14502)/2
c = (17114*3+14502*2+793)/2


zero = np.repeat(a, 1000)
one = np.repeat(b, 1000)
two = np.repeat(c, 1000)


consumption = np.append(zero, np.append(one, two))

i = 0
j = 2

mal_zero = np.append(np.repeat(i, 1000-48), np.repeat(j, 48))
mal_one = np.append(np.repeat(i, 1000-38), np.repeat(j, 38))
mal_two = np.append(np.repeat(i, 1000-5), np.repeat(j, 5))



malformation = np.append(mal_zero, np.append(mal_one, mal_two))

(r, p) = stats.pearsonr(malformation, consumption)

(rho, p_2) = stats.spearmanr(malformation, consumption)

M2 = r**2 * (len(consumption)-1)

pval = scipy.stats.chi2.pdf(M2, df = 1)

pval2 = scipy.stats.chi2.pdf(rho**2 *(len(consumption)-1),  df = 1)


print(f"With these values for the categories, the M-square from Pearson's r is {np.round(M2, 3)}, which has a p-value of {np.round(pval, 4)}.")
print(f"This comes from a Pearson r of {r}")

print(f"With these values for the categories, the M-square from Spearman's rho is {np.round(rho**2 *(len(consumption)-1), 3)}, which has a p-value of {np.round(pval2, 4)}.")
print(f"This comes from a rho of {rho}")

With these values for the categories, the M-square from Pearson's r is 31.558, which has a p-value of 0.0.
This comes from a Pearson r of -0.10258163601545953
With these values for the categories, the M-square from Spearman's rho is 31.421, which has a p-value of 0.0.
This comes from a rho of -0.10235793797162661


# Response labels?

What about the numbers used to define the responses?  Do they matter?  Maybe.  When the response is binary, the labels can change without changing $M^2$. In the code above, I left it easy to change the values of the labels on the response, those were `i` and `j` in the code.  Change them and you'll see that $M^2$ does not change.   


When the response has 3 or more ordered levels, the labels matter in much the same way they did with the predictor.  


Speaking of this, $M^2$ is like $\chi^2$ in that it does not care which variable is the response and which is the predictor.  Recall, $r$ doesn't care about that either.  


# M and directions



So, if the statistic for the $M^2$ test is $M^2$ what is $M$?  

Take the square root of $M^2$ or let $M = r \sqrt{n-1}$, then $M$ follows the standard normal distribution, and can accept a directed alternative hypothesis.  

Let $\rho$ represent the correlation, then 

$H_o: \rho = 0$

$H_a: \rho > 0$

If you're going to use the undirected alternative, $H_a: \rho \not= 0$, then use the $M^2$ test instead. 


If we go back to the example where we used the interval midpoints for the interval labels, we see that $M = 2.563$ and the p-value associated with that is $p = 0.0052$.  Using these, we could conclude that not only is there a relationship between malformed babies and mother alcohol consumption, but the relationship is that likelihood of a malformation increases as the mothers consumption increases.  




In [83]:
M = save_for_later

M

2.5631879068826491

In [86]:
1-stats.norm.cdf(2.56)

0.0052336081635557807