# Comparing Group Membership

The $χ^2$
 test can be used to compare two categorical variables and helps us answer questions like:

- Is whether or not a customer churns independent of their subscription plan?
- Are doctors less likely to smoke?
- Does playing on the home field give a soccer team an advantage?

In this lesson we will dive into how the test is performed.

## The $χ^2$ Contingency Table Test

The $χ^2$ test can be also be used in several other ways, but we will use what is referred to as the contingency table test, which lets us test the hypothesis that one group is independent of another. To do this, we will

1. Calculate the theoretical expected values
2. Find the actual observed values
3. Calculate a test-statistic and p-value based on the two tables above Specifically, our test-statistic, $χ^2$ is given by:
$$χ^2=∑\frac{(O−E)^2}{E}$$

Where $O$ is the observed values, and $E$ is the expected values.

For this example, we will look at the dataset on cars that we explored previously.

In [1]:
import pandas as pd
from scipy import stats
from pydataset import data

mpg = data('mpg')
mpg['transmission'] = mpg.trans.str[:-4] # a little cleaning goes a long way
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,transmission
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,auto
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,manual
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,manual
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,auto
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,auto


In [2]:
from IPython.core.display import HTML
table_css = 'table {align:left;display:block} '
HTML('<style>{}</style>'.format(table_css))

We will investigate the question of whether the cars drive is independent of transmission type.

$H_0$: drive is independent of transmission type

#### Calculating Expected Proportions

To begin with, we will calculate the values we would expect to see if the two groups are independent.

For each subgroup, we calculate the proportion of the total that it is, then multiply each subgroups proportion by the proportion from every other subgroup to determine the expected values.

To start with, we'll calculate the proportions for transmission type:

In [3]:
n = mpg.shape[0]

transmission_proportions = mpg.transmission.value_counts() / n
transmission_proportions

auto      0.67094
manual    0.32906
Name: transmission, dtype: float64

This tells us that cars with automatic transmissions make up ~ 67% of the total, and cars with manual transmissions make up ~ 33% of the total.

Now we'll do the same for drive types.

In [4]:
drive_proportions = mpg.drv.value_counts() / n
drive_proportions

f    0.452991
4    0.440171
r    0.106838
Name: drv, dtype: float64

To find the overall proportions, we multiply all the combinations of proportions together.

For example, to find the expected proportion of automatic drive cars with 4-wheel drive, we would multiply those two proportions together.

$$.67∗.44 = .2984$$

So we would expect about 29.84\% of the total cars to be automatic and 4-wheel drive.

Below we show some code that will loop through all of the proportions and perform this calculation for all combinations of groups.


In [5]:
expected = pd.DataFrame()

for transmission_group, t_prop in transmission_proportions.iteritems():
    for drive_group, d_prop in drive_proportions.iteritems():
        expected.loc[drive_group, transmission_group] = t_prop * d_prop

expected.sort_index(inplace=True)
expected

Unnamed: 0,auto,manual
4,0.295328,0.144843
f,0.30393,0.149061
r,0.071682,0.035156


If we wanted to convert these proportions to expected number of values, we can multiply by the total number of observations:

In [6]:
expected *= n
expected

Unnamed: 0,auto,manual
4,69.106838,33.893162
f,71.119658,34.880342
r,16.773504,8.226496


Now we have the expected proportions, we need to calculate the actual proportions so that we can compare them. to do this, we'll use the `crosstab` function from pandas.

In [7]:
observed = pd.crosstab(mpg.drv, mpg.transmission)
observed

transmission,auto,manual
drv,Unnamed: 1_level_1,Unnamed: 2_level_1
4,75,28
f,65,41
r,17,8


Now we can calculate our test statistic, $χ^2$


In [8]:
chi2 = ((observed - expected)**2 / expected).values.sum()
chi2

3.136769245971112

We also need to find our degrees of freedom for the distribution. The degrees of freedom are given by:

$$(nrows−1)×(ncols−1)$$

Where nrows and ncols are the number of rows and columns in our contingency table.


In [9]:
nrows, ncols = observed.shape

degrees_of_freedom = (nrows - 1) * (ncols - 1)

Now, based on the test statistic and degrees of freedom, we could lookup the corresponding p-value from a pre-calculated table, or use `scipy`'s chi2 distribution.


In [10]:
stats.chi2(degrees_of_freedom).sf(chi2)

0.20838152534979645

With this high of a p-value, we fail to reject our null hypothesis.

### Another Example

Suppose we have the following contingency table:

|          | Product A | Product B |
|----------|-----------|-----------|
| Churn    | 100       | 50        |
| No Churn | 120       | 28        |

And we want to know if a customer churning is independent of which product offering they have.

We have all the information that we need to run a $χ^2$ test, because we can calculate the population proportions from the above table.

1. Find the proportions for Product A, Product B, Churn, and No Churn

|          | Product A | Product B |     |
|----------|-----------|-----------|-----|
| Churn    | 100       | 50        | 150 |
| No Churn | 120       | 28        | 148 |
|          | 220       | 78        | 298 |

2. Calculate the proportions
- Product A = 220 / 298 = .738
- Product B = 78 / 298 = .262
- Churn = 150 / 298 = .503
- No churn = 148 / 298 = .497

3. Multiply these together to produce a contingency table of expected values
First we calculate proportions:

|          | Product A | Product B |
|----------|-----------|-----------|
| Churn    | 0.372     | 0.132     |
| No Churn | 0.367     | 0.130     |

Then we can also see the actual expected number:

|          | Product A | Product B |
|----------|-----------|-----------|
| Churn    | 110.7     | 39.3      |
| No Churn | 109.3     | 38.7      |

4. Calculate the test statistic and compute a p-value

In [11]:
index = ['Churn', 'No Churn']
columns = ['Product A', 'Product B']

observed = pd.DataFrame([[100, 50], [120, 28]], index=index, columns=columns)
n = observed.values.sum()

expected = pd.DataFrame([[.372, .132], [.367, .130]], index=index, columns=columns) * n

chi2 = ((observed - expected)**2 / expected).values.sum()

nrows, ncols = observed.shape

degrees_of_freedom = (nrows - 1) * (ncols - 1)

p = stats.chi2(degrees_of_freedom).sf(chi2)

print('Observed')
print(observed)
print('---\nExpected')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
          Product A  Product B
Churn           100         50
No Churn        120         28
---
Expected
          Product A  Product B
Churn       110.856     39.336
No Churn    109.366     38.740
---

chi^2 = 7.9656
p     = 0.0048


#### The Easy Way

We can also give our observed values to the `chi2_contingency` function from scipy's `stats` module to make all the calculations for us.


In [12]:
observed = pd.crosstab(mpg.drv, mpg.transmission)

In [13]:
chi2, p, degf, expected = stats.chi2_contingency(observed)

print('Observed\n')
print(observed.values)
print('---\nExpected\n')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed

[[75 28]
 [65 41]
 [17  8]]
---
Expected

[[69.10683761 33.89316239]
 [71.11965812 34.88034188]
 [16.77350427  8.22649573]]
---

chi^2 = 3.1368
p     = 0.2084


Note that this function will return not just the $χ^2$ test statistic and p-value, but also the degrees of freedom, and a matrix of expected values.