CramerV is measure of association between two categorical variables
https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V
> This notebook is for benchmarking several different implementation
- current running on i5-8400 CPU, 


In [129]:
import numpy as np
import pandas as pd


generate data

In [130]:
# generate a random numpy matrix with two categorical variables

# Define the categories
animal = ['cat', 'dog', 'mouse']
color = ['red', 'blue', 'green']
size = 100

# Generate pd.Dataframe with two categorical variables
df = pd.DataFrame({
    'animal': np.random.choice(animal, size),
    'color': np.random.choice(color, size)
})
df.head(2)

Unnamed: 0,animal,color
0,cat,blue
1,cat,blue


method 1, consise method with scipy and pandas

In [131]:
from scipy.stats.contingency import association

In [132]:
def cramer1(a, b):
    xtab = pd.crosstab(a, b)
    return association(xtab, method='cramer')
cramer1(df['animal'], df['color'])

0.1268005573659622

method 2, implemented like algorithm described in wiki

In [133]:
from scipy.stats import chi2_contingency

In [134]:
def cramer2(a, b ):
    xtab = pd.crosstab(a, b)
    chi2 = chi2_contingency(xtab)[0]
    return np.sqrt((chi2 / xtab.values.sum()) / min(xtab.shape[0] - 1, xtab.shape[1] - 1))
cramer2(df['animal'], df['color'])

0.1268005573659622

method 3, improve the xtab

In [139]:
# modifed based on 
# https://gist.github.com/alexland/d6d64d3f634895b9dc8e

def numpy_crosstab(a,b):
    uniq_vals_a, idx_a = np.unique(a, return_inverse=True)
    uniq_vals_b, idx_b = np.unique(b, return_inverse=True)
    shape_xt = (uniq_vals_a.size, uniq_vals_b.size)
    xt = np.zeros(shape_xt, dtype='uint')
    np.add.at(xt, (idx_a, idx_b), 1)
    return xt
    
def cramer3(a, b ):
    xtab = numpy_crosstab(a, b)
    chi2 = chi2_contingency(xtab)[0]
    return np.sqrt((chi2 / xtab.sum()) / min(xtab.shape[0] - 1, xtab.shape[1] - 1))
cramer3(df['animal'], df['color'])

0.1268005573659622

benchmarking

In [69]:
%%timeit
cramer1(df['animal'], df['color'])

5.94 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [70]:
%%timeit
cramer2(df['animal'], df['color'])

5.38 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [140]:
%%timeit
cramer3(df['animal'], df['color'])

375 µs ± 8.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
