# Review

In [2]:
import pandas as pd
from scipy.stats import chi2_contingency

As a final exercise, the NPI dataset has been loaded for you as npi. Remember that the columns are defined as follows:
- `influence`: `yes` = I have a natural talent for influencing people; `no` = I am not good at influencing people.
- `blend_in`: `yes` = I prefer to blend in with the crowd; `no` = I like to be the center of attention.
- `special`: `yes` = I think I am a special person; `no` = I am no better or worse than most people.
- `leader`: `yes` = I see myself as a good leader; `no` = I am not sure if I would make a good leader.
- `authority`: `yes` = I like to have authority over other people; `no` = I don’t mind following orders.

Which other pairs of questions might be associated (or not)? Use the workspace and your newfound skills to investigate for yourself!

In [3]:
df = pd.read_csv('npi_sample.csv')
df.head()

Unnamed: 0,influence,blend_in,special,leader,authority
0,no,yes,yes,yes,yes
1,no,yes,no,no,no
2,yes,no,yes,yes,yes
3,yes,no,no,yes,yes
4,yes,yes,no,yes,no


### Just one Example between two categorical columns

Frequency Table

In [4]:
influence_blend_in_freq = pd.crosstab(df.influence, df.blend_in)
influence_blend_in_freq

blend_in,no,yes
influence,Unnamed: 1_level_1,Unnamed: 2_level_1
no,773,3535
yes,2626,4163


Propotion Table

In [5]:
influence_blend_in_prop = influence_blend_in_freq / len(df)
influence_blend_in_prop

blend_in,no,yes
influence,Unnamed: 1_level_1,Unnamed: 2_level_1
no,0.069658,0.318555
yes,0.236641,0.375146


Marginal Proportions

In [6]:
influence_marginal = influence_blend_in_prop.sum(axis=1)
blend_in_marginal = influence_blend_in_prop.sum(axis=0)

influence_marginal, blend_in_marginal

(influence
 no     0.388213
 yes    0.611787
 dtype: float64,
 blend_in
 no     0.306299
 yes    0.693701
 dtype: float64)

Expected Contingency Table

In [7]:
chi2, p, dof, expected = chi2_contingency(influence_blend_in_freq)

In [8]:
expected, influence_blend_in_freq

(array([[1319.53609084, 2988.46390916],
        [2079.46390916, 4709.53609084]]),
 blend_in     no   yes
 influence            
 no          773  3535
 yes        2626  4163)

In [9]:
p, dof

(np.float64(8.431959337996535e-118), 1)

In [10]:
chi2

np.float64(532.4132818664078)

**Answer:** The Pair `influence` and `blend_in` are associated to each other, because the expected contingency table is not equal to the observed contingency table. The $\chi^2$ has a value of `532` and way higher than then the critical value of `4`. 

### Testing

In [11]:
def association(col1, col2):
    freq = pd.crosstab(col1, col2)
    prop = freq / len(col1)
    marginal1 = prop.sum(axis=1)
    marginal2 = prop.sum(axis=0)
    chi2, p, dof, expected = chi2_contingency(freq)
    
    return freq, prop, marginal1, marginal2, chi2, p, dof, expected

In [17]:
freq, prop, marginal1, marginal2, chi2, p, dof, expected = association(df.influence, df.special)

print(f"Frequency: \n{freq}", end="\n\n")
print(f"Proportion: \n{prop}", end="\n\n")
print(f"Marginal 1: \n{marginal1}", end="\n\n")
print(f"Marginal 2: \n{marginal2}", end="\n\n")
print(f"Chi2: {chi2}", end="\n\n")
print(f"P-value: {p}", end="\n\n")
print(f"Degrees of Freedom: {dof}", end="\n\n")
print(f"Expected: \n{expected}")

Frequency: 
special      no   yes
influence            
no         2725  1583
yes        3249  3540

Proportion: 
special          no       yes
influence                    
no         0.245562  0.142651
yes        0.292782  0.319005

Marginal 1: 
influence
no     0.388213
yes    0.611787
dtype: float64

Marginal 2: 
special
no     0.538344
yes    0.461656
dtype: float64

Chi2: 250.80246206335414

P-value: 1.73578852784625e-56

Degrees of Freedom: 1

Expected: 
[[2319.1846445 1988.8153555]
 [3654.8153555 3134.1846445]]


In [23]:
import itertools

columns = df.columns
combinations = list(itertools.combinations(columns, 2))

for combination in combinations:
    print(f"Association between {combination[0]} and {combination[1]}")
    print("--------------------------------------------------")
    freq, prop, marginal1, marginal2, chi2, p, dof, expected = association(df[combination[0]], df[combination[1]])
    print(f"Chi2: {chi2}", end="\n\n\n")

Association between influence and blend_in
--------------------------------------------------
Chi2: 532.4132818664078


Association between influence and special
--------------------------------------------------
Chi2: 250.80246206335414


Association between influence and leader
--------------------------------------------------
Chi2: 1307.8836807573769


Association between influence and authority
--------------------------------------------------
Chi2: 356.9691576604298


Association between blend_in and special
--------------------------------------------------
Chi2: 631.5051574353496


Association between blend_in and leader
--------------------------------------------------
Chi2: 462.44980106783


Association between blend_in and authority
--------------------------------------------------
Chi2: 665.4529799272262


Association between special and leader
--------------------------------------------------
Chi2: 410.7382415694936


Association between special and authority
---------