### Co-occurrence Analysis and Finding Potential Relationships

Given a co-occurrence matrix find the following
1. Joint Probabilities
2. Marginal Probabilities
3. Joint Prob. if entities are independent
4. Point-wise Mutual Information
5. Compute Phi-square
6. Compute Chi-square

In [1]:
import numpy as np
np.set_printoptions(precision=5, suppress=True)

We work off the following co-occurrence table:


| <i></i>         | Apple | Facebook | Tesla |
| --------------- |:-----:|:--------:|:-----:|
| Elon Musk       | 10    | 15       | 300   |
| Mark Zuckerberg | 500   | 10000    | 500   |
| Tim Cook        | 200   | 30       | 10    |


So Elon Musk and Apple co-occur in 10 news articles, Elon Musk and Facebook co-occur in 15 news articles, etc.

In [2]:
co_occurrence_table = np.array([[10, 15, 300],
                                [500, 10000, 500],
                                [200, 30, 10]])
print(co_occurrence_table)

[[   10    15   300]
 [  500 10000   500]
 [  200    30    10]]


### Task 1 - Joint Probabilities

In [3]:
joint_prob = co_occurrence_table/co_occurrence_table.sum()

In [4]:
joint_prob

array([[0.00086, 0.0013 , 0.02594],
       [0.04323, 0.86468, 0.04323],
       [0.01729, 0.00259, 0.00086]])

In [7]:
joint_prob.sum()

1.0

### Task 2 - Marginal Probabilities

In [10]:
# Calculating marginal prob of people
marginal_prob_people = joint_prob.sum(axis=1)
marginal_prob_people

array([0.0281 , 0.95115, 0.02075])

In [11]:
# Calculating marginal prob of companies
marginal_prob_companies = joint_prob.sum(axis=0)
marginal_prob_companies

array([0.06139, 0.86857, 0.07004])

### Task 3 - Joint Prob if entities are independent

In [15]:
# Outer product is order dependent
joint_prob_people_company_independent = np.outer(marginal_prob_people,marginal_prob_companies)
joint_prob_people_company_independent

array([[0.00173, 0.02441, 0.00197],
       [0.05839, 0.82614, 0.06662],
       [0.00127, 0.01802, 0.00145]])

### Task 4 - PMI

In [17]:
ratio = joint_prob/joint_prob_people_company_independent
ratio

array([[ 0.50119,  0.05314, 13.17949],
       [ 0.7404 ,  1.04665,  0.64899],
       [13.57394,  0.14391,  0.59491]])

In [19]:
pmi = np.log2(ratio)
pmi

array([[-0.99657, -4.23412,  3.72022],
       [-0.43363,  0.06578, -0.62373],
       [ 3.76277, -2.79671, -0.74926]])

By ranking the 9 PMIs from largest to smallest, and looking at the largest 3 PMIs (3.76277, 3.72022, and 0.06578), we see that these tell us the CEO/company pairings

### Task 5 - Compute Phi-square

$$\sum_{A,B} \frac{(P(A,B)-P(A)P(B))^2}{P(A)P(B)}.$$

In [20]:
numer = (joint_prob-joint_prob_people_company_independent)**2

In [21]:
denom = joint_prob_people_company_independent 

In [22]:
phi_square = (numer/denom).sum()

In [23]:
phi_square

0.5430990706343366

### Task 6 - Compute Chi-square

In [27]:
chi_square = phi_square*co_occurrence_table.sum()
chi_square

6280.940751886103