# Two-Way Tables and Two Categorical Variables

Data science is all about relationships between variables. How do we summarize and visualize the relationship between two categorical variables?

For example, what can we say about the relationship between gender and survival on the Titanic?

In [None]:
import pandas as pd

df_titanic = pd.read_csv("https://raw.githubusercontent.com/anyoneai/notebooks/main/datasets/titanic_imbalanced.csv")
df_titanic.head()

Unnamed: 0,name,gender,age,class,embarked,country,ticketno,fare,survived
0,"Abbing, Mr. Anthony",male,42.0,3rd,S,United States,5547.0,7.11,0
1,"Abbott, Mr. Eugene Joseph",male,13.0,3rd,S,United States,2673.0,20.05,0
2,"Abbott, Mr. Rossmore Edward",male,16.0,3rd,S,United States,2673.0,20.05,0
3,"Abbott, Mrs. Rhoda Mary 'Rosa'",female,39.0,3rd,S,England,2673.0,20.05,1
4,"Abelseth, Miss. Karen Marie",female,16.0,3rd,S,Norway,348125.0,7.13,1


We can summarize each variable individually like we did in the previous lesson.

In [None]:
df_titanic["gender"].value_counts()

male      1718
female     489
Name: gender, dtype: int64

In [None]:
df_titanic["survived"].value_counts()

0    1496
1     711
Name: survived, dtype: int64

But this does not tell us how gender interacts with survival. To do that, we need to produce a _cross-tabulation_, or "cross-tab" for short. (Statisticians tend to call this a _contigency table_ or a _two-way table_.)

In [None]:
pd.crosstab(df_titanic["survived"], df_titanic["gender"])

gender,female,male
survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,130,1366
1,359,352


A cross-tabulation of two categorical variables is a two-dimensional array, with the levels of one variable along the rows and the levels of the other variable along the columns. Each cell in this array contains the number of observations that had a particular combination of levels. So in the Titanic data set, there were 359 females who survived and 1366 males who died. From the cross-tabulation, we can see that there were more females who survived than not, while there were more males who died than not. Clearly, gender had a strong influence on survival because of the Titanic's policy of ["women and children first"](https://en.wikipedia.org/wiki/Women_and_children_first).

To get probabilities instead of counts, we specify `normalize=True`.


In [None]:
joint_survived_gender = pd.crosstab(df_titanic["survived"], df_titanic["gender"],
                                    normalize=True)
joint_survived_gender

gender,female,male
survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.058903,0.61894
1,0.162664,0.159493


Notice that the four probabilities in this table add up to 1.0. Each of these probabilities is called a joint probability and can be notated, for example, as

$$ P(\text{female}, \text{died}) = 0.058903.$$

Collectively, these probabilities make up the _joint distribution_ of the variables **survived** and **gender**.

## Marginal Distributions

Is it possible to recover the distribution of **gender** alone from the joint distribution of **survived** and **gender**?

Yes! We simply sum the probabilities for each **gender** over all the possible levels of **survived**.

\begin{align}
P(\text{female}) = P(\text{female}, \text{died}) + P(\text{female}, \text{survived}) &= 0.058903 + 0.162664 = 0.221567 \\
P(\text{male}) = P(\text{male}, \text{died}) + P(\text{male}, \text{survived}) &= 0.618940 + 0.159493 = 0.778433
\end{align}

In code, this can be achieved by summing the `DataFrame` _over_ one of the dimensions. We can specify which dimension to sum over, using the `axis=` argument to `.sum()`.

- `axis=0` refers to the rows. In the current example, **survived** is the variable along this axis.
- `axis=1` refers to the columns. In the current example, **gender** is the variable along this axis.

Since we want to sum _over_ the **survived** variable, we specify `.sum(axis=0)`.

In [None]:
gender = joint_survived_gender.sum(axis=0)
gender

gender
female    0.221568
male      0.778432
dtype: float64

When calculated from a joint distribution, the distribution of one variable is called a _marginal distribution_. So the above is the marginal distribution of **gender**.

The name "marginal distribution" comes from the fact that it is customary to write these totals in the _margins_ of the table. In fact `pd.crosstab()` has an argument `margins=` that automatically adds these margins to the cross-tabulation.

In [None]:
pd.crosstab(df_titanic["survived"], df_titanic["gender"],
            normalize=True, margins=True)

gender,female,male,All
survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.058903,0.61894,0.677843
1,0.162664,0.159493,0.322157
All,0.221568,0.778432,1.0


While the margins are useful for display purposes, they actually make computations more difficult, since it is easy to mix up which numbers correspond to joint probabilities and which ones correspond to marginal probabilities.

Likewise, to obtain the marginal distribution of **survived**, we sum over the possible levels of **gender** (which is the variable along `axis=1`).

In [None]:
survived = joint_survived_gender.sum(axis=1)
survived

survived
0    0.677843
1    0.322157
dtype: float64

## Conditional Distributions

Let's take another look at the joint distribution of **survived** and **gender**.

In [None]:
pd.crosstab(df_titanic["survived"], df_titanic["gender"],
            normalize=True, margins=True)

gender,female,male,All
survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.058903,0.61894,0.677843
1,0.162664,0.159493,0.322157
All,0.221568,0.778432,1.0


In [None]:
P(male, survived)
P(female, survived)
P(male, not survived)
P(female, not survived)

------------

1.

From the joint distribution, it is tempting to conclude that females and males did not differ too much in their survival rates, since

$$ P(\text{female}, \text{survived}) = 0.162664 $$

is not too different from

$$ P(\text{male}, \text{survived}) = 0.159493. $$

This is because there were 359 women and 352 men who survived, out of 2207 passengers.

But this is the wrong comparison. The joint probabilities are affected by the baseline gender probabilities, and over three-quarters of the people aboard the Titanic were men. $P(\text{male}, \text{survived})$ and $ P(\text{female}, \text{survived})$ should not even be close if men were just as likely to survive as women, simply because of the sheer number of men aboard.

A better comparison is between the conditional probabilities. We ought to compare

$$ P(\text{survived} | \text{female}) $$

to

$$ P(\text{survived} | \text{male}). $$

To calculate each conditional probability, we simply divide the joint probability by the marginal probability. That is,

\begin{align}
P(\text{survived} | \text{female}) = \frac{P(\text{female}, \text{survived})}{P(\text{female})} &= \frac{0.162664}{0.221568} = .7341 \\
P(\text{survived} | \text{male}) = \frac{P(\text{male}, \text{survived})}{P(\text{male})} &= \frac{0.159493}{0.778432} = .2049
\end{align}

The conditional probabilities expose the stark difference in survival rates. One way to think about conditional probabilities is that they _adjust_ for the baseline gender probabilities. By dividing by $P(\text{male})$ and $P(\text{female})$, we adjust for the fact that there were more men and fewer women on the Titanic, thus enabling an apples-to-apples comparison.

We can also get to this result pandas crosstab just using `normalize="columns"`.

In [None]:
pd.crosstab(df_titanic["survived"], df_titanic["gender"],
            normalize="columns", margins=True)

gender,female,male,All
survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.265849,0.795111,0.677843
1,0.734151,0.204889,0.322157


In [None]:
pd.crosstab(df_titanic["survived"], df_titanic["gender"],
            normalize="index", margins=True)

gender,female,male
survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.086898,0.913102
1,0.504923,0.495077
All,0.221568,0.778432
