# Part 19: Frequency Tables

A frequency table is just a data table that shows the counts of one or more categorical variables.


In [1]:
import numpy as np
import pandas as pd
import os

In [7]:
os.chdir('/home/sindhuvarun/github/ML-Learning/staticsAndProbability/PythonForDataAnalytics/dataset/Titanic')
titanic_train = pd.read_csv('train.csv')
char_cabin = titanic_train['Cabin'].astype(str)
new_cabin = np.array([cabin[0] for cabin in char_cabin])
titanic_train['Cabin'] = pd.Categorical(new_cabin)
titanic_train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [6]:
survived_tab = pd.crosstab(index=titanic_train['Survived'], columns="count")
survived_tab

col_0,count
Survived,Unnamed: 1_level_1
0,549
1,342


In [9]:
pClass_tab = pd.crosstab(index = titanic_train['Pclass'], columns="count")
pClass_tab

col_0,count
Pclass,Unnamed: 1_level_1
1,216
2,184
3,491


In [11]:
gender_tab = pd.crosstab(index=titanic_train['Sex'], columns="number")
gender_tab

col_0,number
Sex,Unnamed: 1_level_1
female,314
male,577


In [12]:
cabin_tab = pd.crosstab(index=titanic_train['Cabin'], columns="count")
cabin_tab

col_0,count
Cabin,Unnamed: 1_level_1
n,687
C,59
E,32
G,4
D,33
A,15
B,47
F,13
T,1


Even these simple one-way tables give us some useful insight: we immediately get a sense of distribution of records across the categories. For instance, we see that males outnumbered females by a significant margin and that there were more third class passengers than first and second class passengers combined.

__Frequency tables is that they allow you to extract the proportion of the data that belongs to each category:__

In [13]:
cabin_tab/cabin_tab.sum()

col_0,count
Cabin,Unnamed: 1_level_1
n,0.771044
C,0.066218
E,0.035915
G,0.004489
D,0.037037
A,0.016835
B,0.05275
F,0.01459
T,0.001122


## Two-Way Tables
Two-way frequency tables, also called contingency tables, are tables of counts with two dimensions where each dimension is a different variable. Two-way tables can give you insight into the relationship between two variables.

In [16]:
# Table of survival vs gender
survived_gender = pd.crosstab(index=titanic_train['Survived'], 
                             columns=titanic_train['Sex'])
survived_gender.index=["Dead", "Survived"]
survived_gender

Sex,female,male
Dead,81,468
Survived,233,109


In [19]:
# Table of survival vs passenger class
survived_Pclass = pd.crosstab(index=titanic_train['Survived'],
                             columns=titanic_train['Pclass'])
survived_Pclass.index = ["Dead", "Survived"]
survived_Pclass

Pclass,1,2,3
Dead,80,97,372
Survived,136,87,119


__You can get the marginal counts (totals for each row and column) by including the argument margins=True__

In [27]:
survived_Pclass = pd.crosstab(index=titanic_train['Survived'],
                             columns=titanic_train['Pclass'],
                             margins=True)
survived_Pclass.index = ['Dead', 'Survival', 'ColTotal']
survived_Pclass.columns = ['Class1', "Class2", "Class3", "RowTotal"]
survived_Pclass

Unnamed: 0,Class1,Class2,Class3,RowTotal
Dead,80,97,372,549
Survival,136,87,119,342
ColTotal,216,184,491,891


__ To get the total proportion of counts in each cell, divide the table by the grand total: __

In [28]:
survived_Pclass/survived_Pclass.loc['ColTotal', 'RowTotal']

Unnamed: 0,Class1,Class2,Class3,RowTotal
Dead,0.089787,0.108866,0.417508,0.616162
Survival,0.152637,0.097643,0.133558,0.383838
ColTotal,0.242424,0.20651,0.551066,1.0


__ To get the proportion of counts along each column (in this case, the survival rate within each passenger class) divide by the column totals:__

In [31]:
survived_Pclass/survived_Pclass.loc['ColTotal']

Unnamed: 0,Class1,Class2,Class3,RowTotal
Dead,0.37037,0.527174,0.757637,0.616162
Survival,0.62963,0.472826,0.242363,0.383838
ColTotal,1.0,1.0,1.0,1.0


To get the proportion of counts along each row divide by the row totals. The division operator functions on a row-by-row basis when used on DataFrames by default. In this case we want to divide each column by the rowtotals column. To get division to work on a column by column basis, use df.div() with the axis set to 0 (or "index"):

In [36]:
    survived_Pclass.div(survived_Pclass['RowTotal'], axis=0)

Unnamed: 0,Class1,Class2,Class3,RowTotal
Dead,0.145719,0.176685,0.677596,1.0
Survival,0.397661,0.254386,0.347953,1.0
ColTotal,0.242424,0.20651,0.551066,1.0


Alternatively, you can transpose the table with df.T to swap rows and columns and perform row by row division as normal:

In [39]:
survived_Pclass.T/survived_Pclass['RowTotal']

Unnamed: 0,Dead,Survival,ColTotal
Class1,0.145719,0.397661,0.242424
Class2,0.176685,0.254386,0.20651
Class3,0.677596,0.347953,0.551066
RowTotal,1.0,1.0,1.0


### Higher Dimensional Tables

In [47]:
surv_sex_class = pd.crosstab(index=titanic_train['Survived'],
                            columns=[
                                titanic_train['Pclass'],
                                titanic_train['Sex']],
                             margins=True
                            )
surv_sex_class

Pclass,1,1,2,2,3,3,All
Sex,female,male,female,male,female,male,Unnamed: 7_level_1
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0,3,77,6,91,72,300,549
1,91,45,70,17,72,47,342
All,94,122,76,108,144,347,891


In [54]:
surv_sex_class[1]  # Get the subtable under Pclass 1

Sex,female,male
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3,77
1,91,45
All,94,122


In [56]:
surv_sex_class[2]['female'] # Get the female column within Pclass 2

Survived
0       6
1      70
All    76
Name: female, dtype: int64

In [59]:
# Proportion of survival across each column:
surv_sex_class/surv_sex_class.loc['All']

Pclass,1,1,2,2,3,3,All
Sex,female,male,female,male,female,male,Unnamed: 7_level_1
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0,0.031915,0.631148,0.078947,0.842593,0.5,0.864553,0.616162
1,0.968085,0.368852,0.921053,0.157407,0.5,0.135447,0.383838
All,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Here we see something quite interesting: over 90% of women in first class and second class survived, but only 50% of women in third class survived. Men in first class also survived at a greater rate than men in lower classes. Passenger class seems to have a significant impact on survival, so it would likely be useful to include as a feature in a predictive model.