Discovering relationships between variables is the fundamental goal of data analysis. Frequency tables are a basic tool we can use to explore data and get an idea of the relationships between variables. A frequency table is just a data table that shows the counts of one or more categorical variables.

To explore frequency tables, we'll revisit the Titanic training set. We will start by performing a couple of the same preprocessing steps.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('train.csv')

In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [8]:
# let's have a look on the cabin column of the dataset

df['Cabin'].unique()

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64',

* As we ccan see that it does not follow any categorical pattern, but still we can categories this using it's first letter 

In [12]:
# Before changing the cabin column into categorical value we have to convert it into the str types 

char_cabin  = df['Cabin'].astype(str)
# Creating an array of the first letter of the Cabin column

new_cabin = np.array([i[0] for i in char_cabin])
new_cabin

array(['n', 'C', 'n', 'C', 'n', 'n', 'E', 'n', 'n', 'n', 'G', 'C', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'D', 'n', 'A', 'n', 'n',
       'n', 'C', 'n', 'n', 'n', 'B', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'D', 'n', 'B', 'C', 'n', 'n', 'n', 'n', 'n', 'B', 'C', 'n', 'n',
       'n', 'F', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'F', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'C', 'n', 'n',
       'n', 'E', 'n', 'n', 'n', 'A', 'D', 'n', 'n', 'n', 'n', 'D', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'C', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'B', 'n', 'n', 'n', 'n', 'E', 'D', 'n', 'n', 'n', 'F', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'D', 'C', 'n', 'B', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'F', 'n', 'n', 'C', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'E', 'n', 'n',
       'n', 'B', 'n', 'n', 'n', 'A', 'n', 'n', 'C', 'n', 'n', 'n

In [14]:
# Now categories the cabin column

df['Cabin'] =   pd.Categorical(new_cabin)

In [16]:
df.dtypes

PassengerId       int64
Survived          int64
Pclass            int64
Name             object
Sex              object
Age             float64
SibSp             int64
Parch             int64
Ticket           object
Fare            float64
Cabin          category
Embarked         object
dtype: object

**One-Way Tables**

Create frequency tables (also known as crosstabs) in pandas using the pd.crosstab() function. The function takes one or more array-like objects as indexes or columns and then constructs a new DataFrame of variable counts based on the supplied arrays. Let's make a one-way table of the survived variable:

In [18]:
t1 = pd.crosstab(index = df['Survived'],
                columns = 'count')
t1

col_0,count
Survived,Unnamed: 1_level_1
0,549
1,342


In [19]:
type(t1)

pandas.core.frame.DataFrame

In [20]:
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S


In [21]:
pd.crosstab(index = df['Pclass'],
           columns = 'count')

col_0,count
Pclass,Unnamed: 1_level_1
1,216
2,184
3,491


In [22]:
pd.crosstab(index = df['Sex'],
           columns = 'count')

col_0,count
Sex,Unnamed: 1_level_1
female,314
male,577


In [23]:
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S


In [25]:
pd.crosstab(index = df['SibSp'],
           columns = 'count')

col_0,count
SibSp,Unnamed: 1_level_1
0,608
1,209
2,28
3,16
4,18
5,5
8,7


In [29]:
cabin_tab = pd.crosstab(index = df['Cabin'],
           columns = 'count')
cabin_tab

col_0,count
Cabin,Unnamed: 1_level_1
A,15
B,47
C,59
D,33
E,32
F,13
G,4
T,1
n,687


We can also use the value_counts() function to on a pandas series (a single column) to check its counts:

In [27]:
df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

Even these simple one-way tables give us some useful insight: we immediately get a sense of distribution of records across the categories. For instance, we see that males outnumbered females by a significant margin and that there were more third class passengers than first and second class passengers combined.

One of the most useful aspects of frequency tables is that they allow us to extract the proportion of the data that belongs to each category. With a one-way table, we can do this by dividing each table value by the total number of records in the table:

In [30]:
cabin_tab/cabin_tab.sum()

col_0,count
Cabin,Unnamed: 1_level_1
A,0.016835
B,0.05275
C,0.066218
D,0.037037
E,0.035915
F,0.01459
G,0.004489
T,0.001122
n,0.771044


In [33]:
# we can change the above result as the percentage 

cabin_tab_dis = (cabin_tab/cabin_tab.sum())*100
cabin_tab_dis

col_0,count
Cabin,Unnamed: 1_level_1
A,1.683502
B,5.274972
C,6.621773
D,3.703704
E,3.59147
F,1.459035
G,0.448934
T,0.112233
n,77.104377


In [37]:
cabin_tab_dis.rename(columns = {'count':'Percent'},inplace =True)

In [38]:
cabin_tab_dis

col_0,Percent
Cabin,Unnamed: 1_level_1
A,1.683502
B,5.274972
C,6.621773
D,3.703704
E,3.59147
F,1.459035
G,0.448934
T,0.112233
n,77.104377


**Two-Way Tables**

Two-way frequency tables, also called contingency tables, are tables of counts with two dimensions where each dimension is a different variable. Two-way tables can give you insight into the relationship between two variables. To create a two way table, pass two variables to the pd.crosstab() function instead of one:

In [42]:
# table of survival vs sex

sur_sex = pd.crosstab(index = df['Survived'],
                     columns = df['Sex'])
sur_sex.index = ['died','survived']
sur_sex

Sex,female,male
died,81,468
survived,233,109


In [48]:
# Table of survival vs passenger class

df.head(2)
sur_pclass = pd.crosstab(index = df['Survived'],
           columns  = df['Pclass'])

sur_pclass.index = ['died','survived']
sur_pclass

Pclass,1,2,3
died,80,97,372
survived,136,87,119


we can get the marginal counts (totals for each row and column) by including the argument margins=True:

In [56]:
sur_class = pd.crosstab(index = df['Survived'],
           columns = df['Pclass'],
           margins = True)

sur_class.index = ['died','survived','col_total']
sur_class.columns = ['class1','class2','class3','row_total']

sur_class


Unnamed: 0,class1,class2,class3,row_total
died,80,97,372,549
survived,136,87,119,342
col_total,216,184,491,891


**Higher Dimensional Tables**

The crosstab() function lets you create tables out of more than two categories. Higher dimensional tables can be a little confusing to look at, but they can also yield finer-grained insight into interactions between multiple variables. Let's create a 3-way table inspecting survival, sex and passenger class:

In [60]:
sur_sex_class = pd.crosstab(index = df['Survived'],
                           columns = [df['Pclass'],df['Sex']],
                           margins = True)
sur_sex_class

Pclass,1,1,2,2,3,3,All
Sex,female,male,female,male,female,male,Unnamed: 7_level_1
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0,3,77,6,91,72,300,549
1,91,45,70,17,72,47,342
All,94,122,76,108,144,347,891


Notice that by passing a second variable to the columns argument, the resulting table has columns categorized by both Pclass and Sex. The outermost index (Pclass) returns sections of the table instead of individual columns:

In [62]:
sur_sex_class[2]  # for getting the subtable under pclass2

Sex,female,male
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,6,91
1,70,17
All,76,108


In [64]:
sur_sex_class[2]['female'] # get the female column within the pclass2

Survived
0       6
1      70
All    76
Name: female, dtype: int64

In [65]:
# To get the fractional value of the totaldeivide the each column by the total

sur_sex_class/sur_sex_class.loc['All']

Pclass,1,1,2,2,3,3,All
Sex,female,male,female,male,female,male,Unnamed: 7_level_1
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0,0.031915,0.631148,0.078947,0.842593,0.5,0.864553,0.616162
1,0.968085,0.368852,0.921053,0.157407,0.5,0.135447,0.383838
All,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [66]:
# converting the above result in the percentage form

(sur_sex_class/sur_sex_class.loc['All'])*100

Pclass,1,1,2,2,3,3,All
Sex,female,male,female,male,female,male,Unnamed: 7_level_1
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0,3.191489,63.114754,7.894737,84.259259,50.0,86.455331,61.616162
1,96.808511,36.885246,92.105263,15.740741,50.0,13.544669,38.383838
All,100.0,100.0,100.0,100.0,100.0,100.0,100.0


* Here we see something quite interesting: over 90% of women in first class and second class survived, but only 50% of women in third class survived. Men in first class also survived at a greater rate than men in lower classes. Passenger class seems to have a significant impact on survival, so it would likely be useful to include as a feature in a predictive model.