# Selecting dataframe rows and columns

This notebook demonstrates some practical techniques for working with dataframes.  Recall that the rows of a dataframe usually represent observations (elements of a sample of data), and the columns of a dataframe usually represent variables.  We illustrate selection techniques for both rows and columns of a dataframe, using the NHANES data introduced earlier in the course.

In [51]:
import numpy as np
import pandas as pd

Read the NHANES 2015-2016 data from a file into Pandas:

In [52]:
df = pd.read_csv("nhanes_2015_2016.csv")

To get oriented, we can display the first few rows of the data frame:

In [53]:
df.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


In a Pandas dataframe, every column has a name and a dtype.  The 'dtypes' method prints all of this information:

In [54]:
df.dtypes

SEQN          int64
ALQ101      float64
ALQ110      float64
ALQ130      float64
SMQ020        int64
RIAGENDR      int64
RIDAGEYR      int64
RIDRETH1      int64
DMDCITZN    float64
DMDEDUC2    float64
DMDMARTL    float64
DMDHHSIZ      int64
WTINT2YR    float64
SDMVPSU       int64
SDMVSTRA      int64
INDFMPIR    float64
BPXSY1      float64
BPXDI1      float64
BPXSY2      float64
BPXDI2      float64
BMXWT       float64
BMXHT       float64
BMXBMI      float64
BMXLEG      float64
BMXARML     float64
BMXARMC     float64
BMXWAIST    float64
HIQ210      float64
dtype: object

## Selecting columns

Now suppose that we wish to retain only the body measures columns (columns with "BMX" in the name). 

One option is to directly type the names of the columns that we want to retain and create a *literal list*.

In [55]:
keep = ['BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC', 'BMXWAIST']

Another option is to write a list comprehension to "filter" the column names according to a selection criterion:

In [56]:
keep2 = [column for column in df.columns if 'BMX' in column]
print(keep2)

['BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC', 'BMXWAIST']


Here is a slightly narrower way to accomplish the same goal:

In [57]:
keep3 = [column for column in df.columns if column.startswith("BMX")]

Below are two equivalent ways to retain only the columns named in the list 'keep':

In [58]:
df_BMX = df[keep]
df_BMX2 = df.loc[:, keep]

We can confirm that the two results from the preceding cell are equal:

In [59]:
pd.testing.assert_frame_equal(df_BMX, df_BMX2)

Now we can inspect our results.

In [60]:
df_BMX.head()

Unnamed: 0,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST
0,94.8,184.5,27.8,43.3,43.6,35.9,101.1
1,90.4,171.4,30.8,38.0,40.0,33.2,107.9
2,83.4,170.1,28.8,35.6,37.0,31.0,116.5
3,109.8,160.9,42.4,38.5,37.7,38.3,110.1
4,55.2,164.9,20.3,37.4,36.0,27.2,80.4


## Selecting rows

There are many situations where we want to select a subset of the rows in a dataframe, based on one or more conditions.

For example, suppose we want a dataframe that consists only of rows where the value of 'BMXWAIST' is greater than the median BMXWAIST computed over all subjects.  We begin by calculating this median 'BMXWAIST'.

In [61]:
waist_median = pd.Series.median(df_BMX['BMXWAIST'])

In [62]:
waist_median

98.3

Now we can select only the rows where the value of BMXWAIST is greater than the median:

In [63]:
df_BMX2 = df_BMX[df_BMX['BMXWAIST'] > waist_median]
print(df_BMX.shape)
print(df_BMX2.shape)
df_BMX2.head()

(5735, 7)
(2677, 7)


Unnamed: 0,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST
0,94.8,184.5,27.8,43.3,43.6,35.9,101.1
1,90.4,171.4,30.8,38.0,40.0,33.2,107.9
2,83.4,170.1,28.8,35.6,37.0,31.0,116.5
3,109.8,160.9,42.4,38.5,37.7,38.3,110.1
9,108.3,179.4,33.6,46.0,44.1,38.5,116.0


Now let's add another condition, that 'BMXLEG' must be less than 32:

In [64]:
condition1 = df_BMX['BMXWAIST'] > waist_median
condition2 = df_BMX['BMXLEG'] < 32
df_BMX2 = df_BMX[condition1 & condition2]
print(df_BMX2.shape)
df_BMX2.head()

(163, 7)


Unnamed: 0,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST
15,80.5,150.8,35.4,31.6,32.7,33.7,113.5
27,75.6,145.2,35.9,31.0,33.1,36.0,108.0
39,63.7,147.9,29.1,26.0,34.0,31.5,110.0
52,105.9,157.7,42.6,29.2,35.0,40.7,129.1
55,77.5,148.3,35.2,30.5,34.0,34.4,107.6


The [query](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html) function in Pandas is a powerful and flexible way to select dataframe rows.  We can use 'query' to carry out the selection implemented above as follows:

In [65]:
df_BMX3 = df_BMX.query('BMXWAIST > @waist_median & BMXLEG < 32')

Note that when calling the query method, the entire selection expression is a string enclosed in quotation marks (as always in Python, strings can be enclosed in either single or double quotation marks).  Also, variables from the environment that are part of the query must be prefixed with "@". The reason for these requirements is that the query expression is passed to [Numba](https://numba.pydata.org) which compiles the expression for efficiency.

We can confirm that the two approaches give the same result:

In [66]:
pd.testing.assert_frame_equal(df_BMX2, df_BMX3)