# Read data files and explore their contents

## Reading a data file

We will be working with the NHANES (National Health and Nutrition Examination Survey) data from the 2015-2016 wave. The raw data for this study are available here: [NHANES 2015-2016](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2015)

As in many large studies, the NHANES data are spread across multiple files.  The NHANES files are stored in SAS transport (.xpt) format. This is a somewhat obscure format, and while Pandas is perfectly capable of reading the NHANES data directly from the .xpt files, accomplishing this task is a more advanced topic than we want to get into here.  Therefore, we have prepared some merged datasets in .csv format.

In [1]:
import pandas as pd

df = pd.read_csv('csv/nhanes_2015_2016.csv')

To confirm that we have actually obtained the data the we are expecting, we can display the shape (number of rows and columns) of the data frame.

In [2]:
df.shape

(5735, 28)

Based on what we see above, the data set being read here has 5735 rows, corresponding to 5735 people in this wave of the NHANES study, and 28 columns, corresponding to 28 variables in this particular dataset.  Note that NHANES collects thousands of variables on each study subject, but here we are working with a reduced file that contains a limited number of variables.

## Exploring the contents of a data set

Pandas has a number of basic ways to understand what is in a data set. For example, above we used the `.shape` method to determine the numbers of rows and columns in a data set. The columns in a Pandas data frame have names, to see the names, use the `.columns` method:

In [3]:
df.columns

Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',
       'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR',
       'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2',
       'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC',
       'BMXWAIST', 'HIQ210'],
      dtype='object')

These names correspond to variables in the NHANES study. For example, "SEQN" is a unique identifier for one person, and "BMXWT" is the subject's weight in kilograms ("BMX" is the NHANES prefix for body measurements). The variables in the NHANES data set are documented in a set of "codebooks" that are available on-line.  The codebooks for the 2015-2016 wave of NHANES can be found at the following link: [Codebooks NHANES 2015-2016](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2015)

For convenience, direct links to some of the codebooks are included below:

- [Demographics codebook](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm)

- [Body measures codebook](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm)

- [Blood pressure codebook](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.htm)

- [Alcohol questionaire codebook](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/ALQ_I.htm)

- [Smoking questionaire codebook](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/SMQ_I.htm)

Every variable in a Pandas data frame has a data type.  There are many different data types, but most commonly you will encounter floating point values (real numbers), integers, strings (text), and date/time values.  When Pandas reads a text/csv file, it guesses the data types based on what it sees in the first few rows of the data file.  Usually it selects an appropriate type, but occasionally it does not.  To confirm that the data types are consistent with what the variables represent, inspect the `.dtypes` attribute of the data frame.

In [4]:
df.dtypes

SEQN          int64
ALQ101      float64
ALQ110      float64
ALQ130      float64
SMQ020        int64
RIAGENDR      int64
RIDAGEYR      int64
RIDRETH1      int64
DMDCITZN    float64
DMDEDUC2    float64
DMDMARTL    float64
DMDHHSIZ      int64
WTINT2YR    float64
SDMVPSU       int64
SDMVSTRA      int64
INDFMPIR    float64
BPXSY1      float64
BPXDI1      float64
BPXSY2      float64
BPXDI2      float64
BMXWT       float64
BMXHT       float64
BMXBMI      float64
BMXLEG      float64
BMXARML     float64
BMXARMC     float64
BMXWAIST    float64
HIQ210      float64
dtype: object

As we see here, most of the variables have floating point or integer data type. Unlike many data sets, NHANES does not use any text values in its data. For example, while many datasets would use text labels like "F" or "M" to denote a subject's gender, this information is represented in NHANES with integer codes.  For example, the variable "RIAGENDR" contains each subject's gender, with male gender coded as "1" and female gender coded as "2". The "RIAGENDR" variable is part of the demographics component of NHANES, so this coding can be found in the demographics codebook.

Variables like "BMXWT" which represent a quantitative measurement will typically be stored as floating point data values.

### Slicing a data set

As discussed above, a Pandas data frame is a rectangular data table, in which the rows represent cases and the columns represent variables. One common manipulation of a data frame is to extract the data for one case or for one variable. There are several ways to do this, as shown below.

To extract all the values for one variable, the following four approaches are equivalent.  In the next lines of code, we are assigning the data from one column of the data frame into new variables.  The first three approaches access the variable by name ("DMDEDUC2" here is an NHANES variable containing a person's educational attainment).  The fourth approach accesses the variable by position (note that "DMDEDUC2" is in position `9` of the `df.columns` array shown above, remember that Python counts starting at position zero).

In [5]:
a = df['DMDEDUC2']
b = df.loc[:, 'DMDEDUC2']
c = df.DMDEDUC2
d = df.iloc[:, 9]

Another reason to slice a variable out of a data frame is so that we can then pass it into a function. For example, we can find the maximum value over all values in a column using any one of the following four lines of code:

In [6]:
print(a.max())
print(b.max())
print(c.max())
print(d.max())

9.0
9.0
9.0
9.0


Every value has a type, and the type information can be obtained using `type()` function.  This can be useful, for example, if you are looking for the documentation associated with some value, but you do not know what the value's type is.

Here we see that the variable `df` has type `.DataFrame`, while one column of `df` has type `.Series`. As noted above, a Series is a Pandas data structure for holding a single column (or row) of data.

In [7]:
print(type(df))  # The type of the variable
print(type(df.DMDEDUC2))  # The type of one column of the data frame
print(type(df.iloc[2, :]))  # The type of one row of the data frame

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


It may also be useful to slice a row (case) out of a data frame. Just as a data frame's columns have names, the rows also have names, which are called the "index".  However many data sets do not have meaningful row names, so it is more common to extract a row of a data frame using its position. The `.iloc[]` method slices rows or columns from a data frame by position (counting from 0). The following line of code extracts row 3 from the data set (which is the fourth row, counting from zero).

In [8]:
x = df.iloc[3, :]

Another important data frame manipulation is to extract a contiguous block of rows or columns from the data set.  Below we slice by position, in the first case taking row positions 3 and 4 (counting from 0, which are rows 4 and 5 counting from 1), and in the second case taking columns 2, 3, and 4 (columns 3, 4, 5 if counting from 1).

In [9]:
y = df.iloc[3:5, :]
z = df.iloc[:, 2:5]

### Missing values

When reading a dataset using Pandas, there is a set of values including `NA`, `NULL`, and `NaN` that are taken by default to represent a missing value. The full list of default missing value codes is in the `read_csv` [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html).  This document also explains how to change the way that `read_csv` decides whether a variable's value is missing.

Pandas has functions called `isnull` and `notnull` that can be used to identify where the missing and non-missing values are located in a data frame. Below we use these functions to count the number of missing and non-missing "DMDEDUC2" values.

In [10]:
print(pd.isnull(df['DMDEDUC2']).sum())
print(pd.notnull(df['DMDEDUC2']).sum())

261
5474


As an aside, note that there may be a variety of distinct forms of missingness in a variable, and in some cases it is important to keep these values distinct. For example, in case of the "DMDEDUC2" variable, in addition to the blank or `NA` values that Pandas considers to be missing, three people responded "don't know" (code value 9).  In many analyses, the "don't know" values will also be treated as missing, but at this point we are considering "don't know" to be a distinct category of observed response.