Data Cleaning and Preparing
===========================

<a id="1"></a>
1: Index
---------

- [1: Index](#1)
- [2: Types of Bad Data](#2)
- [3: Exploring a Dataset](#3)
  - [3.0: Setup](#3.0)
  - [3.1: Exploring a Subset](#3.1)
    - [3.1.1: Browse Data and Rename Columns](#3.1.1)
    - [3.1.2: Clean Globally](#3.1.2)
    - [3.1.3: Clean Military](#3.1.3)
    - [3.1.4: Clean Citizenship](#3.1.4)
    - [3.1.5: Clean Gender](#3.1.5)
    - [3.1.6: Save CSV](#3.1.6)
- [4: Dealing with Missing Data](#4)
- [5: Dealing with Duplicates](#5)
  - [5.1: Duplicated Rows](#5.1)
  - [5.2: Duplicated Columns](#5.2)
- [6: Infeasible and Extreme Data](#6)
  - [6.1: Impossible values](#6.1)
  - [6.2: Categorical Errors](#6.2)
  - [6.3: Extreme/Out of Range Values](#6.3)
  - [6.4: Outliers](#6.4)
  - [6.5: Saturated Data](#6.5)
- [7: Automating Data Preparation](#7)
  - [7.1: Testing](#7.1)
  - [7.2: Automation](#7.2)
- [8: Feature Selection](#8)

<a id="2"></a>
2: Types of Bad Data
--------------------

- Formatting errors (e.g. extra whitespace)
- Value errors (e.g., misspellings) (note according to Treehouse, misspellings are classified
  as "Formatting errors")
- Incorrect data type (e.g. numerical or string entries)
- Nonsensical data entries (e.g. age < 0)
- Duplicate entries (duplicate rows or columns)
- Missing data (e.g. NaN)
- Saturated data (e.g. value beyond a measurement limit)
- Systematic and individual errors (error affects many entries or only one)
- Confidential information (e.g. personally identifying or private information)

<a id="3"> </a>
3: Exploring a Dataset
----------------------

- [3.0: Setup](#3.0)
- [3.1: Exploring a Subset](#3.1)
  - [3.1.1: Browse Data and Rename Columns](#3.1.1)
  - [3.1.2: Clean Globally](#3.1.2)
  - [3.1.3: Clean Military](#3.1.3)
  - [3.1.4: Clean Citizenship](#3.1.4)
  - [3.1.5: Clean Gender](#3.1.5)
  - [3.1.6: Save CSV](#3.1.6)

<a id="3.0"> </a>
### 3.0: Setup ###

In [1]:
import os
import numpy as np
import pandas as pd

DATAPATH = os.path.join('thirdpartydata', 'cleaning-preparing-s1v4')

bodymeasures_filename = os.path.join(DATAPATH, 'BodyMeasures.csv')
demographics_filename = os.path.join(DATAPATH, 'Demographics.csv')
occupations_filename = os.path.join(DATAPATH, 'Occupations.csv')

In [2]:
# codebook URL: https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/BMX.htm
bodymeasures = pd.read_csv(bodymeasures_filename)
bodymeasures.head()

Unnamed: 0,SEQN,BMAEXLEN,BMAEXSTS,BMAEXCMT,BMXWT,BMIWT,BMXRECUM,BMIRECUM,BMXHEAD,BMIHEAD,...,BMAAMP,BMAUREXT,BMAUPREL,BMAULEXT,BMAUPLEL,BMALOREX,BMALORKN,BMALLEXT,BMALLKNE,RIDAGEYR
0,1,289.0,1.0,,12.5,3.0,93.2,,,,...,2.0,,,,,,,,,2
1,2,376.0,1.0,,75.4,,,,,,...,2.0,,,,,,,,,77
2,3,199.0,1.0,,32.9,,,,,,...,2.0,,,,,,,,,95
3,4,170.0,1.0,,13.3,,87.1,,,,...,2.0,,,,,,,,,1
4,5,277.0,1.0,,92.5,,,,,,...,2.0,,,,,,,,,49


In [3]:
bodymeasures.describe()

Unnamed: 0,SEQN,BMAEXLEN,BMAEXSTS,BMAEXCMT,BMXWT,BMIWT,BMXRECUM,BMIRECUM,BMXHEAD,BMIHEAD,...,BMAAMP,BMAUREXT,BMAUPREL,BMAULEXT,BMAUPLEL,BMALOREX,BMALORKN,BMALLEXT,BMALLKNE,RIDAGEYR
count,9278.0,9266.0,9266.0,138.0,9181.0,253.0,1079.0,29.0,267.0,1.0,...,9148.0,3.0,0.0,3.0,0.0,3.0,3.0,3.0,1.0,9278.0
mean,4992.804592,261.712066,1.021368,25.905797,67.519987,2.976285,80.244486,1.0,41.319476,1.0,...,1.999672,2.0,,2.0,,1.0,2.0,1.666667,2.0,29.014766
std,2869.638216,80.9763,0.18344,37.109207,282.449524,1.174899,13.927584,0.0,2.980247,,...,0.018107,0.0,,0.0,,0.0,0.0,0.57735,,24.442185
min,1.0,0.0,1.0,1.0,-149.0,1.0,44.9,1.0,15.5,1.0,...,1.0,2.0,,2.0,,1.0,2.0,1.0,2.0,-69.0
25%,2517.25,211.0,1.0,2.0,39.2,3.0,69.15,1.0,39.5,1.0,...,2.0,2.0,,2.0,,1.0,2.0,1.5,2.0,10.0
50%,4988.5,258.0,1.0,6.0,63.0,3.0,81.3,1.0,41.7,1.0,...,2.0,2.0,,2.0,,1.0,2.0,2.0,2.0,19.0
75%,7482.75,307.0,1.0,56.0,79.7,3.0,91.8,1.0,43.15,1.0,...,2.0,2.0,,2.0,,1.0,2.0,2.0,2.0,47.0
max,9965.0,909.0,3.0,99.0,12870.0,11.0,110.3,1.0,47.9,1.0,...,2.0,2.0,,2.0,,1.0,2.0,2.0,2.0,109.0


In [4]:
# codebook URL: https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DEMO.htm
demographics = pd.read_csv(demographics_filename)
demographics.head()

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIDEXMON,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDAGEEX,RIDRETH1,RIDRETH2,...,WTIREP43,WTIREP44,WTIREP45,WTIREP46,WTIREP47,WTIREP48,WTIREP49,WTIREP50,WTIREP51,WTIREP52
0,1,1.0,Exam,2.0,Female,2.0,29.0,31.0,Non-Hispanic Black,2.0,...,10094.0171,9912.461855,9727.078709,10041.524113,9953.955984,9857.381983,9865.152486,10327.992682,9809.165049,10323.315747
1,2,1.0,Both,2.0,Male,77.0,926.0,926.0,Non-Hispanic White,1.0,...,27186.728682,27324.345051,28099.663528,27757.066921,28049.286048,26716.602006,26877.704909,27268.025234,27406.38362,26984.812909
2,3,1.0,Exam,1.0,Female,95.0,125.0,126.0,Non-Hispanic White,1.0,...,43993.193099,44075.386428,46642.563799,44967.681579,44572.48201,44087.945688,44831.370881,44480.987235,45389.112766,43781.905637
3,4,1.0,Both,2.0,Male,1.0,22.0,23.0,Non-Hispanic Black,2.0,...,10702.307249,10531.444441,10346.119327,10636.063039,0.0,10533.108939,10654.749584,10851.024385,10564.981435,11012.529729
4,5,1.0,Both,2.0,Male,49.0,597.0,597.0,Non-Hispanic White,1.0,...,93164.78243,92119.608772,95388.490406,94131.383538,95297.809952,91325.082461,91640.586117,92817.926915,94282.855382,91993.251203


In [5]:
# codebook URL: https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/OCQ.htm
occupations = pd.read_csv(occupations_filename)
occupations.head()

Unnamed: 0,SEQN,OCQ130,OCQ150,OCQ160,OCQ180,OCQ210,OCD230,OCD240,OCQ260,OCD270,...,OCD390,OCD395,OCQ420,OCQ430,OCQ440,OCQ450,OCQ470G,OCD470,OCD480,RIAGENDR
0,2.0,,2.0,2.0,,1.0,38.0,13.0,1.0,168.0,...,9.0,300.0,,,,,,,,Male
1,5.0,,1.0,,40.0,,19.0,1.0,1.0,48.0,...,,,,,,,,,,Male
2,6.0,,4.0,,,,,,,,...,16.0,3.0,2.0,,,,,,,Female
3,7.0,,1.0,,45.0,,27.0,23.0,1.0,36.0,...,,,,,,,,,,F
4,8.0,7.0,,,,,,,,,...,,,,,,,,,,M


In [6]:
# Make a dataset by joining demographics and bodymeasurements
dataset = pd.merge(demographics, bodymeasures, on='SEQN', how='inner') # CF 7.2 in the Pandas Notebook
dataset.head()

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIDEXMON,RIAGENDR,RIDAGEYR_x,RIDAGEMN,RIDAGEEX,RIDRETH1,RIDRETH2,...,BMAAMP,BMAUREXT,BMAUPREL,BMAULEXT,BMAUPLEL,BMALOREX,BMALORKN,BMALLEXT,BMALLKNE,RIDAGEYR_y
0,1,1.0,Exam,2.0,Female,2.0,29.0,31.0,Non-Hispanic Black,2.0,...,2.0,,,,,,,,,2
1,2,1.0,Both,2.0,Male,77.0,926.0,926.0,Non-Hispanic White,1.0,...,2.0,,,,,,,,,77
2,3,1.0,Exam,1.0,Female,95.0,125.0,126.0,Non-Hispanic White,1.0,...,2.0,,,,,,,,,95
3,4,1.0,Both,2.0,Male,1.0,22.0,23.0,Non-Hispanic Black,2.0,...,2.0,,,,,,,,,1
4,5,1.0,Both,2.0,Male,49.0,597.0,597.0,Non-Hispanic White,1.0,...,2.0,,,,,,,,,49


In [7]:
# Write the dataset back out as a CSV
#dataset.to_csv('joined_demo_bmx.csv', index=False)

<a id="3.1"> </a>
### 3.1: Exploring a Subset ###

- [3.1.1: Browse Data and Rename Columns](#3.1.1)
- [3.1.2: Clean Globally](#3.1.2)
- [3.1.3: Clean Military](#3.1.3)
- [3.1.4: Clean Citizenship](#3.1.4)
- [3.1.5: Clean Gender](#3.1.5)
- [3.1.6: Save CSV](#3.1.6)

<a id="3.1.1"> </a>
#### 3.1.1: Browse Data and Rename Columns ####

In [8]:
# 'SEQN' = ID
# 'RIDAGEYR' = Age at Screening (should be capped at 85)
# 'RIAGENDR' = Gender
# 'DMQMILIT' = Veteran/Military Status
# 'DMDCITZN' = Citizenship Status
demo_subset = demographics.loc[:, ['SEQN', 'RIDAGEYR', 'RIAGENDR', 'DMQMILIT', 'DMDCITZN']]
demo_subset.rename(columns={'SEQN': 'ID', 'RIDAGEYR': 'Age', 'RIAGENDR': 'Gender', 'DMQMILIT': 'Military', 'DMDCITZN': 'Citizenship'}, inplace=True)
demo_subset.head()

Unnamed: 0,ID,Age,Gender,Military,Citizenship
0,1,2.0,Female,,Citizen by birth or naturalization
1,2,77.0,Male,Y,Citizen by birth or naturalization
2,3,95.0,Female,,Not a citizen of the US
3,4,1.0,Male,,Citizen by birth or naturalization
4,5,49.0,Male,Yes,Citizen by birth or naturalization


<a id="3.1.2"> </a>
#### 3.1.2: Clean Globally ####

In [9]:
# Instead of cleaning whitespace on each column individually (original code left as comments 
# below), we can use the apply method to clean all relevant columns in one go:
text_columns = ['Gender', 'Military', 'Citizenship']
demo_subset[text_columns] = demo_subset[text_columns].apply(lambda x: x.str.strip())


# We can fix minor typos like 'Y' -> 'Yes', etc
BINARY_CORRECTION_MAP = {
    'Y': 'Yes',
    'N': 'No',
}
GENDER_CORRECTION_MAP = {
    'M': 'Male',
    'F': 'Female',
}
DONTKNOW_CORRECTION_MAP = {
    "Don't Know": "Don't know",
    "Dont know": "Don't know",
    "Unknown": "Don't know",
}

global_replacements = {
    **BINARY_CORRECTION_MAP,
    **GENDER_CORRECTION_MAP,
    **DONTKNOW_CORRECTION_MAP,
}

demo_subset.replace(
    {
        'Military': {**global_replacements},
        'Citizenship': {**global_replacements},
        'Gender': {**global_replacements},
    },
    inplace=True
)

<a id="3.1.3"> </a>
#### 3.1.3: Clean Military ####

In [10]:
unique_military = demo_subset['Military'].unique()
len(unique_military)

5

In [11]:
# 38 seems like a lot of unique values for military status
unique_military

array([nan, 'Yes', 'No', "Don't know", 'Refused'], dtype=object)

In [12]:
# we can see lots of duplicates (and many just have extraneous whitespace)
#demo_subset.loc[:, 'Military'] = demo_subset.loc[:, 'Military'].str.strip()

In [13]:
unique_military = demo_subset['Military'].unique()
len(unique_military)

5

In [14]:
unique_military

array([nan, 'Yes', 'No', "Don't know", 'Refused'], dtype=object)

In [15]:
#replace_dict = {'Military': {**BINARY_CORRECTION_MAP, **DONTKNOW_CORRECTION_MAP}}
#demo_subset.replace(replace_dict, inplace=True)

In [16]:
unique_military = demo_subset['Military'].unique()
len(unique_military)

5

In [17]:
# Success: We are down to the 5 truly different response values
unique_military

array([nan, 'Yes', 'No', "Don't know", 'Refused'], dtype=object)

In [18]:
# Replace unique string values with numeric codes:
replace_dict = {
    'Military': {
        'Yes': 1,
        'No': 2,
        'Refused': 7,
        "Don't know": 9,
    }
}
demo_subset.replace(replace_dict, inplace=True)
unique_military = demo_subset['Military'].unique()
unique_military

array([nan,  1.,  2.,  9.,  7.])

<a id="3.1.4"> </a>
#### 3.1.4: Clean Citizenship ####

In [19]:
# Let's repeat for citizenship (we'll strip whitespace before we even look at the values
#demo_subset.loc[:,'Citizenship'] = demo_subset.loc[:, 'Citizenship'].str.strip()
unique_citizenship = demo_subset['Citizenship'].unique()
unique_citizenship

array(['Citizen by birth or naturalization', 'Not a citizen of the US',
       'Refused', "Don't know", nan], dtype=object)

In [20]:
#replace_dict = {'Citizenship': {**DONTKNOW_CORRECTION_MAP}}
#demo_subset.replace(replace_dict, inplace=True)

In [21]:
unique_citizenship = demo_subset['Citizenship'].unique()
unique_citizenship

array(['Citizen by birth or naturalization', 'Not a citizen of the US',
       'Refused', "Don't know", nan], dtype=object)

In [22]:
# Replace unique string values with numeric codes:
replace_dict = {
    'Citizenship': {
        'Citizen by birth or naturalization': 1,
        'Not a citizen of the US': 2,
        'Refused': 7,
        "Don't know": 9,
    }
}
demo_subset.replace(replace_dict, inplace=True)
unique_citizenship = demo_subset['Citizenship'].unique()
unique_citizenship

array([ 1.,  2.,  7.,  9., nan])

<a id="3.1.5"> </a>
#### 3.1.5: Clean Gender ####

In [23]:
#demo_subset.loc[:, 'Gender'] = demo_subset.loc[:, 'Gender'].str.strip()
unique_gender = demo_subset['Gender'].unique()
unique_gender

array(['Female', 'Male', nan], dtype=object)

In [24]:
#replace_dict = {'Gender': {**GENDER_CORRECTION_MAP}}
#demo_subset.replace(replace_dict, inplace=True)

In [25]:
unique_gender = demo_subset['Gender'].unique()
unique_gender

array(['Female', 'Male', nan], dtype=object)

In [26]:
# Replace unique string values with numeric codes:
replace_dict = {
    'Gender': {
        'Male': 1,
        'Female': 2,
    }
}
demo_subset.replace(replace_dict, inplace=True)
unique_gender = demo_subset['Gender'].unique()
unique_gender

array([ 2.,  1., nan])

<a id="3.1.6"> </a>
#### 3.1.6: Save CSV ####

In [27]:
#demo_subset.to_csv('demo_subset.csv', index=False)

<a id="4"> </a>
4: Dealing with Missing Data
----------------------------

In [28]:
total_entries = len(demo_subset.index)
valid_entries = demo_subset.count()
missing_data_c = total_entries - valid_entries
missing_data_c

ID                0
Age             209
Gender          211
Military       4607
Citizenship     212
dtype: int64

In [29]:
missing_percentage = missing_data_c / total_entries * 100
missing_percentage.sort_values(ascending=False).head()

Military       43.519743
Citizenship     2.002645
Gender          1.993199
Age             1.974306
ID              0.000000
dtype: float64

In [30]:
missing_data_r = np.sum(demo_subset.isnull(), axis=1)
missing_data_r

0        1
1        0
2        1
3        1
4        0
        ..
10581    0
10582    0
10583    0
10584    0
10585    1
Length: 10586, dtype: int64

In [31]:
demo_subset

Unnamed: 0,ID,Age,Gender,Military,Citizenship
0,1,2.0,2.0,,1.0
1,2,77.0,1.0,1.0,1.0
2,3,95.0,2.0,,2.0
3,4,1.0,1.0,,1.0
4,5,49.0,1.0,1.0,1.0
...,...,...,...,...,...
10581,2774,17.0,2.0,2.0,1.0
10582,7696,56.0,2.0,2.0,2.0
10583,868,18.0,1.0,2.0,1.0
10584,6810,31.0,2.0,2.0,2.0


In [32]:
print('Demographics:')
print(demographics.dtypes.head())

print('\nBody Measures:')
print(bodymeasures.dtypes.head())

Demographics:
SEQN          int64
SDDSRVYR    float64
RIDSTATR     object
RIDEXMON    float64
RIAGENDR     object
dtype: object

Body Measures:
SEQN          int64
BMAEXLEN    float64
BMAEXSTS    float64
BMAEXCMT    float64
BMXWT       float64
dtype: object


In [33]:
# Convert string to integer
# bmx.loc[:, 'SEQN'] = pd.to_numeric(bmx['SEQN'], downcast='integer')
#
# if the data contains values that cannot be parsed as numbers, use the `errors` parameter:
# bmx.loc[:, 'SEQN'] = pd.to_numeric(bmx['SEQN'], downcast='integer', errors='coerce')
#
# WARNING: in the example above, even though we specified `downcast='integer'` the `dtype` of
# the field will be `float64` if there were any coerced strings (which were coerced to NaN),
# since NaN is considered a float
#
# In that case we could instead do the following:
# 1. Get the index for each row where the ID is not a number
# ind = np.isnan(bmx['SEQN'])
#
# 2. Drop the rows with non-numeric IDs
# bmx = bmx.loc[~ind,:]
#
# 3. Finally, convert the field to integer
# bmx.loc[:, 'SEQN'] = pd.to_numeric(bmx['SEQN'])

In [34]:
minors_mask = demo_subset.loc[:, 'Age'] < 18
# Remove information for minors by setting the following fields to NaN
demo_subset.loc[minors_mask, ['Gender', 'Military', 'Citizenship']] = np.nan

In [35]:
demo_subset

Unnamed: 0,ID,Age,Gender,Military,Citizenship
0,1,2.0,,,
1,2,77.0,1.0,1.0,1.0
2,3,95.0,2.0,,2.0
3,4,1.0,,,
4,5,49.0,1.0,1.0,1.0
...,...,...,...,...,...
10581,2774,17.0,,,
10582,7696,56.0,2.0,2.0,2.0
10583,868,18.0,1.0,2.0,1.0
10584,6810,31.0,2.0,2.0,2.0


In [36]:
# Find the column with the largest proportion of missing rows

In [37]:
valid_entries = demographics.count()
total_rows = len(demographics.index)
missing_data = total_rows - valid_entries
missing_proportion = missing_data / total_rows
missing_proportion.head()

SEQN        0.000000
SDDSRVYR    0.019932
RIDSTATR    0.019743
RIDEXMON    0.086907
RIAGENDR    0.019932
dtype: float64

In [38]:
missing_proportion.sort_values(ascending=False).index[0]

'DMARACE'

In [39]:
# Pandas has a built-in df and series method to do what we just did in the line above
missing_proportion.idxmax()

'DMARACE'

<a id="5"> </a>
5: Dealing with Duplicates
--------------------------

- [5.1: Duplicated Rows](#5.1)
- [5.2: Duplicated Columns](#5.2)

<a id="5.1"> </a>
### 5.1: Duplicated Rows ###

In [40]:
# `keep`
# ------
# - `'first'` (default): Mark duplicates as True except first occurrence
# - `'last'`: Mark duplicates as True except last occurrence
# - `False`: Mark all duplicates as True
duplicates_mask = demographics['SEQN'].duplicated(keep=False)
print(f'Number of duplicated rows: {len(demographics.loc[duplicates_mask,:])}')
demographics.loc[duplicates_mask,:].sort_values('SEQN').head()

Number of duplicated rows: 1213


Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIDEXMON,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDAGEEX,RIDRETH1,RIDRETH2,...,WTIREP43,WTIREP44,WTIREP45,WTIREP46,WTIREP47,WTIREP48,WTIREP49,WTIREP50,WTIREP51,WTIREP52
14,15,1.0,exam,1.0,Female,38.0,459.0,460.0,Non-Hispanic White,1.0,...,110651.142444,109666.993353,113803.42969,112123.984382,113109.089839,108722.224537,109264.251804,110186.143135,112750.496899,110008.64796
10491,15,1.0,exam,1.0,Female,38.0,459.0,460.0,Non-Hispanic White,1.0,...,110651.142444,109666.993353,113803.42969,33314.865599,113109.089839,108722.224537,109264.251804,110186.143135,112750.496899,110008.64796
10102,42,1.0,Both,2.0,Female,18.0,223.0,224.0,Non-Hispanic Black,2.0,...,9118.858665,9139.612368,8948.540469,9083.673929,0.0,9095.89801,9290.163552,9459.795274,9312.80037,9518.027784
9991,42,1.0,Both,2.0,Female,18.0,223.0,224.0,Non-Hispanic Black,2.0,...,9118.858665,9139.612368,8948.540469,9083.673929,0.0,9095.89801,9290.163552,9459.795274,9312.80037,9518.027784
41,42,1.0,Both,2.0,Female,18.0,223.0,224.0,Non-Hispanic Black,2.0,...,9118.858665,9139.612368,8948.540469,9083.673929,0.0,9095.89801,9290.163552,9459.795274,9312.80037,9518.027784


In [41]:
# We can use `drop_duplicates` to remove perfect duplicates (i.e., rows where every value in 
# every column is the same)
demographics.drop_duplicates(inplace=True)

# Check how many duplicates remain:
duplicates_mask = demographics['SEQN'].duplicated(keep=False)
print(f'Number of duplicated rows: {len(demographics.loc[duplicates_mask,:])}')
demographics.loc[duplicates_mask,:].sort_values('SEQN').head()

Number of duplicated rows: 814


Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIDEXMON,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDAGEEX,RIDRETH1,RIDRETH2,...,WTIREP43,WTIREP44,WTIREP45,WTIREP46,WTIREP47,WTIREP48,WTIREP49,WTIREP50,WTIREP51,WTIREP52
14,15,1.0,exam,1.0,Female,38.0,459.0,460.0,Non-Hispanic White,1.0,...,110651.142444,109666.993353,113803.42969,112123.984382,113109.089839,108722.224537,109264.251804,110186.143135,112750.496899,110008.64796
10491,15,1.0,exam,1.0,Female,38.0,459.0,460.0,Non-Hispanic White,1.0,...,110651.142444,109666.993353,113803.42969,33314.865599,113109.089839,108722.224537,109264.251804,110186.143135,112750.496899,110008.64796
10307,57,,Both,2.0,,39.0,475.0,476.0,Non-Hispanic White,1.0,...,93148.702889,,,,,,,92801.90724,94266.582871,91977.37386
56,57,1.0,,,Male,,,,,,...,,92103.709621,95372.027071,94115.13717,95281.362267,91309.32044,91624.769642,,,
59,60,1.0,Exam,2.0,Male,5.0,70.0,71.0,Non-Hispanic White,1.0,...,41858.569852,41576.241533,43040.796615,43167.384386,42922.779884,41775.622676,0.0,42091.705803,42039.843678,41556.080118


In [42]:
# Some entries might be identical but missing different values. In this case, the 
# drop_duplicates method won't find these duplicates. We first need to combine the values
# from the incomplete duplicates to make complete records and then remove duplicates
import itertools
duplicate_ids = demographics.loc[duplicates_mask, 'SEQN'].unique()
duplicate_ids

array([  15,   57,   60,   79,   83,  105,  147,  151,  160,  206,  219,
        225,  233,  262,  263,  295,  303,  383,  385,  395,  478,  481,
        524,  575,  588,  620,  646,  652,  690,  703,  715,  781,  806,
        826,  830,  838,  863,  868,  869,  967,  990, 1044, 1071, 1080,
       1116, 1129, 1164, 1234, 1282, 1331, 1336, 1360, 1377, 1431, 1448,
       1493, 1494, 1503, 1517, 1528, 1544, 1547, 1575, 1599, 1606, 1613,
       1619, 1639, 1646, 1650, 1667, 1682, 1694, 1696, 1718, 1720, 1782,
       1784, 1820, 1828, 1902, 1906, 1909, 1947, 2026, 2036, 2043, 2047,
       2078, 2109, 2140, 2156, 2204, 2206, 2237, 2286, 2295, 2302, 2315,
       2331, 2351, 2485, 2500, 2518, 2523, 2529, 2573, 2610, 2672, 2706,
       2718, 2730, 2734, 2743, 2765, 2769, 2774, 2782, 2810, 2866, 2949,
       2981, 3106, 3115, 3142, 3150, 3158, 3171, 3173, 3233, 3283, 3290,
       3311, 3329, 3345, 3346, 3360, 3367, 3377, 3429, 3461, 3561, 3573,
       3574, 3587, 3604, 3619, 3620, 3643, 3649, 37

In [43]:
# Note this method is VERY slow and inefficient

# iterate through every id that has duplicates
for duplicate_id in duplicate_ids:
    # get an array of all the row indices with the current id
    duplicate_rows = np.where(demographics['SEQN'] == duplicate_id)[0]  # the array is the first element in a tuple
    
    # create a cartesian product of all the duplicated rows, and fill the data pairwise
    # e.g., suppose we had three duplicates: A, B, C; `itertools.product` would create the
    # cartesian product: [(A, A), (A, B), (A, C), (B, A), (B, B), (B, C), (C, A), (C, B), (C, C)]
    # (observe that every duplicate gets compared with every other duplicate)
    for (left, right) in itertools.product(duplicate_rows, repeat=2):
        
        # For every na value in the left row, replace with the corresponding value from the right row
        demographics.iloc[left,:] = demographics.iloc[left,:].fillna(demographics.iloc[right,:])

In [44]:
demographics.drop_duplicates(inplace=True)

# Check how many duplicates remain:
duplicates_mask = demographics['SEQN'].duplicated(keep=False)
print(f'Number of duplicated rows: {len(demographics.loc[duplicates_mask,:])}')
demographics.loc[duplicates_mask,:].sort_values('SEQN').head()

Number of duplicated rows: 412


Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIDEXMON,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDAGEEX,RIDRETH1,RIDRETH2,...,WTIREP43,WTIREP44,WTIREP45,WTIREP46,WTIREP47,WTIREP48,WTIREP49,WTIREP50,WTIREP51,WTIREP52
14,15,1.0,exam,1.0,Female,38.0,459.0,460.0,Non-Hispanic White,1.0,...,110651.142444,109666.993353,113803.42969,112123.984382,113109.089839,108722.224537,109264.251804,110186.143135,112750.496899,110008.64796
10491,15,1.0,exam,1.0,Female,38.0,459.0,460.0,Non-Hispanic White,1.0,...,110651.142444,109666.993353,113803.42969,33314.865599,113109.089839,108722.224537,109264.251804,110186.143135,112750.496899,110008.64796
59,60,1.0,Exam,2.0,Male,5.0,70.0,71.0,Non-Hispanic White,1.0,...,41858.569852,41576.241533,43040.796615,43167.384386,42922.779884,41775.622676,0.0,42091.705803,42039.843678,41556.080118
10392,60,1.0,Exam,1.0,Male,5.0,70.0,71.0,Non-Hispanic White,1.0,...,41858.569852,41576.241533,43040.796615,12698.465159,20321.432353,41775.622676,0.0,42091.705803,42039.843678,41556.080118
82,83,1.0,exam,2.0,Female,60.0,731.0,731.0,Non-Hispanic Black,2.0,...,10840.799501,10774.881528,10553.677856,10816.699831,10998.350416,10902.784484,10863.252191,10954.27438,10864.016537,0.0


In [45]:
# The remaining duplicates have conflicting values, for example `SEQN` 60 has two different 
# values for `RIDEXMON`.
# Since we have no way of ascertaining which of the duplicated values is the correct one, the
# safest thing is to drop all records with conflicting values:
duplicates_mask = demographics['SEQN'].duplicated(keep=False)
demographics = demographics.loc[~duplicates_mask,:]

In [46]:
# Check how many duplicates remain:
duplicates_mask = demographics['SEQN'].duplicated(keep=False)
print(f'Number of duplicated rows: {len(demographics.loc[duplicates_mask,:])}')
demographics.loc[duplicates_mask,:].sort_values('SEQN').head()

Number of duplicated rows: 0


Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIDEXMON,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDAGEEX,RIDRETH1,RIDRETH2,...,WTIREP43,WTIREP44,WTIREP45,WTIREP46,WTIREP47,WTIREP48,WTIREP49,WTIREP50,WTIREP51,WTIREP52


<a id="5.2"> </a>
### 5.2: Duplicated Columns ###

In [47]:
# Let's see which, if any, columns are in both demographics and bodymeasures
set(demographics.columns).intersection(bodymeasures.columns)

{'RIDAGEYR', 'SEQN'}

In [48]:
# `RIDAGEYR` is age. Age should only be in demographics, not in body measures
bodymeasures.drop(columns='RIDAGEYR', inplace=True)

# Now verify that only `SEQN` is in both tables
set(demographics.columns).intersection(bodymeasures.columns)

{'SEQN'}

In [49]:
# We could now safely merge demographics and bodymeasures
merged = pd.merge(demographics, bodymeasures, on='SEQN', how='inner')

In [50]:
merged.head()

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIDEXMON,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDAGEEX,RIDRETH1,RIDRETH2,...,BMISUB,BMAAMP,BMAUREXT,BMAUPREL,BMAULEXT,BMAUPLEL,BMALOREX,BMALORKN,BMALLEXT,BMALLKNE
0,1,1.0,Exam,2.0,Female,2.0,29.0,31.0,Non-Hispanic Black,2.0,...,,2.0,,,,,,,,
1,2,1.0,Both,2.0,Male,77.0,926.0,926.0,Non-Hispanic White,1.0,...,,2.0,,,,,,,,
2,3,1.0,Exam,1.0,Female,95.0,125.0,126.0,Non-Hispanic White,1.0,...,,2.0,,,,,,,,
3,4,1.0,Both,2.0,Male,1.0,22.0,23.0,Non-Hispanic Black,2.0,...,,2.0,,,,,,,,
4,5,1.0,Both,2.0,Male,49.0,597.0,597.0,Non-Hispanic White,1.0,...,,2.0,,,,,,,,


<a id="6"> </a>
6: Infeasible and Extreme Data
------------------------------
- [6.1: Impossible values](#6.1)
- [6.2: Categorical Errors](#6.2)
- [6.3: Extreme/Out of Range Values](#6.3)
- [6.4: Outliers](#6.4)
- [6.5: Saturated Data](#6.5)

<a id="6.1"> </a>
### 6.1: Impossible Values ###

In [51]:
# Example, a person cannot have a negative weight
bodymeasures['BMXWT'].describe()  # we can see that there is at least one entry where the value is negative

count     9181.000000
mean        67.519987
std        282.449524
min       -149.000000
25%         39.200000
50%         63.000000
75%         79.700000
max      12870.000000
Name: BMXWT, dtype: float64

In [52]:
# Let's look at all such entries
subzero_mask = bodymeasures.BMXWT < 0
bodymeasures[subzero_mask]

Unnamed: 0,SEQN,BMAEXLEN,BMAEXSTS,BMAEXCMT,BMXWT,BMIWT,BMXRECUM,BMIRECUM,BMXHEAD,BMIHEAD,...,BMISUB,BMAAMP,BMAUREXT,BMAUPREL,BMAULEXT,BMAUPLEL,BMALOREX,BMALORKN,BMALLEXT,BMALLKNE
1962,2128,210.0,1.0,,-26.3,,,,,,...,,2.0,,,,,,,,
3418,3698,256.0,1.0,,-48.14,,,,,,...,,2.0,,,,,,,,
4023,4334,311.0,1.0,,-96.5,,,,,,...,,2.0,,,,,,,,
4060,4373,349.0,1.0,,-96.4,,,,,,...,,2.0,,,,,,,,
6480,6959,312.0,1.0,,-99.44,,,,,,...,,2.0,,,,,,,,
6762,7266,349.0,1.0,,-149.0,,,,,,...,2.0,2.0,,,,,,,,


In [53]:
# Replace all such values with NaN
bodymeasures.loc[subzero_mask, 'BMXWT'] = np.nan

In [54]:
# Now look at the summary again
bodymeasures.BMXWT.describe()

count     9175.000000
mean        67.620357
std        282.512782
min          3.100000
25%         39.300000
50%         63.000000
75%         79.800000
max      12870.000000
Name: BMXWT, dtype: float64

<a id="6.2"> </a>
### 6.2: Categorical Errors ###

In [55]:
# For example, looking at Weight Comment, the codebook specifies 4 possible values:
# 1: Could not obtain;
# 2: Exceeds capacity;
# 3: Clothing;
# 4: Medical appliance;
# (and NaN for missing values)
bodymeasures.BMIWT.unique()

array([ 3., nan,  4.,  1., 11.,  7.])

In [56]:
# Here we see that some entries have been coded 7 or 11, which are not valid categories
category_mask = bodymeasures.BMIWT > 4
bodymeasures[category_mask]

Unnamed: 0,SEQN,BMAEXLEN,BMAEXSTS,BMAEXCMT,BMXWT,BMIWT,BMXRECUM,BMIRECUM,BMXHEAD,BMIHEAD,...,BMISUB,BMAAMP,BMAUREXT,BMAUPREL,BMAULEXT,BMAUPLEL,BMALOREX,BMALORKN,BMALLEXT,BMALLKNE
531,591,265.0,1.0,,58.3,11.0,,,,,...,,2.0,,,,,,,,
1978,2145,261.0,1.0,,79.8,11.0,,,,,...,,2.0,,,,,,,,
2576,2790,239.0,1.0,,79.8,7.0,,,,,...,,2.0,,,,,,,,
4548,4892,263.0,1.0,,83.6,7.0,,,,,...,,2.0,,,,,,,,
7811,8386,244.0,1.0,,37.3,11.0,,,,,...,,2.0,,,,,,,,


In [57]:
# Replace all such values with NaN
bodymeasures.loc[category_mask, 'BMIWT'] = np.nan

# Recheck values
category_mask = bodymeasures.BMIWT > 4
bodymeasures[category_mask]

Unnamed: 0,SEQN,BMAEXLEN,BMAEXSTS,BMAEXCMT,BMXWT,BMIWT,BMXRECUM,BMIRECUM,BMXHEAD,BMIHEAD,...,BMISUB,BMAAMP,BMAUREXT,BMAUPREL,BMAULEXT,BMAUPLEL,BMALOREX,BMALORKN,BMALLEXT,BMALLKNE


<a id="6.3"> </a>
### 6.3: Extreme/Out of Range Values ###

In [58]:
# Let's look at weight again
bodymeasures['BMXWT'].describe()

count     9175.000000
mean        67.620357
std        282.512782
min          3.100000
25%         39.300000
50%         63.000000
75%         79.800000
max      12870.000000
Name: BMXWT, dtype: float64

In [59]:
# At least one person is listed as having a weight of 12.9 metric tonnes. This cannot be
# valid.
# The heaviest person ever recorded was 635kg. We will assume that any weight entries greater
# than this are errors.
MAX_WEIGHT = 635
weightmax_mask = bodymeasures.BMXWT > MAX_WEIGHT
bodymeasures[weightmax_mask]

Unnamed: 0,SEQN,BMAEXLEN,BMAEXSTS,BMAEXCMT,BMXWT,BMIWT,BMXRECUM,BMIRECUM,BMXHEAD,BMIHEAD,...,BMISUB,BMAAMP,BMAUREXT,BMAUPREL,BMAULEXT,BMAUPLEL,BMALOREX,BMALORKN,BMALLEXT,BMALLKNE
124,135,409.0,1.0,,5510.0,,,,,,...,,2.0,,,,,,,,
880,964,315.0,1.0,,7090.0,,,,,,...,,2.0,,,,,,,,
896,980,271.0,1.0,,11810.0,,,,,,...,1.0,2.0,,,,,,,,
1335,1451,409.0,1.0,,1430.0,,96.8,,,,...,,2.0,,,,,,,,
1375,1493,233.0,1.0,,12870.0,,,,,,...,,2.0,,,,,,,,
4050,4363,253.0,1.0,,1360.0,,94.3,,,,...,,2.0,,,,,,,,
4413,4751,396.0,1.0,,12554.0,,,,,,...,,2.0,,,,,,,,
4856,5221,356.0,1.0,,5790.0,,,,,,...,,2.0,,,,,,,,
8248,8864,241.0,2.0,6.0,12340.0,,,,,,...,,,,,,,,,,


In [60]:
bodymeasures.loc[weightmax_mask, 'BMXWT'] = np.nan

bodymeasures['BMXWT'].describe()

count    9166.000000
mean       59.967574
std        29.841072
min         3.100000
25%        39.200000
50%        63.000000
75%        79.600000
max       193.300000
Name: BMXWT, dtype: float64

<a id="6.4"> </a>
### 6.4: Outliers ###

Outliers are not necessary incorrect, in which case removing them would remove legitimate data and introduce bias. However, we do want to identify outliers.

In [61]:
# Let's find the z-score (# standard deviations from the mean) of the lowest and highest values
mean_wt = np.nanmean(bodymeasures.BMXWT)
std_wt = np.nanstd(bodymeasures.BMXWT)
min_wt = np.nanmin(bodymeasures.BMXWT)
max_wt = np.nanmax(bodymeasures.BMXWT)

low_wt_zscore = (min_wt - mean_wt) / std_wt
high_wt_zscore = (max_wt - mean_wt) / std_wt

print( low_wt_zscore )
print( high_wt_zscore )

-1.905785256469405
4.468328013497096


<a id="6.5"> </a>
### 6.5: Saturated Data ###

In [62]:
# We know from the codebook that ages 85 and over are topcoded at 85
# Thus any value > 85 is incorrect, and any value == 85 is saturated
max_age = np.nanmax(demographics.RIDAGEYR)

age_mask = demographics.RIDAGEYR > 85
demographics[age_mask]

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIDEXMON,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDAGEEX,RIDRETH1,RIDRETH2,...,WTIREP43,WTIREP44,WTIREP45,WTIREP46,WTIREP47,WTIREP48,WTIREP49,WTIREP50,WTIREP51,WTIREP52
2,3,1.0,Exam,1.0,Female,95.0,125.0,126.0,Non-Hispanic White,1.0,...,43993.193099,44075.386428,46642.563799,44967.681579,44572.48201,44087.945688,44831.370881,44480.987235,45389.112766,43781.905637
2518,2519,1.0,Both,2.0,Male,103.0,167.0,171.0,Non-Hispanic White,1.0,...,65205.582912,66697.698374,68753.196442,66511.689165,66963.711142,65276.256929,66462.199596,65844.480121,66786.805379,64697.36934
2804,2805,1.0,Exam,2.0,Female,92.0,4.0,5.0,Non-Hispanic White,1.0,...,19703.69122,19768.528033,20251.991653,0.0,19811.852195,19631.498136,19809.86465,19699.759843,19943.782217,19558.144011
2905,2906,1.0,Exam,1.0,Female,103.0,799.0,799.0,Non-Hispanic Black,2.0,...,12157.724627,12083.79903,11835.723847,12130.697362,12334.414605,12227.239459,12182.904838,12284.984275,12183.762035,12917.30742
3201,3202,1.0,Both,2.0,Male,103.0,967.0,968.0,Non-Hispanic Black,2.0,...,9187.345128,9080.02308,8986.031938,9369.506905,9501.816854,9075.378231,9268.66133,9313.423583,9145.223146,0.0
3539,3540,1.0,exam,2.0,Male,87.0,11.0,12.0,Mexican American,3.0,...,2169.959959,2128.801437,2128.801437,2150.382372,2173.265526,2260.477542,2136.463309,2144.124165,2128.801437,2128.801437
6006,6007,1.0,Exam,2.0,Female,89.0,98.0,99.0,Non-Hispanic Black,2.0,...,9514.341959,9546.152415,9318.767931,9493.772915,9671.821321,9498.674311,9696.457245,9805.370221,9732.910624,0.0
7926,7927,1.0,Exam,1.0,Female,109.0,929.0,930.0,Non-Hispanic White,1.0,...,47688.48813,48213.597041,49879.678868,49152.353166,49217.799507,47224.199855,47340.856093,47782.976887,48269.267217,47725.380232
8185,8186,1.0,Both,2.0,Male,99.0,891.0,891.0,Non-Hispanic White,1.0,...,27548.975362,27688.425385,28474.074511,28126.913012,28423.025782,27072.584532,27235.834034,27631.35514,27771.557068,27344.369183
8491,8492,1.0,Both Interviewed and MEC examined,2.0,Female,105.0,1002.0,1003.0,Non-Hispanic White,1.0,...,21811.797393,22051.972111,22814.005899,22481.34111,22511.275007,21599.440865,21652.797185,21855.014733,22077.434579,21828.671131


In [63]:
demographics.loc[age_mask, 'RIDAGEYR'] = 85

In [64]:
age_mask = demographics.RIDAGEYR > 85
demographics[age_mask]

Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIDEXMON,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDAGEEX,RIDRETH1,RIDRETH2,...,WTIREP43,WTIREP44,WTIREP45,WTIREP46,WTIREP47,WTIREP48,WTIREP49,WTIREP50,WTIREP51,WTIREP52


<a id="7"> </a>
7: Automating Data Preparation
------------------------------

- [7.1: Testing](#7.1)
- [7.2: Automation](#7.2)

<a id="7.1"> </a>
### 7.1: Testing ###

In [68]:
# Test that array shapes are as expected
np.testing.assert_array_equal(demographics.shape, (9760, 144))
np.testing.assert_array_equal(bodymeasures.shape, (9278, 38))
np.testing.assert_array_equal(occupations.shape, (7749, 37))

(7749, 37)

In [71]:
# Test that the joined dataframe has the correct number of columns
dataset = pd.merge(demographics, bodymeasures, on='SEQN', how='inner')
dataset = pd.merge(dataset, occupations, on='SEQN', how='inner')

expected_total_columns = (demographics.shape[1]
                       + bodymeasures.shape[1]
                       + occupations.shape[1]
                       - 2)
np.testing.assert_equal(dataset.shape[1], expected_total_columns)

In [72]:
# Test that the number of rows in the joined dataframe is less/equal to the rows
# in demographics (since we used inner join)
np.testing.assert_array_less(dataset.shape[0], demographics.shape[0])

<a id="7.2"> </a>
### 7.2: Automation ###

In [74]:
# Strip whitespace
for column in demographics.columns:
    try:
        demographics.loc[:,column] = demographics.loc[:,column].str.strip()
    except AttributeError:  # non-string columns
        pass

In [75]:
# In columns containing discrete data, replace invalid values with NaN
VALID_CODES = {
    'DMDBORN': [1, 2, 3, 7, 9],
    'DMDCITZN': [1, 2, 7, 9]
}
for column, values in VALID_CODES.items():
    valid_mask = demographics[column].isin(values)
    demographics.loc[~valid_mask, column] = np.nan

In [None]:
# In colums containing a range of data, replace out-of-range values with NaN
VALID_RANGE = {
    'BMXWT': (0, 635),
    'BMXHT': (81.8, 201.3),
}
for column, values in VALID_RANGE.items():
    valid_mask = ((demographics[column] >= values[0])
               & (demographics[column] <= values[1]))
    demographics.loc[~valid_mask, column] = np.nan

In [78]:
# Remove columns that contain too many missing values
MAX_MISSING_PROPORTION = 0.3
non_missing_rows = demographics.count()
total_rows = len(demographics.index)
missing_rows = total_rows - non_missing_rows
missing_rows_proportion = missing_rows / total_rows

missing_mask = missing_rows_proportion > MAX_MISSING_PROPORTION
demographics.columns[missing_mask]

Index(['DMQMILIT', 'DMDBORN', 'DMDCITZN', 'DMDYRSUS', 'DMDEDUC3', 'DMDEDUC2',
       'DMDSCHOL', 'DMDMARTL', 'RIDEXPRG', 'RIDPREG', 'DMDHSEDU', 'DMAETHN',
       'DMARACE'],
      dtype='object')

<a id="8"> </a>
8: Feature Selection
-----------------------

- [8.1: Using statistical measures](#8.1)

<a id="8.1"> </a>
### 8.1: Using statistical measures ###

In [82]:
# Let's start by taking a look at the correlation between a few variables:
bmx_subset = bodymeasures[['BMXWT', 'BMXHT', 'BMXARMC']]
bmx_subset.corr()

Unnamed: 0,BMXWT,BMXHT,BMXARMC
BMXWT,1.0,0.792439,0.962957
BMXHT,0.792439,1.0,0.717806
BMXARMC,0.962957,0.717806,1.0


In [85]:
# Here are all the correlations
correlations = bodymeasures.corr()
correlations

Unnamed: 0,SEQN,BMAEXLEN,BMAEXSTS,BMAEXCMT,BMXWT,BMIWT,BMXRECUM,BMIRECUM,BMXHEAD,BMIHEAD,...,BMISUB,BMAAMP,BMAUREXT,BMAUPREL,BMAULEXT,BMAUPLEL,BMALOREX,BMALORKN,BMALLEXT,BMALLKNE
SEQN,1.0,-0.005587,0.006266,0.036363,0.010767,0.013072,-0.024519,,0.070769,,...,-0.03524,0.009234,,,,,,,0.206894,
BMAEXLEN,-0.005587,1.0,-0.261646,0.428511,0.341805,0.115279,0.283557,,0.110186,,...,0.139543,-0.004788,,,,,,,0.969201,
BMAEXSTS,0.006266,-0.261646,1.0,-0.166574,-0.014908,0.013982,0.050508,,-0.105523,,...,0.004511,0.000909,,,,,,,,
BMAEXCMT,0.036363,0.428511,-0.166574,1.0,0.099935,-0.808746,-0.697847,,,,...,0.577288,,,,,,,,,
BMXWT,0.010767,0.341805,-0.014908,0.099935,1.0,0.184963,0.939712,,0.721255,,...,0.256605,-0.008648,,,,,,,,
BMIWT,0.013072,0.115279,0.013982,-0.808746,0.184963,1.0,,,,,...,0.12356,-0.02643,,,,,,,1.0,
BMXRECUM,-0.024519,0.283557,0.050508,-0.697847,0.939712,,1.0,,0.706535,,...,-0.35463,,,,,,,,,
BMIRECUM,,,,,,,,,,,...,,,,,,,,,,
BMXHEAD,0.070769,0.110186,-0.105523,,0.721255,,0.706535,,1.0,,...,-1.0,,,,,,,,,
BMIHEAD,,,,,,,,,,,...,,,,,,,,,,


In [86]:
# Let's get just the correlations against height
height_corrs = correlations.loc['BMXHT',:]
height_corrs

SEQN       -0.001174
BMAEXLEN    0.184662
BMAEXSTS   -0.016523
BMAEXCMT    0.062168
BMXWT       0.792439
BMIWT       0.109128
BMXRECUM    0.976458
BMIRECUM         NaN
BMXHEAD          NaN
BMIHEAD          NaN
BMXHT       1.000000
BMIHT            NaN
BMXBMI      0.502305
BMXLEG      0.787780
BMILEG           NaN
BMXCALF     0.529484
BMICALF          NaN
BMXARML     0.953757
BMIARML          NaN
BMXARMC     0.717806
BMIARMC          NaN
BMXWAIST    0.674774
BMIWAIST         NaN
BMXTHICR    0.443504
BMITHICR         NaN
BMXTRI      0.233100
BMITRI      0.293590
BMXSUB      0.427107
BMISUB      0.092885
BMAAMP     -0.002982
BMAUREXT         NaN
BMAUPREL         NaN
BMAULEXT         NaN
BMAUPLEL         NaN
BMALOREX         NaN
BMALORKN         NaN
BMALLEXT         NaN
BMALLKNE         NaN
Name: BMXHT, dtype: float64

In [87]:
# We can filter to just the strongly correlated relations
STRONG_CORRELATION = 0.7
strong_correlation_mask = (height_corrs > 0.7) | (height_corrs < -0.7)
height_corrs[strong_correlation_mask]

BMXWT       0.792439
BMXRECUM    0.976458
BMXHT       1.000000
BMXLEG      0.787780
BMXARML     0.953757
BMXARMC     0.717806
Name: BMXHT, dtype: float64

In [88]:
# Or we could just list the column names that match the criteria
bodymeasures.columns[strong_correlation_mask]

Index(['BMXWT', 'BMXRECUM', 'BMXHT', 'BMXLEG', 'BMXARML', 'BMXARMC'], dtype='object')

In [90]:
# Let's create a selection that comprises our mask plus the index column SEQN
selection_mask = strong_correlation_mask
selection_mask['SEQN'] = True

In [91]:
# Now we can create a new dataframe with just the selected columns
subset = bodymeasures.loc[:,selection_mask]
subset.describe()

Unnamed: 0,SEQN,BMXWT,BMXRECUM,BMXHT,BMXLEG,BMXARML,BMXARMC
count,9278.0,9166.0,1079.0,8450.0,7214.0,8974.0,8969.0
mean,4992.804592,59.967574,80.244486,156.262888,39.91626,32.760464,28.053841
std,2869.638216,29.841072,13.927584,22.382665,4.149213,7.426678,7.47245
min,1.0,3.1,44.9,81.8,23.6,9.1,10.8
25%,2517.25,39.2,69.15,151.1,37.4,30.8,22.8
50%,4988.5,63.0,81.3,161.8,40.0,35.1,28.8
75%,7482.75,79.6,91.8,170.6,42.7,37.6,33.2
max,9965.0,193.3,110.3,201.3,55.0,46.5,58.2
