In [10]:
import pandas as pd
import seaborn as sns
import janitor
sns.set()

In [16]:
# data import and standardizing var names with pyjanitor
demo_phq = pd.read_csv("../data/raw/DEMO_PHQ.csv").clean_names()
pag_hei = pd.read_csv("../data/raw/PAG_HEI.csv").clean_names()

# Data Dictionary

## DEMO_PHQ

- **seqn**: unique identifier (respondent)
- **dpq0_**: are those answers for the phq_9 form, values from 0 to 3 on each column are expected
- **riagendr**: gender
<span style="color:yellow">
  - 1: male
  - 2: female
</span>
- **ridageyr**: age in years
- **ridreth1**: race
<span style="color:yellow">
  - 1: white non-hispanic
  - 2: black non-hispanic
  - 3: mexican-american
  - 4: other
  - 5: other-hispanic
</span>
- **dmdeduc**: schooling
<span style="color:yellow">
  - 1: miner than 9 year
  - 2: 9 to 12 year
  - 3: Middle school
  - 4: Supirior uncomplished
  - 5: Full supirior
  - 7: not want to answer
  - 9: don't know
</span>
- **indfminc**: anual familiar revenue in US$
<span style="color:yellow">
  - 1: 0 - 4999
  - 2: 5000 - 9999
  - 3: 10000 - 14999
  - 4: 15000 - 19999
  - 5: 20000 - 24999
  - 6: 25000 - 34999
  - 7: 35000 - 44999
  - 8: 45000 - 54999
  - 9: 55000 - 64999
  - 10: 65000 - 74999
  - 11: >= 75000
  - 12: > 20000
  - 13: < 20000
  - 77: don't want to answer
  - 99: don't know
</span>


## PAG_HEI

- seqn: respondent unique identifier
- pag_minw: weekly total of aerobic activity moderetly-vigorous in min (PAG)
- adherence: adherence group:
<span style="color:yellow">
  - 1: low (< 150 min/week)
  - 2: adjusted (150 - 300 min/week)
  - 3: up (> 300 min/week)
</span>
- hei2015_: health eating indexes, those ranges are:
<span style="color:yellow">
  - 0-5: vegetables, dark green vegetables and bean, fruits, in nature fruits, proteins, sea plant and protein plant.
  - 0-10: whole beans, dairy, fatty acid, sodium, refined grain, satured fat, add sugger
  - 0-100: final score 
</span>

# Data Pre-Processing

In [28]:
pag_hei_seqn_ids = pag_hei.seqn.to_list()
demo_phq_seqn_ids = demo_phq.seqn.to_list()

# comparing lists
pag_hei_to_demo_phq = set(pag_hei_seqn_ids) - set(demo_phq_seqn_ids)
print(len(pag_hei_to_demo_phq))

4090


**Questions**: 
- Why do we have 4090 respondents that are present on HEI and not in PHQ_9?
- How those unique identifiers are correlated to each other?

## Merging the data


In [34]:
demo_phq


Unnamed: 0,seqn,dpq010,dpq020,dpq030,dpq040,dpq050,dpq060,dpq070,dpq080,dpq090,riagendr,ridageyr,ridreth1,dmdeduc,indfminc
0,31130,,,,,,,,,,2,85,3,4,4.0
1,31131,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,44,4,4,11.0
2,31132,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,70,3,5,11.0
3,31134,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,73,3,3,12.0
4,31139,0.0,0.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,2,18,2,3,11.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5329,41466,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2,58,5,2,3.0
5330,41468,0.0,2.0,0.0,1.0,1.0,2.0,1.0,3.0,0.0,2,66,1,1,8.0
5331,41469,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1,19,4,4,2.0
5332,41472,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,34,3,4,7.0


In [20]:

demo_phq.columns

Index(['seqn', 'dpq010', 'dpq020', 'dpq030', 'dpq040', 'dpq050', 'dpq060',
       'dpq070', 'dpq080', 'dpq090', 'riagendr', 'ridageyr', 'ridreth1',
       'dmdeduc', 'indfminc'],
      dtype='object')

In [32]:
pag_hei.shape[0] - 4090

5334