# COGS 108 - Data Checkpoint

# Names

- Xingyu Chen
- Rosie Peng
- Zhekai Wang
- Yuchen Guo
- Zeyu Liu

<a id='research_question'></a>
# Research Question

How does people’s health consciousness affect the healthy degree of their weight in the U.S. in 2017? 

- Health consciousness is defined as the degree to which people value their health. In this context, one's subjective behaviors to make them healthy are considered more health-conscious (for example, having health insurance and doing exercise regularly). Conversely, actions that are detrimental to health imply a lack of health consciousness (for example, smoking or drinking).The way we quantize will be discussed below. 

- The healthy degree of one's weight is defined based on the standard BMI scale, and it indicates how healthy one’s weight is. For the BMI standard, the healthy range is [18.5, 25). With a BMI out of this range, one is considered to be underweight or overweight, which is less healthy. If one’s BMI is larger than 30, we will consider one obese, which is morbid and should be treated as a disease. We will talk more about how we use BMI to measure this variable in the following parts.  

### Hypothesis

Having more health awareness causes an increase in people’s health of weight (concepts defined above).

- We have this expectation because keeping healty body weight is an important part of personal health, and it seems that people who maintain health intentionally usually have healthier body weight. We want to test if there's a cause-and-effect relationship. 


**(We lost points in the research question part and hypothesis part in proposal, and we addressed the feedback here.)**

# Dataset(s)

**(We lost points in the data part in proposal. Addressed the feedback here.)**

- Dataset Name: HARMONIZED DATA FROM U.S. NATIONAL HEALTH SURVEYS
- Link to the dataset: https://nhis.ipums.org/nhis/ (This is the link to database and we create the data extract by ourselves)
- Number of observations: 7,8132

This dataset contains harmonized data from an individual-level survey by IPUMS collecting information on the health of the U.S. population, including the demographics of persons, health behaviors information, and health insurance data. It is structured in a time series way, but for the completeness of variables we need, we choose the sample of 2017. Each observation in the dataset is an individual. 



### Variable List
We have 27 variables in the dataset.

| Variable Name | Description | Type | Values Code |
|---|---|---|---|
| YEAR | year | Numerical | 2017 only |
| REGION |region in U.S. of residence|Categorical|1=Northeast, 2=North Central/Midwest, 03=South, 04=West|
|NHISPID|NHIS unique identifier for person|Numerical|Unique 14-digit value for each individual|
|SAMPWEIGHT|sample person weight|Numerical|inverse probability of selection into the interview; used as weight variable|
|AGE|age|Numerical|0-85; age 85+ are recorded as 85|
|SEX|sex|Categorical|1=Male, 2=Female|
|BMI|body mass index|Numerical|0=Missing, 99.99=Unknown; calculated for age 18+ |
|BMICAT|body mass index, categorical|Categorical|1=underweight, 2=normal, 3=overweight, 4=obese, 9=unknown|
|HINOTCOVE|whether the person lacks health insurance coverage|Categorical|1=has coverage, 2=no coverage, 9=don't know|
|HINONE|whether the person didn't currently have any health insurance coverage|Categorical|1=has some type of insurance, 2=has not, 7=refuse to answer, 9=don't know|
|HIP1COST|amount spent for insurance premiums|Numerical|0=not apply to this question, 99997=refused, 99998=not certain, 99999=don't know|
|ALC5UPYR|total number of days during the past 12 months where one had 5 or more alcoholic drinks|Numerical|990+ =missing for reasons|
|ALCAMT|average number of alcoholic beverages consumed by one on days that they drink|Numerical|96+ =missing; 0=not apply|
|CIGDAYMO|number days smoked in past 30 days|Numerical|90+ =missing|
|SMOKFREQNOW|smoke every day, some days, or not at all now|Categorical|0=not apply, 1=Not at all, 2=some days, 3=everyday, 5+ =missing|
|SMKLSFREQNOW|frequency of smokeless tobacco use now(chewing tobacco, snuff, dip, etc.)|Categorical|0=never use/not apply, 1=everyday, 2=some days, 4=not at all now|
|QUITNO|time since quit smoking: number of units|Numerical|should use with unit types variable; 0=not apply, 900+ =missing; 95+ recorded as 95|
|QUITTP|time since quit smoking: time period unit|Categorical|1=days, 2=weeks, 3=months, 4=years, 0=not apply, 6+ =missing|
|MOD10FWK|frequency of moderate leisure-time physical activity 10+ minutes: times per week|Numerical|0=not apply, 94=less than once per week, 95=never, 96=unable to do, 97+ =missing|
|VIG10FWK|Frequency of vigorous leisure-time physical activity 10+ minutes: times per week|Numerical|Same with MOD10FWK above|
|STRONGFWK|frequency of leisure-time strengthening activity: times per week|Numerical|Same above|
|DIETDYR|ever told by health professional to reduce fat or calories in diet, past year|Categorical|0=not apply, 1=No, 2=Yes, 5+ =missing|
|WTPROGDYR|ever told by health professional to participate in weight loss program, past year|Categorical|Same as above|
|DIETNOW|currently reducing the amount of fat or calories in diet|Categorical|Same as above|
|WTPROGNOW|currently participating in a weight loss program|Categorical|Same as above|
|HRSLEEP|usual hours sleep per day|Numerical|0=not apply, 30+ =missing|
|PCLOOKHELYR|looked up health information on Internet, past year|Categorical|1=No, 2=Yes, 0=not apply. 5+ =missing|


- _"not apply to this question"_ indicates that the individual's missingness in the variable is due to their situation. For example, someone who reports they never drinks will not have number of alcohol consumed, so they do not need to answer this question in the survey, and the corresponding variable ALCAMT will be recorded as 0. Also, most questions are only surveyed on sampled adults (age 18+), and individuals who are not sampled or not age 18+ can have this value. 
- _"missing for reasons"/"missing"_ indicates that the value is missing due to: individual refuses to answer/individual is uncertain about this question/individual doesn't know.
- Variables that are included in the dataset but would not be used in the analysis are not in this form as they are not related to this project or repeated with other variables. They will be cleaned out in the following part. Descriptions of these variables could be found on website of the database. Variable name list to be cleaned: PX, PERWEIGHT, ASTATFLG, CSTATFLG, MCAIDPREM, QUITYRS

# Setup

In [27]:
import pandas as pd
import numpy as np
import requests

In [28]:
# Read the csv file
url = 'https://raw.githubusercontent.com/COGS108/Group_Sp23_cogs108project/master/nhis_2017.csv?token=GHSAT0AAAAAACCXWJNLLILHUJ67SFHINF56ZDEJUNQ'
url='https://raw.githubusercontent.com/COGS108/Group_Sp23_cogs108project/master/nhis_2017.csv?token=GHSAT0AAAAAACA5LIDDCU3SKR3FCFZXO3CWZDELJ3Q'
# df = pd.read_csv(url, index_col=0)
df = pd.read_csv("nhis_2017.csv")
df.head(10)

Unnamed: 0,YEAR,REGION,NHISPID,PX,PERWEIGHT,SAMPWEIGHT,ASTATFLG,CSTATFLG,AGE,SEX,...,QUITYRS,MOD10FWK,VIG10FWK,STRONGFWK,DIETDYR,WTPROGDYR,DIETNOW,WTPROGNOW,HRSLEEP,PCLOOKHELYR
0,2017,3,20170000030101,1,4000,5044.0,1,0,65,2,...,96,95,7,95,1,1,2,1,8,2
1,2017,3,20170000080101,1,6112,0.0,2,0,27,2,...,96,0,0,0,0,0,0,0,0,0
2,2017,3,20170000080102,2,3778,4808.0,0,1,10,1,...,96,0,0,0,0,0,0,0,0,0
3,2017,2,20170000090101,1,3143,3770.0,1,0,19,1,...,96,98,3,3,2,1,2,1,6,2
4,2017,2,20170000110101,1,4766,0.0,3,0,43,2,...,96,0,0,0,0,0,0,0,0,0
5,2017,2,20170000110102,2,4514,17305.0,1,0,45,1,...,96,2,94,2,2,1,2,1,5,1
6,2017,2,20170000110103,3,5298,0.0,3,0,20,2,...,96,0,0,0,0,0,0,0,0,0
7,2017,2,20170000110104,4,4070,4027.0,0,1,13,1,...,96,0,0,0,0,0,0,0,0,0
8,2017,2,20170000150101,1,4960,7383.0,1,0,67,2,...,5,7,3,3,1,1,2,1,8,2
9,2017,3,20170000180101,1,3868,8314.0,1,0,40,1,...,96,3,95,95,1,1,1,1,8,1


# Data Cleaning

- This dataset uses certain values to represent missingness, so we cannot simply use dropna() in pandas. Instead, we exclude certain values.

In [29]:
##dropping useless variables
data = df.drop(columns=["PX", "PERWEIGHT", "ASTATFLG", "CSTATFLG", "MCAIDPREM", "QUITYRS","ALC5UPYR","ALCAMT","CIGDAYMO","QUITNO","QUITTP","SMOKFREQNOW","SMKLSFREQNOW"])
## clean dataset by remove the rows that contains BMI = 0 and over 99 is unknown
data = data.loc[(df['BMI']!=0) & (df['BMI']<99)]

## for HINOTCOVE, 9 = dont know we need to get rid of that
data = data.loc[(data['HINOTCOVE']!=9)]

## for HINONE, 7 = refused to answer 9 = dont know we need to get rid of that
data = data.loc[(data['HINONE']!=9) & data['HINONE']!=7]

## for HIP1COST higher than 99997 are useless data
data = data.loc[(data['HIP1COST']<99997) & (data['HIP1COST']!=0)]

# ## ALC5UPYR over 990 are useless data
# data = data.loc[(data['ALC5UPYR']<990)]

# ## ALCAMT over 96 are useless
# data = data.loc[(data['ALCAMT']<96)]

# ## CIGDAYMO over 90 are missing data
# data = data.loc[(data['CIGDAYMO']<90)]

# ## QUITNO over 900 are missing
# data = data.loc[(data['QUITNO']<900)]

# ## QUITTP over 6 are missing
# data = data.loc[(data['QUITTP']<6)]

# ## SMOKFREQNOW over 5 are missing data
# data = data.loc[(data['SMOKFREQNOW']<5)]

## For MOD10FWK, 0=not apply and 97+ are missing values, so we remove them.
data = data.loc[(data['MOD10FWK']!=0) & (data['MOD10FWK']<97)]
def MOD_PA(num):
    if num >= 95:
        return 0
    elif num == 94:
        return 1
    else:
        return num
data['MOD10FWK'] = data['MOD10FWK'].apply(MOD_PA)

## For VIG10FWK, 0=not apply and 97+ are missing values, so we remove them.
data = data.loc[(data['VIG10FWK']!=0) & (data['VIG10FWK']<97)]
data['VIG10FWK'] = df['VIG10FWK'].apply(MOD_PA)

## For STRONGFWK, 0=not apply and 97+ are missing values, so we remove them.
data = data.loc[(df['STRONGFWK']!=0) & (df['STRONGFWK']<97)]
data['STRONGFWK'] = data['STRONGFWK'].apply(MOD_PA)

## For DIETDYR, 0=not apply and 5+ are missing values, so we remove them.
data = data.loc[(data['DIETDYR']!=0) & (data['DIETDYR']<5)]

## For WTPROGDYR, 0=not apply and 5+ are missing values, so we remove them.
data = data.loc[(data['WTPROGDYR']!=0) & (data['WTPROGDYR']<5)]

## For DIETNOW, 0=not apply and 5+ are missing values, so we remove them.
data = data.loc[(data['DIETNOW']!=0) & (data['DIETNOW']<5)]

## For WTPROGNOW, 0=not apply and 5+ are missing values, so we remove them.
data = data.loc[(data['WTPROGNOW']!=0) & (data['WTPROGNOW']<5)]

## For HRSLEEP, 0=not apply and 30+ are missing values, so we remove them.
data = data.loc[(data['HRSLEEP']!=0) & (data['HRSLEEP']<30)]

## For PCLOOKHELYR, 0=not apply and 5+ are missing values, so we remove them.
data = data.loc[(data['PCLOOKHELYR']!=0) & (data['PCLOOKHELYR']<5)]

## since the insurance cost gap is huge, thus change it into categorical data, where 1 = paying over $1000, 0 = under $1000, -1 = not paying at all
def cost_cat(cost):
    if cost>= 1000:
        return 1
    elif cost<=0:
        return -1
    else:
        return 0
data['HIP1COST'] = data['HIP1COST'].apply(cost_cat)
data

Unnamed: 0,YEAR,REGION,NHISPID,SAMPWEIGHT,AGE,SEX,BMI,BMICAT,HINOTCOVE,HINONE,HIP1COST,MOD10FWK,VIG10FWK,STRONGFWK,DIETDYR,WTPROGDYR,DIETNOW,WTPROGNOW,HRSLEEP,PCLOOKHELYR
0,2017,3,20170000030101,5044.0,65,2,29.30,3,1,1,1,0,7,0,1,1,2,1,8,2
5,2017,2,20170000110102,17305.0,45,1,35.44,4,1,1,1,2,1,2,2,1,2,1,5,1
8,2017,2,20170000150101,7383.0,67,2,43.13,4,1,1,1,7,3,3,1,1,2,1,8,2
9,2017,3,20170000180101,8314.0,40,1,32.27,4,1,1,1,3,0,0,1,1,1,1,8,1
29,2017,2,20170000260101,5164.0,70,1,26.62,3,1,1,1,2,0,2,1,1,2,2,8,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78113,2017,1,20170588690101,4341.0,78,2,19.84,2,1,1,0,2,0,0,1,1,1,1,7,1
78121,2017,4,20170588800102,6082.0,80,2,20.51,2,1,1,1,4,0,0,1,1,2,1,8,2
78122,2017,3,20170588810101,5557.0,70,2,36.01,4,1,1,0,0,1,0,2,1,2,1,8,2
78125,2017,1,20170588830101,7833.0,67,1,34.45,4,1,1,1,2,0,0,2,2,2,1,7,2


### Construct Variables

- **health consciousness index**: First, we categorizing the health-related variables to conscious/unconscious, depending on their effects on one's health are positive or negative. Then we sum them up to get the health consciousness index. Higher its value is, more health consciousness the individual is considered to have. 
- **healthy degree of weight**: In a categorical way, we build it by reassigning values to the categorical BMI: 1=healthy, 0=not healthy, -1=morbid. In the numerical way, we use the BMI variable and set the center of healthy range (21.75) to have a "healthy score" of 100. So this score will be 100 - (absolute distance to 21.75). 

In [30]:
##creating the health consciousness index column based on the health-related variables
##removed from the formula -data['ALC5UPYR']-data['ALCAMT']-data['CIGDAYMO']-data['SMOKFREQNOW']-data['SMKLSFREQNOW']+data['QUITNO']+data['QUITTP']
data['Health Consciousness index'] = data['HINOTCOVE']+data['HINONE']+data['HIP1COST']+data['MOD10FWK']+data['VIG10FWK']+data['STRONGFWK']+data['DIETDYR']+data['WTPROGDYR']+data['DIETNOW']+data['WTPROGNOW']+data['HRSLEEP']+data['PCLOOKHELYR']

## creating healthy degree of weight colum based on BMI
data['Healthy degree of weight'] = 100-abs(21.75-data['BMI'])

##determine the healthy weight score above 96 is healthy, between 96 to 89 is not healthy, and below 89 is morbid
## 18.5-24.9 is normal weight based on BMI thus 96 is the healthy standard
def healthy_BMI(healthy_score):
    if healthy_score >= 96:
        return 1
    elif healthy_score <89:
        return -1
    else:
        return 0

## changing the catagorical BMI into 1 = healthy, 0 = not healthy, -1 = morbid, based on the healthy degree of weight
data['BMICAT'] = data['Healthy degree of weight'].apply(healthy_BMI)
data


Unnamed: 0,YEAR,REGION,NHISPID,SAMPWEIGHT,AGE,SEX,BMI,BMICAT,HINOTCOVE,HINONE,...,VIG10FWK,STRONGFWK,DIETDYR,WTPROGDYR,DIETNOW,WTPROGNOW,HRSLEEP,PCLOOKHELYR,Health Consciousness index,Healthy degree of weight
0,2017,3,20170000030101,5044.0,65,2,29.30,0,1,1,...,7,0,1,1,2,1,8,2,25,92.45
5,2017,2,20170000110102,17305.0,45,1,35.44,-1,1,1,...,1,2,2,1,2,1,5,1,20,86.31
8,2017,2,20170000150101,7383.0,67,2,43.13,-1,1,1,...,3,3,1,1,2,1,8,2,31,78.62
9,2017,3,20170000180101,8314.0,40,1,32.27,0,1,1,...,0,0,1,1,1,1,8,1,19,89.48
29,2017,2,20170000260101,5164.0,70,1,26.62,0,1,1,...,0,2,1,1,2,2,8,2,23,95.13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78113,2017,1,20170588690101,4341.0,78,2,19.84,1,1,1,...,0,0,1,1,1,1,7,1,16,98.09
78121,2017,4,20170588800102,6082.0,80,2,20.51,1,1,1,...,0,0,1,1,2,1,8,2,22,98.76
78122,2017,3,20170588810101,5557.0,70,2,36.01,-1,1,1,...,1,0,2,1,2,1,8,2,19,85.74
78125,2017,1,20170588830101,7833.0,67,1,34.45,-1,1,1,...,0,0,2,2,2,1,7,2,21,87.30


- Also, we want to rename the variables for convinience and readability.

In [31]:
## renaming the columns for readability
data = data.rename(columns = {"BMICAT":"BMI, categorical","HINOTCOVE":"Person Lack of Health insurance coverage","HINONE":"Person did not have insurancecoverage",
                       "HIP1COST":"amount spent for insurance premiums","ALC5UPYR":"had 5 or more alcoholic drink during the past 12 month",
                       "ALCAMT":"average number of alcoholic drink consumed by the day they had drink","CIGDAYMO":"number of days smoked in past 30 days",
                       "SMOKFREQNOW":"smoke frequency", "SMKLSFREQNOW":"use of smokeless tobacco product frequency","QUITNO":"time since quit smoking by number of unit",
                       "QUITTP":"time since quit somking by time period","MOD10FWK":"moderate leisure-time physical activity",
                       "VIG10FWK":"vigorous leisure-time physical activity","STRONGFWK":"leisure-time strengthening activity",
                       "DIETDYR":"told by health professional to diet","WTPROGDYR":"told by health professional to participate in weight loss",
                       "DIETNOW":"currently in diet","WTPROGNOW":"currently in a weight loss program","HRSLEEP":"hours of sleep","PCLOOKHELYR":"looked up health information on Internet"})
data

Unnamed: 0,YEAR,REGION,NHISPID,SAMPWEIGHT,AGE,SEX,BMI,"BMI, categorical",Person Lack of Health insurance coverage,Person did not have insurancecoverage,...,vigorous leisure-time physical activity,leisure-time strengthening activity,told by health professional to diet,told by health professional to participate in weight loss,currently in diet,currently in a weight loss program,hours of sleep,looked up health information on Internet,Health Consciousness index,Healthy degree of weight
0,2017,3,20170000030101,5044.0,65,2,29.30,0,1,1,...,7,0,1,1,2,1,8,2,25,92.45
5,2017,2,20170000110102,17305.0,45,1,35.44,-1,1,1,...,1,2,2,1,2,1,5,1,20,86.31
8,2017,2,20170000150101,7383.0,67,2,43.13,-1,1,1,...,3,3,1,1,2,1,8,2,31,78.62
9,2017,3,20170000180101,8314.0,40,1,32.27,0,1,1,...,0,0,1,1,1,1,8,1,19,89.48
29,2017,2,20170000260101,5164.0,70,1,26.62,0,1,1,...,0,2,1,1,2,2,8,2,23,95.13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78113,2017,1,20170588690101,4341.0,78,2,19.84,1,1,1,...,0,0,1,1,1,1,7,1,16,98.09
78121,2017,4,20170588800102,6082.0,80,2,20.51,1,1,1,...,0,0,1,1,2,1,8,2,22,98.76
78122,2017,3,20170588810101,5557.0,70,2,36.01,-1,1,1,...,1,0,2,1,2,1,8,2,19,85.74
78125,2017,1,20170588830101,7833.0,67,1,34.45,-1,1,1,...,0,0,2,2,2,1,7,2,21,87.30


<a id='updated ethics'></a>
# Updated Ethics and Privacy

- Data: The data we use is available to everyone. We collect data from NHIS website https://nhis.ipums.org/nhis/, which is free for everyone to use. Our data contains 26742 individuals, which is enough for us to make reliable inferences.

- Informed consent: We collect data from The National Health Interview Survey website, and we believe the data on this website is free for us to use.

- Privacy: Data anonymization and protection of individual privacy are priorities for our project. The data we gathered or used in the analysis is anonymous. The data we collected does not contain any names, addresses or any other personality identifying information. We just use NHISPID, the NHIS unique identifier to distinguish individuals.

- Fairness and objectivity: There is no bias or discrimination in our data. We just collect all data from year 2017, without choosing certain groups or individuals intentionally. And the data source NHIS collects information on the health, health care access, and health behaviors of the civilian, non-institutionalized U.S. population, which does not contain any biases. And in our future analysis, we will not group individuals using any standards like sex, ethinicity or race that may cause biases.

- Potential biases related to topic: Our topic, the relationship between health consciousness and healthy degree of weight may be a sensitive problem because many people with obsesity are discriminated today. But we will only analyze how health consciousness affect weight, not display any emotions and attitudes towards weight of individuals.

- Transparency: Our model is not a black box. We will show every step of our analysis to make our decisions and results interpretable.