# Imports

In [1]:
import sys

import pandas as pd
import altair as alt
import numpy as np

# users can skip this step as you just need to install the package and import
path_dir = "/Users/suxian/Data/Class/CSE583/HDCD"
sys.path.append(path_dir)

import hdcd

# Example Use on Chronic Disease Index (CDI) data

In [2]:
cdi_dummy = pd.read_csv("/Users/suxian/Data/Class/CSE583/HDCD/dummy_data/cdi_dummy.csv")

### A look into the CDI data and detailed data dictionary for variables:
- YearStart : The variable encodes the current year for each measurement. 
- LocationAbbr : Abbreviation of locations. E.g. US for United States.
- LocationDesc : Full text of location, e.g. United States.
- Topic : A classified higher level description of "Question" variable. E.g. all "Question" related to Alcohol belong to the Topic Alcohol.
- Question : The main variable of interest. Usually come from annual survey.
- DataValue : The data recorded for corresponding "Question", can be numerical or categorical.
- DataValueType : Describes the type of "DataValue" column, such as "Mean", "Age-adjusted Mean", "Crude Prevalence".
- Stratification : Stratification method applied to the "Question", such as "Gender", "Race/Ethnicity", or "Overall".

In [3]:
cdi_dummy.head()

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,DataSource,Topic,Question,Response,DataValueUnit,DataValueType,...,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
0,2010,2010,OK,Oklahoma,NVSS,Cardiovascular Disease,Mortality from coronary heart disease,,"cases per 100,000",Crude Rate,...,40,CVD,CVD1_3,CRDRATE,GENDER,GENF,,,,
1,2020,2020,OR,Oregon,NVSS,Cardiovascular Disease,Mortality from coronary heart disease,,"cases per 100,000",Age-adjusted Rate,...,41,CVD,CVD1_3,AGEADJRATE,RACE,WHT,,,,
2,2019,2019,NV,Nevada,NVSS,Cardiovascular Disease,Mortality from coronary heart disease,,"cases per 100,000",Crude Rate,...,32,CVD,CVD1_3,CRDRATE,OVERALL,OVR,,,,
3,2020,2020,TN,Tennessee,NVSS,Cardiovascular Disease,Mortality from coronary heart disease,,Number,Number,...,47,CVD,CVD1_3,NMBR,OVERALL,OVR,,,,
4,2020,2020,VA,Virginia,NVSS,Cardiovascular Disease,Mortality from coronary heart disease,,Number,Number,...,51,CVD,CVD1_3,NMBR,RACE,BLK,,,,


### Recommended to start with summary and variable_summary function

In [4]:
hdcd.data_summary(cdi_dummy)

The dataframe contains 34 of columns and 29833 of rows 

The dataframe contains 3 topics and 4 questions 

The list of topics including ['Cardiovascular Disease', 'Alcohol', 'Overarching Conditions'] 

The stratifications of the variables including         ['Gender', 'Race/Ethnicity', 'Overall'] 

The set of functions and tools are designed to analyze the chronic disease index (dataframe) data from the CDC. For more information, print out the [Question] variables from the dataset and explore them under the [Topic] variable


### Check a topic of interest and find the questions

In [5]:
cdi_dummy[cdi_dummy["Topic"] == "Alcohol"]["Question"].value_counts()

Question
Binge drinking frequency among adults aged >= 18 years who binge drink    9570
Name: count, dtype: int64

### Use the variable_summary() function to get:
- Available DataValueType 
- The unique numbers of LocationAbbr/LocationDesc
- Available longitudinal data in YearStart

In [6]:
variable = "Binge drinking frequency among adults aged >= 18 years who binge drink"
hdcd.variable_summary(variable,cdi_dummy)

variable units including ['Mean' 'Age-adjusted Mean']
numbers of geo-location (states) have data available: 55/55
numbers of unique years have data available: 11, from 2011 to 2021


### Plot the geomap and as a general visualization for the selected variable

In [7]:
variable = "Binge drinking frequency among adults aged >= 18 years who binge drink"
datatype = "Age-adjusted Mean"
stratification = "Overall"
dataframe = cdi_dummy.copy()
color_scheme='bluepurple'
width='container'

hdcd.plot_geomap(variable,
                 datatype,
                 stratification,
                 dataframe,
                 color_scheme,
                 width)

### Plot the longitudinal visualization to dive into a state of interest

In the above geomap, Arkansas seems to have a drastic change over time, let's take a closer look

In [8]:
variable = "Binge drinking frequency among adults aged >= 18 years who binge drink"
location = "Arkansas"
stratification = "Gender"
dataframe = cdi_dummy.copy()

hdcd.plot_longitudinal_change(variable,
                              location,
                              stratification,
                              dataframe)

### Plot a simpe scatterplot inference the correlation between this variable and some outcome of interest

It looks like the binge drinking habits is fluctuating with a graduate increasing trend in Arkansas.
Let's see how does this behavior influence a outcome of interest - life expectancy

In [10]:
sod ='Binge drinking frequency among adults aged >= 18 years who binge drink'
health_outcome = 'Life expectancy at birth'
stratification = "Overall"
dataframe = cdi_dummy.copy()
print_corr= True

hdcd.plot_corr( sod,
                health_outcome,
                stratification,
                dataframe,
                print_corr)

spearmanr correlation coefficient for [Binge drinking frequency among adults aged >= 18 years who binge drink - Age-adjusted Mean] and [Life expectancy at birth - Number]: SignificanceResult(statistic=-0.6146151584618712, pvalue=1.6034462441952998e-06) 

spearmanr correlation coefficient for [Binge drinking frequency among adults aged >= 18 years who binge drink - Mean] and [Life expectancy at birth - Number]: SignificanceResult(statistic=-0.5925314034252818, pvalue=4.632513505293616e-06) 



# Summary

- This notebook serve as a general start for using the hdcd tool
- The 3-step illustrations serve as in-practice for discover new data pattern and identify curcial health issues