# Project proposal

---

Group name: Nico & Burcin

---


## Introduction

The aim of this project is to investigate if there is a correlation between the household income and the death rate in the United States of America. In order to explore this relation, we gathered data on both topics and will analyse how and to what extense the death rate is impacted by the household income.

### Research Question

We want to answer the following question

**Does the household income have an impact on the deathrates in the U.S. and if yes, how big is it?**

The predictor variable will be the median household income and the response variable will be the age-adjusted death rate. Further insight can be gained by using categories like death cause, state or year (see Analysis Approach)
Other useful information will be provided by the absolute amount of total deaths.

This question is backed by the following studies:
* KINGE, Jonas Minet, et al. Association of household income with life expectancy and cause-specific mortality in Norway, 2005-2015. Jama, 2019, 321. Jg., Nr. 19, S. 1916-1925. (https://jamanetwork.com/journals/jama/article-abstract/2733322)
* KAPLAN, George A., et al. Inequality in income and mortality in the United States: analysis of mortality and potential pathways. Bmj, 1996, 312. Jg., Nr. 7037, S. 999-1003. (https://www.bmj.com/content/312/7037/999.full)
* O’CONNOR, Gerald T., et al. Median household income and mortality rate in cystic fibrosis. Pediatrics, 2003, 111. Jg., Nr. 4, S. e333-e339. (https://publications.aap.org/pediatrics/article-abstract/111/4/e333/63113/Median-Household-Income-and-Mortality-Rate-in)

Although the first study was done in Norway and the second study investigates mortality instead of death rate, we suspect to gather similar observations.
Therefore our hypotheses regarding the research question is:

**The household income and the death rate will have a negative correlation.**

Meaning, that the higher the household income is, the lower the death rate will be.

Added information on mortality vs mortality rate/death rate:
*Mortality is a fact that refers to susceptibility to death. While there is a crude death rate that refers to number of deaths in a population in a year, mortality rate is the number of deaths per thousand people over a period of time that is normally a year.* 

See: 
* https://www.differencebetween.com/difference-between-death-rate-and-vs-mortality-rate/
* https://en.wikipedia.org/wiki/Mortality_rate

Added information on age-adjustment:
*In epidemiology and demography, age adjustment, also called age standardization, is a technique used to allow statistical populations to be compared when the age profiles of the populations are quite different. The effects of the age factor are adjusted or standardized.* 

See: 
* https://en.wikipedia.org/wiki/Age_adjustment
* https://www.health.pa.gov/topics/HealthStatistics/Statistical-Resources/UnderstandingHealthStats/Pages/Age-Adjusted-Rates.aspx


## Data description

In this section, you will describe the data set you wish to explore. This includes

-   description of the observations in the data set,
-   description of how the data was originally collected (not how you found the data but how the original curator of the data collected it).

We have two data sets which we will combine by joining on the state column.

The first data set:

The dataset is about the 10 leading death causes in the United States, beginning in 1999. Data are based on information from all resident death certificates filed in the 50 states and the District of Columbia using demographic and medical characteristics. Age-adjusted death rates (per 100,000 population) are based on the 2000 U.S. standard population. Populations used for computing death rates after 2010 are postcensal estimates based on the 2010 census, estimated as of July 1, 2010. Rates for census years are based on populations enumerated in the corresponding censuses. Rates for non-census years before 2010 are revised using updated intercensal population estimates and may differ from rates previously published.

In the data set we have 10868 cases and 6 columns.

The columns are:
* years, from 1999 to 2017
* 113 cause name, the NDI ICD-10 113 categories for causes of death
* cause name, the generic name for the death cause defined in the 113 cause name column
* state, in which state the data was collected
* death, the count of the total deaths 
* age-adjusted death rate, the standardized death percentage for the specific state in the observed year.

In [1]:
import pandas as pd

raw_data = '..\\data\\raw\\'
file_income = 'Median_Household_Income_By_State_1990-2017.csv'
file_death = 'NCHS_-_Leading_Causes_of_Death__United_States.csv'

df_income = pd.read_csv(raw_data+file_income)
df_death = pd.read_csv(raw_data+file_death)


In [2]:
df_death

Unnamed: 0,Year,113 Cause Name,Cause Name,State,Deaths,Age-adjusted Death Rate
0,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,United States,169936,49.4
1,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alabama,2703,53.8
2,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alaska,436,63.7
3,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arizona,4184,56.2
4,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arkansas,1625,51.8
...,...,...,...,...,...,...
10863,1999,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Virginia,1035,16.9
10864,1999,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Washington,278,5.2
10865,1999,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,West Virginia,345,16.4
10866,1999,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Wisconsin,677,11.9


The second data set:

The data were prepared in May 2019. It is originally a mix of the collection of American Cencus data and the American Community Survey (ACS).
It shows the median household income by state for the years 1990, 2000, 2005, 2010 and 2013-2017. We see in the first column, the states in the United States. In the other columns we have the years below mentioned. Dependant from state and year, you can see the median household income.
This dataset needs to be cleaned and corrected in order to be joined with the death dataset.
The columns for the income dataset need to be transformed in order to have the same format as the death dataset.
The join can be done by state, the missing income values for the years not represented in the income dataset, need to either be dropped or replaced.

In [3]:
df_income

Unnamed: 0,"Table 102.30. Median household income, by state: Selected years, 1990 through 2017",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31
0,[In constant 2017 dollars. Standard errors app...,,,,,,,,,,...,,,,,,,,,,
1,State,1990\1\,2000\2\,2005.0,,2010.0,,2013.0,,2014.0,...,,,,,,,,,,
2,1,2,3,4.0,,5.0,,6.0,,7.0,...,,,,,,,,,,
3,United States ...........,57500,62000,58200.0,80.0,56400.0,40.0,55100.0,40.0,55600.0,...,,,,,,,,,,
4,Alabama ....................,45200,50400,46400.0,400.0,45600.0,320.0,45200.0,410.0,44400.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63,Wyoming ..................,51900,55900,58100.0,1160.0,60300.0,1300.0,61900.0,1150.0,59100.0,...,,,,,,,,,,
64,\1\Based on 1989 incomes collected in the 1990...,,,,,,,,,,...,,,,,,,,,,
65,\2\Based on 1999 incomes collected in the 2000...,,,,,,,,,,...,,,,,,,,,,
66,NOTE: Constant dollars adjusted by the Consume...,,,,,,,,,,...,,,,,,,,,,



## Analysis approach

Our response variable is the age-adjusted death rate.
This variable shows the standardized percentage of deaths in a specific year for a specific state (adjusted for the age factor). 

In order to test our hypotheses we will use different visualizations and summary statistics. For example:
* Multi-Line Highlight or Multi-Line Tooltip, because x axis (income) as well as y axis (death rate) are numeric variables and the categories for death cause could be vizualized with different lines. 
    Examples can be found here: 
    * https://altair-viz.github.io/gallery/multiline_tooltip.html
    * https://altair-viz.github.io/gallery/multiline_highlight.html

    Additionally, we will show different line plots with summarized info on the income per state or per year and also the death rate per state or per year.
* For the overall death rates and income for different years a facet plot will be used to gain insights for comparison.
    * https://altair-viz.github.io/gallery/beckers_barley_wrapped_facet.html

* In addition, summary statistics in table form will show the changes over the years for
    -  Total death cause or death rate in the US by year
    - Death Casue or death rate by state for each year
 
The predictor variable will be the median household income. Although we suspect the death cause and the state to be indicators for variation on a more detailed level. This can give us insights and potential relevant information for our model during the data exploration.

### Model type
Since the predictor and the response variable are numeric, and we try to find a patter between them, we have a regression problem.
We will start with simple linear regression since we suspect to only have one predictor (household income) and one response/dependent (death rate) variable and will assume a linear relationship. 

This assumption could change, depending on the insights we get from our analysis of the data. Therefore we will also take the following models into account:
* Polynomial Regression, in case the relationship between predictor and response variable is not linear
* Bayesian Regression
* Decision Tree Regression, mainly xgboost
* Gradient Descent Regression

In order to find the best performing model, we will compare them using a specific metric (e.g. Mean Squared Error or Mean Absolute Error)



## Data dictionary

*Create a data dictionary for all the variables in your data set. You may fill out the data description table or create your own table with Pandas:*

<br>


| Name  |   Description	| Role   	| Type   	|  Format 	|
|---	|---	        |---    	|---	    |---	|
|   	|   	        |   	    |   	    |   	|
|   	|   	        |   	    |   	    |   	|
|   	|   	        |   	    |   	    |   	|


<br>

- `Role`: response, predictor, ID (ID columns are not used in a model but can help to better understand the data)

- `Type`: nominal, ordinal or numeric

- `Format`: int, float, string, category, date or object