# DATA 602 - Project Proposal
## Dirk Hartog

#### ______________________________________________________________________________________

### Research Question and purpose

The data set I choose to work with includes data on mortality rates or percentages on cadiovascular disease in US counties between the years of 2005 - 2019. The data is stratified by (gender, race, age). The purpose of researching this data is to answer if there are any inequalities in the rates of cardiovasular disease among different races and genders or locations across the united states. Anthor goal is to aswer the question What are the trends in cardiovascualr disease mortality rates at the county and state level. Gaining deeper insights on who and where the rates of cardiovascular disease may provide opportunities to target public health efforts. 

### Data Source

The data source is the National Vital Statistics System but the csv file can be found at [Data.gov](https://catalog.data.gov/dataset/rates-and-trends-in-heart-disease-and-stroke-mortality-among-us-adults-35-by-county-a-2000-45659)

### Libraries that will be potentially used

In [90]:
# Pandas
import pandas as pd 

# Datetime
from datetime import datetime as dt 

# Numpy
import numpy as np 

# Matplotlib
import matplotlib.pyplot as plt

# Seaborn
import seaborn as sns

# Plotly
import plotly.express as px

### Exploratory Data Analysis

In [91]:
# Load in the csv file

heart_disease = pd.read_csv('Heart_Disease_and_Stroke.csv', dtype = {'Year':'str'})

#### *1. Explore contents of the date*

In [92]:
# View the first 5 rows

heart_disease.head()

Unnamed: 0,Year,LocationAbbr,LocationDesc,GeographicLevel,DataSource,Class,Topic,Data_Value,Data_Value_Unit,Data_Value_Type,...,Data_Value_Footnote,Confidence_limit_Low,Confidence_limit_High,StratificationCategory1,Stratification1,StratificationCategory2,Stratification2,StratificationCategory3,Stratification3,LocationID
0,1999,AL,Autauga,County,NVSS,Cardiovascular Diseases,All heart disease,,"per 100,000","Age-Standardized, Spatiotemporally Smoothed Rate",...,Value suppressed,,,Age group,Ages 35-64 years,Race,American Indian/Alaska Native,Sex,Overall,1001
1,2013,AL,Autauga,County,NVSS,Cardiovascular Diseases,All heart disease,,"per 100,000","Age-Standardized, Spatiotemporally Smoothed Rate",...,Value suppressed,,,Age group,Ages 35-64 years,Race,American Indian/Alaska Native,Sex,Overall,1001
2,2014,AL,Autauga,County,NVSS,Cardiovascular Diseases,All heart disease,,"per 100,000","Age-Standardized, Spatiotemporally Smoothed Rate",...,Value suppressed,,,Age group,Ages 35-64 years,Race,American Indian/Alaska Native,Sex,Overall,1001
3,2005,AL,Autauga,County,NVSS,Cardiovascular Diseases,All heart disease,,"per 100,000","Age-Standardized, Spatiotemporally Smoothed Rate",...,Value suppressed,,,Age group,Ages 35-64 years,Race,American Indian/Alaska Native,Sex,Overall,1001
4,2012,AL,Autauga,County,NVSS,Cardiovascular Diseases,All heart disease,,"per 100,000","Age-Standardized, Spatiotemporally Smoothed Rate",...,Value suppressed,,,Age group,Ages 35-64 years,Race,American Indian/Alaska Native,Sex,Overall,1001


In [93]:
# Insepct all column names

heart_disease.columns

Index(['Year', 'LocationAbbr', 'LocationDesc', 'GeographicLevel', 'DataSource',
       'Class', 'Topic', 'Data_Value', 'Data_Value_Unit', 'Data_Value_Type',
       'Data_Value_Footnote_Symbol', 'Data_Value_Footnote',
       'Confidence_limit_Low', 'Confidence_limit_High',
       'StratificationCategory1', 'Stratification1', 'StratificationCategory2',
       'Stratification2', 'StratificationCategory3', 'Stratification3',
       'LocationID'],
      dtype='object')

In [94]:
# Check data types and number of non null vlaues

heart_disease.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5770240 entries, 0 to 5770239
Data columns (total 21 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   Year                        object 
 1   LocationAbbr                object 
 2   LocationDesc                object 
 3   GeographicLevel             object 
 4   DataSource                  object 
 5   Class                       object 
 6   Topic                       object 
 7   Data_Value                  float64
 8   Data_Value_Unit             object 
 9   Data_Value_Type             object 
 10  Data_Value_Footnote_Symbol  object 
 11  Data_Value_Footnote         object 
 12  Confidence_limit_Low        float64
 13  Confidence_limit_High       float64
 14  StratificationCategory1     object 
 15  Stratification1             object 
 16  StratificationCategory2     object 
 17  Stratification2             object 
 18  StratificationCategory3     object 
 19  Stratification3      

In [95]:
# Since info did not give us the non null values we still need to check for missing values

heart_disease.isna().any()

Year                          False
LocationAbbr                  False
LocationDesc                  False
GeographicLevel               False
DataSource                    False
Class                         False
Topic                         False
Data_Value                     True
Data_Value_Unit               False
Data_Value_Type               False
Data_Value_Footnote_Symbol     True
Data_Value_Footnote            True
Confidence_limit_Low           True
Confidence_limit_High          True
StratificationCategory1       False
Stratification1               False
StratificationCategory2       False
Stratification2               False
StratificationCategory3       False
Stratification3               False
LocationID                    False
dtype: bool

In [96]:
# Print the unique values in each column

for col in heart_disease.columns: 
    print(f'*Unique values in column: {col}*')
    print(heart_disease[col].unique())

*Unique values in column: Year*
['1999' '2013' '2014' '2005' '2012' '2010' '2009' '2011' '2007' '2019'
 '2018' '2004' '2016' '2015' '2000' '2002' '2003' '2006' '2008' '2001'
 '2017' '1999 - 2010' '2010 - 2019']
*Unique values in column: LocationAbbr*
['AL' 'AK' 'AZ' 'AR' 'CA' 'CO' 'CT' 'DE' 'FL' 'DC' 'GA' 'HI' 'ID' 'IL'
 'IN' 'IA' 'KS' 'KY' 'LA' 'ME' 'MD' 'MA' 'MI' 'MN' 'MS' 'MO' 'MT' 'NE'
 'NV' 'NH' 'NJ' 'NM' 'NY' 'NC' 'ND' 'OH' 'OK' 'OR' 'PA' 'RI' 'SC' 'SD'
 'TN' 'TX' 'UT' 'VT' 'VA' 'WA' 'WV' 'WI' 'WY']
*Unique values in column: LocationDesc*
['Autauga' 'Baldwin' 'Barbour' ... 'Uinta' 'Washakie' 'Weston']
*Unique values in column: GeographicLevel*
['County']
*Unique values in column: DataSource*
['NVSS']
*Unique values in column: Class*
['Cardiovascular Diseases']
*Unique values in column: Topic*
['All heart disease' 'All stroke' 'Coronary heart disease (CHD)'
 'Cardiovascular disease (CVD)' 'Heart failure']
*Unique values in column: Data_Value*
[  nan  25.7  29.5 ... -76.8 -76.  -74

#### *2. Generate summary statistics*

In [97]:
# Summary statistics by RACE/ETHNICITY and All heart disease 

df1 = heart_disease[(heart_disease["Topic"] == "All heart disease") & 
              (heart_disease["Data_Value_Type"] == "Total percent change") &
              (heart_disease["Stratification3"] == "Overall") &
              (heart_disease["Stratification2"] != "Overall")
             ].dropna(axis = 0, subset = "Data_Value")

In [98]:
df1.groupby(["Year", "Stratification2"]).agg({"Data_Value" : ["describe"]})

Unnamed: 0_level_0,Unnamed: 1_level_0,Data_Value,Data_Value,Data_Value,Data_Value,Data_Value,Data_Value,Data_Value,Data_Value
Unnamed: 0_level_1,Unnamed: 1_level_1,describe,describe,describe,describe,describe,describe,describe,describe
Unnamed: 0_level_2,Unnamed: 1_level_2,count,mean,std,min,25%,50%,75%,max
Year,Stratification2,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
1999 - 2010,American Indian/Alaska Native,525.0,-9.414857,23.330523,-61.2,-24.9,-12.1,1.6,105.2
1999 - 2010,Asian/Pacific Islander,1012.0,-29.606522,12.39888,-59.9,-37.05,-30.9,-23.2,69.1
1999 - 2010,Black (Non-Hispanic),2324.0,-29.705465,13.409396,-67.1,-38.4,-30.9,-23.0,248.7
1999 - 2010,Hispanic,1987.0,-32.861399,13.076478,-72.2,-41.55,-34.1,-25.9,40.7
1999 - 2010,White,6000.0,-24.690683,13.270857,-60.7,-34.3,-27.1,-17.575,47.7
2010 - 2019,American Indian/Alaska Native,525.0,9.157143,23.641501,-46.8,-7.4,7.5,23.6,146.9
2010 - 2019,Asian/Pacific Islander,1013.0,4.534353,23.23303,-37.7,-13.7,1.0,19.2,147.0
2010 - 2019,Black (Non-Hispanic),2324.0,1.546256,16.824125,-50.5,-10.3,0.2,11.9,82.9
2010 - 2019,Hispanic,1989.0,-0.997285,14.648169,-49.9,-11.7,-2.4,7.6,80.8
2010 - 2019,White,6002.0,0.27021,14.211688,-38.5,-9.4,-1.3,8.0,140.0


In [99]:
# Summary statistics by SEX and All heart disease 

df2 = heart_disease[(heart_disease["Topic"] == "All heart disease") & 
              (heart_disease["Data_Value_Type"] == "Total percent change") &
              (heart_disease["Stratification3"] != "Overall") &
              (heart_disease["Stratification2"] == "Overall")
             ].dropna(axis = 0, subset = "Data_Value")

In [85]:
df2.groupby(["Year", "Stratification3"]).agg({"Data_Value" : ["describe"]})

Unnamed: 0_level_0,Unnamed: 1_level_0,Data_Value,Data_Value,Data_Value,Data_Value,Data_Value,Data_Value,Data_Value,Data_Value
Unnamed: 0_level_1,Unnamed: 1_level_1,describe,describe,describe,describe,describe,describe,describe,describe
Unnamed: 0_level_2,Unnamed: 1_level_2,count,mean,std,min,25%,50%,75%,max
Year,Stratification3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
1999 - 2010,Men,5805.0,-26.192317,11.900591,-62.7,-34.7,-28.3,-19.7,67.5
1999 - 2010,Women,5841.0,-25.05333,15.11613,-61.0,-35.7,-28.0,-17.5,56.2
2010 - 2019,Men,5807.0,-0.629551,12.247496,-47.1,-8.9,-1.7,6.1,98.2
2010 - 2019,Women,5843.0,1.005545,18.094917,-42.0,-11.8,-1.7,11.6,204.8


In [100]:
# Summary statistics by AGE and All heart disease 

df3 = heart_disease[(heart_disease["Topic"] == "All heart disease") & 
              (heart_disease["Data_Value_Type"] == "Total percent change") &
              (heart_disease["Stratification3"] == "Overall") &
              (heart_disease["Stratification2"] == "Overall")
             ].dropna(axis = 0, subset = "Data_Value")

In [101]:
df3.groupby(["Year", "Stratification1"]).agg({"Data_Value" : ["describe"]})

Unnamed: 0_level_0,Unnamed: 1_level_0,Data_Value,Data_Value,Data_Value,Data_Value,Data_Value,Data_Value,Data_Value,Data_Value
Unnamed: 0_level_1,Unnamed: 1_level_1,describe,describe,describe,describe,describe,describe,describe,describe
Unnamed: 0_level_2,Unnamed: 1_level_2,count,mean,std,min,25%,50%,75%,max
Year,Stratification1,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
1999 - 2010,Ages 35-64 years,3082.0,-19.830402,12.751135,-56.3,-29.0,-21.3,-12.3,69.0
1999 - 2010,Ages 65 years and older,3023.0,-32.066953,8.084356,-59.8,-37.7,-32.7,-27.4,11.4
2010 - 2019,Ages 35-64 years,3083.0,6.767467,13.437475,-43.5,-2.5,5.8,14.8,142.6
2010 - 2019,Ages 65 years and older,3024.0,-6.712665,10.028252,-37.0,-13.3,-7.1,-1.0,113.4


### Brief Summary and Further Analysis

When looking at the summary statistics in different different categories it appears that mortality rates from "All heart disease" went down from 1999 - 2010. The interval between 2010 and 2019 saw a continued reduction in mortality rates in the Hispanic population, men, and those aged 65 years or older. 

Further analysis will be done to determine rates in subgroups and trends done through various data visualiztions techniques in order to answer the reseach questions:

- Identifying trends in mortality rates through a line plot.
- Comparing group mortiality rates with bar charts will be helpful to identify any inequalities that may exist within the population. 
- Looking at the distribution of mortality rates through box plots or histograms at the state or county level may reveal regions where potential interventions may be helpful to reduce the mortality rates of cardiovascular disease. 