## NYC Leading causes of Death data analysis

### Group: Apurva Padwal(apadwal2@illinois.edu) and Pranav Dange(pdange2@illinois.edu)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import altair as alt

We used the NCHS - Leading Causes of Death: United States dataset for this contextual visualization, and more details concerning the dataset could be accessed at https://catalog.data.gov/dataset/nchs-leading-causes-of-death-united-states. This dataset has been used in the CSV file format, and the URL to it is https://data.cdc.gov/api/views/bi63-dtpu/rows.csv?accessType=DOWNLOAD. Effective since 1999, this dataset offers statistics on age-adjusted mortality rates and the number of casualties for the ten primary causes deaths in the US. We utilized this dataset to depict the average mortality in New York City during 2010 and 2015 depending on the various causes of death.

In [2]:
# importing the data
us_leading_cause_of_death = pd.read_csv('https://data.cdc.gov/api/views/bi63-dtpu/rows.csv?accessType=DOWNLOAD').dropna()

In [3]:
#printing the first 10 values to get an idea of the dataset
us_leading_cause_of_death.head(10)

Unnamed: 0,Year,113 Cause Name,Cause Name,State,Deaths,Age-adjusted Death Rate
0,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,United States,169936,49.4
1,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alabama,2703,53.8
2,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Alaska,436,63.7
3,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arizona,4184,56.2
4,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Arkansas,1625,51.8
5,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,California,13840,33.2
6,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Colorado,3037,53.6
7,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Connecticut,2078,53.2
8,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,Delaware,608,61.9
9,2017,"Accidents (unintentional injuries) (V01-X59,Y8...",Unintentional injuries,District of Columbia,427,61.0


In [4]:
#printing the last 10 values to get an idea of the dataset
us_leading_cause_of_death.tail(10)

Unnamed: 0,Year,113 Cause Name,Cause Name,State,Deaths,Age-adjusted Death Rate
10858,1999,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Tennessee,675,12.3
10859,1999,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Texas,1669,10.3
10860,1999,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,United States,35525,13.0
10861,1999,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Utah,135,9.1
10862,1999,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Vermont,56,9.2
10863,1999,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Virginia,1035,16.9
10864,1999,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Washington,278,5.2
10865,1999,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,West Virginia,345,16.4
10866,1999,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Wisconsin,677,11.9
10867,1999,"Nephritis, nephrotic syndrome and nephrosis (N...",Kidney disease,Wyoming,30,6.8


In [5]:
#printing all the columns present in the dataset
us_leading_cause_of_death.columns

Index(['Year', '113 Cause Name', 'Cause Name', 'State', 'Deaths',
       'Age-adjusted Death Rate'],
      dtype='object')

In [6]:
us_leading_cause_of_death = us_leading_cause_of_death[(us_leading_cause_of_death['Year'] >= 2010) & (us_leading_cause_of_death['Year'] <= 2015) & (us_leading_cause_of_death['Cause Name'] != 'All causes') & ((us_leading_cause_of_death['State'] == 'New York') | (us_leading_cause_of_death['State'] == 'California'))]

In [7]:
# plot the bar chart
Final_cont_bar_chart = alt.Chart(us_leading_cause_of_death).mark_bar().encode(
  alt.Column('Year:Q'),
  alt.X('State:N', title ='State'),
  alt.Y('average(Deaths):Q', axis=alt.Axis(grid=False), title ='Average Number of Deaths'),
  alt.Color('State:N'),
  tooltip=(['average(Deaths):Q'])
).properties(
  title='Statewise comparison between California and New York for Average Number of Deaths during 2010-2015'
)

In [8]:
# display the bar chart
Final_cont_bar_chart = Final_cont_bar_chart.properties(width=115)
Final_cont_bar_chart

#### Plot choice explanation:

For this first Plot, we decides to choose a Bar chart. It felt like a suitable choice for the plot to visualize the Average number of deaths in two states viz, California and New York for a comparison. It allowed us to make an easy comparison of numbers for the death category. Additionally, this chart can help us show the differences in data for two states with respect to each year affectively.

We used Altair visualization library for this bar chart for making the visualization interactive. We defined a custom Title and axis labels to make sense of what we are trying to visualize. The visualization uses bars to show the average number of deaths in particular state, with the height of the bar representing the average number of deaths and the color representing the state.
There are tooltips that offer more details on the average deaths in each state.
The visual is divided into columns based on the year, with the state names displayed on the x-axis. The average number of fatalities is shown on the y-axis.

We had to hunt for datasets that would assist us learn more and relate to our key dataset while selecting contextual visualizations. We chose the New York City Community Health Survey dataset after some research. It featured other areas that provided us with deeper insights such as the proportion of persons who did not have access to medical treatment, access to health insurance, their lifestyle choices such as food and smoking habits, and so on. We chose to investigate and go forward with this dataset as it gives us an opportunity to analyze how someone's lifestyle, daily habits and circumstances contribute towards their cause of death in the city of New York.

In [9]:
#myJekyllDir = 'C:/Users/prana/pdange21.github.io/assets/json/'

In [10]:
#Final_cont_bar_chartl.save(myJekyllDir+"Final_cont_bar_chart.json")

#### Contextual Visualization 2

In [11]:
# importing the data
us_leading_cause_of_death_2 = pd.read_csv('https://data.cdc.gov/api/views/bi63-dtpu/rows.csv?accessType=DOWNLOAD').dropna()

In [12]:
# filter the data to include only heart disease deaths in New York and California from 2010 to 2015
heart_disease_data = us_leading_cause_of_death_2[(us_leading_cause_of_death_2['State'].isin(['New York', 'California'])) & 
                                     (us_leading_cause_of_death_2['Year'].between(2010, 2015)) &
                                     (us_leading_cause_of_death_2['Cause Name'] == 'Heart disease')]

# create a line chart using Altair
cont_line_chart_1 = alt.Chart(heart_disease_data).mark_line().encode(
    x='Year:O', # O is for ordinal (categorical with order)
    y='Deaths:Q', # Q is for quantitative
    color='State:N' # N is for nominal (categorical without order)
).properties(
    title='Comaprison of Deaths due to Heart Diseases in New York and California (2010-2015)',
    width=500
)

# display the line chart
cont_line_chart_1

In [13]:
#myJekyllDir = 'C:/Users/prana/pdange21.github.io/assets/json/'

In [14]:
#cont_line_chart_1.save(myJekyllDir+"cont_line_chart_1.json")

#### Plot choice explanation:

For the Second Plot, We thought that a Line Chart would be suitable for visualizing trends in our data for two selectes states. This is because, a Line chart can help us visualize data using lines for a comparison effectively and display trends for specificyears consecutively. 
Again, We used Altair visualization library for this bar chart for making the visualization interactive. We defined a custom Title and axis labels to make sense of what we are trying to visualize. The graph depicts the number of deaths from heart disease in New York and California from 2010 to 2015. On the graphic, two lines are drawn: one for New York and one for California. The x-axis shows the number of years, while the y-axis shows the number of deaths from heart disease. Each line is represented by a distinct color, and the two states are distinguished by a legend.


#### Contextual Visualization 3:

In [15]:
# Filter for heart disease and the desired years
heart_disease_data = us_leading_cause_of_death_2[(us_leading_cause_of_death_2['Cause Name'] == 'Heart disease') & 
                                      (us_leading_cause_of_death_2['Year'].between(2010, 2015))]

# Filter for New York and get average deaths by year
ny_heart_disease_data = heart_disease_data[heart_disease_data['State'] == 'New York']
ny_heart_disease_data = ny_heart_disease_data.groupby('Year').mean().reset_index()
ny_heart_disease_data['State'] = 'New York'

# Get average deaths by year for all other states combined
other_states_heart_disease_data = heart_disease_data[heart_disease_data['State'] != 'New York']
other_states_heart_disease_data = other_states_heart_disease_data.groupby('Year').mean().reset_index()
other_states_heart_disease_data['State'] = 'Other states'

# Concatenate data frames
heart_disease_data_combined = pd.concat([ny_heart_disease_data, other_states_heart_disease_data])

# Plot line chart
line_chart = alt.Chart(heart_disease_data_combined).mark_line().encode(
    x=alt.X('Year:O', title='Year'),
    y=alt.Y('average(Deaths):Q', axis=alt.Axis(title='Average Number of Deaths')),
    color=alt.Color('State:N', title='State')
).properties(
    title='Deaths due to Heart disease in New York compared to Other States (2010-2015)',
    width=800, # set the width of the chart to 800 pixels
    height=400 # set the height of the chart to 400 pixels
)

line_chart

In [16]:
#cont_line_chart_2.save(myJekyllDir+"cont_line_chart_2.json")

#### Plot choice explanation:

The Third chart is similar to our second plot. We have used Line chart again, but this time, We chose to compare. We decided to use this chart because we got to demonstrate how a given data point, in this example, deaths from heart disease, changes over time.

The graph depicts two lines, one for New York and the other for all other states combined, representing the average number of deaths from heart disease from 2010 to 2015. The x-axis depicts the years 2010 to 2015, while the y-axis depicts the average number of heart disease deaths. The graphic also has a caption that distinguishes between the two lines. 
The title of the graphic is "Deaths due to Heart Disease in New York Compared to Other States (2010-2015)," and its width and height are 800 and 400 pixels, respectively. Overall, the resultant line chart gives a clear and unambiguous picture of the average number of Deaths attributable to heart disease in New York and other states. 