In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
#print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# Introduction

The dataset we are going to explore contains public health data for the city of Chicago. The data is provided by the City of Chicago's open data platform. The data is described in the documentation as the following:


"This dataset contains the cumulative number of deaths, average number of deaths annually, average annual crude and adjusted death rates with corresponding 95% confidence intervals, and average annual years of potential life lost per 100,000 residents aged 75 and younger due to selected causes of death, by Chicago community area, for the years 2006 – 2010."


Let's explore cause of death trends for people 75 and under and associate some public health indicators with cause of death.

# Load the data

Load the cause of death data and the public health indicator data.

In [None]:
# Let pandas create an index to make slicing easier --
# "Community health" index starts from 1, which might be confusing
deaths = pd.read_csv('./chicago-public-health-statistics/public-health-statistics-selected-underlying-causes-of-death-in-chicago-2006-2010.csv')
health = pd.read_csv('./chicago-public-health-statistics/public-health-statistics-selected-public-health-indicators-by-chicago-community-area.csv')

Explore the data. Some questions to think about are, what are the most deaths from, over the time period? Are there any outlier neighborhoods? Can we associate any public health indicators to leading causes of death for people 75 and under in Chicago?

In [None]:
deaths.head()

What are the unique causes of death in the dataset?

In [None]:
np.sort(np.unique(deaths['Cause of Death']))

# Data validation

# TODO: Exclude 'Chicago' overall community area from subtotals

Between the last commit and now I noticed that there is a Community Area 0 that counts the deaths by cause for the whole city of Chicago, as was attempted below. Now I'll need to redo the below analysis to exclude community area 0 as well.

In [None]:
deaths[deaths['Community Area'] == 0]['Community Area Name'].unique()

In the data we have some aggregate categories -- like "All Causes" and "Cancer (all sites)." Additionally, according to the documentation, some of these categories are double-counted in others, like "Lung cancer" is counted in "Cancer (all sites)" and some "Firearm-related" deaths are part of "Suicide (intentional self-harm)."

We can validate this by looking at the cancer death counts.

Are all subcategories of cancer deaths included in the aggregate cancer data? Let's validate by totaling the subcategory deaths and comparing to "Cancer (all sites)."

Just glancing at the different cancer death types, it's doubtful that all kinds of cancer are represented in the total. For instance, leukemia is not a category, though that is a common cancer.

In [None]:
# Look for the different field names
causes = pd.Series(deaths['Cause of Death'].unique())
cancer = causes[causes.str.lower().str.contains('cancer')]
print(cancer)

Let's continue validating the numbers anyway.

In [None]:
# Separate deaths by all cancer and deaths by subcategories of cancer in the data
subcancer = deaths[(deaths['Cause of Death'].isin(cancer)) & (deaths['Cause of Death'] != 'Cancer (all sites)')]
cancer_all = deaths[deaths['Cause of Death'] == 'Cancer (all sites)']

If the total deaths for all subcategories of cancer equal the total deaths for "Cancer (all sites)," then we know the subcategories are exhaustive.

In [None]:
print(subcancer['Cumulative Deaths 2006 - 2010'].sum() )
print(cancer_all['Cumulative Deaths 2006 - 2010'].sum())

As expected, we don't have information on cancer deaths for every specific subtype of cancer.

# Exploratory Data Analysis

Let's look at some descriptive statistics to understand the data and look for trends.

#### Neighborhoods

How many neighborhoods?

In [None]:
len(np.unique(deaths['Community Area']))

#### Citywide statistics

Although our data is reported by neighborhood, let's look at the leading causes of death across the city. That way, we have a baseline to compare neighborhoods to.

In [None]:
deaths.groupby('Cause of Death')['Cumulative Deaths 2006 - 2010'].sum().sort_values(ascending = False).head(10)

This top ten isn't very informative because it includes some of the aggregate categories. Let's work on removing those.

We exclude 'All causes' types of data and redo the list of top causes of death across the city.

In [None]:
import matplotlib.pyplot as plt
% matplotlib inline

In [None]:
# Exclude the cause of death from the subtotal if it contains "all"
specific_deaths = deaths[~(deaths['Cause of Death'].str.lower().str.contains('all'))]

Now we have more specific insight into the top causes of death across the city.

In [None]:
# Totals for city
sd_gb = specific_deaths.groupby('Cause of Death')['Cumulative Deaths 2006 - 2010'].sum()
sd_gb.sort_values(ascending = False)

We can visualize this data to understand it better.

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
sd_gb.sort_values().plot(kind='barh')
plt.ylabel('')
plt.title(sd_gb.index.name + ' -  Chicago, citywide')

From the visualization, it's clear that Coronary heart disease is far and away the leading cause of death for people aged 75 and under in Chicago in the time frame given.

# Explore health indicators

Let's take a look at the other dataset.

In [None]:
health.head()

Let's look at the different health indicators that are available to us.

In [None]:
health.columns

We'll be able to tie the cause of death dataset with the health indicators dataset based on the "Community Area" key. Is there anything missing?

In [None]:
health.merge(deaths, left_on = 'Community Area', right_on = 'Community Area').head()

# To-dos, summary:

* Exclude community area 0, Chicago (citywide), from subtotals.
* Merge the public health dataset with the cause of death dataset.
* For the end of the project:
To do additional analysis, we could bring in some national cause of death statistics 
to see how Chicago or a specific neighborhood in Chicago compares to the rest of the US.