# How are People Dying in the United States? Visualizing Mortality

**Death** is a _morbid_ topic, but it is a topic that is important for the government, for healthcare, economics and of course, medical sciences. Understanding how people are dying can lead to changes in research funding for certain diseases, or increased preventative measures when certain kinds of people are at risk.

In United States, the **Centers of Disease Control and Prevention (CDC)** have been collecting [mortality data](https://wonder.cdc.gov/ucd-icd10.html) from 1999-2015. The data is rich in demographic info, including age of death, underlying cause of death, gender, and even race and geographic city/county.

With data, there are many **questions we can ask about death**:
- What are the top causes of death in the United States? 
- Are men more likely to die than women? Does it depend on the cause of death? Does it depend on age?
- What causes of death are becoming more or less prevalent over time?

#### Learning Data Visualization

In this notebook, you will be introduced to **Matplotlib**, one of the most popular packages for data visualization in Python.

There are many different ways to _use_ **Matplotlib**, but this notebook will simply go through the basics. If you'd like to continue learning more about **Matplotlib**, you can review the documentation [here](http://matplotlib.org/).

<a id="mpl"></a>
## Import **Matplotlib** library

To begin, let's make sure that we have the appropriate libraries for plotting.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

# The command allows to transfer the server part of creation of charts in the Notebook. 
# This is not important at this stage, but it is recommended use in the future
%matplotlib inline 

import sys

<a id="getdata"></a>
## Getting the Data

Now read the file `deaths.csv`:

In [3]:
df = pd.read_csv("deaths.csv", encoding='GBK') # allows to use chinese symbols
df

FileNotFoundError: [Errno 2] No such file or directory: 'deaths.csv'

Great! The data is now read into the variable, **`df`**.

<a id="understand"></a>
## Understanding the Data

#### The top 10 rows:

In [None]:
df.head(10)

#### The bottom 5 rows:

In [None]:
df.tail(5)

#### What is the age range of the data?

In [None]:
df.Age.describe()

#### Which years are included in this dataset?

In [None]:
# write your code here


#### To confirm, does the Gender column only contain male and female?

In [None]:
# write your code here



#### What is the summary statistics for the number of deaths?

In [None]:
# write your code here



#### What are the causes of death in this dataset?

In [None]:
causes = pd.DataFrame(df['Cause'].unique(), columns=['Death Cause']) # remove duplicates
causes = causes.sort_values(by='Death Cause')
causes.index = range(0, len(causes)) # re-index the row numbers
causes

<a id="year"></a>
## Deaths: by Year

#### Let's look at the data again:

In [None]:
df.head(3)

#### How many deaths are there overall in 2005, 2010 and 2015?

First, we want to **group the data by year**, then by specifying the **Deaths** column, we can calculate the **sum of deaths per year**:

In [None]:
by_year = df.groupby("Year").Deaths.sum()
by_year

After you groupby, plotting is easy. With **`.plot()`**, you just need to choose the kind of graph:

In [None]:
by_year.plot(kind="bar")

#### We can re-write the code for the same plot so it's easier to follow:

In [None]:
# Making syntax easier to read. \ tells Python to continue to the next line before executing 
df.groupby("Year")\
.Deaths\
.sum()\
.plot(kind="bar")

## Split data into years

It's going to get confusing if we always have data from 2005, 2010 and 2015.
  
Let's just start by exploring only the **deaths in 2015**.

<a id="gender"></a>
## Deaths: Male vs. Female

In 2015, which gender had a higher mortality rate?

In [1]:
df2015 = df[df.Year == 2015]

df2015\
.groupby("Gender")\
.Deaths\
.sum()

NameError: name 'df' is not defined

#### Can you create a simple bar graph to compare the total number of deaths for each gender?

In [None]:
## write your code here


Let's add **color** to the graph: <font color=red>red</font> for **Female** and  <font color=blue>blue</font> for **Male**.

We can add color using:
> `.plot(kind="bar", ` **`color=["red", "blue"]`**`)`

In [None]:
df2015\
.groupby("Gender")\
.Deaths\
.sum()\
.plot(kind="bar", color=["red", "blue"])
# try both options. How it works?
# .plot(kind="bar", color=["green", "red", "blue"]) 

<a id="age"></a>
## Deaths: by Age

At what age did people die in 2015?

In [None]:
## write your code here


The graph above is difficult to read -- it's **too small**! 

We can increase the figure size using: 
> `.plot(kind="bar",`** figsize=[18, 6]**`)`

In [None]:
## write your code here


# play with other options. Understand how it works
# .plot(kind="bar", color="grey", figsize=[9, 6])
# .plot(kind="bar", color="grey", figsize=[150, 6])

<font color="green"> What do you notice from the graph above? </font>

Conclusions:

1. 

<a id="ageXgender"></a>
## Deaths: by Age & Gender

#### Is age of death affected by gender?

To groupby the data using two variables, simply use a list:
> `.groupby(`**`["Age", "Gender"]`**`)`

In [None]:
df2015\
.groupby(["Age", "Gender"])\
.Deaths\
.sum()

#### Let's create two subplots -- one for female and male deaths by age.

We can create subplots using: 
> ...  
> **`.unstack(1)`**`\`  
> `.plot(kind="bar", ` **` subplots=True`**`)`

In [None]:
df2015\
.groupby(["Age", "Gender"])\
.Deaths\
.sum()\
.unstack(1)\ # equal numbers of plots - 1
.plot(kind="bar", color=["red", "blue"], figsize=[18, 10], subplots=True)

<font color="green"> What do you notice from the graph above? </font>

Conclusions:

1. 

We can also **stack** the Male and Female together to form a total bar.

To create stacked bar charts, we can use:
>`.plot(kind="bar", ` **`stacked=True`**`)`

In [None]:
df2015\
.groupby(["Age", "Gender"])\
.Deaths\
.sum()\
.unstack(1)\
.plot(kind="bar", color=["red", "blue"], figsize=[18, 10], stacked=True)

But this isn't very helpful because it is not easy to compare M to F in a stacked bar chart.

<font color="green"> Give examples, when stacked bar chart will be useful</font>

Examples (2):

1. ...

#### Because `Age` is a continuous variable, it might help to compare lines on a **line graph**. 

We can create line graphs using:
> `.plot(kind="`**`line`**`")`

In [None]:
df2015\
.groupby(["Age", "Gender"])\
.Deaths\
.sum()\
.unstack(1)\
.plot(kind="line", figsize=[18, 10], subplots=True)

Now with just lines, this makes it easier to compare the gender difference over age in **one line graph**.

To plot multiple groups in the same graph, make sure that you have **unstacked the data** and **kept subplots as False (default)** :
>`...`  
>**`.unstack(1)`**`\`  
>`.plot(kind="line")   #Default: subplots = False`

In [None]:
df2015\
.groupby(["Age", "Gender"])\
.Deaths\
.sum()\
.unstack(1)\ # try to delete this row and check the result
.plot(kind="line", color=["red", "blue"], figsize=[18, 6])

Let's add a main title and a y-axis label:

> `.plot(kind = "line",` **`title = "Deaths in 2015 by Age and Gender"`**`)`


In [None]:
## write your code here


<a id="causeX2015"></a>
## Deaths: Top Causes of Death in 2015

Let's look at the top causes of death in 2015:

In [None]:
df2015\
.groupby(["Cause"])\
.Deaths\
.sum()\
.plot(kind="bar", color="black")

This is very difficult to read. We need to **sort the data** and **rotate the graph (horizontal bar chart)**.

1. We can **sort** the data using:
> **`.sort('Deaths', ascending=True)`**

2. We can create a **horizontal bar chart** using:
> **`.plot(kind="barh")`**

In [None]:
df2015\
.groupby(["Cause"])\
.agg({'Deaths' : sum})\  # note this. What is it?
.sort_values('Deaths', ascending=True)\
.plot(kind="barh", legend=False, color="black", figsize=[9, 12])

#### Let's look at the Top 10 most common causes of death. 

In [None]:
## write your code here


<a id="causeXyear"></a>
## Deaths: Top Causes of Death by Year

In [None]:
df\
.groupby(["Cause","Year"])\
.agg({'Deaths':sum})\
.sort_values('Deaths', ascending = False)\
.unstack(1)\
.plot(kind="barh", legend=True, figsize=[10, 24])

The visualization above contains a lot of information (maybe too much!). But did you notice? 

**Deaths caused by HIV disease has decreased every five years since 2005!**

<a id="causeXgender"></a>
## Deaths: Causes of Death by Gender

In [None]:
## write your code here


<a id="causeXage"></a>
## Deaths: Causes of Death by Age

Because there are so many causes of death, let's choose just a few causes to visualize by age of death:
- "Alzheimer's disease" 
- "Diseases of heart" 
- "Malignant neoplasms" 
- "Accidents (unintentional injuries)"

In [None]:
clist = ["Alzheimer's disease", 
         "Diseases of heart", 
         "Malignant neoplasms", 
         "Accidents (unintentional injuries)"]

df2015_clist = df2015[df2015["Cause"].isin(clist)] #isin

df2015_clist\
.groupby(["Age", "Cause"])\
.agg({'Deaths':sum})\
.sort_values('Deaths', ascending=False)\
.unstack(1)\
.plot(kind="line", legend=True, figsize=[10, 6])

<font color="green"> Discuss: What do you notice from the graph above? </font>

Conclusions:

1. 

<a id="causeXgenderXage"></a>
## Deaths: Causes of Death by Gender & Age

This visualization is particularly difficult because there are 2 genders x 3 years x 51 causes. It's virtually impossible to place all of this data on a single graph and make it easy to understand.

The best thing to do is to visualize some of the data, or just the data that is most interesting.

In [None]:
clist = df.Cause.unique()

for cause in clist:
    df2015_clist = df2015[df2015["Cause"].isin([cause])]
    
    df2015_clist\
    .groupby(["Age", "Gender"])\
    .agg({'Deaths':sum})\
    .unstack([1])\
    .plot(kind="line", legend=True, color=('r', 'b'), figsize=[10, 8], title=str(cause))

<font color = "green"> Summarize all your conclusions and formulate a general conclusion based on the studied data: </font>

Dataset conclusions:

