# Creative Programming Assignment 01

Author: Kelly Zhang

### Monthly Provisional Counts of Death

This dataset is published by the Centers for Disease Control and Prevention on data.gov. It details the count of deaths for each month, ranging from 2019 to 2021. Death counts are further broken down per age group, sex, and race/ethnicity for select underlying causes.

You can access the dataset here: 
https://catalog.data.gov/dataset/monthly-provisional-counts-of-deaths-by-age-group-sex-and-race-ethnicity-for-select-causes

I will answer the following questions:
1. Which age group has the highest death count per month?
2. What are the top three causes of death over the years? 
3. What is top cause of death per age group?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### 1. Load the data into a pandas dataframe

In [None]:
df = pd.read_csv('Monthly Death Counts.csv')
df

**Clean up data**


In [None]:
df.columns

In [None]:
# merge respiratory system ailments
df['Respiratory diseases'] = df['Chronic lower respiratory diseases (J40-J47)'] + df['Other diseases of respiratory system (J00-J06,J30-J39,J67,J70-J98)']

# merge both COVID-19 columns
df['COVID-19'] = df['COVID-19 (U071, Multiple Cause of Death)'] + df['COVID-19 (U071, Underlying Cause of Death)']

df

Drop unnecessary columns

In [None]:
to_drop = ['AnalysisDate', 
          'Start Date',
          'End Date',
          'Jurisdiction of Occurrence', 
          'Chronic lower respiratory diseases (J40-J47)',
          'Other diseases of respiratory system (J00-J06,J30-J39,J67,J70-J98)',
          'COVID-19 (U071, Multiple Cause of Death)',
          'COVID-19 (U071, Underlying Cause of Death)']

df.drop(to_drop, inplace=True, axis=1)
df

Modify column names to make them simpler

In [None]:
df.rename(
    columns={'Date Of Death Year': 'Year', 
         'Date Of Death Month': 'Month',
         'Sex': 'Sex',
         'Race/Ethnicity': 'Race/Ethnicity', 
         'AgeGroup': 'Age Group', 
         'AllCause': 'Total', 
         'NaturalCause': 'Natural Causes',
         'Septicemia (A40-A41)': 'Septicemia', 
         'Malignant neoplasms (C00-C97)': 'Malignant neoplasms',
         'Diabetes mellitus (E10-E14)': 'Diabetes mellitus', 
         'Alzheimer disease (G30)': 'Alzheimer disease',
         'Influenza and pneumonia (J09-J18)': 'Influenza and pneumonia',
         'Nephritis, nephrotic syndrome and nephrosis (N00-N07,N17-N19,N25-N27)': 'Nephritis, nephrotic syndrome and nephrosis',
         'Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified (R00-R99)': 'Not elsewhere classified',
         'Diseases of heart (I00-I09,I11,I13,I20-I51)': 'Heart diseases',
         'Cerebrovascular diseases (I60-I69)': 'Cerebrovascular diseases',
         'Respiratory diseases': 'Respiratory diseases',
         'COVID-19': 'COVID-19'}, 
      inplace=True)

df

Drop any rows with NaN values

In [None]:
df = df.dropna()
df

Change the 'Male' value in the 'Sex' column to 'M'

In [None]:
df['Sex'].replace('Male', 'M', inplace=True)
df

Combine Year and Month into one column, 'Date', for easy grouping. Set every entry to Day 1, because the Day is not defined in this dataset.

In [None]:
df['Date'] = pd.to_datetime(df[['Year','Month']].assign(DAY=1))

In [None]:
# Drop Month and Year columns
df.drop(['Month'], inplace=True, axis=1)
df.drop(['Year'], inplace=True, axis=1)

# Move Date column to first column
date_column = df.pop('Date')
df.insert(0, 'Date', date_column)

df

### 2. Rough Overview of data:

In [None]:
df.describe()

### 3. Analyze Data

a. Print array of columns and index array

In [None]:
list_of_index = df.values[0]
for i in range(len(df.columns)):
    print(df.columns[i], ":", end=" ")
    print(list_of_index[i])

b. Group by number of COVID-19 for each year 

In [None]:
covid_df = df.groupby('Date')['COVID-19'].agg([np.sum])
covid_df

c. Simple plot: Number of COVID-related deaths per month

In [None]:
plt.plot(covid_df)
plt.xlabel("Month")
plt.xticks(rotation=90)
plt.ylabel("Number of COVID-19 Deaths")
plt.title("Number of COVID-19 Deaths Per Month")

### Q1. Which age group has the highest death count per month?

d. pivot table and plot part of data from pivot table

In [None]:
pivot = pd.pivot_table(df, 
                       values='COVID-19',
                      index=['Date'],
                      columns=['Age Group'])
pivot.tail()

In [None]:
pivot[['0-4 years','5-14 years','15-24 years','25-34 years','35-44 years','45-54 years','55-64 years','65-74 years','75-84 years','85 years and over']].plot(figsize=(10,6))
plt.grid()

### Q2. What are the top three causes of death over the years? 

In [None]:
# get the aggregate sum of each numeric column
col_sums = df.sum(axis=0, numeric_only=True)

# remove first row
col_sums.drop(index=col_sums.index[0], inplace=True)

col_sums

In [None]:
# plot each cause of death vs number of deaths
plt.plot(col_sums)
plt.xlabel("Cause of Death")
plt.xticks(rotation=90)
plt.ylabel("Number of Deaths")
plt.title("Number of Deaths Per Cause of Death")

### Q3. What is top cause of death per age group?

In [None]:
# plot number of deaths for each cause of death, per age group (lines)
df3 = df.groupby(by=['Age Group']).sum()
df3 = df3.iloc[: , 1:] # remove first column, which is "Total"
df3 = df3.transpose()
df3

In [None]:
df3.plot(figsize=(15,6)) # swap rows and columns and plot

plt.xlabel("Cause of Death", fontsize=14)
plt.xticks(rotation=90, fontsize=14)
plt.ylabel("Number of Deaths", fontsize=14)
plt.title("Number of Deaths Per Cause of Death", fontsize=20)

plt.show()

### 4. Discussion

I imported the dataset, AH Monthly Provisional Counts of Deaths for Select Causes of Death by Sex, Age, and Race and Hispanic Origin, which had the provisional counts of death per month from years 2019 to 2021. The counts are broken up per cause, gender, age group, and race/ethnicity.

I was curious about the top causes of death, which age group had the highest death counts, and what the top cause of death per age group is. 

The age group with the highest death counts are the 85 years old and up, with the highest death counts for all age groups in Jan 2021.

The top three causes of death are natural causes (count: 7973174), heart diseases (count: 1845070), and Malignant neoplasms  (count: 1639339), with the forth highest cause being COVID-19 (count: 1352990). 

The top cause of death for each age group looks like natural causes, with malignant neoplasms, heart diseases, and COVID-19 as the second top cause of death for various age groups.