The [Tennessee Department of Health](https://www.tn.gov/health/health-program-areas/statistics/health-data/death-statistics.html) published data on number of deaths by cause broken down by county. In this notebook, you'll see how you can make use of for loops and functions to efficiently read in, clean up, and combine data coming from muliple files.

In [None]:
import pandas as pd

The first dataset we'll work with is contained in `TN Deaths Malignant Neoplasms - 2018.xlsx`.

If you inspect the file, you'll see that the data is divided up into two tables.

First, let's see how we can read in the lefthand table. We can use the `read_excel` function and point it to columns A - G.

In [None]:
neoplasms = pd.read_excel('data/TN Deaths Malignant Neoplasms - 2018.xlsx',
             header = 4,
             usecols = 'A:G'
             )

In [None]:
neoplasms

We don't need the first row and the last few, so we can remove these using `.loc`.

In [None]:
neoplasms = neoplasms.loc[1:48]

In [None]:
neoplasms

Next, let's give the DataFrame better column names.

In [None]:
colnames = ['county', 'total_number', 'total_rate', 'white_number', 'white_rate', 'black_number', 'black_rate']
neoplasms.columns = colnames

In [None]:
neoplasms.info()

We've got a couple of columns being treated as objects rather than as numeric values. We can fix this using the `pd.to_numeric` function. 

In [None]:
# Convert the total_number column to numeric
neoplasms['total_number'] = pd.to_numeric(neoplasms['total_number'], errors = 'coerce')

# Convert the total_rate column to numeric
neoplasms['total_rate'] = pd.to_numeric(neoplasms['total_rate'], errors = 'coerce')

# We could continue this for all of the columns, but we can save ourselves some trouble by using a for loop

Notice that we are using the same setup for each column, with the only thing changing being the columns name. This is the perfect opportunity to utilize a loop to save typing and avoid copy/paste errors.

In [None]:
numeric_cols = ['total_number', 'total_rate', 'white_number', 'white_rate',
       'black_number', 'black_rate']

# Utilize a for loop on the list of numeric columns to convert each 

Notice that the entire process will be idential for the righthand table with the only piece changing the columns to point the `read_excel` to. This is another opportunity to make use of a for loop. Since we will be reusing this piece of code later on other causes of death, let's switch from calling the dataframe `neoplasms` to a more generic `df`.

We'll create a list to store the two resulting DataFrames.

In [None]:
# Convert the above to a for loop
dfs = []


Finally, we can combine our two dataframes into one using `concat`.

In [None]:
# Combine them together


We'll keep just one of the value columns, `total_number`. 

There is also a heart disease dataset that we would like to combine with this one, so let's rename this column to something more descriptive.

In [None]:
neoplasms = neoplasms[['county', 'total_number']].rename(columns = {'total_number': 'neoplasms'})

Now, let's reuse the code block above to read in the heart disease data.

In [None]:
# Fill this in with code to read in the heart disease data


Now, let's combine together the results so that we end up with a column per cause of death.

In [None]:
# Combine the two dataframes


What if we want to pull in all of the data that we have? We could copy and paste the above for loop, changing filenames, but we have so much repetition of the same code, we might be better off utilizing a function. Let's convert the above to a function which returns a cleaned up dataframe.

In [None]:
# Create a function for the above work


Test it out on the accidents deaths.

Since we have a function, we can use this to pull in all of the data we need. If we had a list of filenames, it would make our job easier, and fortunately, the `glob` library can help us out here.

In [None]:
import glob
filenames = glob.glob('data/*.xlsx')

filenames

Now, just make a list of causes to go with each filename.

In [None]:
causes = ['diabetes', 'accident', 'cerebrovascular', 'heart_disease', 'malignant_neoplasms']

Using these two lists and a for loop, we can pull in and clean up all of the needed data.

In [None]:
dfs = []

# Loop through the filenames and causes to pull in all of the data

Finally, let's combine all of the resulting dataframes together into one dataframe.

In [None]:
deaths = dfs[0]

for df in dfs[1:]:
    deaths = pd.merge(left = deaths, right = df)

In [None]:
deaths