***
# ETL Project: Extract, Transform, Load
***
## Step 1: Extract
> Data sources are from csv files, after deep exploration of the data we need for this project we will start retrieval process by reading in the dataset. 
> Dataset we are using in this part is to potentially answer the following question:
> * What is the prevalence of overdose deaths across the US from 1999 to 2014?
> * Data source: https://data.world/health/opioid-overdose-deaths

### Importing Dependencies

In [1]:
# Import Dependencies:
import pandas as pd
import os

In [3]:
# Creating csv data file path: 
death_cause = os.path.join("Resources", "Multiple_Cause_death.csv")

'Resources/Multiple_Cause_death.csv'

In [8]:
# Reading in data file to store into Pandas DataFrame:
death_cause_df = pd.read_csv("Resources/Multiple_Cause_death.csv")
death_cause_df.head()

Unnamed: 0,State,Year,Deaths,Population,Crude Rate,Crude Rate Lower 95% Confidence Interval,Crude Rate Upper 95% Confidence Interval,Prescriptions Dispensed by US Retailers in that year (millions)
0,Alabama,1999,39,4430141,0.9,0.6,1.2,116
1,Alabama,2000,46,4447100,1.0,0.8,1.4,126
2,Alabama,2001,67,4467634,1.5,1.2,1.9,138
3,Alabama,2002,75,4480089,1.7,1.3,2.1,142
4,Alabama,2003,54,4503491,1.2,0.9,1.6,149


***
## Step 2: Transform
***
> Transforming the dataset to suit the needs of our project, this will including:
> 1. Cleaning Data
> 2. Removing NaNs
> 3. Selecting needed columns
> 4. Re-naming columns

In [10]:
# Cleaning dataset and dropping any bad records:
cleaned_death_cause_df = death_cause_df.dropna(how='any')
cleaned_death_cause_df.head()

Unnamed: 0,State,Year,Deaths,Population,Crude Rate,Crude Rate Lower 95% Confidence Interval,Crude Rate Upper 95% Confidence Interval,Prescriptions Dispensed by US Retailers in that year (millions)
0,Alabama,1999,39,4430141,0.9,0.6,1.2,116
1,Alabama,2000,46,4447100,1.0,0.8,1.4,126
2,Alabama,2001,67,4467634,1.5,1.2,1.9,138
3,Alabama,2002,75,4480089,1.7,1.3,2.1,142
4,Alabama,2003,54,4503491,1.2,0.9,1.6,149


In [18]:
# Filtering dataset by selecting columns needed to answer potential query:
death_cause_subset = cleaned_death_cause_df[["State", "Year", "Population", "Crude Rate", "Prescriptions Dispensed by US Retailers in that year (millions)" ]]
death_cause_subset.head()

Unnamed: 0,State,Year,Population,Crude Rate,Prescriptions Dispensed by US Retailers in that year (millions)
0,Alabama,1999,4430141,0.9,116
1,Alabama,2000,4447100,1.0,126
2,Alabama,2001,4467634,1.5,138
3,Alabama,2002,4480089,1.7,142
4,Alabama,2003,4503491,1.2,149


In [21]:
# Renaming the subset dataframe columns:
overdose_death = death_cause_subset.rename(columns={
    'Crude Rate': 'Death per 100K',
    'Prescriptions Dispensed by US Retailers in that year (millions)': 'US Dispensed Prescriptions (millions)'})
overdose_death.head()

Unnamed: 0,State,Year,Population,Death per 100K,US Dispensed Prescriptions (millions)
0,Alabama,1999,4430141,0.9,116
1,Alabama,2000,4447100,1.0,126
2,Alabama,2001,4467634,1.5,138
3,Alabama,2002,4480089,1.7,142
4,Alabama,2003,4503491,1.2,149
