***
# ETL Project: Extract, Transform, Load
***
## Step 1: Extract
> Data sources are from csv files, after deep exploration of the data we need for this project we will start retrieval process by reading in the dataset. 
> Dataset we are using in this part is to potentially answer the following question:
> * What is the prevalence of overdose deaths across the US from 1999 to 2014?
> * Data source: https://data.world/health/opioid-overdose-deaths

### Importing Dependencies

In [2]:
# Import Dependencies:
import pandas as pd
import os

In [3]:
# Creating csv data file path: 
death_cause = os.path.join("Resources", "Multiple_Cause_death.csv")

### Store CSV into DataFrame

In [4]:
# Reading in data file to store into Pandas DataFrame:
death_cause_df = pd.read_csv("Resources/Multiple_Cause_death.csv")
death_cause_df.head()

Unnamed: 0,State,Year,Deaths,Population,Crude Rate,Crude Rate Lower 95% Confidence Interval,Crude Rate Upper 95% Confidence Interval,Prescriptions Dispensed by US Retailers in that year (millions)
0,Alabama,1999,39,4430141,0.9,0.6,1.2,116
1,Alabama,2000,46,4447100,1.0,0.8,1.4,126
2,Alabama,2001,67,4467634,1.5,1.2,1.9,138
3,Alabama,2002,75,4480089,1.7,1.3,2.1,142
4,Alabama,2003,54,4503491,1.2,0.9,1.6,149


***
## Step 2: Transform
***
> Transforming the dataset to suit the needs of our project, this will including:
> 1. Cleaning Data
> 2. Removing NaNs
> 3. Selecting needed columns
> 4. Re-naming columns

In [5]:
# Cleaning dataset and dropping any bad records:
cleaned_death_cause_df = death_cause_df.dropna(how='any')
cleaned_death_cause_df.head()

Unnamed: 0,State,Year,Deaths,Population,Crude Rate,Crude Rate Lower 95% Confidence Interval,Crude Rate Upper 95% Confidence Interval,Prescriptions Dispensed by US Retailers in that year (millions)
0,Alabama,1999,39,4430141,0.9,0.6,1.2,116
1,Alabama,2000,46,4447100,1.0,0.8,1.4,126
2,Alabama,2001,67,4467634,1.5,1.2,1.9,138
3,Alabama,2002,75,4480089,1.7,1.3,2.1,142
4,Alabama,2003,54,4503491,1.2,0.9,1.6,149


In [8]:
# Changing data type for Crude Rate column for analysis:
death_cause_df['Crude Rate'] = death_cause_df['Crude Rate'].astype(str)
death_cause_df

Unnamed: 0,State,Year,Deaths,Population,Crude Rate,Crude Rate Lower 95% Confidence Interval,Crude Rate Upper 95% Confidence Interval,Prescriptions Dispensed by US Retailers in that year (millions)
0,Alabama,1999,39,4430141,0.9,0.6,1.2,116
1,Alabama,2000,46,4447100,1,0.8,1.4,126
2,Alabama,2001,67,4467634,1.5,1.2,1.9,138
3,Alabama,2002,75,4480089,1.7,1.3,2.1,142
4,Alabama,2003,54,4503491,1.2,0.9,1.6,149
...,...,...,...,...,...,...,...,...
811,Wyoming,2010,49,563626,8.7,6.4,11.5,210
812,Wyoming,2011,47,568158,8.3,6.1,11,219
813,Wyoming,2012,47,576412,8.2,6,10.8,217
814,Wyoming,2013,52,582658,8.9,6.7,11.7,207


### Create new data with select columns

In [6]:
# Filtering dataset by selecting columns needed to answer potential query:
# Extracting only needed columns:
death_cause_subset = cleaned_death_cause_df[["State", "Year", "Population", "Crude Rate", "Prescriptions Dispensed by US Retailers in that year (millions)" ]]
death_cause_subset.head()

Unnamed: 0,State,Year,Population,Crude Rate,Prescriptions Dispensed by US Retailers in that year (millions)
0,Alabama,1999,4430141,0.9,116
1,Alabama,2000,4447100,1.0,126
2,Alabama,2001,4467634,1.5,138
3,Alabama,2002,4480089,1.7,142
4,Alabama,2003,4503491,1.2,149


In [22]:
# Renaming the subset dataframe columns:
overdose_death = death_cause_subset.rename(columns={
    'Crude Rate': 'Death',
    'Prescriptions Dispensed by US Retailers in that year (millions)': 'dispensed_prescriptions'})
overdose_death.head()

Unnamed: 0,State,Year,Population,Death,dispensed_prescriptions
0,Alabama,1999,4430141,0.9,116
1,Alabama,2000,4447100,1.0,126
2,Alabama,2001,4467634,1.5,138
3,Alabama,2002,4480089,1.7,142
4,Alabama,2003,4503491,1.2,149


In [24]:
# Saving needed subset into csv file: 
overdose_death.to_csv('Resources/overdose_death.csv', index=False)

In [9]:
# # what is the average number of death per year and state
# overdose_death = """
#     SELECT Year,
#            Death per 100K 
#     FROM overdose_death
#     WHERE Year = '1999';
# """
# overdose_death