***
# ETL Project: Extract, Transform, Load
***
## Step 1: Extract 
> Data source extraction from a csv file, after deep exploration of the data we need for this project we will start data retrieval process by writing the code for extraction, then running it through transformation steps. 
### Part 3 of our project:  Demographic usage and overdose death from opioids across the US. 
### In this part we will try to retrieve datasets that will potentially answer the following questions:
> * What is the prevalence of overdose deaths across the US from 1999 to 2014?
> * What are the rates of usage organized by demographics across the US?
> * Using California as a model, is there a relationship between enrollments in medically-assisted facilities and rates of overdose deaths?
### These are the dataset sources:
> * Opioid Overdose Deaths : https://data.world/health/opioid-overdose-deaths
> * Drug Use, Employment, Work Absence, Income, Race, Education: https://data.world/balexturner/drug-use-employment-work-absence-income-race-education
> * Medication-Assisted Treatment in Medi-Cal for Opioid Use: https://data.world/chhs/8329a339-ab77-4d05-ab7a-405d0ae5765c

### Importing Dependencies

In [8]:
# Import Dependencies:
import pandas as pd
import os
from sqlalchemy import create_engine
from config import password
from config import username

In [9]:
# Creating csv data file path: 
death_cause = os.path.join("Resources", "Multiple_Cause_death.csv")
workforce = os.path.join("Resources", "NSDUH Workforce Adults.csv")
medication_assisted = os.path.join("Resources", "mat_annually.csv")

### Store CSV into DataFrame

In [10]:
# Reading in data file to store into Pandas DataFrame:
death_cause_df = pd.read_csv("Resources/Multiple_Cause_death.csv")
death_cause_df.head()

Unnamed: 0,State,Year,Deaths,Population,Crude Rate,Crude Rate Lower 95% Confidence Interval,Crude Rate Upper 95% Confidence Interval,Prescriptions Dispensed by US Retailers in that year (millions)
0,Alabama,1999,39,4430141,0.9,0.6,1.2,116
1,Alabama,2000,46,4447100,1.0,0.8,1.4,126
2,Alabama,2001,67,4467634,1.5,1.2,1.9,138
3,Alabama,2002,75,4480089,1.7,1.3,2.1,142
4,Alabama,2003,54,4503491,1.2,0.9,1.6,149


In [3]:
# Reading in data file to store into Pandas DataFrame:
workforce_df = pd.read_csv("Resources/NSDUH Workforce Adults.csv")
workforce_df.head()

Unnamed: 0.1,Unnamed: 0,IRPINC3,IRFAMIN3,marij_ever,marij_month,marij_year,cocaine_ever,cocaine_month,cocaine_year,crack_ever,...,EverDrugTest,EverDrugTest2,race_str,race_num,education,WouldWorkForDrugTester,SelectiveLeave,SkipSick,sex,druglist
0,1,2,4,1,1,1,1,0,0,1,...,0.0,No,Hispanic,7,2,3,0,0,1,Marijuana Cocaine Crack Hallucinogen 0
1,2,4,4,0,0,0,0,0,0,0,...,1.0,Yes,Hispanic,7,2,1,0,0,1,0
2,3,4,7,1,0,0,0,0,0,0,...,1.0,Yes,White,1,3,3,0,0,2,Marijuana 0
3,4,4,7,0,0,0,0,0,0,0,...,1.0,Yes,Hispanic,7,3,3,0,0,1,0
4,5,2,3,0,0,0,0,0,0,0,...,1.0,Yes,Hispanic,7,1,1,2,2,2,0


In [4]:
# Reading in data file to store into Pandas DataFrame:
medication_assisted_treatment = pd.read_csv("Resources/mat_annually.csv")
medication_assisted_treatment.head()

Unnamed: 0,County,Year,Medication_Assisted_Treatment,Beneficiaries,Status,Annotation,Annotation_Description
0,Statewide,2010,Buprenorphine,1265.0,F,,
1,Statewide,2011,Buprenorphine,1680.0,F,,
2,Statewide,2012,Buprenorphine,2099.0,F,,
3,Statewide,2013,Buprenorphine,2129.0,F,,
4,Statewide,2014,Buprenorphine,5000.0,F,,


***
## Step 2: Transform
***
> Transforming the dataset to suit the needs of our project, this will including:
> 1. Cleaning Data
> 2. Removing NaNs
> 3. Selecting needed columns
> 4. Re-naming columns

In [6]:
# Cleaning dataset and dropping any bad records:
cleaned_workforce_df = workforce_df.dropna(how='any')
cleaned_workforce_df.head()

Unnamed: 0.1,Unnamed: 0,IRPINC3,IRFAMIN3,marij_ever,marij_month,marij_year,cocaine_ever,cocaine_month,cocaine_year,crack_ever,...,EverDrugTest,EverDrugTest2,race_str,race_num,education,WouldWorkForDrugTester,SelectiveLeave,SkipSick,sex,druglist
0,1,2,4,1,1,1,1,0,0,1,...,0.0,No,Hispanic,7,2,3,0,0,1,Marijuana Cocaine Crack Hallucinogen 0
1,2,4,4,0,0,0,0,0,0,0,...,1.0,Yes,Hispanic,7,2,1,0,0,1,0
2,3,4,7,1,0,0,0,0,0,0,...,1.0,Yes,White,1,3,3,0,0,2,Marijuana 0
3,4,4,7,0,0,0,0,0,0,0,...,1.0,Yes,Hispanic,7,3,3,0,0,1,0
4,5,2,3,0,0,0,0,0,0,0,...,1.0,Yes,Hispanic,7,1,1,2,2,2,0


In [7]:
# Cleaning dataset and dropping any bad records:
cleaned_death_cause_df = death_cause_df.dropna(how='any')
cleaned_death_cause_df.head()

Unnamed: 0,State,Year,Deaths,Population,Crude Rate,Crude Rate Lower 95% Confidence Interval,Crude Rate Upper 95% Confidence Interval,Prescriptions Dispensed by US Retailers in that year (millions)
0,Alabama,1999,39,4430141,0.9,0.6,1.2,116
1,Alabama,2000,46,4447100,1.0,0.8,1.4,126
2,Alabama,2001,67,4467634,1.5,1.2,1.9,138
3,Alabama,2002,75,4480089,1.7,1.3,2.1,142
4,Alabama,2003,54,4503491,1.2,0.9,1.6,149


### Create new data with select columns

In [5]:
# Filtering dataset by selecting subset columns needed to answer potential queries:
# Extracting only needed columns
workforce_subset = cleaned_workforce_df[["IRFAMIN3", "painrelieve_ever", "EmploymentStatus", "race_str", "education", "sex"]]
workforce_subset

Unnamed: 0,IRFAMIN3,painrelieve_ever,EmploymentStatus,race_str,education,sex
0,4,0,1,Hispanic,2,1
1,4,0,1,Hispanic,2,1
2,7,0,1,White,3,2
3,7,0,1,Hispanic,3,1
4,3,0,2,Hispanic,1,2
...,...,...,...,...,...,...
32035,1,0,1,White,3,2
32036,3,0,1,Hispanic,3,2
32037,6,0,1,White,4,1
32038,6,0,1,Hispanic,2,1


In [6]:
# Renaming the subset dataframe columns:
demographic_drug_use = workforce_subset.rename(columns={
    'IRFAMIN3': 'total_family_income',
    'painrelieve_ever': 'pain_relieve_ever',
    'EmploymentStatus': 'employment_status',
    'race_str': 'race',
    'sex': 'gender'
})
demographic_drug_use.head()

Unnamed: 0,total_family_income,pain_relieve_ever,employment_status,race,education,gender
0,4,0,1,Hispanic,2,1
1,4,0,1,Hispanic,2,1
2,7,0,1,White,3,2
3,7,0,1,Hispanic,3,1
4,3,0,2,Hispanic,1,2


In [7]:
# Save needed subset into csv file:
demographic_drug_use.to_csv('Resources/demographic_drug_use', index=False)

***
## Step 3: Load
***
> * This is the final step of the ETL process, where we are loading our extracted and transformed data into a database
> * We will be using for this step postgres to load and store our database

### Connect to local database

In [17]:
# Connecting to localhost database:

engine = create_engine(f'postgresql://{username}:{password}@localhost:5432/etl_db')
connection = engine.connect()

In [18]:
# Reviewing the tables from SQL Database
engine.table_names()

['overdose_death']