***
# ETL Project: Extract, Transform, Load
***
## Step 1: Extract
> Data source extraction from a csv file, after deep exploration of the data we need for this project we will start retrieval process by reading in the dataset. 
> Dataset we are using in this part is to potentially answer the following question:
> * What are the rates of usage organized by demographics across the US?
> * Data source: https://data.world/balexturner/drug-use-employment-work-absence-income-race-education


### Importing Dependencies

In [1]:
# Import Dependencies:
import pandas as pd
import os

In [3]:
# Creating csv data file path: 
workforce = os.path.join("Resources", "NSDUH Workforce Adults.csv")

### Store CSV into DataFrame

In [5]:
# Reading in data file to store into Pandas DataFrame:
workforce_df = pd.read_csv("Resources/NSDUH Workforce Adults.csv")
workforce_df.head()

Unnamed: 0.1,Unnamed: 0,IRPINC3,IRFAMIN3,marij_ever,marij_month,marij_year,cocaine_ever,cocaine_month,cocaine_year,crack_ever,...,EverDrugTest,EverDrugTest2,race_str,race_num,education,WouldWorkForDrugTester,SelectiveLeave,SkipSick,sex,druglist
0,1,2,4,1,1,1,1,0,0,1,...,0.0,No,Hispanic,7,2,3,0,0,1,Marijuana Cocaine Crack Hallucinogen 0
1,2,4,4,0,0,0,0,0,0,0,...,1.0,Yes,Hispanic,7,2,1,0,0,1,0
2,3,4,7,1,0,0,0,0,0,0,...,1.0,Yes,White,1,3,3,0,0,2,Marijuana 0
3,4,4,7,0,0,0,0,0,0,0,...,1.0,Yes,Hispanic,7,3,3,0,0,1,0
4,5,2,3,0,0,0,0,0,0,0,...,1.0,Yes,Hispanic,7,1,1,2,2,2,0


***
## Step 2: Transform
***
> Transforming the dataset to suit the needs of our project, this will including:
> 1. Cleaning Data
> 2. Removing NaNs
> 3. Selecting needed columns
> 4. Re-naming columns

In [11]:
# Cleaning dataset and dropping any bad records:
cleaned_workforce_df = workforce_df.dropna(how='any')
cleaned_workforce_df.head()

Unnamed: 0.1,Unnamed: 0,IRPINC3,IRFAMIN3,marij_ever,marij_month,marij_year,cocaine_ever,cocaine_month,cocaine_year,crack_ever,...,EverDrugTest,EverDrugTest2,race_str,race_num,education,WouldWorkForDrugTester,SelectiveLeave,SkipSick,sex,druglist
0,1,2,4,1,1,1,1,0,0,1,...,0.0,No,Hispanic,7,2,3,0,0,1,Marijuana Cocaine Crack Hallucinogen 0
1,2,4,4,0,0,0,0,0,0,0,...,1.0,Yes,Hispanic,7,2,1,0,0,1,0
2,3,4,7,1,0,0,0,0,0,0,...,1.0,Yes,White,1,3,3,0,0,2,Marijuana 0
3,4,4,7,0,0,0,0,0,0,0,...,1.0,Yes,Hispanic,7,3,3,0,0,1,0
4,5,2,3,0,0,0,0,0,0,0,...,1.0,Yes,Hispanic,7,1,1,2,2,2,0


### Create new data with select columns

In [13]:
# Filtering dataset by selecting subset columns needed to answer potential queries:
# Extracting only needed columns
workforce_subset = cleaned_workforce_df[["IRFAMIN3", "painrelieve_ever", "EmploymentStatus", "race_str", "education", "sex"]]
workforce_subset

Unnamed: 0,IRFAMIN3,painrelieve_ever,EmploymentStatus,race_str,education,sex
0,4,0,1,Hispanic,2,1
1,4,0,1,Hispanic,2,1
2,7,0,1,White,3,2
3,7,0,1,Hispanic,3,1
4,3,0,2,Hispanic,1,2
...,...,...,...,...,...,...
32035,1,0,1,White,3,2
32036,3,0,1,Hispanic,3,2
32037,6,0,1,White,4,1
32038,6,0,1,Hispanic,2,1


In [14]:
# Renaming the subset dataframe columns:
demographic_drug_use = workforce_subset.rename(columns={
    'IRFAMIN3': 'total_family_income',
    'painrelieve_ever': 'pain_relieve_ever',
    'EmploymentStatus': 'employment_status',
    'race_str': 'race',
    'sex': 'gender'
})
demographic_drug_use.head()

Unnamed: 0,total_family_income,pain_relieve_ever,employment_status,race,education,gender
0,4,0,1,Hispanic,2,1
1,4,0,1,Hispanic,2,1
2,7,0,1,White,3,2
3,7,0,1,Hispanic,3,1
4,3,0,2,Hispanic,1,2


In [15]:
# Save needed subset into csv file:
demographic_drug_use.to_csv('Resources/demographic_drug_use', index=False)