# Units analysis on the collected water data

In [26]:
import pandas as pd
import numpy as np
import re

In [3]:
labs = pd.read_csv("data/lab_results.csv")
standards = pd.read_csv("data/state_regulations.csv")


In [4]:
labs.head()

Unnamed: 0,station_id,station_name,full_station_name,station_number,station_type,latitude,longitude,status,county_name,sample_code,sample_date,sample_depth,sample_depth_units,parameter,result,reporting_limit,units,method_name
0,8135,01S04E32C001M,01S04E32C001M,01S04E32C001M,Groundwater,37.8073,121.5617,Review Status Unknown,Alameda,WDIS_0719152,05/03/1967 09:00,,Feet,Conductance,3480.0,1.0,uS/cm,EPA 120.1
1,8135,01S04E32C001M,01S04E32C001M,01S04E32C001M,Groundwater,37.8073,121.5617,Review Status Unknown,Alameda,WDIS_0719152,05/03/1967 09:00,,Feet,Dissolved Boron,7.7,0.1,mg/L,"Std Method 4500-B, C"
2,8135,01S04E32C001M,01S04E32C001M,01S04E32C001M,Groundwater,37.8073,121.5617,Review Status Unknown,Alameda,WDIS_0719152,05/03/1967 09:00,,Feet,Dissolved Calcium,68.0,1.0,mg/L,EPA 215.2
3,8135,01S04E32C001M,01S04E32C001M,01S04E32C001M,Groundwater,37.8073,121.5617,Review Status Unknown,Alameda,WDIS_0719152,05/03/1967 09:00,,Feet,Dissolved Chloride,758.0,0.1,mg/L,"Std Method 4500-Cl, B"
4,8135,01S04E32C001M,01S04E32C001M,01S04E32C001M,Groundwater,37.8073,121.5617,Review Status Unknown,Alameda,WDIS_0719152,05/03/1967 09:00,,Feet,Dissolved Magnesium,59.0,0.1,mg/L,"Std Method 3500-Mg, E"


In [5]:
standards.head()

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Units
0,Aluminum,1.0,0.05,0.6,2001.0,,,mg/L
1,Antimony,0.006,0.006,0.001,2016.0,0.006,0.006,mg/L
2,Arsenic,0.01,0.002,4e-06,2004.0,0.01,0.0,mg/L
3,Asbestos,7.0,0.2,7.0,2003.0,7.0,7.0,MFL
4,Barium,1.0,0.1,2.0,2003.0,2.0,2.0,mg/L


## Unit Analysis

- The State Standads have most of the units listed in mg/L, except in some cases, however not all of the contaminants listed have the same units
- There are more parameters tested in the lab than just contaminants, we only want to compare contaminants in this case
- There are some discrepancies in the contaminant name vs the parameter name in the labs table

Thoughts: 
- A lot of the columns can be dropped from the labs table to reduce the size of the data
    - Are all station_name, full_station_name, and station_number the same? 
    - Status, sample_code, sample_date, sample_depth, sample_depth_units, reporting_limit, and method_name are unnecessary

- I don't need to add all of the columns from the standards table to the labs
    - Contaminant, state_mcl, federal_mcl and units are required

- I need to add a column for measurement type
    - Contaminants will be listed as contaminants
    - Others include physical attributes such as turbidity and odor
    - Electrochemical - conductance and pH
    - Non-contaminants such as minerals and other elements?

In [12]:
len(labs['parameter'].unique())

435

Unfortunately, that's a lot. Taking a look at the first 25.

In [23]:
top50 = labs.parameter.sort_values().unique()[0:50]

In [24]:
top50

array(['(Aminomethyl)phosphonic acid',
       '*No Lab Analyses (Field Measures Only)',
       '1,1,1,2-Tetrachloroethane', '1,1,1-Trichloroethane',
       '1,1,2,2-Tetrachloroethane', '1,1,2-Trichloroethane',
       '1,1,2-Trichlorotrifluoroethane', '1,1-Dichloroethane',
       '1,1-Dichloroethene', '1,1-Dichloropropene',
       '1,2,3-Trichlorobenzene', '1,2,3-Trichloropropane',
       '1,2,4-Trichlorobenzene', '1,2,4-Trimethylbenzene',
       '1,2-Dibromo-3-chloropropane (DBCP)', '1,2-Dibromoethane (EDB)',
       '1,2-Dichlorobenzene', '1,2-Dichloroethane', '1,2-Dichloropropane',
       '1,3,5-Trimethylbenzene', '1,3-Dichlorobenzene',
       '1,3-Dichloropropane', '1,4-Dichlorobenzene', '1-Naphthol',
       '100-Day Biochemical Oxygen Demand', '2,2-Dichloropropane',
       '2,3,6-Trichlorobenzoic Acid',
       '2,3,7,8-Tetrachlorodibenzo-p-dioxin', '2,3-Dibromopropionic acid',
       '2,4,5-T', '2,4,5-TP (Silvex)', '2,4-D', '2,4-DB',
       '2-Chloroallyl diethyldithiocarbamate (CDE

In [25]:
standards

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Units
0,Aluminum,1.000,0.050,0.600000,2001.0,,,mg/L
1,Antimony,0.006,0.006,0.001000,2016.0,0.006,0.006,mg/L
2,Arsenic,0.010,0.002,0.000004,2004.0,0.010,0.000,mg/L
3,Asbestos,7.000,0.200,7.000000,2003.0,7.000,7.000,MFL
4,Barium,1.000,0.100,2.000000,2003.0,2.000,2.000,mg/L
...,...,...,...,...,...,...,...,...
86,"2,4,5-TP (Silvex)",0.050,0.001,0.003000,2014.0,0.050,0.050,mg/L
87,Total Trihalomethanes,0.080,,,,0.080,,mg/L
88,Haloacetic Acids (five) (HAA5),0.060,,,,0.060,,mg/L
89,Bromate,0.010,0.005,0.000100,2009.0,0.010,0.000,mg/L


In [48]:
x

[]