In [1]:
from tabula import read_pdf
from tabulate import tabulate
import pandas as pd
import numpy as np


In [2]:
filename = "Data\mcls_dlrs_phgs.pdf"
df_list = read_pdf(filename, pages='all')

In [35]:
df_list

[              State Regulated  State  State.1  State PHG  State.2 Federal  \
 0          Inorganic Chemical    MCL      DLR        NaN  Date of     MCL   
 1                 Contaminant    NaN      NaN        NaN      PHG     NaN   
 2                    Aluminum      1     0.05        0.6     2001      --   
 3                    Antimony  0.006    0.006      0.001     2016   0.006   
 4                     Arsenic  0.010    0.002   0.000004     2004   0.010   
 5             Asbestos (MFL =  7 MFL  0.2 MFL      7 MFL     2003   7 MFL   
 6   million fibers per liter;    NaN      NaN        NaN      NaN     NaN   
 7              for fibers >10    NaN      NaN        NaN      NaN     NaN   
 8               microns long)    NaN      NaN        NaN      NaN     NaN   
 9                      Barium      1      0.1          2     2003       2   
 10                  Beryllium  0.004    0.001      0.001     2003   0.004   
 11                    Cadmium  0.005    0.001    0.00004     20

## The File

The output from the tabulated pdf is a list of lists. Each list represents a different class of pollutants in the water, such as radioactive, organics, disinfectants... 
Each list has a different length, three of them have columns that have been augmented, some list the number 0 as a string 'zero', and many of the NaN are represented as --

The column headings are not consistent and non-intuitive, so the first function will standardize the names of the columns to a meaningful measure.

## Decontaminate Lables 
This identifies: 
- Contaminant
- State Maximum Containment Level
- State Detection Limit for reporting
- State Public Health Goals (often smells and tastes)
- Public Health Goal Date
- Federal Maximum Containment Level
- Federal Maximum Containmnet Goal

In [3]:
def Decontaminate_Labels(df_list):
    for df in df_list:
        df.rename(columns={df.columns[0]: "Contaminant",
                           df.columns[1]: "State_MCL",
                           df.columns[2]: "State_DLR",
                           df.columns[3]: "State_PHG",
                           df.columns[4]: "PHG_Date",
                           df.columns[5]: "Federal_MCL",
                           df.columns[6]: "Federal_MCLG"
                           }, inplace=True)
    return df_list


In [4]:
df_relabeled = Decontaminate_Labels(df_list)

In [5]:
df_relabeled[0].head()

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG
0,Inorganic Chemical,MCL,DLR,,Date of,MCL,MCLG
1,Contaminant,,,,PHG,,
2,Aluminum,1,0.05,0.6,2001,--,--
3,Antimony,0.006,0.006,0.001,2016,0.006,0.006
4,Arsenic,0.010,0.002,4e-06,2004,0.010,zero


Just looking at the first list, there are issues with the first two rows - they are not acutally data, but continued headers. These will be modified in later functions. 
The other issue that can be seen is in the Alumninum row, with two NaN values labeled as '--', which occurs througout the document. 

## Decontaminate_Nulls
This next function aims to change all '--' values in each list to a numpy NaN

In [6]:
df_relabeled[0].replace('--', np.nan, inplace=True)

In [7]:

def Decontaminate_Nulls(df_list):   
    import numpy as np
    for df in df_list:
        df.replace('--', np.nan, inplace=True)
    return df_list


In [8]:
df_nulled = Decontaminate_Nulls(df_relabeled)

In [9]:
df_nulled[0].head()

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG
0,Inorganic Chemical,MCL,DLR,,Date of,MCL,MCLG
1,Contaminant,,,,PHG,,
2,Aluminum,1,0.05,0.6,2001,,
3,Antimony,0.006,0.006,0.001,2016,0.006,0.006
4,Arsenic,0.010,0.002,4e-06,2004,0.010,zero


As can be seen, the issue of the Nulls has been addressed, and checked in several of the other tables. 

The next issue that needs to be addressed the the presence of these leftover subheading rows.

## Decontaminate Rows
Looking through each of the tables from the original documentation and the tabulated data, the rows that do not call a contaminant will have one of the following issues in the 'State_MCL' column:
- It will be NaN
- It will say 'MCL'
- It may say 'mrem/yr' in the case of radioactive material

Some of the known contaminants truly have a NaN value for the State_MCL. 
This raises two question: 
- Is there an overarching Federal MCL that must already be met?
- Is this contaminant only a goal? 

If there is a Federal_MCL, the State_MCL will be set equal to the Federal_MCL, since it MUST be met. 
If there is NO Federal_MCL or State_MCL, the contaminant will not be included in this study. 

In [10]:
df_nulled[3].head()

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG
0,Radionuclides,MCL,DLR,PHG,Date of,MCL,MCLG
1,Contaminant,,,,PHG,,
2,Gross alpha particle,15,3,none,,15,zero
3,activity - OEHHA,,,,,,
4,concluded in 2003 that,,,,,,


As can be seen in the above table, it was true in all tables that when there was no State_MCL, there was never a case where a Federal_MCL was present. The opposite was not the case. Because this remained true for all 14 categories, I was able to use the State_MCL column with anything containing a Null to remove that column, as it is either a header or a contaminant that is not applicable to the scope of the project. 

### Units
One thing that is missing from this table is the unit of measure of these specifications. Because the reporting is in different units at different sites, it will be important to convert the units to those which can be compared to the regulations. 
The default, as specified in the documentation, is mg/L unless otherwise specified. Since this is the case, this function will add an additional column 'Units' which will all be set to 'mg/L'. 

For those that are specified differently, their values will be adjusted on a case-by-case basis in the next function

NOTE: While the values in the state regulations, by default, are in mg/L, most of the collected data were in ug/L. Instead of changing the ug/L on a case-by-case basis from the lab data, I will convert the mg/L to ug/L for all state regs, and then convert back any that should remain in mg/L. This cannot be done yet, because not all columns are numeric, but this will happen after the data within the columns are set to numeric.

In [11]:
def Decontaminate_Rows(df_list):
    for n in range(len(df_list)):
        df_list[n].dropna(subset=['State_MCL'], how='all', inplace=True)
        df_list[n] = df_list[n].loc[df_list[n].State_MCL != 'MCL']
        df_list[n] = df_list[n].loc[df_list[n].State_MCL != 'mrem/yr']
        df_list[n]['Units'] = 'ug/L'
    return df_list

In [12]:
df_sub_headless = Decontaminate_Rows(df_nulled)

In [13]:
df_sub_headless[4]

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Federal\rMCLG,Units
0,Strontium-90,8,2,0.35,,2006,,,ug/L
1,Tritium,"""20,000""","""1,000""",400.0,,2006,,,ug/L
2,Uranium,20,1,0.43,,2001,30 μg/L,zero,ug/L


### Column Discrancies
There were discrepancies in the number of columns in three of the lists: 4, 7, and 11. Each of these were augmented in one of the earlier columns, changing the values to something incorrect, and then adding an additional Federal\rMCLG column at the end. 


## Decontaminate Lists
Each of these were examined, paying attention to the PHG_Date and any NaN values, the best indicators of where the shifts occurred. Each of these tables were adjusted by dropping the column that contained all NaN values, and then using a dictionary to rename the other columns to the appropriate names. 

In [14]:

def Decontaminate_Lists(df_list):
    for n in range(len(df_list)):
        if n == 4:
            df_list[n].drop(columns='PHG_Date', inplace=True)
            df_list[n].rename(columns={'Federal_MCL': 'PHG_Date',
                                       'Federal_MCLG': 'Federal_MCL',
                                       'Federal\rMCLG': 'Federal_MCLG'}, inplace=True)
        elif n == 7:
            df_list[n].drop(columns='State_DLR', inplace=True)
            df_list[n].rename(columns={'State_PHG': 'State_DLR',
                                       'PHG_Date': 'State_PHG',
                                       'Federal_MCL': 'PHG_Date',
                                       'Federal_MCLG': 'Federal_MCL',
                                       'Federal\rMCLG': 'Federal_MCLG'}, inplace=True)
        elif n == 11:
            df_list[n].drop(columns='Federal_MCLG', inplace=True)
            df_list[n].rename(
                columns={'Federal\rMCLG': 'Federal_MCLG'}, inplace=True)
    return df_list

In [15]:
df_decon_lists = Decontaminate_Lists(df_sub_headless)

In [16]:
df_decon_lists[4]

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Units
0,Strontium-90,8,2,0.35,2006,,,ug/L
1,Tritium,"""20,000""","""1,000""",400.0,2006,,,ug/L
2,Uranium,20,1,0.43,2001,30 μg/L,zero,ug/L


This particular table shows some other interesting issues that need to be addressed: 
- Titanium's values are strings, not numeric
- Strontium and Tritium units should be in pCi/L
- Uranium's Federal Goal should be the numeric value zero, not a string

Each table, upon investigation, has its own issues that need to be addressed.

## Decontaminate Values
This function addresses the issues that are specific to each list, and cannot be standardized across the rest of the documentation. 

Not all of the tables needed modifications, so those are not included here, such as 12 and 14.

NOTE: The data are still not converted to numeric, so I still can't perform the bulk unit conversion, but I am preparing the inserted data to be multiplied by 1000, so as not to have to correct the same rows twice.

In [17]:

def Decontaminate_Values(df_list):
    import numpy as np
    for n in range(len(df_list)):
        if n == 0:
            # Fixes the string zero to numerical
            df_list[n].loc[4, ["Federal_MCLG"]] = [0]
            df_list[n].loc[5, ["Contaminant", "State_MCL", "State_DLR", "State_PHG", "PHG_Date", "Federal_MCL", "Federal_MCLG", "Units"]] = [
                'Asbestos', .007, .0002, .007, 2003, .007, .007, 'MFL']  # Removes the units from every value to numerical values
            # Changes long text to just chromium, total - changes 'withdrawn' to Null
            df_list[n].loc[12, ["Contaminant", "State_PHG"]] = ['Chromium, Total', np.nan]
        elif n == 1:
            df_list[n].loc[1, ["State_MCL", "State_DLR", "State_PHG", "Federal_MCL", "Federal_MCLG", "Units"]] = [.00015, 0.0001, 0.00015, 0.0002, 0.0002, 'mg/L']  # Cyanide
            df_list[n].loc[2, ["State_MCL", "State_DLR", "State_PHG", "Federal_MCL", "Federal_MCLG", "Units"]] = [0.002, 0.0001, 0.001, 0.004, 0.004, 'mg/L'] # Fluoride
            df_list[n].loc[3, ["Contaminant", "PHG_Date"]] = ['Mercury', 2005]
            df_list[n].loc[5, ["Contaminant", "State_MCL", "State_PHG", "Units"]] = [
                'Nitrate', .010, .045, 'mg/L as N'] # Mult by 10 to convert from 10mg as N to mg/L as N (instead of 10 mg/L as N)
            df_list[n].loc[6, ["Contaminant", "State_MCL", "State_PHG", "Units"]] = [
                'Nitrite', .001, .001, 'mg/L as N']
            df_list[n].loc[7, ["Contaminant", "State_MCL", "State_PHG", "Units"]] = [
                'Nitrate + Nitrite', .010, .010, 'mg/L as N'] # mult by 10 to convert to mg/l as N
            df_list[n].loc[10, ["PHG_Date"]] = [2004]
        elif n == 2:
            df_list[n].loc[3, ["Federal_MCLG"]] = [0.0]
        elif n == 3:
            df_list[n].loc[2, ["Contaminant", "State_MCL", "State_DLR", "State_PHG", "PHG_Date", "Federal_MCL", "Federal_MCLG", "Units"]] = [
                "Gross Alpha Particle", 0.0225, 0.0045, np.nan, np.nan, 0.0225, 0.0, 'ug/L'] # adjusted by 1.5 to convert pCi/L to ug/L
            df_list[n].loc[7, ["Contaminant", "State_MCL", "State_DLR", "State_PHG", "PHG_Date", "Federal_MCL", "Federal_MCLG", "Units"]] = [
                "Gross Beta Particle", 0.004, 0.004, np.nan, np.nan, 0.004, 0.0, 'mrem/yr']
            df_list[n].loc[14, ["Contaminant", "State_MCL", "Federal_MCL",  "Federal_MCLG", "Units"]] = [
                'Radium-226 + Radium-228', 0.0075, 0.0075, 0.0, 'ug/L'] # Mult by 1.5 to get into ug/L
        elif n == 4:
            df_list[n].loc[0, ["State_MCL", "State_DLR", "State_PHG",'Units']] = [0.012, 0.003, 0.000525, 'ug/L'] #strontium-90, multiplied all values by 1.5 to adjust from pCi/L to ug/L
            df_list[n].loc[1, ["State_MCL", "State_DLR", "State_PHG", "Units"]] = [30, 1.5, .6, 'ug/L'] #tritium, adjusted by 1.5 to ug/L
            df_list[n].loc[2, ["State_MCL", "State_DLR", "State_PHG", "Federal_MCL", "Federal_MCLG", 'Units']] = [
                0.02, 0.001, 0.00043, .030, 0.0, 'ug/L'] #uranium
        elif n == 5:
            df_list[n].loc[0, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[1, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[2, ["PHG_Date"]] = [2009]
            df_list[n].loc[3, ["Contaminant"]] = ['1,4-Dichlorobenzene(p-DCB)']
            df_list[n].loc[4, ["Contaminant"]] = ['1,1-Dichloroethane (1,1-DCA)']
            df_list[n].loc[5, ["Contaminant", "PHG_Date", "Federal_MCLG"]] = [
                '1,2-Dichloroethane (1,2-DCA)', 2005, 0.0]
            df_list[n].loc[6, ["Contaminant"]] = [
                '1,1-Dichloroethylene (1,1-DCE)']
        elif n == 6:
            df_list[n].loc[1, ["Contaminant"]] = ['trans-1,2-Dichloroethylene']
            df_list[n].loc[2, ["Contaminant", "Federal_MCLG"]] = [
                'Dichloromethane (Methylene chloride)', 0.0]
            df_list[n].loc[3, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[4, ["PHG_Date"]] = [2006]
            df_list[n].loc[6, ["Contaminant"]] = [
                'Methyl tertiary butyl ether (MTBE)']
            df_list[n].loc[9, ["Contaminant"]] = ['1,1,2,2-Tetrachloroethane']
            df_list[n].loc[10, ["Contaminant", "Federal_MCLG"]] = [
                'Tetrachloroethylene (PCE)', 0.0]
        elif n == 7:
            df_list[n].loc[0, ["Contaminant"]] = [
                '1,1,1-Trichloroethane (1,1,1-TCA)']
            df_list[n].loc[1, ["Contaminant"]] = [
                '1,1,2-Trichloroethane (1,1,2-TCA)']
            df_list[n].loc[2, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[3, ["Contaminant"]] = [
                'Trichlorofluoromethane (Freon 11)']
            df_list[n].loc[4, ["Contaminant", "PHG_Date"]] = [
                '1,1,2-Trichloro-1,2,2-Trifluoroethane (Freon 113)', 2011]
            df_list[n].loc[5, ["Federal_MCLG"]] = [0.0]
        elif n == 8:
            df_list[n].loc[0, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[2, ["PHG_Date"]] = [2009]
        elif n == 9:
            df_list[n].loc[0, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[2, ["PHG_Date", "Federal_MCLG"]] = [2006, 0.0]
            df_list[n].loc[3, ["PHG_Date"]] = [2009]
            df_list[n].loc[4, ["Contaminant", "Federal_MCLG"]] = [
                '1,2-Dibromo-3-chloropropane (DBCP)', 0.0]
            df_list[n].loc[5, ["Contaminant"]] = [
                '2,4-Dichlorophenoxyacetic acid (2,4-D)']
            df_list[n].loc[6, ["Contaminant"]] = ['Di(2-ethylhexyl)adipate']
            df_list[n].loc[7, ["Contaminant", "Federal_MCLG"]] = [
                'Di(2-ethylhexyl)phthalate (DEHP)', 0.0]
            df_list[n].loc[8, ["PHG_Date"]] = [2010]
        elif n == 10:
            df_list[n].loc[1, ["Contaminant", "Federal_MCL", "Federal_MCLG"]] = [
                'Ethylene dibromide (EDB)', .00005, 0.0]
            df_list[n].loc[3, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[4, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[5, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[6, ["Contaminant"]] = ['Hexachlorocyclopentadiene']
            df_list[n].loc[7, ["PHG_Date"]] = [2005]
            df_list[n].loc[11, ["Federal_MCLG"]] = [0.0]
        elif n == 11:
            df_list[n].loc[0, ["Contaminant", "Federal_MCLG"]] = [
                'Polychlorinated biphenyls (PCBs)', 0.0]
            df_list[n].loc[3, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[4, ["Contaminant", "State_MCL", "State_DLR"]] = [
                '1,2,3-Trichloropropane', .000005, .000005]
            df_list[n].loc[5, ["Contaminant", "State_MCL", "State_DLR", "State_PHG", "Federal_MCL",
                               "Federal_MCLG", "Units"]] = ['2,3,7,8-TCDD (dioxin)', 0.03, 0.005, 0.00005, 0.03, 0.0, 'pg/L'] # Changed to pg/L and adjusted values for multplication by 1000
        elif n == 13:
            df_list[n].loc[2, ["Contaminant"]] = [
                'Haloacetic Acids (five) (HAA5)']
            df_list[n].loc[8, ["State_DLR", "Federal_MCLG"]] = [.0000050, 0.0]
    return df_list



In [18]:
df_values = Decontaminate_Values(df_decon_lists)

In [19]:
df_values[11]

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Units
0,Polychlorinated biphenyls (PCBs),0.0005,0.0005,9e-05,2007,0.0005,0.0,ug/L
1,Simazine,0.004,0.001,0.004,2001,0.004,0.004,ug/L
2,Thiobencarb,0.07,0.001,0.042,2016,,,ug/L
3,Toxaphene,0.003,0.001,3e-05,2003,0.003,0.0,ug/L
4,"1,2,3-Trichloropropane",5e-06,5e-06,7e-07,2009,,,ug/L
5,"2,3,7,8-TCDD (dioxin)",0.03,0.005,5e-05,2010,0.03,0.0,pg/L
6,"2,4,5-TP (Silvex)",0.05,0.001,0.003,2014,0.05,0.05,ug/L


Notice that the 2,3,7,8 ... has been updated to the units of pg/L

In [20]:
df_values[11].dtypes

Contaminant     object
State_MCL       object
State_DLR       object
State_PHG       object
PHG_Date         int64
Federal_MCL     object
Federal_MCLG    object
Units           object
dtype: object

## Decontaminate_Datatypes
This has now fixed the issues within each of the tables, adjusting the units, changing any additional missing or unknown values to NaN, and changing any strings to numerical values when appropriate.
This function will be applied to the full dataframe after the individuals have been concatenated. 


In [21]:
def Decontaminate_Datatypes(df):
    df.State_MCL = pd.to_numeric(df.State_MCL)
    df.State_DLR = pd.to_numeric(df.State_DLR)
    df.State_PHG = pd.to_numeric(df.State_PHG)
    df.Federal_MCL = pd.to_numeric(df.Federal_MCL)
    df.Federal_MCLG = pd.to_numeric(df.Federal_MCLG)
    return df 

In [22]:
df = pd.concat(df_list, ignore_index=True)
df_numeric = Decontaminate_Datatypes(df)

In [23]:
df_numeric

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Units
0,Aluminum,1.000,0.050000,0.600000,2001,,,ug/L
1,Antimony,0.006,0.006000,0.001000,2016,0.006,0.006,ug/L
2,Arsenic,0.010,0.002000,0.000004,2004,0.010,0.000,ug/L
3,Asbestos,0.007,0.000200,0.007000,2003,0.007,0.007,MFL
4,Barium,1.000,0.100000,2.000000,2003,2.000,2.000,ug/L
...,...,...,...,...,...,...,...,...
86,"2,4,5-TP (Silvex)",0.050,0.001000,0.003000,2014,0.050,0.050,ug/L
87,Total Trihalomethanes,0.080,,,,0.080,,ug/L
88,Haloacetic Acids (five) (HAA5),0.060,,,,0.060,,ug/L
89,Bromate,0.010,0.000005,0.000100,2009,0.010,0.000,ug/L


# Decontaminate_Unit_Conversion

Finally, to fix the unit conversion issue...
The only 2 that should remain in mg/L are Cyanide and Dissolved Fluoride

To deal with this, their values were divided by 1000 in the above function, Decontaminate_Values, and the units were corrected to mg/L

In [24]:
def Decontaminate_Unit_Conversion(df):
        df['State_MCL'] = df['State_MCL'].apply(lambda x: x*1000)
        df['State_DLR'] = df['State_DLR'].apply(lambda x: x*1000)
        df['State_PHG'] = df['State_PHG'].apply(lambda x: x*1000)
        df['Federal_MCL'] = df['Federal_MCL'].apply(lambda x: x*1000)
        df['Federal_MCLG'] = df['Federal_MCLG'].apply(lambda x: x*1000)

        
        return df

In [25]:
df_converted = Decontaminate_Unit_Conversion(df_numeric)

In [26]:
df_converted.head(50)

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Units
0,Aluminum,1000.0,50.0,600.0,2001.0,,,ug/L
1,Antimony,6.0,6.0,1.0,2016.0,6.0,6.0,ug/L
2,Arsenic,10.0,2.0,0.004,2004.0,10.0,0.0,ug/L
3,Asbestos,7.0,0.2,7.0,2003.0,7.0,7.0,MFL
4,Barium,1000.0,100.0,2000.0,2003.0,2000.0,2000.0,ug/L
5,Beryllium,4.0,1.0,1.0,2003.0,4.0,4.0,ug/L
6,Cadmium,5.0,1.0,0.04,2006.0,5.0,5.0,ug/L
7,"Chromium, Total",50.0,10.0,,1999.0,100.0,100.0,ug/L
8,Cyanide,0.15,0.1,0.15,1997.0,0.2,0.2,mg/L
9,Fluoride,2.0,0.1,1.0,1997.0,4.0,4.0,mg/L


In [27]:
df_converted.tail(41)


Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Units
50,"1,1,2-Trichloro-1,2,2-Trifluoroethane (Freon 113)",1200.0,10.0,4000.0,2011.0,,,ug/L
51,Vinyl chloride,0.5,0.5,0.05,2000.0,2.0,0.0,ug/L
52,Xylenes,1750.0,0.5,1800.0,1997.0,10000.0,10000.0,ug/L
53,Alachlor,2.0,1.0,4.0,1997.0,2.0,0.0,ug/L
54,Atrazine,1.0,0.5,0.15,1999.0,3.0,3.0,ug/L
55,Bentazon,18.0,2.0,200.0,2009.0,,,ug/L
56,Benzo(a)pyrene,0.2,0.1,0.007,2010.0,0.2,0.0,ug/L
57,Carbofuran,18.0,5.0,0.7,2016.0,40.0,40.0,ug/L
58,Chlordane,0.1,0.1,0.03,2006.0,2.0,0.0,ug/L
59,Dalapon,200.0,10.0,790.0,2009.0,200.0,200.0,ug/L


### Verification 
All values were verified with the pdf; the nitrate and nitrate + nitrite is now in the correct units and meaures. Similarly the fixes to the pCi/L to ug/L and the conversion to pg/L from mg/L. 

## Decontaminate
The final function calls each of the subroutines in the order that was followed above. This function will take the filename of the pdf instead of having to previously perform that step. It will finally concatenate all of the lists into a single output. 

In [28]:
def Decontaminate(filename):
    from tabula import read_pdf
    from tabulate import tabulate
    import pandas as pd

    df_list = read_pdf(filename, pages='all')
    Decontaminate_Labels(df_list)
    Decontaminate_Nulls(df_list)
    Decontaminate_Rows(df_list)
    Decontaminate_Lists(df_list)
    Decontaminate_Values(df_list)
    df = pd.concat(df_list, ignore_index=True)
    Decontaminate_Datatypes(df)
    Decontaminate_Unit_Conversion(df)
    return df

In [29]:
decon = Decontaminate(filename)

In [30]:
decon.head(50)

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Units
0,Aluminum,1000.0,50.0,600.0,2001.0,,,ug/L
1,Antimony,6.0,6.0,1.0,2016.0,6.0,6.0,ug/L
2,Arsenic,10.0,2.0,0.004,2004.0,10.0,0.0,ug/L
3,Asbestos,7.0,0.2,7.0,2003.0,7.0,7.0,MFL
4,Barium,1000.0,100.0,2000.0,2003.0,2000.0,2000.0,ug/L
5,Beryllium,4.0,1.0,1.0,2003.0,4.0,4.0,ug/L
6,Cadmium,5.0,1.0,0.04,2006.0,5.0,5.0,ug/L
7,"Chromium, Total",50.0,10.0,,1999.0,100.0,100.0,ug/L
8,Cyanide,0.15,0.1,0.15,1997.0,0.2,0.2,mg/L
9,Fluoride,2.0,0.1,1.0,1997.0,4.0,4.0,mg/L


In [31]:
decon.dtypes

Contaminant      object
State_MCL       float64
State_DLR       float64
State_PHG       float64
PHG_Date         object
Federal_MCL     float64
Federal_MCLG    float64
Units            object
dtype: object

### LATER UPDATE
Now that I'm looking at the parameters from the state lab results, the naming convention is different in the collected data, but these are not unique rows like the regulatory sheet is , so I will be changing the naming conventions to match those of the collected data.

To do this, each of the lists was alphabetically sorted in SQL. Since I only need to address those that are present in the state-regulatory material, I only need to scroll through the 435 rows of disinct labels in lab tests to verify the labeling of the 91 in the standards. Most of the convention issues are in the first few, as these are the chemical structures that have acronyms or abbreviations that are not present in the collected data.

- Dichloroethylene and Dichloroethene are the same chemical structure, but use different naming conventions. The state regulations use the Dichloroethylene on their documentation, but the lab measurements use the naming Dichloroethene.
- There are many values in the measurements table that are both present as dissolved and total. While there are no values in the standards that state dissolved, there are several that do say total, so unless specified as total, the values will be designated Dissolved. For example, standards only states "Aluminum" but there are measures for both Dissolved and Total Aluminum

Going to create a function that changes the chemical names that are problematic based on the output of this function, as not all of them were modified in the original Decontaminate_Values function.


In [32]:
def Decontaminate_Names(df):
    df.loc[0, ['Contaminant']] = ['Dissolved Aluminum']
    df.loc[1, ['Contaminant']] = ['Dissolved Antimony']
    df.loc[2, ['Contaminant']] = ['Dissolved Arsenic']
    df.loc[3, ['Contaminant']] = ['Asbestos, Chrysotile']
    df.loc[4, ['Contaminant']] = ['Dissolved Barium']
    df.loc[5, ['Contaminant']] = ['Dissolved Beryllium']
    df.loc[6, ['Contaminant']] = ['Dissolved Cadmium']
    df.loc[7, ['Contaminant']] = ['Total Chromium']
    df.loc[8, ['Contaminant']] = ['Cyanide']
    df.loc[9, ['Contaminant']] = ['Dissolved Fluoride']
    df.loc[10, ['Contaminant']] = ['Dissolved Mercury']
    df.loc[11, ['Contaminant']] = ['Dissolved Nickel']
    df.loc[12, ['Contaminant']] = ['Dissolved Nitrate']
    df.loc[13, ['Contaminant']] = ['Dissolved Nitrite']
    df.loc[14, ['Contaminant']] = ['Dissolved Nitrate + Nitrite']
    df.loc[16, ['Contaminant']] = ['Dissolved Selenium']
    df.loc[17, ['Contaminant']] = ['Dissolved Thallium']
    df.loc[18, ['Contaminant']] = ['Dissolved Copper']
    df.loc[19, ['Contaminant']] = ['Dissolved Lead']
    df.loc[23, ['Contaminant']] = ['Dissolved Strontium']
    df.loc[25, ['Contaminant']] = ['Dissolved Uranium']
    df.loc[27, ['Contaminant']] = ['Carbon tetrachloride']
    df.loc[28, ['Contaminant']] = ['1,2-Dichlorobenzene']
    df.loc[29, ['Contaminant']] = ['1,4-Dichlorobenzene']
    df.loc[30, ['Contaminant']] = ['1,1-Dichloroethane']
    df.loc[31, ['Contaminant']] = ['1,2-Dichloroethane']
    df.loc[32, ['Contaminant']] = ['1,1-Dichloroethene'] # Note that Dichloroethylene and Dichloroethene are the same chemical compound
    df.loc[33, ['Contaminant']] = ['cis-1,2-Dichloroethene']
    df.loc[34, ['Contaminant']] = ['trans-1,2-Dichloroethene']
    df.loc[35, ['Contaminant']] = ['Methylene chloride'] # This was listed as Dichloromethane(Metheylene Chloride), the labs use the latter
    df.loc[36, ['Contaminant']] = ['1,2-Dichloropropane']
    df.loc[37, ['Contaminant']] = ['cis-1,3-Dichloropropene'] # There is an issue here where the labs collected cis and trans separately, but the state only regulates the mixture
    df.loc[39, ['Contaminant']] = ['Methyl tert-butyl ether (MTBE)'] # tert is an abbreviation for tertiary
    df.loc[40, ['Contaminant']] = ['Chlorobenzene'] # Chlorobenzene is a specific and simplest of the monochlorobenzenes
    df.loc[43, ['Contaminant']] = ['Tetrachloroethene']  # There is a problem with the lab data here; they have both tetrachloroethylene and tetrachloroethene, which are the same thing
    df.loc[46, ['Contaminant']] = ['1,1,1-Trichloroethane']
    df.loc[47, ['Contaminant']] = ['1,1,2-Trichloroethane']
    df.loc[48, ['Contaminant']] = ['Trichloroethene']
    df.loc[49, ['Contaminant']] = ['Trichlorofluoromethane']
    df.loc[50, ['Contaminant']] = ['1,1,2-Trichlorotrifluoroethane']
    df.loc[52, ['Contaminant']] = ['Total Xylene, (total)']
    df.loc[60, ['Contaminant']] = ['1,2-Dibromo-3-chloropropane (DBCP)']
    df.loc[61, ['Contaminant']] = ['2,4-D']
    df.loc[62, ['Contaminant']] = ['Bis(2-ethylhexyl) adipate'] # this is the same compound as Di(2-ethylhexyl)adipate
    # This is the same compound as Di(2-ethylhexyl)phthalate
    df.loc[63, ['Contaminant']] = ['bis(2-Ethylhexyl) phthalate']
    df.loc[64, ['Contaminant']] = ['Dinoseb (DNPB)']
    df.loc[68, ['Contaminant']] = ['Ethylene Dibromide']
    df.loc[74, ['Contaminant']] = ['BHC-gamma (Lindane)']
    df.loc[78, ['Contaminant']] = ['Pentachlorophenol (PCP)']
    df.loc[79, ['Contaminant']] = ["PCB's"]
    df.loc[84, ['Contaminant']] = ['1,2,3-Trichloropropane']
    df.loc[85, ['Contaminant']] = ['2,3,7,8-Tetrachlorodibenzo-p-dioxin']
    df.loc[86, ['Contaminant']] = ['2,4,5-TP (Silvex)']
    return df
    

        
    

In [33]:
decon

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Units
0,Aluminum,1000.0,50.000,600.000,2001,,,ug/L
1,Antimony,6.0,6.000,1.000,2016,6.0,6.0,ug/L
2,Arsenic,10.0,2.000,0.004,2004,10.0,0.0,ug/L
3,Asbestos,7.0,0.200,7.000,2003,7.0,7.0,MFL
4,Barium,1000.0,100.000,2000.000,2003,2000.0,2000.0,ug/L
...,...,...,...,...,...,...,...,...
86,"2,4,5-TP (Silvex)",50.0,1.000,3.000,2014,50.0,50.0,ug/L
87,Total Trihalomethanes,80.0,,,,80.0,,ug/L
88,Haloacetic Acids (five) (HAA5),60.0,,,,60.0,,ug/L
89,Bromate,10.0,0.005,0.100,2009,10.0,0.0,ug/L


In [34]:
Decontaminate_Names(decon)

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Units
0,Dissolved Aluminum,1000.0,50.000,600.000,2001,,,ug/L
1,Dissolved Antimony,6.0,6.000,1.000,2016,6.0,6.0,ug/L
2,Dissolved Arsenic,10.0,2.000,0.004,2004,10.0,0.0,ug/L
3,"Asbestos, Chrysotile",7.0,0.200,7.000,2003,7.0,7.0,MFL
4,Dissolved Barium,1000.0,100.000,2000.000,2003,2000.0,2000.0,ug/L
...,...,...,...,...,...,...,...,...
86,"2,4,5-TP (Silvex)",50.0,1.000,3.000,2014,50.0,50.0,ug/L
87,Total Trihalomethanes,80.0,,,,80.0,,ug/L
88,Haloacetic Acids (five) (HAA5),60.0,,,,60.0,,ug/L
89,Bromate,10.0,0.005,0.100,2009,10.0,0.0,ug/L


In [35]:
def Decontaminate(filename):
    from tabula import read_pdf
    from tabulate import tabulate
    import pandas as pd

    df_list = read_pdf(filename, pages='all')
    Decontaminate_Labels(df_list)
    Decontaminate_Nulls(df_list)
    Decontaminate_Rows(df_list)
    Decontaminate_Lists(df_list)
    Decontaminate_Values(df_list)
    df = pd.concat(df_list, ignore_index=True)
    Decontaminate_Datatypes(df)
    Decontaminate_Unit_Conversion(df)
    Decontaminate_Names(df)
    return df


In [36]:
Decontaminate(filename)

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Units
0,Dissolved Aluminum,1000.0,50.000,600.000,2001,,,ug/L
1,Dissolved Antimony,6.0,6.000,1.000,2016,6.0,6.0,ug/L
2,Dissolved Arsenic,10.0,2.000,0.004,2004,10.0,0.0,ug/L
3,"Asbestos, Chrysotile",7.0,0.200,7.000,2003,7.0,7.0,MFL
4,Dissolved Barium,1000.0,100.000,2000.000,2003,2000.0,2000.0,ug/L
...,...,...,...,...,...,...,...,...
86,"2,4,5-TP (Silvex)",50.0,1.000,3.000,2014,50.0,50.0,ug/L
87,Total Trihalomethanes,80.0,,,,80.0,,ug/L
88,Haloacetic Acids (five) (HAA5),60.0,,,,60.0,,ug/L
89,Bromate,10.0,0.005,0.100,2009,10.0,0.0,ug/L
