In [1]:
from tabula import read_pdf
from tabulate import tabulate
import pandas as pd
import numpy as np


In [2]:
filename = "Data\mcls_dlrs_phgs.pdf"
df_list = read_pdf(filename, pages='all')

In [3]:
df_list

[              State Regulated  State  State.1  State PHG  State.2 Federal  \
 0          Inorganic Chemical    MCL      DLR        NaN  Date of     MCL   
 1                 Contaminant    NaN      NaN        NaN      PHG     NaN   
 2                    Aluminum      1     0.05        0.6     2001      --   
 3                    Antimony  0.006    0.006      0.001     2016   0.006   
 4                     Arsenic  0.010    0.002   0.000004     2004   0.010   
 5             Asbestos (MFL =  7 MFL  0.2 MFL      7 MFL     2003   7 MFL   
 6   million fibers per liter;    NaN      NaN        NaN      NaN     NaN   
 7              for fibers >10    NaN      NaN        NaN      NaN     NaN   
 8               microns long)    NaN      NaN        NaN      NaN     NaN   
 9                      Barium      1      0.1          2     2003       2   
 10                  Beryllium  0.004    0.001      0.001     2003   0.004   
 11                    Cadmium  0.005    0.001    0.00004     20

## The File

The output from the tabulated pdf is a list of lists. Each list represents a different class of pollutants in the water, such as radioactive, organics, disinfectants... 
Each list has a different length, three of them have columns that have been augmented, some list the number 0 as a string 'zero', and many of the NaN are represented as --

The column headings are not consistent and non-intuitive, so the first function will standardize the names of the columns to a meaningful measure.

## Decontaminate Lables 
This identifies: 
- Contaminant
- State Maximum Containment Level
- State Detection Limit for reporting
- State Public Health Goals (often smells and tastes)
- Public Health Goal Date
- Federal Maximum Containment Level
- Federal Maximum Containmnet Goal

In [4]:
def Decontaminate_Labels(df_list):
    for df in df_list:
        df.rename(columns={df.columns[0]: "Contaminant",
                           df.columns[1]: "State_MCL",
                           df.columns[2]: "State_DLR",
                           df.columns[3]: "State_PHG",
                           df.columns[4]: "PHG_Date",
                           df.columns[5]: "Federal_MCL",
                           df.columns[6]: "Federal_MCLG"
                           }, inplace=True)
    return df_list


In [5]:
df_relabeled = Decontaminate_Labels(df_list)

In [6]:
df_relabeled[0].head()

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG
0,Inorganic Chemical,MCL,DLR,,Date of,MCL,MCLG
1,Contaminant,,,,PHG,,
2,Aluminum,1,0.05,0.6,2001,--,--
3,Antimony,0.006,0.006,0.001,2016,0.006,0.006
4,Arsenic,0.010,0.002,4e-06,2004,0.010,zero


Just looking at the first list, there are issues with the first two rows - they are not acutally data, but continued headers. These will be modified in later functions. 
The other issue that can be seen is in the Alumninum row, with two NaN values labeled as '--', which occurs througout the document. 

## Decontaminate_Nulls
This next function aims to change all '--' values in each list to a numpy NaN

In [7]:
df_relabeled[0].replace('--', np.nan, inplace=True)

In [8]:

def Decontaminate_Nulls(df_list):   # This isn't working yet and needs to be fixed
    import numpy as np
    for df in df_list:
        df.replace('--', np.nan, inplace=True)
    return df_list


In [9]:
df_nulled = Decontaminate_Nulls(df_relabeled)

In [10]:
df_nulled[0].head()

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG
0,Inorganic Chemical,MCL,DLR,,Date of,MCL,MCLG
1,Contaminant,,,,PHG,,
2,Aluminum,1,0.05,0.6,2001,,
3,Antimony,0.006,0.006,0.001,2016,0.006,0.006
4,Arsenic,0.010,0.002,4e-06,2004,0.010,zero


As can be seen, the issue of the Nulls has been addressed, and checked in several of the other tables. 

The next issue that needs to be addressed the the presence of these leftover subheading rows.

## Decontaminate Rows
Looking through each of the tables from the original documentation and the tabulated data, the rows that do not call a contaminant will have one of the following issues in the 'State_MCL' column:
- It will be NaN
- It will say 'MCL'
- It may say 'mrem/yr' in the case of radioactive material

Some of the known contaminants truly have a NaN value for the State_MCL. 
This raises two question: 
- Is there an overarching Federal MCL that must already be met?
- Is this contaminant only a goal? 

If there is a Federal_MCL, the State_MCL will be set equal to the Federal_MCL, since it MUST be met. 
If there is NO Federal_MCL or State_MCL, the contaminant will not be included in this study. 

In [11]:
df_nulled[3].head()

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG
0,Radionuclides,MCL,DLR,PHG,Date of,MCL,MCLG
1,Contaminant,,,,PHG,,
2,Gross alpha particle,15,3,none,,15,zero
3,activity - OEHHA,,,,,,
4,concluded in 2003 that,,,,,,


As can be seen in the above table, it was true in all tables that when there was no State_MCL, there was never a case where a Federal_MCL was present. The opposite was not the case. Because this remained true for all 14 categories, I was able to use the State_MCL column with anything containing a Null to remove that column, as it is either a header or a contaminant that is not applicable to the scope of the project. 

### Units
One thing that is missing from this table is the unit of measure of these specifications. Because the reporting is in different units at different sites, it will be important to convert the units to those which can be compared to the regulations. 
The default, as specified in the documentation, is mg/L unless otherwise specified. Since this is the case, this function will add an additional column 'Units' which will all be set to 'mg/L'. 

For those that are specified differently, their values will be adjusted on a case-by-case basis in the next function

In [12]:
def Decontaminate_Rows(df_list):
    for n in range(len(df_list)):
        df_list[n].dropna(subset=['State_MCL'], how='all', inplace=True)
        df_list[n] = df_list[n].loc[df_list[n].State_MCL != 'MCL']
        df_list[n] = df_list[n].loc[df_list[n].State_MCL != 'mrem/yr']
        df_list[n]['Units'] = 'mg/L'
    return df_list

In [13]:
df_sub_headless = Decontaminate_Rows(df_nulled)

In [14]:
df_sub_headless[4]

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Federal\rMCLG,Units
0,Strontium-90,8,2,0.35,,2006,,,mg/L
1,Tritium,"""20,000""","""1,000""",400.0,,2006,,,mg/L
2,Uranium,20,1,0.43,,2001,30 μg/L,zero,mg/L


### Column Discrancies
There were discrepancies in the number of columns in three of the lists: 4, 7, and 11. Each of these were augmented in one of the earlier columns, changing the values to something incorrect, and then adding an additional Federal\rMCLG column at the end. 


## Decontaminate Lists
Each of these were examined, paying attention to the PHG_Date and any NaN values, the best indicators of where the shifts occurred. Each of these tables were adjusted by dropping the column that contained all NaN values, and then using a dictionary to rename the other columns to the appropriate names. 

In [15]:

def Decontaminate_Lists(df_list):
    for n in range(len(df_list)):
        if n == 4:
            df_list[n].drop(columns='PHG_Date', inplace=True)
            df_list[n].rename(columns={'Federal_MCL': 'PHG_Date',
                                       'Federal_MCLG': 'Federal_MCL',
                                       'Federal\rMCLG': 'Federal_MCLG'}, inplace=True)
        elif n == 7:
            df_list[n].drop(columns='State_DLR', inplace=True)
            df_list[n].rename(columns={'State_PHG': 'State_DLR',
                                       'PHG_Date': 'State_PHG',
                                       'Federal_MCL': 'PHG_Date',
                                       'Federal_MCLG': 'Federal_MCL',
                                       'Federal\rMCLG': 'Federal_MCLG'}, inplace=True)
        elif n == 11:
            df_list[n].drop(columns='Federal_MCLG', inplace=True)
            df_list[n].rename(
                columns={'Federal\rMCLG': 'Federal_MCLG'}, inplace=True)
    return df_list

In [16]:
df_decon_lists = Decontaminate_Lists(df_sub_headless)

In [83]:
df_decon_lists[3]

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Units
2,Gross Alpha Particle,15,3.0,none,,15,0.0,pCi/L
7,Gross Beta Particle,4,4.0,none,,4,0.0,pCi/L
14,Radium-226 + Radium-228,5,,,,5,0.0,pCi/L


This particular table shows some other interesting issues that need to be addressed: 
- Titanium's values are strings, not numeric
- Uranium's units should be in ug/L
- Uranium's Federal Goal should be the numeric value zero, not a string

Each table, upon investigation, has its own issues that need to be addressed.

## Decontaminate Values
This function addresses the issues that are specific to each list, and cannot be standardized across the rest of the documentation. 

Not all of the tables needed modifications, so those are not included here, such as 12 and 14.

In [84]:

def Decontaminate_Values(df_list):
    import numpy as np
    for n in range(len(df_list)):
        if n == 0:
            # Fixes the string zero to numerical
            df_list[n].loc[4, ["Federal_MCLG"]] = [0]
            df_list[n].loc[5, ["Contaminant", "State_MCL", "State_DLR", "State_PHG", "PHG_Date", "Federal_MCL", "Federal_MCLG", "Units"]] = [
                'Asbestos', 7.0, 0.2, 7.0, 2003, 7.0, 7.0, 'MFL']  # Removes the units from every value to numerical values
            # Changes long text to just chromium, total - changes 'witdrawn' to Null
            df_list[n].loc[12, ["Contaminant", "State_PHG"]] = ['Chromium, Total', np.nan]
        elif n == 1:
            df_list[n].loc[3, ["Contaminant", "PHG_Date"]] = ['Mercury', 2005]
            df_list[n].loc[5, ["Contaminant", "State_MCL", "State_PHG", "Units"]] = [
                'Nitrate', 10, 45, '10 as N mg/L']
            df_list[n].loc[6, ["Contaminant", "State_MCL", "State_PHG", "Units"]] = [
                'Nitrite', 1, 1, '1 as N mg/L']
            df_list[n].loc[7, ["Contaminant", "State_MCL", "State_PHG", "Units"]] = [
                'Nitrate + Nitrite', 10, 10, '10 as N mg/L']
            df_list[n].loc[10, ["PHG_Date"]] = [2004]
        elif n == 2:
            df_list[n].loc[3, ["Federal_MCLG"]] = [0.0]
        elif n == 3:
            df_list[n].loc[2, ["Contaminant", "State_PHG", "PHG_Date", "Federal_MCLG", "Units"]] = [
                "Gross Alpha Particle", np.nan, np.nan, 0.0, 'pCi/L']
            df_list[n].loc[7, ["Contaminant", "State_PHG", "PHG_Date", "Federal_MCLG", "Units"]] = [
                "Gross Beta Particle", np.nan, np.nan, 0.0, 'pCi/L']
            df_list[n].loc[14, ["Contaminant", "Federal_MCLG", "Units"]] = [
                'Radium-226 + Radium-228', 0.0, 'pCi/L']
        elif n == 4:
            df_list[n].loc[0, ["Units"]] = ['pCi/L']
            df_list[n].loc[1, ["State_MCL", "State_DLR", "State_PHG", "Units"]] = [20000, 1000, 400, 'pCi/L']
            df_list[n].loc[2, ["Federal_MCL", "Federal_MCLG", 'Units']] = [
                30, 0.0, 'pCi/L (ug/L for Federal_MCL)']
        elif n == 5:
            df_list[n].loc[0, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[1, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[2, ["PHG_Date"]] = [2009]
            df_list[n].loc[3, ["Contaminant"]] = ['1,4-Dichlorobenzene(p-DCB)']
            df_list[n].loc[4, ["Contaminant"]] = ['1,1-Dichloroethane (1,1-DCA)']
            df_list[n].loc[5, ["Contaminant", "PHG_Date", "Federal_MCLG"]] = [
                '1,2-Dichloroethane (1,2-DCA)', 2005, 0.0]
            df_list[n].loc[6, ["Contaminant"]] = [
                '1,1-Dichloroethylene (1,1-DCE)']
        elif n == 6:
            df_list[n].loc[1, ["Contaminant"]] = ['trans-1,2-Dichloroethylene']
            df_list[n].loc[2, ["Contaminant", "Federal_MCLG"]] = [
                'Dichloromethane (Methylene chloride)', 0.0]
            df_list[n].loc[3, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[4, ["PHG_Date"]] = [2006]
            df_list[n].loc[6, ["Contaminant"]] = [
                'Methyl tertiary butyl ether (MTBE)']
            df_list[n].loc[9, ["Contaminant"]] = ['1,1,2,2-Tetrachloroethane']
            df_list[n].loc[10, ["Contaminant", "Federal_MCLG"]] = [
                'Tetrachloroethylene (PCE)', 0.0]
        elif n == 7:
            df_list[n].loc[0, ["Contaminant"]] = [
                '1,1,1-Trichloroethane (1,1,1-TCA)']
            df_list[n].loc[1, ["Contaminant"]] = [
                '1,1,2-Trichloroethane (1,1,2-TCA)']
            df_list[n].loc[2, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[3, ["Contaminant"]] = [
                'Trichlorofluoromethane (Freon 11)']
            df_list[n].loc[4, ["Contaminant", "PHG_Date"]] = [
                '1,1,2-Trichloro-1,2,2-Trifluoroethane (Freon 113)', 2011]
            df_list[n].loc[5, ["Federal_MCLG"]] = [0.0]
        elif n == 8:
            df_list[n].loc[0, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[2, ["PHG_Date"]] = [2009]
        elif n == 9:
            df_list[n].loc[0, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[2, ["PHG_Date", "Federal_MCLG"]] = [2006, 0.0]
            df_list[n].loc[3, ["PHG_Date"]] = [2009]
            df_list[n].loc[4, ["Contaminant", "Federal_MCLG"]] = [
                '1,2-Dibromo-3-chloropropane (DBCP)', 0.0]
            df_list[n].loc[5, ["Contaminant"]] = [
                '2,4-Dichlorophenoxyacetic acid (2,4-D)']
            df_list[n].loc[6, ["Contaminant"]] = ['Di(2-ethylhexyl)adipate']
            df_list[n].loc[7, ["Contaminant", "Federal_MCLG"]] = [
                'Di(2-ethylhexyl)phthalate (DEHP)', 0.0]
            df_list[n].loc[8, ["PHG_Date"]] = [2010]
        elif n == 10:
            df_list[n].loc[1, ["Contaminant", "Federal_MCL", "Federal_MCLG"]] = [
                'Ethylene dibromide (EDB)', 0.00005, 0.0]
            df_list[n].loc[3, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[4, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[5, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[6, ["Contaminant"]] = ['Hexachlorocyclopentadiene']
            df_list[n].loc[7, ["PHG_Date"]] = [2005]
            df_list[n].loc[11, ["Federal_MCLG"]] = [0.0]
        elif n == 11:
            df_list[n].loc[0, ["Contaminant", "Federal_MCLG"]] = [
                'Polychlorinated biphenyls (PCBs)', 0.0]
            df_list[n].loc[3, ["Federal_MCLG"]] = [0.0]
            df_list[n].loc[4, ["Contaminant", "State_MCL", "State_DLR"]] = [
                '1,2,3-Trichloropropane', 0.000005, 0.000005]
            df_list[n].loc[5, ["Contaminant", "State_MCL", "State_DLR", "State_PHG", "Federal_MCL",
                               "Federal_MCLG"]] = ['2,3,7,8-TCDD (dioxin)', 3.0e-8, 5.0e-9, 5.0e-11, 3.0e-8, 0.0]
        elif n == 13:
            df_list[n].loc[2, ["Contaminant"]] = [
                'Haloacetic Acids (five) (HAA5)']
            df_list[n].loc[8, ["State_DLR", "Federal_MCLG"]] = [0.0050, 0.0]
    return df_list



In [85]:
df_values = Decontaminate_Values(df_decon_lists)

In [93]:
df_values[11]

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Units
0,Polychlorinated biphenyls (PCBs),0.0005,0.0005,9e-05,2007,0.0005,0.0,mg/L
1,Simazine,0.004,0.001,0.004,2001,0.004,0.004,mg/L
2,Thiobencarb,0.07,0.001,0.042,2016,,,mg/L
3,Toxaphene,0.003,0.001,3e-05,2003,0.003,0.0,mg/L
4,"1,2,3-Trichloropropane",5e-06,5e-06,7e-07,2009,,,mg/L
5,"2,3,7,8-TCDD (dioxin)",0.0,0.0,0.0,2010,0.0,0.0,mg/L
6,"2,4,5-TP (Silvex)",0.05,0.001,0.003,2014,0.05,0.05,mg/L


In [94]:
df_values[11].dtypes

Contaminant     object
State_MCL       object
State_DLR       object
State_PHG       object
PHG_Date         int64
Federal_MCL     object
Federal_MCLG    object
Units           object
dtype: object

## Decontaminate_Datatypes
This has now fixed the issues within each of the tables, adjusting the units, changing any additional missing or unknown values to NaN, and changing any strings to numerical values when appropriate.
This function will be applied to the full dataframe after the individuals have been concatenated. 


In [106]:
def Decontaminate_Datatypes(df):
    df.State_MCL = pd.to_numeric(df.State_MCL)
    df.State_DLR = pd.to_numeric(df.State_DLR)
    df.State_PHG = pd.to_numeric(df.State_PHG)
    df.Federal_MCL = pd.to_numeric(df.Federal_MCL)
    df.Federal_MCLG = pd.to_numeric(df.Federal_MCLG)
    return df 

## Decontaminate
The final function calls each of the subroutines in the order that was followed above. This function will take the filename of the pdf instead of having to previously perform that step. It will finally concatenate all of the lists into a single output. 

In [107]:
def Decontaminate(filename):
    from tabula import read_pdf
    from tabulate import tabulate
    import pandas as pd

    df_list = read_pdf(filename, pages='all')
    Decontaminate_Labels(df_list)
    Decontaminate_Nulls(df_list)
    Decontaminate_Rows(df_list)
    Decontaminate_Lists(df_list)
    Decontaminate_Values(df_list)
    df = pd.concat(df_list, ignore_index=True)
    Decontaminate_Datatypes(df)
    return df

In [108]:
decon = Decontaminate(filename)

In [109]:
decon

Unnamed: 0,Contaminant,State_MCL,State_DLR,State_PHG,PHG_Date,Federal_MCL,Federal_MCLG,Units
0,Aluminum,1.000,0.050,0.600000,2001,,,mg/L
1,Antimony,0.006,0.006,0.001000,2016,0.006,0.006,mg/L
2,Arsenic,0.010,0.002,0.000004,2004,0.010,0.000,mg/L
3,Asbestos,7.000,0.200,7.000000,2003,7.000,7.000,MFL
4,Barium,1.000,0.100,2.000000,2003,2.000,2.000,mg/L
...,...,...,...,...,...,...,...,...
87,"2,4,5-TP (Silvex)",0.050,0.001,0.003000,2014,0.050,0.050,mg/L
88,Total Trihalomethanes,0.080,,,,0.080,,mg/L
89,Haloacetic Acids (five) (HAA5),0.060,,,,0.060,,mg/L
90,Bromate,0.010,0.005,0.000100,2009,0.010,0.000,mg/L


In [111]:
decon.dtypes

Contaminant      object
State_MCL       float64
State_DLR       float64
State_PHG       float64
PHG_Date         object
Federal_MCL     float64
Federal_MCLG    float64
Units            object
dtype: object