Introduction
The study site is a 154-acre active landfill in New England. Long-term monitoring of groundwater and surface water has been conducted over the past twenty years in accordance with an EPA-approved monitoring and remediation plan. CERLCA requires that extended remediation plans must submit a full status report to EPA every five years. To show the effectiveness of the applied remedial techniques, time series analyses has been performed every five years of sampling data from the beginning of sampling. Currently, the analyses are conducted using the software program Carstat. Carstat, is a “black box” program, in which users input data and export plots without intermediate steps. This program has been used for the majority of the project life and the particular coding behind the statistical analyses is unknown. Therefore, the goal of this code is to recreate this “black box”, enabling future reproducibility of the data analyses. This code is referred to as White Box.

For this class, only the past 10 years of data have been used for analysis, as this data is the most complete in terms of the sampling locations and the sampling parameters.


Section 1. What does the data describe? What are the important fields?

Contaminant level data was collected from predetermined locations during quarterly sampling rounds from 2010 to the present (2019). Laboratory testing was performed on each sample to obtain results for 53 volatile organic compouns (VOCs), 43 semi-volatile organic compound (SVOCs), and the following metals: arsenic, beryllium, cadmium, chromium, cyanide, iron, lead, manganese, nickel, nitrates and vanadium. Results from these sampling rounds were stored in SQL-based proprietary relational database. 

Analytical results from the quarterly groundwater sampling rounds were submitted to the appropriate Federal and State regulatory agencies as tables. After five years, EPA required a time series analysis over five years at a 95% confidence level to assess any statistically significant changes to the levels of the identified contaminants of concern (COCs) in particular sampling locations that showed elevated levels of a particular COC. These COCs were given action levels by the EPA as indicators where further remedial invervention was needed.

The final dataset included 116 sampling locations, divided by quarter years (Jan-March, April-June, July-Sept, Oct-Dec). Each location was tested for 106 constituents, with a resulting dataset of approximately 140,000 lines. Any chemicals which did not have an establish action level by the EPA was given an action level of 0. Not all of these sampling locations have required reporting by the EPA (discussed in Section 4).

Section 2. Who collected it?

The samples were collected by trained environmental scientists/engineers/geologists in labortory provided containers. The samples were delivered to a qualified laboratory for testing. Results were returned as pdfs, excel files, and most recently, as an electronic data deliverable (EDD).

Section 3. How was the data collected? Does the data describe a sample or a population?

The samples were collected using standard practice field techniques for groundwater/surface water sampling. Most often, the low-flow techniques were used. Water is purged from a well through a YSI meter, which measures a combination of dissolved oxigen, conductivity, specific conductance, resitivity, total dissolved solids, ph, ORP, temperature and turbitity. These parameters were monitored until the well was stabilized, meaning all the mention parameters fell within certain specific ranges over a certain period of time. When the location was considered stabilized, the location was sampled using laboratory provided containers. The samples were delivered the same day, in iced coolers, to the laboratory for testing.

The location of the wells and surface water sampling points were determined by inferring groundwater flow on the site, as well as known sources of contamination. In general, the sampling locations are located either downgradient of the source zones (also called "hot zones") or around the perimeter of the landfill to monitor any offsite migration. While a particular sampling location may be considered a sample of all the water on a site, the number and location of the sampling points are fairly representative of the "population" of all naturally occuring water at the site. 

Section 4. Describe the process to load, clean, and prepare the data.

The general structure of this notebook is as follows:
1. Import data
2. Exclude non-COC wells
3. Run time series analysis
4. Add trend lines
5. Identify statistically significant changes
6. Visualizations

In [3]:
#Import data and needed packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Load Dataset with SAMPLE DATE as a date index
CLF_original = pd.read_csv('../dataset/CSC593dataset_original.csv', index_col="SAMPLEDATE", parse_dates=["SAMPLEDATE"])

CLF_original['jdate'] = CLF_original.index.to_julian_date()

In [4]:
#Check if dataset has loaded
CLF_original.head()

Unnamed: 0_level_0,CAS_RN,CHEMICAL_NAME,AL_RESULT_VALUE,ACTION_LEVEL_UNIT,SYS_LOC_CODE,SYS_SAMPLE_CODE,SAMPLE_ID,SAMPLE_DATE_yyyy-Qq,SAMPLE_TYPE_CODE,MATRIX_CODE,...,REPORT_RESULT_VALUE,REPORT_RESULT_UNIT,REPORT_RESULT_LIMIT,REPORTABLE_RESULT,DETECT_FLAG,LAB_QUALIFIERS,REPORTING_DETECTION_LIMIT,DETECTION_LIMIT_UNIT,METHOD_ANALYTE_GROUP,jdate
SAMPLEDATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2010-02-03,100-02-7,4-Nitrophenol,51.0,ug/L,NELF15B,NELF15B020310,2066285,2010-Q1,N,GW,...,51.0,ug/L,51.0,Y,N,R,51.0,ug/L,2019 CLF SVOCs,2455230.5
2010-02-03,100-41-4,Ethylbenzene,1.0,ug/L,NELF15B,NELF15B020310,2066285,2010-Q1,N,GW,...,1.0,ug/L,1.0,Y,N,R,1.0,ug/L,2019 CLF VOCs,2455230.5
2010-02-03,100-42-5,Styrene,1.0,ug/L,NELF15B,NELF15B020310,2066285,2010-Q1,N,GW,...,1.0,ug/L,1.0,Y,N,R,1.0,ug/L,2019 CLF VOCs,2455230.5
2010-02-03,10061-01-5,"cis-1,3-Dichloropropene",0.5,ug/L,NELF15B,NELF15B020310,2066285,2010-Q1,N,GW,...,0.5,ug/L,0.5,Y,N,,0.5,ug/L,2019 CLF VOCs,2455230.5
2010-02-03,10061-02-6,"trans-1,3-Dichloropropene",0.5,ug/L,NELF15B,NELF15B020310,2066285,2010-Q1,N,GW,...,0.5,ug/L,0.5,Y,N,,0.5,ug/L,2019 CLF VOCs,2455230.5


In [5]:
#See shape and context of data
CLF_original.shape

(141645, 25)

In [6]:
#Number of sampling locations
CLF_original.SYS_LOC_CODE.nunique()

116

Subsetting the Data

As mentioned above, SYS_LOC_CODE refers to the sampling location. However, we don't need all 116 locations for the analysis, as the EPA only identified 46 locations with COCs. The first step in cleaning the data would be to subset based on the COC list (located in the dataset folder).  

In [7]:
#Import Contaminant of Concern (COC) list
COC_List = pd.read_csv('../dataset/COC_List.csv')
#check
COC_List.head()

Unnamed: 0,SYS_LOC_CODE,Constituent1,Constituent2,Constituent3,Constituent4,Constituent5,Constituent6,Constituent7
0,SW4,arsenic,beryllium,chlorobenzene,manganese,nitrate as N,,
1,SW4EFF,nitrate as N,,,,,,
2,SW5,arsenic,iron,lead,manganese,nitrate as N,,
3,SW7,lead,manganese,nitrate as N,,,,
4,SW107,beryllium,lead,manganese,nitrate as N,,,


In [8]:
#Make COC_Loc Column 1 into a list to filter against main dataset.
COC_Loc = COC_List["SYS_LOC_CODE"].tolist()
#check
print(COC_Loc)

['SW4', 'SW4EFF', 'SW5', 'SW7', 'SW107', 'MW03ML11', 'MW03ML12A', 'MW03ML12B', 'MW03ML12C', 'MW03ML12D', 'MW03ML12E', 'MW1457S', 'MW1458S', 'MW1460A', 'MW1561A', 'MW1561B', 'MW14ML15A', 'MW14ML15B', 'MW14ML15C', 'MW14ML15D', 'MW14ML16A', 'MW14ML16B', 'MW14ML16C', 'MW14ML16D', 'MW14ML17A', 'MW14ML17B', 'MW14ML17C', 'MW14ML17D', 'MW14ML18A', 'MW14ML18B', 'MW14ML18C', 'MW14ML18D', 'MW14ML19A', 'MW14ML19B', 'MW14ML19C', 'MW14ML19D', 'MW15ML20A', 'MW15ML20B', 'WE87ML5A', 'MWP51AR', 'MWP51B', 'MWP52A', 'MWP52B', 'MWP53A', 'MWP53B', 'WSLCS']


In [9]:
#Filter data based on COC locations
CLF = CLF_original[CLF_original['SYS_LOC_CODE'].isin(COC_Loc)]
#Check number of unique location rows
CLF.SYS_LOC_CODE.nunique()

46

In [18]:
#Check number of rows
#CLF.shape
CLF.head()

Unnamed: 0_level_0,CAS_RN,CHEMICAL_NAME,AL_RESULT_VALUE,ACTION_LEVEL_UNIT,SYS_LOC_CODE,SYS_SAMPLE_CODE,SAMPLE_ID,SAMPLE_DATE_yyyy-Qq,SAMPLE_TYPE_CODE,MATRIX_CODE,...,REPORT_RESULT_VALUE,REPORT_RESULT_UNIT,REPORT_RESULT_LIMIT,REPORTABLE_RESULT,DETECT_FLAG,LAB_QUALIFIERS,REPORTING_DETECTION_LIMIT,DETECTION_LIMIT_UNIT,METHOD_ANALYTE_GROUP,jdate
SAMPLEDATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2010-02-05,100-02-7,4-Nitrophenol,53.0,ug/L,SW7,SW7020510,2067131,2010-Q1,N,SW,...,53.0,ug/L,53.0,Y,N,R,53.0,ug/L,2019 CLF SVOCs,2455232.5
2010-02-05,100-41-4,Ethylbenzene,1.0,ug/L,SW7,SW7020510,2067131,2010-Q1,N,SW,...,1.0,ug/L,1.0,Y,N,R,1.0,ug/L,2019 CLF VOCs,2455232.5
2010-02-05,100-42-5,Styrene,1.0,ug/L,SW7,SW7020510,2067131,2010-Q1,N,SW,...,1.0,ug/L,1.0,Y,N,R,1.0,ug/L,2019 CLF VOCs,2455232.5
2010-02-05,10061-01-5,"cis-1,3-Dichloropropene",0.5,ug/L,SW7,SW7020510,2067131,2010-Q1,N,SW,...,0.5,ug/L,0.5,Y,N,,0.5,ug/L,2019 CLF VOCs,2455232.5
2010-02-05,10061-02-6,"trans-1,3-Dichloropropene",0.5,ug/L,SW7,SW7020510,2067131,2010-Q1,N,SW,...,0.5,ug/L,0.5,Y,N,,0.5,ug/L,2019 CLF VOCs,2455232.5


I don't understand the format of jdate...but ok.

The subset appears to have been successful. The total dataframe has been reduced from 116 sampling locations to the required 46 and the total dataframe has been reduced from 141,645 rows to 77,872 rows. 

Next, the dataset can be further filtered based on the identified COCs. Even though each sampling is tested for over 100 parameters, the EPA is only interested in a certain 15 constituents. An object is created below containing the parameters as they appear in the original dataset. Note that not all of the parameters are needed for each sampling location.

In [11]:
#Creation of EPA COC parameter list
COC_Param = ["1,1-Dichloroethane", "1,2-Dichlorobenzene", "1,4-Dichlorobenzene", "Arsenic", "Benzene", "Beryllium", "Chlorobenzene", "cis-1,2-Dichloroethene", "Iron", "Lead", "Manganese", "Nickel", "NITRATE/NITRITE AS N", "Trichloroethene (TCE)", "Vinyl chloride"]
len(COC_Param)

15

In [12]:
#COC_Param list is applied to the dataset to filter down constituents.
CLF2 = CLF[CLF['CHEMICAL_NAME'].isin(COC_Param)]
#check
CLF2.CHEMICAL_NAME.nunique()

15

In [13]:
#check
CLF2.shape

(13866, 25)

In [20]:
#Call one well location as object
mw03ml11 = CLF2[CLF.SYS_LOC_CODE=='MW03ML11']
#Call one parameter as object
mw03ml11benz = mw03ml11[mw03ml11.CHEMICAL_NAME=='Benzene']

  


ValueError: cannot reindex from a duplicate axis

I get that the column has duplicate values...it does. I don't know why that matters.

In [None]:
import seaborn as sns
import seaborn as sns
sns.set()
#sns.lineplot('sampledate', 'REPORT_RESULT_VALUE', data=mw03ml11benz)
sns.lineplot('jdate', 'REPORT_RESULT_VALUE', data=mw03ml11benz)
plt.show()

In [None]:
#Regression Model
recent=MW03ML11henz[MW03ML11benz.index>'2013-01-01']
import statsmodels.api as sm
import scipy.states as stats

y=recent.REPORT_RESULT_VALUE
x=sm.add_contant(recent.index.to_julian_date())

lm = sm.OLS(y,x).fit()
lm.summary()

lm.params['x1']
lm.pvalues

Section 5. List three questions that can be answered using the data. For each question, provide at least one visualization, and explanation of the question, and a relevant table/summarystatistic/output of a statistical model.

The three questions are as follows:
1. Was the code successful in recreating plots similar to the propriatary "black box" code?
2. Based on the identified COCs by the EPA at certain sampling locations, which locations are trending downward (i.e. which locations are being successfully remediated?)
3. Are there any locations that have significant increasing or decreasing trends based on the last five years of data, which may indicated extreme subsurface conditions or contaminant source locations.

The code was not successful in recreating plots. If the code was successful, it would have shown 11 locations identified with Significantly Decreasing Trends and 23 locations with Significantly Increasing Trends. Twenty-seven locations would indicate a general Decreasing Trend and 21 with a general, but not Significant Increasing Trend. 

The Significantly Increasing Trend locations would indicate where contaminants are actively moving in the subsurface and where the next target remediation should occur. Further analysis could include cross referencing locations were previous remediation activities occured to correlate the effectiveness of remedial activities. 
