
# BEE 4310/6310: Environmental Statistics and Learning  <br> Assignment #1 (10 pts)

**Include responses to everything below in bold (including plots), and make sure your final assignment is well organized in a single markdown PDF submitted to Canvas. This makes it easier to grade (and easier to give partial credit).**

**Remember to include an acknowledgement at the bottom of this assignment if generative AI was used for coding support, including a list of the problems for which it was used.**

<img src="Learning_Outcomes_1.png" width="1000"/>

**Techniques and Topics:** 
- Visual data exploration
- Maximum likelihood estimation
- AIC
- Q-Q plots
- CDFs and quantile functions
- Parametric and nonparametric bootstrapping

**Packages and functions covered in Data Camp exercises (note: not all will be needed in the problems below)** 

| numpy as np       | Matplotlib.pyplot as plt  | pandas as pd  | scipy.stats | seaborn as sns |
| -----------       | ------------------------  | ------------  | ----------- | -------------- |
|np.array           | plt.plot                  | pd.DataFrame  | uniform.cdf | sns.scatterplot|
|np.mean            | plt.show                  | pd.read_csv   | uniform.rvs | sns.lmplot     |
|np.median          | plt.xscale                | df.iloc       | uniform.ppf |
|np.var             | plt.yscale                | df.loc        | binom.cdf   |
|np.std             | plt.scatter               | df.sample     | binom.rvs   |
|np.quantile        | plt.hist                  | df.iterrows   | binom.ppf   |
|np.random.rand     | plt.clf                   | df.apply      | norm.cdf    |
|np.random.randint  | plt.xlabel                |series.corr    | norm.rvs    |
|np.random.seed     | plt.tlabel                |               | norm.ppf    |
|np.logical_or      | plt.title                 |               | poisson.cdf |
|np.logical_and     | plt.xticks                |               | poisson.rvs |
|np.nditer          | plt.yticks                |               | poisson.pmf |
|np.transpose       | plt.text                  |               |
|                   | plt.grid                  |               |


<img align="right" src="Owego_WaterSupply.png" width="500"/>

The town of Owego is concerned about the quality of their groundwater-based drinking water supply. In response, the town would like to consider the feasibility of switching their entire water supply source to surface water. It would be relatively inexpensive to withdraw water from the Owego Creek directly below the Owego Creek near Owego NY streamflow gage (ID # 01514000, see map). However, Owego needs to determine whether environmental flow regulations would make it difficult for them to reliably withdraw water from the Owego Creek. These environmental flow regulations, designed to ensure enough water remains instream to support local aquatic ecosystems, are enforced through water withdrawal permits administered by the NY Department of Environmental Conservation (DEC). Staff at DEC put out technical guidance (TOGS for short) that can be used to determine the amount of flow that must remain instream before water can be withdrawn for domestic supply. For those who are interested, the TOGS document is on Canvas. 

Your assignment is to determine the required pass-by flow for the Owego Creek in a low flow month, and then determine how this requirement will influence the reliability of water withdrawals that the town of Owego is considering for its water supply needs. 



1. Import the following packages: numpy, pandas, matplotlib.pyplot. Also import norm, lognorm, and gamma from scipy.stats.

2. Download monthly streamflow data for the United States Geological Survey (USGS) Owego Creek near Owego NY streamflow gage (ID # 01514000). The data is available on Canvas, with three columns: year, month, and average flow in that month (in cubic feet per second, cfs). 

3.	**(1 pt)** Load the data using the read_csv() function from Pandas. Then, estimate the average monthly flow for each calendar month, generating 12 monthly average values. **Create a line plot of the monthly hydrograph. Be sure to label your axes with units. Determine and report which month has the lowest average monthly flow.** This will be the ‘design month.’ By examining water demand and supply conflicts during the driest month, our analysis will consider the most constraining time of year for water withdrawals. 

4.	Create a new variable (‘owego_min_month’) that only contains the data for the design month. This variable should be a vector that has 48 elements, one for each year of the record. 

5.	**(1 pt) Plot a histogram of the data in owego_min_month. Be sure to label the x-axis.** 

    **Calculate and report the first two moments of the observed data, rounded to two decimal places. Briefly comment on the shape of the distribution, e.g., is it symmetric or skewed?** 
    
    **At this point, which of the following distributions (normal, log-normal, gamma) do you think may be a good fit for these data and why?** 

6.	**(1 pt)** Fit normal, lognormal, and gamma distributions to these data using maximum likelihood. You can use functions designed for MLE from the scipy.stats package: norm.fit(), lognorm.fit(), gamma.fit(). Note that for the lognormal and gamma distributions, scipy.stats assumes by default a 3-parameter version of these models, allowing for a ‘location’ parameter that shifts the whole distribution to the left or right (i.e., moves the lower bound of the distribution above or below zero). In this application, we are going to force this location parameter equal to zero (i.e., we will use 2-parameter versions of these models). You can do this by setting the argument ‘floc’ equal to 0 in your function call when fitting the models. 

    **Report the fitted parameters and the maximized log-likelihood value for each distribution, all rounded to two decimal places. Which model has the best log-likelihood value?** 

    **Next, re-create the histogram from problem #5, but this time set the argument 'density=True'. Add to this histogram a line representing the pdf of the model with the best log-likelihood value. Briefly comment on the fit of this pdf to the data.**

    -Note: The vertical scale of a 'frequency histogram', which is the default version of a histogram, shows the number of observations in each bin. In contrast, the vertical scale of a 'density histogram' shows units that make the total area of all the bars add to 1 (by taking the frequency of each bar in the histogram and dividing by the product of *n* and *w* (the total number of observations and width of each bar, respectively)). This makes it possible to show the probability density curve of a fitted probability distribution using the same vertical scale. 

    **Finally, for the probability model with the best log-likelihood score selected above, re-fit this model to your data, but this time let the location parameter be fit as well. Add a pdf for this version of the model to the same histogram you created above. Be sure to add a legend to this plot distinguishing which pdf belongs to the 2-parameter and 3-parameter versions of the model.** 
    
    **Does this 3-parameter version of the model look like a better or worse fit to the data than the 2-parameter version? Why might we want to force the location parameter (i.e., the lower bound) of our fitted distribution to zero, rather than let it be fit based on the data? Construct your argument based on the overarching goal of this assignment.**

 

7.	**(1 pt) Calculate and report the AIC for the three models from problem #6, rounded to two decimal places (only consider the 2-parameter versions of each model here). Using the AIC, select and report the model that best fits the data.** This will be our candidate probability model for these data. 

    **Then, develop a Q-Q plot (with labeled axes) to visually evaluate how well your candidate model fits the data. Use parametric bootstrapping to put 95% confidence bounds on your Q-Q plot, and provide an interpretation of the results.**

8.	**(1 pt) Determine and report the pass-by flow (rounded to two decimal places) at the Owego Creek gaging station in the design month using the TOGS guidance (i.e., use Table 1, copied at end of this assignment) and your fitted distribution of choice.** Note that the drainage area for the Owego gage is 185 square miles. Also be mindful that Table 1 provides you with exceedance probabilities, rather than non-exceedance probabilities. 

9.	**(1 pt)** Using reasonable estimates of per capita water use in the United States and the population of the town of Owego (Google it; make sure to use the town, not the village, of Owego), determine an estimate of average daily water demand for the town (convert this to cubic feet per second, or cfs). **Report this average daily water demand (rounded to two decimal places), along with the per capita usage and population numbers you used.**  

    **Using your selected probability model, report the probability (as a percentage, rounded to 2 decimal places) that the average flow for the design month will be below this demand level, without consideration of the pass-by flow requirement.** That is, what is the probability that direct withdrawals from the stream will be unable to meet the town’s average water supply needs during the design month? 


10.	**(1 pt)** If the town needs to acquire a DEC permit that first requires them to allow the entire pass-by flow to pass downstream before withdrawing water to meet their domestic supply needs, how does this change the reliability of their water supply in the design month? That is, **report the probability (as a percentage, rounded to 2 decimal places) that the average flow for the design month will be unable to fully meet both the pass-by flow and Owego's municipal demand.** 

11.	**(1 pt) Using a non-parametric bootstrap (with B=1000 bootstrap samples), develop and report 95% confidence intervals (rounded to two decimal places) for both MLE parameter estimates of the distribution you used in problems #8-#10.** 

12.	**(1 pt)** Now we are interested in the uncertainty in your result from problem #10. Using a non-parametric bootstrap (with B=1000 bootstrap samples), simulate from the sampling distribution of the probability (as a percentage) that the average flow for the design month will not be sufficient to meet both pass-by flow requirements and municipal water supply needs. **Plot the sampling distribution for this probability as a boxplot, and add as a red point on top of this boxplot the original probability calculated in problem #10.** 

    **Around how high could the probability reach of not meeting both pass-by flow requirements and municipal water supply needs?**

13.	**(1 pt) In 1-2 sentences, provide an interpretation of your results from both problems #10 and #12 above, with respect to the Town of Owego’s plan to use Owego Creek as the sole source for its domestic water supply.**

**Table 1. TOGS guidance for passby flows** 

<img src="TOGS_table.png" width="1000"/>