<a href="https://colab.research.google.com/github/ReidelVichot/DSTEP23/blob/main/week_5/dstep23_dsny_trash_part2_rvichot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **DSTEP23 // Dept of Sanitation in NYC, Part 2: trends and periodicity**

*September 28, 2023*

This notebook will explore two questions related to waste removal by New York City's Department of Sanitation:

- ***What is the relationship between refuse and recycling?***

- ***Is there periodicity in the tonnage data?***

The data can be found [here](https://data.cityofnewyork.us/City-Government/DSNY-Monthly-Tonnage-Data/ebb7-mvp5).

---

### **From Part 1**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# -- set the filename and read the data while parsing the MONTH column
fname = "https://data.cityofnewyork.us/api/views/ebb7-mvp5/rows.csv?accessType=DOWNLOAD"
dsny  = pd.read_csv(fname, parse_dates=["MONTH"])

# -- sub-select only the first six columns
cols = dsny.columns[:6]
dsny = dsny[cols]

# -- rename the columns for ease of use
dsny.columns = ["month", "borough", "district", "refuse", "paper", "mgp"]

# -- create a column that is all recycling
dsny["recy"] = dsny["paper"] + dsny["mgp"]

# -- sort by month values
dsny = dsny.sort_values("month", ignore_index=True)

# -- let's concentrate on 1995 to 2019 (16 years)
ind_tlo = dsny["month"] >= "1995-01-01"
ind_thi = dsny["month"] < "2019-01-01"
ind_tot = ind_tlo & ind_thi
dsny = dsny[ind_tot]

# -- convert NaNs to 0s (this is a CHOICE!)
dsny = dsny.fillna(0.0)

# -- group by month and sum to aggregate across the whole city
nyc_ts = dsny.groupby("month").sum(numeric_only=True)

# -- display the origina and result
display(dsny)
display(nyc_ts)

# -- plot the result
fig, ax = plt.subplots( figsize=(10, 5))
nyc_ts.plot(y="refuse", ylabel="total NYC refuse [tons]", xlabel="", legend=False, color="k", ax=ax)
fig.show()

### **Structure in Time Series Data and Filtering**

We see that there is a lot of variability in the time series data **"on multiple time scales"**.  We can isolate the short time scale behavior by removing trends.  Let's concentrate on Brooklyn for now:

In [None]:
# -- sub-select Brooklyn
bk =
bk_ts =

Let's plot Brooklyn again, but add some additional gridlines,

In [None]:
# -- plot total refuse for Brooklyn with minor grid lines
fig, ax = plt.subplots(figsize=(10, 3))

ax.set_xlabel("")
ax.set_ylabel("total Brooklyn refuse [tons]")
fig.show()

Zooming in a bit,

In [None]:
# -- restrict x-axis range


Convert units,

In [None]:
# -- display index
bk_ts.index

In [None]:
# -- divide by days in the month


In [None]:
# -- restrict x-axis range


In [None]:
# -- plot total refuse for Brooklyn with minor grid lines


Sharp dips in Febrary have gone away, but it still seems that there are ***seasonal*** effects.  Let's look at all of the districts in BK,

In [None]:
# -- plot total refuse and refuse of individual districts
fig, ax = plt.subplots(2, 1, figsize=(10, 6))

bk_ts.plot(y="refuse", legend=False, color="k", ax=ax[0])
ax[0].grid(axis="x", which="major", lw=2)
ax[0].grid(axis="x", which="minor", lw=0.5)
ax[0].set_xlabel("")
ax[0].set_ylabel("total Brooklyn refuse [tons/day]")

bkdist =

ax[1].grid(axis="x", which="major", lw=2)
ax[1].set_xlabel("")
ax[1].set_ylabel("total refuse [tons/day]")
ax[1].legend(loc="upper left")

fig.show()

Long time scale trends can be found by "filtering" time series data.  One of the most common is the rolling mean:

In [None]:
# -- take the rolling mean of the Brooklyn refuse with a quarterly window
bk_ts_03 =
bk_ts_12 =

In [None]:
# -- plot the two
fig, ax = plt.subplots(figsize=(10, 5))
bk_ts.plot(y="refuse", color="k", label="raw data", ax=ax)

ax.grid(axis="x", which="major", lw=2)
ax.grid(axis="x", which="minor", lw=0.5)
ax.set_xlabel("")
ax.set_ylabel("total Brooklyn refuse [tons/day]")

So the 12 month rolling mean is showing is giving us the long time scale behavior.  What if we compare the smoothed refuse to recycling?:

In [None]:
# -- plot refuse and recycling


[This WNYC story](https://www.wnyc.org/story/bloomberg-and-garbage-pile-unfinished-business/) provides some context that might help explain some of the characteristics of this plot.

### **Covariance and correlation**

The above plot is interesting for a few reasons...  Let's try to calcuate the correlation coefficient between the two:

&nbsp;&nbsp;&nbsp; <big> **$C = \frac{cov(v_1, \ v_2)}{\sigma_{v_1} \sigma_{v_2}}$** </big>

&nbsp;&nbsp;&nbsp; **$cov(v_1, \ v_2) = \sum_i \frac{(v_{1, i} \ - \  \bar{v}_1) \ (v_{2, i} \ - \  \bar{v}_2)}{\sqrt{N- 1}} $**

Perhaps put more simply,

&nbsp;&nbsp;&nbsp; **$C = \langle v_1^{\prime} \cdot v_2^{\prime} \rangle$**

where

&nbsp;&nbsp;&nbsp; $v_1^{\prime} = \frac{v_1 \ - \ \langle v_1 \rangle}{\sigma_{v_1}}$ <small> &nbsp;&nbsp;&nbsp; this is called <u>**standardization**</u></small>

In [None]:
# -- to visualize, make a scatter plot of the tonnage values


The correlation coefficient defined as above *ranges from 1 (perfectly correlated) to -1 (perfectly anti-correlated)*.

In [None]:
# -- calculate Pearson correlation coefficient


**<u>ANSWER TO FIRST QUESTION</u>: Refuse and Recycling are <i>anti</i>-correlated <small>(in Brooklyn)</small>.**

### **Auto-Correlation: Determining Periodicity**

We can remove the smoothed time series from the raw data to isolate the short time scale (aka "high frequency") behavior:

In [None]:
# -- isolate short time scale behavior
bk_short =

In [None]:
# -- plot the the short time scale (high frequency) behavior


There does *seem* to be some periodicity in this time series.  We can extract that periodicity by extending the concept of correlation to <u>auto-correlation</u> which is the correlation of a time series with itself **shifted by some time lag**.

In [None]:
# -- plot autocorrelation function


**<u>ANSWER TO SECOND QUESTION</u>: Refuse has periodicity on annual time scales <small>(in Brooklyn)</small>.**