In [1]:
# This notebook processing the data without OHE to use for HistGradientBoost algorithm

# Recycling Effectiveness in MA

### *Part 3: Baseline Regressions for Recycling Rate in Total Population Based on Service Attributes of Each Municipality*


In [2]:
import pandas as pd
import numpy as np
# import plotly.graph_objects as go
# from plotly.graph_objs import *
import matplotlib.pyplot as plt
import seaborn as sns


## Importing and Inspection of Data

In [3]:
# Import the 2019 municipal survey results into a df
# usecols is just trimming off additional columns that had to do with special/hazardous recyclables

serv19 = pd.read_csv('data/MA_MSW_Collection_Data/serv19cleaned.csv', index_col='Municipality Name')
serv19.head()

Unnamed: 0_level_0,Contact Name,Total Number of Households,Households Served by Municipal Trash Program,Households Served by Municipal Recycling Program,Trash Service Type,Carts for Trash,Trash Cart size,Recycling Service Type,Recycling Collection Frequency,SS Recycling,...,Does trash disposal tonnage include bulky waste?,Bulky waste tonnage,Fee for bulky waste?,Annual Bulky \nWaste \nLimit,Tip Fee as of 1/1/2020,Tons Single Stream Recyclables,Newspaper,Cardboard,Mixed Paper,Commingled
Municipality Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abington,Angela Dahlstrom,6558.0,4486.0,4486.0,Curbside,Yes,64.0,Curbside,Weekly,Yes,...,Yes,,Yes,,86.5,1413.42,,,,
Acton,Corey York,9800.0,3846.0,4335.0,Drop-off,,,Drop-off,,,...,Yes,,Yes,,57.16,,,,683.04,407.27
Acushnet,Dan Menard,4304.0,3591.0,3591.0,Curbside,Yes,65.0,Both,Bi-weekly,Yes,...,No,41.0,Yes,,64.6,879.5,3.7,20.0,16.94,27.4
Adams,Linda Cernik,3867.0,664.0,664.0,Drop-off,,,Drop-off,,,...,No,4.43,Yes,,110.0,,,,94.13,45.48
Agawam,Tracy DeMaio,12031.0,8879.0,8879.0,Curbside,Yes,65.0,Curbside,Bi-weekly,Yes,...,No,275.17,Yes,30.0,74.0,2238.0,,,,


In [4]:
serv19.columns

Index(['Contact Name', 'Total Number of Households',
       'Households Served by Municipal Trash Program',
       'Households Served by Municipal Recycling Program',
       'Trash Service Type', 'Carts for Trash', 'Trash Cart size',
       'Recycling Service Type', 'Recycling Collection Frequency',
       'SS Recycling', 'Carts for Recycling', 'Recycling Cart Size',
       'Municipal Buildings Trash and Recycling Service',
       'School Trash and Recycling Service',
       'Business Trash and Recycling Service',
       'Non-resident Trash and Recycling Service',
       'Solid Waste program funded by property tax?',
       'Solid Waste program funded by annual fee?',
       'Solid Waste program funded by transfer station access fee?',
       'Solid Waste program funded by per-visit fee?',
       'Solid Waste program funded by PAYT/ SMART revenue?',
       'What is the annual fee?', 'What is the transfer station access fee?',
       'What is the per-visit fee?', 'PAYT/ SMART',
       '

In [5]:
cols_to_use = ['Solid Waste program funded by property tax?',
       'Solid Waste program funded by annual fee?',
       'Solid Waste program funded by transfer station access fee?',
       'Solid Waste program funded by per-visit fee?',
       'Solid Waste program funded by PAYT/ SMART revenue?',
       'What is the annual fee?', 'What is the transfer station access fee?',
       'What is the per-visit fee?', 'PAYT/ SMART',
       'Municipal Buildings Trash and Recycling Service',
       'School Trash and Recycling Service',
       'Business Trash and Recycling Service',
       'Non-resident Trash and Recycling Service', 'Households Served by Municipal Trash Program', 'Trash Service Type', 'Carts for Trash', 'Trash Cart size', 'Does trash disposal tonnage include bulky waste?','Fee for bulky waste?',
       'Annual Bulky \nWaste \nLimit', 'Tip Fee as of 1/1/2020', 'Enforced Trash Limits at Curb', 'Maximum # bags/ barrels per week',
       'Barrel size in gallons (eg 32 64 etc)', 'Trash Enforced by Muni',
       'Trash Enforced by Hauler', 'Dedicated Trash Enforcement Personnel','Households Served by Municipal Recycling Program', 'Recycling Service Type', 'Recycling Collection Frequency',
       'SS Recycling', 'Carts for Recycling', 'Recycling Cart Size', 'Enforced Mandatory Recycling',
       'Applies to Residential Generators Eligible to be Served by Municipal Program',
       'Applies to Residential Generators not Eligible to be Served by the Municipal Program',
       'Applies to Commercial Generators', 'Recycling Enforced by Muni',
       'Recycling Enforced by Hauler',
       'Dedicated Mandatory Recycling Enforcement Personnel',
       '# Hours Enforcement Personnel on Street', 'Private Hauler regulations that require recycling',
       'Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program',
       'Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program',
       'Applies to Commercial Generators.1'
              ]

In [6]:
# general information related to funding for service.

serv_fund = ['Solid Waste program funded by property tax?',
       'Solid Waste program funded by annual fee?',
       'Solid Waste program funded by transfer station access fee?',
       'Solid Waste program funded by per-visit fee?',
       'Solid Waste program funded by PAYT/ SMART revenue?',
       'What is the annual fee?', 'What is the transfer station access fee?',
       'What is the per-visit fee?', 'PAYT/ SMART',]

In [7]:
# general service information for non-residental buildings

gen_serv = ['Municipal Buildings Trash and Recycling Service',
       'School Trash and Recycling Service',
       'Business Trash and Recycling Service',
       'Non-resident Trash and Recycling Service', ]

In [8]:
# quantities and specifications related to trash services

trash_serv = ['Households Served by Municipal Trash Program', 'Trash Service Type', 'Carts for Trash', 'Trash Cart size', 'Does trash disposal tonnage include bulky waste?','Fee for bulky waste?',
       'Annual Bulky \nWaste \nLimit', 'Tip Fee as of 1/1/2020', 'Enforced Trash Limits at Curb', 'Maximum # bags/ barrels per week',
       'Barrel size in gallons (eg 32 64 etc)', 'Trash Enforced by Muni',
       'Trash Enforced by Hauler', 'Dedicated Trash Enforcement Personnel', ]

In [9]:
# quantitative data on trash tonnages collected

trash_tonnage_data = ['Households Served by Municipal Trash Program', 'Trash Disposal Tonnage', 'Bulky waste tonnage', ]

In [10]:
# quantities and specifications related to recycle services

recycle_serv= ['Households Served by Municipal Recycling Program', 'Recycling Service Type', 'Recycling Collection Frequency',
       'SS Recycling', 'Carts for Recycling', 'Recycling Cart Size', 'Enforced Mandatory Recycling',
       'Applies to Residential Generators Eligible to be Served by Municipal Program',
       'Applies to Residential Generators not Eligible to be Served by the Municipal Program',
       'Applies to Commercial Generators', 'Recycling Enforced by Muni',
       'Recycling Enforced by Hauler',
       'Dedicated Mandatory Recycling Enforcement Personnel',
       '# Hours Enforcement Personnel on Street', 'Private Hauler regulations that require recycling',
       'Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program',
       'Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program',
       'Applies to Commercial Generators.1']

In [11]:
# Import summary data for municipal waste tonnages in 2019

tonnages19 = pd.read_csv('data/MA_MSW_Collection_Data/musum19.csv', index_col='Municipality Name')
tonnages19.head()

Unnamed: 0_level_0,tot_households,stream_type,tons_ss_recyclables,tons_ms_recyclables,tons_recyclables_total,hh_served_by_mu_recycle,tons_recyclables/hh,hh_served_by_mu_trash,tons_trash_total,tons_trash/hh,%recycle/hh,total_waste/hh
Municipality Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Abington,6558.0,ss,1413.42,0.0,1413.42,4486.0,0.315074,4486.0,3826.66,0.853023,0.269733,1.168096
Acton,9800.0,ms,0.0,1090.31,1090.31,4335.0,0.251513,3846.0,2148.67,0.558677,0.310437,0.81019
Acushnet,4304.0,ss+,879.5,68.04,947.54,3591.0,0.263865,3591.0,3446.38,0.959727,0.215648,1.223592
Adams,3867.0,ms,0.0,139.61,139.61,664.0,0.210256,664.0,134.47,0.202515,0.509377,0.412771
Agawam,12031.0,ss,2238.0,0.0,2238.0,8879.0,0.252055,8879.0,6717.17,0.756523,0.249912,1.008579


## Setting up data for regression

In [12]:
df_for_regression = serv19.loc[:,cols_to_use].merge(tonnages19.loc[:,['%recycle/hh','total_waste/hh']], left_index=True, right_index=True)
df_for_regression.head(2)

Unnamed: 0_level_0,Solid Waste program funded by property tax?,Solid Waste program funded by annual fee?,Solid Waste program funded by transfer station access fee?,Solid Waste program funded by per-visit fee?,Solid Waste program funded by PAYT/ SMART revenue?,What is the annual fee?,What is the transfer station access fee?,What is the per-visit fee?,PAYT/ SMART,Municipal Buildings Trash and Recycling Service,...,Recycling Enforced by Muni,Recycling Enforced by Hauler,Dedicated Mandatory Recycling Enforcement Personnel,# Hours Enforcement Personnel on Street,Private Hauler regulations that require recycling,Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program,Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program,Applies to Commercial Generators.1,%recycle/hh,total_waste/hh
Municipality Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abington,Yes,,,,,,,,,Both,...,Yes,Yes,Yes,20.0,Yes,Yes,Yes,Yes,0.269733,1.168096
Acton,,,Yes,Yes,Yes,,100.0,30.0,Yes,Both,...,Yes,,No,,,,,,0.310437,0.81019


In [13]:
df_for_regression.info()

<class 'pandas.core.frame.DataFrame'>
Index: 277 entries, Abington to Yarmouth
Data columns (total 47 columns):
 #   Column                                                                                            Non-Null Count  Dtype  
---  ------                                                                                            --------------  -----  
 0   Solid Waste program funded by property tax?                                                       192 non-null    object 
 1   Solid Waste program funded by annual fee?                                                         85 non-null     object 
 2   Solid Waste program funded by transfer station access fee?                                        73 non-null     object 
 3   Solid Waste program funded by per-visit fee?                                                      18 non-null     object 
 4   Solid Waste program funded by PAYT/ SMART revenue?                                                132 non-null    object 

### Dealing with NaNs

There are lot of columns missing quite a few values. 277 entries, let's look at anything that has less than 271 non-null values. I can see the features with 271 non-null values are datapoints that I knew were missing some data. I get to that last.

In [14]:
df_for_regression.count()[df_for_regression.count() < 271].sort_values()

Solid Waste program funded by per-visit fee?                                                         18
What is the per-visit fee?                                                                           18
Trash Enforced by Muni                                                                               31
Dedicated Trash Enforcement Personnel                                                                31
# Hours Enforcement Personnel on Street                                                              33
Trash Enforced by Hauler                                                                             35
Annual Bulky \nWaste \nLimit                                                                         48
Barrel size in gallons (eg 32 64 etc)                                                                50
Maximum # bags/ barrels per week                                                                     50
Enforced Trash Limits at Curb                                   

Okay, some of the NaNs are because they really mean "no" or "0", like "Solid Waste program funded by per-visit fee?". Others are numeric and I'll need to handle those appropriately.

In [15]:
# enter each feature name to see why they may have NaNs
df_for_regression['Recycling Cart Size'].value_counts()

96.0    33
64.0    25
95.0    12
65.0     7
0.0      1
35.0     1
Name: Recycling Cart Size, dtype: int64

In [16]:
# columns with only "yes", and NaN likely means "no"

OHE_list_1 = ['Solid Waste program funded by per-visit fee?', 'Trash Enforced by Muni', 'Trash Enforced by Hauler', 'Enforced Trash Limits at Curb', 'Applies to Commercial Generators.1', 
'Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program', 'Recycling Enforced by Hauler', 'Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program',
'Applies to Commercial Generators', 'Carts for Trash', 'Private Hauler regulations that require recycling', 'Solid Waste program funded by transfer station access fee?', 'Applies to Residential Generators not Eligible to be Served by the Municipal Program',
'Carts for Recycling', 'Solid Waste program funded by annual fee?', 'Recycling Enforced by Muni', 'Solid Waste program funded by PAYT/ SMART revenue?', 'PAYT/ SMART', 'SS Recycling', 'Applies to Residential Generators Eligible to be Served by Municipal Program',
'Solid Waste program funded by property tax?']

In [17]:
# columns with "yes" and "no", but I will assume NaN still likely means "no"
# note that enforcement personnel has only 31 non-null values but # of hours enforces is 33.

OHE_list_2 = ['Dedicated Trash Enforcement Personnel', 'Dedicated Mandatory Recycling Enforcement Personnel', 'Enforced Mandatory Recycling', 'Fee for bulky waste?']

In [18]:
# columns where NaN likely means "0"

fill_w_zero = ['What is the per-visit fee?', '# Hours Enforcement Personnel on Street',  'Annual Bulky \nWaste \nLimit', 'Maximum # bags/ barrels per week', 'What is the transfer station access fee?', 'What is the annual fee?', 'Tip Fee as of 1/1/2020']

In [19]:
# columns that have multiple options 
# barrel size and cart size can probably be simplified to 32, 48, 64, and 96

# barrel size and trash cart size, only impute where recycling service type is curbside or both, otherwise 0

impute_w_mode = ['Barrel size in gallons (eg 32 64 etc)', 'Trash Cart size', 'Recycling Cart Size', 'Recycling Collection Frequency']


In [20]:
for col in OHE_list_1:
#     df_for_regression[col] = df_for_regression[col].str.replace('Yes', '1')
    df_for_regression[col].fillna('No', inplace = True)
#     df_for_regression[col] = pd.to_numeric(df_for_regression[col])

In [21]:
for col in OHE_list_2:
#     df_for_regression[col] = df_for_regression[col].str.replace('Yes', '1')
#     df_for_regression[col] = df_for_regression[col].str.replace('No', '0')
    df_for_regression[col].fillna('No', inplace = True)
#     df_for_regression[col] = pd.to_numeric(df_for_regression[col])

In [22]:
df_for_regression[fill_w_zero] = df_for_regression[fill_w_zero].fillna(0)

I'm struggling with the last group because if bin size is missing, it probably means there's no limit of what you can put out or it's drop off only. I could see bin size being important because if the bin is too small, they may recycle less. If no bin is needed, maybe you can recycle more?

It's almost like... you could make a bin scale:
```
0       no curbside pick up
0.25    0-35 gal bins
0.50    35-65 gal bins
0.75    65-96 gal bins
1       no bins provided
```
I also think that Barrel size and Trash cart size can be joined into one. I think Trash cart size is in reference to specifically supplied bins whle barrel size refers to max bin sizes accept. I could take the Barrel size and any missing values, first try to fill with trash cart size. Then if there is curbside pickup, missing values are 1 and if no curbside, values are 0.


Lastly, recycling frequency is only missing valuse because of drop-off only. I think I can still just OHE it? hopefully weekly would be `0 1`, biweekly would be `1 0`, and drop-off only would be `0 0`.

In [23]:
df_for_regression['Recycling Collection Frequency'] = df_for_regression['Recycling Collection Frequency'].fillna('None')

In [24]:
df_for_regression['Recycling Collection Frequency'].value_counts()

None         135
Weekly        77
Bi-weekly     65
Name: Recycling Collection Frequency, dtype: int64

In [25]:
def bin_rank(size, serve_type, limit_size):
    if limit_size != np.nan:
        x = (
            0 if serve_type == 'Drop-off' or serve_type == 'None' else
            0.25 if limit_size < 35 else
            0.50 if limit_size < 65 else
            0.75 if limit_size < 96 else
            1)
    else:
        x = (
            0 if serve_type == 'Drop-off' or serve_type == 'None' else
            0.25 if size < 35 else
            0.50 if size < 65 else
            0.75 if size < 96 else
            1)
    return x

In [26]:
df_for_regression['Trash Service Type'].unique()

array(['Curbside', 'Drop-off', 'Both', 'None'], dtype=object)

In [27]:
df_for_regression['Trash Bin Size Ranking'] = df_for_regression.apply(lambda row: bin_rank(row['Trash Cart size'], row['Trash Service Type'],row['Barrel size in gallons (eg 32 64 etc)']), axis=1)
df_for_regression.head(2)

Unnamed: 0_level_0,Solid Waste program funded by property tax?,Solid Waste program funded by annual fee?,Solid Waste program funded by transfer station access fee?,Solid Waste program funded by per-visit fee?,Solid Waste program funded by PAYT/ SMART revenue?,What is the annual fee?,What is the transfer station access fee?,What is the per-visit fee?,PAYT/ SMART,Municipal Buildings Trash and Recycling Service,...,Recycling Enforced by Hauler,Dedicated Mandatory Recycling Enforcement Personnel,# Hours Enforcement Personnel on Street,Private Hauler regulations that require recycling,Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program,Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program,Applies to Commercial Generators.1,%recycle/hh,total_waste/hh,Trash Bin Size Ranking
Municipality Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abington,Yes,No,No,No,No,0.0,0.0,0.0,No,Both,...,Yes,Yes,20.0,Yes,Yes,Yes,Yes,0.269733,1.168096,0.5
Acton,No,No,Yes,Yes,Yes,0.0,100.0,30.0,Yes,Both,...,No,No,0.0,No,No,No,No,0.310437,0.81019,0.0


In [28]:
df_for_regression.drop(columns=['Trash Cart size','Barrel size in gallons (eg 32 64 etc)'], inplace=True)

In [29]:
def bin_rank(size, serve_type):
    x = (
        0 if serve_type == 'Drop-off' or serve_type == 'None' else
        0.25 if size < 35 else
        0.50 if size < 65 else
        0.75 if size < 96 else
        1)
    return x

In [30]:
df_for_regression['Recycling Cart Size']

Municipality Name
Abington       64.0
Acton           NaN
Acushnet       96.0
Adams           NaN
Agawam         95.0
               ... 
Woburn          NaN
Worcester       NaN
Worthington     NaN
Wrentham       96.0
Yarmouth        NaN
Name: Recycling Cart Size, Length: 277, dtype: float64

In [31]:
df_for_regression['Recycle Bin Size Ranking'] = df_for_regression.apply(lambda row: bin_rank(row['Recycling Cart Size'], row['Recycling Service Type']), axis=1)
df_for_regression.head(2)

Unnamed: 0_level_0,Solid Waste program funded by property tax?,Solid Waste program funded by annual fee?,Solid Waste program funded by transfer station access fee?,Solid Waste program funded by per-visit fee?,Solid Waste program funded by PAYT/ SMART revenue?,What is the annual fee?,What is the transfer station access fee?,What is the per-visit fee?,PAYT/ SMART,Municipal Buildings Trash and Recycling Service,...,Dedicated Mandatory Recycling Enforcement Personnel,# Hours Enforcement Personnel on Street,Private Hauler regulations that require recycling,Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program,Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program,Applies to Commercial Generators.1,%recycle/hh,total_waste/hh,Trash Bin Size Ranking,Recycle Bin Size Ranking
Municipality Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abington,Yes,No,No,No,No,0.0,0.0,0.0,No,Both,...,Yes,20.0,Yes,Yes,Yes,Yes,0.269733,1.168096,0.5,0.5
Acton,No,No,Yes,Yes,Yes,0.0,100.0,30.0,Yes,Both,...,No,0.0,No,No,No,No,0.310437,0.81019,0.0,0.0


In [32]:
df_for_regression.drop(columns=['Recycling Cart Size'], inplace=True)

In [33]:
df_for_regression.info()

<class 'pandas.core.frame.DataFrame'>
Index: 277 entries, Abington to Yarmouth
Data columns (total 46 columns):
 #   Column                                                                                            Non-Null Count  Dtype  
---  ------                                                                                            --------------  -----  
 0   Solid Waste program funded by property tax?                                                       277 non-null    object 
 1   Solid Waste program funded by annual fee?                                                         277 non-null    object 
 2   Solid Waste program funded by transfer station access fee?                                        277 non-null    object 
 3   Solid Waste program funded by per-visit fee?                                                      277 non-null    object 
 4   Solid Waste program funded by PAYT/ SMART revenue?                                                277 non-null    object 

In [34]:
# for the last missing values, I'm just going to drop those columns because I was missing the trash data in the original dataset

df_for_regression.dropna(inplace=True)

In [35]:
# fixing some annoying formatting
df_for_regression.rename(columns=dict(zip(df_for_regression.columns, df_for_regression.columns.str.replace('\n',''))), inplace=True)

In [36]:
df_for_regression.info()

<class 'pandas.core.frame.DataFrame'>
Index: 271 entries, Abington to Yarmouth
Data columns (total 46 columns):
 #   Column                                                                                            Non-Null Count  Dtype  
---  ------                                                                                            --------------  -----  
 0   Solid Waste program funded by property tax?                                                       271 non-null    object 
 1   Solid Waste program funded by annual fee?                                                         271 non-null    object 
 2   Solid Waste program funded by transfer station access fee?                                        271 non-null    object 
 3   Solid Waste program funded by per-visit fee?                                                      271 non-null    object 
 4   Solid Waste program funded by PAYT/ SMART revenue?                                                271 non-null    object 

## Regression

### Removing highly correlated data

First off, I don't need `total_waste/hh` because this is a cluster feature (used in notebooke Part 2), rather than a regression prediction feature. So I will remove this.

In [37]:
df_for_regression.drop(columns='total_waste/hh', inplace = True)

I want to look at the correlations for the data to be used for regression. I know some of the information will be _highly_ correlated (like over 0.75) from the last notebook I worked on with fully encoded data. So I'm going to first remove anything that's obviously correlated in the full dataset as is. And then I will created a second dataset, sort of pre-emptively deleting the same highly correlated "primary questions" that I removed from the last encoded dataset.

In [38]:
# colors "true" values to make the table more readable
def color_val_red(val):
    color = 'red' if abs(val) > 0.75 and val < 1 else 'black'
    return 'color: {}'.format(color)

In [39]:
# Now the whole table

dfcorr= df_for_regression.corr()
mask = ((dfcorr>0.75) & (dfcorr<1)).any(axis=0)
dfcorr[mask].T.style.applymap(color_val_red)

Unnamed: 0,Households Served by Municipal Trash Program,Households Served by Municipal Recycling Program,Trash Bin Size Ranking,Recycle Bin Size Ranking
What is the annual fee?,-0.018321,-0.020551,0.071906,0.067112
What is the transfer station access fee?,-0.112409,-0.113328,-0.319931,-0.335387
What is the per-visit fee?,-0.044233,-0.041662,-0.198586,-0.215054
Households Served by Municipal Trash Program,1.0,0.997106,0.276656,0.21581
Annual Bulky Waste Limit,0.102215,0.09839,0.222642,0.198364
Tip Fee as of 1/1/2020,-0.011109,-0.0088,-0.112853,-0.174166
Maximum # bags/ barrels per week,0.041469,0.037119,-0.021528,0.395508
Households Served by Municipal Recycling Program,0.997106,1.0,0.280911,0.219082
# Hours Enforcement Personnel on Street,0.187082,0.205081,0.141942,0.145346
%recycle/hh,-0.149513,-0.152167,0.03908,-0.049924


Like last time: `Households Served by Municipal Trash Program` and `Households Served by Municipal Recycling Program` are correlated. I think I feel ok removing both these columns because this is similar to, or will scale linearly with population. Population isn't in this table but it is in the "cluster" data. However, I would need something like this for the baseline regression... okay, I'll keep one, let's go with `Households Served by Municipal Recycling Program` since this whole analysis is about recycling. Maybe I'll run two regressions, with and without this feature, just to see how it does.

`Trash Bin Size Ranking` and `Recycle Bin Size Ranking` are correlated. This is tricky because I'm interested in both of these features. However, a lot of the time, if a municipality as a big trash can, they have a big recycling can and vice versa. I will keep the `Recycle Bin Size Ranking` as is, since this is a recycling study, and remove `Trash Bin Size Ranking`.


In [40]:
df_for_regression.drop(columns=['Trash Bin Size Ranking', 'Households Served by Municipal Trash Program'], inplace=True)

In [41]:
# Remaking the correlation matrix

dfcorr= df_for_regression.corr()

In [42]:
mask = ((dfcorr>0.75) & (dfcorr<1)).any(axis=0)

In [43]:
dfcorr[mask].T.style.applymap(color_val_red)

What is the annual fee?
What is the transfer station access fee?
What is the per-visit fee?
Annual Bulky Waste Limit
Tip Fee as of 1/1/2020
Maximum # bags/ barrels per week
Households Served by Municipal Recycling Program
# Hours Enforcement Personnel on Street
%recycle/hh
Recycle Bin Size Ranking


Cool, no more correlations above 0.75.

In [44]:
# # For saving file for this version of the cleaned dataset
# df_for_regression.to_csv('data/data_for_regression_unencoded1.csv', index=True)

Categorical Features removed from last notebook:

In [45]:
df_for_regression2 = df_for_regression.drop(columns=['Solid Waste program funded by annual fee?', 'Solid Waste program funded by per-visit fee?', 'Solid Waste program funded by PAYT/ SMART revenue?',
    'Enforced Trash Limits at Curb', 'Enforced Mandatory Recycling', 'Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program',
    'Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program', 'Applies to Commercial Generators.1'])

In [46]:
# # For saving file for this version of the cleaned dataset
# df_for_regression2.to_csv('data/data_for_regression_unencoded2.csv', index=True)

I guess I need to use an ordinal encoder for HGB.

In [47]:
from sklearn.preprocessing import OrdinalEncoder

In [48]:
cat_cols = list(df_for_regression.select_dtypes(include=[object]).columns)
cat_cols

['Solid Waste program funded by property tax?',
 'Solid Waste program funded by annual fee?',
 'Solid Waste program funded by transfer station access fee?',
 'Solid Waste program funded by per-visit fee?',
 'Solid Waste program funded by PAYT/ SMART revenue?',
 'PAYT/ SMART',
 'Municipal Buildings Trash and Recycling Service',
 'School Trash and Recycling Service',
 'Business Trash and Recycling Service',
 'Non-resident Trash and Recycling Service',
 'Trash Service Type',
 'Carts for Trash',
 'Does trash disposal tonnage include bulky waste?',
 'Fee for bulky waste?',
 'Enforced Trash Limits at Curb',
 'Trash Enforced by Muni',
 'Trash Enforced by Hauler',
 'Dedicated Trash Enforcement Personnel',
 'Recycling Service Type',
 'Recycling Collection Frequency',
 'SS Recycling',
 'Carts for Recycling',
 'Enforced Mandatory Recycling',
 'Applies to Residential Generators Eligible to be Served by Municipal Program',
 'Applies to Residential Generators not Eligible to be Served by the Municip

In [49]:
oe = OrdinalEncoder()
df_for_regression[cat_cols] = oe.fit_transform(df_for_regression[cat_cols])
df_for_regression['Recycling Collection Frequency'].unique()

array([2., 1., 0.])

In [50]:
cat_cols1 = list(df_for_regression2.select_dtypes(include=[object]).columns)

In [51]:
oe2 = OrdinalEncoder()
df_for_regression2[cat_cols1] = oe2.fit_transform(df_for_regression2[cat_cols1])
df_for_regression2['Recycling Collection Frequency'].unique()

array([2., 1., 0.])

### Making the Baseline Regression *

I'm just looking at the HistGradientBoost algorithm.

In [52]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split
# from sklearn.pipeline import make_pipeline
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
# from sklearn.feature_selection import RFE

In [53]:
# Data with none of the primary questions removed
X0 = df_for_regression.drop(columns=['%recycle/hh'])
y0 = df_for_regression['%recycle/hh']

# Data with the primary questions removed
X1 = df_for_regression2.drop(columns=['%recycle/hh'])
y1 = df_for_regression2['%recycle/hh']

In [54]:
def fit_model(model, name):
    model.fit(X_train, y_train)
    score_train = model.score(X_train, y_train)
    score_test = model.score(X_test, y_test)
    rmse = mean_squared_error(y_test, model.predict(X_test), squared=False)
    mae = mean_absolute_error(y_test, model.predict(X_test))
    fit_results[name] = (score_train, score_test, rmse, mae)
    return

In [55]:
fit_results = {}

### No removed data

In [56]:
# cat_cols = list(X0.select_dtypes(include=[object]).columns)

# mask = [X0.columns.get_loc(col) for col in cat_cols]

mask = [True if col in cat_cols else False for col in X0.columns]

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X0, y0, random_state=8)

In [59]:
model0 = DummyRegressor()
fit_model(model0, 'dummy')

In [60]:
hgb = HistGradientBoostingRegressor(max_iter=1_000, min_samples_leaf=5, max_depth=5, categorical_features = mask, random_state=123)
fit_model(hgb, 'GradBoost')

In [61]:
pd.DataFrame(fit_results, index=['Train Score', 'Test Score','rmse', 'mae']).T

Unnamed: 0,Train Score,Test Score,rmse,mae
dummy,0.0,-0.001343,0.086263,0.065243
GradBoost,0.999998,0.01777,0.085436,0.06529


In [62]:
params = {
#     'learning_rate' : [0.05, 0.1, 0.2],
    'max_iter': [100,1000,10000],
    'max_depth': [3,5,None],
    'min_samples_leaf': [3,5,10],
    'max_bins': [20,255]
}

In [63]:
gs = GridSearchCV(HistGradientBoostingRegressor( categorical_features = mask, random_state=123), param_grid=params, n_jobs=-1)
fit_model(gs, 'gs_hgb')

In [64]:
pd.DataFrame(fit_results, index=['Train Score', 'Test Score','rmse', 'mae']).T

Unnamed: 0,Train Score,Test Score,rmse,mae
dummy,0.0,-0.001343,0.086263,0.065243
GradBoost,0.999998,0.01777,0.085436,0.06529
gs_hgb,0.954706,-0.076826,0.089455,0.0677


In [65]:
gs.best_estimator_

HistGradientBoostingRegressor(categorical_features=[True, True, True, True,
                                                    True, False, False, False,
                                                    True, True, True, True,
                                                    True, True, True, True,
                                                    True, False, False, True,
                                                    False, True, True, True,
                                                    False, True, True, True,
                                                    True, True, ...],
                              max_bins=20, max_depth=5, min_samples_leaf=3,
                              random_state=123)

### Removed data

In [66]:
# cat_cols1 = list(X1.select_dtypes(include=[object]).columns)
# mask = [X1.columns.get_loc(col) for col in cat_cols1]
mask = [True if col in cat_cols1 else False for col in X1.columns]

In [67]:
mask

[True,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False]

In [68]:
X_train, X_test, y_train, y_test = train_test_split(X1, y1, random_state=8)

In [69]:
model1 = DummyRegressor()
fit_model(model1, 'dummy1')

In [70]:
hgb1 = HistGradientBoostingRegressor(max_iter=1_000, min_samples_leaf=5, max_depth=5, categorical_features = mask, random_state=123)
fit_model(hgb1, 'GradBoost1')

In [71]:
pd.DataFrame(fit_results, index=['Train Score', 'Test Score','rmse', 'mae']).T

Unnamed: 0,Train Score,Test Score,rmse,mae
dummy,0.0,-0.001343,0.086263,0.065243
GradBoost,0.999998,0.01777,0.085436,0.06529
gs_hgb,0.954706,-0.076826,0.089455,0.0677
dummy1,0.0,-0.001343,0.086263,0.065243
GradBoost1,0.999996,0.008915,0.08582,0.065543


In [84]:
params = {
    'learning_rate' : [0.05, 0.1, 0.2],
#     'max_iter': [100,1000,10000],
#     'max_depth': [3,5,None],
#     'min_samples_leaf': [3,5,10],
#     'max_bins': [20,255]
}

In [73]:
gs1 = GridSearchCV(HistGradientBoostingRegressor( categorical_features = mask, random_state=123), param_grid=params, n_jobs=-1)
fit_model(gs1, 'gs_hgb1')

In [74]:
pd.DataFrame(fit_results, index=['Train Score', 'Test Score','rmse', 'mae']).T

Unnamed: 0,Train Score,Test Score,rmse,mae
dummy,0.0,-0.001343,0.086263,0.065243
GradBoost,0.999998,0.01777,0.085436,0.06529
gs_hgb,0.954706,-0.076826,0.089455,0.0677
dummy1,0.0,-0.001343,0.086263,0.065243
GradBoost1,0.999996,0.008915,0.08582,0.065543
gs_hgb1,0.949586,-0.151719,0.092514,0.070577


In [76]:
gs1.best_estimator_

HistGradientBoostingRegressor(categorical_features=[True, True, False, False,
                                                    False, True, True, True,
                                                    True, True, True, True,
                                                    True, True, False, False,
                                                    False, True, True, True,
                                                    False, True, True, True,
                                                    True, True, True, True,
                                                    True, True, ...],
                              max_bins=20, max_depth=5, min_samples_leaf=3,
                              random_state=123)

In [87]:
hgb1 = HistGradientBoostingRegressor(max_iter=10_000, min_samples_leaf=3, max_depth=5, categorical_features = mask, max_bins=20, random_state=123)
gs2 = GridSearchCV(hgb1, params, n_jobs=-1)
fit_model(gs2, 'gs_hgb_2')

In [88]:
pd.DataFrame(fit_results, index=['Train Score', 'Test Score','rmse', 'mae']).T

Unnamed: 0,Train Score,Test Score,rmse,mae
dummy,0.0,-0.001343,0.086263,0.065243
GradBoost,0.999998,0.01777,0.085436,0.06529
gs_hgb,0.954706,-0.076826,0.089455,0.0677
dummy1,0.0,-0.001343,0.086263,0.065243
GradBoost1,0.999996,0.008915,0.08582,0.065543
gs_hgb1,0.949586,-0.151719,0.092514,0.070577
test1,0.949586,-0.151719,0.092514,0.070577
GradBoost1_2,1.0,-0.293163,0.09803,0.074484
gs_hgb_2,1.0,-0.293163,0.09803,0.074484


In [151]:
results = pd.DataFrame(fit_results, index=['Score','RMSE']).T
results.sort_values('Score', ascending = False)

Unnamed: 0,Score,RMSE
vr_bagging_lr_w_rfe_ridge,0.241313,0.063418
rfe_ridge_lr,0.236891,0.063603
gs_rfe_ridge_lr,0.228621,0.063946
plain_ridge,0.220996,0.064262
rfe_lr_lr,0.219112,0.064339
lr_scaler,0.218597,0.064361
plain_linreg,0.218597,0.064361
gs_on_ridge,0.209219,0.064746
bagging,0.153595,0.066984
RFR,0.137405,0.067622


In [152]:
model8.named_steps

{'standardscaler': StandardScaler(),
 'rfe': RFE(estimator=Ridge(), n_features_to_select=11),
 'linearregression': LinearRegression()}

In [153]:
rfe_lr_lr_coefs = pd.Series(data= model9.named_steps['linearregression'].coef_, index = X.columns[model9.named_steps['rfe'].support_], name= 'rfe_lr_lr')

In [154]:
rfe_ridge_lr_coefs = pd.Series(data= model8.named_steps['linearregression'].coef_, index = X.columns[model8.named_steps['rfe'].support_], name='rfe_ridge_lr')

In [155]:
bagging_coefs = np.mean([
    tree.feature_importances_ for tree in model4.estimators_
], axis=0)
# model4.estimators_[0].feature_importances_

In [156]:
model_coefs = pd.DataFrame(data={'plain_linreg':model1.coef_, 'plain_ridge':model6.named_steps['ridge'].coef_, "bagging":bagging_coefs},index=X.columns)

In [157]:
model_coefs = model_coefs.merge(rfe_lr_lr_coefs,how = 'left',left_index=True,right_index=True)

In [158]:
model_coefs = model_coefs.merge(rfe_ridge_lr_coefs,how = 'left',left_index=True,right_index=True)

In [159]:
model_coefs = model_coefs.fillna(0)

In [161]:
model_coefs['vr_bag_w_rfe_ridge_lr'] = (model_coefs['bagging'] + model_coefs['rfe_ridge_lr']) / 2

In [178]:
model_coefs.sort_values(by='vr_bag_w_rfe_ridge_lr')

Unnamed: 0,plain_linreg,plain_ridge,bagging,rfe_lr_lr,rfe_ridge_lr,vr_bag_w_rfe_ridge_lr
Recycling Collection Frequency_Bi-weekly,-0.0435297,-0.013459,0.004723,-0.013893,-0.016732,-0.006004
Non-resident Trash and Recycling Service_Trash,-0.1436053,-0.00992,8.1e-05,0.0,-0.00958,-0.004749
# Hours Enforcement Personnel on Street,-0.001566063,-0.01894,0.009754,-0.013662,-0.017136,-0.003691
Non-resident Trash and Recycling Service_Recycling,-0.05296178,-0.008017,0.004352,0.0,-0.009221,-0.002434
Municipal Buildings Trash and Recycling Service_Trash,2.449933e-12,0.0,0.0,0.0,0.0,0.0
Dedicated Trash Enforcement Personnel,0.007588894,0.001495,0.00039,0.0,0.0,0.000195
Fee for bulky waste?,-0.02783589,-0.011908,0.012969,0.0,-0.010811,0.001079
Trash Enforced by Hauler,-0.02427997,-0.008423,0.002407,0.0,0.0,0.001203
Recycling Service Type_Curbside,-0.02325173,-0.004533,0.003942,0.0,0.0,0.001971
Trash Enforced by Muni,0.02157855,0.007229,0.004277,0.0,0.0,0.002139


The best non-ensemble models were Linear Regressions using Ridge or Linear Regression RFE before fitting. Let's see what theses models said are the most important features...

**Negative Correlations** -- *presences of these features or higher amounts of these features **reduce** recycling effectiveness. /b/ indicates these features were in both the RFE results with linear regression and ridge. If no /b/ is present, the feature only appeard in the ridge results.*

* /b/`# Hours Enforcement Personnel on Street` -- surprising...
* /b/`Recycling Collection Frequency_Bi-weekly` -- doesn't do as well as recycling weekly, I suppose
* /b/`Households Served by Municipal Recycling Program` -- this is probably showing some indication of population size
* `Fee for bulky waste?`  -- I'm not sure what to make of this. Could it be that some of the bulky waste is recyclable? like kiddy-pools?
* `Non-resident Trash and Recycling Service` (Trash + Recycling) -- Not sure why servicing non-residental buildings would lead to lower recycling over all..

**Positive Correlations** -- *presences of these features or higher amounts of these features **increase** recycling effectiveness. /b/ indicates these features were in both the RFE results with linear regression and ridge. If no /b/ is present, the feature only appeard in the ridge results.*
* `Business Trash and Recycling Service_Recycling` -- I think this makes sense because if you are encouraged to recycle at work or out and about, you're probably more likely to recycle at home. Or maybe municipalities that are more recycling-conscious would want to implement comprehensive recycling outside just residential homes... hard to say. That's why I'm surprised "non-residential services" somehow led to lower recycling...
* `Trash Service Type_Curbside` -- More convenient for the participants, but I'm surprised *recycling* curbside didn't matter more...
* /b/`Dedicated Mandatory Recycling Enforcement Personnel` -- LOL ... so why is hours of enforcement a negative impact?
* /b/`Carts for Recycling` -- Makes sense. No excuse to not recycle if you already have a bin
* /b/`PAYT/ SMART` -- Makes sense, if you pay for the amount you throw out, you probably want to minimize your volume of trash by recycling. Plus, that money goes to further waste handling funding!

The Voting Regression model that used both the linear regression with ridge RFE and Bagging did have the highest score by a small amount. It shares a lot of the same important features (makes sense since half of this model is the linear regression with ridge RFE) but there are some difference:
* bulky waste fee impact was flipped (but also insignificant)
* Households Served by Municipal Recycling Program impact was flipped (and is highly significant)
* Seemed a lot more concerned with funding. In addition to `PAYT/ SMART`, the bagging regressor brought in: `What is the annual fee?`, `What is the transfer station access fee`, `Tip Fee as of 1/1/2020`
* Seemed to focus more on positive correlations actually

If I want to put more of an emphasis on funding, I may want to select to voting regressor. However, I'll probably also compare future results to the linear regression using ridge RFE. My major take aways in seeing the erratic scores and sometimes logically-opposing feature importances, I think running a predictive model on the population as a whole does not make sense. Assessing smaller sub-populations will probably yield better scores and insights. That's the theory anyways...

In [169]:
results = results.sort_values('Score', ascending=False)

In [170]:
# # For saving files

# model_coefs.to_csv('data/baseline_models_coefs.csv', index=True)
# results.to_csv('data/baseline_models_scores.csv', index=True)