# Recycling Effectiveness in MA

### *Part 3: Baseline Regressions for Recycling Rate in Total Population Based on Service Attributes of Each Municipality*


In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.graph_objs import *
import matplotlib.pyplot as plt
import seaborn as sns


## Importing and Inspection of Data

In [2]:
# Import the 2019 municipal survey results into a df
# usecols is just trimming off additional columns that had to do with special/hazardous recyclables

serv19 = pd.read_csv('data/MA_MSW_Collection_Data/serv19cleaned.csv', index_col='Municipality Name')
serv19.head()

Unnamed: 0_level_0,Contact Name,Total Number of Households,Households Served by Municipal Trash Program,Households Served by Municipal Recycling Program,Trash Service Type,Carts for Trash,Trash Cart size,Recycling Service Type,Recycling Collection Frequency,SS Recycling,...,Does trash disposal tonnage include bulky waste?,Bulky waste tonnage,Fee for bulky waste?,Annual Bulky \nWaste \nLimit,Tip Fee as of 1/1/2020,Tons Single Stream Recyclables,Newspaper,Cardboard,Mixed Paper,Commingled
Municipality Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abington,Angela Dahlstrom,6558.0,4486.0,4486.0,Curbside,Yes,64.0,Curbside,Weekly,Yes,...,Yes,,Yes,,86.5,1413.42,,,,
Acton,Corey York,9800.0,3846.0,4335.0,Drop-off,,,Drop-off,,,...,Yes,,Yes,,57.16,,,,683.04,407.27
Acushnet,Dan Menard,4304.0,3591.0,3591.0,Curbside,Yes,65.0,Both,Bi-weekly,Yes,...,No,41.0,Yes,,64.6,879.5,3.7,20.0,16.94,27.4
Adams,Linda Cernik,3867.0,664.0,664.0,Drop-off,,,Drop-off,,,...,No,4.43,Yes,,110.0,,,,94.13,45.48
Agawam,Tracy DeMaio,12031.0,8879.0,8879.0,Curbside,Yes,65.0,Curbside,Bi-weekly,Yes,...,No,275.17,Yes,30.0,74.0,2238.0,,,,


In [3]:
serv19.columns

Index(['Contact Name', 'Total Number of Households',
       'Households Served by Municipal Trash Program',
       'Households Served by Municipal Recycling Program',
       'Trash Service Type', 'Carts for Trash', 'Trash Cart size',
       'Recycling Service Type', 'Recycling Collection Frequency',
       'SS Recycling', 'Carts for Recycling', 'Recycling Cart Size',
       'Municipal Buildings Trash and Recycling Service',
       'School Trash and Recycling Service',
       'Business Trash and Recycling Service',
       'Non-resident Trash and Recycling Service',
       'Solid Waste program funded by property tax?',
       'Solid Waste program funded by annual fee?',
       'Solid Waste program funded by transfer station access fee?',
       'Solid Waste program funded by per-visit fee?',
       'Solid Waste program funded by PAYT/ SMART revenue?',
       'What is the annual fee?', 'What is the transfer station access fee?',
       'What is the per-visit fee?', 'PAYT/ SMART',
       '

In [4]:
cols_to_use = ['Solid Waste program funded by property tax?',
       'Solid Waste program funded by annual fee?',
       'Solid Waste program funded by transfer station access fee?',
       'Solid Waste program funded by per-visit fee?',
       'Solid Waste program funded by PAYT/ SMART revenue?',
       'What is the annual fee?', 'What is the transfer station access fee?',
       'What is the per-visit fee?', 'PAYT/ SMART',
       'Municipal Buildings Trash and Recycling Service',
       'School Trash and Recycling Service',
       'Business Trash and Recycling Service',
       'Non-resident Trash and Recycling Service', 'Households Served by Municipal Trash Program', 'Trash Service Type', 'Carts for Trash', 'Trash Cart size', 'Does trash disposal tonnage include bulky waste?','Fee for bulky waste?',
       'Annual Bulky \nWaste \nLimit', 'Tip Fee as of 1/1/2020', 'Enforced Trash Limits at Curb', 'Maximum # bags/ barrels per week',
       'Barrel size in gallons (eg 32 64 etc)', 'Trash Enforced by Muni',
       'Trash Enforced by Hauler', 'Dedicated Trash Enforcement Personnel','Households Served by Municipal Recycling Program', 'Recycling Service Type', 'Recycling Collection Frequency',
       'SS Recycling', 'Carts for Recycling', 'Recycling Cart Size', 'Enforced Mandatory Recycling',
       'Applies to Residential Generators Eligible to be Served by Municipal Program',
       'Applies to Residential Generators not Eligible to be Served by the Municipal Program',
       'Applies to Commercial Generators', 'Recycling Enforced by Muni',
       'Recycling Enforced by Hauler',
       'Dedicated Mandatory Recycling Enforcement Personnel',
       '# Hours Enforcement Personnel on Street', 'Private Hauler regulations that require recycling',
       'Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program',
       'Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program',
       'Applies to Commercial Generators.1'
              ]

In [5]:
# general information related to funding for service.

serv_fund = ['Solid Waste program funded by property tax?',
       'Solid Waste program funded by annual fee?',
       'Solid Waste program funded by transfer station access fee?',
       'Solid Waste program funded by per-visit fee?',
       'Solid Waste program funded by PAYT/ SMART revenue?',
       'What is the annual fee?', 'What is the transfer station access fee?',
       'What is the per-visit fee?', 'PAYT/ SMART',]

In [6]:
# general service information for non-residental buildings

gen_serv = ['Municipal Buildings Trash and Recycling Service',
       'School Trash and Recycling Service',
       'Business Trash and Recycling Service',
       'Non-resident Trash and Recycling Service', ]

In [7]:
# quantities and specifications related to trash services

trash_serv = ['Households Served by Municipal Trash Program', 'Trash Service Type', 'Carts for Trash', 'Trash Cart size', 'Does trash disposal tonnage include bulky waste?','Fee for bulky waste?',
       'Annual Bulky \nWaste \nLimit', 'Tip Fee as of 1/1/2020', 'Enforced Trash Limits at Curb', 'Maximum # bags/ barrels per week',
       'Barrel size in gallons (eg 32 64 etc)', 'Trash Enforced by Muni',
       'Trash Enforced by Hauler', 'Dedicated Trash Enforcement Personnel', ]

In [8]:
# quantitative data on trash tonnages collected

trash_tonnage_data = ['Households Served by Municipal Trash Program', 'Trash Disposal Tonnage', 'Bulky waste tonnage', ]

In [9]:
# quantities and specifications related to recycle services

recycle_serv= ['Households Served by Municipal Recycling Program', 'Recycling Service Type', 'Recycling Collection Frequency',
       'SS Recycling', 'Carts for Recycling', 'Recycling Cart Size', 'Enforced Mandatory Recycling',
       'Applies to Residential Generators Eligible to be Served by Municipal Program',
       'Applies to Residential Generators not Eligible to be Served by the Municipal Program',
       'Applies to Commercial Generators', 'Recycling Enforced by Muni',
       'Recycling Enforced by Hauler',
       'Dedicated Mandatory Recycling Enforcement Personnel',
       '# Hours Enforcement Personnel on Street', 'Private Hauler regulations that require recycling',
       'Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program',
       'Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program',
       'Applies to Commercial Generators.1']

In [10]:
# Import summary data for municipal waste tonnages in 2019

tonnages19 = pd.read_csv('data/MA_MSW_Collection_Data/musum19.csv', index_col='Municipality Name')
tonnages19.head()

Unnamed: 0_level_0,tot_households,stream_type,tons_ss_recyclables,tons_ms_recyclables,tons_recyclables_total,hh_served_by_mu_recycle,tons_recyclables/hh,hh_served_by_mu_trash,tons_trash_total,tons_trash/hh,%recycle/hh,total_waste/hh
Municipality Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Abington,6558.0,ss,1413.42,0.0,1413.42,4486.0,0.315074,4486.0,3826.66,0.853023,0.269733,1.168096
Acton,9800.0,ms,0.0,1090.31,1090.31,4335.0,0.251513,3846.0,2148.67,0.558677,0.310437,0.81019
Acushnet,4304.0,ss+,879.5,68.04,947.54,3591.0,0.263865,3591.0,3446.38,0.959727,0.215648,1.223592
Adams,3867.0,ms,0.0,139.61,139.61,664.0,0.210256,664.0,134.47,0.202515,0.509377,0.412771
Agawam,12031.0,ss,2238.0,0.0,2238.0,8879.0,0.252055,8879.0,6717.17,0.756523,0.249912,1.008579


## Setting up data for regression

In [11]:
df_for_regression = serv19.loc[:,cols_to_use].merge(tonnages19.loc[:,['%recycle/hh','total_waste/hh']], left_index=True, right_index=True)
df_for_regression.head(2)

Unnamed: 0_level_0,Solid Waste program funded by property tax?,Solid Waste program funded by annual fee?,Solid Waste program funded by transfer station access fee?,Solid Waste program funded by per-visit fee?,Solid Waste program funded by PAYT/ SMART revenue?,What is the annual fee?,What is the transfer station access fee?,What is the per-visit fee?,PAYT/ SMART,Municipal Buildings Trash and Recycling Service,...,Recycling Enforced by Muni,Recycling Enforced by Hauler,Dedicated Mandatory Recycling Enforcement Personnel,# Hours Enforcement Personnel on Street,Private Hauler regulations that require recycling,Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program,Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program,Applies to Commercial Generators.1,%recycle/hh,total_waste/hh
Municipality Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abington,Yes,,,,,,,,,Both,...,Yes,Yes,Yes,20.0,Yes,Yes,Yes,Yes,0.269733,1.168096
Acton,,,Yes,Yes,Yes,,100.0,30.0,Yes,Both,...,Yes,,No,,,,,,0.310437,0.81019


In [12]:
df_for_regression.info()

<class 'pandas.core.frame.DataFrame'>
Index: 277 entries, Abington to Yarmouth
Data columns (total 47 columns):
 #   Column                                                                                            Non-Null Count  Dtype  
---  ------                                                                                            --------------  -----  
 0   Solid Waste program funded by property tax?                                                       192 non-null    object 
 1   Solid Waste program funded by annual fee?                                                         85 non-null     object 
 2   Solid Waste program funded by transfer station access fee?                                        73 non-null     object 
 3   Solid Waste program funded by per-visit fee?                                                      18 non-null     object 
 4   Solid Waste program funded by PAYT/ SMART revenue?                                                132 non-null    object 

### Dealing with NaNs

There are lot of columns missing quite a few values. 277 entries, let's look at anything that has less than 271 non-null values. I can see the features with 271 non-null values are datapoints that I knew were missing some data. I get to that last.

In [13]:
df_for_regression.count()[df_for_regression.count() < 271].sort_values()

Solid Waste program funded by per-visit fee?                                                         18
What is the per-visit fee?                                                                           18
Trash Enforced by Muni                                                                               31
Dedicated Trash Enforcement Personnel                                                                31
# Hours Enforcement Personnel on Street                                                              33
Trash Enforced by Hauler                                                                             35
Annual Bulky \nWaste \nLimit                                                                         48
Barrel size in gallons (eg 32 64 etc)                                                                50
Maximum # bags/ barrels per week                                                                     50
Enforced Trash Limits at Curb                                   

Okay, some of the NaNs are because they really mean "no" or "0", like "Solid Waste program funded by per-visit fee?". Others are numeric and I'll need to handle those appropriately.

In [14]:
# enter each feature name to see why they may have NaNs
df_for_regression['Recycling Cart Size'].value_counts()

96.0    33
64.0    25
95.0    12
65.0     7
35.0     1
0.0      1
Name: Recycling Cart Size, dtype: int64

In [15]:
# columns with only "yes", and NaN likely means "no"

OHE_list_1 = ['Solid Waste program funded by per-visit fee?', 'Trash Enforced by Muni', 'Trash Enforced by Hauler', 'Enforced Trash Limits at Curb', 'Applies to Commercial Generators.1', 
'Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program', 'Recycling Enforced by Hauler', 'Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program',
'Applies to Commercial Generators', 'Carts for Trash', 'Private Hauler regulations that require recycling', 'Solid Waste program funded by transfer station access fee?', 'Applies to Residential Generators not Eligible to be Served by the Municipal Program',
'Carts for Recycling', 'Solid Waste program funded by annual fee?', 'Recycling Enforced by Muni', 'Solid Waste program funded by PAYT/ SMART revenue?', 'PAYT/ SMART', 'SS Recycling', 'Applies to Residential Generators Eligible to be Served by Municipal Program',
'Solid Waste program funded by property tax?']

In [16]:
# columns with "yes" and "no", but I will assume NaN still likely means "no"
# note that enforcement personnel has only 31 non-null values but # of hours enforces is 33.

OHE_list_2 = ['Dedicated Trash Enforcement Personnel', 'Dedicated Mandatory Recycling Enforcement Personnel', 'Enforced Mandatory Recycling', 'Fee for bulky waste?']

In [17]:
# columns where NaN likely means "0"

fill_w_zero = ['What is the per-visit fee?', '# Hours Enforcement Personnel on Street',  'Annual Bulky \nWaste \nLimit', 'Maximum # bags/ barrels per week', 'What is the transfer station access fee?', 'What is the annual fee?', 'Tip Fee as of 1/1/2020']

In [18]:
# columns that have multiple options 
# barrel size and cart size can probably be simplified to 32, 48, 64, and 96

# barrel size and trash cart size, only impute where recycling service type is curbside or both, otherwise 0

impute_w_mode = ['Barrel size in gallons (eg 32 64 etc)', 'Trash Cart size', 'Recycling Cart Size', 'Recycling Collection Frequency']


In [19]:
for col in OHE_list_1:
    df_for_regression[col] = df_for_regression[col].str.replace('Yes', '1')
    df_for_regression[col].fillna(0, inplace = True)
    df_for_regression[col] = pd.to_numeric(df_for_regression[col])

In [20]:
for col in OHE_list_2:
    df_for_regression[col] = df_for_regression[col].str.replace('Yes', '1')
    df_for_regression[col] = df_for_regression[col].str.replace('No', '0')
    df_for_regression[col].fillna(0, inplace = True)
    df_for_regression[col] = pd.to_numeric(df_for_regression[col])

In [21]:
df_for_regression[fill_w_zero] = df_for_regression[fill_w_zero].fillna(0)

I'm struggling with the last group because if bin size is missing, it probably means there's no limit of what you can put out or it's drop off only. I could see bin size being important because if the bin is too small, they may recycle less. If no bin is needed, maybe you can recycle more?

It's almost like... you could make a bin scale:
```
0       no curbside pick up
0.25    0-35 gal bins
0.50    35-65 gal bins
0.75    65-96 gal bins
1       no bins provided
```
I also think that Barrel size and Trash cart size can be joined into one. I think Trash cart size is in reference to specifically supplied bins whle barrel size refers to max bin sizes accept. I could take the Barrel size and any missing values, first try to fill with trash cart size. Then if there is curbside pickup, missing values are 1 and if no curbside, values are 0.


Lastly, recycling frequency is only missing valuse because of drop-off only. I think I can still just OHE it? hopefully weekly would be `0 1`, biweekly would be `1 0`, and drop-off only would be `0 0`.

In [22]:
recycle_collection_encoding = pd.get_dummies(pd.DataFrame(df_for_regression['Recycling Collection Frequency']))

In [23]:
recycle_collection_encoding

Unnamed: 0_level_0,Recycling Collection Frequency_Bi-weekly,Recycling Collection Frequency_Weekly
Municipality Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Abington,0,1
Acton,0,0
Acushnet,1,0
Adams,0,0
Agawam,1,0
...,...,...
Woburn,1,0
Worcester,0,1
Worthington,0,0
Wrentham,1,0


In [24]:
df_for_regression = df_for_regression.merge(recycle_collection_encoding, left_index=True,right_index=True)
df_for_regression.head()

Unnamed: 0_level_0,Solid Waste program funded by property tax?,Solid Waste program funded by annual fee?,Solid Waste program funded by transfer station access fee?,Solid Waste program funded by per-visit fee?,Solid Waste program funded by PAYT/ SMART revenue?,What is the annual fee?,What is the transfer station access fee?,What is the per-visit fee?,PAYT/ SMART,Municipal Buildings Trash and Recycling Service,...,Dedicated Mandatory Recycling Enforcement Personnel,# Hours Enforcement Personnel on Street,Private Hauler regulations that require recycling,Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program,Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program,Applies to Commercial Generators.1,%recycle/hh,total_waste/hh,Recycling Collection Frequency_Bi-weekly,Recycling Collection Frequency_Weekly
Municipality Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abington,1,0,0,0,0,0.0,0.0,0.0,0,Both,...,1,20.0,1,1,1,1,0.269733,1.168096,0,1
Acton,0,0,1,1,1,0.0,100.0,30.0,1,Both,...,0,0.0,0,0,0,0,0.310437,0.81019,0,0
Acushnet,1,0,0,0,0,0.0,0.0,0.0,0,Both,...,0,0.0,1,1,0,0,0.215648,1.223592,1,0
Adams,1,0,1,0,1,0.0,50.0,0.0,1,Both,...,0,0.0,1,1,0,1,0.509377,0.412771,0,0
Agawam,1,0,0,0,0,0.0,0.0,0.0,0,Both,...,0,0.0,0,0,0,0,0.249912,1.008579,1,0


In [25]:
df_for_regression.drop(columns='Recycling Collection Frequency', inplace=True)

In [26]:
def bin_rank(size, serve_type, limit_size):
    if limit_size != np.nan:
        x = (
            0 if serve_type == 'Drop-off' or serve_type == 'None' else
            0.25 if limit_size < 35 else
            0.50 if limit_size < 65 else
            0.75 if limit_size < 96 else
            1)
    else:
        x = (
            0 if serve_type == 'Drop-off' or serve_type == 'None' else
            0.25 if size < 35 else
            0.50 if size < 65 else
            0.75 if size < 96 else
            1)
    return x

In [27]:
df_for_regression['Trash Service Type'].unique()

array(['Curbside', 'Drop-off', 'Both', 'None'], dtype=object)

In [28]:
df_for_regression['Trash Bin Size Ranking'] = df_for_regression.apply(lambda row: bin_rank(row['Trash Cart size'], row['Trash Service Type'],row['Barrel size in gallons (eg 32 64 etc)']), axis=1)
df_for_regression.head(2)

Unnamed: 0_level_0,Solid Waste program funded by property tax?,Solid Waste program funded by annual fee?,Solid Waste program funded by transfer station access fee?,Solid Waste program funded by per-visit fee?,Solid Waste program funded by PAYT/ SMART revenue?,What is the annual fee?,What is the transfer station access fee?,What is the per-visit fee?,PAYT/ SMART,Municipal Buildings Trash and Recycling Service,...,# Hours Enforcement Personnel on Street,Private Hauler regulations that require recycling,Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program,Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program,Applies to Commercial Generators.1,%recycle/hh,total_waste/hh,Recycling Collection Frequency_Bi-weekly,Recycling Collection Frequency_Weekly,Trash Bin Size Ranking
Municipality Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abington,1,0,0,0,0,0.0,0.0,0.0,0,Both,...,20.0,1,1,1,1,0.269733,1.168096,0,1,0.5
Acton,0,0,1,1,1,0.0,100.0,30.0,1,Both,...,0.0,0,0,0,0,0.310437,0.81019,0,0,0.0


In [29]:
df_for_regression.drop(columns=['Trash Cart size','Barrel size in gallons (eg 32 64 etc)'], inplace=True)

In [30]:
def bin_rank(size, serve_type):
    x = (
        0 if serve_type == 'Drop-off' or serve_type == 'None' else
        0.25 if size < 35 else
        0.50 if size < 65 else
        0.75 if size < 96 else
        1)
    return x

In [31]:
df_for_regression['Recycling Cart Size']

Municipality Name
Abington       64.0
Acton           NaN
Acushnet       96.0
Adams           NaN
Agawam         95.0
               ... 
Woburn          NaN
Worcester       NaN
Worthington     NaN
Wrentham       96.0
Yarmouth        NaN
Name: Recycling Cart Size, Length: 277, dtype: float64

In [32]:
df_for_regression['Recycle Bin Size Ranking'] = df_for_regression.apply(lambda row: bin_rank(row['Recycling Cart Size'], row['Recycling Service Type']), axis=1)
df_for_regression.head(2)

Unnamed: 0_level_0,Solid Waste program funded by property tax?,Solid Waste program funded by annual fee?,Solid Waste program funded by transfer station access fee?,Solid Waste program funded by per-visit fee?,Solid Waste program funded by PAYT/ SMART revenue?,What is the annual fee?,What is the transfer station access fee?,What is the per-visit fee?,PAYT/ SMART,Municipal Buildings Trash and Recycling Service,...,Private Hauler regulations that require recycling,Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program,Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program,Applies to Commercial Generators.1,%recycle/hh,total_waste/hh,Recycling Collection Frequency_Bi-weekly,Recycling Collection Frequency_Weekly,Trash Bin Size Ranking,Recycle Bin Size Ranking
Municipality Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Abington,1,0,0,0,0,0.0,0.0,0.0,0,Both,...,1,1,1,1,0.269733,1.168096,0,1,0.5,0.5
Acton,0,0,1,1,1,0.0,100.0,30.0,1,Both,...,0,0,0,0,0.310437,0.81019,0,0,0.0,0.0


In [33]:
df_for_regression.drop(columns=['Recycling Cart Size'], inplace=True)

In [34]:
df_for_regression.info()

<class 'pandas.core.frame.DataFrame'>
Index: 277 entries, Abington to Yarmouth
Data columns (total 47 columns):
 #   Column                                                                                            Non-Null Count  Dtype  
---  ------                                                                                            --------------  -----  
 0   Solid Waste program funded by property tax?                                                       277 non-null    int64  
 1   Solid Waste program funded by annual fee?                                                         277 non-null    int64  
 2   Solid Waste program funded by transfer station access fee?                                        277 non-null    int64  
 3   Solid Waste program funded by per-visit fee?                                                      277 non-null    int64  
 4   Solid Waste program funded by PAYT/ SMART revenue?                                                277 non-null    int64  

In [35]:
# for the last missing values, I'm just going to drop those columns because I was missing the trash data in the original dataset

df_for_regression.dropna(inplace=True)

In [36]:
# fixing some annoying formatting
df_for_regression.rename(columns=dict(zip(df_for_regression.columns, df_for_regression.columns.str.replace('\n',''))), inplace=True)

In [37]:
df_for_regression.info()

<class 'pandas.core.frame.DataFrame'>
Index: 271 entries, Abington to Yarmouth
Data columns (total 47 columns):
 #   Column                                                                                            Non-Null Count  Dtype  
---  ------                                                                                            --------------  -----  
 0   Solid Waste program funded by property tax?                                                       271 non-null    int64  
 1   Solid Waste program funded by annual fee?                                                         271 non-null    int64  
 2   Solid Waste program funded by transfer station access fee?                                        271 non-null    int64  
 3   Solid Waste program funded by per-visit fee?                                                      271 non-null    int64  
 4   Solid Waste program funded by PAYT/ SMART revenue?                                                271 non-null    int64  

#### Encoding to make all features numeric

In [38]:
from sklearn.preprocessing import LabelEncoder

In [39]:
le = LabelEncoder()
df_for_regression['Does trash disposal tonnage include bulky waste?'] = le.fit_transform(df_for_regression['Does trash disposal tonnage include bulky waste?'])

In [40]:
from category_encoders import OneHotEncoder 

In [41]:
ohe = OneHotEncoder(use_cat_names=True)
df_for_regression = ohe.fit_transform(df_for_regression)

In [42]:
ohe.get_feature_names

<bound method OneHotEncoder.get_feature_names of OneHotEncoder(cols=['Municipal Buildings Trash and Recycling Service',
                    'School Trash and Recycling Service',
                    'Business Trash and Recycling Service',
                    'Non-resident Trash and Recycling Service',
                    'Trash Service Type', 'Recycling Service Type'],
              use_cat_names=True)>

In [43]:
df_for_regression = df_for_regression.drop(columns=['Municipal Buildings Trash and Recycling Service_Neither', 'School Trash and Recycling Service_Neither', 'Business Trash and Recycling Service_Neither',
                               'Non-resident Trash and Recycling Service_Neither', 'Trash Service Type_Drop-off', 'Recycling Service Type_Drop-off'])

In [44]:
df_for_regression.info()

<class 'pandas.core.frame.DataFrame'>
Index: 271 entries, Abington to Yarmouth
Data columns (total 55 columns):
 #   Column                                                                                            Non-Null Count  Dtype  
---  ------                                                                                            --------------  -----  
 0   Solid Waste program funded by property tax?                                                       271 non-null    int64  
 1   Solid Waste program funded by annual fee?                                                         271 non-null    int64  
 2   Solid Waste program funded by transfer station access fee?                                        271 non-null    int64  
 3   Solid Waste program funded by per-visit fee?                                                      271 non-null    int64  
 4   Solid Waste program funded by PAYT/ SMART revenue?                                                271 non-null    int64  

## Regression

### Removing highly correlated data

First off, I don't need `total_waste/hh` because this is a cluster feature (used in notebooke Part 2), rather than a regression prediction feature. So I will remove this.

In [45]:
df_for_regression.drop(columns='total_waste/hh', inplace = True)

I want to look at the correlations for the data to be used for regression. I know some of the information will be _highly_ correlated (like over 0.75) and I will probably want to remove these. But there are 55 features so jumping straight into a heat map may not be super helpful. Instead, I'll make a matrix of the correlations for different subcategories found in the original survey. These subcategories often had a "primary question" and then bunch of follow up questions if the survery-answerer marked yes to the primary question. This is leading to a lot of highly correlated features. After assessing each subcategory, I'll look at the full matrix of remaining features. I will apply a style to these matrices to identify any correlations above 0.75. 

In [46]:
# colors "true" values to make the table more readable
def color_val_red(val):
    color = 'red' if abs(val) > 0.75 and val < 1 else 'black'
    return 'color: {}'.format(color)

In [47]:
# Program Funding Mechanism  
df_for_regression[['Solid Waste program funded by property tax?',
       'Solid Waste program funded by annual fee?',
       'Solid Waste program funded by transfer station access fee?',
       'Solid Waste program funded by per-visit fee?',
       'Solid Waste program funded by PAYT/ SMART revenue?',
       'What is the annual fee?', 'What is the transfer station access fee?',
       'What is the per-visit fee?']].corr().style.applymap(color_val_red)

Unnamed: 0,Solid Waste program funded by property tax?,Solid Waste program funded by annual fee?,Solid Waste program funded by transfer station access fee?,Solid Waste program funded by per-visit fee?,Solid Waste program funded by PAYT/ SMART revenue?,What is the annual fee?,What is the transfer station access fee?,What is the per-visit fee?
Solid Waste program funded by property tax?,1.0,-0.368231,-0.173991,-0.144251,-0.153305,-0.437962,-0.249015,-0.180522
Solid Waste program funded by annual fee?,-0.368231,1.0,-0.227104,0.141645,-0.062497,0.799488,-0.219411,0.067305
Solid Waste program funded by transfer station access fee?,-0.173991,-0.227104,1.0,0.205468,0.207061,-0.1533,0.710962,0.210263
Solid Waste program funded by per-visit fee?,-0.144251,0.141645,0.205468,1.0,0.006892,0.091526,0.259141,0.862883
Solid Waste program funded by PAYT/ SMART revenue?,-0.153305,-0.062497,0.207061,0.006892,1.0,-0.12679,0.127689,0.067448
What is the annual fee?,-0.437962,0.799488,-0.1533,0.091526,-0.12679,1.0,-0.162668,0.048642
What is the transfer station access fee?,-0.249015,-0.219411,0.710962,0.259141,0.127689,-0.162668,1.0,0.332847
What is the per-visit fee?,-0.180522,0.067305,0.210263,0.862883,0.067448,0.048642,0.332847,1.0


remove `Solid Waste program funded by annual fee?` as this is redundant with `What is the annual fee?`.

remove `Solid Waste program funded by per-visit fee?` as this is redundant with `What is the per-visit fee?`.

In [48]:
df_for_regression.drop(columns=['Solid Waste program funded by annual fee?', 'Solid Waste program funded by per-visit fee?', 'Solid Waste program funded by PAYT/ SMART revenue?'], inplace=True)

In [49]:
# for some reason, these two municipalities said they're funded by PAYT/SMART but they did not respond to the `PAYT/ SMART` question so I'll manually enter it for them
df_for_regression.at[['Shirley', 'Middleborough'],['PAYT/ SMART']] = 1

In [50]:
# Trash Limits

df_for_regression[['Enforced Trash Limits at Curb',
       'Maximum # bags/ barrels per week', 'Trash Enforced by Muni',
       'Trash Enforced by Hauler', 'Dedicated Trash Enforcement Personnel', 'Trash Bin Size Ranking']].corr().style.applymap(color_val_red)

Unnamed: 0,Enforced Trash Limits at Curb,Maximum # bags/ barrels per week,Trash Enforced by Muni,Trash Enforced by Hauler,Dedicated Trash Enforcement Personnel,Trash Bin Size Ranking
Enforced Trash Limits at Curb,1.0,0.793255,0.75559,0.809635,0.41152,0.041874
Maximum # bags/ barrels per week,0.793255,1.0,0.451582,0.664328,0.18802,-0.021528
Trash Enforced by Muni,0.75559,0.451582,1.0,0.622123,0.544634,0.040247
Trash Enforced by Hauler,0.809635,0.664328,0.622123,1.0,0.274815,0.027197
Dedicated Trash Enforcement Personnel,0.41152,0.18802,0.544634,0.274815,1.0,0.02157
Trash Bin Size Ranking,0.041874,-0.021528,0.040247,0.027197,0.02157,1.0


`Enforced Trash Limits at Curb` needs to go. It's a primary question and the other questions are follow ups. So anyone who doesn't enforce,  all the other cateories will be 0.

In [51]:
df_for_regression.drop(columns=['Enforced Trash Limits at Curb'], inplace=True)

In [52]:
# Mandatory Recycling

df_for_regression[['Enforced Mandatory Recycling',
       'Applies to Residential Generators Eligible to be Served by Municipal Program',
       'Applies to Residential Generators not Eligible to be Served by the Municipal Program',
       'Applies to Commercial Generators', 'Recycling Enforced by Muni',
       'Recycling Enforced by Hauler',
       'Dedicated Mandatory Recycling Enforcement Personnel',
       '# Hours Enforcement Personnel on Street']].corr().style.applymap(color_val_red)

Unnamed: 0,Enforced Mandatory Recycling,Applies to Residential Generators Eligible to be Served by Municipal Program,Applies to Residential Generators not Eligible to be Served by the Municipal Program,Applies to Commercial Generators,Recycling Enforced by Muni,Recycling Enforced by Hauler,Dedicated Mandatory Recycling Enforcement Personnel,# Hours Enforcement Personnel on Street
Enforced Mandatory Recycling,1.0,0.698249,0.267236,0.488373,0.878021,0.589756,0.42084,0.300477
Applies to Residential Generators Eligible to be Served by Municipal Program,0.698249,1.0,0.479167,0.438797,0.613077,0.411796,0.293851,0.209808
Applies to Residential Generators not Eligible to be Served by the Municipal Program,0.267236,0.479167,1.0,0.661157,0.249355,0.149594,0.12999,0.139982
Applies to Commercial Generators,0.488373,0.438797,0.661157,1.0,0.446735,0.239424,0.191462,0.181045
Recycling Enforced by Muni,0.878021,0.613077,0.249355,0.446735,1.0,0.35599,0.479305,0.342221
Recycling Enforced by Hauler,0.589756,0.411796,0.149594,0.239424,0.35599,1.0,0.328417,0.154383
Dedicated Mandatory Recycling Enforcement Personnel,0.42084,0.293851,0.12999,0.191462,0.479305,0.328417,1.0,0.713994
# Hours Enforcement Personnel on Street,0.300477,0.209808,0.139982,0.181045,0.342221,0.154383,0.713994,1.0


Remove `Enforced Mandatory Recycling` as this is the primary question and everything else is follow-up.

In [53]:
df_for_regression.drop(columns=['Enforced Mandatory Recycling'], inplace=True)

In [54]:
df_for_regression.columns

Index(['Solid Waste program funded by property tax?',
       'Solid Waste program funded by transfer station access fee?',
       'What is the annual fee?', 'What is the transfer station access fee?',
       'What is the per-visit fee?', 'PAYT/ SMART',
       'Municipal Buildings Trash and Recycling Service_Both',
       'Municipal Buildings Trash and Recycling Service_Recycling',
       'Municipal Buildings Trash and Recycling Service_Trash',
       'School Trash and Recycling Service_Both',
       'School Trash and Recycling Service_Recycling',
       'Business Trash and Recycling Service_Recycling',
       'Business Trash and Recycling Service_Both',
       'Non-resident Trash and Recycling Service_Both',
       'Non-resident Trash and Recycling Service_Recycling',
       'Non-resident Trash and Recycling Service_Trash',
       'Households Served by Municipal Trash Program',
       'Trash Service Type_Curbside', 'Trash Service Type_Both',
       'Carts for Trash', 'Does trash dispos

In [55]:
# Private Hauler Regulations  

df_for_regression[['Private Hauler regulations that require recycling',
       'Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program',
       'Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program',
       'Applies to Commercial Generators.1']].corr().style.applymap(color_val_red)

Unnamed: 0,Private Hauler regulations that require recycling,Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program,Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program,Applies to Commercial Generators.1
Private Hauler regulations that require recycling,1.0,0.902629,0.863387,0.823806
Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program,0.902629,1.0,0.778674,0.752542
Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program,0.863387,0.778674,1.0,0.813319
Applies to Commercial Generators.1,0.823806,0.752542,0.813319,1.0


This one is a bit trickier because they're all correlated. It seems like in this case I'm going to want to remove all the follow-up questions: `'Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program',
       'Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program',
       'Applies to Commercial Generators.1'`

In [56]:
df_for_regression.drop(columns=['Applies to Haulers Serving Residential Generators Eligible to be Served by Municipal Program',
       'Applies to Haulers Serving Residential Generators Not Eligible to be Served by Municipal Program',
       'Applies to Commercial Generators.1'], inplace=True)

In [57]:
# Now the whole table

dfcorr= df_for_regression.corr()

In [58]:
# dfcorr = dfcorr[abs(dfcorr['%recycle/hh']) >= 0.1]

In [59]:
mask = ((dfcorr>0.75) & (dfcorr<1)).any(axis=0)

In [60]:
dfcorr[mask].T.style.applymap(color_val_red)

Unnamed: 0,Households Served by Municipal Trash Program,Households Served by Municipal Recycling Program,Trash Bin Size Ranking,Recycle Bin Size Ranking
Solid Waste program funded by property tax?,0.121544,0.123729,0.213753,0.266806
Solid Waste program funded by transfer station access fee?,-0.138441,-0.14066,-0.337256,-0.358551
What is the annual fee?,-0.018321,-0.020551,0.071906,0.067112
What is the transfer station access fee?,-0.112409,-0.113328,-0.319931,-0.335387
What is the per-visit fee?,-0.044233,-0.041662,-0.198586,-0.215054
PAYT/ SMART,-0.198205,-0.1964,-0.075483,-0.255131
Municipal Buildings Trash and Recycling Service_Both,0.041967,0.043764,0.158465,0.157938
Municipal Buildings Trash and Recycling Service_Recycling,-0.019238,-0.021324,-0.073676,-0.085135
Municipal Buildings Trash and Recycling Service_Trash,-0.02134,-0.021634,-0.05736,-0.060049
School Trash and Recycling Service_Both,0.072898,0.072404,0.392244,0.448444


`Households Served by Municipal Trash Program` and `Households Served by Municipal Recycling Program` are correlated. I think I feel ok removing both these columns because this is similar to, or will scale linearly with population. Population isn't in this table but it is in the "cluster" data. However, I would need something like this for the baseline regression... okay, I'll keep one, let's go with `Households Served by Municipal Recycling Program` since this whole analysis is about recycling. Maybe I'll run two regressions, with and without this feature, just to see how it does.

`Trash Bin Size Ranking` and `Recycle Bin Size Ranking` are correlated. This is tricky because I'm interested in both of these features. However, a lot of the time, if a municipality as a big trash can, they have a big recycling can and vice versa. I supposed what I could do is keep the `Recycle Bin Size Ranking` as is, since this is a recycling study, but make a new feature--like a manually engineered polynomial in a way--of the ratio of `recycle bin rank:trash bin rank` and I can see how that faires in the correlation matrix.



In [61]:
df_for_regression['recycle bin rank:trash bin rank'] = df_for_regression['Recycle Bin Size Ranking'].div(df_for_regression['Trash Bin Size Ranking'])

In [62]:
df_for_regression[['Trash Bin Size Ranking', 'Recycle Bin Size Ranking', 'recycle bin rank:trash bin rank']][df_for_regression['recycle bin rank:trash bin rank'].isna()].sort_values(by='Trash Bin Size Ranking')

Unnamed: 0_level_0,Trash Bin Size Ranking,Recycle Bin Size Ranking,recycle bin rank:trash bin rank
Municipality Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Acton,0.0,0.0,
Sandwich,0.0,0.0,
Russell,0.0,0.0,
Royalston,0.0,0.0,
Rowe,0.0,0.0,
...,...,...,...
Douglas,0.0,0.0,
Dennis,0.0,0.0,
Deerfield,0.0,0.0,
Medfield,0.0,0.0,


In [63]:
# looks like we got some NaNs where the rank was 0 for both categories.
df_for_regression['recycle bin rank:trash bin rank'] = df_for_regression['recycle bin rank:trash bin rank'].fillna(0)

In [64]:
df_for_regression[['Trash Bin Size Ranking', 'Recycle Bin Size Ranking', 'recycle bin rank:trash bin rank']][df_for_regression['recycle bin rank:trash bin rank'].isna()]

Unnamed: 0_level_0,Trash Bin Size Ranking,Recycle Bin Size Ranking,recycle bin rank:trash bin rank
Municipality Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [65]:
# checking the new column

df_for_regression[['Trash Bin Size Ranking', 'Recycle Bin Size Ranking', 'recycle bin rank:trash bin rank']].corr().style.applymap(color_val_red)

Unnamed: 0,Trash Bin Size Ranking,Recycle Bin Size Ranking,recycle bin rank:trash bin rank
Trash Bin Size Ranking,1.0,0.823893,0.351215
Recycle Bin Size Ranking,0.823893,1.0,0.761509
recycle bin rank:trash bin rank,0.351215,0.761509,1.0


No, the new column is still _too_ strongly correlate (>0.75) with recycle bin ranking so I must remove `Trash Bin Size Ranking` and my new feature to remove any high correlations. I'll also the `Households Served by Municipal Trash Program` feature mentioned above.

In [66]:
df_for_regression.drop(columns=['Trash Bin Size Ranking', 'recycle bin rank:trash bin rank', 'Households Served by Municipal Trash Program'], inplace=True)

In [67]:
# Remaking the correlation matrix

dfcorr= df_for_regression.corr()

In [68]:
mask = ((dfcorr>0.75) & (dfcorr<1)).any(axis=0)

In [69]:
dfcorr[mask].T.style.applymap(color_val_red)

Solid Waste program funded by property tax?
Solid Waste program funded by transfer station access fee?
What is the annual fee?
What is the transfer station access fee?
What is the per-visit fee?
PAYT/ SMART
Municipal Buildings Trash and Recycling Service_Both
Municipal Buildings Trash and Recycling Service_Recycling
Municipal Buildings Trash and Recycling Service_Trash
School Trash and Recycling Service_Both
School Trash and Recycling Service_Recycling
Business Trash and Recycling Service_Recycling
Business Trash and Recycling Service_Both
Non-resident Trash and Recycling Service_Both
Non-resident Trash and Recycling Service_Recycling
Non-resident Trash and Recycling Service_Trash
Trash Service Type_Curbside
Trash Service Type_Both
Carts for Trash
Does trash disposal tonnage include bulky waste?
Fee for bulky waste?
Annual Bulky Waste Limit
Tip Fee as of 1/1/2020
Maximum # bags/ barrels per week
Trash Enforced by Muni
Trash Enforced by Hauler
Dedicated Trash Enforcement Personnel
Households Served by Municipal Recycling Program
Recycling Service Type_Curbside
Recycling Service Type_Both
SS Recycling
Carts for Recycling
Applies to Residential Generators Eligible to be Served by Municipal Program
Applies to Residential Generators not Eligible to be Served by the Municipal Program
Applies to Commercial Generators
Recycling Enforced by Muni
Recycling Enforced by Hauler
Dedicated Mandatory Recycling Enforcement Personnel
# Hours Enforcement Personnel on Street
Private Hauler regulations that require recycling
%recycle/hh
Recycling Collection Frequency_Bi-weekly
Recycling Collection Frequency_Weekly
Recycle Bin Size Ranking


Cool, no more correlations above 0.75.

In [70]:
# # For saving files
# df_for_regression.to_csv('data/data_for_regression.csv', index=True)

### Making the Baseline Regression *

I will test out several algorithms and determine which result in models with the best scores. Of the top scoring-models, I will determine the coefficients or feature importances. These will translate to the most important aspects of recycling / waste management services in maximizing recycling across the entire population. The resulting scores may not be high; however, this will give further credance to my theory that there are sub-groups within the population that can be observed independently to come up with a better predictive model, and thus better service suggests for specific municipalities.

\* due to some data abnormalities found in the regression section, I have started a new notebook [part3b.ipynb](part3b.ipynb).

In [71]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor, BaggingRegressor, VotingRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import RFE

In [72]:
X = df_for_regression.drop(columns=['%recycle/hh'])
y = df_for_regression['%recycle/hh']

In [108]:
X.columns

Index(['Solid Waste program funded by property tax?',
       'Solid Waste program funded by transfer station access fee?',
       'What is the annual fee?', 'What is the transfer station access fee?',
       'What is the per-visit fee?', 'PAYT/ SMART',
       'Municipal Buildings Trash and Recycling Service_Both',
       'Municipal Buildings Trash and Recycling Service_Recycling',
       'Municipal Buildings Trash and Recycling Service_Trash',
       'School Trash and Recycling Service_Both',
       'School Trash and Recycling Service_Recycling',
       'Business Trash and Recycling Service_Recycling',
       'Business Trash and Recycling Service_Both',
       'Non-resident Trash and Recycling Service_Both',
       'Non-resident Trash and Recycling Service_Recycling',
       'Non-resident Trash and Recycling Service_Trash',
       'Trash Service Type_Curbside', 'Trash Service Type_Both',
       'Carts for Trash', 'Does trash disposal tonnage include bulky waste?',
       'Fee for bulky

In [73]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [109]:
fit_results = {}

def fit_model(model, name):
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    rmse = mean_squared_error(y_test, model.predict(X_test), squared=False)
    fit_results[name] = (score, rmse)
    return score

In [None]:
model0 = DummyRegressor()
fit_model(model0, 'dummy')

In [111]:
model1 = LinearRegression(n_jobs=-1)
fit_model(model1, 'plain_linreg')

0.21859654846224486

In [114]:
model2 = RandomForestRegressor(n_jobs=-1, random_state=123)
fit_model(model2, 'RFR')

0.13740517005245223

In [115]:
model3 = HistGradientBoostingRegressor(max_iter=10_000, random_state=123)
fit_model(model3, 'GradBoost')

-0.25126471106226456

In [120]:
model4 = BaggingRegressor(n_jobs=-1, n_estimators=100, random_state=123)
fit_model(model4, 'bagging')

0.15359534479323111

In [121]:
model5 = make_pipeline(StandardScaler(), HistGradientBoostingRegressor(max_iter=10_000, random_state=123))
fit_model(model5, 'GB_w_scaler')

-0.25126471106226456

In [122]:
model6 = make_pipeline(StandardScaler(),Ridge())
fit_model(model6, 'plain_ridge')

0.22099629888522287

In [123]:
model7 = make_pipeline(StandardScaler(),Lasso())
fit_model(model7, 'lasso')

-0.005464464679645564

In [124]:
params6 = {
    'ridge__alpha':[1,10,100]
}
gs6 = GridSearchCV(model6, params6, n_jobs=-1)
fit_model(gs6, 'gs_on_ridge')

0.20921888682507017

In [129]:
params2={
    'n_estimators':[100,1000],
    'max_depth': [25,50]
}

gs2 = GridSearchCV(RandomForestRegressor(random_state=123), params2, n_jobs=-1)
fit_model(gs2, 'gs_on_RFR')

0.11768490658402153

In [126]:
pd.DataFrame(gs2.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.279999,0.024883,0.014,0.002279558,25,100,"{'max_depth': 25, 'n_estimators': 100}",0.130418,0.046887,-0.175587,-0.654952,0.143313,-0.101984,0.299177,4
1,2.444399,0.121693,0.085799,0.007195073,25,1000,"{'max_depth': 25, 'n_estimators': 1000}",0.145396,0.071692,-0.116526,-0.574878,0.083434,-0.078176,0.263342,3
2,0.236799,0.007781,0.011,5.519789e-07,50,100,"{'max_depth': 50, 'n_estimators': 100}",0.118661,0.093043,-0.034194,-0.509845,0.135378,-0.039391,0.242659,1
3,2.0182,0.356314,0.057399,0.01076386,50,1000,"{'max_depth': 50, 'n_estimators': 1000}",0.136908,0.073096,-0.1157,-0.456528,0.132171,-0.046011,0.224844,2


The irregular outcomes between splits leads me to believe there are "multiple profiles" in one dataset, further supporting my instinct to compare the baseline model against the clustered models.

In [127]:
model8 = make_pipeline(StandardScaler(), RFE(Ridge()), LinearRegression())
params8 = {
    'rfe__n_features_to_select':[10,20,30]
}
gs8 = GridSearchCV(model8, params8, n_jobs=-1)
fit_model(gs8, 'gs_rfe_ridge_lr')

0.22862135241911452

In [128]:
pd.DataFrame(gs8.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rfe__n_features_to_select,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.043201,0.007025,0.003001,2e-06,10,{'rfe__n_features_to_select': 10},-0.027404,-1.307755,-0.439416,-0.084713,0.107802,-0.350297,0.511726,1
1,0.033,0.004147,0.003198,0.000399,20,{'rfe__n_features_to_select': 20},-0.065728,-0.8741,-0.593595,-0.2939,-0.005166,-0.366498,0.327269,2
2,0.022597,0.004759,0.002799,0.0004,30,{'rfe__n_features_to_select': 30},-0.200684,-1.357271,-0.515951,-0.386792,-0.131086,-0.518357,0.440905,3


In [144]:
model8 = make_pipeline(StandardScaler(), RFE(Ridge(), n_features_to_select=11), LinearRegression())
fit_model(model8, 'rfe_ridge_lr')

0.2368909310876468

In [177]:
model9 = make_pipeline(StandardScaler(), RFE(LinearRegression(), n_features_to_select=6), LinearRegression())
fit_model(model9, 'rfe_lr_lr')

0.21911175946737882

In [132]:
model10 = make_pipeline(StandardScaler(), LinearRegression())
fit_model(model10, 'lr_scaler')

0.21859654846299525

In [145]:
pd.DataFrame(fit_results, index=['score','rmse']).T.sort_values(by='score', ascending=False)

Unnamed: 0,score,rmse
rfe_ridge_lr,0.236891,0.063603
gs_rfe_ridge_lr,0.228621,0.063946
plain_ridge,0.220996,0.064262
rfe_lr_lr,0.219112,0.064339
lr_scaler,0.218597,0.064361
plain_linreg,0.218597,0.064361
gs_on_ridge,0.209219,0.064746
bagging,0.153595,0.066984
RFR,0.137405,0.067622
gs_on_RFR,0.117685,0.06839


In [149]:
vr11 = VotingRegressor([('lr_w_rfe', model8), ('bagging', model4)])
fit_model(vr11, 'vr_bagging_lr_w_rfe_ridge')

0.24131292985001096

In [151]:
results = pd.DataFrame(fit_results, index=['Score','RMSE']).T
results.sort_values('Score', ascending = False)

Unnamed: 0,Score,RMSE
vr_bagging_lr_w_rfe_ridge,0.241313,0.063418
rfe_ridge_lr,0.236891,0.063603
gs_rfe_ridge_lr,0.228621,0.063946
plain_ridge,0.220996,0.064262
rfe_lr_lr,0.219112,0.064339
lr_scaler,0.218597,0.064361
plain_linreg,0.218597,0.064361
gs_on_ridge,0.209219,0.064746
bagging,0.153595,0.066984
RFR,0.137405,0.067622


In [152]:
model8.named_steps

{'standardscaler': StandardScaler(),
 'rfe': RFE(estimator=Ridge(), n_features_to_select=11),
 'linearregression': LinearRegression()}

In [153]:
rfe_lr_lr_coefs = pd.Series(data= model9.named_steps['linearregression'].coef_, index = X.columns[model9.named_steps['rfe'].support_], name= 'rfe_lr_lr')

In [154]:
rfe_ridge_lr_coefs = pd.Series(data= model8.named_steps['linearregression'].coef_, index = X.columns[model8.named_steps['rfe'].support_], name='rfe_ridge_lr')

In [155]:
bagging_coefs = np.mean([
    tree.feature_importances_ for tree in model4.estimators_
], axis=0)
# model4.estimators_[0].feature_importances_

In [156]:
model_coefs = pd.DataFrame(data={'plain_linreg':model1.coef_, 'plain_ridge':model6.named_steps['ridge'].coef_, "bagging":bagging_coefs},index=X.columns)

In [157]:
model_coefs = model_coefs.merge(rfe_lr_lr_coefs,how = 'left',left_index=True,right_index=True)

In [158]:
model_coefs = model_coefs.merge(rfe_ridge_lr_coefs,how = 'left',left_index=True,right_index=True)

In [159]:
model_coefs = model_coefs.fillna(0)

In [161]:
model_coefs['vr_bag_w_rfe_ridge_lr'] = (model_coefs['bagging'] + model_coefs['rfe_ridge_lr']) / 2

In [178]:
model_coefs.sort_values(by='vr_bag_w_rfe_ridge_lr')

Unnamed: 0,plain_linreg,plain_ridge,bagging,rfe_lr_lr,rfe_ridge_lr,vr_bag_w_rfe_ridge_lr
Recycling Collection Frequency_Bi-weekly,-0.0435297,-0.013459,0.004723,-0.013893,-0.016732,-0.006004
Non-resident Trash and Recycling Service_Trash,-0.1436053,-0.00992,8.1e-05,0.0,-0.00958,-0.004749
# Hours Enforcement Personnel on Street,-0.001566063,-0.01894,0.009754,-0.013662,-0.017136,-0.003691
Non-resident Trash and Recycling Service_Recycling,-0.05296178,-0.008017,0.004352,0.0,-0.009221,-0.002434
Municipal Buildings Trash and Recycling Service_Trash,2.449933e-12,0.0,0.0,0.0,0.0,0.0
Dedicated Trash Enforcement Personnel,0.007588894,0.001495,0.00039,0.0,0.0,0.000195
Fee for bulky waste?,-0.02783589,-0.011908,0.012969,0.0,-0.010811,0.001079
Trash Enforced by Hauler,-0.02427997,-0.008423,0.002407,0.0,0.0,0.001203
Recycling Service Type_Curbside,-0.02325173,-0.004533,0.003942,0.0,0.0,0.001971
Trash Enforced by Muni,0.02157855,0.007229,0.004277,0.0,0.0,0.002139


The best non-ensemble models were Linear Regressions using Ridge or Linear Regression RFE before fitting. Let's see what theses models said are the most important features...

**Negative Correlations** -- *presences of these features or higher amounts of these features **reduce** recycling effectiveness. /b/ indicates these features were in both the RFE results with linear regression and ridge. If no /b/ is present, the feature only appeard in the ridge results.*

* /b/`# Hours Enforcement Personnel on Street` -- surprising...
* /b/`Recycling Collection Frequency_Bi-weekly` -- doesn't do as well as recycling weekly, I suppose
* /b/`Households Served by Municipal Recycling Program` -- this is probably showing some indication of population size
* `Fee for bulky waste?`  -- I'm not sure what to make of this. Could it be that some of the bulky waste is recyclable? like kiddy-pools?
* `Non-resident Trash and Recycling Service` (Trash + Recycling) -- Not sure why servicing non-residental buildings would lead to lower recycling over all..

**Positive Correlations** -- *presences of these features or higher amounts of these features **increase** recycling effectiveness. /b/ indicates these features were in both the RFE results with linear regression and ridge. If no /b/ is present, the feature only appeard in the ridge results.*
* `Business Trash and Recycling Service_Recycling` -- I think this makes sense because if you are encouraged to recycle at work or out and about, you're probably more likely to recycle at home. Or maybe municipalities that are more recycling-conscious would want to implement comprehensive recycling outside just residential homes... hard to say. That's why I'm surprised "non-residential services" somehow led to lower recycling...
* `Trash Service Type_Curbside` -- More convenient for the participants, but I'm surprised *recycling* curbside didn't matter more...
* /b/`Dedicated Mandatory Recycling Enforcement Personnel` -- LOL ... so why is hours of enforcement a negative impact?
* /b/`Carts for Recycling` -- Makes sense. No excuse to not recycle if you already have a bin
* /b/`PAYT/ SMART` -- Makes sense, if you pay for the amount you throw out, you probably want to minimize your volume of trash by recycling. Plus, that money goes to further waste handling funding!

The Voting Regression model that used both the linear regression with ridge RFE and Bagging did have the highest score by a small amount. It shares a lot of the same important features (makes sense since half of this model is the linear regression with ridge RFE) but there are some difference:
* bulky waste fee impact was flipped (but also insignificant)
* Households Served by Municipal Recycling Program impact was flipped (and is highly significant)
* Seemed a lot more concerned with funding. In addition to `PAYT/ SMART`, the bagging regressor brought in: `What is the annual fee?`, `What is the transfer station access fee`, `Tip Fee as of 1/1/2020`
* Seemed to focus more on positive correlations actually

If I want to put more of an emphasis on funding, I may want to select to voting regressor. However, I'll probably also compare future results to the linear regression using ridge RFE. My major take aways in seeing the erratic scores and sometimes logically-opposing feature importances, I think running a predictive model on the population as a whole does not make sense. Assessing smaller sub-populations will probably yield better scores and insights. That's the theory anyways...

In [169]:
results = results.sort_values('Score', ascending=False)

In [170]:
# # For saving files

# model_coefs.to_csv('data/baseline_models_coefs.csv', index=True)
# results.to_csv('data/baseline_models_scores.csv', index=True)