<h1 align='center'><font size='5'>Q3 & Q4 Relationship between Complaints and Housing Characteristics</font></h1>

* ### The Housing characteristics dataset can be downloaded here [link](https://data.cityofnewyork.us/City-Government/Primary-Land-Use-Tax-Lot-Output-PLUTO-/xuk2-nczf) (only use the data in Bronx county)
* ### The housing characteristics dataset only include selected columns (see below)

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn')

df_BX = pd.read_csv('./PlUTO_for_WEB/BX_18v1.csv',usecols=['Address', 'BldgArea', 'BldgDepth', 'BuiltFAR', 
                                                         'CommFAR', 'FacilFAR', 'Lot', 'LotArea', 
                                                         'LotDepth', 'NumBldgs', 'NumFloors', 'OfficeArea', 
                                                         'ResArea', 'ResidFAR', 'RetailArea', 'YearBuilt', 
                                                         'YearAlter1', 'ZipCode', 'YCoord', 'XCoord'],
                   dtype={'ZipCode':'Int64'})
df_BX.head()

Unnamed: 0,Lot,ZipCode,Address,LotArea,BldgArea,ResArea,OfficeArea,RetailArea,NumBldgs,NumFloors,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord
0,1,10454,122 BRUCKNER BOULEVARD,15000,0,0,0,0,1,0.0,200.0,0.0,0,0,0.0,6.02,5.0,6.5,1005957.0,232162.0
1,4,10454,126 BRUCKNER BOULEVARD,13770,752,0,272,0,2,1.0,100.0,16.0,1931,1994,0.05,6.02,5.0,6.5,1006076.0,232156.0
2,10,10454,138 BRUCKNER BOULEVARD,35000,39375,0,0,0,1,2.0,200.0,200.0,1931,0,1.13,6.02,5.0,6.5,1006187.0,232036.0
3,17,10454,144 BRUCKNER BOULEVARD,2500,12500,12500,0,0,1,5.0,100.0,85.0,1931,2001,5.0,6.02,5.0,6.5,1006299.0,232033.0
4,18,10454,148 BRUCKNER BOULEVARD,1875,8595,6876,0,1719,1,5.0,75.0,70.0,1920,2009,4.58,6.02,5.0,6.5,1006363.0,232040.0


In [3]:
print(df_BX.dtypes,'\n')
print(df_BX.shape)

Lot             int64
ZipCode         Int64
Address        object
LotArea         int64
BldgArea        int64
ResArea         int64
OfficeArea      int64
RetailArea      int64
NumBldgs        int64
NumFloors     float64
LotDepth      float64
BldgDepth     float64
YearBuilt       int64
YearAlter1      int64
BuiltFAR      float64
ResidFAR      float64
CommFAR       float64
FacilFAR      float64
XCoord        float64
YCoord        float64
dtype: object 

(89854, 20)


<div class="alert alert-block alert-info">

## Table of Contents <a id='toc'></a>
1. <font size='3'>[Data cleaning,wrangling](#1)</font>
2. <font size='3'>[Relationship between types of complaints and housing characteristics](#2)</font>
    1. <font size='3'>[Merge tables, drop duplicated columns and NA rows](#2)</font>
    2. <font size='3'>[Feature selection](#2.2)</font>
    3. <font size='3'>[Model selection & evaluation](#2.3)</font>
3. <font size='3'>[Relationship between number of heating-related complaints and housing characteristics](#3)</font>
    1. <font size='3'>[Preprocessing](#3)</font>
    2. <font size='3'>[Feature selection](#3.2)</font>
    3. <font size='3'>[Model selection & evaluation](#3.3)</font>

</div>

# [Data cleaning, wrangling](#toc) <a id='1'></a>

In [4]:
# read from HPDComplaint.csv only to extract data from Bronx borough with top 5 complaint type, also extract create year&month
df_HPD = pd.read_csv('HPDComplaint.csv',parse_dates=['created_date','closed_date'])
df_HPD.drop('Unnamed: 0',axis=1,inplace=True)
df_HPD['created_year'] = df_HPD['created_date'].map(lambda x: x.strftime('%Y')).astype(int)
df_HPD['created_month'] = df_HPD['created_date'].map(lambda x: x.strftime('%m')).astype(int)
df_HPD.set_index('complaint_type',inplace=True)
df_HPD = df_HPD.loc[['HEAT/HOT WATER','HEATING','PLUMBING','GENERAL CONSTRUCTION','UNSANITARY CONDITION'],:]
df_HPD = df_HPD[df_HPD['borough']=='BRONX']
# add a new column for processed days, and fill in NA with now() time minus created date
df_HPD['processed_days'] = (df_HPD['closed_date']-df_HPD['created_date']).dt.days.fillna((pd.Timestamp.now()-df_HPD['created_date']).dt.days).astype(int)

In [5]:
# drop borough and closed_date columns
df_HPD.reset_index(inplace=True)
df_HPD.drop(columns=['borough','closed_date'],inplace=True)
# df_HPD[df_HPD.drop(columns=['address_type','city','resolution_description']).isnull().any(axis=1)]
print(df_HPD.shape)
print(df_HPD.dtypes)
df_HPD.head()

(996983, 16)
complaint_type                    object
unique_key                         int64
created_date              datetime64[ns]
status                            object
resolution_description            object
location_type                     object
incident_zip                     float64
incident_address                  object
street_name                       object
address_type                      object
city                              object
latitude                         float64
longitude                        float64
created_year                       int32
created_month                      int32
processed_days                     int32
dtype: object


Unnamed: 0,complaint_type,unique_key,created_date,status,resolution_description,location_type,incident_zip,incident_address,street_name,address_type,city,latitude,longitude,created_year,created_month,processed_days
0,HEAT/HOT WATER,44649082,2019-11-07 18:15:23,Open,The following complaint conditions are still o...,RESIDENTIAL BUILDING,10451.0,751 GERARD AVENUE,GERARD AVENUE,ADDRESS,BRONX,40.824936,-73.926456,2019,11,119
1,HEAT/HOT WATER,44648076,2019-11-07 09:34:31,Open,The complaint you filed is a duplicate of a co...,RESIDENTIAL BUILDING,10457.0,739 EAST 182 STREET,EAST 182 STREET,ADDRESS,BRONX,40.849734,-73.885476,2019,11,119
2,HEAT/HOT WATER,44646827,2019-11-07 11:13:32,Open,The following complaint conditions are still o...,RESIDENTIAL BUILDING,10458.0,319 EAST 197 STREET,EAST 197 STREET,ADDRESS,BRONX,40.866864,-73.889066,2019,11,119
3,HEAT/HOT WATER,44514095,2019-11-06 19:51:53,Open,The following complaint conditions are still o...,RESIDENTIAL BUILDING,10468.0,30 WEST 184 STREET,WEST 184 STREET,ADDRESS,BRONX,40.860995,-73.903757,2019,11,120
4,HEAT/HOT WATER,44514012,2019-11-06 08:41:31,Open,The following complaint conditions are still o...,RESIDENTIAL BUILDING,10463.0,3135 GODWIN TERRACE,GODWIN TERRACE,ADDRESS,BRONX,40.880061,-73.905143,2019,11,120


# [Finding relationship between the type of complaints and Housing characteristics](#toc) <a id='2'></a>
## 1. Merge complaints and housing tables and drop duplicated columns and NA rows

In [6]:
# merge the two tables (complaint table and house info table) on address
df_mix = pd.merge(df_HPD,df_BX,left_on='incident_address',right_on='Address')
# df_mix.columns
# df_mix.loc[:,'city'].unique()
df_mix.drop(columns=['resolution_description','incident_address','address_type','city','incident_zip','location_type'],inplace=True)
df_mix.drop(index=df_mix.index[df_mix.isnull().any(axis=1)],inplace=True)
# df_mix.loc[:,'CommFAR'].unique()
print(df_mix.shape)
print(df_mix.dtypes)
df_mix.head()

(797888, 30)
complaint_type            object
unique_key                 int64
created_date      datetime64[ns]
status                    object
street_name               object
latitude                 float64
longitude                float64
created_year               int32
created_month              int32
processed_days             int32
Lot                        int64
ZipCode                    Int64
Address                   object
LotArea                    int64
BldgArea                   int64
ResArea                    int64
OfficeArea                 int64
RetailArea                 int64
NumBldgs                   int64
NumFloors                float64
LotDepth                 float64
BldgDepth                float64
YearBuilt                  int64
YearAlter1                 int64
BuiltFAR                 float64
ResidFAR                 float64
CommFAR                  float64
FacilFAR                 float64
XCoord                   float64
YCoord                   float

Unnamed: 0,complaint_type,unique_key,created_date,status,street_name,latitude,longitude,created_year,created_month,processed_days,...,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord
0,HEAT/HOT WATER,44649082,2019-11-07 18:15:23,Open,GERARD AVENUE,40.824936,-73.926456,2019,11,119,...,115.0,105.0,1928,0,5.31,6.02,0.0,6.5,1004556.0,239929.0
1,HEAT/HOT WATER,41234155,2018-12-18 03:11:13,Closed,GERARD AVENUE,40.824936,-73.926456,2018,12,2,...,115.0,105.0,1928,0,5.31,6.02,0.0,6.5,1004556.0,239929.0
2,HEAT/HOT WATER,44160698,2019-10-27 12:19:15,Closed,GERARD AVENUE,40.824936,-73.926456,2019,10,1,...,115.0,105.0,1928,0,5.31,6.02,0.0,6.5,1004556.0,239929.0
3,HEAT/HOT WATER,27787711,2014-04-04 00:00:00,Closed,GERARD AVENUE,40.824691,-73.926605,2014,4,4,...,115.0,105.0,1928,0,5.31,6.02,0.0,6.5,1004556.0,239929.0
4,HEAT/HOT WATER,27822358,2014-04-11 00:00:00,Closed,GERARD AVENUE,40.824691,-73.926605,2014,4,2,...,115.0,105.0,1928,0,5.31,6.02,0.0,6.5,1004556.0,239929.0


In [7]:
# Previously tried all street names, get_dummies generate over 950 features, and when convert to np array, not enough memory to allocate the data
# So now only select the top 10 street complaints for features. 
df = df_mix[df_mix['street_name'].isin(list(dict(df_mix['street_name'].value_counts().sort_values(ascending=False).head(10)).keys()))]
df.head()

Unnamed: 0,complaint_type,unique_key,created_date,status,street_name,latitude,longitude,created_year,created_month,processed_days,...,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord
1791,HEAT/HOT WATER,44511445,2019-11-06 16:58:40,Open,WALTON AVENUE,40.830614,-73.922071,2019,11,120,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0
1793,HEAT/HOT WATER,29298365,2014-11-15 00:00:00,Closed,WALTON AVENUE,40.830614,-73.922071,2014,11,2,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0
1794,HEAT/HOT WATER,29671698,2015-01-08 00:00:00,Closed,WALTON AVENUE,40.830614,-73.922071,2015,1,6,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0
1795,HEAT/HOT WATER,31983907,2015-11-13 20:59:09,Closed,WALTON AVENUE,40.830614,-73.922071,2015,11,4,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0
1796,HEAT/HOT WATER,32017131,2015-11-18 18:31:21,Closed,WALTON AVENUE,40.830614,-73.922071,2015,11,1,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0


In [8]:
# extract all numerical features to get correlation matrix
df_num = df.drop(columns=['unique_key','created_date','Address','complaint_type','street_name','status'])
df_num.head()

Unnamed: 0,latitude,longitude,created_year,created_month,processed_days,Lot,ZipCode,LotArea,BldgArea,ResArea,...,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord
1791,40.830614,-73.922071,2019,11,120,11,10452,10770,48369,48369,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0
1793,40.830614,-73.922071,2014,11,2,11,10452,10770,48369,48369,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0
1794,40.830614,-73.922071,2015,1,6,11,10452,10770,48369,48369,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0
1795,40.830614,-73.922071,2015,11,4,11,10452,10770,48369,48369,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0
1796,40.830614,-73.922071,2015,11,1,11,10452,10770,48369,48369,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0


## 2. [Feature selections using filter, wrapper methods](#toc) <a id='2.2'></a>

In [9]:
# correlation matrix
corr_mattrix = df_num.corr()
corr_mattrix.style.background_gradient(cmap='coolwarm')

Unnamed: 0,latitude,longitude,created_year,created_month,processed_days,Lot,ZipCode,LotArea,BldgArea,ResArea,OfficeArea,RetailArea,NumBldgs,NumFloors,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord
latitude,1.0,0.421645,-0.0394924,-0.0116546,-0.0102597,0.10423,0.348848,-0.101131,-0.184229,-0.185348,-0.0662773,-0.0579446,0.037343,-0.125802,0.000481488,-0.11531,-0.017547,-0.0937164,-0.190998,-0.122215,-0.0528906,-0.25918,0.4222,0.999937
longitude,0.421645,1.0,0.0510706,-0.0239543,-0.0246356,0.0828592,0.780085,-0.22627,-0.224286,-0.216826,-0.0853529,-0.0690369,-0.0367372,-0.130847,-0.220026,-0.207809,-0.00862702,-0.138445,-0.00862985,-0.418959,-0.0432663,-0.39374,0.999637,0.421967
created_year,-0.0394924,0.0510706,1.0,-0.154209,-0.169944,-0.00755781,0.0345465,-0.0102994,-0.013832,-0.0145412,0.000965032,-0.00175989,-0.017615,0.0104888,-0.0346644,-0.03105,-0.0122044,0.000134123,0.0169419,-0.025644,0.0154967,-0.0183135,0.051301,-0.0397009
created_month,-0.0116546,-0.0239543,-0.154209,1.0,-0.0465004,0.00659415,-0.0176874,0.00586558,0.0156527,0.0149422,0.00529406,0.00622845,0.0135652,0.0158251,-0.00905992,0.00875772,0.00545763,0.0129785,0.0237557,0.0285858,0.0136276,0.0165132,-0.0244287,-0.0114886
processed_days,-0.0102597,-0.0246356,-0.169944,-0.0465004,1.0,0.0181388,-0.0294911,-0.00664724,-0.0111402,-0.0104957,-0.00982503,0.00583128,0.00134107,-0.0104281,-0.00549401,-0.000874913,-0.0104457,-0.0123303,-0.00434767,0.0214029,0.00830716,0.0201951,-0.0247088,-0.0100994
Lot,0.10423,0.0828592,-0.00755781,0.00659415,0.0181388,1.0,0.0839415,0.0236975,0.0144605,0.0109868,0.0360805,0.013867,0.0596819,-0.00053166,-0.0796005,-0.0585862,0.0261789,0.0358189,0.00325098,-0.116455,-0.0385921,-0.163314,0.079665,0.106503
ZipCode,0.348848,0.780085,0.0345465,-0.0176874,-0.0294911,0.0839415,1.0,-0.185857,-0.210147,-0.203704,-0.0873186,-0.0856802,-0.0356272,-0.114915,-0.163169,-0.216226,-0.00489666,-0.0969195,-0.0425064,-0.473509,-0.0192663,-0.353773,0.779551,0.349345
LotArea,-0.101131,-0.22627,-0.0102994,0.00586558,-0.00664724,0.0236975,-0.185857,1.0,0.822642,0.804888,0.179336,0.054528,0.367815,0.547433,0.673977,0.423545,0.0875529,0.120867,0.0710933,-0.00538578,-0.0607394,-0.00719091,-0.223884,-0.100378
BldgArea,-0.184229,-0.224286,-0.013832,0.0156527,-0.0111402,0.0144605,-0.210147,0.822642,1.0,0.995285,0.152441,0.103243,0.234293,0.677237,0.513995,0.449288,0.0794612,0.183801,0.525379,0.100327,-0.00691105,0.10371,-0.22406,-0.183567
ResArea,-0.185348,-0.216826,-0.0145412,0.0149422,-0.0104957,0.0109868,-0.203704,0.804888,0.995285,1.0,0.0997182,0.0729095,0.226691,0.649377,0.514307,0.457081,0.0729608,0.194099,0.543553,0.106383,-0.0102749,0.10995,-0.216868,-0.184669


In [10]:
# feature selection: 1) filter out highly correlated features (>0.95)
lower = corr_mattrix.where(np.tril(np.ones(corr_mattrix.shape),k=-1).astype(bool)).abs()
to_drop = [i for i in lower.index if any(lower[i]>0.95)]
df_num.drop(columns=to_drop,inplace=True)
df_num.head()

Unnamed: 0,created_year,created_month,processed_days,Lot,ZipCode,LotArea,ResArea,OfficeArea,RetailArea,NumBldgs,...,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord
1791,2019,11,120,11,10452,10770,48369,0,0,1,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0
1793,2014,11,2,11,10452,10770,48369,0,0,1,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0
1794,2015,1,6,11,10452,10770,48369,0,0,1,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0
1795,2015,11,4,11,10452,10770,48369,0,0,1,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0
1796,2015,11,1,11,10452,10770,48369,0,0,1,...,117.5,98.0,1928,0,4.49,6.02,0.0,6.5,1005929.0,241960.0


In [11]:
# Standardize numerical features
from sklearn.preprocessing import StandardScaler
df_standardize = pd.DataFrame(StandardScaler().fit(df_num).transform(df_num),columns=df_num.columns)
df_standardize

Unnamed: 0,created_year,created_month,processed_days,Lot,ZipCode,LotArea,ResArea,OfficeArea,RetailArea,NumBldgs,...,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord
0,1.750425,1.197288,0.208990,-0.482183,-1.091063,-0.410778,-0.358873,-0.141144,-0.358298,-0.134518,...,-0.052896,-0.068058,-0.013288,-0.631476,0.139671,0.777725,-0.152758,0.695197,-1.356044,-1.148402
1,-0.358429,1.197288,-0.128997,-0.482183,-1.091063,-0.410778,-0.358873,-0.141144,-0.358298,-0.134518,...,-0.052896,-0.068058,-0.013288,-0.631476,0.139671,0.777725,-0.152758,0.695197,-1.356044,-1.148402
2,0.063342,-1.224597,-0.117540,-0.482183,-1.091063,-0.410778,-0.358873,-0.141144,-0.358298,-0.134518,...,-0.052896,-0.068058,-0.013288,-0.631476,0.139671,0.777725,-0.152758,0.695197,-1.356044,-1.148402
3,0.063342,1.197288,-0.123268,-0.482183,-1.091063,-0.410778,-0.358873,-0.141144,-0.358298,-0.134518,...,-0.052896,-0.068058,-0.013288,-0.631476,0.139671,0.777725,-0.152758,0.695197,-1.356044,-1.148402
4,0.063342,1.197288,-0.131861,-0.482183,-1.091063,-0.410778,-0.358873,-0.141144,-0.358298,-0.134518,...,-0.052896,-0.068058,-0.013288,-0.631476,0.139671,0.777725,-0.152758,0.695197,-1.356044,-1.148402
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
167061,1.328654,0.470723,-0.068847,-0.056360,-0.951897,-1.019351,-1.080185,-0.141144,-0.358298,-0.134518,...,-0.779581,-1.050247,-0.393271,1.596392,-2.036384,-0.859283,-0.152758,-0.760756,-0.328495,0.201388
167062,1.328654,0.470723,-0.068847,-0.056360,-0.951897,-1.019351,-1.080185,-0.141144,-0.358298,-0.134518,...,-0.779581,-1.050247,-0.393271,1.596392,-2.036384,-0.859283,-0.152758,-0.760756,-0.328495,0.201388
167063,1.328654,0.712911,0.160297,-0.179986,1.692247,-1.008367,-1.094799,-0.141144,-0.358298,-0.134518,...,-0.515332,-0.663324,-0.097729,-0.631476,-2.400379,-1.500127,-0.152758,-0.760756,1.725831,-1.158537
167064,1.750425,-0.982408,-0.057389,0.204628,1.692247,-1.008367,-1.106497,-0.141144,-0.358298,-0.134518,...,-0.515332,-1.139537,-0.182170,-0.631476,-2.629854,-1.500127,-0.152758,-0.760756,1.666129,-1.099364


In [12]:
# concatenate df_num and categorical columns in df
dfX = pd.concat([df_standardize, df['street_name'].reset_index(drop=True)],axis=1)
# hanlding categorical data, get dummies
dfX_dummies = pd.get_dummies(dfX)
print('Original columns: ',list(dfX.columns))
print('Dummies columns: ', list(dfX_dummies.columns))
dfX_dummies.head()

Original columns:  ['created_year', 'created_month', 'processed_days', 'Lot', 'ZipCode', 'LotArea', 'ResArea', 'OfficeArea', 'RetailArea', 'NumBldgs', 'NumFloors', 'LotDepth', 'BldgDepth', 'YearBuilt', 'YearAlter1', 'BuiltFAR', 'ResidFAR', 'CommFAR', 'FacilFAR', 'XCoord', 'YCoord', 'street_name']
Dummies columns:  ['created_year', 'created_month', 'processed_days', 'Lot', 'ZipCode', 'LotArea', 'ResArea', 'OfficeArea', 'RetailArea', 'NumBldgs', 'NumFloors', 'LotDepth', 'BldgDepth', 'YearBuilt', 'YearAlter1', 'BuiltFAR', 'ResidFAR', 'CommFAR', 'FacilFAR', 'XCoord', 'YCoord', 'street_name_BAILEY AVENUE', 'street_name_BOYNTON AVENUE', 'street_name_CRESTON AVENUE', 'street_name_DAVIDSON AVENUE', 'street_name_DECATUR AVENUE', 'street_name_GRAND CONCOURSE', 'street_name_MORRIS AVENUE', 'street_name_SEDGWICK AVENUE', 'street_name_SHERIDAN AVENUE', 'street_name_WALTON AVENUE']


Unnamed: 0,created_year,created_month,processed_days,Lot,ZipCode,LotArea,ResArea,OfficeArea,RetailArea,NumBldgs,...,street_name_BAILEY AVENUE,street_name_BOYNTON AVENUE,street_name_CRESTON AVENUE,street_name_DAVIDSON AVENUE,street_name_DECATUR AVENUE,street_name_GRAND CONCOURSE,street_name_MORRIS AVENUE,street_name_SEDGWICK AVENUE,street_name_SHERIDAN AVENUE,street_name_WALTON AVENUE
0,1.750425,1.197288,0.20899,-0.482183,-1.091063,-0.410778,-0.358873,-0.141144,-0.358298,-0.134518,...,0,0,0,0,0,0,0,0,0,1
1,-0.358429,1.197288,-0.128997,-0.482183,-1.091063,-0.410778,-0.358873,-0.141144,-0.358298,-0.134518,...,0,0,0,0,0,0,0,0,0,1
2,0.063342,-1.224597,-0.11754,-0.482183,-1.091063,-0.410778,-0.358873,-0.141144,-0.358298,-0.134518,...,0,0,0,0,0,0,0,0,0,1
3,0.063342,1.197288,-0.123268,-0.482183,-1.091063,-0.410778,-0.358873,-0.141144,-0.358298,-0.134518,...,0,0,0,0,0,0,0,0,0,1
4,0.063342,1.197288,-0.131861,-0.482183,-1.091063,-0.410778,-0.358873,-0.141144,-0.358298,-0.134518,...,0,0,0,0,0,0,0,0,0,1


In [13]:
# define X (features) and y (target)
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
X_name = dfX_dummies.columns
y_name = ['GENERAL CONSTRUCTION', 'HEAT/HOT WATER', 'HEATING', 'PLUMBING', 'UNSANITARY CONDITION']
print(len(X_name))
print(y_name)
X = dfX_dummies.loc[:,X_name].values
fit_y = LabelEncoder().fit(df.loc[:,'complaint_type'])
y = fit_y.transform(df.loc[:,'complaint_type'])
print(X.shape, '\n', y.shape)
# list(X.columns)

31
['GENERAL CONSTRUCTION', 'HEAT/HOT WATER', 'HEATING', 'PLUMBING', 'UNSANITARY CONDITION']
(167066, 31) 
 (167066,)


In [14]:
# feature selection: 1) wrapper, recursively feature elimination
from sklearn.feature_selection import RFECV
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=0)
rfe_selector = RFECV(estimator=model,min_features_to_select=21,n_jobs=-1)
fit_rfe = rfe_selector.fit(X,y)
print('Number of features: ', fit_rfe.n_features_)
print('Selected features: ', X_name[fit_rfe.support_])
print('Feature Ranking: ', fit_rfe.ranking_)

Number of features:  21
Selected features:  Index(['created_year', 'created_month', 'processed_days', 'Lot', 'ZipCode',
       'LotArea', 'ResArea', 'OfficeArea', 'RetailArea', 'NumFloors',
       'LotDepth', 'BldgDepth', 'YearBuilt', 'YearAlter1', 'BuiltFAR',
       'ResidFAR', 'FacilFAR', 'XCoord', 'YCoord', 'street_name_MORRIS AVENUE',
       'street_name_WALTON AVENUE'],
      dtype='object')
Feature Ranking:  [ 1  1  1  1  1  1  1  1  1  6  1  1  1  1  1  1  1  7  1  1  1 11  8  3
  4 10  2  1  9  5  1]


In [15]:
# feature selection 2) embedded method: random forest classifier to select 15 most important features
from sklearn.ensemble import RandomForestClassifier
X = fit_rfe.transform(X)
rfc_fit = RandomForestClassifier(random_state=0).fit(X,y)
X_name = [X_name[fit_rfe.support_][i] for i in np.argsort(rfc_fit.feature_importances_)[::-1]][:15]
X_name

['created_year',
 'processed_days',
 'created_month',
 'YCoord',
 'XCoord',
 'BuiltFAR',
 'Lot',
 'ResArea',
 'LotDepth',
 'LotArea',
 'BldgDepth',
 'YearBuilt',
 'ZipCode',
 'YearAlter1',
 'RetailArea']

## 3. [Model selection with grid search and model evaluation with cross-validation](#toc) <a id='2.3'></a>

In [16]:
# train-test split
from sklearn.model_selection import train_test_split
X = dfX_dummies[X_name].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

In [17]:
# model selection using pipeline & GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
pipe = Pipeline([('classifier', RandomForestClassifier())])
param_grid = [{'classifier': [RandomForestClassifier(random_state=0)],
               'classifier__n_estimators': [10,100],
               'classifier__max_features': [1,3,'auto']},
              {'classifier': [AdaBoostClassifier(random_state=0)], 
               'classifier__n_estimators': [10,100], 
               'classifier__learning_rate': np.linspace(0,3,5)}, 
              {'classifier': [LogisticRegression(random_state=0)],
               'classifier__penalty': ['l1','l2'],
               'classifier__C': np.linspace(0,3,5)}]
grid = GridSearchCV(estimator=pipe,param_grid=param_grid,scoring='roc_auc_ovr',verbose=3,n_jobs=-3)
model = grid.fit(X_train,y_train)

Fitting 5 folds for each of 26 candidates, totalling 130 fits


[Parallel(n_jobs=-3)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-3)]: Done  20 tasks      | elapsed:   29.3s
[Parallel(n_jobs=-3)]: Done 116 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-3)]: Done 130 out of 130 | elapsed:  2.4min finished


In [18]:
# find the best model and corresponding parameters and fitting score
print(model.best_params_,'\n',model.best_score_)

{'classifier': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features=3,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False), 'classifier__max_features': 3, 'classifier__n_estimators': 100} 
 0.9452560635103964


In [19]:
# model evaluation
from sklearn.model_selection import KFold, cross_val_score
kf = KFold(n_splits=10,shuffle=True,random_state=0)
cv_res = cross_val_score(model.best_estimator_,X_test,y_test,scoring='roc_auc_ovr',cv=kf)
print('ROC_AUC for fitting the data with Random Forest Classifier is: ', cv_res.mean())

ROC_AUC for fitting the data with Random Forest Classifier is:  0.9398942797395538


In [20]:
# rank the most important features: from most important to least important
important_feature = model.best_estimator_['classifier'].feature_importances_
feature_ranking = [X_name[idx] for idx in np.argsort(important_feature)[::-1]]
print("The importance of features, ranking from most important to least important: \n", feature_ranking)

The importance of features, ranking from most important to least important: 
 ['created_year', 'processed_days', 'created_month', 'YCoord', 'XCoord', 'BuiltFAR', 'Lot', 'ResArea', 'LotArea', 'LotDepth', 'BldgDepth', 'YearBuilt', 'ZipCode', 'YearAlter1', 'RetailArea']


- **<p style='font-size: 1.8em; color: red'>Conclusion III: </p>**

    * **<font size='4'>In this section, relationship between top 5 types of complaints and various housing characteristics was investigated.</font>**
    - **<font size='4'>Feature selection was performed to select most important features that are correlated to the types of complaints, including the year and month created, location (GPS coords) of the house, building floor area, lot area, lot and building depth, the year bulit and altered</font>**
    * **<font size='4'>The best model was found to be random forest, with the roc_auc score of 0.85, and the model could be used to predict the 5 types of complaints in NYC</font>**
    * **<font size='4'>The NYC government could use the features mentioned above to predict future types of complaints including general construction, heating, plumbing, and unsanity condition, and then set a series plans to improve the community.</font>**

# [Relationship between number of heating-related complaints and Housing characteristics](#toc) <a id='3'></a> 
## 1. Select only heating related complaints in Bronx, merge with housing table, drop duplicated columns and NA rows

In [21]:
# 1) extract complaint_type only related to heating 
# 2) group by incident_address and count the number of complaints for each address
# 3) merge to the Housing characteristics table on address and sort the number of complaints in the desc order
df_mix_heat = df_HPD[(df_HPD['complaint_type']=='HEAT/HOT WATER') | (df_HPD['complaint_type']=='HEATING')]\
                .groupby('incident_address').count()['unique_key'].to_frame().reset_index()\
                .merge(df_BX,left_on='incident_address',right_on='Address')\
                .sort_values(by='unique_key',ascending=False)
# add the columns of 'created_year','created_month','processed_days' to the number of heating-related complaints table
df_mix_heat = df_mix_heat.merge(df_HPD.loc[:,['incident_address','created_year','created_month','processed_days']]
                  .drop_duplicates(subset='incident_address'),on='incident_address')
# drop the address columns
df_mix_heat.drop(columns=['incident_address','Address'],inplace=True)
# drop rows containing NA
df_mix_heat = df_mix_heat[~df_mix_heat.isnull().any(axis=1)]
df_mix_heat

Unnamed: 0,unique_key,Lot,ZipCode,LotArea,BldgArea,ResArea,OfficeArea,RetailArea,NumBldgs,NumFloors,...,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord,created_year,created_month,processed_days
0,7112,7,10463,21320,54001,54000,0,0,1,5.0,...,0,2.53,3.44,0.0,4.8,1012722.0,261446.0,2014,3,1
1,5763,54,10472,12319,61500,61500,0,0,1,6.0,...,0,4.99,2.43,0.0,4.8,1018133.0,239710.0,2019,10,0
2,3001,34,10451,28444,122800,111800,8000,3000,1,6.0,...,0,4.32,6.02,0.0,6.5,1005800.0,240335.0,2014,3,4
3,2791,19,10458,7150,34320,34320,0,0,1,6.0,...,0,4.80,3.00,0.0,3.0,1016326.0,256197.0,2019,11,0
4,2503,1,10462,45000,174400,174400,0,0,1,6.0,...,0,3.88,3.44,0.0,4.8,1021763.0,249982.0,2019,11,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17018,1,51,10461,2500,3912,3912,0,0,1,3.0,...,0,1.56,2.43,0.0,4.8,1025367.0,246144.0,2018,1,7
17019,1,36,10466,1700,1216,1216,0,0,2,2.0,...,0,0.72,0.90,0.0,2.0,1025539.0,262316.0,2017,11,9
17020,1,82,10466,1700,1216,1216,0,0,1,2.0,...,0,0.72,0.90,0.0,2.0,1025531.0,262298.0,2012,12,7
17021,1,7,10466,1700,1880,1880,0,0,1,3.0,...,0,1.11,0.90,0.0,2.0,1027780.0,263911.0,2012,12,4


In [22]:
# standardize features
from sklearn.preprocessing import StandardScaler
dfX_heat = pd.DataFrame(StandardScaler().fit_transform(
                        df_mix_heat.drop(columns=['unique_key'])),
                        columns=df_mix_heat.drop(columns=['unique_key']).columns)
dfX_heat

Unnamed: 0,Lot,ZipCode,LotArea,BldgArea,ResArea,OfficeArea,RetailArea,NumBldgs,NumFloors,LotDepth,...,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord,created_year,created_month,processed_days
0,-0.166909,0.066116,0.276339,0.636689,0.806425,-0.037310,-0.141576,-0.188814,0.761401,1.453047,...,-0.452324,0.158175,0.648025,-0.186322,0.637770,-0.677566,1.355243,-0.509892,-0.900955,-0.084902
1,-0.085358,1.466940,0.092914,0.773515,0.968685,-0.037310,-0.141576,-0.188814,1.256323,-0.111350,...,-0.452324,1.141524,-0.032378,-0.186322,0.637770,0.046558,-1.076808,1.648351,0.724266,-0.094392
2,-0.120061,-1.801649,0.421513,1.891989,2.056907,0.385305,0.855667,-0.188814,1.256323,2.053140,...,-0.452324,0.873701,2.386084,-0.186322,1.693062,-1.603899,-1.006876,-0.509892,-0.900955,-0.056431
3,-0.146088,-0.712120,-0.012421,0.277592,0.380656,-0.037310,-0.141576,-0.188814,1.256323,0.133087,...,-0.452324,1.065574,0.351612,-0.186322,-0.479598,-0.195262,0.767930,1.648351,0.956440,-0.094392
4,-0.177320,-0.089531,0.758895,2.833477,3.411236,-0.037310,-0.141576,-0.188814,1.256323,2.333020,...,-0.452324,0.697818,0.648025,-0.186322,0.637770,0.532341,0.072531,1.648351,0.956440,-0.094392
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16992,-0.090563,-0.245178,-0.107180,-0.277230,-0.277210,-0.037310,-0.141576,-0.188814,-0.228444,-0.111350,...,-0.452324,-0.229568,-0.032378,-0.186322,0.637770,1.014645,-0.356904,1.216702,-1.365303,-0.027960
16993,-0.116590,0.533057,-0.123482,-0.326421,-0.335537,-0.037310,-0.141576,0.800480,-0.723366,-0.478005,...,-0.452324,-0.565346,-1.063087,-0.186322,-1.100357,1.037663,1.452588,0.785053,0.956440,-0.008979
16994,-0.036774,0.533057,-0.123482,-0.326421,-0.335537,-0.037310,-0.141576,-0.188814,-0.723366,-0.478005,...,-0.452324,-0.565346,-1.063087,-0.186322,-1.100357,1.036592,1.450574,-1.373190,1.188615,-0.027960
16995,-0.166909,0.533057,-0.123482,-0.314305,-0.321172,-0.037310,-0.141576,-0.188814,-0.228444,-0.478005,...,-0.452324,-0.409449,-1.063087,-0.186322,-1.100357,1.337563,1.631053,-1.373190,1.188615,-0.056431


## 2. [Feature selection](#toc) <a id='3.2'></a>

In [23]:
# correlation mattrix does not find highly correlated features
corr_mat_heat = dfX_heat.corr()
corr_mat_heat.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Lot,ZipCode,LotArea,BldgArea,ResArea,OfficeArea,RetailArea,NumBldgs,NumFloors,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord,created_year,created_month,processed_days
Lot,1.0,-0.0103693,0.083482,0.174747,0.181272,0.00944881,0.140863,0.138991,0.135064,0.0440068,0.00608057,0.0175853,-0.0248164,0.129962,0.0250662,0.0296181,0.0339864,-0.0480576,0.00946423,0.0362009,0.00439235,-0.0036227
ZipCode,-0.0103693,1.0,-0.00859659,-0.0953434,-0.101774,-0.0156465,-0.0549193,0.0521724,-0.270419,0.00476181,-0.149161,0.0714889,-0.23091,-0.207161,-0.504062,-0.0987543,-0.486496,0.647174,0.315151,-0.0473786,-0.0393471,-0.0290731
LotArea,0.083482,-0.00859659,1.0,0.613661,0.371727,0.689763,0.0628017,0.472536,0.163437,0.420256,0.182181,0.00654544,0.0507982,0.0206006,0.023811,0.0156556,0.0336093,-0.0162902,0.00534795,0.039896,-0.00201279,0.0142033
BldgArea,0.174747,-0.0953434,0.613661,1.0,0.869274,0.426755,0.175169,0.502244,0.552185,0.51702,0.415325,0.0287127,0.129038,0.477106,0.185297,0.0389349,0.190472,-0.152677,-0.0140306,0.100084,0.0362219,0.0176594
ResArea,0.181272,-0.101774,0.371727,0.869274,1.0,0.0467097,0.0831973,0.46422,0.581677,0.420269,0.322382,0.0272467,0.122066,0.549272,0.204857,-0.00853094,0.198133,-0.167525,-0.0092163,0.1059,0.0564776,0.0197151
OfficeArea,0.00944881,-0.0156465,0.689763,0.426755,0.0467097,1.0,0.0603793,0.239798,0.0511724,0.231584,0.207478,0.0024523,0.0349626,0.00841883,0.0152192,0.05217,0.0265703,-0.0134343,-0.00947848,0.0174622,-0.02111,-0.000426428
RetailArea,0.140863,-0.0549193,0.0628017,0.175169,0.0831973,0.0603793,1.0,0.0159746,0.0930638,0.131772,0.157462,-0.00318113,0.0868552,0.0803097,0.0886397,0.179053,0.107365,-0.0629638,-0.0278842,0.0252138,0.00114499,0.00625118
NumBldgs,0.138991,0.0521724,0.472536,0.502244,0.46422,0.239798,0.0159746,1.0,0.0065774,0.232469,0.0274616,0.0695086,0.0131881,0.2239,-0.0714174,-0.0136285,-0.0645767,0.0809499,0.0367112,-0.00163382,-0.00116715,-0.000791753
NumFloors,0.135064,-0.270419,0.163437,0.552185,0.581677,0.0511724,0.0930638,0.0065774,1.0,0.294226,0.463902,0.138294,0.240043,0.474192,0.394751,0.0245805,0.385362,-0.370478,-0.0690658,0.146392,0.0804526,0.0141182
LotDepth,0.0440068,0.00476181,0.420256,0.51702,0.420269,0.231584,0.131772,0.232469,0.294226,1.0,0.447087,0.00709194,0.0649718,0.0270568,0.030873,0.035697,0.0469296,-0.0353499,0.0168437,0.0337033,0.00119508,0.0316894


In [24]:
# feature selection 1) wrapper method: calculat ANOVA F-value to select features
from sklearn.feature_selection import SelectKBest, f_classif
X_heat = dfX_heat.values
y_heat = df_mix_heat['unique_key'].values
fvalue_selector = SelectKBest(score_func=f_classif,k=13)
fit_fvalue = fvalue_selector.fit(X_heat,y_heat)
print(np.sort(fit_fvalue.scores_))
print(np.sort(fit_fvalue.pvalues_))
X_name_heat = dfX_heat.columns[fit_fvalue.pvalues_<0.05]
print(X_name_heat)

[ 0.03829484  0.16675284  0.41965651  0.56007573  0.62459767  0.83140759
  0.86047831  0.88620286  0.90767344  1.46332946  2.59799438  2.7852032
  3.00354622  3.25744462  4.28165523  5.36285114  5.52963099  6.89124188
  7.25300008  8.05140574  9.91839597 18.90474716]
[0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000
 0.00000000e+000 1.01733452e-247 2.19319137e-236 4.52769051e-164
 5.40154135e-099 9.01347142e-084 4.21010937e-071 1.30013274e-060
 1.46517301e-009 9.14928680e-001 9.56331328e-001 9.83078186e-001
 9.95314905e-001 1.00000000e+000 1.00000000e+000 1.00000000e+000
 1.00000000e+000 1.00000000e+000]
Index(['ZipCode', 'BldgArea', 'ResArea', 'NumFloors', 'LotDepth', 'BldgDepth',
       'YearAlter1', 'BuiltFAR', 'ResidFAR', 'FacilFAR', 'XCoord',
       'created_year', 'created_month'],
      dtype='object')


In [25]:
# feature selection 2) embedded method: Ridge regression 
from sklearn.linear_model import Ridge
X_heat = dfX_heat[X_name_heat].values
fit_heat = Ridge().fit(X_heat,y_heat)
print('The fitting coefficients are: ',fit_heat.coef_)
print('R2: ',fit_heat.score(X_heat,y_heat))
X_name_heat = [X_name_heat[idx] for idx in np.argsort(abs(fit_heat.coef_))[::-1]][:6]
print('The 6 most relevant features: ', X_name_heat)

The fitting coefficients are:  [  1.97262181 -11.22023667  17.53571959  14.50681964  -3.69516739
  13.38613307  -1.75403743  -0.15030333   8.36018427  -2.77627076
   0.84561891  11.45557924   4.81808541]
R2:  0.08829418819121591
The 6 most relevant features:  ['ResArea', 'NumFloors', 'BldgDepth', 'created_year', 'BldgArea', 'ResidFAR']


## 3. [Model selection & evaluation](#toc) <a id='3.3'></a>

In [26]:
# train-test split
X_heat = dfX_heat[X_name_heat].values
X_heat_train, X_heat_test, y_heat_train, y_heat_test = train_test_split(X_heat, y_heat, test_size=0.33, random_state=0)

In [27]:
# model selection & evaluation using Pipeline & GridSearchCV
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
model_heat = {}
model_name = ['ridge','randomforest','adaboost','svr']
reg = [Ridge(),RandomForestRegressor(random_state=0),AdaBoostRegressor(random_state=0),SVR()]
param = [{'alpha': np.linspace(0,3,5)}, 
         {'n_estimators': [10,100], 
          'max_features': [3,'auto',None]}, 
         {'n_estimators': [10,100], 
          'learning_rate': np.linspace(0,3,5)}, 
         {'kernel': ['linear','rbf'], 
          'C': np.linspace(0,3,5)}
         ]
for idx, regres, params in zip(model_name,reg,param):
    grid_heat = GridSearchCV(estimator=regres,param_grid=params,verbose=3,n_jobs=-3)
    model_heat[idx] = grid_heat.fit(X_heat_train,y_heat_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-3)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-3)]: Done  23 out of  25 | elapsed:    2.1s remaining:    0.1s
[Parallel(n_jobs=-3)]: Done  25 out of  25 | elapsed:    2.1s finished
[Parallel(n_jobs=-3)]: Using backend LokyBackend with 6 concurrent workers.


Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-3)]: Done  30 out of  30 | elapsed:    9.1s remaining:    0.0s
[Parallel(n_jobs=-3)]: Done  30 out of  30 | elapsed:    9.1s finished


Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-3)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-3)]: Done  28 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-3)]: Done  50 out of  50 | elapsed:    1.3s finished
[Parallel(n_jobs=-3)]: Using backend LokyBackend with 6 concurrent workers.


Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-3)]: Done  28 tasks      | elapsed:   22.5s
[Parallel(n_jobs=-3)]: Done  50 out of  50 | elapsed:   38.4s finished


In [28]:
# Use the best_score_ to find the best model with the highest score
model_name = ['ridge','randomforest','adaboost','svr']
best_model = model_name[np.argmax([model_heat[idx].best_score_ for idx in model_name])]
print('Best model: ', model_heat[best_model].best_estimator_)

Best model:  Ridge(alpha=3.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)


In [29]:
from sklearn.model_selection import KFold, cross_val_score
kf = KFold(n_splits=10,shuffle=True,random_state=0)
cv_res = cross_val_score(model_heat[best_model].best_estimator_,X_heat_test,y_heat_test,scoring='r2',cv=kf)
print('R2 for fitting the data with SVR model is: ', cv_res.mean())

R2 for fitting the data with SVR model is:  0.11707350956826405


- **<p style='font-size: 1.8em; color: red'>Conclusion IV: </p>**

    * **<font size='4'>In this section, relationship between the number of heating-related complaints (identified in Q1&Q2) and various housing characteristics was investigated.</font>**
    - **<font size='4'>Feature selection was performed to select most important features that are correlated to the number of heating-complaints, including residential area, number of floors, building depth, the year case was created, building area, and lot area</font>**
    * **<font size='4'>The best model was found to be support vector machine, with the R2 score of 0.11.</font>**
    * **<font size='4'>The R2 score for the best model is pretty low, so it is hard to predict the number of heating-related complaints with current features, and it is recommended to collect more features for prediction. However, although weakly correlated, the government could also work on the features mentioned above to improve the community living condition.</font>**