# 1. Context and data upload

Data comes from [Seattle Open Data](https://data.seattle.gov/Community/2021-Building-Energy-Benchmarking/bfsh-nrm6).

First step is to connect to the source. Since this data is available via the [Socrata Open Data API (SODA)](https://dev.socrata.com/), and the dataset contains 3663 rows, I indicate a max. value of rows to retrieve greater than 3663. Default value is [1000](https://dev.socrata.com/docs/paging.html).

In [1]:
import pandas as pd

link_source = 'https://data.seattle.gov/resource/bfsh-nrm6.csv'
rows_to_retrieve = '?$limit=5000'

init_data = pd.read_csv(link_source+rows_to_retrieve)

# Let's have a look at the shape:
print("This dataset contains {} rows and {} columns".format(
    init_data.shape[0], init_data.shape[1]))

This dataset contains 3663 rows and 42 columns


In [2]:
# Display all the columns for analysis
pd.set_option('display.max_columns', init_data.shape[1]+1)
# Set the column width (defaults to 50)
#pd.set_option('display.max_colwidth', 40)

init_data.head()

Unnamed: 0,osebuildingid,datayear,buildingname,buildingtype,taxparcelidentificationnumber,address,city,state,zipcode,latitude,longitude,neighborhood,councildistrictcode,yearbuilt,numberoffloors,numberofbuildings,propertygfatotal,propertygfabuilding_s,propertygfaparking,energystarscore,siteeuiwn_kbtu_sf,siteeui_kbtu_sf,siteenergyuse_kbtu,siteenergyusewn_kbtu,sourceeuiwn_kbtu_sf,sourceeui_kbtu_sf,epapropertytype,largestpropertyusetype,largestpropertyusetypegfa,secondlargestpropertyusetype,secondlargestpropertyuse,thirdlargestpropertyusetype,thirdlargestpropertyusetypegfa,electricity_kwh,steamuse_kbtu,naturalgas_therms,compliancestatus,complianceissue,electricity_kbtu,naturalgas_kbtu,totalghgemissions,ghgemissionsintensity
0,1,2021,MAYFLOWER PARK HOTEL,NonResidential,659000030,405 OLIVE WAY,SEATTLE,WA,98101,47.6122,-122.33799,DOWNTOWN,1,1927,12,1,88434,88434,0,78,73.800003,73.599998,6510477,6522024,144.399994,144.199997,Hotel,Hotel,88434,,0,,0,944955,1798672,14876,Compliant,No Issue,3224187,1487620,241.6,2.7
1,2,2021,PARAMOUNT HOTEL,NonResidential,659000220,724 PINE ST,SEATTLE,WA,98101,47.61307,-122.33361,DOWNTOWN,1,1996,11,1,103566,88502,15064,96,52.200001,52.200001,4617104,4617104,99.099998,99.099998,Hotel,Hotel,88502,Parking,15064,,0,657478,0,23738,Compliant,No Issue,2243315,2373790,135.4,1.5
2,3,2021,WESTIN HOTEL (Parent Building),NonResidential,659000475,1900 5TH AVE,SEATTLE,WA,98101,47.61367,-122.33822,DOWNTOWN,1,1969,41,3,956110,759392,196718,96,46.5,46.5,43953212,43953212,105.300003,105.300003,Hotel,Hotel,945349,Parking,117783,Swimming Pool,0,8673722,10583473,37750,Compliant,No Issue,29594739,3775000,1201.4,1.6
3,5,2021,HOTEL MAX,NonResidential,659000640,620 STEWART ST,SEATTLE,WA,98101,47.61412,-122.33664,DOWNTOWN,1,1926,10,1,61320,61320,0,76,79.5,79.5,4873753,4873753,136.0,136.0,Hotel,Hotel,61320,,0,,0,509497,1167770,19676,Compliant,No Issue,1738403,1967580,208.6,3.4
4,8,2021,WARWICK SEATTLE HOTEL,NonResidential,659000970,401 LENORA ST,SEATTLE,WA,98121,47.61375,-122.34047,DOWNTOWN,1,1980,18,1,175580,113580,62000,90,92.400002,92.0,11358936,11409090,161.5,161.100006,Hotel,Hotel,123445,Parking,68009,Swimming Pool,0,1333597,0,68087,Compliant,No Issue,4550233,6808700,380.4,3.3


# 2. Overview of the data and pre-selection of columns

First, it is important to note that only the __non-residential__ data should be used here.

The aim here is to predict the __CO2 emissions__ and the __total energy consumption__ for the non-residential buildings. It should be based on structural data (size and usage of the buildings, construction date, location, etc).
- It is interesting to note that: SiteEnergyUse(kBtu) ~= Electricity(kBtu) + SteamUse(kBtu) + NaturalGas(kBtu)
- The sum is not exactly equal (this can be verified by creating a calculated column), maybe there is some rounding errors.

It is also important to evalutate the __ENERGY STAR Score__'s utility here, and its predictive power.

In [3]:
init_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3663 entries, 0 to 3662
Data columns (total 42 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   osebuildingid                   3663 non-null   int64  
 1   datayear                        3663 non-null   int64  
 2   buildingname                    3663 non-null   object 
 3   buildingtype                    3663 non-null   object 
 4   taxparcelidentificationnumber   3663 non-null   object 
 5   address                         3663 non-null   object 
 6   city                            3663 non-null   object 
 7   state                           3663 non-null   object 
 8   zipcode                         3663 non-null   int64  
 9   latitude                        3663 non-null   float64
 10  longitude                       3663 non-null   float64
 11  neighborhood                    3661 non-null   object 
 12  councildistrictcode             36

In [4]:
# First, let's remove the duplicate rows (if any).

# Make a copy:
copy_dedup = init_data.copy()

print(f'Rows before dedup: {init_data.shape[0]}')
# Ignore_index to reset the index:
copy_dedup.drop_duplicates(inplace=True, ignore_index=True)
print(f'Rows after dedup: {copy_dedup.shape[0]}')

Rows before dedup: 3663
Rows after dedup: 3663


There are none, great!

## 2.1. Removing the "Weather Normalized" columns

- Since I'm not exactly sure what it means and how it has been calculated.

In [5]:
wn_cols = ['siteeuiwn_kbtu_sf', 'siteenergyusewn_kbtu', 'sourceeuiwn_kbtu_sf']

init_no_wn = init_data.drop(columns=wn_cols)
init_no_wn.shape

(3663, 39)

## 2.2. Selecting the non-residential rows

In [6]:
# Let's check the BuildingType
col_ = 'buildingtype'

init_no_wn[col_].unique()

array(['NonResidential', 'Nonresidential COS', 'Multifamily MR (5-9)',
       'SPS-District K-12', 'Campus', 'Multifamily LR (1-4)',
       'Multifamily HR (10+)', 'Nonresidential WA'], dtype=object)

In [7]:
# Let's drop the rows that are residential
# I'm assuming SPS stands for Seattle Public Schools
to_drop = ['Multifamily MR (5-9)', 'Multifamily LR (1-4)',
           'Multifamily HR (10+)']

nonRes_data = init_no_wn[~init_no_wn[col_].isin(to_drop)]
nonRes_data.shape

(1700, 39)

## 2.3. Selecting the rows for CO2 emissions, total energy consumption and ENERGY STAR Score

In [8]:
# Define the variables to check
co2_indic = 'totalghgemissions'
energy_indic = 'siteenergyuse_kbtu'
energy_star = 'energystarscore'

In [9]:
# Let's count how many rows contain at least 1 of the 3 variables empty
len(nonRes_data[nonRes_data[co2_indic].isna() |
                nonRes_data[energy_indic].isna() |
                nonRes_data[energy_star].isna()])

0

Looks like there are none, great!

## 2.4. 'Object' columns

In [10]:
# Now, let's investigate the 'object' columns for the modalities
obj_data = nonRes_data.select_dtypes(include='object')

# Print the shape and show a sample
print("This dataset contains {} rows and {} columns\n".format(
    obj_data.shape[0], obj_data.shape[1]))
obj_data.head()

This dataset contains 1700 rows and 13 columns



Unnamed: 0,buildingname,buildingtype,taxparcelidentificationnumber,address,city,state,neighborhood,epapropertytype,largestpropertyusetype,secondlargestpropertyusetype,thirdlargestpropertyusetype,compliancestatus,complianceissue
0,MAYFLOWER PARK HOTEL,NonResidential,659000030,405 OLIVE WAY,SEATTLE,WA,DOWNTOWN,Hotel,Hotel,,,Compliant,No Issue
1,PARAMOUNT HOTEL,NonResidential,659000220,724 PINE ST,SEATTLE,WA,DOWNTOWN,Hotel,Hotel,Parking,,Compliant,No Issue
2,WESTIN HOTEL (Parent Building),NonResidential,659000475,1900 5TH AVE,SEATTLE,WA,DOWNTOWN,Hotel,Hotel,Parking,Swimming Pool,Compliant,No Issue
3,HOTEL MAX,NonResidential,659000640,620 STEWART ST,SEATTLE,WA,DOWNTOWN,Hotel,Hotel,,,Compliant,No Issue
4,WARWICK SEATTLE HOTEL,NonResidential,659000970,401 LENORA ST,SEATTLE,WA,DOWNTOWN,Hotel,Hotel,Parking,Swimming Pool,Compliant,No Issue


### Building Name

In [11]:
# This might not be relevant here, unless the aim is to "point out" some
# buildings that are "outliers" and help them reduce whatever is not good.
# I just count the number of different building names:
col_ = 'buildingname'

len(obj_data[col_].unique())

1669

Looks like there are some duplicates, it could be because it's a brand and they do not share the same address.

In [12]:
# Then, I drop the column:
sub_df_1 = nonRes_data.drop(columns=col_)
sub_df_1.shape

(1700, 38)

### Tax Parcel ID

In [13]:
# Let's also count the number of different parcel ID:
col_ = 'taxparcelidentificationnumber'

len(obj_data[col_].unique())

1602

In [14]:
# Let's investigate the duplicated rows:
obj_data[obj_data.duplicated([col_])].head()

Unnamed: 0,buildingname,buildingtype,taxparcelidentificationnumber,address,city,state,neighborhood,epapropertytype,largestpropertyusetype,secondlargestpropertyusetype,thirdlargestpropertyusetype,compliancestatus,complianceissue
109,UNION GOSPEL MISSION ASSN / HOPE PLACE,NonResidential,7378600265,3802 S OTHELLO ST,SEATTLE,WA,GREATER DUWAMISH,Other - Lodging/Residential,Other - Lodging/Residential,Office,Food Service,Compliant,No Issue
128,WESTWOOD VILLAGE - BLDG B,NonResidential,3624039009,2600 SW BARTON ST,SEATTLE,WA,DELRIDGE NEIGHBORHOODS,Retail Store,Retail Store,,,Not Compliant,Portfolio Manager Account Not Shared
147,CITY LIGHT - SOUTH SERVICE CENTER BLDG B,Nonresidential COS,7666205660,400 S SPOKANE ST,SEATTLE,WA,GREATER DUWAMISH,Other - Utility,Other - Utility,"Repair Services (Vehicle, Shoe, Locksmith, etc.)",,Compliant,No Issue
197,WESTLAKE TOWER,NonResidential,9301500000,1601 5TH AVE,SEATTLE,WA,DOWNTOWN,Office,Office,Parking,,Compliant,No Issue
252,IBM BUILDING,NonResidential,2400002,1200 5TH AVE,SEATTLE,WA,DOWNTOWN,Office,Office,Parking,Other - Restaurant/Bar,Compliant,No Issue


In [15]:
# Pick one ID to investigate:
nonRes_data[nonRes_data[col_]=='3624039009']

Unnamed: 0,osebuildingid,datayear,buildingname,buildingtype,taxparcelidentificationnumber,address,city,state,zipcode,latitude,longitude,neighborhood,councildistrictcode,yearbuilt,numberoffloors,numberofbuildings,propertygfatotal,propertygfabuilding_s,propertygfaparking,energystarscore,siteeui_kbtu_sf,siteenergyuse_kbtu,sourceeui_kbtu_sf,epapropertytype,largestpropertyusetype,largestpropertyusetypegfa,secondlargestpropertyusetype,secondlargestpropertyuse,thirdlargestpropertyusetype,thirdlargestpropertyusetypegfa,electricity_kwh,steamuse_kbtu,naturalgas_therms,compliancestatus,complianceissue,electricity_kbtu,naturalgas_kbtu,totalghgemissions,ghgemissionsintensity
127,223,2021,WESTWOOD VILLAGE - BLDG A,NonResidential,3624039009,2600 SW BARTON ST,SEATTLE,WA,98126,47.52254,-122.36627,DELRIDGE NEIGHBORHOODS,1,1966,1,1,64984,64984,0,0,39.599998,2581250,93.800003,Retail Store,Retail Store,65195,,0,,0,570470,0,6348,Not Compliant,Portfolio Manager Account Not Shared,1946444,634806,41.8,0.6
128,224,2021,WESTWOOD VILLAGE - BLDG B,NonResidential,3624039009,2600 SW BARTON ST,SEATTLE,WA,98126,47.52254,-122.36627,DELRIDGE NEIGHBORHOODS,1,1965,1,1,67745,67745,0,0,109.900002,7444998,144.699997,Retail Store,Retail Store,67763,,0,,0,332386,0,63109,Not Compliant,Portfolio Manager Account Not Shared,1134101,6310897,339.9,5.0
2130,24636,2021,WESTWOOD VILLAGE - BUILDING E,NonResidential,3624039009,2600 SW BARTON ST,SEATTLE,WA,98126,47.52254,-122.36627,DELRIDGE NEIGHBORHOODS,1,1990,1,1,40265,40265,0,42,60.099998,2452712,149.0,Retail Store,Retail Store,40793,,0,,0,586728,0,4508,Not Compliant,Portfolio Manager Account Not Shared,2001916,450796,32.2,0.8
2131,24640,2021,WESTWOOD VILLAGE - BUILDING F,NonResidential,3624039009,2600 SW BARTON ST,SEATTLE,WA,98126,47.52254,-122.36627,DELRIDGE NEIGHBORHOODS,1,2005,1,1,26208,26208,0,74,48.799999,1581233,104.800003,Retail Store,Retail Store,32427,,0,,0,291104,0,5880,Not Compliant,Portfolio Manager Account Not Shared,993247,587986,35.3,1.3


This could be investigated further, but it looks like these are different buildings/parts of a building.

Also, this should not be a meaningful feature as-is, because each row has its own "energy use" value, so they can be considered separate.
- In case this assumption is wrong, add the feature back or find a way to create a new feature that could be used.

In [16]:
# I also remove this column:
sub_df_2 = sub_df_1.drop(columns=col_)
sub_df_2.shape

(1700, 37)

### City

In [17]:
# Check the City
col_ = 'city'

obj_data[col_].unique()

array(['SEATTLE', '0', 'Seattle'], dtype=object)

In [18]:
# This is to be expected considering the data (from Seattle).
# I can drop this column:
to_drop = 'city'

sub_df_3 = sub_df_2.drop(columns=to_drop)
sub_df_3.shape

(1700, 36)

### State

In [19]:
# Check the State
col_ = 'state'

obj_data[col_].unique()

array(['WA', '0'], dtype=object)

In [20]:
# Same thing here. I can drop this column:
to_drop = 'state'

sub_df_4 = sub_df_3.drop(columns=to_drop)
sub_df_4.shape

(1700, 35)

### Neighborhood

In [21]:
# Check the Neighborhood
col_ = 'neighborhood'

obj_data[col_].unique()

array(['DOWNTOWN', 'SOUTHEAST', 'NORTHEAST', 'EAST',
       'SHARED: CENTRAL & EAST', 'NORTH', 'MAGNOLIA / QUEEN ANNE',
       'LAKE UNION', 'GREATER DUWAMISH', 'BALLARD', 'NORTHWEST',
       'CENTRAL', 'SOUTHWEST', 'DELRIDGE NEIGHBORHOODS',
       'SHARED: NORTH & NORTHWEST',
       'SHARED: GREATER DUWAMISH & DELRIDGE NEIGHBORHOODS',
       'SHARED: BALLARD & LAKE UNION', 'SHARED: LAKE UNION & NORTHWEST',
       'SHARED: BALLARD & NORTHWEST', 'water', nan], dtype=object)

Depending on the knowledge/situation, maybe some of them could be grouped together.

The 'water' and missing values will need a fix.

### EPAPropertyType (primary use of a property)

In [22]:
# Check the EPA
col_ = 'epapropertytype'

obj_data[col_].unique()

array(['Hotel', 'Police Station', 'Other - Entertainment/Public Assembly',
       'Library', 'Fitness Center/Health Club/Gym', 'Mixed Use Property',
       'Courthouse', 'Prison/Incarceration', 'K-12 School',
       'College/University', 'Office', 'Self-Storage Facility',
       'Other - Mall', nan, 'Parking', 'Medical Office', 'Other',
       'Social/Meeting Hall', 'Performing Arts', 'Data Center',
       'Supermarket/Grocery Store', 'Multifamily Housing',
       'Hospital (General Medical & Surgical)', 'Fire Station', 'Museum',
       'Repair Services (Vehicle, Shoe, Locksmith, etc)',
       'Worship Facility', 'Other - Lodging/Residential',
       'Non-Refrigerated Warehouse', 'Retail Store', 'Financial Office',
       'Manufacturing/Industrial Plant', 'Other - Utility',
       'Residence Hall/Dormitory', 'Enclosed Mall', 'Laboratory',
       'Convention Center', 'Outpatient Rehabilitation/Physical Therapy',
       'Distribution Center', 'Other/Specialty Hospital',
       'Other - S

In [23]:
# That's a long list that I might rework (do some grouping), but not for now.
# I see some housing/residential related rows, let's investigate:

# Define the "abnormal" values:
# --> Is a 'Hotel' considered Residential?
residential_list = ['Multifamily Housing', 'Other - Lodging/Residential',
                    'Residence Hall/Dormitory', 'Residential Care Facility']

# Let's count how many rows there are, and show a sample:
print('There are {} abnormal rows\nHere\'s a sample:\n'.format(
    len(sub_df_4[sub_df_4[col_].isin(residential_list)])))
sub_df_4[sub_df_4[col_].isin(residential_list)].head()

There are 33 abnormal rows
Here's a sample:



Unnamed: 0,osebuildingid,datayear,buildingtype,address,zipcode,latitude,longitude,neighborhood,councildistrictcode,yearbuilt,numberoffloors,numberofbuildings,propertygfatotal,propertygfabuilding_s,propertygfaparking,energystarscore,siteeui_kbtu_sf,siteenergyuse_kbtu,sourceeui_kbtu_sf,epapropertytype,largestpropertyusetype,largestpropertyusetypegfa,secondlargestpropertyusetype,secondlargestpropertyuse,thirdlargestpropertyusetype,thirdlargestpropertyusetypegfa,electricity_kwh,steamuse_kbtu,naturalgas_therms,compliancestatus,complianceissue,electricity_kbtu,naturalgas_kbtu,totalghgemissions,ghgemissionsintensity
86,136,2021,NonResidential,14115 AURORA AVE N,98133,47.73141,-122.3458,NORTHWEST,1,2001,4,1,51390,51390,0,0,0.0,0,0.0,Multifamily Housing,Multifamily Housing,51390,,0,,0,0,0,0,Not Compliant,Missing 2021 EUI or Electricity Data,1166524,0,0.0,0.0
89,139,2021,NonResidential,13300 STONE AVE N,98133,47.72529,-122.34115,NORTHWEST,1,2002,3,1,69138,69138,0,0,0.0,0,0.0,Multifamily Housing,Multifamily Housing,69138,,0,,0,0,0,0,Not Compliant,Missing 2021 EUI or Electricity Data,1528017,0,0.0,0.0
109,180,2021,NonResidential,3802 S OTHELLO ST,98118,47.53722,-122.284,GREATER DUWAMISH,2,2009,5,1,89821,89821,0,0,16.299999,1448529,18.200001,Other - Lodging/Residential,Other - Lodging/Residential,46483,Office,37938,Food Service,4579,16696,0,13916,Compliant,No Issue,56967,1391560,74.1,0.8
178,294,2021,NonResidential,1118 5TH AVE,98101,47.6079,-122.33266,DOWNTOWN,7,1913,9,1,107572,107572,0,84,39.599998,4020211,79.5,Residence Hall/Dormitory,Residence Hall/Dormitory,63934,Office,27143,Retail Store,10568,595237,1989262,0,Compliant,No Issue,2030949,0,173.5,1.6
627,19451,2021,NonResidential,2611 S DEARBORN ST,98144,47.59505,-122.29824,CENTRAL,1,1963,3,1,66402,66402,0,0,95.199997,6322400,151.699997,Residential Care Facility,Residential Care Facility,66402,Parking,0,,0,574886,0,43609,Compliant,No Issue,1961512,4360890,239.7,3.6


In [24]:
# Without further knowledge, I'll assume this is a misclassification of the
# BuildingType and drop these rows:
sub_df_5 = sub_df_4[~sub_df_4[col_].isin(residential_list)]
sub_df_5.shape

(1667, 35)

In [25]:
# I also replace the 'nan' by 'UNK' (Unknown)
sub_df_6 = sub_df_5.copy()
sub_df_6[col_] = sub_df_5[col_].fillna('UNK')

### Largest Property Use Type

In [26]:
# Check the Largest Property Use Type
col_ = 'largestpropertyusetype'

obj_data[col_].unique()

array(['Hotel', 'Police Station', 'Other - Entertainment/Public Assembly',
       'Library', 'Fitness Center/Health Club/Gym', 'Social/Meeting Hall',
       'Courthouse', 'Prison/Incarceration', 'K-12 School',
       'College/University', 'Office', 'Self-Storage Facility',
       'Other - Mall', 'Senior Living Community', 'Parking',
       'Medical Office', 'Other', 'Performing Arts', 'Data Center',
       'Supermarket/Grocery Store', 'Multifamily Housing',
       'Hospital (General Medical & Surgical)', 'Fire Station', 'Museum',
       'Repair Services (Vehicle, Shoe, Locksmith, etc.)',
       'Worship Facility', 'Other - Lodging/Residential',
       'Non-Refrigerated Warehouse', 'Retail Store', 'Financial Office',
       'Manufacturing/Industrial Plant', 'Other - Utility',
       'Transportation Terminal/Station', 'Residence Hall/Dormitory',
       'Laboratory', 'Convention Center', 'Restaurant', 'Enclosed Mall',
       nan, 'Outpatient Rehabilitation/Physical Therapy',
       'Distr

In [27]:
# Same issue here.
# I see some housing/residential related rows, let's investigate:

# Define the "abnormal" values:
# --> Is a 'Hotel' considered Residential?
residential_list = ['Senior Living Community', 'Multifamily Housing',
                    'Other - Lodging/Residential', 'Residence Hall/Dormitory',
                    'Residential Care Facility']

# Let's count how many rows there are, and show a sample:
print('There are {} abnormal rows\nHere\'s a sample:'.format(
    len(sub_df_6[sub_df_6[col_].isin(residential_list)])))
sub_df_6[sub_df_6[col_].isin(residential_list)].head()

There are 26 abnormal rows
Here's a sample:


Unnamed: 0,osebuildingid,datayear,buildingtype,address,zipcode,latitude,longitude,neighborhood,councildistrictcode,yearbuilt,numberoffloors,numberofbuildings,propertygfatotal,propertygfabuilding_s,propertygfaparking,energystarscore,siteeui_kbtu_sf,siteenergyuse_kbtu,sourceeui_kbtu_sf,epapropertytype,largestpropertyusetype,largestpropertyusetypegfa,secondlargestpropertyusetype,secondlargestpropertyuse,thirdlargestpropertyusetype,thirdlargestpropertyusetypegfa,electricity_kwh,steamuse_kbtu,naturalgas_therms,compliancestatus,complianceissue,electricity_kbtu,naturalgas_kbtu,totalghgemissions,ghgemissionsintensity
57,84,2021,NonResidential,4831 35TH AVE SW,98126,47.55837,-122.37751,SOUTHWEST,1,1922,5,1,371257,296313,74944,44,151.800003,44967996,222.800003,UNK,Senior Living Community,296313,Parking,0,,0,3149638,0,342214,Compliant,No Issue,10746565,34221431,1862.0,6.3
58,85,2021,NonResidential,13023 GREENWOOD AVE N,98133,47.72426,-122.35599,NORTHWEST,1,1970,2,1,93397,93397,0,1,167.0,15596807,273.5,UNK,Senior Living Community,93397,,0,,0,1535177,0,103588,Compliant,No Issue,5238024,10358780,571.8,6.1
163,278,2021,NonResidential,802 S DEARBORN ST,98134,47.59626,-122.3218,DOWNTOWN,1,1999,5,1,102796,90185,12611,0,85.300003,8770639,205.0,UNK,Senior Living Community,55259,Medical Office,40091,K-12 School,3231,1986257,0,19935,Not Compliant,Default Data,6777109,1993530,133.9,1.5
864,20145,2021,NonResidential,909 BOREN AVE,98104,47.60897,-122.32507,EAST,3,2006,6,1,62334,51457,10877,0,58.299999,2996646,134.800003,UNK,Senior Living Community,47805,Parking,10877,Bank Branch,3552,632587,0,8383,Compliant,No Issue,2158386,838260,53.5,1.0
941,20377,2021,NonResidential,2821 S WALDEN ST,98144,47.57188,-122.29531,SOUTHEAST,1,1981,3,1,58665,58665,0,50,96.900002,7595482,177.399994,UNK,Senior Living Community,78374,Parking,0,,0,993130,0,42069,Compliant,No Issue,3388561,4206920,237.4,4.0


In [28]:
# Without further knowledge, I'll assume this is a misclassification and
# drop these rows:
sub_df_7 = sub_df_6[~sub_df_6[col_].isin(residential_list)]
sub_df_7.shape

(1641, 35)

In [29]:
# Check the missing values
sub_df_7[sub_df_7[col_].isna()].head()

Unnamed: 0,osebuildingid,datayear,buildingtype,address,zipcode,latitude,longitude,neighborhood,councildistrictcode,yearbuilt,numberoffloors,numberofbuildings,propertygfatotal,propertygfabuilding_s,propertygfaparking,energystarscore,siteeui_kbtu_sf,siteenergyuse_kbtu,sourceeui_kbtu_sf,epapropertytype,largestpropertyusetype,largestpropertyusetypegfa,secondlargestpropertyusetype,secondlargestpropertyuse,thirdlargestpropertyusetype,thirdlargestpropertyusetypegfa,electricity_kwh,steamuse_kbtu,naturalgas_therms,compliancestatus,complianceissue,electricity_kbtu,naturalgas_kbtu,totalghgemissions,ghgemissionsintensity
211,340,2021,NonResidential,600 1ST AVE,98104,47.60219,-122.33347,DOWNTOWN,1,1900,6,1,89355,89355,0,0,0.0,0,0.0,UNK,,0,,0,,0,0,0,0,Not Compliant,Portfolio Manager Account Not Shared,0,0,0.0,0.0
325,473,2021,NonResidential,500 BOREN AVE N,98109,47.62383,-122.3356,LAKE UNION,7,2009,5,1,271071,172371,98700,0,0.0,0,0.0,UNK,,0,,0,,0,0,0,0,Contact the Help Desk,Unknown - Contact the Help Desk,0,0,0.0,0.0
352,514,2021,NonResidential,12525 AURORA AVE N,98133,47.72053,-122.34739,NORTHWEST,1,1997,1,1,131387,131387,0,0,0.0,0,0.0,UNK,,0,,0,,0,0,0,0,Contact the Help Desk,Unknown - Contact the Help Desk,0,0,0.0,0.0
402,576,2021,NonResidential,818 NE NORTHGATE WAY,98125,47.7094,-122.31843,SHARED: NORTH & NORTHWEST,1,1969,2,1,98539,98539,0,0,0.0,0,0.0,UNK,,0,,0,,0,0,0,0,Not Compliant,Portfolio Manager Account Not Shared,0,0,0.0,0.0
609,851,2021,NonResidential,2120 S JACKSON ST,98144,47.5997,-122.30364,CENTRAL,1,1974,4,2,114000,114000,0,0,0.0,0,0.0,UNK,,0,,0,,0,0,0,0,Not Compliant,Portfolio Manager Account Not Shared,0,0,0.0,0.0


In [30]:
# Since the EPA doesn't help, I also fill with UNK
sub_df_8 = sub_df_7.copy()
sub_df_8[col_] = sub_df_7[col_].fillna('UNK')

### 2nd Largest Property Use Type

In [31]:
# Check the 2nd Largest Property Use Type
col_ = 'secondlargestpropertyusetype'

obj_data[col_].unique()

array([nan, 'Parking', 'Office', 'Restaurant', 'Barracks', 'K-12 School',
       'Laboratory', 'Non-Refrigerated Warehouse', 'Other - Education',
       'Vocational School', 'Hotel', 'Retail Store',
       'Personal Services (Health/Beauty, Dry Cleaning, etc)',
       'Other - Entertainment/Public Assembly', 'Medical Office',
       'Enclosed Mall', 'Performing Arts', 'Data Center',
       'Residence Hall/Dormitory', 'Distribution Center',
       'Repair Services (Vehicle, Shoe, Locksmith, etc.)',
       'Movie Theater', 'Multifamily Housing', 'Other',
       'Transportation Terminal/Station', 'Other - Services',
       'Other - Recreation', 'Food Service', 'College/University',
       'Other - Restaurant/Bar', 'Supermarket/Grocery Store',
       'Adult Education', 'Fitness Center/Health Club/Gym',
       'Refrigerated Warehouse', 'Library',
       'Outpatient Rehabilitation/Physical Therapy',
       'Manufacturing/Industrial Plant', 'Social/Meeting Hall',
       'Bar/Nightclub', 'Self

In [32]:
# Same issue here.
# I see some housing/residential related rows, let's investigate:

# Define the "abnormal" values:
# --> Is a 'Hotel' considered Residential?
residential_list = ['Residence Hall/Dormitory', 'Multifamily Housing',
                    'Other - Lodging/Residential', 'Residential Care Facility',
                    'Senior Living Community']

# Let's count how many rows there are, and show a sample:
print('There are {} abnormal rows\nHere\'s a sample:'.format(
    len(sub_df_8[sub_df_8[col_].isin(residential_list)])))
sub_df_8[sub_df_8[col_].isin(residential_list)].head()

There are 15 abnormal rows
Here's a sample:


Unnamed: 0,osebuildingid,datayear,buildingtype,address,zipcode,latitude,longitude,neighborhood,councildistrictcode,yearbuilt,numberoffloors,numberofbuildings,propertygfatotal,propertygfabuilding_s,propertygfaparking,energystarscore,siteeui_kbtu_sf,siteenergyuse_kbtu,sourceeui_kbtu_sf,epapropertytype,largestpropertyusetype,largestpropertyusetypegfa,secondlargestpropertyusetype,secondlargestpropertyuse,thirdlargestpropertyusetype,thirdlargestpropertyusetypegfa,electricity_kwh,steamuse_kbtu,naturalgas_therms,compliancestatus,complianceissue,electricity_kbtu,naturalgas_kbtu,totalghgemissions,ghgemissionsintensity
137,234,2021,Nonresidential COS,802 ROY ST,98109,47.62552,-122.34064,MAGNOLIA / QUEEN ANNE,1,1926,2,1,50292,50292,0,0,0.0,0,0.0,Non-Refrigerated Warehouse,Non-Refrigerated Warehouse,38693,Residence Hall/Dormitory,5000,Office,4730,0,0,0,Compliant,No Issue,133200,0,0.0,0.0
155,264,2021,NonResidential,516 1ST AVE W,98119,47.6239,-122.35756,MAGNOLIA / QUEEN ANNE,1,2002,4,1,82061,42882,39179,0,51.200001,7619336,143.300003,Supermarket/Grocery Store,Supermarket/Grocery Store,76268,Multifamily Housing,72564,,0,2233100,0,0,Not Compliant,Account Requires Verification,7619337,0,31.5,0.7
467,657,2021,NonResidential,2200 WESTLAKE AVE,98121,47.61783,-122.33729,DOWNTOWN,1,2006,0,1,516407,516407,0,93,67.0,34618024,128.100006,Mixed Use Property,Parking,385196,Multifamily Housing,340236,Hotel,80067,4988754,0,175964,Compliant,No Issue,17021629,17596400,1005.0,1.9
492,694,2021,NonResidential,3100 AIRPORT WAY S,98134,47.57654,-122.32047,GREATER DUWAMISH,1,2007,1,23,213696,213696,0,41,45.5,5880402,119.0,Office,Office,80157,Multifamily Housing,34302,Non-Refrigerated Warehouse,14900,1544537,0,6104,Compliant,No Issue,5269960,610440,54.2,0.3
741,19801,2021,NonResidential,2225 1ST AVE,98121,47.61294,-122.34638,DOWNTOWN,1,1909,3,1,30700,30700,0,0,30.6,939527,54.900002,Office,Office,19980,Other - Lodging/Residential,10720,,0,117239,0,5395,Compliant,No Issue,400018,539510,30.3,1.0


This part would definitely require domain knowledge on how to best handle these rows:
- they could be removed because we do not want any residential building at all,
- they could be removed based on a certain % of the total surface (say, if at least 20% is residential then the row gets removed),
- they could be kept because the main/largest use type is not residential,
- the residential part could be substracted from the total of each row, which would also require to deal with the energy/CO2 values of the rows (which _seems_ feasible but too complicated for now),
- etc

For now, I'll just pick an easy solution because it's only 15 rows: I'll drop them.

In [33]:
# Drop the rows:
sub_df_9 = sub_df_8[~sub_df_8[col_].isin(residential_list)]
sub_df_9.shape

(1626, 35)

In [34]:
# I also replace the 'nan' by 'UNK'
sub_df_10 = sub_df_9.copy()
sub_df_10[col_] = sub_df_9[col_].fillna('UNK')

### 3rd Largest Property Use Type

In [35]:
# Check for the completion of the column
col_ = 'thirdlargestpropertyusetype'

nb_na = sub_df_10[col_].isna().sum()
length = len(sub_df_10[col_])

print(f'Completion of the column: {(1 - nb_na/length) * 100:.2f}%')

Completion of the column: 21.16%


In [36]:
# This column is too empty there, so I just drop it.
# I consider having enough information with the Largest and 2nd Largest
# Property Use Type, so I don't create a new column encoding the missingness:
# Same thing here. I can drop this column:
sub_df_11 = sub_df_10.drop(columns=col_)
sub_df_11.shape

(1626, 34)

### Compliance Status & Issue

In [37]:
# Check the Compliance Status
col_ = 'compliancestatus'

obj_data[col_].unique()

array(['Compliant', 'Not Compliant', 'Contact the Help Desk'],
      dtype=object)

In [38]:
# Check the Compliance Issue
col_ = 'complianceissue'

obj_data[col_].unique()

array(['No Issue', 'Missing 2021 EUI or Electricity Data',
       'Portfolio Manager Account Not Shared',
       'Account Requires Verification', 'Default Data',
       'Unknown - Contact the Help Desk'], dtype=object)

In [39]:
# I'll assume these columns are not helpful for the prediction, so I drop them:
to_drop = ['compliancestatus', 'complianceissue']

sub_df_12 = sub_df_11.drop(columns=to_drop)
sub_df_12.shape

(1626, 32)

In [40]:
# Finally, let's reset the index
sub_df_13 = sub_df_12.reset_index(drop=True)

## 2.5. Numerical columns

In [41]:
# Let's select the non-'object' columns for the modalities
num_data = sub_df_13.select_dtypes(exclude='object')

# Print the shape and show a sample
print("This dataset contains {} rows and {} columns\n".format(
    num_data.shape[0], num_data.shape[1]))
num_data.head()

This dataset contains 1626 rows and 26 columns



Unnamed: 0,osebuildingid,datayear,zipcode,latitude,longitude,councildistrictcode,yearbuilt,numberoffloors,numberofbuildings,propertygfatotal,propertygfabuilding_s,propertygfaparking,energystarscore,siteeui_kbtu_sf,siteenergyuse_kbtu,sourceeui_kbtu_sf,largestpropertyusetypegfa,secondlargestpropertyuse,thirdlargestpropertyusetypegfa,electricity_kwh,steamuse_kbtu,naturalgas_therms,electricity_kbtu,naturalgas_kbtu,totalghgemissions,ghgemissionsintensity
0,1,2021,98101,47.6122,-122.33799,1,1927,12,1,88434,88434,0,78,73.599998,6510477,144.199997,88434,0,0,944955,1798672,14876,3224187,1487620,241.6,2.7
1,2,2021,98101,47.61307,-122.33361,1,1996,11,1,103566,88502,15064,96,52.200001,4617104,99.099998,88502,15064,0,657478,0,23738,2243315,2373790,135.4,1.5
2,3,2021,98101,47.61367,-122.33822,1,1969,41,3,956110,759392,196718,96,46.5,43953212,105.300003,945349,117783,0,8673722,10583473,37750,29594739,3775000,1201.4,1.6
3,5,2021,98101,47.61412,-122.33664,1,1926,10,1,61320,61320,0,76,79.5,4873753,136.0,61320,0,0,509497,1167770,19676,1738403,1967580,208.6,3.4
4,8,2021,98121,47.61375,-122.34047,1,1980,18,1,175580,113580,62000,90,92.0,11358936,161.100006,123445,68009,0,1333597,0,68087,4550233,6808700,380.4,3.3


### OSE Building ID (unique ID)

In [42]:
# Check the unique ID
col_ = 'osebuildingid'

# This should not throw an error (otherwise, there may be some duplicate issue):
assert len(num_data[col_].unique()) == num_data.shape[0]

In [43]:
# The column should not have any predictive power, and can/should be dropped:
sub_df_14 = sub_df_13.drop(columns=col_)
sub_df_14.shape

(1626, 31)

### Data Year

In [44]:
# Check the year
col_ = 'datayear'

num_data[col_].unique()

array([2021])

In [45]:
# This is to be expected considering the data. I can drop this column:
sub_df_15 = sub_df_14.drop(columns=col_)
sub_df_15.shape

(1626, 30)

### Zipcode

In [46]:
# Check the Zipcode
col_ = 'zipcode'

num_data[col_].unique()

array([98101, 98121, 98104, 98118, 98105, 98112, 98125, 98109, 98103,
       98108, 98199, 98115, 98107, 98144, 98119, 98122, 98146, 98106,
       98133, 98126, 98134, 98117, 98136, 98116, 98177, 98102, 98155,
       98178, 98195,     0])

In [47]:
# Count the rows where the zipcode is '0' and display them:
print(f'There are {len(num_data[num_data[col_] == 0])} rows with 0 as Zipcode:')
num_data[num_data[col_]==0]

There are 15 rows with 0 as Zipcode:


Unnamed: 0,osebuildingid,datayear,zipcode,latitude,longitude,councildistrictcode,yearbuilt,numberoffloors,numberofbuildings,propertygfatotal,propertygfabuilding_s,propertygfaparking,energystarscore,siteeui_kbtu_sf,siteenergyuse_kbtu,sourceeui_kbtu_sf,largestpropertyusetypegfa,secondlargestpropertyuse,thirdlargestpropertyusetypegfa,electricity_kwh,steamuse_kbtu,naturalgas_therms,electricity_kbtu,naturalgas_kbtu,totalghgemissions,ghgemissionsintensity
1473,49770,2021,0,47.65034,-122.30907,1,1960,10,1,1739802,1615291,124511,59,201.600006,325567936,397.799988,1615291,124511,0,46049940,168030736,4148,157122395,414810,14619.9,9.1
1535,50148,2021,0,47.66148,-122.31569,4,2016,7,1,180257,125562,54695,86,66.5,8345792,133.100006,125562,54695,0,1331356,0,38032,4542587,3803207,220.8,1.8
1536,50150,2021,0,47.62009,-122.34799,1,2016,8,1,142805,114736,28069,95,58.0,6673531,114.400002,115050,28069,0,1030574,0,31572,3516318,3157210,182.2,1.6
1537,50160,2021,0,47.61523,-122.33835,1,2016,37,2,1496961,1076961,420000,95,39.900002,45050740,101.199997,1023298,448625,59672,11209760,0,68030,38247701,6803030,519.5,0.5
1538,50166,2021,0,47.6479,-122.33814,1,2016,4,1,338989,220902,118087,67,57.900002,12152586,152.699997,205727,117684,2839,3230227,0,11310,11021535,1131050,105.7,0.5
1539,50170,2021,0,47.70598,-122.33486,5,2016,2,1,44745,44745,0,95,38.299999,1711505,107.300003,44667,0,0,501613,0,0,1711505,0,7.1,0.2
1542,50192,2021,0,47.61632,-122.33304,1,2016,21,1,489821,361575,128246,89,42.299999,15805341,116.800003,373458,110813,0,4525949,0,3628,15442538,362800,83.2,0.2
1543,50193,2021,0,47.66306,-122.3002,1,2016,4,1,47406,47406,0,11,65.099998,2927390,182.100006,45000,0,0,857969,0,0,2927390,0,12.1,0.3
1544,50194,2021,0,47.66673,-122.38309,6,2016,4,1,37100,37100,0,93,32.400002,1114496,79.0,34350,0,0,258570,0,2323,882241,232260,16.0,0.4
1545,50195,2021,0,47.63849,-122.37695,1,2016,1,1,25065,25065,0,98,51.599998,1329135,144.5,25763,0,0,389547,0,0,1329135,0,5.5,0.2


In [48]:
# I could try to deduce the Zipcode through the address or the TaxParcelID,
# but I'll go with the easy (maybe not optimal) solution here: drop the rows
sub_df_16 = sub_df_15[sub_df_15[col_]!=0]
sub_df_16.shape

(1611, 30)

### Latitude and Longitude
* Latitude should range between -90 and +90
* Longitude should range between -180 and +180
* Seattle coordinates should be around lat. +47 / long. -122

In [49]:
col_lat = 'latitude'
col_long = 'longitude'

# Check the latitude
print('Latitude:\n- Min: {}\n- Max: {}'.format(num_data[col_lat].min(),
                                               num_data[col_lat].max()))
print('\nLongitude:\n- Min: {}\n- Max: {}'.format(num_data[col_long].min(),
                                                  num_data[col_long].max()))

Latitude:
- Min: 0.0
- Max: 47.73387

Longitude:
- Min: -122.41182
- Max: 0.0


In [50]:
# Let's investigate the rows where the Latitude is less than 40:
sub_df_16[sub_df_16[col_lat]<40]

Unnamed: 0,buildingtype,address,zipcode,latitude,longitude,neighborhood,councildistrictcode,yearbuilt,numberoffloors,numberofbuildings,propertygfatotal,propertygfabuilding_s,propertygfaparking,energystarscore,siteeui_kbtu_sf,siteenergyuse_kbtu,sourceeui_kbtu_sf,epapropertytype,largestpropertyusetype,largestpropertyusetypegfa,secondlargestpropertyusetype,secondlargestpropertyuse,thirdlargestpropertyusetypegfa,electricity_kwh,steamuse_kbtu,naturalgas_therms,electricity_kbtu,naturalgas_kbtu,totalghgemissions,ghgemissionsintensity
1492,NonResidential,625 SW 100TH ST,98146,0.0,0.0,,0,2010,2,1,44162,44162,0,0,36.200001,1597700,77.0,Pre-school/Daycare,Pre-school/Daycare,44162,Parking,15761,0,288311,0,6140,983716,613980,36.7,0.8


In [51]:
# Let's also investigate the rows where the Longitude is greater than -120:
sub_df_16[sub_df_16[col_long]>-120]

Unnamed: 0,buildingtype,address,zipcode,latitude,longitude,neighborhood,councildistrictcode,yearbuilt,numberoffloors,numberofbuildings,propertygfatotal,propertygfabuilding_s,propertygfaparking,energystarscore,siteeui_kbtu_sf,siteenergyuse_kbtu,sourceeui_kbtu_sf,epapropertytype,largestpropertyusetype,largestpropertyusetypegfa,secondlargestpropertyusetype,secondlargestpropertyuse,thirdlargestpropertyusetypegfa,electricity_kwh,steamuse_kbtu,naturalgas_therms,electricity_kbtu,naturalgas_kbtu,totalghgemissions,ghgemissionsintensity
1492,NonResidential,625 SW 100TH ST,98146,0.0,0.0,,0,2010,2,1,44162,44162,0,0,36.200001,1597700,77.0,Pre-school/Daycare,Pre-school/Daycare,44162,Parking,15761,0,288311,0,6140,983716,613980,36.7,0.8


In [52]:
# Looks like this row needs some fixing, but I'll just remove it:
sub_df_17 = sub_df_16[sub_df_16[col_lat]>40]
sub_df_17.shape

(1610, 30)

### CouncilDistrictCode

In [53]:
# Check the code
col_ = 'councildistrictcode'

num_data[col_].unique()

array([1, 7, 3, 4, 2, 6, 5, 0])

Seems normal.

### Year Built

In [54]:
# Check the Year Built
col_ = 'yearbuilt'

num_data[col_].unique()

array([1927, 1996, 1969, 1926, 1980, 1999, 1904, 1998, 1928, 1922, 2004,
       1930, 1983, 1907, 1916, 1985, 1961, 2001, 1991, 1955, 1978, 1949,
       1989, 1906, 1994, 1992, 1990, 1950, 1900, 1954, 1911, 1973, 1920,
       1910, 1982, 1908, 2000, 1997, 1962, 2008, 2016, 1965, 2010, 1938,
       1986, 1970, 2002, 1923, 2003, 1957, 1964, 1941, 1929, 1963, 1959,
       2006, 1915, 1958, 2011, 2007, 1951, 1953, 1952, 1960, 1937, 1966,
       1968, 1925, 1924, 2005, 1931, 1972, 1914, 1995, 1981, 1976, 2009,
       1909, 1971, 1988, 1979, 1947, 1984, 1956, 1912, 1977, 1921, 1913,
       1945, 1974, 1975, 1946, 1967, 1987, 1932, 1948, 1993, 1918, 1905,
       1902, 1940, 1939, 1944, 1917, 1942, 1903, 2012, 2013, 1919, 2017,
       1901, 2019, 1936, 1935, 2014, 1896, 2015, 2018, 2020])

In [55]:
# Check for the min and max values
print('Construction year:\n- Oldest: {}\n- Most recent: {}'.format(
    num_data[col_].min(), num_data[col_].max()))

Construction year:
- Oldest: 1896
- Most recent: 2020


Seems normal.

### Number of floors

In [56]:
# Check the # of floors
col_ = 'numberoffloors'

num_data[col_].unique()

array([12, 11, 41, 10, 18,  2,  8, 15, 25,  9, 33,  6, 28,  5, 19,  7,  1,
        3,  4, 24, 20, 34,  0, 16, 23, 17, 36, 22, 47, 29, 14, 49, 37, 42,
       63, 13, 21, 55, 46, 30, 56, 27, 76, 39, 44, 45, 38])

In [57]:
# Check for the min and max values
print('Number of floors:\n- Min: {}\n- Max: {}'.format(
    num_data[col_].min(), num_data[col_].max()))

Number of floors:
- Min: 0
- Max: 76


Seems normal.

### Number of buildings

In [58]:
# Check the # of buildings
col_ = 'numberofbuildings'

num_data[col_].unique()

array([  1,   3,   2,   0,  27,  10,  11,  16,   4,   8,   6,   5,  39,
        25,  14,   7, 111])

In [59]:
# There are some abnormal values here. Let's check the value 0:
num_data[num_data[col_]==0]

Unnamed: 0,osebuildingid,datayear,zipcode,latitude,longitude,councildistrictcode,yearbuilt,numberoffloors,numberofbuildings,propertygfatotal,propertygfabuilding_s,propertygfaparking,energystarscore,siteeui_kbtu_sf,siteenergyuse_kbtu,sourceeui_kbtu_sf,largestpropertyusetypegfa,secondlargestpropertyuse,thirdlargestpropertyusetypegfa,electricity_kwh,steamuse_kbtu,naturalgas_therms,electricity_kbtu,naturalgas_kbtu,totalghgemissions,ghgemissionsintensity
86,141,2021,98126,47.53284,-122.37493,1,1926,3,0,44324,44324,0,0,34.299999,1742902,68.400002,50742,0,0,275093,0,8043,938617,804280,46.6,1.1
87,142,2021,98119,47.65115,-122.36037,1,1927,2,0,82746,82746,0,0,106.400002,8804385,194.399994,82746,0,0,1145847,0,48948,3909630,4894760,276.1,3.3
315,486,2021,98109,47.623,-122.35589,1,1964,4,0,127735,70135,57600,93,48.700001,3414394,99.5,70135,61100,0,568560,0,14745,1939925,1474470,86.3,1.2
431,631,2021,98104,47.59685,-122.33342,1,1904,4,0,83400,83400,0,0,0.0,0,0.0,83400,0,0,0,0,0,107602,0,0.0,0.0
593,865,2021,98144,47.58203,-122.29854,1,1972,1,0,166014,166014,0,39,34.900002,5800354,92.5,166014,0,0,1551185,0,5077,5292643,507710,48.9,0.3
652,19892,2021,98101,47.61206,-122.33785,1,1926,4,0,27000,27000,0,0,104.300003,2815636,237.300003,16200,5400,5400,577754,0,8443,1971298,844340,53.0,2.0
681,20216,2021,98199,47.66304,-122.39169,1,1926,3,0,48560,48560,0,0,51.200001,3059511,143.199997,59805,0,0,896691,0,0,3059511,0,12.7,0.3
691,20394,2021,98134,47.57411,-122.33715,1,1959,1,0,32192,32192,0,0,137.899994,7123873,196.100006,51442,168,35,443343,0,56112,1512686,5611190,304.3,9.5
741,20835,2021,98102,47.64596,-122.32628,1,1989,3,0,76245,42420,33825,57,54.700001,2250568,153.300003,41117,33825,0,659604,0,0,2250568,0,9.3,0.2
798,21227,2021,98133,47.72362,-122.34897,1,1997,1,0,20595,20595,0,0,80.5,1656904,128.199997,20595,0,0,150980,0,11418,515144,1141760,62.8,3.0


In [60]:
# This seems to be a mistake. The value could either be 1, or 10, or something
# else that I could fill using an imputer, but for now I just drop these rows:
sub_df_18 = sub_df_17[sub_df_17[col_]!=0]
sub_df_18.shape

(1592, 30)

In [61]:
# Now let's also investigate the value 111:
init_data[init_data[col_]==111]

Unnamed: 0,osebuildingid,datayear,buildingname,buildingtype,taxparcelidentificationnumber,address,city,state,zipcode,latitude,longitude,neighborhood,councildistrictcode,yearbuilt,numberoffloors,numberofbuildings,propertygfatotal,propertygfabuilding_s,propertygfaparking,energystarscore,siteeuiwn_kbtu_sf,siteeui_kbtu_sf,siteenergyuse_kbtu,siteenergyusewn_kbtu,sourceeuiwn_kbtu_sf,sourceeui_kbtu_sf,epapropertytype,largestpropertyusetype,largestpropertyusetypegfa,secondlargestpropertyusetype,secondlargestpropertyuse,thirdlargestpropertyusetype,thirdlargestpropertyusetypegfa,electricity_kwh,steamuse_kbtu,naturalgas_therms,compliancestatus,complianceissue,electricity_kbtu,naturalgas_kbtu,totalghgemissions,ghgemissionsintensity
3235,49967,2021,UW - UNIVERSITY OF WASHINGTON SEATTLE CAMPUS (...,Campus,1625049001,4000 15TH AVE NE,SEATTLE,WA,98105,47.65644,-122.31041,NORTHEAST,1,1900,0,111,9320156,9320156,0,0,0.0,0.0,0,0,0.0,0.0,College/University,College/University,15216474,,0,,0,0,0,0,Compliant,No Issue,919167906,39449200,0.0,0.0


In [62]:
# This seems to be an outlier, and hopefully the buildings it includes are
# present somewhere in the other rows. I also drop it:
sub_df_19 = sub_df_18[sub_df_18[col_]!=111]
sub_df_19.shape

(1591, 30)

### Property G(ross) F(loor) A(rea)
- total
- building(s)
- parking

#### Property GFA total

In [71]:
# Select the GFA total
col_ = 'propertygfatotal'

# Define the (approx.) ratio to convert to m2:
sf_to_m2 = 10.76

# Check for the min and max values
print('Prop. GFA total:\n- Min: {:.2f} m2\n- Max: {:.2f} m2'.format(
    num_data[col_].min()/sf_to_m2,
    num_data[col_].max()/sf_to_m2))

Prop. GFA total:
- Min: 1858.74 m2
- Max: 866185.50 m2


#### Property GFA building(s)

In [74]:
# Select the GFA building(s)
col_ = 'propertygfabuilding_s'

# Define the (approx.) ratio to convert to m2:
sf_to_m2 = 10.76

# Check for the min and max values
print('Prop. GFA building(s):\n- Min: {:.2f} m2\n- Max: {:.2f} m2'.format(
    num_data[col_].min()/sf_to_m2,
    num_data[col_].max()/sf_to_m2))

Prop. GFA building(s):
- Min: 1063.20 m2
- Max: 866185.50 m2


#### Property GFA parking

In [75]:
# Select the GFA parking
col_ = 'propertygfaparking'

# Define the (approx.) ratio to convert to m2:
sf_to_m2 = 10.76

# Check for the min and max values
print('Prop. GFA parking:\n- Min: {:.2f} m2\n- Max: {:.2f} m2'.format(
    num_data[col_].min()/sf_to_m2,
    num_data[col_].max()/sf_to_m2))

Prop. GFA parking:
- Min: 0.00 m2
- Max: 63824.35 m2


#### Check that total = buildings + parking

In [79]:
# First make a copy of the data
pgfa_df = sub_df_19.copy()

In [80]:
p_total = 'propertygfatotal'
p_building = 'propertygfabuilding_s'
p_parking = 'propertygfaparking'

# Calculate the total of building + parking:
pgfa_df['computed_gfa_total'] = pgfa_df[p_building] + pgfa_df[p_parking]

In [85]:
# Now compare the two columns
sum(pgfa_df[p_total] != pgfa_df['computed_gfa_total'])

0

Great, there's no issue.

### EnergySTARScore

In [78]:
# Select the score
col_ = 'energystarscore'

# Check for the min and max values
print('EnergySTARScore:\n- Min: {}\n- Max: {}'.format(
    num_data[col_].min(), num_data[col_].max()))

EnergySTARScore:
- Min: 0
- Max: 100


### Site Energy Use (kBtu)

In [87]:
# Select the Site Energy Use (kBtu)
col_ = 'siteenergyuse_kbtu'

# Check for the min and max values
print('Site Energy Use:\n- Min: {}\n- Max: {}'.format(
    num_data[col_].min(), num_data[col_].max()))

Site Energy Use:
- Min: 0
- Max: 518003488


In [89]:
# Check how many have no consumption:
len(sub_df_19[sub_df_19[col_]<1])

79

In [91]:
# Let's investigate a sample:
sub_df_19[sub_df_19[col_]<1].sample(n=10, random_state=23)

Unnamed: 0,buildingtype,address,zipcode,latitude,longitude,neighborhood,councildistrictcode,yearbuilt,numberoffloors,numberofbuildings,propertygfatotal,propertygfabuilding_s,propertygfaparking,energystarscore,siteeui_kbtu_sf,siteenergyuse_kbtu,sourceeui_kbtu_sf,epapropertytype,largestpropertyusetype,largestpropertyusetypegfa,secondlargestpropertyusetype,secondlargestpropertyuse,thirdlargestpropertyusetypegfa,electricity_kwh,steamuse_kbtu,naturalgas_therms,electricity_kbtu,naturalgas_kbtu,totalghgemissions,ghgemissionsintensity
602,NonResidential,6500 URSULA PL S,98108,47.54437,-122.30863,GREATER DUWAMISH,2,1962,1,1,44700,44700,0,0,0.0,0,0.0,Office,Office,47870,UNK,0,0,0,0,0,1517797,0,0.0,0.0
696,NonResidential,811 5TH AVE,98104,47.60547,-122.33118,DOWNTOWN,1,1908,5,1,41536,41536,0,0,0.0,0,0.0,UNK,UNK,0,UNK,0,0,0,0,0,0,0,0.0,0.0
1164,NonResidential,3228 1ST AVE S,98134,47.57459,-122.3339,GREATER DUWAMISH,1,1937,1,1,20975,20975,0,0,0.0,0,0.0,Other - Entertainment/Public Assembly,Other - Entertainment/Public Assembly,13144,Other,6600,1070,0,0,0,11957,11570,0.0,0.0
1148,NonResidential,4710 NE 70TH ST,98115,47.67984,-122.27739,NORTHEAST,1,1946,1,1,24295,24295,0,0,0.0,0,0.0,Worship Facility,Worship Facility,24295,Pre-school/Daycare,12147,0,0,0,0,127805,0,0.0,0.0
338,NonResidential,13050 AURORA AVE N,98133,47.72371,-122.34293,NORTHWEST,1,1997,1,1,50083,50083,0,0,0.0,0,0.0,Supermarket/Grocery Store,Supermarket/Grocery Store,27999,Retail Store,22143,0,0,0,0,3610756,0,0.0,0.0
1278,NonResidential,5201 1ST AVE S,98134,47.55523,-122.33448,GREATER DUWAMISH,1,1927,1,1,22622,22622,0,0,0.0,0,0.0,UNK,UNK,0,UNK,0,0,0,0,0,0,0,0.0,0.0
840,Nonresidential COS,201 THOMAS ST,98109,47.62082,-122.35251,MAGNOLIA / QUEEN ANNE,7,1992,2,1,40600,40600,0,0,0.0,0,0.0,Performing Arts,Performing Arts,36000,UNK,0,0,0,0,0,775897,655980,0.0,0.0
1073,NonResidential,2709 AIRPORT WAY S,98134,47.5793,-122.32226,GREATER DUWAMISH,2,1960,1,1,24109,24109,0,0,0.0,0,0.0,Manufacturing/Industrial Plant,Manufacturing/Industrial Plant,24109,Parking,0,0,0,0,0,0,0,0.0,0.0
1406,NonResidential,1520 13TH AVE,98122,47.61481,-122.31536,EAST,1,1920,2,1,23040,23040,0,0,0.0,0,0.0,UNK,UNK,0,UNK,0,0,0,0,0,0,0,0.0,0.0
334,NonResidential,12525 AURORA AVE N,98133,47.72053,-122.34739,NORTHWEST,1,1997,1,1,131387,131387,0,0,0.0,0,0.0,UNK,UNK,0,UNK,0,0,0,0,0,0,0,0.0,0.0


### Site E(nergy) U(se) I(ntensity) (kBtu/sf)
- Calculated as Site Energy Use / GFA

### Source EUI (kBtu/sf)
- Source Energy Use is the annual energy used to operate the property, including losses from generation, transmission, & distribution,
- Calculated as Spurce Energy Use / GFA

### Largest Property Use Type (GFA)

### 2nd Largest Property Use Type

### 3rd Largest Property Use Type (GFA)

### Electricity:
- kwh
- kBtu

### SteamUse (kBtu)

### NaturalGas:
- therms
- kBtu

### GHGemissions:
- total
- intensity