# ASHRAE Great Energy Predictor
### *by Jose Correa*

# Introduction

? Rewrite better

How much does it cost to cool a skyscraper in the summer?  A lot! And not just in dollars, but in environmental impact.

Thankfully, significant investments are being made to improve building efficiencies to reduce costs and emissions. The question is, are the improvements workinn. Under pay-for-performance financing, the building owner makes payments based on the difference between their real energy consumption and what they would have used without any retrofits. The latter values have to come from a mo, however, c. Current methods of estimation are fragmented and do not scale w, s. Some assume a specific meter type or don’t work with different building types.

In machine learning analysisitis beenou’ll deed elop accurate models of metered building energy usage in the following areas: chilled water, electric, hot water, and steam meters. The data comes from over 1,000 buil, at several different sites around the world,dings over a three-year  timeframe. With better estimates of these energy-saving investments, large scale investors and financial institutions will be more inclined to invest in this area to enable progress in building efficien


*Dataset Description*

Assessing the value of energy efficiency improvements can be challenging as there's no way to truly know how much energy a building would have used without the improvements. The best we can do is to build counterfactual models. Once a building is overhauled the new (lower) energy consumption is compared against modeled values for the original building to calculate the savings from the retrofit. More accurate models could support better market incentives and enable lower cost financing.

This competition challenges you to build these counterfactual models across four energy types based on historic usage rates and observed weather. The dataset includes three years of hourly meter readings from over one thousand buildings at several different sites around the world.cies.

<a id='toc'></a>

# Table of Contents
1. [Basic Data Wrangling](#bdw-0)
    1. [Data processing](#bdw-1)
    2. [Shape of the data frame](#bdw-2)
    3. [Numeric and categorical columns distribution](#bdw-3)
    4. [Duplicate values](#bdw-4)
    5. [Null values](#bdw-5)
2. [Exploratory Data Analysis (EDA)](#eda-0)
    1. [Relationship between..???](#eda-1)
3. [Statistical Analysis](#sa-0)
    1. [Statistically significant difference..](#sa-1)
    2. [Correlation with ??](#sa-2)
4. [Advanced Statistical Analysis](#asa-0)
    1. [Linear regression for ??](#asa-1)
    2. [Logistic regression for ??](#asa-2)
5. [Conclusion](#concl)

<a id='bdw-0'></a>

# 1. Basic Data Wrangling

<a id='bdw-1'></a>

## 1.1. Data processing

In [1]:
# Import Python packages
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# Setting figure size
plt.rcParams['figure.figsize']=(8.0,6.0)

### Join train.csv and building_metadata.csv
Find a way to restrict to 300k rows to speed PC process time

#### train file

In [2]:
# train file, data loading. The file is coma (,) separated. 
# The index is auto generated, since the 'building_id' column doesn't have unique values
train_df = pd.read_csv('data/train.csv', sep = ',')
display (train_df.head(3))

Unnamed: 0,building_id,meter,timestamp,meter_reading
0,0,0,2016-01-01 00:00:00,0.0
1,1,0,2016-01-01 00:00:00,0.0
2,2,0,2016-01-01 00:00:00,0.0


In [3]:
# General info
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20216100 entries, 0 to 20216099
Data columns (total 4 columns):
 #   Column         Dtype  
---  ------         -----  
 0   building_id    int64  
 1   meter          int64  
 2   timestamp      object 
 3   meter_reading  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 616.9+ MB


#### building_meta file 

In [5]:
# building_meta file, data loading. The file is coma (,) separated. 
# The index is auto generated, since the 'building_id' column doesn't have unique values
building_df = pd.read_csv('data/building_metadata.csv', sep = ',')
display (building_df.head(3))

Unnamed: 0,site_id,building_id,primary_use,square_feet,year_built,floor_count
0,0,0,Education,7432,2008.0,
1,0,1,Education,2720,2004.0,
2,0,2,Education,5376,1991.0,


In [7]:
# General info
building_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1449 entries, 0 to 1448
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   site_id      1449 non-null   int64  
 1   building_id  1449 non-null   int64  
 2   primary_use  1449 non-null   object 
 3   square_feet  1449 non-null   int64  
 4   year_built   675 non-null    float64
 5   floor_count  355 non-null    float64
dtypes: float64(2), int64(3), object(1)
memory usage: 68.1+ KB


#### Join train_df and building_df on 'building_id'

In [10]:
# Join train_df and building_df on 'building_id'
train_building_df = train_df.merge(building_df, on='building_id', how='inner')
train_building_df.sample(3)

Unnamed: 0,building_id,meter,timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count
13766419,1107,1,2016-11-03 12:00:00,447.321,13,Education,184098,,
4682436,354,0,2016-11-27 11:00:00,30.81,3,Education,77500,,
4728182,360,0,2016-02-15 00:00:00,59.18,3,Education,69600,1949.0,


In [11]:
# Review merged df
train_building_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20216100 entries, 0 to 20216099
Data columns (total 9 columns):
 #   Column         Dtype  
---  ------         -----  
 0   building_id    int64  
 1   meter          int64  
 2   timestamp      object 
 3   meter_reading  float64
 4   site_id        int64  
 5   primary_use    object 
 6   square_feet    int64  
 7   year_built     float64
 8   floor_count    float64
dtypes: float64(3), int64(4), object(2)
memory usage: 1.4+ GB


#### Weather_train file

In [14]:
# weather_train.csv file, data loading. The file is coma (,) separated. 
# The index is auto generated, since the 'site_id' column doesn't have unique values
weather_train_df = pd.read_csv('data/weather_train.csv', sep = ',')
weather_train_df.sample(3)

Unnamed: 0,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
78535,8,2016-12-19 19:00:00,29.4,,20.0,0.0,1024.8,0.0,0.0
10111,1,2016-02-25 18:00:00,4.3,,-4.7,,1016.8,330.0,2.1
43,0,2016-01-02 19:00:00,22.2,,12.8,0.0,1017.6,60.0,3.1


In [15]:
# General info
weather_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139773 entries, 0 to 139772
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   site_id             139773 non-null  int64  
 1   timestamp           139773 non-null  object 
 2   air_temperature     139718 non-null  float64
 3   cloud_coverage      70600 non-null   float64
 4   dew_temperature     139660 non-null  float64
 5   precip_depth_1_hr   89484 non-null   float64
 6   sea_level_pressure  129155 non-null  float64
 7   wind_direction      133505 non-null  float64
 8   wind_speed          139469 non-null  float64
dtypes: float64(7), int64(1), object(1)
memory usage: 9.6+ MB


#### Rename column header for timestamp to avoid confusion 

In [17]:
# Rename the 'timestamp' column to 'meter_timestamp'
train_building_df.rename(columns={'timestamp': 'meter_timestamp'}, inplace=True)

In [18]:
# Check successful rename
train_building_df.head(3)

Unnamed: 0,building_id,meter,meter_timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count
0,0,0,2016-01-01 00:00:00,0.0,0,Education,7432,2008.0,
1,0,0,2016-01-01 01:00:00,0.0,0,Education,7432,2008.0,
2,0,0,2016-01-01 02:00:00,0.0,0,Education,7432,2008.0,


In [19]:
# For weather_train_df
# Rename the 'timestamp' column to 'weather_timestamp'
weather_train_df.rename(columns={'timestamp': 'weather_timestamp'}, inplace=True)

In [20]:
# Check successful rename
weather_train_df.head(3)

Unnamed: 0,site_id,weather_timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,2016-01-01 00:00:00,25.0,6.0,20.0,,1019.7,0.0,0.0
1,0,2016-01-01 01:00:00,24.4,,21.1,-1.0,1020.2,70.0,1.5
2,0,2016-01-01 02:00:00,22.8,2.0,21.1,0.0,1020.2,0.0,0.0


#### Merge train_building_df with 'weather_train.csv' on 'site_id
Without timestamp data type change from object to datetime

In [34]:
# Merge the DataFrames on 'site_id' and 'meter_timestamp' equal to 'weather_timestamp'
# merged_data = 
train_building_df.merge(weather_train_df, left_on=['site_id', 'meter_timestamp'], right_on=['site_id', 'weather_timestamp'], how='inner')

Unnamed: 0,building_id,meter,meter_timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,weather_timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,0,2016-01-01 00:00:00,0.0000,0,Education,7432,2008.0,,2016-01-01 00:00:00,25.0,6.0,20.0,,1019.7,0.0,0.0
1,1,0,2016-01-01 00:00:00,0.0000,0,Education,2720,2004.0,,2016-01-01 00:00:00,25.0,6.0,20.0,,1019.7,0.0,0.0
2,2,0,2016-01-01 00:00:00,0.0000,0,Education,5376,1991.0,,2016-01-01 00:00:00,25.0,6.0,20.0,,1019.7,0.0,0.0
3,3,0,2016-01-01 00:00:00,0.0000,0,Education,23685,2002.0,,2016-01-01 00:00:00,25.0,6.0,20.0,,1019.7,0.0,0.0
4,4,0,2016-01-01 00:00:00,0.0000,0,Education,116607,1975.0,,2016-01-01 00:00:00,25.0,6.0,20.0,,1019.7,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20125600,1400,1,2016-03-24 12:00:00,15.3753,15,Lodging/residential,21168,1928.0,,2016-03-24 12:00:00,1.7,,1.7,,1016.8,0.0,0.0
20125601,1400,1,2016-03-24 13:00:00,25.0848,15,Lodging/residential,21168,1928.0,,2016-03-24 13:00:00,2.8,,2.2,,1016.6,320.0,1.5
20125602,1400,1,2016-03-24 14:00:00,32.3439,15,Lodging/residential,21168,1928.0,,2016-03-24 14:00:00,5.6,,4.4,,1015.2,110.0,2.1
20125603,1400,1,2016-03-24 15:00:00,24.2214,15,Lodging/residential,21168,1928.0,,2016-03-24 15:00:00,11.1,,4.4,,1013.9,150.0,5.1


Rows decreased after merge: 20,125,605 != 20,216,100 

#### Timestamp data type change from object to datetime

In [35]:
# train_building_df
train_building_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20216100 entries, 0 to 20216099
Data columns (total 9 columns):
 #   Column           Dtype  
---  ------           -----  
 0   building_id      int64  
 1   meter            int64  
 2   meter_timestamp  object 
 3   meter_reading    float64
 4   site_id          int64  
 5   primary_use      object 
 6   square_feet      int64  
 7   year_built       float64
 8   floor_count      float64
dtypes: float64(3), int64(4), object(2)
memory usage: 1.4+ GB


In [36]:
# Converting timestamp from 'object' to a data time data type
train_building_df['meter_timestamp']=train_building_df['meter_timestamp'].astype('datetime64[ns]')

In [37]:
# Verification
train_building_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20216100 entries, 0 to 20216099
Data columns (total 9 columns):
 #   Column           Dtype         
---  ------           -----         
 0   building_id      int64         
 1   meter            int64         
 2   meter_timestamp  datetime64[ns]
 3   meter_reading    float64       
 4   site_id          int64         
 5   primary_use      object        
 6   square_feet      int64         
 7   year_built       float64       
 8   floor_count      float64       
dtypes: datetime64[ns](1), float64(3), int64(4), object(1)
memory usage: 1.4+ GB


In [38]:
# weather_train_df
# timestamp data type change from object to dataframe
weather_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139773 entries, 0 to 139772
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   site_id             139773 non-null  int64  
 1   weather_timestamp   139773 non-null  object 
 2   air_temperature     139718 non-null  float64
 3   cloud_coverage      70600 non-null   float64
 4   dew_temperature     139660 non-null  float64
 5   precip_depth_1_hr   89484 non-null   float64
 6   sea_level_pressure  129155 non-null  float64
 7   wind_direction      133505 non-null  float64
 8   wind_speed          139469 non-null  float64
dtypes: float64(7), int64(1), object(1)
memory usage: 9.6+ MB


In [39]:
# Converting timestamp from 'object' to a data time data type
weather_train_df['weather_timestamp']=weather_train_df['weather_timestamp'].astype('datetime64[ns]')

In [40]:
# Verification
weather_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139773 entries, 0 to 139772
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   site_id             139773 non-null  int64         
 1   weather_timestamp   139773 non-null  datetime64[ns]
 2   air_temperature     139718 non-null  float64       
 3   cloud_coverage      70600 non-null   float64       
 4   dew_temperature     139660 non-null  float64       
 5   precip_depth_1_hr   89484 non-null   float64       
 6   sea_level_pressure  129155 non-null  float64       
 7   wind_direction      133505 non-null  float64       
 8   wind_speed          139469 non-null  float64       
dtypes: datetime64[ns](1), float64(7), int64(1)
memory usage: 9.6 MB


#### Inner merge between train_building_df and weather_train_df

In [44]:
# Merge the DataFrames on 'site_id' and 'meter_timestamp' equal to 'weather_timestamp'
merged_train_df = train_building_df.merge(weather_train_df, left_on=['site_id', 'meter_timestamp'], right_on=['site_id', 'weather_timestamp'], how='inner')
merged_train_df.sample(3)

Unnamed: 0,building_id,meter,meter_timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,weather_timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
136688,50,0,2016-02-25 10:00:00,0.0,0,Other,4698,1981.0,,2016-02-25 10:00:00,12.8,4.0,6.7,0.0,1016.9,300.0,7.2
19887801,1334,1,2016-11-18 18:00:00,5.595,15,Office,130794,1933.0,,2016-11-18 18:00:00,19.4,0.0,8.3,,1018.9,190.0,3.1
13677863,1210,0,2016-03-09 11:00:00,172.363,13,Office,92500,,,2016-03-09 11:00:00,2.2,,0.6,0.0,1009.4,250.0,4.1


Rows decreased after merge: 20,125,605 != 20,216,100. Not all the records satisfy the double merge condition.

#### Left merge merge between train_building_df and weather_train_df

In [42]:
# Merge the DataFrames on 'site_id' and 'meter_timestamp' equal to 'weather_timestamp'
# merged_data = 
train_building_df.merge(weather_train_df, left_on=['site_id', 'meter_timestamp'], right_on=['site_id', 'weather_timestamp'], how='left')

Unnamed: 0,building_id,meter,meter_timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,weather_timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,0,2016-01-01 00:00:00,0.00,0,Education,7432,2008.0,,2016-01-01 00:00:00,25.0,6.0,20.0,,1019.7,0.0,0.0
1,0,0,2016-01-01 01:00:00,0.00,0,Education,7432,2008.0,,2016-01-01 01:00:00,24.4,,21.1,-1.0,1020.2,70.0,1.5
2,0,0,2016-01-01 02:00:00,0.00,0,Education,7432,2008.0,,2016-01-01 02:00:00,22.8,2.0,21.1,0.0,1020.2,0.0,0.0
3,0,0,2016-01-01 03:00:00,0.00,0,Education,7432,2008.0,,2016-01-01 03:00:00,21.1,2.0,20.6,0.0,1020.1,0.0,0.0
4,0,0,2016-01-01 04:00:00,0.00,0,Education,7432,2008.0,,2016-01-01 04:00:00,20.0,2.0,20.0,-1.0,1020.0,250.0,2.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20216095,403,0,2016-12-31 19:00:00,43.66,3,Education,49500,1962.0,,2016-12-31 19:00:00,9.4,,-6.7,0.0,1016.7,200.0,11.8
20216096,403,0,2016-12-31 20:00:00,43.64,3,Education,49500,1962.0,,2016-12-31 20:00:00,8.9,,-6.1,0.0,1016.3,200.0,8.2
20216097,403,0,2016-12-31 21:00:00,43.89,3,Education,49500,1962.0,,2016-12-31 21:00:00,8.9,6.0,-6.1,0.0,1015.4,190.0,7.7
20216098,403,0,2016-12-31 22:00:00,44.37,3,Education,49500,1962.0,,2016-12-31 22:00:00,8.9,,-6.1,0.0,1015.7,200.0,8.2


Continue with inner merge since it warranties less null values for weather info

In [45]:
merged_train_df.sample(3)

Unnamed: 0,building_id,meter,meter_timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,weather_timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
14113660,1172,2,2016-05-07 06:00:00,2421.88,13,Manufacturing/industrial,63847,,,2016-05-07 06:00:00,18.9,6.0,5.0,0.0,1009.5,20.0,8.2
15637091,1179,1,2016-11-28 21:00:00,150.94,13,Entertainment/public assembly,143228,,,2016-11-28 21:00:00,10.0,,7.8,-1.0,981.8,160.0,4.6
11893905,992,0,2016-11-03 01:00:00,24.0,9,Office,51483,,,2016-11-03 01:00:00,27.2,4.0,21.1,0.0,1017.4,120.0,3.1


In [46]:
# Overview
merged_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20125605 entries, 0 to 20125604
Data columns (total 17 columns):
 #   Column              Dtype         
---  ------              -----         
 0   building_id         int64         
 1   meter               int64         
 2   meter_timestamp     datetime64[ns]
 3   meter_reading       float64       
 4   site_id             int64         
 5   primary_use         object        
 6   square_feet         int64         
 7   year_built          float64       
 8   floor_count         float64       
 9   weather_timestamp   datetime64[ns]
 10  air_temperature     float64       
 11  cloud_coverage      float64       
 12  dew_temperature     float64       
 13  precip_depth_1_hr   float64       
 14  sea_level_pressure  float64       
 15  wind_direction      float64       
 16  wind_speed          float64       
dtypes: datetime64[ns](2), float64(10), int64(4), object(1)
memory usage: 2.5+ GB


In [47]:
# Shape
merged_train_df.shape

(20125605, 17)

In [48]:
# Find a way to filter to 300k rows to speed PC process time

In [54]:
# Count by unique values
merged_train_df.groupby(['site_id'])['primary_use'].count()

site_id
0     1076662
1      552034
2     2530025
3     2369014
4      746664
5      779195
6      667989
7      359642
8      567915
9     2678102
10     411313
11     117259
12     314869
13    2711454
14    2499502
15    1743966
Name: primary_use, dtype: int64

In [62]:
# Count by unique values
count_by_primary_use = merged_train_df.groupby(['primary_use'])['primary_use'].count()

In [68]:
# Sort ascending
count_by_primary_use = count_by_primary_use.sort_values(ascending=True)
count_by_primary_use

primary_use
Religious worship                  31775
Utility                            55016
Technology/science                 76713
Services                           96493
Warehouse/storage                 111838
Retail                            112564
Food sales and service            114041
Manufacturing/industrial          124458
Parking                           213777
Other                             242163
Healthcare                        397992
Public services                  1658858
Lodging/residential              2130981
Entertainment/public assembly    2254880
Office                           4379290
Education                        8124766
Name: primary_use, dtype: int64

Continue analysis only with Healthcare records (397,992 rows)

[back to TOC](#toc)

<a id='bdw-2'></a>

## 1.2. Shape of the data frame

[back to TOC](#toc)

<a id='bdw-3'></a>

## 1.3. Numeric and categorical columns distribution

[back to TOC](#toc)

<a id='bdw-4'></a>

## 1.4. Duplicate values

[back to TOC](#toc)

<a id='bdw-5'></a>

## 1.5. Null values

[back to TOC](#toc)

<a id='eda-0'></a>

# 2. Exploratory Data Analysis (EDA)

<a id='eda-1'></a>

## 2.1. Relationship between ...???

[back to TOC](#toc)

<a id='sa-0'></a>

# 3. Advanced Statistical Analysis

<a id='sa-1'></a>

## 3.1. Statistically significant difference

[back to TOC](#toc)

<a id='sa-2'></a>

## 3.2. Correlation with ??

[back to TOC](#toc)

<a id='asa-0'></a>

# 4. Advanced Statistical Analysis

<a id='asa-1'></a>

## 4.1. Linear regression for ??

[back to TOC](#toc)

<a id='asa-2'></a>

## 4.2. Logistic regression for ??

[back to TOC](#toc)

<a id='concl'></a>

# Conclusion

In summary, describing the relationship between mosquito species and WNV prevalence involves data analysis and visualization among other techniques. The following insights aim to show that the mosquito species are a significant vector for the virus and contribute to informed decision-making in public health and mosquito control efforts in the City of Chicago:
* The mosquito 'Culex Pipiens' require deeper analysis, since it is the most frequent specie with WNV positive.
* It is not recommended to continue using OVI traps in the study, since is has a low trapping rate. Give priority to Sentinel and CDC traps, the ones with the highest trap efficacy. 
* Implement additional safety measures during august to decrease mosquito population, such as:
    * Eliminate or regularly empty containers that can collect water, such as buckets, flower pots, and clogged gutters.
    * Apply mosquito repellent on exposed skin when spending time outdoors.
    * Wear long sleeves and pants when outside.
    * Set up more mosquito traps.
    * Introduce natural mosquito predators like birds, bats, or dragonflies to your area, if possible.
    * Plant mosquito-repelling herbs and flowers in your garden, such as citronella, lavender, and marigolds.on

[back to TOC](#toc)