# ASHRAE Great Energy Predictor
### *by Jose Correa*

# Introduction

? Rewrite better

How much does it cost to cool a skyscraper in the summer?  A lot! And not just in dollars, but in environmental impact.

Thankfully, significant investments are being made to improve building efficiencies to reduce costs and emissions. The question is, are the improvements workinn. Under pay-for-performance financing, the building owner makes payments based on the difference between their real energy consumption and what they would have used without any retrofits. The latter values have to come from a mo, however, c. Current methods of estimation are fragmented and do not scale w, s. Some assume a specific meter type or don’t work with different building types.

In machine learning analysisitis beenou’ll deed elop accurate models of metered building energy usage in the following areas: chilled water, electric, hot water, and steam meters. The data comes from over 1,000 buil, at several different sites around the world,dings over a three-year  timeframe. With better estimates of these energy-saving investments, large scale investors and financial institutions will be more inclined to invest in this area to enable progress in building efficien


*Dataset Description*

Assessing the value of energy efficiency improvements can be challenging as there's no way to truly know how much energy a building would have used without the improvements. The best we can do is to build counterfactual models. Once a building is overhauled the new (lower) energy consumption is compared against modeled values for the original building to calculate the savings from the retrofit. More accurate models could support better market incentives and enable lower cost financing.

This competition challenges you to build these counterfactual models across four energy types based on historic usage rates and observed weather. The dataset includes three years of hourly meter readings from over one thousand buildings at several different sites around the world.cies.

<a id='toc'></a>

# Table of Contents
1. [Basic Data Wrangling](#bdw-0)
    1. [Data processing](#bdw-1)
    2. [Shape of the data frame](#bdw-2)
    3. [Numeric and categorical columns distribution](#bdw-3)
    4. [Duplicate values](#bdw-4)
    5. [Null values](#bdw-5)
2. [Exploratory Data Analysis (EDA)](#eda-0)
    1. [Relationship between..???](#eda-1)
3. [Statistical Analysis](#sa-0)
    1. [Statistically significant difference..](#sa-1)
    2. [Correlation with ??](#sa-2)
4. [Advanced Statistical Analysis](#asa-0)
    1. [Linear regression for ??](#asa-1)
    2. [Logistic regression for ??](#asa-2)
5. [Conclusion](#concl)

<a id='bdw-0'></a>

# 1. Basic Data Wrangling

<a id='bdw-1'></a>

## 1.1. Data processing

In [1]:
# Import Python packages
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# Setting figure size
plt.rcParams['figure.figsize']=(8.0,6.0)

In [5]:
# train file, data loading. The file is coma (,) separated. 
# The index is auto generated, since the 'building_id' column doesn't have unique values
df_train = pd.read_csv('data/train.csv', sep = ',')
display (df_train.head(3))

Unnamed: 0,building_id,meter,timestamp,meter_reading
0,0,0,2016-01-01 00:00:00,0.0
1,1,0,2016-01-01 00:00:00,0.0
2,2,0,2016-01-01 00:00:00,0.0


In [14]:
# General info
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20216100 entries, 0 to 20216099
Data columns (total 5 columns):
 #   Column         Dtype  
---  ------         -----  
 0   building_id    int64  
 1   meter          int64  
 2   timestamp      object 
 3   meter_reading  float64
 4   year           int32  
dtypes: float64(1), int32(1), int64(2), object(1)
memory usage: 694.1+ MB


In [16]:
# Converting timestamp from 'object' to a data time data type
df_train['timestamp']=df_train['timestamp'].astype('datetime64[ns]')

In [18]:
# Confirm timestamp as datatime dtype
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20216100 entries, 0 to 20216099
Data columns (total 5 columns):
 #   Column         Dtype         
---  ------         -----         
 0   building_id    int64         
 1   meter          int64         
 2   timestamp      datetime64[ns]
 3   meter_reading  float64       
 4   year           int32         
dtypes: datetime64[ns](1), float64(1), int32(1), int64(2)
memory usage: 694.1 MB


In [21]:
# 20M of rows
df_train.shape

(20216100, 5)

In [12]:
# Find a way to restrict to 300k rows to speed PC process time
# Plot a chart to identify records per year
# Extract the year from 'timestamp' and create the column 'year'
df_train['year']=pd.to_datetime(df_train['timestamp']).dt.year
df_train.head(3)

Unnamed: 0,building_id,meter,timestamp,meter_reading,year
0,0,0,2016-01-01 00:00:00,0.0,2016
1,1,0,2016-01-01 00:00:00,0.0,2016
2,2,0,2016-01-01 00:00:00,0.0,2016


In [13]:
# count records per year
df_train.groupby(['year'])['year'].count()

year
2016    20216100
Name: year, dtype: int64

Only records for 2016. Try to filter by 'building_id'

In [22]:
# How many unique values?
df_train['timestamp'].nunique()

8784

As many unique values as hours per year (8760 hr)

In [23]:
# Review 'building_id'
# unique values
df_train['building_id'].nunique()

1449

In [30]:
# Count by unique values
df_train.groupby(['building_id'])['building_id'].count()

building_id
0       8784
1       8784
2       8784
3       8784
4       8784
        ... 
1444    7445
1445    7449
1446    7472
1447    7471
1448    7452
Name: building_id, Length: 1449, dtype: int64

In [None]:
# How many records account for buildings with 8784 equivalent a full year records

In [31]:
# Create a column record_year
df_train['record_year']=df_train.groupby(['building_id'])['building_id'].count()
df_train.head(3)

Unnamed: 0,building_id,meter,timestamp,meter_reading,year,record_year
0,0,0,2016-01-01,0.0,2016,8784.0
1,1,0,2016-01-01,0.0,2016,8784.0
2,2,0,2016-01-01,0.0,2016,8784.0


In [40]:
# count records per year
len(df_train[df_train['record_year']==8784])

267

In [41]:
# Rows in total
267*8784

2345328

Higher than the 300k goal. Try another filter. Merge with building_meta.csv, to explore other criteria

In [42]:
# building_meta file, data loading. The file is coma (,) separated. 
# The index is auto generated, since the 'building_id' column doesn't have unique values
df_building = pd.read_csv('data/building_metadata.csv', sep = ',')
display (df_building.head(3))

Unnamed: 0,site_id,building_id,primary_use,square_feet,year_built,floor_count
0,0,0,Education,7432,2008.0,
1,0,1,Education,2720,2004.0,
2,0,2,Education,5376,1991.0,


In [43]:
df_building.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1449 entries, 0 to 1448
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   site_id      1449 non-null   int64  
 1   building_id  1449 non-null   int64  
 2   primary_use  1449 non-null   object 
 3   square_feet  1449 non-null   int64  
 4   year_built   675 non-null    float64
 5   floor_count  355 non-null    float64
dtypes: float64(2), int64(3), object(1)
memory usage: 68.1+ KB


In [44]:
# Join df_train and df_building on 'building_id'

[back to TOC](#toc)

<a id='bdw-2'></a>

## 1.2. Shape of the data frame

[back to TOC](#toc)

<a id='bdw-3'></a>

## 1.3. Numeric and categorical columns distribution

[back to TOC](#toc)

<a id='bdw-4'></a>

## 1.4. Duplicate values

[back to TOC](#toc)

<a id='bdw-5'></a>

## 1.5. Null values

[back to TOC](#toc)

<a id='eda-0'></a>

# 2. Exploratory Data Analysis (EDA)

<a id='eda-1'></a>

## 2.1. Relationship between ...???

[back to TOC](#toc)

<a id='sa-0'></a>

# 3. Advanced Statistical Analysis

<a id='sa-1'></a>

## 3.1. Statistically significant difference

[back to TOC](#toc)

<a id='sa-2'></a>

## 3.2. Correlation with ??

[back to TOC](#toc)

<a id='asa-0'></a>

# 4. Advanced Statistical Analysis

<a id='asa-1'></a>

## 4.1. Linear regression for ??

[back to TOC](#toc)

<a id='asa-2'></a>

## 4.2. Logistic regression for ??

[back to TOC](#toc)

<a id='concl'></a>

# Conclusion

In summary, describing the relationship between mosquito species and WNV prevalence involves data analysis and visualization among other techniques. The following insights aim to show that the mosquito species are a significant vector for the virus and contribute to informed decision-making in public health and mosquito control efforts in the City of Chicago:
* The mosquito 'Culex Pipiens' require deeper analysis, since it is the most frequent specie with WNV positive.
* It is not recommended to continue using OVI traps in the study, since is has a low trapping rate. Give priority to Sentinel and CDC traps, the ones with the highest trap efficacy. 
* Implement additional safety measures during august to decrease mosquito population, such as:
    * Eliminate or regularly empty containers that can collect water, such as buckets, flower pots, and clogged gutters.
    * Apply mosquito repellent on exposed skin when spending time outdoors.
    * Wear long sleeves and pants when outside.
    * Set up more mosquito traps.
    * Introduce natural mosquito predators like birds, bats, or dragonflies to your area, if possible.
    * Plant mosquito-repelling herbs and flowers in your garden, such as citronella, lavender, and marigolds.on

[back to TOC](#toc)