# WiDS Datathon 2022

## Overview

The WiDS Datathon 2022 focuses on a prediction task involving roughly 100k observations of **building energy usage records** collected over 7 years and a number of states within the United States. The dataset consists of building characteristics (e.g. floor area, facility type etc), weather data for the location of the building (e.g. annual average temperature, annual total precipitation etc) as well as the energy usage for the building and the given year, measured as **Site Energy Usage Intensity (Site EUI)**.

**Each row in the data corresponds to the a single building observed in a given year**. The task is to **predict the Site EUI for each row**, given the characteristics of the building and the weather data for the location of the building.

## Data Dictionary

  - `id`: building id
  - `Year_Factor`: anonymized year in which the weather and energy usage factors were observed
  - `State_Factor`: anonymized state in which the building is located
  - `building_class`: building classification
  - `facility_type`: building usage type
  - `floor_area`: floor area (in square feet) of the building
  - `year_built`: year in which the building was constructed
  - `energy_star_rating`: the energy star rating of the building
  - `ELEVATION`: elevation of the building location
  - `january_min_temp`: minimum temperature in January (in Fahrenheit) at the location of the building
  - `january_avg_temp`: average temperature in January (in Fahrenheit) at the location of the building
  - `january_max_temp`: maximum temperature in January (in Fahrenheit) at the location of the building
`cooling_degree_days`: cooling degree day for a given day is the number of degrees where the daily average temperature exceeds 65 degrees Fahrenheit. Each month is summed to produce an annual total at the location of the building.
`heating_degree_days`: heating degree day for a given day is the number of degrees where the daily average temperature falls under 65 degrees Fahrenheit. Each month is summed to produce an annual total at the location of the building.
  - `precipitation_inches`: annual precipitation in inches at the location of the building
  - `snowfall_inches`: annual snowfall in inches at the location of the building
  - `snowdepth_inches`: annual snow depth in inches at the location of the building
  - `avg_temp`: average temperature over a year at the location of the building
  - `days_below_30F`: total number of days below 30 degrees Fahrenheit at the location of the building
  - `days_below_20F`: total number of days below 20 degrees Fahrenheit at the location of the building
  - `days_below_10F`: total number of days below 10 degrees Fahrenheit at the location of the building
  - `days_below_0F`: total number of days below 0 degrees Fahrenheit at the location of the building
  - `days_above_80F`: total number of days above 80 degrees Fahrenheit at the location of the building
  - `days_above_90F`: total number of days above 90 degrees Fahrenheit at the location of the building
  - `days_above_100F`: total number of days above 100 degrees Fahrenheit at the location of the building
  - `days_above_110F`: total number of days above 110 degrees Fahrenheit at the location of the building
  - `direction_max_wind_speed`: wind direction for maximum wind speed at the location of the building. Given in 360-degree compass point directions (e.g. 360 = north, 180 = south, etc.).
  - `direction_peak_wind_speed`: wind direction for peak wind gust speed at the location of the building. Given in 360-degree compass point directions (e.g. 360 = north, 180 = south, etc.).
  - `max_wind_speed`: maximum wind speed at the location of the building
  - `days_with_fog`: number of days with fog at the location of the building

## Environment Setup

In [1]:
import os
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
pd.set_option('display.max.columns', None)
pd.set_option('display.max.rows', 50)
pd.set_option('display.precision', 5)
pd.set_option('display.float_format',  '{:.5f}'.format)
warnings.simplefilter("ignore")

In [3]:
if 'notebooks' in os.getcwd():
    os.chdir('..')

DATA_PATH = os.getcwd() + '/data/'
train_fname = 'train.csv'
test_fname = 'test.csv'

## Data Exploration

### Reading in Data

In [4]:
train_df = pd.read_csv(DATA_PATH + train_fname, header='infer', low_memory=False)
train_df.columns = [x.lower() for x in train_df.columns]
print(train_df.shape)

(75757, 64)


In [5]:
test_df = pd.read_csv(DATA_PATH + test_fname, header='infer', low_memory=False)
test_df.columns = [x.lower() for x in test_df.columns]
print(test_df.shape)

(9705, 63)


In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75757 entries, 0 to 75756
Data columns (total 64 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   year_factor                75757 non-null  int64  
 1   state_factor               75757 non-null  object 
 2   building_class             75757 non-null  object 
 3   facility_type              75757 non-null  object 
 4   floor_area                 75757 non-null  float64
 5   year_built                 73920 non-null  float64
 6   energy_star_rating         49048 non-null  float64
 7   elevation                  75757 non-null  float64
 8   january_min_temp           75757 non-null  int64  
 9   january_avg_temp           75757 non-null  float64
 10  january_max_temp           75757 non-null  int64  
 11  february_min_temp          75757 non-null  int64  
 12  february_avg_temp          75757 non-null  float64
 13  february_max_temp          75757 non-null  int

In [9]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9705 entries, 0 to 9704
Data columns (total 63 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   year_factor                9705 non-null   int64  
 1   state_factor               9705 non-null   object 
 2   building_class             9705 non-null   object 
 3   facility_type              9705 non-null   object 
 4   floor_area                 9705 non-null   float64
 5   year_built                 9613 non-null   float64
 6   energy_star_rating         7451 non-null   float64
 7   elevation                  9705 non-null   float64
 8   january_min_temp           9705 non-null   int64  
 9   january_avg_temp           9705 non-null   float64
 10  january_max_temp           9705 non-null   int64  
 11  february_min_temp          9705 non-null   int64  
 12  february_avg_temp          9705 non-null   float64
 13  february_max_temp          9705 non-null   int64

Columns with nulls - 

  - `year_built`
  - `energy_star_rating`
  - `direction_max_wind_speed`
  - `direction_peak_wind_speed`
  - `max_wind_speed`
  - `days_with_fog`

### Categorical Attributes

  - `year_factor`
  - `state_factor`
  - `building_class`
  - `facility_type`

In [7]:
cat_cols = ['year_factor', 'state_factor', 'building_class', 'facility_type']
for col in cat_cols:
    print(col, train_df[col].nunique())
    print(train_df[col].value_counts())
    print(2*len(col)*'-')

year_factor 6
6    22449
5    18308
4    12946
3    10879
2     9058
1     2117
Name: year_factor, dtype: int64
----------------------
state_factor 7
State_6     50840
State_11     6412
State_1      5618
State_2      4871
State_4      4300
State_8      3701
State_10       15
Name: state_factor, dtype: int64
------------------------
building_class 2
Residential    43558
Commercial     32199
Name: building_class, dtype: int64
----------------------------
facility_type 60
Multifamily_Uncategorized              39455
Office_Uncategorized                   12512
Education_Other_classroom               3860
Lodging_Hotel                           2098
2to4_Unit_Building                      1893
                                       ...  
Food_Service_Other                        17
Mixed_Use_Predominantly_Residential        9
Public_Assembly_Stadium                    9
Service_Drycleaning_or_Laundry             9
Lodging_Uncategorized                      5
Name: facility_type, Length: 60

### Numerical Attributes

In [8]:
num_cols = [x for x in train_df.columns if x not in cat_cols and x!='id']
for col in num_cols:
    print(col)
    print(train_df[col].describe().loc[['mean', '50%']])

floor_area
mean   165983.86586
50%     91367.00000
Name: floor_area, dtype: float64
year_built
mean   1952.30676
50%    1951.00000
Name: year_built, dtype: float64
energy_star_rating
mean   61.04861
50%    67.00000
Name: energy_star_rating, dtype: float64
elevation
mean   39.50632
50%    25.00000
Name: elevation, dtype: float64
january_min_temp
mean   11.43234
50%    11.00000
Name: january_min_temp, dtype: float64
january_avg_temp
mean   34.31047
50%    34.45161
Name: january_avg_temp, dtype: float64
january_max_temp
mean   59.05495
50%    59.00000
Name: january_max_temp, dtype: float64
february_min_temp
mean   11.72057
50%     9.00000
Name: february_min_temp, dtype: float64
february_avg_temp
mean   35.52684
50%    34.10714
Name: february_avg_temp, dtype: float64
february_max_temp
mean   58.48628
50%    61.00000
Name: february_max_temp, dtype: float64
march_min_temp
mean   21.60628
50%    25.00000
Name: march_min_temp, dtype: float64
march_avg_temp
mean   44.46929
50%    44.51613
Name:

### Plots