# Weather Data
Assume user downloaded archive.zip from Kaggle,
renamed the file BuildingData.zip,
and stored the file in the data subdirectory.
Assume the zip file contains the weather.csv file. 

The weather file has one row per hour for two years with 8 feature columns. We noted a large range of mean air temperature per site: from 7.8 to 25.1 degrees Celsius.

In [1]:
DATAPATH=''
try:
    # On Google Drive, set path to my drive / data directory.
    from google.colab import drive
    IN_COLAB = True
    PATH='/content/drive/'
    drive.mount(PATH)
    DATAPATH=PATH+'My Drive/data/'  # must end in "/"
except:
    # On home computer, set path to local data directory.
    IN_COLAB = False
    DATAPATH='data/'  # must end in "/"

ZIP_FILE='BuildingData.zip'
ZIP_PATH = DATAPATH+ZIP_FILE
WEATHER_FILE='weather.csv'
MODEL_FILE='Model'  # will be used later to save models

In [2]:
from os import listdir
import csv
from zipfile import ZipFile
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA, KernelPCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from matplotlib import colors
mycmap = colors.ListedColormap(['red','blue'])  # list color for label 0 then 1
np.set_printoptions(precision=2)

In [3]:
def read_csv_to_numpy(filename): # array of string, header=row[0]
    with open(ELEC_PATH,'r') as handle:
        data_iter = csv.reader(handle,delimiter = ',',quotechar = '"')
        data = [data for data in data_iter]
        return np.asarray(data, dtype = None)
# Pandas incorporates column headers, row numbers, timestamps, and NaN for missing values.
def read_csv_to_panda(filename): # pandas data frame
    return pd.read_csv(filename)
def read_zip_to_panda(zip_filename,csv_filename):
    zip_handle = ZipFile(zip_filename)
    csv_handle = zip_handle.open(csv_filename)
    panda = pd.read_csv(csv_handle)
    return panda

## Weather data
We have 2 years of hourly weather data per site ID.  
A site is a geographical area such as a college campus.  
Each site is code-named with an animal like Bear.
For each site, we have multiple buildings.  
Each building is code-named with person-name like Lulu.  

In [4]:
wet_df = read_zip_to_panda(ZIP_PATH,WEATHER_FILE)

In [5]:
wet_df

Unnamed: 0,timestamp,site_id,airTemperature,cloudCoverage,dewTemperature,precipDepth1HR,precipDepth6HR,seaLvlPressure,windDirection,windSpeed
0,2016-01-01 00:00:00,Panther,19.4,,19.4,0.0,,,0.0,0.0
1,2016-01-01 01:00:00,Panther,21.1,6.0,21.1,-1.0,,1019.4,0.0,0.0
2,2016-01-01 02:00:00,Panther,21.1,,21.1,0.0,,1018.8,210.0,1.5
3,2016-01-01 03:00:00,Panther,20.6,,20.0,0.0,,1018.1,0.0,0.0
4,2016-01-01 04:00:00,Panther,21.1,,20.6,0.0,,1019.0,290.0,1.5
...,...,...,...,...,...,...,...,...,...,...
331161,2017-12-31 19:00:00,Mouse,8.5,,4.8,,,992.3,210.0,8.2
331162,2017-12-31 20:00:00,Mouse,8.5,,4.5,,,992.1,210.0,7.2
331163,2017-12-31 21:00:00,Mouse,8.2,,4.0,,,992.1,230.0,10.3
331164,2017-12-31 22:00:00,Mouse,7.5,,4.3,,,993.7,260.0,12.9


In [6]:
print("Air temp and wind speed: 300K reports.")
wet_df.describe()

Air temp and wind speed: 300K reports.


Unnamed: 0,airTemperature,cloudCoverage,dewTemperature,precipDepth1HR,precipDepth6HR,seaLvlPressure,windDirection,windSpeed
count,331038.0,160179.0,330838.0,197980.0,18162.0,309542.0,318161.0,330592.0
mean,14.235343,1.920907,7.64937,0.955738,13.53656,1016.063498,184.391299,3.569554
std,9.990392,2.550744,9.201438,8.273852,43.801017,8.052463,111.571354,2.335197
min,-28.9,0.0,-35.0,-1.0,-1.0,968.2,0.0,0.0
25%,7.8,0.0,1.8,0.0,0.0,1011.6,90.0,2.1
50%,14.4,0.0,8.5,0.0,0.0,1016.2,200.0,3.1
75%,21.1,4.0,13.9,0.0,5.0,1020.9,280.0,5.0
max,48.3,9.0,26.7,597.0,770.0,1050.1,360.0,24.2


In [7]:
print("Weather observations per site:")
wet_df.site_id.value_counts()

Weather observations per site:


Panther     17544
Gator       17544
Fox         17543
Bear        17542
Hog         17542
Rat         17539
Peacock     17539
Eagle       17536
Swan        17535
Bull        17529
Bobcat      17525
Mouse       17516
Robin       17516
Shrew       17516
Wolf        17505
Lamb        17500
Cockatoo    16975
Crow        16860
Moose       16860
Name: site_id, dtype: int64

In [8]:
print("Outside Air Temperature observations for one site:")
gator_df = wet_df[wet_df['site_id']=='Gator']
gator_temp_df=gator_df['airTemperature']
gator_temp_df.describe()

Outside Air Temperature observations for one site:


count    17541.000000
mean        22.805091
std          5.792597
min          1.700000
25%         19.400000
50%         23.900000
75%         26.700000
max         36.100000
Name: airTemperature, dtype: float64

In [9]:
print("Stats for outside air temp per site:")
wet_df.groupby(by=['site_id'])['airTemperature'].agg(['mean','std','min','max']).sort_values('mean')

Stats for outside air temp per site:


Unnamed: 0_level_0,mean,std,min,max
site_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Moose,7.798055,12.012177,-28.8,33.9
Crow,7.798055,12.012177,-28.8,33.9
Cockatoo,9.26769,10.616823,-23.9,33.9
Hog,9.561038,12.31958,-28.9,35.6
Wolf,9.957812,4.995753,-4.8,26.1
Lamb,11.000171,5.023139,-3.0,30.0
Bobcat,11.552921,11.30004,-19.4,38.3
Mouse,11.842307,6.117376,-4.1,33.9
Robin,11.842307,6.117376,-4.1,33.9
Shrew,11.842307,6.117376,-4.1,33.9
