# Step 2.2) Feature engineering: weather

We'd like to establish the connections between the features and make decision about what data we need for predition. We will <br>
* See the correlation of #calls_t and weather_t <br>
* Autocorrelation of the weather 

In [120]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import pyexasol
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf
import statsmodels as sm

### Collect the data

If you'd like you can take a look at what query was used.

In [None]:
def print_query(file_name):
    print('Would you like to see the query? y/n')
    ans = input()
    if ans=='y':
        with open('files/queries/'+file_name, 'r') as file:
            query_time_series_hour = file.read()
        print('-'*50)
        print(query_time_series_hour)
        print('-'*50)
print_query('query_weather_to_calls_data.txt')

Would you like to see the query? y/n


The result of query

In [None]:
ds_raw = pd.read_csv('files/weather_to_numb_calls.csv',sep=';',decimal=',')
ds_raw = ds_raw.drop_duplicates(subset=['DT_ISO']) #make sure no duplicates are present
ds_raw = ds_raw.reset_index(drop=True)

## Explore the data

In [None]:
ds_raw.info()

In [None]:
ds_raw['DT_ISO'] = pd.to_datetime(ds_raw.DT_ISO)

In [None]:
def correlation(df):
    corr = df.corr()
    sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap='RdBu')# plot the correlation heatmap
    return corr

In [None]:
correlation(ds_raw)

We can see that TEMP is highly correlated with **FEELS_LIKE, TEMP_MIN and TEMP_MAX**, thus we will **exclude** them from consideration, they don't bring any new information to the model. <br> Humidity if quite highly correlated with the temperature, but we will not exclude it for now.

In [None]:
# as discussed before we exclude it, as it's highly correlated
columns_to_drop = ['FEELS_LIKE','TEMP_MIN','TEMP_MAX']
ds_raw = ds_raw.drop(columns_to_drop,axis = 1) 

Though we know that the number of calls depends on the weather, the correlation matrix doesn't show it. That is because the correlation value shows only linear dependence and this is not always the case. That is why let us look at the scatter plot.

In [None]:
pd.plotting.scatter_matrix(ds_raw, alpha = 1, figsize = (30, 30))
plt.show()

By looking at the scatterplot, it is hard to see the dependency of the number of calls on the weather. <br>
We could also look at the the relationship between number of calls and weather description. 

In [None]:
weather_types = ds_raw['WEATHER_DESCRIPTION'].unique()
for w_type in weather_types:
    calls_w_type= ds_raw[ds_raw['WEATHER_DESCRIPTION']==w_type]
    numb_calls=calls_w_type['NUMB_CALLS'].sum()
    print(w_type+' '*(30-len(w_type)), numb_calls)

As expected, that when the sky is clear, we get many more calls while when there is rain we get significantly smaller number of calls. We couldn't see the clear dependence of the calls on each weather attribute on the graphs, as this dependence could be more complex. <br>

Let us also plot the changes in each weather parameter and how it influences the number of calls for one year:

In [None]:
def plot_time_series(df,one_col='', omit=[],normalize=False,span=0,calls=False):
    if span>0:
        df = df.set_index("DT_ISO", inplace=False) # inplace=F: make a copy, don't modify directly
        df = df.rolling(span).mean()
        x_axis = df.index
    else:
        x_axis = df['DT_ISO'][:]
    omit.extend(['DT_ISO'])
    columns_time_series=[one_col]
    if len(one_col)==0:
        columns_time_series = [col for col in df.columns if col not in omit ]
    if calls:
        columns_time_series.extend(['NUMB_CALLS'])
    plt.figure("figure",figsize=(40,15))
    for col in columns_time_series:
        if normalize:
            y_axis = (10*df[col]/df[col].max())[:]
            # normalize the output, so all the attributes are between 0 and 10 
        else:
            y_axis = df[col][:]
        plt.plot(x_axis,y_axis)
    plt.legend(columns_time_series)
    plt.grid(True)
    plt.show()
    return

In [None]:
year_2017=ds_raw[(ds_raw['DT_ISO']>='2017-01-01')&(ds_raw['DT_ISO']<'2018-01-01')]
columns=['TEMP', 'PRESSURE', 'HUMIDITY', 'WIND_SPEED', 'WIND_DEG',\
       'RAIN_1H', 'RAIN_3H', 'SNOW_1H', 'SNOW_3H', 'CLOUDS_ALL']
for col in columns:
    print(col)
    plot_time_series(year_2017,one_col=col,normalize=False,span=0,calls=True)

Let us plot the moving average over a span of one week to make the graphs smoother and also normalize it for better visual perception:

In [None]:
for col in columns:
    print(col)
    plot_time_series(year_2017,one_col=col,normalize=True,span=24*7,calls=True)

### Some observations from time series plots:
The **pressure** doesn't vary throughout the year and could be **omitted**. Humidity mirrors #calls: we often see, when **humidity** rises, #calls declines. There is a tendency, that #calls rise during the summer as the **temperature** rises and **rain with clouds** decline. It looks like the **WIND_DEG** and #calls also have similar fluctuations and trend. It makes sense, as on a windy day the fire spreads easier. The **snow** is very rare, therefore the plot is not reflecting any connection of it to #calls.

Let's add the dummy variables to the dataset and look at the correlation matrix again

In [None]:
def add_dummies(df,attribute):
    # to have categorical values, we need to have the datatype string
    #making sure the datatype is string
    conv_to_str = lambda x: str(x)
    df[attribute]=df[attribute].apply(conv_to_str)
    dummies = pd.get_dummies(df[[attribute]], drop_first=True) #  #dummy var. = #categorical values - 1
    df = df.drop(attribute,axis=1)
    df = pd.concat([df, dummies], axis=1)
    return df

In [None]:
ds_raw = ds_raw.rename(columns={"WEATHER_DESCRIPTION": "wd"}) # shorten the name beforehand
categorical_attr=['wd','H','D','M','S']
for attr in categorical_attr:
    ds_raw=add_dummies(ds_raw,attr)
ds_raw.info()

In [None]:
ds_raw.columns 

In [None]:
# some column names have extra spaces in their name
for col in ds_raw.columns:
    ds_raw = ds_raw.rename(columns={col: col.strip()})

In [None]:
corr = correlation(ds_raw)

In [None]:
(abs(corr['NUMB_CALLS']).sort_values(ascending=False))[:50]

In [None]:
def autocorrelation_attr(df,n_lags=50):
    plt.rcParams['figure.max_open_warning'] = 0
    weather_time_series = [col for col in df.columns if col not in ('NUMB_CALLS','DT_ISO') ]
    #plt.subplots(figsize=(10,5))
    for col in weather_time_series:
        plot_acf(df[col],lags=n_lags)
        plt.title(col)
    plt.show()

In [None]:
autocorrelation_attr(ds_raw,n_lags = 60)

We can obtain even more insight about which weather parameters influence the most #calls after using the Gradient Boosting.