# Step 1: Exploring & preprocessing raw data

In [None]:
import pandas as pd
df_weather_raw = pd.read_csv('files/weather_data.csv',sep=',',decimal='.')
df_calls_raw = pd.read_csv('files/calls_data.csv',sep=',',decimal='.')

In [None]:
df_weather_raw.head(2)

In [None]:
df_weather_raw.describe()

In [None]:
df_calls_raw.head()

Since we are given the hourly information about the weather in Seatle, we won't differentiate the calls for different locations in Seatle. Thus in the call dataset we will remove all the columns, **except the datetime and the incident number** (as we assume there could be a few calls at the same time). <br><br>
When it comes to the data about weather, we will get rid of redundant columns: **dt, timezone, city_name, lat, lon, sea_level, grnd_level, weather_id, weather_icon and weather_main** (as this information is already included in weather_description).  <br>
(A little observation: it's not clear, how there are different timezones in the dataset, while the latitude and longitute do not change.)  

In [None]:
columns_to_drop_weather = ['dt', 'timezone', 'lat', 'lon', 'sea_level', 'grnd_level', 'weather_id', 'weather_icon', \
                   'weather_main', 'city_name']
columns_to_drop_calls = ['Address','Type','Latitude','Longitude','Report Location']
df_weather_raw = df_weather_raw.drop(columns_to_drop_weather,axis = 1) 
df_calls_raw = df_calls_raw.drop(columns_to_drop_calls,axis = 1) 

In [None]:
print('Missing values calls?')
display(df_calls_raw.isnull().any())
print('Missing values weather?')
display(df_weather_raw.isnull().any())

The missing weather data about the **rain_1h, rain_3h, snow_1h, snow_3h** can be replaced with zeros, which is clear from the values of the weather description. (No rain or snow in the description for the rows with this missing data)

In [None]:
df_weather_raw = df_weather_raw.fillna(0)
display(df_weather_raw.isnull().any())

In [None]:
df_weather_raw.describe()

#### Let's adjust the datestamp of the calls to the standard 'YYYY-MM-DD HH:MM:SS'

In [None]:
df_calls_raw['Datetime'] = pd.to_datetime(df_calls_raw.Datetime)
df_calls_raw.head()

In [None]:
# for weather data
# convert 2002-01-01 00:00:00 +0000 UTC to 2002-01-01 00:00:00
import re
example= '2002-01-01 00:00:00 +0000 UTC'
pos = example.find('+') # find the position of +
remove_ending = lambda x: x[:pos-1] # remove everything after +
df_weather_raw['dt_iso'] = df_weather_raw['dt_iso'].apply(remove_ending)
df_weather_raw['dt_iso'] = df_weather_raw['dt_iso'].apply(pd.to_datetime)
df_weather_raw.head()

In [None]:
df_calls_raw.head()

In [None]:
# remove the dublicates wrt to timestamp for the weather
# since we capture randomly the temperature at one time at one hour, it is fair to choose one value out of two
# but we could also take an average
df_weather_raw = df_weather_raw.drop_duplicates(subset=['dt_iso'])
df_weather_raw =df_weather_raw.reset_index(drop=True)

In [None]:
len(df_weather_raw)

In [None]:
len(df_weather_raw['dt_iso'].unique())

### Let's find out if we are missing any timestamps in the call and weather data

In [None]:
df_calls_raw['tDiff'] = df_calls_raw.Datetime.diff()
df_calls_raw[df_calls_raw.tDiff > pd.Timedelta('1H')]

In [None]:
df_weather_raw['tDiff'] = df_weather_raw.dt_iso.diff()
df_weather_raw[df_weather_raw.tDiff > pd.Timedelta('1H')]

As we could expect, the fire department doesn't receive calls every hour. This means, that when we combine the calls and weather data, we can simply use **left join** and substitute the missing call values with zeros. 

### We will use the data for the past 5 years: from 2015-11-01 till 2020-11-01

In [None]:
df_calls_raw = df_calls_raw[(df_calls_raw['Datetime']>='2015-11-01 00:00:00') & (df_calls_raw['Datetime']<'2020-11-01 00:00:00')]
df_weather_raw = df_weather_raw[(df_weather_raw['dt_iso']>='2015-11-01 00:00:00') & (df_weather_raw['dt_iso']<'2020-11-01 00:00:00')]

### Next:
Now we will aggregate the information about the calls per hour and join it with the weather information. As suggested, we will be using the database for feature engineering. <br>

Most of the time we could execute the queries from Python, however as it cannot be reproduced, the SQL queries used and the output (csv file) of the queries is going to be provided.<br>
Just for the demonstration, this cell (not excutable) shows an example how we could work with the database from Python, that 

In [None]:
df_weather_raw.head()

In [None]:
df_calls_raw = df_calls_raw.drop('tDiff',axis = 1)
df_weather_raw = df_weather_raw.drop('tDiff',axis = 1)

In [None]:
# we import these files to use them in the Database
#df_weather_raw.to_csv('weather_import_db.csv') 
#df_calls_raw.to_csv('calls_import_db.csv')

In [None]:
def print_query(file_name):
    print('Would you like to see the query? y/n')
    ans = input()
    if ans=='y':
        with open('files/queries/'+file_name, 'r') as file:
            query_time_series_hour = file.read()
        print('-'*50)
        print(query_time_series_hour)
        print('-'*50)

In [None]:
print_query('time_series_analysis.txt')