#### The Objective

The ultimate aim of of the challenge is to **predict the area of wildfires in 7 regions in Australia for February 2021** with historical data, before they have happened! 

There are three submissions:
- 1) Predict wildfires in February 2020.
- 2) Predict wildifres in 3rd and 4th week of January 2021.
- 3) Predict wildfires in February 20201.

#### 1.2 Historical Weather

This dataset contains daily aggregates computed from the hourly ERA5 climate reanalysis. Find more information about this data [here](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=overview) and [here](https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5)

#### Variables

* All variables are aggregated to daily values from `YYYY-mm-ddT01:00:00Z` to `YYYY-mm-(dd+1)T00:00:00Z`
* `Precipitation` is derived from total precipitation. Hourly raw data is converted from m/hour to mm/hour 
* [`Relative humidity`](https://en.wikipedia.org/wiki/Relative_humidity) is derived from the temperature and dewpoint
* `Soil water content` is given for 0 - 7 cm below the surface
* `Solar radiation`or Surface Solar Radiation Downwards. Units are converted from J/h to MJ/h
* `Temperature`
* `Wind speed` is calculated for every hour from the Easterly and Northerly 10 meter wind components

#### Steps:
[1. Load Packages](#LoadPackages) 

[2. Descriptive Stats](#DescriptiveStats) 

[3. Evaluating for Missing Values(no missing values)](#MissingValues) 

[4. Checking for Duplicates (no duplicates)](#Duplicates) 

[5. Rearranging Table via Pivot](#PivotTable) 

[6. Evaluate Re-Arranced Parameter Columns for Missing and Duplicates](#RearrangedTable) 

[7. Weather Data Review](#DataReview) 

[8. Save out Pre-Processed "C&P_Weather" CSV File](#PreprocessedWeather) 

#### Load packages <a class="anchor" id="LoadPackages"></a>

In [1]:
# Import the necessary packages for analysis and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt, mpld3
%matplotlib inline
import json
import datetime

from shapely.geometry import Polygon, mapping
import geopandas as gpd
import folium
from folium.plugins import TimeSliderChoropleth
import seaborn as sns
import plotly.express as px

sns.set_style("whitegrid")

import warnings
warnings.filterwarnings("ignore")

#### Notes:

* Data type has been changed to match across all other datasets.
* Renaming columns to make more sense.
* No null values.
* No duplicates or drops.

In [2]:
# Load the dataset
weather = "H_Weather.csv"
print("Reading file: '{}'".format(weather))
weather_df = pd.read_csv(weather, parse_dates=[1])
print("Loaded...")

# Columns and their datatypes
print(weather_df.dtypes)

weather_df.head()

Reading file: 'H_Weather.csv'
Loaded...
Date                    object
Region                  object
Parameter               object
count()[unit: km^2]    float64
min()                  float64
max()                  float64
mean()                 float64
variance()             float64
dtype: object


Unnamed: 0,Date,Region,Parameter,count()[unit: km^2],min(),max(),mean(),variance()
0,2005-01-01,NSW,Precipitation,800234.348986,0.0,1.836935,0.044274,0.028362
1,2005-01-01,NSW,RelativeHumidity,800234.348986,13.877194,80.522964,36.355567,253.559937
2,2005-01-01,NSW,SoilWaterContent,800234.348986,0.002245,0.414305,0.170931,0.007758
3,2005-01-01,NSW,SolarRadiation,800234.348986,14.515009,32.169781,26.749389,6.078587
4,2005-01-01,NSW,Temperature,800234.348986,14.485785,35.878704,27.341182,18.562212


#### Notes:
For every region {object}:

    1 - Date : here is an object and will need to be defined as (format YYYY-MM-DD) {datetime64[ns]}
    2 - Parameter includes: {object}

            Precipiation (mmd/day)
            Relative Humidity (%)
            Soil water content (m3 m3)
            Solar Radiation (MJ/day)
            Temperature (C)
            Wind speed (m/s)

    3 - Count - (km2) {float64}

In [3]:
#changing date type for consistency across all datasets
weather_df['Date'] = pd.to_datetime(weather_df['Date'])
weather_df.head()

Unnamed: 0,Date,Region,Parameter,count()[unit: km^2],min(),max(),mean(),variance()
0,2005-01-01,NSW,Precipitation,800234.348986,0.0,1.836935,0.044274,0.028362
1,2005-01-01,NSW,RelativeHumidity,800234.348986,13.877194,80.522964,36.355567,253.559937
2,2005-01-01,NSW,SoilWaterContent,800234.348986,0.002245,0.414305,0.170931,0.007758
3,2005-01-01,NSW,SolarRadiation,800234.348986,14.515009,32.169781,26.749389,6.078587
4,2005-01-01,NSW,Temperature,800234.348986,14.485785,35.878704,27.341182,18.562212


In [4]:
# rename columns
weather_cols = ['Date', 'Region', 'Parameter', 'area', 'min', 'max', 'mean', '2nd_moment']
weather_df.columns= weather_cols

In [5]:
weather_df.columns.tolist()

['Date', 'Region', 'Parameter', 'area', 'min', 'max', 'mean', '2nd_moment']

#### Descriptive Stats <a class="anchor" id="DescriptiveStats"></a>

In [6]:
weather_df.dtypes

Date          datetime64[ns]
Region                object
Parameter             object
area                 float64
min                  float64
max                  float64
mean                 float64
2nd_moment           float64
dtype: object

In [7]:
weather_df.shape

(242781, 8)

In [8]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242781 entries, 0 to 242780
Data columns (total 8 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   Date        242781 non-null  datetime64[ns]
 1   Region      242781 non-null  object        
 2   Parameter   242781 non-null  object        
 3   area        242781 non-null  float64       
 4   min         242781 non-null  float64       
 5   max         242781 non-null  float64       
 6   mean        242781 non-null  float64       
 7   2nd_moment  242781 non-null  float64       
dtypes: datetime64[ns](1), float64(5), object(2)
memory usage: 14.8+ MB


#### Evaluating for Missing Values <a class="anchor" id="MissingValues"></a>

In [9]:
# check for missing values
weather_df.isna().sum()

Date          0
Region        0
Parameter     0
area          0
min           0
max           0
mean          0
2nd_moment    0
dtype: int64

#### Checking for Duplicates <a class="anchor" id="Duplicates"></a>

In [10]:
# find duplicates
weather_df.duplicated().sum()

0

In [11]:
weather_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
area,242781.0,1100786.0,795500.546644,67785.261409,229453.2,991315.104985,1730605.0,2528546.0
min,242781.0,9.539068,14.572877,-5.055067,7.586398e-07,2.368203,14.09707,90.27615
max,242781.0,27.22397,31.337855,0.0,5.494375,15.527743,32.08391,509.8331
mean,242781.0,16.68256,21.045676,0.0,0.3413224,6.709059,24.97674,95.953
2nd_moment,242781.0,39.0485,94.837253,0.0,0.3435371,2.954298,16.55526,2064.897


In [12]:
print("Rows    : ", weather_df.shape[0])
print("Columns : ", weather_df.shape[1])
print("\nFeatures : ", weather_df.columns.tolist())
print("\nMissing Values : \n",weather_df.isnull().any())
print("\nUnique Values : \n",weather_df.nunique())
print("Number of records: {}".format(len(weather_df)))
print("Number of regions: {}\n".format(len(weather_df['Region'].unique())))
print(weather_df['Region'].unique())
print(weather_df['Parameter'].unique())

Rows    :  242781
Columns :  8

Features :  ['Date', 'Region', 'Parameter', 'area', 'min', 'max', 'mean', '2nd_moment']

Missing Values : 
 Date          False
Region        False
Parameter     False
area          False
min           False
max           False
mean          False
2nd_moment    False
dtype: bool

Unique Values : 
 Date            5783
Region             7
Parameter          6
area               7
min           174495
max           241802
mean          242662
2nd_moment    242662
dtype: int64
Number of records: 242781
Number of regions: 7

['NSW' 'NT' 'QL' 'SA' 'TA' 'VI' 'WA']
['Precipitation' 'RelativeHumidity' 'SoilWaterContent' 'SolarRadiation'
 'Temperature' 'WindSpeed']


#### Re-arranging Table via Pivot Function <a class="anchor" id="PivotTable"></a>

In [13]:
# Rename columns names
weather_df.columns = ['Date', 'Region', 'Parameter', 'area', 'min', 'max', 'mean', '2nd_moment']
weather_df.head()

Unnamed: 0,Date,Region,Parameter,area,min,max,mean,2nd_moment
0,2005-01-01,NSW,Precipitation,800234.348986,0.0,1.836935,0.044274,0.028362
1,2005-01-01,NSW,RelativeHumidity,800234.348986,13.877194,80.522964,36.355567,253.559937
2,2005-01-01,NSW,SoilWaterContent,800234.348986,0.002245,0.414305,0.170931,0.007758
3,2005-01-01,NSW,SolarRadiation,800234.348986,14.515009,32.169781,26.749389,6.078587
4,2005-01-01,NSW,Temperature,800234.348986,14.485785,35.878704,27.341182,18.562212


In [14]:
#rearranging Paramater values in the weather data
df_pivot = weather_df.pivot_table(values=['min','max','mean','2nd_moment'], index=['Date','Region', 'area'], columns=['Parameter'])
df_pivot

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,2nd_moment,2nd_moment,2nd_moment,2nd_moment,2nd_moment,2nd_moment,max,max,max,max,...,mean,mean,mean,mean,min,min,min,min,min,min
Unnamed: 0_level_1,Unnamed: 1_level_1,Parameter,Precipitation,RelativeHumidity,SoilWaterContent,SolarRadiation,Temperature,WindSpeed,Precipitation,RelativeHumidity,SoilWaterContent,SolarRadiation,...,SoilWaterContent,SolarRadiation,Temperature,WindSpeed,Precipitation,RelativeHumidity,SoilWaterContent,SolarRadiation,Temperature,WindSpeed
Date,Region,area,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2
2005-01-01,NSW,8.002343e+05,0.028362,253.559937,0.007758,6.078587,18.562212,0.850048,1.836935,80.522964,0.414305,32.169781,...,0.170931,26.749389,27.341182,3.323550,0.000000,13.877194,0.002245,14.515009,14.485785,1.354448
2005-01-01,NT,1.357561e+06,546.059262,584.201131,0.026743,58.942658,12.920252,1.930014,315.266815,95.683342,0.496140,31.634459,...,0.167735,19.781791,29.881492,5.296892,0.000000,14.558820,0.000000,2.518120,24.179960,1.840394
2005-01-01,QL,1.730605e+06,35.641257,403.134377,0.012679,29.500832,13.792599,0.883048,74.452164,95.898270,0.472416,31.982830,...,0.185641,27.056979,28.842866,3.483753,0.000000,14.443199,0.000000,6.033827,20.951620,1.106028
2005-01-01,SA,9.913151e+05,0.042837,246.044713,0.001917,7.914246,34.799336,1.655908,3.193624,81.980751,0.263911,31.734528,...,0.056047,27.142643,30.793675,4.657538,0.000000,10.618136,0.000000,17.861103,14.095855,2.023657
2005-01-01,TA,6.778526e+04,12.068597,111.754034,0.007121,12.826400,4.912013,2.963118,13.604791,81.501442,0.368189,33.225517,...,0.211360,26.755711,11.788805,5.408138,0.003973,43.906574,0.000000,20.742302,6.686816,1.995647
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-10-31,QL,1.730605e+06,4.229566,316.368911,0.006935,4.796284,8.811943,0.549077,26.366877,87.803581,0.442840,30.670065,...,0.141467,28.745128,24.777835,3.137745,0.000000,12.749949,0.000000,19.898029,17.249844,1.180556
2020-10-31,SA,9.913151e+05,0.000492,316.308826,0.005425,17.093209,9.466862,0.640926,0.259189,82.642616,0.436347,29.623133,...,0.083082,26.530568,18.947783,3.548225,0.000000,12.512371,0.000000,15.608338,9.948414,2.062619
2020-10-31,TA,6.778526e+04,0.243603,10.936866,0.009317,8.718983,2.388990,0.859296,2.179307,86.428932,0.376833,28.439209,...,0.264608,21.782732,11.648813,2.501697,0.000703,69.801094,0.000000,15.503663,8.217092,1.300771
2020-10-31,VI,2.294532e+05,3.148454,44.766480,0.005050,9.917196,4.088503,1.019079,11.436618,93.374763,0.455111,28.041906,...,0.324260,19.553751,13.167147,3.838360,0.000000,66.459290,0.000000,11.170260,9.186510,1.783996


In [15]:
#resetting the index on the new table formed
df_pivot.reset_index(inplace=True)
df_pivot.head()

Unnamed: 0_level_0,Date,Region,area,2nd_moment,2nd_moment,2nd_moment,2nd_moment,2nd_moment,2nd_moment,max,...,mean,mean,mean,mean,min,min,min,min,min,min
Parameter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Precipitation,RelativeHumidity,SoilWaterContent,SolarRadiation,Temperature,WindSpeed,Precipitation,...,SoilWaterContent,SolarRadiation,Temperature,WindSpeed,Precipitation,RelativeHumidity,SoilWaterContent,SolarRadiation,Temperature,WindSpeed
0,2005-01-01,NSW,800234.3,0.028362,253.559937,0.007758,6.078587,18.562212,0.850048,1.836935,...,0.170931,26.749389,27.341182,3.32355,0.0,13.877194,0.002245,14.515009,14.485785,1.354448
1,2005-01-01,NT,1357561.0,546.059262,584.201131,0.026743,58.942658,12.920252,1.930014,315.266815,...,0.167735,19.781791,29.881492,5.296892,0.0,14.55882,0.0,2.51812,24.17996,1.840394
2,2005-01-01,QL,1730605.0,35.641257,403.134377,0.012679,29.500832,13.792599,0.883048,74.452164,...,0.185641,27.056979,28.842866,3.483753,0.0,14.443199,0.0,6.033827,20.95162,1.106028
3,2005-01-01,SA,991315.1,0.042837,246.044713,0.001917,7.914246,34.799336,1.655908,3.193624,...,0.056047,27.142643,30.793675,4.657538,0.0,10.618136,0.0,17.861103,14.095855,2.023657
4,2005-01-01,TA,67785.26,12.068597,111.754034,0.007121,12.8264,4.912013,2.963118,13.604791,...,0.21136,26.755711,11.788805,5.408138,0.003973,43.906574,0.0,20.742302,6.686816,1.995647


In [16]:
# Renaming Column names
df_pivot.columns = [col[0] if not(col[1]) else '{1}_{0}'.format(*col) for col in df_pivot.columns.values]
df_pivot.head()

Unnamed: 0,Date,Region,area,Precipitation_2nd_moment,RelativeHumidity_2nd_moment,SoilWaterContent_2nd_moment,SolarRadiation_2nd_moment,Temperature_2nd_moment,WindSpeed_2nd_moment,Precipitation_max,...,SoilWaterContent_mean,SolarRadiation_mean,Temperature_mean,WindSpeed_mean,Precipitation_min,RelativeHumidity_min,SoilWaterContent_min,SolarRadiation_min,Temperature_min,WindSpeed_min
0,2005-01-01,NSW,800234.3,0.028362,253.559937,0.007758,6.078587,18.562212,0.850048,1.836935,...,0.170931,26.749389,27.341182,3.32355,0.0,13.877194,0.002245,14.515009,14.485785,1.354448
1,2005-01-01,NT,1357561.0,546.059262,584.201131,0.026743,58.942658,12.920252,1.930014,315.266815,...,0.167735,19.781791,29.881492,5.296892,0.0,14.55882,0.0,2.51812,24.17996,1.840394
2,2005-01-01,QL,1730605.0,35.641257,403.134377,0.012679,29.500832,13.792599,0.883048,74.452164,...,0.185641,27.056979,28.842866,3.483753,0.0,14.443199,0.0,6.033827,20.95162,1.106028
3,2005-01-01,SA,991315.1,0.042837,246.044713,0.001917,7.914246,34.799336,1.655908,3.193624,...,0.056047,27.142643,30.793675,4.657538,0.0,10.618136,0.0,17.861103,14.095855,2.023657
4,2005-01-01,TA,67785.26,12.068597,111.754034,0.007121,12.8264,4.912013,2.963118,13.604791,...,0.21136,26.755711,11.788805,5.408138,0.003973,43.906574,0.0,20.742302,6.686816,1.995647


In [17]:
# Rearranging Data and column
params = df_pivot.columns.tolist()[3:]
params.sort()
weather_data = df_pivot[df_pivot.columns.tolist()[:3] + params].copy()
weather_data.head()

Unnamed: 0,Date,Region,area,Precipitation_2nd_moment,Precipitation_max,Precipitation_mean,Precipitation_min,RelativeHumidity_2nd_moment,RelativeHumidity_max,RelativeHumidity_mean,...,SolarRadiation_mean,SolarRadiation_min,Temperature_2nd_moment,Temperature_max,Temperature_mean,Temperature_min,WindSpeed_2nd_moment,WindSpeed_max,WindSpeed_mean,WindSpeed_min
0,2005-01-01,NSW,800234.3,0.028362,1.836935,0.044274,0.0,253.559937,80.522964,36.355567,...,26.749389,14.515009,18.562212,35.878704,27.341182,14.485785,0.850048,7.670482,3.32355,1.354448
1,2005-01-01,NT,1357561.0,546.059262,315.266815,9.884958,0.0,584.201131,95.683342,61.494675,...,19.781791,2.51812,12.920252,38.136787,29.881492,24.17996,1.930014,9.704402,5.296892,1.840394
2,2005-01-01,QL,1730605.0,35.641257,74.452164,1.453053,0.0,403.134377,95.89827,47.959364,...,27.056979,6.033827,13.792599,37.047943,28.842866,20.95162,0.883048,7.675632,3.483753,1.106028
3,2005-01-01,SA,991315.1,0.042837,3.193624,0.059078,0.0,246.044713,81.980751,30.057683,...,27.142643,17.861103,34.799336,38.326847,30.793675,14.095855,1.655908,10.044715,4.657538,2.023657
4,2005-01-01,TA,67785.26,12.068597,13.604791,3.099497,0.003973,111.754034,81.501442,65.086764,...,26.755711,20.742302,4.912013,16.22851,11.788805,6.686816,2.963118,11.432408,5.408138,1.995647


In [18]:
num_rows, num_cols = weather_data.shape
print("There are total {} records in the following {} columns:\n".format(num_rows, num_cols))
print("\n".join(list(weather_data.columns)))

There are total 40481 records in the following 27 columns:

Date
Region
area
Precipitation_2nd_moment
Precipitation_max
Precipitation_mean
Precipitation_min
RelativeHumidity_2nd_moment
RelativeHumidity_max
RelativeHumidity_mean
RelativeHumidity_min
SoilWaterContent_2nd_moment
SoilWaterContent_max
SoilWaterContent_mean
SoilWaterContent_min
SolarRadiation_2nd_moment
SolarRadiation_max
SolarRadiation_mean
SolarRadiation_min
Temperature_2nd_moment
Temperature_max
Temperature_mean
Temperature_min
WindSpeed_2nd_moment
WindSpeed_max
WindSpeed_mean
WindSpeed_min


#### Evaluate Re-Arranged Paramater Columns for Missing and Duplicates <a class="anchor" id="RearrangedTable"></a>

Note: Check for null values in the weather data paramater columns now.

In [19]:
weather_data.isna().sum()

Date                            0
Region                          0
area                            0
Precipitation_2nd_moment        7
Precipitation_max               7
Precipitation_mean              7
Precipitation_min               7
RelativeHumidity_2nd_moment    42
RelativeHumidity_max           42
RelativeHumidity_mean          42
RelativeHumidity_min           42
SoilWaterContent_2nd_moment     0
SoilWaterContent_max            0
SoilWaterContent_mean           0
SoilWaterContent_min            0
SolarRadiation_2nd_moment      14
SolarRadiation_max             14
SolarRadiation_mean            14
SolarRadiation_min             14
Temperature_2nd_moment         14
Temperature_max                14
Temperature_mean               14
Temperature_min                14
WindSpeed_2nd_moment           28
WindSpeed_max                  28
WindSpeed_mean                 28
WindSpeed_min                  28
dtype: int64

Checking NULL values for - PRECIPITATION

In [20]:
#cross checking null values in the new arranged data for the Precipitation column
weather_data.loc[weather_data['Precipitation_mean'].isna(), :]

Unnamed: 0,Date,Region,area,Precipitation_2nd_moment,Precipitation_max,Precipitation_mean,Precipitation_min,RelativeHumidity_2nd_moment,RelativeHumidity_max,RelativeHumidity_mean,...,SolarRadiation_mean,SolarRadiation_min,Temperature_2nd_moment,Temperature_max,Temperature_mean,Temperature_min,WindSpeed_2nd_moment,WindSpeed_max,WindSpeed_mean,WindSpeed_min
36995,2019-06-22,NSW,800234.3,,,,,114.312088,91.814453,68.422677,...,11.564692,5.427815,4.957193,13.923991,6.138674,-0.300265,0.670847,9.339998,2.914745,1.048711
36996,2019-06-22,NT,1357561.0,,,,,125.066021,74.113892,30.802208,...,17.438541,12.637497,17.84191,25.157646,14.323231,7.124587,0.82178,9.785229,4.616783,2.790959
36997,2019-06-22,QL,1730605.0,,,,,136.242906,83.852173,46.585385,...,16.071312,10.554407,22.149479,25.403965,11.880667,4.738083,1.266645,9.982671,3.428681,1.280596
36998,2019-06-22,SA,991315.1,,,,,189.151252,90.240486,50.957245,...,12.181231,6.261038,2.724242,13.625819,8.601456,2.367725,1.11302,8.605908,3.424624,1.245013
36999,2019-06-22,TA,67785.26,,,,,38.48772,97.402702,86.778359,...,5.85425,4.417063,6.325362,11.469337,4.57491,-0.214279,0.857406,6.195638,1.771625,0.6963
37000,2019-06-22,VI,229453.2,,,,,41.1992,96.209885,84.582572,...,7.781152,5.230695,3.68577,11.80181,5.001564,0.078639,0.361891,6.246301,2.101499,0.802938
37001,2019-06-22,WA,2528546.0,,,,,467.014533,98.127876,39.120916,...,11.886737,2.460803,6.561743,24.897877,18.15278,11.241782,1.324167,8.518429,4.775665,2.197379


In [21]:
#verifying the original data, that indeed there was no rain on 06-22-2019
weather_df.loc[weather_df['Date'] == "2019-06-22", :]

Unnamed: 0,Date,Region,Parameter,area,min,max,mean,2nd_moment
221893,2019-06-22,NSW,RelativeHumidity,800234.3,47.983139,91.814453,68.422677,114.312088
221894,2019-06-22,NSW,SoilWaterContent,800234.3,0.000723,0.41293,0.206911,0.008979
221895,2019-06-22,NSW,SolarRadiation,800234.3,5.427815,14.121369,11.564692,2.390587
221896,2019-06-22,NSW,Temperature,800234.3,-0.300265,13.923991,6.138674,4.957193
221897,2019-06-22,NSW,WindSpeed,800234.3,1.048711,9.339998,2.914745,0.670847
221898,2019-06-22,NT,RelativeHumidity,1357561.0,17.733223,74.113892,30.802208,125.066021
221899,2019-06-22,NT,SoilWaterContent,1357561.0,0.0,0.264075,0.067087,0.003592
221900,2019-06-22,NT,SolarRadiation,1357561.0,12.637497,20.044704,17.438541,2.493807
221901,2019-06-22,NT,Temperature,1357561.0,7.124587,25.157646,14.323231,17.84191
221902,2019-06-22,NT,WindSpeed,1357561.0,2.790959,9.785229,4.616783,0.82178


Checking NULL values for - TEMPERATURE

In [22]:
weather_data.loc[weather_data['Temperature_mean'].isna(), :]

Unnamed: 0,Date,Region,area,Precipitation_2nd_moment,Precipitation_max,Precipitation_mean,Precipitation_min,RelativeHumidity_2nd_moment,RelativeHumidity_max,RelativeHumidity_mean,...,SolarRadiation_mean,SolarRadiation_min,Temperature_2nd_moment,Temperature_max,Temperature_mean,Temperature_min,WindSpeed_2nd_moment,WindSpeed_max,WindSpeed_mean,WindSpeed_min
10493,2009-02-08,NSW,800234.3,0.111808,3.443802,0.09644,0.0,,,,...,28.499942,18.516203,,,,,3.19568,9.101792,3.861615,1.108891
10494,2009-02-08,NT,1357561.0,120.248699,80.331871,7.512906,0.0,,,,...,21.338024,5.833764,,,,,3.405661,10.320077,4.139762,0.98892
10495,2009-02-08,QL,1730605.0,279.584007,176.865433,9.807672,0.0,,,,...,19.476571,2.483397,,,,,1.510411,8.917587,3.688307,0.638895
10496,2009-02-08,SA,991315.1,0.00504,1.049786,0.015489,0.0,,,,...,27.29943,16.878788,,,,,2.527251,11.238954,6.878877,1.923652
10497,2009-02-08,TA,67785.26,1.719551,6.913176,1.033623,0.0,,,,...,21.995847,16.771393,,,,,2.148888,10.531963,4.759034,2.525351
10498,2009-02-08,VI,229453.2,0.502153,5.414611,0.361654,0.0,,,,...,24.60996,17.23601,,,,,1.588085,9.120632,5.007785,2.073939
10499,2009-02-08,WA,2528546.0,4.710052,22.894337,0.710461,0.0,,,,...,26.453286,14.515725,,,,,3.458938,10.121881,6.069578,2.050614
34419,2018-06-19,NSW,800234.3,2.780286,28.75461,0.347647,0.0,,,,...,10.198048,3.253069,,,,,1.183612,9.898138,2.797399,1.053749
34420,2018-06-19,NT,1357561.0,0.000539,1.732284,0.00135,0.0,,,,...,17.050411,11.142137,,,,,0.452138,9.185046,3.272381,1.700933
34421,2018-06-19,QL,1730605.0,0.011031,2.666157,0.010751,0.0,,,,...,16.228667,8.911916,,,,,0.713732,11.24132,3.401858,1.166037


Checking min value in the original data, as it appears to be null for two dates 02/08/2009 and 06/19/2018

In [23]:
weather_df.loc[weather_df['Date'] == "2009-02-08", :]

Unnamed: 0,Date,Region,Parameter,area,min,max,mean,2nd_moment
62951,2009-02-08,NSW,Precipitation,800234.3,0.0,3.443802,0.09644,0.111808
62952,2009-02-08,NSW,SoilWaterContent,800234.3,0.01704425,0.383857,0.149632,0.004283
62953,2009-02-08,NSW,SolarRadiation,800234.3,18.5162,31.312689,28.499942,4.225705
62954,2009-02-08,NSW,WindSpeed,800234.3,1.108891,9.101792,3.861615,3.19568
62955,2009-02-08,NT,Precipitation,1357561.0,0.0,80.331871,7.512906,120.248699
62956,2009-02-08,NT,SoilWaterContent,1357561.0,6.010911e-07,0.494226,0.178735,0.025239
62957,2009-02-08,NT,SolarRadiation,1357561.0,5.833764,30.037846,21.338024,40.175295
62958,2009-02-08,NT,WindSpeed,1357561.0,0.9889196,10.320077,4.139762,3.405661
62959,2009-02-08,QL,Precipitation,1730605.0,0.0,176.865433,9.807672,279.584007
62960,2009-02-08,QL,SoilWaterContent,1730605.0,6.010911e-07,0.512477,0.253117,0.020698


In [24]:
weather_df.loc[weather_df['Date'] == "2018-06-19", :]

Unnamed: 0,Date,Region,Parameter,area,min,max,mean,2nd_moment
206458,2018-06-19,NSW,Precipitation,800234.3,0.0,28.75461,0.347647,2.780286
206459,2018-06-19,NSW,SoilWaterContent,800234.3,0.003328,0.417228,0.213582,0.010501
206460,2018-06-19,NSW,SolarRadiation,800234.3,3.253069,13.556427,10.198048,5.565272
206461,2018-06-19,NSW,WindSpeed,800234.3,1.053749,9.898138,2.797399,1.183612
206462,2018-06-19,NT,Precipitation,1357561.0,0.0,1.732284,0.00135,0.000539
206463,2018-06-19,NT,SoilWaterContent,1357561.0,0.0,0.28254,0.067721,0.00376
206464,2018-06-19,NT,SolarRadiation,1357561.0,11.142137,19.234465,17.050411,1.936097
206465,2018-06-19,NT,WindSpeed,1357561.0,1.700933,9.185046,3.272381,0.452138
206466,2018-06-19,QL,Precipitation,1730605.0,0.0,2.666157,0.010751,0.011031
206467,2018-06-19,QL,SoilWaterContent,1730605.0,0.0,0.367845,0.132323,0.00421


This confirms that Temprature is null for two dates 2009-02-08 and 2018-06-19.

This also confirms data is properly arranged and checks out, meaning the null values exist because there are no readings for those column values in the original data.

In [25]:
# find only the columns that have missing values
null_columns = weather_data.columns[weather_data.isna().any()]
weather_data[null_columns].isna().sum()

Precipitation_2nd_moment        7
Precipitation_max               7
Precipitation_mean              7
Precipitation_min               7
RelativeHumidity_2nd_moment    42
RelativeHumidity_max           42
RelativeHumidity_mean          42
RelativeHumidity_min           42
SolarRadiation_2nd_moment      14
SolarRadiation_max             14
SolarRadiation_mean            14
SolarRadiation_min             14
Temperature_2nd_moment         14
Temperature_max                14
Temperature_mean               14
Temperature_min                14
WindSpeed_2nd_moment           28
WindSpeed_max                  28
WindSpeed_mean                 28
WindSpeed_min                  28
dtype: int64

In [26]:
# Display the index for missing values
#weather_data[weather_data.isna().any(axis=1)].index

In [27]:
# columns DataFrame with missing values
weather_data[weather_data.isna().any(axis=1)][null_columns]

Unnamed: 0,Precipitation_2nd_moment,Precipitation_max,Precipitation_mean,Precipitation_min,RelativeHumidity_2nd_moment,RelativeHumidity_max,RelativeHumidity_mean,RelativeHumidity_min,SolarRadiation_2nd_moment,SolarRadiation_max,SolarRadiation_mean,SolarRadiation_min,Temperature_2nd_moment,Temperature_max,Temperature_mean,Temperature_min,WindSpeed_2nd_moment,WindSpeed_max,WindSpeed_mean,WindSpeed_min
4165,0.103268,5.449200,0.047475,0.000000,,,,,4.271965,18.647676,16.310927,10.194108,7.459332,17.644295,9.641613,0.967949,0.189276,4.213328,2.308445,0.836806
4166,0.000143,0.264100,0.001280,0.000000,,,,,1.270609,23.526985,21.581259,17.291561,8.550165,25.505045,19.497667,13.527777,0.707050,8.489419,3.466174,0.817809
4167,0.005799,1.947693,0.008666,0.000000,,,,,2.051810,22.918108,20.219926,14.184232,6.671278,24.321589,18.480937,9.788080,0.703204,10.176150,2.893979,1.045502
4168,0.001177,0.828103,0.003671,0.000000,,,,,2.074407,19.501968,17.004990,8.967422,7.754605,20.850836,15.690520,9.280705,1.247970,8.295160,3.754106,1.237808
4169,6.470552,10.133667,2.516017,0.000000,,,,,2.596430,12.314164,8.716046,5.677072,3.853864,12.436457,7.332671,2.521077,1.945459,10.319861,4.225402,2.398384
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37137,0.009111,2.198727,0.013224,0.000000,134.210051,87.916321,52.754183,24.495546,,,,,11.213843,25.035809,16.473898,10.031658,,,,
37138,1.691292,11.486980,0.534394,0.000000,97.019022,92.161705,64.657422,41.511185,,,,,3.464617,16.233313,12.788035,7.456407,,,,
37139,82.333966,46.158062,8.328443,0.096614,70.055438,97.620720,85.213593,66.834213,,,,,4.993865,10.564859,5.011099,-0.156643,,,,
37140,36.169893,33.372616,7.296902,1.563267,35.051307,94.720634,84.244987,63.105289,,,,,3.007378,12.426391,7.625355,2.217532,,,,


In [28]:
#fill null values with zeros
weather_data = weather_data.fillna(0)

In [29]:
weather_data.isna().sum()

Date                           0
Region                         0
area                           0
Precipitation_2nd_moment       0
Precipitation_max              0
Precipitation_mean             0
Precipitation_min              0
RelativeHumidity_2nd_moment    0
RelativeHumidity_max           0
RelativeHumidity_mean          0
RelativeHumidity_min           0
SoilWaterContent_2nd_moment    0
SoilWaterContent_max           0
SoilWaterContent_mean          0
SoilWaterContent_min           0
SolarRadiation_2nd_moment      0
SolarRadiation_max             0
SolarRadiation_mean            0
SolarRadiation_min             0
Temperature_2nd_moment         0
Temperature_max                0
Temperature_mean               0
Temperature_min                0
WindSpeed_2nd_moment           0
WindSpeed_max                  0
WindSpeed_mean                 0
WindSpeed_min                  0
dtype: int64

In [30]:
# find duplicates
weather_data.duplicated().sum()

0

#### Weather Data Review <a class="anchor" id="DataReview"></a>

In [31]:
weather_data.dtypes

Date                           datetime64[ns]
Region                                 object
area                                  float64
Precipitation_2nd_moment              float64
Precipitation_max                     float64
Precipitation_mean                    float64
Precipitation_min                     float64
RelativeHumidity_2nd_moment           float64
RelativeHumidity_max                  float64
RelativeHumidity_mean                 float64
RelativeHumidity_min                  float64
SoilWaterContent_2nd_moment           float64
SoilWaterContent_max                  float64
SoilWaterContent_mean                 float64
SoilWaterContent_min                  float64
SolarRadiation_2nd_moment             float64
SolarRadiation_max                    float64
SolarRadiation_mean                   float64
SolarRadiation_min                    float64
Temperature_2nd_moment                float64
Temperature_max                       float64
Temperature_mean                  

In [32]:
# frequencies for  Region column
weather_data.pivot_table(index= ['Region'], aggfunc='size')

Region
NSW    5783
NT     5783
QL     5783
SA     5783
TA     5783
VI     5783
WA     5783
dtype: int64

#### Saving out the final C&P_Weather CSV File <a class="anchor" id="PreprocessedWeather"></a>

In [33]:
final_file = "C&P_Weather.csv"
print("Saving file: '{}'".format(final_file))
weather_data.to_csv(final_file, index=False, encoding='utf-8')
print("File Saved...")

Saving file: 'C&P_Weather.csv'
File Saved...


In [34]:
# check DataFrame exported
df = pd.read_csv("P:\Wildfires_Australia\cfc_wildfireforecastforAustralia\C&P_Weather.csv")
df['Date'] = pd.to_datetime(df['Date'])

In [35]:
df.head()

Unnamed: 0,Date,Region,area,Precipitation_2nd_moment,Precipitation_max,Precipitation_mean,Precipitation_min,RelativeHumidity_2nd_moment,RelativeHumidity_max,RelativeHumidity_mean,...,SolarRadiation_mean,SolarRadiation_min,Temperature_2nd_moment,Temperature_max,Temperature_mean,Temperature_min,WindSpeed_2nd_moment,WindSpeed_max,WindSpeed_mean,WindSpeed_min
0,2005-01-01,NSW,800234.3,0.028362,1.836935,0.044274,0.0,253.559937,80.522964,36.355567,...,26.749389,14.515009,18.562212,35.878704,27.341182,14.485785,0.850048,7.670482,3.32355,1.354448
1,2005-01-01,NT,1357561.0,546.059262,315.266815,9.884958,0.0,584.201131,95.683342,61.494675,...,19.781791,2.51812,12.920252,38.136787,29.881492,24.17996,1.930014,9.704402,5.296892,1.840394
2,2005-01-01,QL,1730605.0,35.641257,74.452164,1.453053,0.0,403.134377,95.89827,47.959364,...,27.056979,6.033827,13.792599,37.047943,28.842866,20.95162,0.883048,7.675632,3.483753,1.106028
3,2005-01-01,SA,991315.1,0.042837,3.193624,0.059078,0.0,246.044713,81.980751,30.057683,...,27.142643,17.861103,34.799336,38.326847,30.793675,14.095855,1.655908,10.044715,4.657538,2.023657
4,2005-01-01,TA,67785.26,12.068597,13.604791,3.099497,0.003973,111.754034,81.501442,65.086764,...,26.755711,20.742302,4.912013,16.22851,11.788805,6.686816,2.963118,11.432408,5.408138,1.995647


In [36]:
df.shape

(40481, 27)

In [37]:
df.Date.dtype.name

'datetime64[ns]'

In [38]:
df.isna().sum()

Date                           0
Region                         0
area                           0
Precipitation_2nd_moment       0
Precipitation_max              0
Precipitation_mean             0
Precipitation_min              0
RelativeHumidity_2nd_moment    0
RelativeHumidity_max           0
RelativeHumidity_mean          0
RelativeHumidity_min           0
SoilWaterContent_2nd_moment    0
SoilWaterContent_max           0
SoilWaterContent_mean          0
SoilWaterContent_min           0
SolarRadiation_2nd_moment      0
SolarRadiation_max             0
SolarRadiation_mean            0
SolarRadiation_min             0
Temperature_2nd_moment         0
Temperature_max                0
Temperature_mean               0
Temperature_min                0
WindSpeed_2nd_moment           0
WindSpeed_max                  0
WindSpeed_mean                 0
WindSpeed_min                  0
dtype: int64