#### The Objective

The ultimate aim of of the challenge is to **predict the area of wildfires in 7 regions in Australia for February 2021** with historical data, before they have happened! 

There are three submissions:
- 1) Predict wildfires in February 2020.
- 2) Predict wildifres in 3rd and 4th week of January 2021.
- 3) Predict wildfires in February 20201.

#### 1.1 Historical Wildfires

This wildfire dataset contains data on fire activities in Australia starting from 2005. Additional information can be found [here](https://earthdata.nasa.gov/earth-observation-data/near-real-time/firms/c6-mcd14dl). For this competition, all MCD14DL data was further processed by using [IBM PAIRS Geoscope](https://ibmpairs.mybluemix.net/).

#### Variables 

* `Region`: the 7 regions
* `Date`: in UTC and provide the data for 24 hours ahead
* `Estimated_fire_area`: daily sum of estimated fire area for presumed vegetation fires with a confidence > 75% for a each region in km$^2$
* `Mean_estimated_fire_brightness`: daily mean (by flagged fire pixels(=count)) of estimated fire brightness for presumed vegetation fires with a confidence level > 75% in Kelvin
* `Mean_estimated_fire_radiative_power`: daily mean of estimated radiative power for presumed vegetation fires with a confidence level > 75% for a given region in megawatts
* `Mean_confidence`: daily mean of confidence for presumed vegetation fires with a confidence level > 75% 
* `Std_confidence`: standard deviation of estimated fire radiative power in megawatts
* `Var_confidence`: Variance of estimated fire radiative power in megawatts
* `Count`: daily numbers of pixels for presumed vegetation fires with a confidence level of larger than 75% for a given region
* `Replaced`: Indicates with an `Y` whether the data has been replaced with standard quality data when they are available (usually with a 2-3 month lag). Replaced data has a slightly higher quality in terms of locations

#### Steps:
[1. Load Packages](#LoadPackages) 

[2. Descriptive Stats](#DescriptiveStats) 

[3. Evaluating for Missing Values](#MissingValues) 

[4. Fixing Missing Values (2 columns)](#FixingMissing) 

[5. Checking for Duplicates (no duplicates)](#Duplicates) 

[6. Fixing Duplicates (only a note placeholder)](#FixingDuplicates) 

[7. Wildfires Data Review](#DataReview) 

[8. Save out Pre-Processed "C&P_Wildfires" CSV File](#PreprocessedWildfires) 

#### Load packages <a class="anchor" id="LoadPackages"></a>

In [1]:
# Import the necessary packages for analysis and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt, mpld3
%matplotlib inline
import json
import datetime

from shapely.geometry import Polygon, mapping
import geopandas as gpd
import folium
from folium.plugins import TimeSliderChoropleth
import seaborn as sns
import plotly.express as px

sns.set_style("whitegrid")

import warnings
warnings.filterwarnings("ignore")

#### Notes:

* This file contains the `Estimated_fire_area` that you can be used as the labels to train the model.
* Other columns columns can also be used as features to train model.
* Data type has been changed to match across all other datasets.
* Reset 1 record with null values in the std_confidence and var_confidence columns to zeros.
* No duplicates or drops.

In [2]:
# Load the dataset
wildfires = "H_Wildfires.csv"
print("Reading file: '{}'".format(wildfires))
wildfires_df = pd.read_csv(wildfires, parse_dates=[1])
print("Loaded...")

# Columns and their datatypes
print(wildfires_df.dtypes)

wildfires_df.head()

Reading file: 'H_Wildfires.csv'
Loaded...
Region                                         object
Date                                   datetime64[ns]
Estimated_fire_area                           float64
Mean_estimated_fire_brightness                float64
Mean_estimated_fire_radiative_power           float64
Mean_confidence                               float64
Std_confidence                                float64
Var_confidence                                float64
Count                                           int64
Replaced                                       object
dtype: object


Unnamed: 0,Region,Date,Estimated_fire_area,Mean_estimated_fire_brightness,Mean_estimated_fire_radiative_power,Mean_confidence,Std_confidence,Var_confidence,Count,Replaced
0,NSW,2005-01-04,8.68,312.266667,42.4,78.666667,2.886751,8.333333,3,R
1,NSW,2005-01-05,16.61125,322.475,62.3625,85.5,8.088793,65.428571,8,R
2,NSW,2005-01-06,5.52,325.266667,38.4,78.333333,3.21455,10.333333,3,R
3,NSW,2005-01-07,6.264,313.87,33.8,92.2,7.52994,56.7,5,R
4,NSW,2005-01-08,5.4,337.383333,122.533333,91.0,7.937254,63.0,3,R


#### Notes:
For every region {object}:
- 1 - Date : (format YYYY-MM-DD) {datetime64[ns]}
- 2 - Estimated fire area : (km2) {float64}
- 3 - Mean estimated brightness: (K) {float64}
- 4 - Mean estimated fire radiative power (MW) {float64}

In [3]:
#changing date type for consistency across all datasets
wildfires_df['Date'] = pd.to_datetime(wildfires_df['Date'])
wildfires_df.head()

Unnamed: 0,Region,Date,Estimated_fire_area,Mean_estimated_fire_brightness,Mean_estimated_fire_radiative_power,Mean_confidence,Std_confidence,Var_confidence,Count,Replaced
0,NSW,2005-01-04,8.68,312.266667,42.4,78.666667,2.886751,8.333333,3,R
1,NSW,2005-01-05,16.61125,322.475,62.3625,85.5,8.088793,65.428571,8,R
2,NSW,2005-01-06,5.52,325.266667,38.4,78.333333,3.21455,10.333333,3,R
3,NSW,2005-01-07,6.264,313.87,33.8,92.2,7.52994,56.7,5,R
4,NSW,2005-01-08,5.4,337.383333,122.533333,91.0,7.937254,63.0,3,R


#### Descriptive Stats <a class="anchor" id="DescriptiveStats"></a>

In [4]:
wildfires_df.dtypes

Region                                         object
Date                                   datetime64[ns]
Estimated_fire_area                           float64
Mean_estimated_fire_brightness                float64
Mean_estimated_fire_radiative_power           float64
Mean_confidence                               float64
Std_confidence                                float64
Var_confidence                                float64
Count                                           int64
Replaced                                       object
dtype: object

In [5]:
wildfires_df.shape

(26406, 10)

In [6]:
wildfires_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26406 entries, 0 to 26405
Data columns (total 10 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   Region                               26406 non-null  object        
 1   Date                                 26406 non-null  datetime64[ns]
 2   Estimated_fire_area                  26406 non-null  float64       
 3   Mean_estimated_fire_brightness       26406 non-null  float64       
 4   Mean_estimated_fire_radiative_power  26406 non-null  float64       
 5   Mean_confidence                      26406 non-null  float64       
 6   Std_confidence                       24199 non-null  float64       
 7   Var_confidence                       24199 non-null  float64       
 8   Count                                26406 non-null  int64         
 9   Replaced                             26406 non-null  object        
dtypes: datetim

#### Evaluating for Missing Values <a class="anchor" id="MissingValues"></a>

In [7]:
# check for missing values
wildfires_df.isna().sum()

Region                                    0
Date                                      0
Estimated_fire_area                       0
Mean_estimated_fire_brightness            0
Mean_estimated_fire_radiative_power       0
Mean_confidence                           0
Std_confidence                         2207
Var_confidence                         2207
Count                                     0
Replaced                                  0
dtype: int64

In [8]:
wildfires_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Estimated_fire_area,26406.0,144.297966,314.453847,1.0,8.911875,38.434091,146.951278,10120.94317
Mean_estimated_fire_brightness,26406.0,319.662078,8.862005,290.7,313.933333,319.784412,325.403144,381.95
Mean_estimated_fire_radiative_power,26406.0,83.621258,67.510022,0.0,44.150391,67.133333,103.123611,2178.6
Mean_confidence,26406.0,87.574735,4.371972,76.0,85.0,87.771429,90.498403,100.0
Std_confidence,24199.0,7.228302,1.995221,0.0,6.68701,7.707025,8.236665,16.970563
Var_confidence,24199.0,56.229092,25.898935,0.0,44.716106,59.398234,67.842642,288.0
Count,26406.0,72.059305,150.973128,1.0,5.0,20.0,74.0,3954.0


In [9]:
print("Rows    : ", wildfires_df.shape[0])
print("Columns : ", wildfires_df.shape[1])
print("\nFeatures : ", wildfires_df.columns.tolist())
print("\nMissing Values : \n",wildfires_df.isnull().any())
print("\nUnique Values : \n",wildfires_df.nunique())
print("Number of records: {}".format(len(wildfires_df)))
print("Number of regions: {}\n".format(len(wildfires_df['Region'].unique())))
print(wildfires_df['Region'].unique())

Rows    :  26406
Columns :  10

Features :  ['Region', 'Date', 'Estimated_fire_area', 'Mean_estimated_fire_brightness', 'Mean_estimated_fire_radiative_power', 'Mean_confidence', 'Std_confidence', 'Var_confidence', 'Count', 'Replaced']

Missing Values : 
 Region                                 False
Date                                   False
Estimated_fire_area                    False
Mean_estimated_fire_brightness         False
Mean_estimated_fire_radiative_power    False
Mean_confidence                        False
Std_confidence                          True
Var_confidence                          True
Count                                  False
Replaced                               False
dtype: bool

Unique Values : 
 Region                                     7
Date                                    5782
Estimated_fire_area                    18041
Mean_estimated_fire_brightness         20203
Mean_estimated_fire_radiative_power    21415
Mean_confidence                        

In [10]:
# find only the columns that have missing values
null_columns = wildfires_df.columns[wildfires_df.isna().any()]
wildfires_df[null_columns].isna().sum()

Std_confidence    2207
Var_confidence    2207
dtype: int64

In [11]:
# Check the reason for above null values
wildfires_df.loc[wildfires_df.Std_confidence.isna(), :]
wildfires_df.loc[wildfires_df.Var_confidence.isna(), :]

Unnamed: 0,Region,Date,Estimated_fire_area,Mean_estimated_fire_brightness,Mean_estimated_fire_radiative_power,Mean_confidence,Std_confidence,Var_confidence,Count,Replaced
48,NSW,2005-02-26,1.00,303.15,8.0,79.0,,,1,R
149,NSW,2005-06-12,1.00,302.55,17.9,79.0,,,1,R
154,NSW,2005-06-18,5.27,301.30,71.9,77.0,,,1,R
157,NSW,2005-06-25,9.60,300.70,145.9,76.0,,,1,R
163,NSW,2005-07-09,2.80,294.65,37.8,79.0,,,1,R
...,...,...,...,...,...,...,...,...,...,...
26327,WA,2020-08-09,2.34,300.15,30.2,85.0,,,1,N
26331,WA,2020-08-13,1.10,320.35,27.1,83.0,,,1,N
26332,WA,2020-08-14,1.00,302.15,15.8,77.0,,,1,N
26335,WA,2020-08-20,1.92,326.85,86.2,92.0,,,1,N


In [12]:
# distinct "Count" column values when Std_confidence and Var_confidence are NULL.
print("Distinct 'Count' column values when Std_confidence and Var_confidence are NULL.\n")
Count_values = wildfires_df.loc[(wildfires_df['Std_confidence'].isna()) & (wildfires_df['Var_confidence'].isna()), 'Count'].values
print("'Count' Column Values: {}".format(Count_values))
# Disply the index for missing values
#wildfires_df[wildfires_df.isna().any(axis=1)].index

Distinct 'Count' column values when Std_confidence and Var_confidence are NULL.

'Count' Column Values: [1 1 1 ... 1 1 1]


In [13]:
# columns DataFrame with missing values 
wildfires_df[wildfires_df.isna().any(axis=1)][null_columns]

Unnamed: 0,Std_confidence,Var_confidence
48,,
149,,
154,,
157,,
163,,
...,...,...
26327,,
26331,,
26332,,
26335,,


#### Fixing Missing Values <a class="anchor" id="FixingMissing"></a>

In [14]:
#### Fill NaN with "mean" value
Std_confidence_mean_value = wildfires_df.Std_confidence.mean()
Std_confidence_mean_value


7.228302073334739

In [15]:
wildfires_df.Std_confidence = wildfires_df.Std_confidence.fillna(Std_confidence_mean_value)

In [16]:
Var_confidence_mean_value = wildfires_df.Var_confidence.mean()
Var_confidence_mean_value

56.22909232542304

In [17]:
wildfires_df.Var_confidence = wildfires_df.Var_confidence.fillna(Var_confidence_mean_value)

In [18]:
#check for missing values
wildfires_df.isna().sum()

Region                                 0
Date                                   0
Estimated_fire_area                    0
Mean_estimated_fire_brightness         0
Mean_estimated_fire_radiative_power    0
Mean_confidence                        0
Std_confidence                         0
Var_confidence                         0
Count                                  0
Replaced                               0
dtype: int64

In [19]:
#As we can see that Standard Deviation and Variance Confidence values are null because Count equals 1.
#Count of 1 shows that there was 1 pixel representing other values. So, lets fill these NULL values with zero.
#wildfires_df.loc[wildfires_df['Std_confidence'].isna(), 'Std_confidence'] = 0
#wildfires_df.loc[wildfires_df['Var_confidence'].isna(), 'Var_confidence'] = 0
#print("\nMissing Values : \n",wildfires_df.isnull().any()) #check if it has been fixed, that we have dropped Null values.

#### Checking for Duplicates <a class="anchor" id="Duplicates"></a>

In [20]:
# find duplicates
wildfires_df.duplicated().sum()

0

#### Fixing Duplicates <a class="anchor" id="FixingDuplicates"></a>

In [21]:
# if there were duplicates, this is the process to remove them by:
# Dropping duplicates in data and reseting the index and checking total records in data

# Remove Duplicates
#wildfires_df.drop_duplicates(inplace=True)

# Reset dataframe index
#wildfires_df.reset_index(drop=True, inplace=True)

# Number of records
#num_rows, num_cols = wildfires_df.shape
#print("Total Records:\t{}".format(num_rows))

# First five rows in data
#wildfires_df.head()

#### Wildfires Data Review <a class="anchor" id="DataReview"></a>

In [22]:
# frequencies for Region column
wildfires_df.pivot_table(index= ['Region'], aggfunc='size')

Region
NSW    4623
NT     5053
QL     5533
SA     1990
TA     1404
VI     2176
WA     5627
dtype: int64

#### Saving out the final C&P_Wildfires CSV File <a class="anchor" id="PreprocessedWildfires"></a>

In [23]:
final_file = "C&P_Wildfires.csv"
print("Saving file: '{}'".format(final_file))
wildfires_df.to_csv(final_file, index=False, encoding='utf-8')
print("File Saved...")

Saving file: 'C&P_Wildfires.csv'
File Saved...


In [24]:
# check DataFrame exported
df = pd.read_csv("P:\Wildfires_Australia\cfc_wildfireforecastforAustralia\C&P_Wildfires.csv")
df['Date'] = pd.to_datetime(df['Date'])

In [25]:
df.head()

Unnamed: 0,Region,Date,Estimated_fire_area,Mean_estimated_fire_brightness,Mean_estimated_fire_radiative_power,Mean_confidence,Std_confidence,Var_confidence,Count,Replaced
0,NSW,2005-01-04,8.68,312.266667,42.4,78.666667,2.886751,8.333333,3,R
1,NSW,2005-01-05,16.61125,322.475,62.3625,85.5,8.088793,65.428571,8,R
2,NSW,2005-01-06,5.52,325.266667,38.4,78.333333,3.21455,10.333333,3,R
3,NSW,2005-01-07,6.264,313.87,33.8,92.2,7.52994,56.7,5,R
4,NSW,2005-01-08,5.4,337.383333,122.533333,91.0,7.937254,63.0,3,R


In [26]:
df.shape

(26406, 10)

In [27]:
df.Date.dtype.name

'datetime64[ns]'

In [28]:
df.isna().sum()

Region                                 0
Date                                   0
Estimated_fire_area                    0
Mean_estimated_fire_brightness         0
Mean_estimated_fire_radiative_power    0
Mean_confidence                        0
Std_confidence                         0
Var_confidence                         0
Count                                  0
Replaced                               0
dtype: int64