#### Imports

In [None]:
# Basic imports
import numpy as np
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
from importlib import reload

In [None]:
# Custom imports
import utils
import emissions_utils
import preparation_utils

#### Index

1. [Introduction](#intro)<br>
2. [Retrieving Wildfire Data](#rw)<br>
3. [Gathering Weather Data](#gwd)<br>
4. [Adding Emissions](#ae)<br>
5. [Data Cleaning](#dc)<br>
6. [Calculating Additional Features](#caf)<br>

---
<a id='intro'></a>
# Introduction

As we noted previously, our lack of larger fires is damaging to the accuracy and performance of the classification models that we hope to build. While we can use upsampling techniques in order to improve the distribution, the best alternative is to get more data from the actual observations that we have in our wildfire data set. While we will be able to get more observations from the minority classes, the uneven distribution means that overall the total number of observations will not be massively increased.

Regardless, we will continue by removing the fires that we have already collected in previous samples and collecting fires that only belong to fire classes D and above. We will then replicate the process that we carried out previously - collecting weather and emissions data for the newly acquired observations - ultimately concatenating this onto the sample that we initially generated.

---
<a id='rw'></a>
## Retrieving Wildfire Data
To begin, we will have to query the `.sqlite` file that contains the wildfire data.

In [6]:
# Get only the relevant columns
con = sqlite3.connect("wildfire_data/wildfires.sqlite")
query = """
    SELECT FIRE_YEAR, DISCOVERY_DOY, FIRE_SIZE, FIRE_SIZE_CLASS, LATITUDE, LONGITUDE, STATE
    FROM fires
"""
wildfires = pd.read_sql_query(query, con)

In [7]:
# Drop duplicates
wildfires.drop_duplicates(inplace=True)

In [8]:
# Create the DATE column
wildfires['DATE'] = pd.to_datetime(wildfires['FIRE_YEAR'] * 1000 + wildfires['DISCOVERY_DOY'], format='%Y%j')

# Pop the column
date = wildfires.pop('DATE')

# Insert it into the relevant position
wildfires.insert(0, 'DATE', date)

In [9]:
wildfires.head()

Unnamed: 0,DATE,FIRE_YEAR,DISCOVERY_DOY,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE
0,2005-02-02,2005,33,0.1,A,40.036944,-121.005833,CA
1,2004-05-12,2004,133,0.25,A,38.933056,-120.404444,CA
2,2004-05-31,2004,152,0.1,A,38.984167,-120.735556,CA
3,2004-06-28,2004,180,0.1,A,38.559167,-119.913333,CA
4,2004-06-28,2004,180,0.1,A,38.559167,-119.933056,CA


As we are only concerned with getting wildfires from classes D, E, F, and G, we will create a temporary DataFrame in which only these classes are present.

In [25]:
g = wildfires[wildfires['FIRE_SIZE_CLASS'].isin(['D', 'E', 'F', 'G'])]['FIRE_SIZE_CLASS']

In [26]:
utils.count_percentage_df(g)

Unnamed: 0,Count,Percentage of Total
D,28419,0.52546
E,14106,0.260817
F,7786,0.143961
G,3773,0.069762


As expected, the number of very large fires (classes F, G) are still in a relative minority. Since we want to improve the distribution of our data - and for considerations of time - we will gather 3773 observations from each class.

In [40]:
classes = ['D', 'E', 'F', 'G']
all_indexes = np.array([])

for fire_class in classes:
    tmp = wildfires[wildfires['FIRE_SIZE_CLASS']==fire_class].sample(3773)
    all_indexes = np.concatenate((all_indexes, tmp.index.values), axis=None)
    
len(all_indexes) == 3773 * 4

True

As we have the indexes of the observations that we have already sampled, we can remove those indexes that have already appeared in our intial sample.

In [44]:
sample = pd.read_pickle('data/30k_wildfires_weather_emissions.pkl')
already_collected = sample['index'].values

In [45]:
all_indexes = [idx for idx in all_indexes
               if idx not in already_collected]

In [46]:
len(all_indexes)

14863

We see, then, that after finding the wildfires that have already sampled, we are left with 14,863 new observations, which have the following count and distribution:

In [47]:
df = wildfires.loc[all_indexes]

In [49]:
utils.count_percentage_df(df['FIRE_SIZE_CLASS'])

Unnamed: 0,Count,Percentage of Total
E,3722,0.250421
F,3720,0.250286
D,3713,0.249815
G,3708,0.249479


---
<a id='gwd'></a>
## Gathering Weather Data
Having collected the wildfire data it is time to use the API call to gather the weather information for each longitude and latitude.

In [50]:
# Necessary import
import api
import urllib.request
import json
import sys

# Aesthetic imports
from ipywidgets import IntProgress
from IPython.display import display

In [51]:
# Get API Key
key = api.API_KEY

# BaseURL
BaseURL = 'https://weather.visualcrossing.com/VisualCrossingWebServices/rest/services/timeline/'

# Data dictionary into which the weather data will be appended
data = {
    'tempmax': [],
    'avg_tempmax': [],
    'temp': [], 
    'avg_temp': [],
    'humidity': [],
    'avg_humidity': [],
    'precip': [], 
    'avg_precip':[],
    'dew': [], 
    'avg_dew' :[],
    'windspeed': [],
    'avg_windspeed': [],
    'winddir': [],
    'avg_winddir': [],
    'pressure': [],
    'avg_pressure': []
}

# Instantiate the bar
progress = IntProgress(min=0, max=df.shape[0]) 

# Display the bar
display(progress) 

# Iterate through the dataframe and calculate weather data
for index, row in df.iterrows():
    
    # Update progress bar
    progress.value += 1
    
    # Create the table variables
    end_date = row['DATE']
    start_date = end_date - pd.Timedelta(6, 'days')
    latitude = row['LATITUDE']
    longitude = row['LONGITUDE']
    
    # Create the API Query
    query = f'{latitude}%2C%20{longitude}/{start_date.date()}/{end_date.date()}?unitGroup=metric&include=days&key=XYAH73VX8WMHKJ3YE6NG62R6V&contentType=json'
    url = BaseURL + query
    
    try: 
        response = urllib.request.urlopen(url)
        # Parse the results as JSON
        string = response.read().decode('utf-8')
        jsonData = json.loads(string)
    except urllib.error.HTTPError  as e:
        ErrorInfo= e.read().decode() 
        print('Error code: ', e.code, ErrorInfo)
        sys.exit()

    # Create lists for the values of the last 7 days
    variables = {
        'tempmax': [],    
        'temp': [],
        'humidity': [],
        'precip': [],
        'dew': [],
        'windspeed': [],
        'winddir': [],
        'pressure': []
    }
    
    # Parse the json data
    for daily_data in jsonData['days']:
        # Iterate through the variables to get the data for the last 7 days
        for variable in variables:
            if daily_data[variable] is not None:
                variables[variable].append(daily_data[variable]) 
            else:
                variables[variable].append(np.nan)
    
    # Append all the variables to the data dictionary
    for key in data:
        if 'avg_' in key:
            metric = key[4:]
            data[key].append(utils.mean(variables[metric]))
        else:
            data[key].append(variables[key])

IntProgress(value=0, max=14863)

Having stored the weather information in a dictionary, we are able to open this in a DataFrame to inspect some of the elements. 

In [52]:
# Convert dictionary into DataFrame
weather_info = pd.DataFrame(data)

In [53]:
weather_info.head()

Unnamed: 0,tempmax,avg_tempmax,temp,avg_temp,humidity,avg_humidity,precip,avg_precip,dew,avg_dew,windspeed,avg_windspeed,winddir,avg_winddir,pressure,avg_pressure
0,"[0.7, 6.1, 13.6, 20.6, 9.4, 19.0, 29.0]",14.057143,"[0.2, 2.4, 6.8, 10.6, 6.6, 9.7, 19.8]",8.014286,"[84.1, 66.9, 46.4, 38.8, 55.6, 56.6, 35.8]",54.885714,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.0,"[-2.1, -3.4, -5.8, -3.7, -1.8, 0.6, 3.4]",-1.828571,"[22.9, 22.0, 14.8, 22.3, 27.7, 20.5, 34.2]",23.485714,"[38.5, 31.8, 76.8, 14.3, 29.4, 154.8, 199.4]",77.857143,"[1026.7, 1026.2, 1021.9, 1014.9, 1019.4, 1023....",1021.328571
1,"[17.9, 17.2, 20.1, 16.7, 16.2, 20.7, 21.2]",18.571429,"[10.5, 10.5, 14.4, 10.0, 8.0, 12.0, 15.0]",11.485714,"[65.9, 64.0, 52.6, 54.7, 55.1, 40.5, 38.8]",53.085714,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.0,"[3.6, 3.2, 3.5, 0.5, -1.6, -2.3, 0.6]",1.071429,"[22.3, 24.1, 33.5, 29.5, 24.1, 27.7, 25.9]",26.728571,"[334.6, 129.6, 197.4, 288.6, 320.2, 181.3, 173.9]",232.228571,"[1012.9, 1014.8, 1006.8, 1014.8, 1023.7, 1021....",1016.271429
2,"[32.5, 32.8, 35.1, 35.2, 33.4, 33.8, 34.0]",33.828571,"[26.6, 26.9, 28.4, 28.0, 27.0, 27.4, 28.3]",27.514286,"[74.2, 69.4, 72.7, 73.6, 81.8, 80.3, 78.3]",75.757143,"[0.0, 0.0, 0.0, 0.15, 2.19, 0.04, 0.25]",0.375714,"[20.9, 19.8, 22.4, 22.1, 23.4, 23.3, 23.7]",22.228571,"[20.5, 14.2, 16.3, 18.4, 13.9, 22.3, 16.3]",17.414286,"[73.1, 35.5, 44.9, 72.4, 51.0, 267.8, 7.2]",78.842857,"[1017.7, 1017.0, 1017.9, 1018.8, 1017.5, 1016....",1017.457143
3,"[3.8, 4.4, 3.9, 5.7, 12.9, 13.9, 16.7]",8.757143,"[1.7, 1.7, 1.0, 0.6, 5.1, 5.4, 9.1]",3.514286,"[93.4, 85.7, 81.3, 61.8, 37.1, 44.4, 33.9]",62.514286,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.0,"[0.7, -0.5, -1.9, -6.4, -9.0, -7.4, -6.5]",-4.428571,"[16.6, 27.7, 22.3, 27.7, 27.7, 25.9, 29.5]",25.342857,"[57.0, 85.0, 46.3, 31.2, 26.4, 191.6, 198.8]",90.9,"[1018.0, 1015.1, 1012.7, 1012.6, 1013.1, 1017....",1014.914286
4,"[11.0, 11.6, 11.8, 7.7, 10.8, 20.0, 21.4]",13.471429,"[2.8, 6.2, 8.7, 3.8, 3.7, 10.7, 14.6]",7.214286,"[70.0, 57.3, 90.0, 85.5, 62.6, 44.7, 39.8]",64.271429,"[0.0, 0.0, 1.24, 0.0, 0.0, 0.0, 0.0]",0.177143,"[-3.0, -2.2, 7.1, 1.5, -3.6, -2.5, 0.8]",-0.271429,"[18.4, 23.0, 25.2, 31.8, 21.9, 39.1, 20.6]",25.714286,"[98.3, 123.6, 125.9, 299.1, 310.3, 186.3, 58.8]",171.757143,"[1022.4, 1015.2, 1009.3, 1017.6, 1026.3, 1018....",1017.971429


In [56]:
df_reset = df.reset_index()

Having reset the index we will be able to concatenate this newly created DataFrame to the DataFrame that contains the wildfire data.

In [59]:
df_with_weather = pd.concat([df_reset, weather_info], axis=1)

In [65]:
df_with_weather.head()

Unnamed: 0,index,DATE,FIRE_YEAR,DISCOVERY_DOY,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,tempmax,...,precip,avg_precip,dew,avg_dew,windspeed,avg_windspeed,winddir,avg_winddir,pressure,avg_pressure
0,1059593,2003-04-14,2003,104,232.0,D,41.363889,-88.173056,IL,"[0.7, 6.1, 13.6, 20.6, 9.4, 19.0, 29.0]",...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.0,"[-2.1, -3.4, -5.8, -3.7, -1.8, 0.6, 3.4]",-1.828571,"[22.9, 22.0, 14.8, 22.3, 27.7, 20.5, 34.2]",23.485714,"[38.5, 31.8, 76.8, 14.3, 29.4, 154.8, 199.4]",77.857143,"[1026.7, 1026.2, 1021.9, 1014.9, 1019.4, 1023....",1021.328571
1,780956,1992-02-21,1992,52,150.0,D,34.587299,-95.611298,OK,"[17.9, 17.2, 20.1, 16.7, 16.2, 20.7, 21.2]",...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.0,"[3.6, 3.2, 3.5, 0.5, -1.6, -2.3, 0.6]",1.071429,"[22.3, 24.1, 33.5, 29.5, 24.1, 27.7, 25.9]",26.728571,"[334.6, 129.6, 197.4, 288.6, 320.2, 181.3, 173.9]",232.228571,"[1012.9, 1014.8, 1006.8, 1014.8, 1023.7, 1021....",1016.271429
2,1358818,2010-06-15,2010,166,277.0,D,27.0012,-81.4362,FL,"[32.5, 32.8, 35.1, 35.2, 33.4, 33.8, 34.0]",...,"[0.0, 0.0, 0.0, 0.15, 2.19, 0.04, 0.25]",0.375714,"[20.9, 19.8, 22.4, 22.1, 23.4, 23.3, 23.7]",22.228571,"[20.5, 14.2, 16.3, 18.4, 13.9, 22.3, 16.3]",17.414286,"[73.1, 35.5, 44.9, 72.4, 51.0, 267.8, 7.2]",78.842857,"[1017.7, 1017.0, 1017.9, 1018.8, 1017.5, 1016....",1017.457143
3,714311,1992-04-29,1992,120,125.0,D,45.966667,-68.466668,ME,"[3.8, 4.4, 3.9, 5.7, 12.9, 13.9, 16.7]",...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",0.0,"[0.7, -0.5, -1.9, -6.4, -9.0, -7.4, -6.5]",-4.428571,"[16.6, 27.7, 22.3, 27.7, 27.7, 25.9, 29.5]",25.342857,"[57.0, 85.0, 46.3, 31.2, 26.4, 191.6, 198.8]",90.9,"[1018.0, 1015.1, 1012.7, 1012.6, 1013.1, 1017....",1014.914286
4,1506649,2011-03-12,2011,71,285.0,D,36.27996,-93.94546,AR,"[11.0, 11.6, 11.8, 7.7, 10.8, 20.0, 21.4]",...,"[0.0, 0.0, 1.24, 0.0, 0.0, 0.0, 0.0]",0.177143,"[-3.0, -2.2, 7.1, 1.5, -3.6, -2.5, 0.8]",-0.271429,"[18.4, 23.0, 25.2, 31.8, 21.9, 39.1, 20.6]",25.714286,"[98.3, 123.6, 125.9, 299.1, 310.3, 186.3, 58.8]",171.757143,"[1022.4, 1015.2, 1009.3, 1017.6, 1026.3, 1018....",1017.971429


We can also save this DataFrame if we ever want to access this data at some point in the future.

In [61]:
df_with_weather.to_pickle('sample_data/large_fire_samples_with_weather.pkl')

---
<a id='ae'></a>
## Adding Emissions

As before, we also need to find the emissions data for the wildfires.

In [62]:
ch4_df = pd.read_csv('data/EDGAR Emissions/ch4.csv')
co2_df = pd.read_csv('data/EDGAR Emissions/co2.csv')
n2o_df = pd.read_csv('data/EDGAR Emissions/n2o.csv')

In [63]:
# Drop unnamed column
dfs = [ch4_df, co2_df, n2o_df]

for df in dfs:
    df.drop('Unnamed: 0', axis=1, inplace = True)

In [69]:
ch4_data = emissions_utils.GetEmissions(df_with_weather, ch4_df)
co2_data = emissions_utils.GetEmissions(df_with_weather, co2_df)
n2o_data = emissions_utils.GetEmissions(df_with_weather, n2o_df)

In [70]:
df_with_weather['ch4'] = ch4_data['emission']
df_with_weather['co2'] = co2_data['emission']
df_with_weather['n2o'] = n2o_data['emission']

In [72]:
df_with_weather.head()

Unnamed: 0,index,DATE,FIRE_YEAR,DISCOVERY_DOY,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,tempmax,...,avg_dew,windspeed,avg_windspeed,winddir,avg_winddir,pressure,avg_pressure,ch4,co2,n2o
0,1059593,2003-04-14,2003,104,232.0,D,41.363889,-88.173056,IL,"[0.7, 6.1, 13.6, 20.6, 9.4, 19.0, 29.0]",...,-1.828571,"[22.9, 22.0, 14.8, 22.3, 27.7, 20.5, 34.2]",23.485714,"[38.5, 31.8, 76.8, 14.3, 29.4, 154.8, 199.4]",77.857143,"[1026.7, 1026.2, 1021.9, 1014.9, 1019.4, 1023....",1021.328571,9.967485e-11,2.751741e-08,9.263724e-12
1,780956,1992-02-21,1992,52,150.0,D,34.587299,-95.611298,OK,"[17.9, 17.2, 20.1, 16.7, 16.2, 20.7, 21.2]",...,1.071429,"[22.3, 24.1, 33.5, 29.5, 24.1, 27.7, 25.9]",26.728571,"[334.6, 129.6, 197.4, 288.6, 320.2, 181.3, 173.9]",232.228571,"[1012.9, 1014.8, 1006.8, 1014.8, 1023.7, 1021....",1016.271429,2.820779e-11,1.685158e-09,1.506214e-12
2,1358818,2010-06-15,2010,166,277.0,D,27.0012,-81.4362,FL,"[32.5, 32.8, 35.1, 35.2, 33.4, 33.8, 34.0]",...,22.228571,"[20.5, 14.2, 16.3, 18.4, 13.9, 22.3, 16.3]",17.414286,"[73.1, 35.5, 44.9, 72.4, 51.0, 267.8, 7.2]",78.842857,"[1017.7, 1017.0, 1017.9, 1018.8, 1017.5, 1016....",1017.457143,7.241625e-11,4.184836e-09,2.445356e-12
3,714311,1992-04-29,1992,120,125.0,D,45.966667,-68.466668,ME,"[3.8, 4.4, 3.9, 5.7, 12.9, 13.9, 16.7]",...,-4.428571,"[16.6, 27.7, 22.3, 27.7, 27.7, 25.9, 29.5]",25.342857,"[57.0, 85.0, 46.3, 31.2, 26.4, 191.6, 198.8]",90.9,"[1018.0, 1015.1, 1012.7, 1012.6, 1013.1, 1017....",1014.914286,8.259238e-12,4.874638e-09,1.488044e-12
4,1506649,2011-03-12,2011,71,285.0,D,36.27996,-93.94546,AR,"[11.0, 11.6, 11.8, 7.7, 10.8, 20.0, 21.4]",...,-0.271429,"[18.4, 23.0, 25.2, 31.8, 21.9, 39.1, 20.6]",25.714286,"[98.3, 123.6, 125.9, 299.1, 310.3, 186.3, 58.8]",171.757143,"[1022.4, 1015.2, 1009.3, 1017.6, 1026.3, 1018....",1017.971429,2.110244e-10,6.19839e-09,3.427687e-12


In [73]:
df_with_weather.to_pickle('sample_data/large_fire_samples_with_weather_emissions.pkl')

---
<a id='dc'></a>
## Data Cleaning

This process will replicate the process that we conducted earlier. Wildfires for which insufficient weather data could be collected will be removed.

In [4]:
df_with_weather = pd.read_pickle('sample_data/large_fire_samples_with_weather_emissions.pkl')

In [6]:
df_raw = df_with_weather.copy()

In [7]:
df_raw.head()

Unnamed: 0,index,DATE,FIRE_YEAR,DISCOVERY_DOY,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,tempmax,...,avg_dew,windspeed,avg_windspeed,winddir,avg_winddir,pressure,avg_pressure,ch4,co2,n2o
0,1059593,2003-04-14,2003,104,232.0,D,41.363889,-88.173056,IL,"[0.7, 6.1, 13.6, 20.6, 9.4, 19.0, 29.0]",...,-1.828571,"[22.9, 22.0, 14.8, 22.3, 27.7, 20.5, 34.2]",23.485714,"[38.5, 31.8, 76.8, 14.3, 29.4, 154.8, 199.4]",77.857143,"[1026.7, 1026.2, 1021.9, 1014.9, 1019.4, 1023....",1021.328571,9.967485e-11,2.751741e-08,9.263724e-12
1,780956,1992-02-21,1992,52,150.0,D,34.587299,-95.611298,OK,"[17.9, 17.2, 20.1, 16.7, 16.2, 20.7, 21.2]",...,1.071429,"[22.3, 24.1, 33.5, 29.5, 24.1, 27.7, 25.9]",26.728571,"[334.6, 129.6, 197.4, 288.6, 320.2, 181.3, 173.9]",232.228571,"[1012.9, 1014.8, 1006.8, 1014.8, 1023.7, 1021....",1016.271429,2.820779e-11,1.685158e-09,1.506214e-12
2,1358818,2010-06-15,2010,166,277.0,D,27.0012,-81.4362,FL,"[32.5, 32.8, 35.1, 35.2, 33.4, 33.8, 34.0]",...,22.228571,"[20.5, 14.2, 16.3, 18.4, 13.9, 22.3, 16.3]",17.414286,"[73.1, 35.5, 44.9, 72.4, 51.0, 267.8, 7.2]",78.842857,"[1017.7, 1017.0, 1017.9, 1018.8, 1017.5, 1016....",1017.457143,7.241625e-11,4.184836e-09,2.445356e-12
3,714311,1992-04-29,1992,120,125.0,D,45.966667,-68.466668,ME,"[3.8, 4.4, 3.9, 5.7, 12.9, 13.9, 16.7]",...,-4.428571,"[16.6, 27.7, 22.3, 27.7, 27.7, 25.9, 29.5]",25.342857,"[57.0, 85.0, 46.3, 31.2, 26.4, 191.6, 198.8]",90.9,"[1018.0, 1015.1, 1012.7, 1012.6, 1013.1, 1017....",1014.914286,8.259238e-12,4.874638e-09,1.488044e-12
4,1506649,2011-03-12,2011,71,285.0,D,36.27996,-93.94546,AR,"[11.0, 11.6, 11.8, 7.7, 10.8, 20.0, 21.4]",...,-0.271429,"[18.4, 23.0, 25.2, 31.8, 21.9, 39.1, 20.6]",25.714286,"[98.3, 123.6, 125.9, 299.1, 310.3, 186.3, 58.8]",171.757143,"[1022.4, 1015.2, 1009.3, 1017.6, 1026.3, 1018....",1017.971429,2.110244e-10,6.19839e-09,3.427687e-12


In [8]:
df_raw.isna().sum()

index                 0
DATE                  0
FIRE_YEAR             0
DISCOVERY_DOY         0
FIRE_SIZE             0
FIRE_SIZE_CLASS       0
LATITUDE              0
LONGITUDE             0
STATE                 0
tempmax               0
avg_tempmax           1
temp                  0
avg_temp            996
humidity              0
avg_humidity       1036
precip                0
avg_precip         2983
dew                   0
avg_dew            1036
windspeed             0
avg_windspeed       994
winddir               0
avg_winddir        2809
pressure              0
avg_pressure       2915
ch4                   0
co2                   0
n2o                   0
dtype: int64

In [9]:
remove_indexes = preparation_utils.GetNullListIndex(df_raw)

In [10]:
len(remove_indexes)

5980

In [12]:
df_cleaned = df_raw.drop(index=remove_indexes)

In [13]:
df_cleaned.isna().sum()

index              0
DATE               0
FIRE_YEAR          0
DISCOVERY_DOY      0
FIRE_SIZE          0
FIRE_SIZE_CLASS    0
LATITUDE           0
LONGITUDE          0
STATE              0
tempmax            0
avg_tempmax        0
temp               0
avg_temp           0
humidity           0
avg_humidity       0
precip             0
avg_precip         0
dew                0
avg_dew            0
windspeed          0
avg_windspeed      0
winddir            0
avg_winddir        0
pressure           0
avg_pressure       0
ch4                0
co2                0
n2o                0
dtype: int64

In [14]:
utils.count_percentage_df(df_cleaned['FIRE_SIZE_CLASS'])

Unnamed: 0,Count,Percentage of Total
D,2429,0.273444
E,2354,0.265001
F,2166,0.243837
G,1934,0.217719


In [15]:
df_cleaned.head()

Unnamed: 0,index,DATE,FIRE_YEAR,DISCOVERY_DOY,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,tempmax,...,avg_dew,windspeed,avg_windspeed,winddir,avg_winddir,pressure,avg_pressure,ch4,co2,n2o
0,1059593,2003-04-14,2003,104,232.0,D,41.363889,-88.173056,IL,"[0.7, 6.1, 13.6, 20.6, 9.4, 19.0, 29.0]",...,-1.828571,"[22.9, 22.0, 14.8, 22.3, 27.7, 20.5, 34.2]",23.485714,"[38.5, 31.8, 76.8, 14.3, 29.4, 154.8, 199.4]",77.857143,"[1026.7, 1026.2, 1021.9, 1014.9, 1019.4, 1023....",1021.328571,9.967485e-11,2.751741e-08,9.263724e-12
1,780956,1992-02-21,1992,52,150.0,D,34.587299,-95.611298,OK,"[17.9, 17.2, 20.1, 16.7, 16.2, 20.7, 21.2]",...,1.071429,"[22.3, 24.1, 33.5, 29.5, 24.1, 27.7, 25.9]",26.728571,"[334.6, 129.6, 197.4, 288.6, 320.2, 181.3, 173.9]",232.228571,"[1012.9, 1014.8, 1006.8, 1014.8, 1023.7, 1021....",1016.271429,2.820779e-11,1.685158e-09,1.506214e-12
2,1358818,2010-06-15,2010,166,277.0,D,27.0012,-81.4362,FL,"[32.5, 32.8, 35.1, 35.2, 33.4, 33.8, 34.0]",...,22.228571,"[20.5, 14.2, 16.3, 18.4, 13.9, 22.3, 16.3]",17.414286,"[73.1, 35.5, 44.9, 72.4, 51.0, 267.8, 7.2]",78.842857,"[1017.7, 1017.0, 1017.9, 1018.8, 1017.5, 1016....",1017.457143,7.241625e-11,4.184836e-09,2.445356e-12
3,714311,1992-04-29,1992,120,125.0,D,45.966667,-68.466668,ME,"[3.8, 4.4, 3.9, 5.7, 12.9, 13.9, 16.7]",...,-4.428571,"[16.6, 27.7, 22.3, 27.7, 27.7, 25.9, 29.5]",25.342857,"[57.0, 85.0, 46.3, 31.2, 26.4, 191.6, 198.8]",90.9,"[1018.0, 1015.1, 1012.7, 1012.6, 1013.1, 1017....",1014.914286,8.259238e-12,4.874638e-09,1.488044e-12
4,1506649,2011-03-12,2011,71,285.0,D,36.27996,-93.94546,AR,"[11.0, 11.6, 11.8, 7.7, 10.8, 20.0, 21.4]",...,-0.271429,"[18.4, 23.0, 25.2, 31.8, 21.9, 39.1, 20.6]",25.714286,"[98.3, 123.6, 125.9, 299.1, 310.3, 186.3, 58.8]",171.757143,"[1022.4, 1015.2, 1009.3, 1017.6, 1026.3, 1018....",1017.971429,2.110244e-10,6.19839e-09,3.427687e-12


---
<a id='caf'></a>
## Calculating Additional Features

In [16]:
df_cleaned = preparation_utils.GenerateFeatures(df_cleaned)

In [35]:
df_cleaned = preparation_utils.RemoveColumns(df_cleaned)

In [36]:
df_cleaned.head()

Unnamed: 0,FIRE_YEAR,DISCOVERY_DOY,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,avg_tempmax,avg_temp,avg_humidity,...,precip_variance,precip_delta,dew_variance,dew_delta,windspeed_variance,windspeed_delta,winddir_variance,winddir_delta,pressure_variance,pressure_delta
0,2003,104,232.0,D,41.363889,-88.173056,IL,14.057143,8.014286,54.885714,...,0.0,0.0,7.864898,5.5,31.552653,11.3,4393.119592,160.9,16.827755,-9.5
1,1992,52,150.0,D,34.587299,-95.611298,OK,18.571429,11.485714,53.085714,...,0.0,0.0,5.124898,-3.0,12.656327,3.6,5591.173469,-160.7,28.450612,6.2
2,2010,166,277.0,D,27.0012,-81.4362,FL,33.828571,27.514286,75.757143,...,0.556367,0.25,1.770612,2.8,8.504082,-4.2,6388.276735,-65.9,0.568163,-0.6
3,1992,120,125.0,D,45.966667,-68.466668,ME,8.757143,3.514286,62.514286,...,0.0,0.0,12.290612,-7.2,17.136735,12.9,4668.631429,141.8,4.49551,-2.9
4,2011,71,285.0,D,36.27996,-93.94546,AR,13.471429,7.214286,64.271429,...,0.188278,0.0,12.262041,3.8,45.43551,2.2,8314.153878,-39.5,24.979184,-6.0


In [37]:
df_cleaned.to_pickle('sample_data/large_fires_cleaned.pkl')

Now we will combine the files that we have created in another notebook working from there.