In [14]:
import json
import pandas as pd
import geopandas as gpd
from datetime import date

# Background

This notebook takes a closer look at issues relating to missing data. Questions are:
1. What does it mean for a shapefile to be empty versus missing for a given date?

This notebook reviews a dataset of HMS smoke data collected from https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/ using code/1_collect_HMS_daily_shapes.py. This dataset underwent the following alterations:
* addition of "year" and "date" columns, with values as strings to enable saving as a shapefile.
* addition of columns for days which previously did not have those columns. That is, where 2005-08-05 previously had no Density column, it now has a Density column with NA values. This is the result of concatenating all daily dataframes.

The original number of rows, column names, and crses for each day's shapefile can be viewed using the metadata csv.

## Load data

In [15]:
# read in missing days file
f = "../data/hms_2005_2021_absent_dates.json"
with open(f, "r") as file:
    absent = json.load(file)
    
absent

{'first_date': '2005-08-05',
 'last_date': '2021-12-31',
 'missing': ['2005-08-09',
  '2005-08-10',
  '2006-03-27',
  '2006-04-01',
  '2006-07-14',
  '2006-07-15',
  '2006-11-04',
  '2007-03-31',
  '2007-08-21',
  '2008-01-23',
  '2008-01-24',
  '2008-03-05',
  '2009-01-30',
  '2009-04-08',
  '2012-10-07',
  '2015-06-02',
  '2015-08-20',
  '2016-03-06',
  '2016-11-12',
  '2017-04-27',
  '2017-05-31',
  '2017-06-01',
  '2017-06-22',
  '2017-07-18',
  '2019-07-10',
  '2019-08-10',
  '2020-07-08'],
 'no_entries': ['2005-12-18',
  '2006-01-22',
  '2006-01-29',
  '2006-06-30',
  '2006-12-24',
  '2007-01-08',
  '2007-02-03',
  '2007-04-10',
  '2007-12-12',
  '2007-12-25',
  '2008-06-05',
  '2008-12-15',
  '2008-12-25',
  '2008-12-26',
  '2009-06-05',
  '2009-10-13',
  '2009-12-15',
  '2010-02-09',
  '2010-02-23',
  '2010-12-26',
  '2011-11-01',
  '2011-12-25',
  '2012-04-24',
  '2012-12-16',
  '2012-12-23',
  '2013-01-04',
  '2013-01-10',
  '2013-11-27',
  '2013-11-28',
  '2013-12-01',
  '20

## Question 1: What does it mean for a daily shapefile to be missing, as opposed to empty?

To get a clearer understanding of what missing versus empty shapefiles represent, we verify the dates of missing and empty shapefile dates against dates of CA wildfires up to 2017.

In [22]:
f = "https://gist.githubusercontent.com/lazarogamio/d64e0d04b1ce1f2a3bd08db7526fa632/raw/3de009f5e5bdc9a86489f7ec9e181ca574a3021d/axios-calfire-wildfire-data.csv"
fire_df = pd.read_csv(f, usecols=['name','start','end'])
fire_df['start'] = pd.to_datetime(fire_df['start'])
fire_df['end'] = pd.to_datetime(fire_df['end'])
fire_df = fire_df.loc[fire_df['start'].dt.year > 2004]
fire_df

Unnamed: 0,name,start,end
472,URUTTA,2005-05-01,2005-09-15
473,CLAYTON,2005-05-15,2005-05-16
474,MENDOTA,2005-05-17,2005-05-17
475,BLM #1(APRICOT),2005-05-17,2005-05-17
476,JOHNSON RANCH VMP,2005-05-19,2005-06-01
...,...,...,...
1459,THOMAS,2017-12-04,2017-12-15
1460,CREEK,2017-12-05,2017-12-15
1461,RYE,2017-12-05,2017-12-12
1462,Skirball,2017-12-06,2017-12-11


In [None]:

wildfire_periods = 
# produce range 