# Summary 
In this notebook, I fill in missing county information from the wildfires database using the latitude and longitude. I also add the 'Month' and 'Day of Week' features from the date.

# Table of Contents
1. [Date Features](#date_feats)
2. [Fill County Information](#county)
    2.1. [Demo function](#demo)
    2.2. [Join county reference table](#county_ref)
3. [Next Steps](#next)

# Imports 

In [1]:
import pandas as pd
pd.set_option('display.max_columns',500) # avoid truncation

import pickle
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

# For FCC API
import requests
from bs4 import BeautifulSoup
import lxml


# Data
Picking up from "Fires2 1 Exploratory Data Analysis"

In [2]:
fires = pd.read_pickle('lean_fires.pkl')
print(fires.shape)
fires.head()
# still the subset but easy to apply to whole dataset

(50000, 13)


Unnamed: 0,FOD_ID,FIRE_YEAR,DISCOVERY_DATE,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,COUNTY,FIPS_CODE,FIPS_NAME,hr,Cause
334821,343186,2001,2452135.5,0.5,B,39.0499,-114.8342,NV,White Pine,33.0,White Pine,14.0,Lightning
1674798,201838989,2013,2456426.5,1.0,B,32.606075,-87.309651,AL,Perry,105.0,Perry,,Accident
1692175,201862585,2013,2456540.5,1.0,B,31.666004,-96.449247,TX,Limestone,293.0,Limestone,,Other
1135865,1385051,2008,2454679.5,0.1,A,33.953889,-116.496944,CA,,,,,Other
130533,131832,2000,2451723.5,1.5,B,37.923056,-120.101111,CA,,,,17.0,Lightning


<a id="date_feats"></a>
# 1. Date Features

In [3]:
# using some pandas datetime attributes
fires['DISCOVERY_DATE'] = pd.to_datetime(fires['DISCOVERY_DATE'], unit='D', origin='julian') #julian dates
fires['Month'] = fires['DISCOVERY_DATE'].dt.month
fires['DayofWeek'] = fires['DISCOVERY_DATE'].dt.weekday_name
fires.head()

Unnamed: 0,FOD_ID,FIRE_YEAR,DISCOVERY_DATE,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,COUNTY,FIPS_CODE,FIPS_NAME,hr,Cause,Month,DayofWeek
334821,343186,2001,2001-08-14,0.5,B,39.0499,-114.8342,NV,White Pine,33.0,White Pine,14.0,Lightning,8,Tuesday
1674798,201838989,2013,2013-05-14,1.0,B,32.606075,-87.309651,AL,Perry,105.0,Perry,,Accident,5,Tuesday
1692175,201862585,2013,2013-09-05,1.0,B,31.666004,-96.449247,TX,Limestone,293.0,Limestone,,Other,9,Thursday
1135865,1385051,2008,2008-08-01,0.1,A,33.953889,-116.496944,CA,,,,,Other,8,Friday
130533,131832,2000,2000-06-28,1.5,B,37.923056,-120.101111,CA,,,,17.0,Lightning,6,Wednesday


<a id="county"></a>
# 2. Fill County Information
The Federal Communicatins Commission (FCC) provides an API that looks up county and state FIPS based on latitude/longitude input. To be considerate to the API provider, I will use the API only for rows that have missing or numeric county information. For new fires, easiest to just use the API to fill in the county information as needed.

In [4]:
def fill_county(county, lat, lon):
    '''***REQUIRES Requests and BeautifulSoup packages*** 
    `import requests
    from bs4 import BeautifulSoup`
    Inputs:  
    "county" = name of a county (str)
    "lat" = latitude in degrees (float)
    "lon" = longitude in degrees (float)
    Output: tuple(2)
    If county is available, returns str lowercased and original
    If unavailable, returns lowercase county name and FIPS code'''
    
    # Check if a county name is already present
    if county is None: 
        county = '0' # add placeholder value for null
    
    # if not, fill with API
    if ((len(county) < 3) | (county.isdigit())): #checking for numeric entries
        try: # in case API gives an error
            url = 'https://geo.fcc.gov/api/census/block/find?latitude={}&longitude={}&showall=true&format=xml'\
                    .format(str(lat),str(lon))
            soup = BeautifulSoup(requests.get(url).text, features="xml")
            tag = soup.find('County')
            return pd.Series((tag['name'].lower(), tag['FIPS']))
        except ValueError:
            c += 1
            print('c_error%s' % str(c))
            return pd.Series((county, county))
        except TypeError:
            t +=1
            print('t_error%s' % str(c))
            return pd.Series((county, county))
    else:
        return pd.Series((county.lower(), county)) 


<a id="demo"></a>
### 2.1. Demo with a fire with known location and small subset
More extensive debugging and testing for the function in the old notebooks. See "Fires cleaning." I chose to not use the API for the full data set because 1.88m rows is a lot to look up. Also, it's good to be respectful.  

The fires that already have location info are merged with another reference table to get the complete FIPS code. 

'fill_county' was applied to the full dataset but won't be shown here.

In [5]:
fires.iloc[0]

FOD_ID                          343186
FIRE_YEAR                         2001
DISCOVERY_DATE     2001-08-14 00:00:00
FIRE_SIZE                          0.5
FIRE_SIZE_CLASS                      B
LATITUDE                       39.0499
LONGITUDE                     -114.834
STATE                               NV
COUNTY                      White Pine
FIPS_CODE                          033
FIPS_NAME                   White Pine
hr                                  14
Cause                        Lightning
Month                                8
DayofWeek                      Tuesday
Name: 334821, dtype: object

In [6]:
fill_county(fires.iloc[0,-5], fires.iloc[0, -8], fires.iloc[0, -7])

0    white pine
1    White Pine
dtype: object

In [7]:
# create minimal subset to demo county lookup
subset = fires.head(10)
subset

Unnamed: 0,FOD_ID,FIRE_YEAR,DISCOVERY_DATE,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,COUNTY,FIPS_CODE,FIPS_NAME,hr,Cause,Month,DayofWeek
334821,343186,2001,2001-08-14,0.5,B,39.0499,-114.8342,NV,White Pine,33.0,White Pine,14.0,Lightning,8,Tuesday
1674798,201838989,2013,2013-05-14,1.0,B,32.606075,-87.309651,AL,Perry,105.0,Perry,,Accident,5,Tuesday
1692175,201862585,2013,2013-09-05,1.0,B,31.666004,-96.449247,TX,Limestone,293.0,Limestone,,Other,9,Thursday
1135865,1385051,2008,2008-08-01,0.1,A,33.953889,-116.496944,CA,,,,,Other,8,Friday
130533,131832,2000,2000-06-28,1.5,B,37.923056,-120.101111,CA,,,,17.0,Lightning,6,Wednesday
197691,200351,1993,1993-04-09,0.5,B,35.1167,-107.3673,NM,,,,13.0,Accident,4,Friday
480083,516759,2008,2008-08-01,1.0,B,31.66277,-93.48503,LA,Sabine,85.0,Sabine,,Other,8,Friday
170847,172525,2004,2004-09-20,0.1,A,46.015,-114.223333,MT,81,81.0,Ravalli,12.0,Other,9,Monday
1452498,20004335,2002,2002-03-31,5.0,B,37.70822,-91.37444,MO,CRAWFORD,55.0,Crawford,,Accident,3,Sunday
1774558,300122136,2014,2014-02-20,4.0,B,35.27409,-93.29173,AR,LOGAN,83.0,Logan,15.0,Other,2,Thursday


In [8]:
%time subset[['COUNTY2', 'COUNTY_ID']] = subset.apply(lambda row: fill_county(row['COUNTY'], row['LATITUDE'], row['LONGITUDE']), axis=1)

CPU times: user 140 ms, sys: 8.57 ms, total: 149 ms
Wall time: 2.4 s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


<a id="county_ref"></a>
### 2.2. Join reference table
This demo uses only 10 rows but it shows some of the challenges faced. The same series of steps was applied to the whole data set to fill in county information and get the 'COUNTY_ID' column to identify each column. 

In [9]:
cols = ['STATE', 'StateID', 'FIPS_CODE', 'COUNTY2']
#renamed cols for joining
url ='https://www2.census.gov/geo/docs/reference/codes/files/national_county.txt'
county_list = pd.read_csv(url, header=None, sep=',', usecols=[0,1,2,3], names=cols, dtype='str')

In [10]:
county_list.head()

Unnamed: 0,STATE,StateID,FIPS_CODE,COUNTY2
0,AL,1,1,Autauga County
1,AL,1,3,Baldwin County
2,AL,1,5,Barbour County
3,AL,1,7,Bibb County
4,AL,1,9,Blount County


In [11]:
%%time 
# remove obvious extraneous word from county name 
county_list['COUNTY2'] = (county_list['COUNTY2'].str.replace(' County', '', regex=True)
                                                .str.lower())

CPU times: user 3.4 ms, sys: 233 µs, total: 3.63 ms
Wall time: 3.55 ms


In [12]:
# concat to get 5 digit code, which is the 'COUNTY_ID' column from the fill_county function
county_list['FullID'] = county_list['StateID'] + county_list['FIPS_CODE']

In [13]:
# drop FIPS_CODE to avoid duplication with original table
county_list = county_list.drop(columns='FIPS_CODE')

In [14]:
# Join 
%time merged = subset.merge(county_list, how='left', on=['STATE','COUNTY2'])

CPU times: user 6.1 ms, sys: 0 ns, total: 6.1 ms
Wall time: 53.3 ms


In [15]:
 # these we already looked up via coordinates, hence IDed
IDed = merged[merged['COUNTY_ID'].str.isnumeric()]
IDed

Unnamed: 0,FOD_ID,FIRE_YEAR,DISCOVERY_DATE,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,COUNTY,FIPS_CODE,FIPS_NAME,hr,Cause,Month,DayofWeek,COUNTY2,COUNTY_ID,StateID,FullID
3,1385051,2008,2008-08-01,0.1,A,33.953889,-116.496944,CA,,,,,Other,8,Friday,riverside,6065,6,6065
4,131832,2000,2000-06-28,1.5,B,37.923056,-120.101111,CA,,,,17.0,Lightning,6,Wednesday,tuolumne,6109,6,6109
5,200351,1993,1993-04-09,0.5,B,35.1167,-107.3673,NM,,,,13.0,Accident,4,Friday,cibola,35006,35,35006
7,172525,2004,2004-09-20,0.1,A,46.015,-114.223333,MT,81.0,81.0,Ravalli,12.0,Other,9,Monday,ravalli,30081,30,30081


In [16]:
#still need to fill IDs
noID = merged[~merged['COUNTY_ID'].str.isnumeric()] 
noID

Unnamed: 0,FOD_ID,FIRE_YEAR,DISCOVERY_DATE,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,COUNTY,FIPS_CODE,FIPS_NAME,hr,Cause,Month,DayofWeek,COUNTY2,COUNTY_ID,StateID,FullID
0,343186,2001,2001-08-14,0.5,B,39.0499,-114.8342,NV,White Pine,33,White Pine,14.0,Lightning,8,Tuesday,white pine,White Pine,32.0,32033.0
1,201838989,2013,2013-05-14,1.0,B,32.606075,-87.309651,AL,Perry,105,Perry,,Accident,5,Tuesday,perry,Perry,1.0,1105.0
2,201862585,2013,2013-09-05,1.0,B,31.666004,-96.449247,TX,Limestone,293,Limestone,,Other,9,Thursday,limestone,Limestone,48.0,48293.0
6,516759,2008,2008-08-01,1.0,B,31.66277,-93.48503,LA,Sabine,85,Sabine,,Other,8,Friday,sabine,Sabine,,
8,20004335,2002,2002-03-31,5.0,B,37.70822,-91.37444,MO,CRAWFORD,55,Crawford,,Accident,3,Sunday,crawford,CRAWFORD,29.0,29055.0
9,300122136,2014,2014-02-20,4.0,B,35.27409,-93.29173,AR,LOGAN,83,Logan,15.0,Other,2,Thursday,logan,LOGAN,5.0,5083.0


In [17]:
# Remember, COUNTY_ID was only still a string because the county name was available. 
# Fill it with the 'FullID' column from the reference table.
noID['COUNTY_ID'] = noID['FullID'] 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [18]:
# Note, however, there were still nulls. 
noID[noID['COUNTY_ID'].isnull()]
# It turns out there were many more extraneous words like 'County' that make the join fail
# stop_words = ['County','Census Area', 'Municipality', 'Borough', 'Parish', 'city', 'Municipio', 'District', 'and']

Unnamed: 0,FOD_ID,FIRE_YEAR,DISCOVERY_DATE,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,COUNTY,FIPS_CODE,FIPS_NAME,hr,Cause,Month,DayofWeek,COUNTY2,COUNTY_ID,StateID,FullID
6,516759,2008,2008-08-01,1.0,B,31.66277,-93.48503,LA,Sabine,85,Sabine,,Other,8,Friday,sabine,,,


In [19]:
filled = noID[~noID['COUNTY_ID'].isnull()] #set filled aside 
stillna = noID[noID['COUNTY_ID'].isnull()] #remaining set to work with

In [20]:
# Still true that these records have the county info even if they are mismatched with the reference table.
# Concat StateID with the FIPS_CODE from the original data to make the full ID.
# Make dict for quicker lookup of the StateID
stateidsdict = dict(zip(county_list['STATE'], county_list['StateID'])) 

In [21]:
%time stillna['StateID'] = stillna['STATE'].apply(lambda x:stateidsdict[x])

CPU times: user 37.6 ms, sys: 98 µs, total: 37.7 ms
Wall time: 37.6 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [22]:
%time stillna['COUNTY_ID'] = stillna['StateID'] + stillna['FIPS_CODE']

CPU times: user 37.3 ms, sys: 67 µs, total: 37.3 ms
Wall time: 37.2 ms


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [23]:
stillna

Unnamed: 0,FOD_ID,FIRE_YEAR,DISCOVERY_DATE,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,COUNTY,FIPS_CODE,FIPS_NAME,hr,Cause,Month,DayofWeek,COUNTY2,COUNTY_ID,StateID,FullID
6,516759,2008,2008-08-01,1.0,B,31.66277,-93.48503,LA,Sabine,85,Sabine,,Other,8,Friday,sabine,22085,22,


In [24]:
fixed = pd.concat([IDed, filled, stillna])

In [25]:
fixed.shape # make sure didn't gain any rows

(10, 19)

In [26]:
# ***Possible Alternative***
# shorten county name to account for the other words and possible get a more complete join
# however, would still have to check for nulls so may not save any time

# county_list[~county_list['COUNTY2'].str.contains('County')]['COUNTY2'].unique()
# county_list['COUNTY2'] = (county_list['COUNTY2'].str.replace(' County', '', regex=True)
#                           .str.replace(' Census Area', '', regex=True)
#                           .str.replace(' Municipality', '', regex=True)
#                           .str.replace(' Borough', '', regex=True)
#                           .str.replace(' Parish', '', regex=True)
#                           .str.replace(' city', '', regex=True)
#                           .str.replace(' Municipio', '', regex=True)
#                           .str.replace(' District', '', regex=True)
#                           .str.lower())

<a id="next"></a> 
# 3. Next Steps
Originally, I proceeded to create historical features about previous fires that occured in the same location or month that seemed to provide some signal. Instead of doing that, I will first go through looking up Elevation and Weather data for the fires in take 2. 