# Summary 
In this notebook, I mention two possible ways of gathering weather data and appending it to the fires data and go into the Google BigQuery method in detail. I will explore the new columns of the full data set in the next notebook.

# Table of Contents
[Method 1: NCDC API](#method1)  
[Method 2: Google BigQuery](#method2)
1. [US Weather Station IDs](#usids)
2. [Pull Weather from GBQ](#gbqweather)
3. [State to Station Dictionary](#state2station)
4. [Find Nearest Stations](#near)
5. [Join Weather](#join)

<a id='method1'></a>
# Method 1: NCDC API
One way to get the weather at a given location is through the National Climatic Data Center (NCDC). I tried this during Take 1 to retrieve precipitation, temperature, and average wind speed for my small subsets but was unable to scale it up for the whole dataset. Look at 'Fires 6 scrape weather' for details.

#### NCDC API
Pros:
- Can retrieve all the stations associated with a FIPS code, allowing you to calculate an average 
- Potentially more features from a weather station

Cons:
- Can only get multiple dates at a time if looking up one specific station 
- Data coverage is spotty at best for features outside the standard ones like temperature and precipitation
- maximum 10000 API requests per key per day - not ideal for a lot of look ups given how restricted the queries are


<a id='method2'></a>
# Method 2: Google BigQuery

Google BigQuery(GBQ) hosts multiple weather data sets, including two from the NOAA* that are updated daily, GHCN** and GSOD***. I decided to query GSOD because the data has thorough descriptions and the schema is easier to work with. The rows are stations on a given day and the columns are the weather features. The data set is divided into separate tables for different years. 

I want to join my fire records to the corresponding weather data. To minimize the data involved for the join, it is imperative to filter the weather data first. 

\*National Oceanic and Atmospheric Administration  
\*\*Global Historical Climatology Network  
\*\*\* Global Surface Summary of the Day  

<a id='usids'></a>
### Step 1. Filter for US only stations

You can try getting the station information from GBQ

```
SELECT usaf, wban, country, state, lat, lon, begin, `end`  
FROM `bigquery-public-data.noaa_gsod.stations`
WHERE country = 'US'
```

However, when looking up descriptions for the columns, I found you can download the Integrated Surface Database Station History from the NOAA ftp server. The .txt version has the descriptions while the .csv contains the data only. 

isd_stations_url = 'ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.csv'

Quick filter with bash
```
! cat isd-history.csv | grep \"US\" > us_only_stations.csv
```




In [1]:
import pandas as pd
from collections import defaultdict
from scipy.spatial.distance import cdist
import numpy as np
import time
import pickle
import os

In [2]:
csv_path = 'us_only_stations.csv'
all_stations = pd.read_csv(csv_path, header=None,
                           names=['USAF', 'WBAN', 'STATION NAME', 'CTRY', 'ST',
                                  'CALL', 'LAT', 'LON', 'ELEV(M)', 'BEGIN',  'END'])
all_stations.head()

Unnamed: 0,USAF,WBAN,STATION NAME,CTRY,ST,CALL,LAT,LON,ELEV(M),BEGIN,END
0,621010,99999,MOORED BUOY,US,,,50.6,-2.933,-999.0,20080721,20080721
1,621110,99999,MOORED BUOY,US,,,58.9,-0.2,-999.0,20041118,20041118
2,621130,99999,MOORED BUOY,US,,,58.4,0.3,-999.0,20040726,20040726
3,621160,99999,MOORED BUOY,US,,,58.1,1.8,-999.0,20040829,20040829
4,621170,99999,MOORED BUOY,US,,,57.9,0.1,-999.0,20040726,20040726


In [3]:
all_stations.count()

USAF            7370
WBAN            7370
STATION NAME    7316
CTRY            7370
ST              6672
CALL            5073
LAT             7249
LON             7249
ELEV(M)         7249
BEGIN           7370
END             7370
dtype: int64

In [4]:
# keep only those that have relevant dates and valid coordinates
stations_has_data_beyond_1992 = all_stations[all_stations['END'] // 10000 > 1992]
drop_nulls = stations_has_data_beyond_1992.dropna(axis=0,
                                                  subset=['LAT', 'LON', 'BEGIN', 'END'])
print(len(stations_has_data_beyond_1992), 'stations with data beyond 1992')
print(len(stations_has_data_beyond_1992) - len(drop_nulls), 'stations with nulls')

5868 stations with data beyond 1992
64 stations with nulls


In [5]:
drop_nulls.nunique()

USAF            3774
WBAN            2487
STATION NAME    5485
CTRY               1
ST                53
CALL            2448
LAT             3722
LON             4326
ELEV(M)         2406
BEGIN           1711
END              756
dtype: int64

In [6]:
# USAF not unique within a state
drop_nulls.groupby('USAF')['ST'].nunique().sort_values(ascending=False).value_counts()

1     3265
0      501
2        7
48       1
Name: ST, dtype: int64

In [7]:
drop_nulls[drop_nulls['USAF'] == '996350']

Unnamed: 0,USAF,WBAN,STATION NAME,CTRY,ST,CALL,LAT,LON,ELEV(M),BEGIN,END
5509,996350,99999,ST. AUGUSTINE FL 40NM ENE OF ST AUGUSTI,US,,,30.0,-80.6,0.0,20020626,20101231


In [8]:
drop_nulls.shape

(5804, 11)

In [9]:
# it's possible you miss out on some closer stations but coverage should be adequate
drop_nulls[drop_nulls['STATION NAME'].str.contains('AUGUSTINE')]

Unnamed: 0,USAF,WBAN,STATION NAME,CTRY,ST,CALL,LAT,LON,ELEV(M),BEGIN,END
1674,722212,92814,ST AUGUSTINE AIRPORT,US,FL,KSGJ,29.959,-81.34,3.1,20060101,20181023
1675,722212,99999,ST AUGUSTINE,US,FL,KSGJ,29.967,-81.333,3.0,19970123,20071231
5408,994410,99999,ST. AUGUSTINE FL,US,,,29.86,-81.26,0.0,19870122,20180425
5433,994700,99999,AUGUSTINE ISLAND,US,AK,,59.38,-153.35,9.1,20020703,20180425
5509,996350,99999,ST. AUGUSTINE FL 40NM ENE OF ST AUGUSTI,US,,,30.0,-80.6,0.0,20020626,20101231


<a id='gbqweather'></a>
### Step 2: Pull US weather data from GBQ
Won't run here since already downloaded in 'gbq_get_weather' notebook. The idea is filter each year's GBQ table for only the stations in the list of US station IDs. 

In [10]:
us_stations = drop_nulls['USAF'].unique()

In [11]:
len(us_stations)

3774

In [12]:
# tuple gives the commas necessary to be read as a list in str form
us_stations2 = tuple(us_stations)

In [13]:
# python API for GBQ

from google.cloud import bigquery
from google.oauth2 import service_account
# set credentials with oauth2 , set up your own json credentials
credentials = service_account.Credentials.from_service_account_file('/home/douglas/Downloads/weather-5deec6be7e9f.json')
# create client for weather project using above service acount credentials
client = bigquery.Client(project='weather-214817', credentials=credentials)

These were the columns I queried but there are a few more available.

Columns | Description
 ----- | -----
stn | station ID - USAF in isd-history
year | year
mo | month
da | day
temp | Mean temperature for the day in degrees Fahrenheit to tenths. Missing = 9999.9
stp | Mean station pressure for the day in millibars to tenths. Missing = 9999.9
wdsp | Mean wind speed for the day in knots to tenths. Missing = 999.9
max | Maximum temperature reported during the day in Fahrenheit to tenths. Missing = 9999.9
prcp | Total precipitation (rain and/or melted snow) reported during the day in inches and hundredths; Many stations do not report '0' on days with no precipitation--therefore, '99.99' 
thunder | Indicators (1 = yes, 0 = no/not reported) for the occurrence during the day

In [14]:
out_dir = '/home/douglas/ds_projects/Predicting_Wildfire_Size/data'


def pull_us_weather(us_stations, out_dir):
    '''Download weather data for US stations
    
    '''
    # Get names and queries ready
    df_names = ['weather' + str(yr) for yr in np.arange(1992,2016,1)]
    queries = ['''SELECT stn, year, mo, da, temp, stp, wdsp, max as max_temp, prcp, thunder  
        FROM `bigquery-public-data.noaa_gsod.gsod{}` 
        WHERE stn IN {} order by stn, mo'''.format(str(yr), tuple(us_stations)) for yr in np.arange(1992,2016)]
    
    # Store pulled dataframes in a dict, with key=year, value=df
    weather_dfs = dict()
    for year, query in zip(df_names, queries):
        query_job = client.query(query)
        rows = query_job.result()
        weather_dfs[year] = rows.to_dataframe()
        print(year, 'downloaded')  
            
    # Pickle dataframes
    for name in df_names:
        df = weather_dfs[name]
        with open(f'{out_dir}/{name}.pkl', 'wb') as picklefile:
            pickle.dump(df, picklefile)
        print(name, 'saved')
    return 'Done'

Sample timings for the two loops within 'pull_us_weather'

```
weather_dfs = dict()
for year, query in zip(df_names, queries):
    query_job = client.query(query)
    rows = query_job.result()
    weather_dfs[year] = rows.to_dataframe()
    print(year, 'downloaded')   
```
CPU times: user 7min 19s, sys: 11.3 s, total: 7min 30s
Wall time: 31min 59s

```
for name in df_names:
    df = weather_dfs[name]
    with open('/home/douglas/ds_projects/Predicting_Wildfire_Size/data/{}.pkl'.format(name), 'wb') as picklefile:
        pickle.dump(df, picklefile)
    print(name, 'saved')
```
CPU times: user 36.1 s, sys: 21.5 s, total: 57.6 s
Wall time: 47min 28s

There were some issues with the data returning from these queries that caused extra rows when joining


In [15]:
def clean_weather_data(weather_df):
    '''convert int date format to pd.datetime and remove duplicate station id and date combinations'''
    original_length = len(weather_df)
    weather_df = weather_df[weather_df['stn'] != '999999'] # DROP STATIONS THAT USE WBAN INDEX INSTEAD OF USAF   
    weather_df['date'] = (weather_df['year'] + weather_df['mo'] + weather_df['da']).astype('int') 
    weather_df['date'] = pd.to_datetime(weather_df['date'], format='%Y%m%d') # deal with month/year changing
    weather_df = weather_df.sort_values(['stn', 'date']) # sort before looking for dupes
    dupe_mask = weather_df[['stn', 'date']].duplicated() # 'Mark duplicates as true except for the first occurence'
    weather_df = weather_df[~dupe_mask] # keep only non dupes 

    print('Rows before and after trimming:', original_length, len(weather_df))
    print('sample cols', weather_df[['date', 'stn']].sample(1))

    return weather_df

Uncomment below to apply. 

In [16]:
# weather_files = sorted([p for p in os.listdir('./data/') if 'weather' in p])

# # Apply cleaning to weather dataframes
# for file in weather_files:
#     weather_df = pd.read_pickle(f'./data/{file}')
#     cleaned_df = clean_weather_data(weather_df)
#     cleaned_df.to_pickle(f'./data/clean_{file}')

In [17]:
# example weather df
w1999 = pd.read_pickle('./data/clean_weather1999.pkl')
w1999.head()

Unnamed: 0,stn,year,mo,da,temp,stp,wdsp,max_temp,prcp,thunder,date
8,690140,1999,1,4,68.1,9999.9,2.4,78.8,0.0,0,1999-01-04
10,690140,1999,1,5,66.8,9999.9,2.6,75.6,0.0,0,1999-01-05
14,690140,1999,1,6,64.7,9999.9,1.5,75.6,0.0,0,1999-01-06
19,690140,1999,1,7,54.9,9999.9,1.3,72.0,0.0,0,1999-01-07
17,690140,1999,1,8,57.4,9999.9,3.0,73.4,0.0,0,1999-01-08


<a id='state2station'></a>
### Step 3: Create state-to-list-of-stations dictionary

In [18]:
%%time
station_locs = defaultdict()
#key is state name, value is 2d array of all stations in state (labeled)
for group in drop_nulls.groupby('ST'):
    station_locs[group[0]] = group[1][[
        'LAT', 'LON', 'USAF', 'BEGIN', 'END']].values

CPU times: user 44 ms, sys: 72 µs, total: 44.1 ms
Wall time: 43.4 ms


In [19]:
#example entry, truncated
print(station_locs['AZ'][:5])

[[32.533 -114.51700000000001 '696454' 19840426 20061214]
 [32.5 -114.15 '696464' 19900809 20061214]
 [32.733000000000004 -113.633 '697564' 19830901 19970418]
 [32.65 -114.617 '699604' 19870701 20121231]
 [35.650999999999996 -112.148 '720059' 20040408 20040614]]


<a id='near'></a>
### Step 4: Find the closest station by coordinates

In [20]:
def _get_sorted_stations(row, station_locs):
    '''
    Calculate distances to each station from latitude and longitude
    Returns a sorted list of stations'''
    
    location = row[['LATITUDE', 'LONGITUDE']].values.reshape(1, 2)
    
    # look up state's array of stations and calc distance
    state_stations = station_locs[row['STATE']]
    dists = cdist(location, state_stations[:, :2])
    
    # np.argsort returns the indices that would sort the distances
    sorted_indices = np.argsort(dists)
    return state_stations[sorted_indices]

In [21]:
#UNUSED
# def find_closest_station(row, station_locs):
#     '''Use 3 closest instead
#     '''
#     sorted_stations = _get_sorted_stations(row, station_locs).reshape(-1,5)    
#     fire_date = row['date_as_int']

    
#     for station in sorted_stations: 
#         if fire_date > station[3] and fire_date < station[4]:
#             return station[2]  # 2 = USAF idx
#         else:
#             continue
#     print("No matching station found for fire", row['FOD_ID'])
#     return None

In [22]:
def find_3_closest_stations(row, state2stations):
    '''Function intended for df.apply
    Requires a 'date_as_int' column in df
    *Inputs*
    row: dataframe row
    state2stations: dict, state-to-stations hash map
    *Returns* tuple of 3 closest stations that includes the date of the fire 
    '''
    sorted_stations = _get_sorted_stations(row, state2stations).reshape(-1,5) 
    # get stations info and reshape into an iterable array   
    fire_date = row['date_as_int']

    #keep track of 3 closest - backups to fill NAs if possible 
    top_3_stations = []
    
    # loop through stations, closest first
    for station in sorted_stations:
        if len(top_3_stations) == 3:
            break  
            
        # take if date is between station's start(3) and end(4) dates
        if fire_date > station[3] and fire_date < station[4]: 
            top_3_stations.append(station[2])  # 2 = USAF idx
            continue
        else:
            continue
            
    # after 3 stations, return results
    if len(top_3_stations) == 0:
        print("No matching station found for fire", row['FOD_ID'])
        return None, None, None
    elif len(top_3_stations) == 1:
        return top_3_stations[0], None, None
    elif len(top_3_stations) == 2:
        return top_3_stations[0], top_3_stations[1], None
    else:
        return tuple(top_3_stations)

In [23]:
# demo subset
lean_fires = pd.read_pickle('lean_fires.pkl')

In [24]:
lean_fires.head()

Unnamed: 0,FOD_ID,FIRE_YEAR,DISCOVERY_DATE,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,COUNTY,FIPS_CODE,FIPS_NAME,hr,Cause
334821,343186,2001,2452135.5,0.5,B,39.0499,-114.8342,NV,White Pine,33.0,White Pine,14.0,Lightning
1674798,201838989,2013,2456426.5,1.0,B,32.606075,-87.309651,AL,Perry,105.0,Perry,,Accident
1692175,201862585,2013,2456540.5,1.0,B,31.666004,-96.449247,TX,Limestone,293.0,Limestone,,Other
1135865,1385051,2008,2454679.5,0.1,A,33.953889,-116.496944,CA,,,,,Other
130533,131832,2000,2451723.5,1.5,B,37.923056,-120.101111,CA,,,,17.0,Lightning


In [25]:
def match_station(df, state_to_stations):
    '''Match fire record with its 3 closest stations
    Requires df to have columns 
    ['LATITUDE', 'LONGITUDE', 'STATE', 'DISCOVERY_DATE']'''
    # Preprocess df
    if df['DISCOVERY_DATE'].dtype == 'float64':
        df['DISCOVERY_DATE'] = pd.to_datetime(df['DISCOVERY_DATE'], origin='julian', unit='D')
    df['date_as_int'] = (10000 * df['DISCOVERY_DATE'].dt.year + 100 *
                                df['DISCOVERY_DATE'].dt.month + df['DISCOVERY_DATE'].dt.day)
    
    # Apply function
    new_cols = df.apply(find_3_closest_stations, args=(state_to_stations,), axis=1)
    print(type(new_cols))
    # convert tuples to dataframe
    unpackdf = pd.DataFrame(new_cols.tolist(),
                            columns=['weather_station1','weather_station2', 'weather_station3'],
                            index=new_cols.index)
    # append to original df
    df = pd.concat([df, unpackdf],axis=1)
    return df

In [26]:
%%time 
lean_fires = match_station(lean_fires, station_locs)

<class 'pandas.core.series.Series'>
CPU times: user 34.7 s, sys: 365 ms, total: 35.1 s
Wall time: 35.1 s


In [27]:
# the new cols
lean_fires.iloc[:, -4:].head()

Unnamed: 0,date_as_int,weather_station1,weather_station2,weather_station3
334821,20010814,724860,725824,724770
1674798,20130514,999999,999999,999999
1692175,20130905,720779,722469,722561
1135865,20080801,722868,747187,720165
130533,20000628,724810,724815,724926


In [28]:
lean_fires[['weather_station1','weather_station2', 'weather_station3']].describe()

Unnamed: 0,weather_station1,weather_station2,weather_station3
count,50000,49852,49734
unique,2133,2251,2355
top,999999,999999,999999
freq,2026,2340,2818


<a id='join'></a>
### Step 5: Join Weather
I wrote a couple of scripts to join the weather df's from step 2 to the fires data. The core component is a for loop that goes through each year, loads that year's weather data, slices the fire dataframe to that year's subset, and does a left join between fires and weather. 

The script below joins weather at the closest station on the day after the fire is discovered. Less than 25% of small (classes A,B,C) fires last into the next day, so the weather conditions the next day may have signal on whether a fire continues to get bigger or not. 

In the full data set, you'll notice there are more columns, such as weather data from the three stations for the first day. This was done with a similar script joining the weather df, just matching on the three different station IDs from Step 4.

The other lines of code are are there to accomodate my folder structure and hardware considerations because I needed to split things up into parts and then recombine. Since the core idea of the script is pretty straightforward, I won't adapt the code to demo with the subset here. 

```
from datetime import timedelta

weather_files = sorted([p for p in os.listdir('./data/') if 'clean_weather' in p])

fires = pd.read_pickle('nov_14_fires_joined_3weather_stations.pkl')
input_length = len(fires)
print(input_length)

# Add new date columns 
fires['day_2'] = fires['DISCOVERY_DATE'] + timedelta(days=1)
fires['day_3'] = fires['DISCOVERY_DATE'] + timedelta(days=2)


fire_subsets = []
n = 0
date_col = 'day_2'
for filename in weather_files:
    
    cyear = int(filename[-8:-4]) #magic numbers from file naming
    
    fires_from_year = fires[fires[date_col].dt.year == cyear]  # slice year
    
    # Load weather data df for that year
    weather_df = pd.read_pickle('./data/{}'.format(filename))

    add_day2 = pd.merge(fires_from_year, weather_df, 
                        how='left', 
                        left_on=['day_2', 'weather_station1'], 
                        right_on=['date', 'stn'], 
                        suffixes=('', '_day2'))
    print(add_day2.shape)
    
    fire_subsets.append(add_day2)end - start)

    # Save part once size gets to a certain point 
    if len(fire_subsets) % 8 == 0:
        n += 1
        fires_df = pd.concat(fire_subsets, axis=0)
        fires_df.to_pickle('dec_4_fires_joined_3weather_days_part{}.pkl'.format(n))
        fire_subsets = []
        
# Merge together
part_filenames = sorted([p for p in os.listdir('./') if 'dec_4_fires_joined_3weather_days_part' in p])
print(part_filenames)
final_parts = []
for part in part_filenames:
    df = pd.read_pickle(part)
    print(part, df.shape)
    final_parts.append(df)
fires = pd.concat(final_parts, axis=0)

#Make sure no rows gained
print('Total Rows: ', '\n Input=', input_length, '\n Final=', len(fires))
```

In [29]:
# misnamed pickle file - full records with 2 days 
full_fires = pd.read_pickle('dec_4_fires_joined_3weather_days.pkl')

In [30]:
pd.set_option('display.max_columns', 500)
full_fires.head()

Unnamed: 0,FOD_ID,DISCOVERY_DATE,DISCOVERY_DOY,DISCOVERY_TIME,STAT_CAUSE_CODE,STAT_CAUSE_DESCR,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,STATE,Month,DayofWeek,DISCOVERY_TIME2,COUNTY2_x,COUNTY_ID,CLASS,StateID,CAUSE,Prev_Lightning_Fires,Prev_Accident_Fires,Prev_Arson_Fires,Prev_Other_Fires,Prev_Fires_at_Location,Prev_Fires_Same_Month,Prev_1_fires2,Prev_2_Fires,Prev_3_fires,Elevation,date_as_int,weather_station,weather_station1,weather_station2,weather_station3,stn_x,year_x,mo_x,da_x,temp_x,stp_x,wdsp_x,max_temp_x,prcp_x,thunder_x,date_x,stn_y,year_y,mo_y,da_y,temp_y,stp_y,wdsp_y,max_temp_y,prcp_y,thunder_y,date_y,stn,year,mo,da,temp,stp,wdsp,max_temp,prcp,thunder,date,day_2,day_3,stn_day2,year_day2,mo_day2,da_day2,temp_day2,stp_day2,wdsp_day2,max_temp_day2,prcp_day2,thunder_day2,date_day2
0,1127038,1992-01-01,1,,5.0,Debris Burning,5.0,B,34.3917,-78.5683,NC,1,Wednesday,,columbus,37047,1,37,Accident,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29,19920101,723035,723035,723013,746930,723035.0,1992.0,1.0,1.0,43.5,9999.9,8.0,55.0,0.0,0.0,1992-01-01,723013,1992,1,1,47.9,1028.2,8.7,66.9,0.0,0,1992-01-01,746930.0,1992.0,1.0,1.0,44.1,9999.9,5.9,54.0,0.0,0.0,1992-01-01,1992-01-02,1992-01-03,723035.0,1992.0,1.0,2.0,46.3,9999.9,6.7,54.0,0.0,0.0,1992-01-02
1,878089,1992-01-01,1,,13.0,Missing/Undefined,15.0,C,42.240662,-105.292504,WY,1,Wednesday,,albany,56001,2,56,Other,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1808,19920101,725685,725685,725645,725643,725685.0,1992.0,1.0,1.0,27.5,9999.9,4.2,44.1,0.0,0.0,1992-01-01,725645,1992,1,1,16.8,9999.9,3.7,28.2,0.04,0,1992-01-01,,,,,,,,,,,NaT,1992-01-02,1992-01-03,725685.0,1992.0,1.0,2.0,35.9,9999.9,7.5,48.9,0.0,0.0,1992-01-02
2,19096771,1992-01-01,1,10.0,9.0,Miscellaneous,0.58,B,32.1325,-82.761,GA,1,Wednesday,10.0,wheeler,13309,1,13,Other,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,59,19920101,722135,722135,722130,722175,722135.0,1992.0,1.0,1.0,49.5,9999.9,7.8,55.0,0.0,0.0,1992-01-01,722130,1992,1,1,48.1,1020.1,9.9,55.9,0.0,0,1992-01-01,722175.0,1992.0,1.0,1.0,45.0,9999.9,6.7,55.0,0.0,0.0,1992-01-01,1992-01-02,1992-01-03,722135.0,1992.0,1.0,2.0,54.5,9999.9,6.8,63.0,0.2,0.0,1992-01-02
3,19094893,1992-01-01,1,200.0,3.0,Smoking,0.72,B,31.1216,-84.2153,GA,1,Wednesday,200.0,mitchell,13205,1,13,Accident,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,71,19920101,722160,722160,722166,747810,722160.0,1992.0,1.0,1.0,48.5,9999.9,9.7,57.0,0.0,0.0,1992-01-01,722166,1992,1,1,50.0,9999.9,11.5,61.0,0.0,0,1992-01-01,747810.0,1992.0,1.0,1.0,51.2,9999.9,8.2,55.9,0.0,0.0,1992-01-01,1992-01-02,1992-01-03,722160.0,1992.0,1.0,2.0,51.1,9999.9,9.5,55.9,0.12,0.0,1992-01-02
4,1100722,1992-01-01,1,,2.0,Equipment Use,8.0,B,29.54,-83.21,FL,1,Wednesday,,dixie,12029,1,12,Accident,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7,19920101,722120,722120,722146,722055,,,,,,,,,,,NaT,722146,1992,1,1,52.8,1017.2,10.0,61.0,0.0,0,1992-01-01,722055.0,1992.0,1.0,1.0,56.5,9999.9,7.4,57.9,0.0,0.0,1992-01-01,1992-01-02,1992-01-03,,,,,,,,,,,NaT


In [31]:
full_fires.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1880443 entries, 0 to 74501
Data columns (total 80 columns):
FOD_ID                    int64
DISCOVERY_DATE            datetime64[ns]
DISCOVERY_DOY             int64
DISCOVERY_TIME            object
STAT_CAUSE_CODE           float64
STAT_CAUSE_DESCR          object
FIRE_SIZE                 float64
FIRE_SIZE_CLASS           object
LATITUDE                  float64
LONGITUDE                 float64
STATE                     object
Month                     int64
DayofWeek                 object
DISCOVERY_TIME2           float64
COUNTY2_x                 object
COUNTY_ID                 object
CLASS                     category
StateID                   object
CAUSE                     object
Prev_Lightning_Fires      float64
Prev_Accident_Fires       float64
Prev_Arson_Fires          float64
Prev_Other_Fires          float64
Prev_Fires_at_Location    float64
Prev_Fires_Same_Month     float64
Prev_1_fires2             float64
Prev_2_Fires