# Wild Fires caused by the weather
## Part 1: Data Acquisition

At this part, we will acquire data by using the *crawling* method.<br>
We will crawl the **National Interagency Fire Center** site.<br>
The data is the USA's wildfires history.<br>
Explanation link about the attributes and more(We will remove some of them,<br>
and we will add other columns related to the weather during the part 2 of this project, the data cleaning):<br>
[Wildland fire locations full history](https://data-nifc.opendata.arcgis.com/datasets/nifc::wfigs-wildland-fire-locations-full-history/about)


#### Preceding Step - import modules (packages)
This step is necessary in order to use external packages. 

In [1]:
import bs4
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup 
from collections import defaultdict

## Selenium support ##

# Uncomment the below lines of code for installing selenium for the first time:
#!pip install selenium
#!pip install webdriver-manager

from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

#### Global variables and constants
Here we define our global variables we will use in this notebook

In [2]:
BASE_URL = "https://data-nifc.opendata.arcgis.com/datasets/nifc::wfigs-wildland-fire-locations-full-history/explore?showTable=true"
CSV_NAME = "Wildfire_history.csv"

## Auxiliary functions

### getFullHTMLContent implementation() - *START* 
In this section, we will implement the getFullHTMLContent() function and its auxiliary functions

In [None]:
## This function is getting a driver and the element we want to scroll and it scrolls it down
def scrollElementDown(driver, element):
    
    # Get scroll height.
    last_height = driver.execute_script("return arguments[0].scrollHeight",element)
            
    while True:
        
        # Scroll down to the bottom.
        driver.execute_script("arguments[0].scrollTo(0, arguments[0].scrollHeight);", element)

        # Wait for full table loading
        time.sleep(1)
        
        # Calculate new scroll height and compare with last scroll height.
        new_height = driver.execute_script("return arguments[0].scrollHeight",element)
        
        if new_height == last_height:
            # Checking if we really scrolled to bottom
            table_info = driver.find_element(by=By.CLASS_NAME, value='feature-table-count').text.split()
            if table_info[1] == table_info[3]:
                print("found {} from {}".format(table_info[1], table_info[3]))
                break
            else:
                try:
                    driver.find_element(by=By.CLASS_NAME, value='loader')
                except:
                    # This site is extremly slow and sometimes it stuck at loading. To prevent it we scroll up the footer height
                    # and scroll down again
                    driver.execute_script("arguments[0].scrollTo(0, 0);", element)
                    time.sleep(2)

        last_height = new_height

In [None]:
## This function is getting the full content of the site and returns it
def getFullHTMLContent(url, html_file_name):
    driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()))    
    driver.get(url)
    
    # Get table div which we will scroll
    table = driver.find_element(by=By.CLASS_NAME, value='infinite-scroll-container')
    
    # Scroll the table down
    scrollElementDown(driver, table)
    
    # returns the html content
    return driver.page_source
    
    

###  getFullHTMLContent implementation() - *END* 

### crawlWildFiresHistory implementation() - *START* 
In this section, we will implement the crawlWildFiresHistory() function and its auxiliary functions

In [None]:
## This function is using beautifulSoup to get all the table columns
def getColumns(table):
    columns = table.find_all("th")
    table_columns = []
    for th in columns:
        table_columns.append(th.get_text().strip())
    return table_columns

In [None]:
## This function is using beautifulSoup to get all the table rows
def getRows(table):
    table_rows = []
    rows = table.find("tbody").find_all("tr")
    for tr in rows:
        td = tr.find_all('td')
        row = [tr.get_text().strip() for tr in td]
        table_rows.append(row)
    return table_rows

In [None]:
## This function is using beautifulSoup to read the table, create a data frame and save it to csv file
def crawlWildFiresHistory(html_content):
    data = {}
    soup = BeautifulSoup(html_content, "html.parser")
    table = soup.find("table")
    columns = getColumns(table)
    rows = getRows(table)
    
    #In this section, we combine the columns and the rows to one dictionary for creating the data frame
    for i in range(len(rows[0])):
        curr_col = []
        for j in range(len(rows)):
            curr_col.append(rows[j][i])
        data[columns[i]] = curr_col.copy()
               
    df = pd.DataFrame(data)
    return df.to_csv(CSV_NAME, index=False)
    

### crawlWildFiresHistory implementation() - *END*

###  Main program - *START* 
This is the main program we will execute in order to process the data acquisition.

#### Crawling part
In this section, we will use the getFullHTMLContent() function to get the full html content by using selenium. <br>
Then, we will send the content to the crawlWildFiresHistory() function for crawling and saving the data to csv file.

In [None]:
html_content = getFullHTMLContent(BASE_URL, HTML_FILE_NAME)
crawlWildFiresHistory(html_content)

#### Data exploration
In this section, we will take a briefly look on the data we crawled

In [3]:
df = pd.read_csv(CSV_NAME)
df

  df = pd.read_csv(CSV_NAME)


Unnamed: 0,ABCDMisc,ADSPermissionState,CalculatedAcres,ContainmentDateTime,ControlDateTime,DailyAcres,DiscoveryAcres,DispatchCenterID,EstimatedCostToDate,FinalFireReportApprovedByTitle,...,IsDispatchComplete,OrganizationalAssessment,StrategicDecisionPublishDate,CreatedOnDateTime_dt,ModifiedOnDateTime_dt,Source,GlobalID,IsCpxChild,CpxName,CpxID
0,,CERTIFIED,50.64,2020/08/06 23:13:07+00,2020/08/06 23:13:24+00,50.60,20.00,MTMCC,,,...,0,,,2020/08/06 19:50:29+00,2020/08/12 20:46:01+00,IRWIN,{E5436898-ED0D-4CB1-90C0-D61915FE1F29},,,
1,,DEFAULT,,,,,0.10,CALACC,,,...,0,,,2020/02/28 20:52:36+00,2020/02/28 20:52:36+00,IRWIN,{0E79B7FD-2882-43CF-8CFA-911BD1C8F77A},,,
2,,DEFAULT,,2017/10/18 00:30:00+00,2017/10/18 00:35:00+00,50.00,50.00,MTKIC,,,...,0,,,2017/10/18 13:46:40+00,2017/11/09 22:08:19+00,IRWIN,{FAC59A92-E6AD-443B-8625-4AAABCF7F533},,,
3,,DEFAULT,,,,,,CAMVIC,,,...,0,,,2019/07/01 20:10:12+00,2019/07/01 20:10:12+00,IRWIN,{5DF06F41-9948-49D3-B00A-2D3A1D1049C5},,,
4,,DEFAULT,,,,,,,,,...,0,,,2016/06/20 22:39:02+00,2016/06/20 22:39:02+00,IRWIN,{F378818E-D541-4E0A-9A44-C81886C2B8B4},,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219353,,DEFAULT,,2022/05/19 05:24:00+00,,0.01,0.01,CARRCC,,,...,0,,,2022/05/19 05:34:01+00,2022/05/19 05:35:18+00,IRWIN,{54E4F816-8F07-4FED-B0F2-0002A4557577},0.0,,
219354,,DEFAULT,,2022/05/19 06:07:00+00,,0.01,0.01,CARRCC,,,...,0,,,2022/05/19 05:54:05+00,2022/05/19 07:05:11+00,IRWIN,{B0811A71-78D6-45AD-A4AA-9DB841925EC1},0.0,,
219355,,DEFAULT,,2022/05/19 07:22:06+00,2022/05/19 07:57:20+00,0.01,0.01,CAMMCC,,,...,0,,,2022/05/19 07:22:16+00,2022/05/19 07:57:27+00,IRWIN,{2A5EC384-919C-4ECB-ACBC-656E118E1FB1},0.0,,
219356,,DEFAULT,,2022/05/19 07:52:01+00,2022/05/19 07:52:04+00,0.01,0.01,CAMMCC,,,...,0,,,2022/05/19 07:52:08+00,2022/05/19 07:52:13+00,IRWIN,{6661A7F0-34EF-4E23-85D1-B19F181B3C2D},0.0,,


In [5]:
print("********** Data Frame info **********")
print(df.info())

********** Data Frame info **********
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219358 entries, 0 to 219357
Data columns (total 93 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   ABCDMisc                         10813 non-null   object 
 1   ADSPermissionState               219358 non-null  object 
 2   CalculatedAcres                  4900 non-null    float64
 3   ContainmentDateTime              131246 non-null  object 
 4   ControlDateTime                  119570 non-null  object 
 5   DailyAcres                       150112 non-null  float64
 6   DiscoveryAcres                   158154 non-null  float64
 7   DispatchCenterID                 181979 non-null  object 
 8   EstimatedCostToDate              13638 non-null   float64
 9   FinalFireReportApprovedByTitle   0 non-null       float64
 10  FinalFireReportApprovedByUnit    2616 non-null    object 
 11  FinalFireReportApprovedDate

In [6]:
print("********** Data Frame describe **********")
print(df.describe(include='all'))

********** Data Frame describe **********
       ABCDMisc ADSPermissionState  CalculatedAcres     ContainmentDateTime  \
count     10813             219358      4900.000000                  131246   
unique      474                  4              NaN                  110938   
top        EKV5            DEFAULT              NaN  2020/12/31 18:00:00+00   
freq        461             181390              NaN                     137   
mean        NaN                NaN      6044.412936                     NaN   
std         NaN                NaN     30862.760752                     NaN   
min         NaN                NaN         0.003400                     NaN   
25%         NaN                NaN         2.977500                     NaN   
50%         NaN                NaN        67.230000                     NaN   
75%         NaN                NaN      1098.949225                     NaN   
max         NaN                NaN    963405.350400                     NaN   

         

###  Main program - *END* 