# Scraping EMUDAT: A First Attempt
This is my attempt to scrape the very detailed disaster database hosted by EMUDAT [link here](http://www.emdat.be/disaster_list/index.html). [Fabio](http://fabio-lana.blogspot.it/) inspired this, and a big shout out to Stevie B. I will be breaking this tutorial up into equal parts code and equal parts research.  

# Intoduction

Risk is calculated by the following equation: [1][ref1]
        
$$RS = PT * PL * V * A$$

Where
+ **PT** is the temporal (e.g. annual) probability of occurrence of a specific hazard scenario (Hs) with a given return period in an area
+ **PL** is the locational or spatial probability of occurrence of a specific hazard scenario with a given return period in an area affecting the elements-at-risk
+ **V** is the physical vulnerability, specified as the degree of damage to a specific element-at-risk Es given the local intensity caused due to the occurrence of hazard scenario Hs
+ **A** is the quantification (value) of the specific type of element at risk evaluated. If the value is expressed in monetary terms, the risk may also be expressed as potential damage.

$$E=f(F,L,T)$$

Where 
+ **E** is the nature/extent of effects,
+ **F** is the flood characteristics (extents, depth, velocity, etc…),
+ **L** is the location characteristics (inside/outside buildings, nature of housing, etc...)
+ **P** is the population characteristics (age, health, GDP, income, gender, etc…)
*The **F** in this case represents the distribution both spatially and temporally. Fabio has a good representation of the different aspects of this in the following image:*

![risk framework](http://1.bp.blogspot.com/-RaW18RN6ZIU/Va313WJ2WeI/AAAAAAAABfM/AiiEa3UOihU/s640/image049.png)

Hazards in general can be defined by their liklihood of occurance (temporal probability) within a certain area (spatial probability).

Fabio breaks the flood hazard component into two main layers:
**Expected Impact Value (EIV)** is based off the probability of a hazard happening on a monthly basis and the amount of people affected by the hazard.
>A simple SPARC example, if a pixel has a 0.1 chance of being flooded in any given March, and an estimated areas risk population (ARP) of 1000, its March EIV is 100.

>It is also known, that the most accurate estimates on damages and people at risk are achieved when the number of return periods are located in the lower tail of the risk curve, while adding an extra return period in the upper tail has very little influence on the final risk estimate. [2][ref1]

He also uses the distribution of rainfall throughout the months in order to calculate against the ARP becasue of the correlation between rainfall and flooding events. He lists below some of the key characteristics involed in the analysis. The background paper prepared for the Global Assessment Report on Disaster Risk Reduction 2013 lists the following data:

+ **Hydromorphometric:**
    + Drainage area
    + Mean basin elevation
    + Mean basin slope
    + Basin shape
    + Main channel length
    + Main channel slope
    + Drainage frequency
    + Distance to final outlet
+ **Land cover:**
    + Surface water storage
    + Forest cover
    + Impervious cover
+ **Climatic time-series:**
    + Mean annual precipitation
    + Temporal mean of monthly maximum precipitation
    + Minimum mean monthly temperature
+ **Climatic zones:**
    + Percentage area of Köppen-Geiger climatic zones
+ **Upstream dam network:**
    + Dam characteristics




### “Source-Pathway-Receptor-Consequence" Model

![pic](http://2.bp.blogspot.com/-SvZl09V9OEM/Va31tru8DrI/AAAAAAAABfM/z28qIqmzqT4/s1600/image012.jpg)


In [66]:
# -*- coding: utf-8 -*-
import os, json, sys, string, io
import pandas as pd
import numpy as np
import urllib2
import pycountry

In [118]:
from scripts import scrape

In [121]:
cc = scrape.iso3dict()

In [93]:
appended_data=[]
for k in cc_dict:
    df_country = emudat2df(k)
    appended_data.append(df_country)
    print "Done with ISO3:"+k

Done with ISO3:DZA
Done with ISO3:AGO
Done with ISO3:EGY
Done with ISO3:BGD
Done with ISO3:NER
Done with ISO3:LIE
Done with ISO3:NAM
Done with ISO3:BGR
Done with ISO3:BOL
Done with ISO3:GHA
Done with ISO3:CCK
Done with ISO3:PAK
Done with ISO3:CPV
Done with ISO3:JOR
Done with ISO3:LBR
Done with ISO3:LBY
Done with ISO3:MYS
Done with ISO3:IOT
Done with ISO3:PRI
Done with ISO3:MYT
Done with ISO3:PRK
Done with ISO3:PSE
Done with ISO3:TZA
Done with ISO3:BWA
Done with ISO3:KHM
Done with ISO3:UMI
Done with ISO3:TTO
Done with ISO3:PRY
Done with ISO3:HKG
Done with ISO3:SAU
Done with ISO3:LBN
Done with ISO3:SVN
Done with ISO3:BFA
Done with ISO3:SVK
Done with ISO3:MRT
Done with ISO3:HRV
Done with ISO3:CHL
Done with ISO3:CHN
Done with ISO3:KNA
Done with ISO3:JAM
Done with ISO3:SMR
Done with ISO3:GIB
Done with ISO3:DJI
Done with ISO3:GIN
Done with ISO3:FIN
Done with ISO3:URY
Done with ISO3:VAT
Done with ISO3:STP
Done with ISO3:SYC
Done with ISO3:NPL
Done with ISO3:CXR
Done with ISO3:LAO
Done with IS

In [96]:
complete_df = pd.concat(appended_data, axis=0)
complete_df.to_csv("CompleteListEMDAT.csv")

In [100]:
complete_df.dtypes

dis_subtype       object
dis_type          object
end_date          object
insur_dam         object
iso               object
location          object
st_date           object
total_affected    object
total_dam         object
total_deaths      object
dtype: object

In [102]:
df = complete_df.copy()

In [105]:
df = df.convert_objects(convert_numeric=True)

AttributeError: 'DataFrame' object has no attribute 'to_numeric'

In [104]:
df.head()

Unnamed: 0,dis_subtype,dis_type,end_date,insur_dam,iso,location,st_date,total_affected,total_dam,total_deaths
1910-0005,Ground movement,Earthquake,24/06/1910,,DZA,Kabylie,24/06/1910,,,12
1927-0012,--,Flood,01/11/1927,,DZA,Mostaganem,01/11/1927,,,3000
1946-0001,Ground movement,Earthquake,12/02/1946,,DZA,Constantine,12/02/1946,,,276
1952-0026,--,Flood,/09/1952,,DZA,,/09/1952,,,25
1954-0017,Ground movement,Earthquake,09/09/1954,,DZA,Orleansville,09/09/1954,129250.0,6000.0,1250


In [106]:
len(df)

12128

In [107]:
df.to_pickle("complete_emdat_nat_haz.pkl") 

In [16]:
cols = data['data'][0]
cols

{u'associated_dis': u'--',
 u'associated_dis2': u'--',
 u'country_name': u'United States',
 u'dis_subtype': u'Tropical cyclone',
 u'dis_type': u'Storm',
 u'disaster_no': u'1900-0003',
 u'end_date': u'08/09/1900',
 u'insur_dam': u'0',
 u'iso': u'USA',
 u'location': u'Galveston (Texas)',
 u'start_date': u'08/09/1900',
 u'total_affected': u'0',
 u'total_dam': u'30000',
 u'total_deaths': u'6000'}

In [24]:
disaster_df = pd.DataFrame.from_dict(output).T
disaster_df
disaster_df.head(5)

Unnamed: 0,dis_subtype,dis_type,end_date,insur_dam,iso,location,st_date,total_affected,total_dam,total_deaths
1900-0003,Tropical cyclone,Storm,08/09/1900,0,USA,Galveston (Texas),08/09/1900,0,30000,6000
1903-0002,--,Flood,//1903,0,USA,"Passaic, Delaware",//1903,0,480000,0
1903-0003,--,Flood,//1903,0,USA,"Heppner, Oregon",//1903,0,0,250
1903-0010,Convective storm,Storm,//1903,0,USA,Gainesville (Georgia),//1903,0,0,98
1906-0004,Tropical cyclone,Storm,//1906,0,USA,Florida,//1906,0,0,164


In [25]:
disaster_df.dtypes

dis_subtype       object
dis_type          object
end_date          object
insur_dam         object
iso               object
location          object
st_date           object
total_affected    object
total_dam         object
total_deaths      object
dtype: object

In [29]:
disaster_df.location

1900-0003                                    Galveston (Texas)
1903-0002                                    Passaic, Delaware
1903-0003                                      Heppner, Oregon
1903-0010                                Gainesville (Georgia)
1906-0004                                              Florida
1906-0005                      Mississippi, Alabama, Pensacola
1906-0013                           San Francisco (California)
1909-0003                                Grand Isle (Lousiana)
1909-0004                                      Velasco (Texas)
1909-0017                                            Louisiana
1910-0003                                              Florida
1913-0005                                 Ohio, Indiana, Texas
1915-0004                                            El Centro
1915-0005           Cavelston (Texas), New Orleans (Louisiana)
1918-0003                                Louisiana (Southwest)
1918-0007                                  Minnesota Wi

# Plotting

In [None]:
import matplotlib.pyplot as plt


The original script I ran...
```python
def emudat2df(countryiso):
    complete_dict.setdefault(countryiso, [])
    url='http://www.emdat.be/disaster_list/php/search.php?_dc=1455563990858&continent=&region=&iso='+countryiso+'&from=1900&to=2015&group=Climatological%27%2C%27Geophysical%27%2C%27Hydrological%27%2C%27Meteorological&type=&options=associated_dis%2Cassociated_dis2%2Ctotal_deaths%2Ctotal_affected%2Ctotal_dam%2Cinsur_dam&page=1&start=0&limit=25'
    response = urllib2.urlopen(url)
    data = json.load(response)
    output = {}
    count = int (len(data['data']) - 1)
    while count >= 0:
        adisaster = data['data'][count]['disaster_no']
        dis_type = data['data'][count]['dis_type']
        dis_subtype = data['data'][count]['dis_subtype']
        st_date = data['data'][count]['start_date']
        end_date = data['data'][count]['end_date']
        iso = data['data'][count]['iso']
        location = data['data'][count]['location']
        total_affected = data['data'][count]['total_affected']
        total_dam = data['data'][count]['total_dam']
        total_deaths = data['data'][count]['total_deaths']
        insur_dam = data['data'][count]['insur_dam']
        adis = {
            "iso": iso,
            "location": location,
            "dis_type": dis_type,
            "dis_subtype": dis_subtype,
            "st_date": st_date,
            "end_date": end_date,
            "total_affected": total_affected,
            "total_dam": total_dam,
            "total_deaths": total_deaths,
            "insur_dam": insur_dam
        }
        output[adisaster] = adis
        count = count - 1
    complete_dict[countryiso] = output
    df = pd.DataFrame.from_dict(output).T
    return df
```


# Sources Cited

[ref1]: http://fabio-lana.blogspot.it/2015/07/using-python-for-mixing-hystorical-data.html "Fabios Blog"