# 01 Obtaining Data

This project came to be as a result that it caught our attention that in those countries that have fewer economic resources the number of environmental catastrophes (hurricanes, typhoons) seems higher than in those countries that enjoy financial success. This made us wonder if the presence of repeated natural disasters affects the economy of a country.

To find that out, we try to corroborate our hypothesis with data. In this book we describe how we accessed the data for this project.


In [1]:
import pandas as pd
import warnings

## 1.1 Nat. Disasters - Importing Data

https://www.kaggle.com/datasets/jnegrini/emdat19002021?resource=download
https://www.emdat.be/

We have obtained a dataset about naural disasters on [kaggle](https://www.kaggle.com/datasets/jnegrini/emdat19002021?resource=download). It contains information about natural disasters from 1900 to 2021. The data has been provided by [EM-DAT](https://www.emdat.be/), the Emergency Events Database. EM-DAT was created with the support of the World Health Organisation (WHO) and the Belgian Government. According to EM-DAT, the data is compiled from various sources, including UN agencies, non-governmental organisations, insurance companies, research institutes and press agencies.



In [8]:
raw_natural_disaster_df = pd.read_csv('data/all_natural_disasters.csv')

relevant_columns = ['Year', 'Disaster Subgroup', 'Disaster Type', 'Event Name', 'Country', 
'ISO', 'Region', 'Continent', 'Start Year', 'End Year', 'Total Deaths', 'Total Affected']

natural_disaster_df = raw_natural_disaster_df[relevant_columns]
natural_disaster_df.head()

Unnamed: 0,Year,Disaster Subgroup,Disaster Type,Event Name,Country,ISO,Region,Continent,Start Year,End Year,Total Deaths,Total Affected
0,1900,Climatological,Drought,,Cabo Verde,CPV,Western Africa,Africa,1900,1900,11000.0,
1,1900,Climatological,Drought,,India,IND,Southern Asia,Asia,1900,1900,1250000.0,
2,1902,Geophysical,Earthquake,,Guatemala,GTM,Central America,Americas,1902,1902,2000.0,
3,1902,Geophysical,Volcanic activity,Santa Maria,Guatemala,GTM,Central America,Americas,1902,1902,1000.0,
4,1902,Geophysical,Volcanic activity,Santa Maria,Guatemala,GTM,Central America,Americas,1902,1902,6000.0,


## 1.2 Nat. Disasters - Handling Columns

### 1.2.1 Start & End Year
Start and End Year Column will be transformed to a duration column. This way its easier to identy events that took place over a longer time period.

In [12]:
# Transforming Start and End Year Column to Duration
natural_disaster_df = natural_disaster_df.assign(Duration=(natural_disaster_df['End Year']-natural_disaster_df['Start Year']))
natural_disaster_df = natural_disaster_df.drop(['Start Year', 'End Year'], axis=1)
natural_disaster_df.head()

Unnamed: 0,Year,Disaster Subgroup,Disaster Type,Event Name,Country,ISO,Region,Continent,Total Deaths,Total Affected,Duration
0,1900,Climatological,Drought,,Cabo Verde,CPV,Western Africa,Africa,11000.0,,0
1,1900,Climatological,Drought,,India,IND,Southern Asia,Asia,1250000.0,,0
2,1902,Geophysical,Earthquake,,Guatemala,GTM,Central America,Americas,2000.0,,0
3,1902,Geophysical,Volcanic activity,Santa Maria,Guatemala,GTM,Central America,Americas,1000.0,,0
4,1902,Geophysical,Volcanic activity,Santa Maria,Guatemala,GTM,Central America,Americas,6000.0,,0


In [15]:
natural_disaster_df['Duration'].unique()

array([ 0,  4,  9,  2,  3,  6,  1,  5, 50], dtype=int64)

In [17]:
natural_disaster_df['Duration'].value_counts()

0     15522
1       476
2        56
4        31
3        25
5        10
6         3
9         2
50        1
Name: Duration, dtype: int64

Most natural disasters lasted less than a year. Perhaps it would be more interesting to know how many months/days they lasted.

TODO: Here we could investigate a little what is the best way to measure time (days/weeks/months/years)

In [21]:
# I had a particular interest in find which catastrophe has taken 50 years. Probably outlier/mistake?
for idx, value in enumerate(natural_disaster_df['Duration']):
    if value == 50:
        print(f"index {idx}")
        print(natural_disaster_df.iloc[idx])

index 15421
Year                            1969
Disaster Subgroup        Geophysical
Disaster Type             Earthquake
Event Name                       NaN
Country                      Morocco
ISO                              MAR
Region               Northern Africa
Continent                     Africa
Total Deaths                    11.0
Total Affected                   NaN
Duration                          50
Name: 15421, dtype: object


### 1.2.2 Disaster Subgroup & Type
- Subgroup becomes Group since Group contained the same value for each row (Natural)
- "Disaster" can be removed from the column names.

In [22]:
rename_dic = {'Disaster Subgroup' : 'Group', 'Disaster Type' : 'Type'}
natural_disaster_df = natural_disaster_df.rename(columns=rename_dic)
natural_disaster_df.head()

Unnamed: 0,Year,Group,Type,Event Name,Country,ISO,Region,Continent,Total Deaths,Total Affected,Duration
0,1900,Climatological,Drought,,Cabo Verde,CPV,Western Africa,Africa,11000.0,,0
1,1900,Climatological,Drought,,India,IND,Southern Asia,Asia,1250000.0,,0
2,1902,Geophysical,Earthquake,,Guatemala,GTM,Central America,Americas,2000.0,,0
3,1902,Geophysical,Volcanic activity,Santa Maria,Guatemala,GTM,Central America,Americas,1000.0,,0
4,1902,Geophysical,Volcanic activity,Santa Maria,Guatemala,GTM,Central America,Americas,6000.0,,0


## 1.3 Nat. Disasters - Display Dataframe Info

In [23]:
natural_disaster_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16126 entries, 0 to 16125
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Year            16126 non-null  int64  
 1   Group           16126 non-null  object 
 2   Type            16126 non-null  object 
 3   Event Name      3861 non-null   object 
 4   Country         16126 non-null  object 
 5   ISO             16126 non-null  object 
 6   Region          16126 non-null  object 
 7   Continent       16126 non-null  object 
 8   Total Deaths    11413 non-null  float64
 9   Total Affected  11617 non-null  float64
 10  Duration        16126 non-null  int64  
dtypes: float64(2), int64(2), object(7)
memory usage: 1.4+ MB


## 1.4 Nat. Disasters - Preparing for Merge

### 1.4.1 Dropping irrelevant Years
The Worldbank API only provides GDP's for the years 1960 to 2021. To test whether the events have an effect on the countries gdps the gdps from the year before, during as well as 3 years after the event will be used. That means that for this project only the years from 1961 to 2018 can be used.

In [33]:
drop_indexes = natural_disaster_df[(natural_disaster_df['Year'] < 1961)].index
drop_indexes = drop_indexes.append(natural_disaster_df[(natural_disaster_df['Year'] > 2018)].index)
nat_df = natural_disaster_df.drop(drop_indexes)

#nat_df['Year'].unique()
nat_df['Year'].value_counts()

2002    532
2000    523
2005    498
2007    449
2001    447
2006    446
2010    441
1999    416
2004    405
2008    400
2015    398
2003    392
2009    384
2017    371
2012    370
1998    363
2011    357
2013    353
2016    350
2014    348
2018    338
1997    323
1990    303
1995    277
1996    273
1993    267
1991    266
1994    255
1988    234
1992    232
1987    227
1983    206
1989    189
1985    175
1986    174
1984    156
1982    150
1981    146
1980    144
1977    141
1978    137
1979    122
1976     99
1966     84
1968     83
1969     83
1970     82
1967     80
1974     72
1965     68
1975     67
1973     65
1972     63
1964     63
1971     63
1963     44
1961     29
1962     29
Name: Year, dtype: int64

In [34]:
# Fixing Index
nat_df = nat_df.reset_index(drop=True)
nat_df.head()

Unnamed: 0,Year,Group,Type,Event Name,Country,ISO,Region,Continent,Total Deaths,Total Affected,Duration
0,1961,Meteorological,Storm,,Bangladesh,BGD,Southern Asia,Asia,11000.0,,0
1,1961,Meteorological,Storm,,Bangladesh,BGD,Southern Asia,Asia,,,0
2,1961,Meteorological,Storm,,Bangladesh,BGD,Southern Asia,Asia,266.0,,0
3,1961,Meteorological,Storm,Hattie,Belize,BLZ,Central America,Americas,275.0,,0
4,1961,Climatological,Drought,,Canada,CAN,Northern America,Americas,,,0


### 1.4.2 Obtaining List of ISO Codes

In [35]:
# Getting all GDP's once per country
country_iso_codes = nat_df['ISO'].unique().tolist()
len(country_iso_codes)

227

## 1.5 GDP's - Testing API

## Requests
- GDP Definition: https://api.worldbank.org/v2/indicator/NY.GDP.MKTP.CD
- Getting GDP all: https://api.worldbank.org/v2/country/all/indicator/NY.GDP.MKTP.CD?page=1
- Getting GDP by Country ISO Code: https://api.worldbank.org/v2/country/cpv/indicator/NY.GDP.MKTP.CD?per_page=62
- Getting GDP for Certain Years: https://api.worldbank.org/v2/country/cpv/indicator/NY.GDP.MKTP.CD?date=1960:1964

In [36]:
import requests
import xml.etree.ElementTree as ET
from genericpath import exists

### 1.5.1 Testing Request and Handling XML Response

In [40]:
response = requests.get('https://api.worldbank.org/v2/country/cpv/indicator/NY.GDP.MKTP.CD?date=2000:2001')

root = ET.fromstring(response.content)

for child in root:
    print(child.tag)
    for subchild in child:
        print('\t', subchild.tag, subchild.text)

{http://www.worldbank.org}data
	 {http://www.worldbank.org}indicator GDP (current US$)
	 {http://www.worldbank.org}country Cabo Verde
	 {http://www.worldbank.org}countryiso3code CPV
	 {http://www.worldbank.org}date 2001
	 {http://www.worldbank.org}value 563024383.296626
	 {http://www.worldbank.org}unit None
	 {http://www.worldbank.org}obs_status None
	 {http://www.worldbank.org}decimal 0
{http://www.worldbank.org}data
	 {http://www.worldbank.org}indicator GDP (current US$)
	 {http://www.worldbank.org}country Cabo Verde
	 {http://www.worldbank.org}countryiso3code CPV
	 {http://www.worldbank.org}date 2000
	 {http://www.worldbank.org}value 539227277.626411
	 {http://www.worldbank.org}unit None
	 {http://www.worldbank.org}obs_status None
	 {http://www.worldbank.org}decimal 0


In [45]:

for entry in root:
    numb = "{:,.2f}".format(float(entry.find('{http://www.worldbank.org}value').text))
    print(entry.find('{http://www.worldbank.org}date').text, ' : ',numb)

2001  :  563,024,383.30
2000  :  539,227,277.63


## 1.6 GDP's - Obtaining GDP's for all Countries & Save

In [46]:
warnings.filterwarnings('ignore', category=FutureWarning)

# ONLY DO THIS STEP WHEN NECESSARY
# Duration ~ 45min
if (not exists('data/country_gdps.csv')):
    country_gpds_df = pd.DataFrame()
    error_isos = []

    # Iterate over iso codes and obtaining all gdp values for these countries
    for iso in country_iso_codes:
        try:
            url = f'https://api.worldbank.org/v2/country/{iso}/indicator/NY.GDP.MKTP.CD?per_page=62'
            response = requests.get(url)

            root = ET.fromstring(response.content)

            gdp_dict = {'iso' : iso}
            for entry in root:
                gdp_dict[entry.find('{http://www.worldbank.org}date').text] = entry.find('{http://www.worldbank.org}value').text

            country_gpds_df = country_gpds_df.append(gdp_dict, ignore_index=True)

        except:
            # Collecting list of iso codes the worldbank api does not list
            error_isos.append(iso)

    # Save to file
    error_df = pd.DataFrame(error_isos)
    error_df.to_csv('data/worldbank_iso_erros.csv')
    country_gpds_df.to_csv('data/country_gdps.csv')

else:
    country_gpds_df = pd.read_csv('data/country_gdps.csv', index_col=0)
    error_df = pd.read_csv('data/worldbank_iso_erros.csv', index_col=0)
    error_isos = error_df['0'].to_list()

country_gpds_df.head()


Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,iso
0,4274893913.49536,4817580183.60155,5081413339.78635,5319458351.16235,5386054619.34987,5906636557.00092,6439687598.32325,7253575399.3215,7483685473.51275,8471006100.95399,...,149990451022.29,172885454931.453,195078678697.23,265236247989.155,293754646182.389,321379023557.462,351238438542.792,373902134700.41,416264942893.326,BGD
1,28071888.5622288,29964370.7125857,31856922.8615428,33749405.0118998,36193826.1234775,40069930.0699301,44405594.4055944,47379310.3448276,44910179.6407186,47305389.2215569,...,1581844936.42351,1676406801.54095,1734320479.1395,1796928936.71505,1844906692.53913,1887465218.48204,1945250235.57603,1585631670.3461,1789923264.03,BLZ
2,40461721692.6468,40934952063.9468,42227447631.9159,45029988561.2124,49377522896.703,54515179580.7148,61088384036.6515,65668655501.1254,71829810519.8955,79148411661.6902,...,1846597421834.98,1805749878439.94,1556508816217.14,1527994741907.43,1649265644244.09,1725329192783.02,1742015045482.31,1645423407568.36,1990761609665.23,CAN
3,,,,,,,,,,,...,47648211133.2183,55612228233.5179,64589334978.8013,74296618481.0882,81770791970.982,84269348327.3454,95912590628.1412,107657734392.446,111271112329.975,ETH
4,62225478000.8822,67461644222.0352,75607529809.9288,84759195105.8693,94007851047.3678,101537248148.427,110045852177.928,118972977486.207,129785441507.456,141903068680.309,...,2811876903329.03,2855964488590.19,2439188643162.5,2472964344587.17,2595151045197.65,2790956878746.66,2728870246705.88,2630317731455.26,2937472757953.44,FRA


## 1.7 GDP's - Reviewing Errors
Taking a look at the countries the Wordlbank does not list GDPs for

In [55]:
nat_df[nat_df['ISO'].isin(error_isos)][['ISO', 'Country']].drop_duplicates()

Unnamed: 0,ISO,Country
16,DFR,Germany Fed Rep
558,SUN,Soviet Union


## 1.8 Nat. Disasters - Clean Up & Save
The Worldbank seems to not list GDP's for countries that do not exist anymore as well as countries that are extremely small with population sizes of less then 10000.

For now these countries are going to be excluded from the dataframe. The possibility remains to lookup the GDP manually for these countries should that be necessary.

In [47]:
nat_df_reduced = nat_df.drop(nat_df[nat_df['ISO'].isin(error_isos)].index)
nat_df_reduced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13834 entries, 0 to 14051
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Year            13834 non-null  int64  
 1   Group           13834 non-null  object 
 2   Type            13834 non-null  object 
 3   Event Name      3424 non-null   object 
 4   Country         13834 non-null  object 
 5   ISO             13834 non-null  object 
 6   Region          13834 non-null  object 
 7   Continent       13834 non-null  object 
 8   Total Deaths    9706 non-null   float64
 9   Total Affected  10300 non-null  float64
 10  Duration        13834 non-null  int64  
dtypes: float64(2), int64(2), object(7)
memory usage: 1.3+ MB


In [48]:
nat_df_reduced.to_csv('data/all_natural_disasters_reduced.csv')