# Data Cleansing

## Data metadata

### [Natural Disaster Trends](https://zenodo.org/records/7930540)

**Disaster_Group:** EM-DAT stores different types of disasters: natural, technological and complex. The dataset has been filtered to contain only natural disasters so this column is "Natural" for all rows but kept for compatibility reasons.

**Disaster_Subgroup:** Every natural disaster is assigned to one of the following six subgroups: Biological, Geophysical, Climatological, Hydrological, Meteorological and Extra-terrestrial to describe the type of natural disaster. No missing values are present for this attribute.

**Disaster_type:** For every natural disaster event one main disaster type is identified. If two or more disasters are related because they are consequences of each other, then this information is encoded in the attributes Associated_Dis and Associated_Dis2. No missing values are present for this attribute.

**Disaster Sub-Type:** Subdivision related to the attribute Disaster_type so that a the disaster type Storm can be further classified as tropical, extra-tropical or convective storm.

**Disaster Sub-Sub Type:** Any appropriate sub-division of the disaster sub-type (not applicable for all disaster sub-types). Types of natural disasters could be further broken down using two more categories which would be available in the database. For example, the Disaster type Storm could be further subdivided into Tropical storm, Extra-tropical storm or Convective storm. Even a further subdivision of the category Convective storm would be possible. Since the analysis is aimed at detecting trends on a high level, the classification of each event based on the attributes Disaster_Subgroup and Disaster _Type was considered sufficient and the further subdivisions into Disaster sub-type and Disaster Subsubtype is only intended to be considered for detailed analysis. The full table is shown in the Appendix.

**Associated_Dis:** Secondary event triggered by a natural disaster (i.e. Landslide for a flood, explosion after an earthquake, ...)

**Associated_Dis2:** Another secondary event triggered by a natural disaster. (i.e. Landslide for a flood, explosion after an earthquake, ...)

Example: If a tsunami is triggered by an earthquake, then the attribute Disaster_Type would be Earthquake, the attribute Disaster_Subtype would be Ground movement and the attribute Associated_Dis would be Tsunami/Tidal wave.

**Country:** The country in which the disaster has occurred or had an impact. If a disaster has affected more than one country, a seperate entry is created in the database for each country affected. No missing values are present for this attribute.

**ISO:** Unique 3-letter code for each country defined by ISO 3166. No missing values are present for this attribute.

**Region:** The region to which the country belongs, based on the UN regional division. No missing values are present for this attribute.

**Continent:** The continent to which the country belongs. No missing values are present for this attribute.

**Start_Year:** The year when the disaster occurred. No missing values are present for this attribute.

**End Year:** The year when the disaster ended. No missing values are present for this attribute. For sudden-impact disasters also the month and the day are well defined and available. For disaster situations developing gradually over a longer time period (i.e. drought) with no specific start date the day attribute is empty. For our questions the exact date plays a subordinate role and therefore the year of the beginning of the disaster is completely sufficient for our analysis.

**Total_Deaths:** Number of people who lost their life because the event happened plus the number of people whose whereabouts since the disaster are unknown, and presumed dead based on official figures. Missing values present for approx. 25% of all events.

**No_Affected:** Number of people which requiring immediate assistance during an emergency situation. The indicator affected is often reported and is widely used by different actors to convey the extent, impact, or severity of a disaster in non-spatial terms. In case that no values for the attribute Total_Deaths are available this attribute could be used as a proxy.

### [Global Spread of Conflict by Country and Population](https://datacatalog.worldbank.org/search/dataset/0041070/Global-Spread-of-Conflict-by-Country-and-Population)

This dataset provides the spread of the conflict globally in terms of population and country for the years 2000-2016.

```Fields: TODO```

### [World Happiness Report](https://worldhappiness.report/data/)

```Fields: TODO```

### [Climate Change: Earth Surface Temperature Data]()

**Date:** starts in 1750 for average land temperature and 1850 for max and min land temperatures and global ocean and land temperatures


**LandAverageTemperature:** global average land temperature in celsius
LandAverageTemperatureUncertainty: the 95% confidence interval around the average

**LandMaxTemperature:** global average maximum land temperature in celsius
LandMaxTemperatureUncertainty: the 95% confidence interval around the maximum land temperature

**LandMinTemperature:** global average minimum land temperature in celsius
LandMinTemperatureUncertainty: the 95% confidence interval around the minimum land temperature

**LandAndOceanAverageTemperature:** global average land and ocean temperature in celsius
LandAndOceanAverageTemperatureUncertainty: the 95% confidence interval around the global average land and ocean temperature

In [82]:
# Imports

import pandas as pd

### disaster.csv

In [83]:
# load disaster.csv

disaster_df = pd.read_csv('../data/disaster.csv')

In [84]:
# get insights on data

print(disaster_df.shape)
print()
print(disaster_df.dtypes)
print()
print(disaster_df.count())

(16132, 17)

Unnamed: 0               int64
Disaster_Group          object
Disaster_Subgroup       object
Disaster_Type           object
Disaster_Subtype        object
Disaster_Subsubtype     object
Country                 object
ISO                     object
Region                  object
Continent               object
Associated_Dis          object
Associated_Dis2         object
Start_Year               int64
End_Year                 int64
Total_Deaths           float64
Total_Affected         float64
Disaster_Decade          int64
dtype: object

Unnamed: 0             16132
Disaster_Group         16132
Disaster_Subgroup      16132
Disaster_Type          16132
Disaster_Subtype       13001
Disaster_Subsubtype     1074
Country                16132
ISO                    16132
Region                 16132
Continent              16132
Associated_Dis          3402
Associated_Dis2          717
Start_Year             16132
End_Year               16132
Total_Deaths           11485
Total_Affe

In [85]:
# add columname to the first column (Unnamed) and make it the index
disaster_df.rename(columns={"Unnamed: 0":"id"}, inplace=True)
disaster_df.set_index("id",inplace=True)

# remove columns with unnnecessary or highly missing data
disaster_df.drop(columns=["Disaster_Subtype","Disaster_Subsubtype","Associated_Dis","Associated_Dis2","Disaster_Decade"],inplace=True)

In [86]:
# replace NaNs

# replace NaNs in Total_Deaths col with 0s, because the dataset will show how many deaths are we aware of
disaster_df["Total_Deaths"] = disaster_df["Total_Deaths"].fillna(0)

# replace NaNs in Total_Affected col with the death number, because at least that amount of people were affected
disaster_df["Total_Affected"] = disaster_df["Total_Affected"].fillna(disaster_df["Total_Deaths"])

# fix the types of the columns
disaster_df = disaster_df.convert_dtypes()

In [87]:
# observe the data again

print(disaster_df.shape)
print()
print(disaster_df.dtypes)
print()
print(disaster_df.count())

(16132, 11)

Disaster_Group       string[python]
Disaster_Subgroup    string[python]
Disaster_Type        string[python]
Country              string[python]
ISO                  string[python]
Region               string[python]
Continent            string[python]
Start_Year                    Int64
End_Year                      Int64
Total_Deaths                  Int64
Total_Affected                Int64
dtype: object

Disaster_Group       16132
Disaster_Subgroup    16132
Disaster_Type        16132
Country              16132
ISO                  16132
Region               16132
Continent            16132
Start_Year           16132
End_Year             16132
Total_Deaths         16132
Total_Affected       16132
dtype: int64


### create the countries dataframe

In [88]:
# create a new dataframe to store the country names, continents and ISO codes there
countries_df = disaster_df[["ISO","Country","Continent"]].set_index("ISO")
countries_df.drop_duplicates(inplace=True)
countries_df.rename(columns={"Country":"Name"}, inplace=True)

# remove the redundant country name and continent columns from disaster dataframe
disaster_df.drop(columns=["Country","Continent"], inplace=True)

Unnamed: 0_level_0,Name,Continent
ISO,Unnamed: 1_level_1,Unnamed: 2_level_1
CPV,Cabo Verde,Africa
IND,India,Asia
GTM,Guatemala,Americas
CAN,Canada,Americas
COM,Comoros (the),Africa
...,...,...
QAT,Qatar,Asia
BLM,Saint Barthélemy,Americas
MAF,Saint Martin (French Part),Americas
SXM,Sint Maarten (Dutch part),Americas


### conflict.xlsx

### happiness.xls

### temperature.csv