## Data Preprocessing

### Dataset : public_emdat_2025-02-22.csv

In this notebook, we will load the dataset into a dataframe and remove unwanted columns that will not be included in our ontology. The columns left after cleaning the dataset will be mapped onto specific ontology classes. 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("public_emdat_2025-02-22.csv")
df

Unnamed: 0,DisNo.,Historic,Classification Key,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,External IDs,Event Name,ISO,...,Reconstruction Costs ('000 US$),"Reconstruction Costs, Adjusted ('000 US$)",Insured Damage ('000 US$),"Insured Damage, Adjusted ('000 US$)",Total Damage ('000 US$),"Total Damage, Adjusted ('000 US$)",CPI,Admin Units,Entry Date,Last Update
0,1999-9388-DJI,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,,,DJI,...,,,,,,,58.111474,"[{""adm1_code"":1093,""adm1_name"":""Ali Sabieh""},{...",2006-03-01,2023-09-25
1,1999-9388-SDN,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,,,SDN,...,,,,,,,56.514291,"[{""adm1_code"":2757,""adm1_name"":""Northern Darfu...",2006-03-08,2023-09-25
2,1999-9388-SOM,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,,,SOM,...,,,,,,,56.514291,"[{""adm1_code"":2691,""adm1_name"":""Bay""},{""adm1_c...",2006-03-08,2023-09-25
3,2000-0001-AGO,No,tec-tra-roa-roa,Technological,Transport,Road,Road,,,AGO,...,,,,,,,56.514291,,2004-10-27,2023-09-25
4,2000-0002-AGO,No,nat-hyd-flo-riv,Natural,Hydrological,Flood,Riverine flood,,,AGO,...,,,,,10000.0,17695.0,56.514291,"[{""adm2_code"":4214,""adm2_name"":""Baia Farta""},{...",2005-02-03,2023-09-25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16182,2025-0101-USA,No,tec-tra-air-air,Technological,Transport,Air,Air,,,USA,...,,,,,,,,,2025-02-17,2025-02-19
16183,2025-0102-NGA,No,tec-mis-fir-fir,Technological,Miscellaneous accident,Fire (Miscellaneous),Fire (Miscellaneous),,School dormitory,NGA,...,,,,,,,,,2025-02-17,2025-02-19
16184,2025-0103-KEN,No,tec-ind-col-col,Technological,Industrial accident,Collapse (Industrial),Collapse (Industrial),,Gold mine,KEN,...,,,,,,,,,2025-02-17,2025-02-19
16185,2025-0104-USA,No,tec-tra-air-air,Technological,Transport,Air,Air,,,USA,...,,,,,,,,,2025-02-17,2025-02-19


In [3]:
df.columns

Index(['DisNo.', 'Historic', 'Classification Key', 'Disaster Group',
       'Disaster Subgroup', 'Disaster Type', 'Disaster Subtype',
       'External IDs', 'Event Name', 'ISO', 'Country', 'Subregion', 'Region',
       'Location', 'Origin', 'Associated Types', 'OFDA/BHA Response', 'Appeal',
       'Declaration', 'AID Contribution ('000 US$)', 'Magnitude',
       'Magnitude Scale', 'Latitude', 'Longitude', 'River Basin', 'Start Year',
       'Start Month', 'Start Day', 'End Year', 'End Month', 'End Day',
       'Total Deaths', 'No. Injured', 'No. Affected', 'No. Homeless',
       'Total Affected', 'Reconstruction Costs ('000 US$)',
       'Reconstruction Costs, Adjusted ('000 US$)',
       'Insured Damage ('000 US$)', 'Insured Damage, Adjusted ('000 US$)',
       'Total Damage ('000 US$)', 'Total Damage, Adjusted ('000 US$)', 'CPI',
       'Admin Units', 'Entry Date', 'Last Update'],
      dtype='object')

In [4]:
df.isnull().sum()

DisNo.                                           0
Historic                                         0
Classification Key                               0
Disaster Group                                   0
Disaster Subgroup                                0
Disaster Type                                    0
Disaster Subtype                                 0
External IDs                                 13645
Event Name                                   11080
ISO                                              0
Country                                          0
Subregion                                        0
Region                                           0
Location                                       702
Origin                                       12139
Associated Types                             12767
OFDA/BHA Response                                0
Appeal                                           0
Declaration                                      0
AID Contribution ('000 US$)    

First, we'll drop the columns that has null values, as we will not import any null values into the ontology:

In [5]:
df = df.dropna(axis=1)

In [6]:
df.isnull().sum()

DisNo.                0
Historic              0
Classification Key    0
Disaster Group        0
Disaster Subgroup     0
Disaster Type         0
Disaster Subtype      0
ISO                   0
Country               0
Subregion             0
Region                0
OFDA/BHA Response     0
Appeal                0
Declaration           0
Start Year            0
End Year              0
Entry Date            0
Last Update           0
dtype: int64

In [7]:
df

Unnamed: 0,DisNo.,Historic,Classification Key,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,ISO,Country,Subregion,Region,OFDA/BHA Response,Appeal,Declaration,Start Year,End Year,Entry Date,Last Update
0,1999-9388-DJI,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,DJI,Djibouti,Sub-Saharan Africa,Africa,Yes,No,No,2001,2001,2006-03-01,2023-09-25
1,1999-9388-SDN,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,SDN,Sudan,Northern Africa,Africa,No,No,No,2000,2001,2006-03-08,2023-09-25
2,1999-9388-SOM,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,SOM,Somalia,Sub-Saharan Africa,Africa,No,No,No,2000,2001,2006-03-08,2023-09-25
3,2000-0001-AGO,No,tec-tra-roa-roa,Technological,Transport,Road,Road,AGO,Angola,Sub-Saharan Africa,Africa,No,No,No,2000,2000,2004-10-27,2023-09-25
4,2000-0002-AGO,No,nat-hyd-flo-riv,Natural,Hydrological,Flood,Riverine flood,AGO,Angola,Sub-Saharan Africa,Africa,No,No,Yes,2000,2000,2005-02-03,2023-09-25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16182,2025-0101-USA,No,tec-tra-air-air,Technological,Transport,Air,Air,USA,United States of America,Northern America,Americas,No,No,No,2025,2025,2025-02-17,2025-02-19
16183,2025-0102-NGA,No,tec-mis-fir-fir,Technological,Miscellaneous accident,Fire (Miscellaneous),Fire (Miscellaneous),NGA,Nigeria,Sub-Saharan Africa,Africa,No,No,No,2025,2025,2025-02-17,2025-02-19
16184,2025-0103-KEN,No,tec-ind-col-col,Technological,Industrial accident,Collapse (Industrial),Collapse (Industrial),KEN,Kenya,Sub-Saharan Africa,Africa,No,No,No,2025,2025,2025-02-17,2025-02-19
16185,2025-0104-USA,No,tec-tra-air-air,Technological,Transport,Air,Air,USA,United States of America,Northern America,Americas,No,No,No,2025,2025,2025-02-17,2025-02-19


Now dropping columns that will not be necessary in the ontology:

In [8]:
df= df.drop(columns=['Historic', 'Classification Key', 'ISO', 'OFDA/BHA Response', 'Appeal', 'Declaration'], axis=1)
df

Unnamed: 0,DisNo.,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,Country,Subregion,Region,Start Year,End Year,Entry Date,Last Update
0,1999-9388-DJI,Natural,Climatological,Drought,Drought,Djibouti,Sub-Saharan Africa,Africa,2001,2001,2006-03-01,2023-09-25
1,1999-9388-SDN,Natural,Climatological,Drought,Drought,Sudan,Northern Africa,Africa,2000,2001,2006-03-08,2023-09-25
2,1999-9388-SOM,Natural,Climatological,Drought,Drought,Somalia,Sub-Saharan Africa,Africa,2000,2001,2006-03-08,2023-09-25
3,2000-0001-AGO,Technological,Transport,Road,Road,Angola,Sub-Saharan Africa,Africa,2000,2000,2004-10-27,2023-09-25
4,2000-0002-AGO,Natural,Hydrological,Flood,Riverine flood,Angola,Sub-Saharan Africa,Africa,2000,2000,2005-02-03,2023-09-25
...,...,...,...,...,...,...,...,...,...,...,...,...
16182,2025-0101-USA,Technological,Transport,Air,Air,United States of America,Northern America,Americas,2025,2025,2025-02-17,2025-02-19
16183,2025-0102-NGA,Technological,Miscellaneous accident,Fire (Miscellaneous),Fire (Miscellaneous),Nigeria,Sub-Saharan Africa,Africa,2025,2025,2025-02-17,2025-02-19
16184,2025-0103-KEN,Technological,Industrial accident,Collapse (Industrial),Collapse (Industrial),Kenya,Sub-Saharan Africa,Africa,2025,2025,2025-02-17,2025-02-19
16185,2025-0104-USA,Technological,Transport,Air,Air,United States of America,Northern America,Americas,2025,2025,2025-02-17,2025-02-19


In [9]:
df = df.drop(columns=['End Year', 'Entry Date', 'Last Update'], axis=1)
df

Unnamed: 0,DisNo.,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,Country,Subregion,Region,Start Year
0,1999-9388-DJI,Natural,Climatological,Drought,Drought,Djibouti,Sub-Saharan Africa,Africa,2001
1,1999-9388-SDN,Natural,Climatological,Drought,Drought,Sudan,Northern Africa,Africa,2000
2,1999-9388-SOM,Natural,Climatological,Drought,Drought,Somalia,Sub-Saharan Africa,Africa,2000
3,2000-0001-AGO,Technological,Transport,Road,Road,Angola,Sub-Saharan Africa,Africa,2000
4,2000-0002-AGO,Natural,Hydrological,Flood,Riverine flood,Angola,Sub-Saharan Africa,Africa,2000
...,...,...,...,...,...,...,...,...,...
16182,2025-0101-USA,Technological,Transport,Air,Air,United States of America,Northern America,Americas,2025
16183,2025-0102-NGA,Technological,Miscellaneous accident,Fire (Miscellaneous),Fire (Miscellaneous),Nigeria,Sub-Saharan Africa,Africa,2025
16184,2025-0103-KEN,Technological,Industrial accident,Collapse (Industrial),Collapse (Industrial),Kenya,Sub-Saharan Africa,Africa,2025
16185,2025-0104-USA,Technological,Transport,Air,Air,United States of America,Northern America,Americas,2025


In [10]:
df[df['Disaster Type'] != df['Disaster Subtype']]

Unnamed: 0,DisNo.,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,Country,Subregion,Region,Start Year
4,2000-0002-AGO,Natural,Hydrological,Flood,Riverine flood,Angola,Sub-Saharan Africa,Africa,2000
5,2000-0003-BGD,Natural,Meteorological,Extreme temperature,Cold wave,Bangladesh,Southern Asia,Asia,2000
10,2000-0008-GTM,Natural,Geophysical,Volcanic activity,Ash fall,Guatemala,Latin America and the Caribbean,Americas,2000
11,2000-0009-IRN,Natural,Meteorological,Storm,Storm (General),Iran (Islamic Republic of),Southern Asia,Asia,2000
14,2000-0012-MOZ,Natural,Hydrological,Flood,Riverine flood,Mozambique,Sub-Saharan Africa,Africa,2000
...,...,...,...,...,...,...,...,...,...
16166,2025-0048-ITA,Natural,Meteorological,Storm,Storm (General),Italy,Southern Europe,Europe,2025
16172,2025-0062-MYS,Natural,Hydrological,Flood,Flood (General),Malaysia,South-eastern Asia,Asia,2025
16173,2025-0072-BRA,Natural,Meteorological,Storm,Severe weather,Brazil,Latin America and the Caribbean,Americas,2025
16174,2025-0075-AUS,Natural,Hydrological,Flood,Flood (General),Australia,Australia and New Zealand,Oceania,2025


Disaster subtype returns a more detailed description of the disaster type, but they display more or less the same information, hence we can drop the more general column (disaster type) and keep the subtype column that elaborates more on the specific disaster to reduce redundancy in the data.

In [11]:
df = df.drop(columns='Disaster Type', axis=1)
df

Unnamed: 0,DisNo.,Disaster Group,Disaster Subgroup,Disaster Subtype,Country,Subregion,Region,Start Year
0,1999-9388-DJI,Natural,Climatological,Drought,Djibouti,Sub-Saharan Africa,Africa,2001
1,1999-9388-SDN,Natural,Climatological,Drought,Sudan,Northern Africa,Africa,2000
2,1999-9388-SOM,Natural,Climatological,Drought,Somalia,Sub-Saharan Africa,Africa,2000
3,2000-0001-AGO,Technological,Transport,Road,Angola,Sub-Saharan Africa,Africa,2000
4,2000-0002-AGO,Natural,Hydrological,Riverine flood,Angola,Sub-Saharan Africa,Africa,2000
...,...,...,...,...,...,...,...,...
16182,2025-0101-USA,Technological,Transport,Air,United States of America,Northern America,Americas,2025
16183,2025-0102-NGA,Technological,Miscellaneous accident,Fire (Miscellaneous),Nigeria,Sub-Saharan Africa,Africa,2025
16184,2025-0103-KEN,Technological,Industrial accident,Collapse (Industrial),Kenya,Sub-Saharan Africa,Africa,2025
16185,2025-0104-USA,Technological,Transport,Air,United States of America,Northern America,Americas,2025


In [12]:
df.columns

Index(['DisNo.', 'Disaster Group', 'Disaster Subgroup', 'Disaster Subtype',
       'Country', 'Subregion', 'Region', 'Start Year'],
      dtype='object')

The dataset now contains information on disaster events, the country/region of the event and the year the event ooccured. This data directly maps to our DisasterEvent class and Location class in our ontology. We can now save this dataset to a csv file to import into our ontology.

In [13]:
df.to_csv('disaster_dataset.csv', index=False)