# Final Project: Coding for Data Science

## Title: Analyze natural disasters that occurred in Southeast Asia from 2000 to early 2024 and predict the natural disasters that may happen in Vietnam in the coming years.

### Advisor:
- Mr. Phạm Trọng Nghĩa  
- Mr. Lê Ngọc Thành  
- Mr. Vũ Công Thành  

### Student:
- Trần Trường Giang - 22120085  
- Lê Đại Hòa - 22120108  
- Lê Hoàng Vũ - 22120461


# A. INTRODUCTION
- This notebook is the result of a Data Science project where we processed with datasets of natural disasters occured in the area of Southeast Asia from 2000 to the early 2024. We aimed to reviewing the recent circumstance of the local area in the last 2 decades, withdrawing valuable conclusions, visualizing some noteworthy parts and last but not least: giving some promising predictions.

# B. ABOUT THE DATA
- We decided to approach this project by the dataset of EM-DAT. EM-DAT (Emergency Events Database) is a global database of disasters developed by the Center for Research on the Epidemiology of Disasters (CRED). The main objective of EM-DAT is to record and provide comprehensive information about major natural and man-made disasters worldwide, ranging from climate-related events (hurricanes, floods, droughts, etc.) to industrial accidents and armed conflicts.

- EM-DAT is an important resource for researchers, governments, and international organizations when studying and preparing for disasters. However, it's noteworthy that this database records events based on criteria such as:

    - At least 10 deaths (including dead and missing).
    - At least 100 people affected.
    - International assistance is required or an emergency declaration is requested from the government.
    - The level of disasters
    - And so on

# C. IMPLEMENT

## 1. Import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from IPython.display import display

## 2. Data collection

## 3. Data preprocessing and exploration

In [2]:
df = pd.read_excel('..\data\disaster_sea.xlsx')
df.head(5)

Unnamed: 0,DisNo.,Historic,Classification Key,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,External IDs,Event Name,ISO,...,Reconstruction Costs ('000 US$),"Reconstruction Costs, Adjusted ('000 US$)",Insured Damage ('000 US$),"Insured Damage, Adjusted ('000 US$)",Total Damage ('000 US$),"Total Damage, Adjusted ('000 US$)",CPI,Admin Units,Entry Date,Last Update
0,2000-0038-PHL,No,nat-hyd-flo-fla,Natural,Hydrological,Flood,Flash flood,,,PHL,...,,,,,4080.0,7219.0,56.514291,"[{""adm2_code"":24275,""adm2_name"":""Agusan Del No...",2004-10-27,2023-09-25
1,2000-0066-PHL,No,nat-hyd-flo-coa,Natural,Hydrological,Flood,Coastal flood,,,PHL,...,,,,,,,56.514291,"[{""adm2_code"":24203,""adm2_name"":""Tawi-tawi""}]",2003-07-01,2023-09-25
2,2000-0082-IDN,No,nat-hyd-mmw-mud,Natural,Hydrological,Mass movement (wet),Mudslide,,,IDN,...,,,,,11600.0,20526.0,56.514291,"[{""adm2_code"":18035,""adm2_name"":""Brebes""}]",2005-07-21,2023-09-25
3,2000-0089-PHL,No,nat-geo-vol-ash,Natural,Geophysical,Volcanic activity,Ash fall,,Mt. Mayon,PHL,...,,,,,2214.0,3918.0,56.514291,"[{""adm2_code"":24240,""adm2_name"":""Albay""}]",2005-06-01,2023-09-25
4,2000-0108-IDN,No,nat-bio-epi-vir,Natural,Biological,Epidemic,Viral disease,,Dengue fever,IDN,...,,,,,,,56.514291,,2003-07-01,2023-09-25


### 3.0 How many rows and columns are there in the dataset?

In [3]:
df.shape

(1278, 46)

### 3.1 Explore rows

### 3.2 Explore columns

In [4]:
pd.set_option('display.max_colwidth', None)  
attributes = pd.read_csv(r'C:\Users\Admin\Desktop\Disaster_SEA\data\attributes.csv')
attributes

Unnamed: 0,Column name,Description,Explanation
0,Dis No.,"A unique 8-digit identifier including the year (4 digits) and a sequential number (4 digits) for each disaster event (i.e., 2004-0659). In the EM-DAT Public Table, the ISO country code is appended.","Dis No.: A unique 8-digit identifier including the year (4 digits) and a sequential number (4 digits) for each disaster event (i.e., 2004-0659). In the EM-DAT Public Table, the ISO country code is appended.;"
1,Historic,"Binary field specifying whether or not the disaster happened before 2000, using the Start Year. Data before 2000 should be considered of lesser quality","Historic: Binary field specifying whether or not the disaster happened before 2000, using the Start Year. Data before 2000 should be considered of lesser quality;"
2,Classification Key,"A unique 15-character string identifying disasters in terms of the Group, Subgroup, Type and Subtype classification hierarchy.","Classification Key: A unique 15-character string identifying disasters in terms of the Group, Subgroup, Type and Subtype classification hierarchy.;"
3,Disaster Group,"The disaster group, i.e., “Natural” or “Technological.”","Disaster Group: The disaster group, i.e., “Natural” or “Technological.”;"
4,Disaster Subgroup,The disaster subgroup.,Disaster Subgroup: The disaster subgroup.;
5,Disaster Type,The disaster type.,Disaster Type: The disaster type.;
6,Disaster Subtype,The disaster subtype.,Disaster Subtype: The disaster subtype.;
7,External IDs,"List of identifiers for external resources (GLIDE, USGS, DFO), in the format “<source>:<identifier>” and separated by the pipe character (""|"").","External IDs: List of identifiers for external resources (GLIDE, USGS, DFO), in the format “<source>:<identifier>” and separated by the pipe character (""|"").;"
8,Event Name,"Short specification for disaster identification, e.g., storm names (e.g., “Mitch”), plane type in air crash (e.g., “Boeing 707”), disease name (e.g., “Cholera”), or volcano name (e.g., “Etna”).","Event Name: Short specification for disaster identification, e.g., storm names (e.g., “Mitch”), plane type in air crash (e.g., “Boeing 707”), disease name (e.g., “Cholera”), or volcano name (e.g., “Etna”).;"
9,ISO,The International Organization for Standardization (ISO) 3-letter code referring to the Country. The ISO 3166 norm is used.,ISO: The International Organization for Standardization (ISO) 3-letter code referring to the Country. The ISO 3166 norm is used.;


### 3.3 Handle missing data & Convert data

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1278 entries, 0 to 1277
Data columns (total 46 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   DisNo.                                     1278 non-null   object 
 1   Historic                                   1278 non-null   object 
 2   Classification Key                         1278 non-null   object 
 3   Disaster Group                             1278 non-null   object 
 4   Disaster Subgroup                          1278 non-null   object 
 5   Disaster Type                              1278 non-null   object 
 6   Disaster Subtype                           1278 non-null   object 
 7   External IDs                               286 non-null    object 
 8   Event Name                                 380 non-null    object 
 9   ISO                                        1278 non-null   object 
 10  Country                 

In [6]:
df.isnull().sum()

DisNo.                                          0
Historic                                        0
Classification Key                              0
Disaster Group                                  0
Disaster Subgroup                               0
Disaster Type                                   0
Disaster Subtype                                0
External IDs                                  992
Event Name                                    898
ISO                                             0
Country                                         0
Subregion                                       0
Region                                          0
Location                                       23
Origin                                        709
Associated Types                              799
OFDA/BHA Response                               0
Appeal                                          0
Declaration                                     0
AID Contribution ('000 US$)                  1194


### 3.4 Data distribution

## 4. Question proposing & Answering

### 4.1 What are the most common types of natural disasters that occurred in Southeast Asia from 2000 to early 2024?
- (e.g., floods, typhoons, earthquakes, etc.)



In [7]:
df['Start Year'] = pd.to_datetime(df['Start Year'], format='%Y', errors= 'coerce').dt.year
df['End Year'] = pd.to_datetime(df['End Year'], format='%Y', errors= 'coerce').dt.year
modified_df = df[(df['Start Year'] >= 2000) & (df['End Year'] <= 2024)]

In [8]:
disaters_count = modified_df['Disaster Type'].value_counts()
disaters_count.head(1)

Flood    595
Name: Disaster Type, dtype: int64

### 4.2 What are the main causes of natural disasters in Southeast Asia, and how do they differ between regions?
- (e.g., climate change, tectonic activity, deforestation, etc.)



### 4.3 The level of impact of natural disasters in countries in recent times.
- (e.g., money, facilities, human.)






### 4.4 What is the trend of natural disaster occurrences over the years in Southeast Asia?
- (e.g., increasing, decreasing, or fluctuating patterns.)


### 4.5 Which provinces in Vietnam have the most storms and floods in a year? Calculated based on the last 3 years.


### 4.6 Can we construct a **machine learning classifier** to automatically determine the **severity level** of a disaster event based on multi-dimensional indicators?

## 5. Evaluate

# D. CONCLUSION

# E. REFERENCES