# Temperature Anomaly and Natural Disasters Analysis in Germany

This notebook aims to analyze temperature anomaly trends in Germany from 1961 to 2023 and investigate natural disasters in Germany from 2001 to 2023. Additionally, it will evaluate the economic and human impact of these disasters.

## Questions this project aims to address

1. What are the temperature anomaly trends over the years 1961 to 2023 in Germany?
2. Are the temperature anomaly trends similar to previous reports seen across the world?
3. What natural disasters have struck Germany from 2001 till 2023?
4. What is the economic loss that has been reported due to these disasters?
5. What is the reported loss of human lives due to these disasters?




## Data Sources

### 1. EM-DAT - The International Disaster Database
* Metadata URL: https://public.emdat.be/data
* Data URL: https://public.emdat.be/data
* Data Type: Microsoft Excel

This data source contains various information about the types of disasters across the world at both country-wide and regional levels. For this project, only the data concerning Germany has been taken into account. The main objective is to identify the disaster that Germany is most vulnerable to, considering the loss of life and total damage incurred due to these disasters.

EM-DAT contains data on the occurrence and impacts of over 26,000 mass disasters worldwide from 1900 to the present day. The database is compiled from various sources, including UN agencies, non-governmental organizations, reinsurance companies, research institutes, and press agencies. The Centre for Research on the Epidemiology of Disasters (CRED) distributes the data in open access for non-commercial use. The terms and condition of the dataset allows utilisation of the datasets for academic, non-commercial purposes, as long as it is not reproduced, distributed, or a derivative of the databases is created in any unauthorized manner. As all of these conditions are met and the source is cited here, this database can be used in this project. 

The database used is in a tabular format, specifically Microsoft Excel. Various numerical value columns in the data are incomplete, for example: 'Magnitude', 'Total Deaths', 'No. Injured', 'No. Affected', 'Total Affected', 'Insured Damage ('000 US$)', 'Insured Damage, Adjusted ('000 US$)', 'Total Damage ('000 US$)', 'Total Damage, Adjusted ('000 US$)'

However, other columns such as 'Date' and 'Disaster Type' are complete. As no assumptions about the magnitude of economic losses can be made, the incomplete rows have been filled with a value of 0. The data is found to be consistent.

The dataset for Germany was further sub-divided into meteorological and hydrological disasters, as these two types of disasters have been found to be the most frequent. The timeline for recorded hydrological disasters ranges from 2002-2021, and for meteorological disasters from 2001-2023. Thus, the dataset appears to be appropriate to answer the questions this project aims to solve.

**Citation:** EM-DAT, CRED / UCLouvain, Brussels, Belgium – www.emdat.be

More information about disaster classification can be found here: [Disaster Classification System](https://doc.emdat.be/docs/data-structure-and-content/disaster-classification-system/).

More information on the legal use of this dataset can be found here: [Terms of Use](https://doc.emdat.be/docs/legal/terms-of-use/).


### 2. Food and Agriculture Organization of the United Nations (FAO)
* Metadata URL: https://www.fao.org/faostat/en/#data/ET
* Data URL: https://bulks-faostat.fao.org/production/Environment_Temperature_change_E_All_Data.zip
* Data Type: CSV

The FAO is a specialized agency of the United Nations that leads international efforts to defeat hunger. The FAO encourages the use of its databases for research, statistical, and scientific purposes. Access, downloading, creating copies, and re-disseminating datasets are subject to the following terms: "Unless specifically stated otherwise, all datasets disseminated through the databases below are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 IGO (CC BY-NC-SA 3.0 IGO)." As the terms and conditions are met and the source has been cited here, this dataset can be used in this project.

The FAOSTAT Temperature Change on Land domain disseminates statistics on mean surface temperature changes by country, with annual updates. The current dissemination covers the period 1961–2023. Statistics are available for monthly, seasonal, and annual mean temperature anomalies, i.e., temperature changes with respect to a baseline climatology corresponding to the period 1951–1980. The standard deviation of the temperature change from the baseline methodology is also available.

For the purpose of this project, the dataset has been further sub-divided into the following datasets:
- Annual temperature change
- Annual standard deviation
- Seasonal temperature change
- Seasonal standard deviation
- Meteorological temperature change
- Meteorological standard deviation

The data is tabular (structured) and in CSV format. The dataset timeline ranges from 1961-2023. The data is complete and consistent, with all temperature values present and measured in degrees Celsius. This project aims to study only the temperature anomaly of Germany, making the dataset relevant to the problem at hand.

**Citation:**
FAO. [Environment_Temperature_change_E_Europe_NOFLAG]. License: CC BY-NC-SA 3.0 IGO. Extracted from: [FAO Database](https://bulks-faostat.fao.org/production/Environment_Temperature_change_E_All_Data.zip). Date of Access: 20-05-2024.


## Data Pipeline

### Running the Pipeline

The pipeline can be run using the bash script `pipeline.sh` present in the project directory. It runs `pipeline.py`, also located in the project directory. The `pipeline.py` script runs three Python scripts present inside the **components** folder. These scripts are responsible for different stages of the pipeline.

1. **Ingestion**
   The first script is **ingestion.py**, which is responsible for fetching the datasets. There are two functions for ingestion:
   - `ingestion_from_url()`: Directly fetches the dataset from the provided URL.
   - The other dataset posed a challenge. Although the dataset is available for open download, the user must be logged in to the EM-DAT website. Initially, the requests library was used to log in, but it did not work. Next, the Selenium library was used to mimic the way a user logs into the website. To achieve this the Chromedriver for the respective system must be installed in the PATH. By using Selenium which can handle JavaScript and interact with the browser like a human user, the authentication is executed uccesssfully. Again, Selenium is used to redirect to the webpage containing the data and click the Donwload button present in the webpage. Using this method the data is fetched from the URL provided in `pipeline.py`. The username and password are input from the `.env` file, this file must be present in the components folder. The environment variables are named `EM_DAT_USERNAME` and `EM_DAT_PASSWORD`. 
   The creation of two differnt methods of fetching the datasets posed a challenge in the data ingestion phase. 

   After the ingestion step is complete, the functions return the dataset in CSV format to the pipeline.

2. **Data Preparation**
   The next step is data preparation, accomplished by the `data_cleaning()` function present in the script **data_prep.py**. The function takes an ID associated with the dataset to identify which dataset it is supposed to clean. The datasets being different require different cleaning steps:
   
   - For the FAO Temperature Change on Land dataset, unnecessary columns are dropped. Only temperature anomaly data from 1961-2023 and Monthly (Jan-Dec)/Seasonal/Meteorological columns are shortlisted. For the final analysis, the dataset is divided into six datasets:
     - `temp_change_annual.csv`
     - `temp_change_seasonal.csv`
     - `temp_change_met.csv`
     - `std_dev_annual.csv`
     - `std_dev_met.csv`
     - `std_dev_seasonal.csv`
   
     This function returns a dictionary, with the key being the name of the dataframe and the value being the dataset itself.

   - For the EM-DAT disaster dataset, only data related to Germany is shortlisted. There are two categories of disasters: natural and technological. This study focuses only on natural disasters. The 'Start Day' and 'End Day' columns, which indicate the day on which the disaster occurred and ended, are filled with the first date of the month whereever the value is NaN. The columns 'Start Year', 'Start Month', and 'Start Day' are combined to form a full date in the format YYYY-MM-DD and converted to a datetime object. The original columns are dropped. Similar steps are taken to create the 'End Date' column. In categorical columns, the value 'Unknown' is filled wherever None values are present. For the numerical columns, NaNs are filled with zeros. Finally, using the 'Meteorological' and 'Hydrological' categories of 'Disaster Subgroup' column, two dataframes are created for further analysis. The number of data points for any other type of 'Disaster Subgroup' was insufficient for analysis. This function also returns a dictionary, with the key being the name of the dataframe and the value being the dataframe.
 
 The main challenges in this step of the pipleline was deciding on a strategy to handle the incomplete data. And also, combining year, month, and day columns to form complete date objects and ensuring the correct handling of missing values by substituting the first day of the month where necessary was also tricky. 

3. **Saving Dataframes**
   The last step of the pipeline is to save the returned dataframes as separate CSV files. The goal is to perform Exploratory Data Analysis (EDA) on the datasets to answer the questions posed by this project. CSV is found to be a suitable format to save the data and as it can be easily read into a dataframe to fulfill this task.


In [14]:
from IPython.display import Image, display, HTML

# Local image file
image_path = "../data/Drawing1.png"

# HTML with image and caption
html_code = f"""
    <figure>
        <img src="{image_path}" alt="ETL Image" style="width:100%;">
        <figcaption>fig: Data Engineering Pipeline</figcaption>
    </figure>
"""

# Display the HTML
display(HTML(html_code))


## Results and Limitations

### Chosen Data Format for the Output of the Pipeline

The output data of the pipeline is saved in CSV format. The CSV format was chosen due to several reasons:
- **Simplicity**: CSV files are easy to read and write in most programming languages, making them highly accessible.
- **Compatibility**: CSV is a widely supported format across various data analysis and visualization tools.
- **Efficiency**: For the purposes of this project, CSV files are sufficient to store and manage the structured datasets without adding unnecessary complexity.

### Critical Reflection on the Data and Potential Issues

- **Temperature Anomaly Dataset**
  - **Issues**: The dataset appears to be high quality with no missing values. However, temperature anomalies alone may not fully capture the complexities of climate change impacts.
  - **Potential Improvements**: Incorporating additional climate indicators such as precipitation patterns and extreme weather events could provide a more comprehensive view in future projects.

- **Natural Disasters Dataset**
  - **Issues**: The dataset had missing values that were filled with zeros or 'Unknown', which could introduce bias or inaccuracies. Additionally, the economic and human impact data might be underreported due to incomplete records or reporting discrepancies.
  - **Potential Improvements**: Accessing more granular data or supplementary sources could enhance the dataset's accuracy in future works.

