# Datenjournalismus in Python - 
# Eine praktische Einführung in die Programmierung


### Natalie Widmann




Wintersemester 2022 / 2023


Universität Leipzig





### Aggregated figures for Natural Disasters in EM-DAT

Link: https://data.humdata.org/dataset/emdat-country-profiles


In 1988, the **Centre for Research on the Epidemiology of Disasters (CRED)** launched the **Emergency Events Database (EM-DAT)**. EM-DAT was created with the initial support of the **World Health Organisation (WHO) and the Belgian Government**.

The main objective of the database is to **serve the purposes of humanitarian action at national and international levels**. The initiative aims to rationalise decision making for disaster preparedness, as well as provide an objective base for vulnerability assessment and priority setting.

EM-DAT contains essential core data on the **occurrence and effects of over 22,000 mass disasters in the world from 1900 to the present day**. The database is compiled from various sources, including UN agencies, non-governmental organisations, insurance companies, research institutes and press agencies.



In [1]:
# Install a pip package im Jupyter Notebook
!pip3 install pandas
!pip3 install openpyxl

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [51]:
import pandas as pd
data = pd.read_excel('../data/clean_emdat.xlsx', engine="openpyxl")

  warn("Workbook contains no default style, apply openpyxl's default")


In [52]:
data

Unnamed: 0,Year,Country,ISO,Disaster Group,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)","Total Damage (USD, adjusted)",CPI
0,#date +occurred,#country +name,#country +code,#cause +group,#cause +subgroup,#cause +type,#cause +subtype,#frequency,#affected +ind,#affected +ind +killed,,#value +usd,
1,1900,Cabo Verde,CPV,Natural,Climatological,Drought,Drought,1,,11000,,,3.077091
2,1900,India,IND,Natural,Climatological,Drought,Drought,1,,1250000,,,3.077091
3,1900,Jamaica,JAM,Natural,Hydrological,Flood,,1,,300,,,3.077091
4,1900,Japan,JPN,Natural,Geophysical,Volcanic activity,Ash fall,1,,30,,,3.077091
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10338,2022,Yemen,YEM,Natural,Hydrological,Flood,Flash flood,1,3400,13,,,
10339,2022,South Africa,ZAF,Natural,Hydrological,Flood,,7,143119,562,3.164000e+09,,
10340,2022,Zambia,ZMB,Natural,Hydrological,Flood,,1,15000,3,,,
10341,2022,Zimbabwe,ZWE,Natural,Hydrological,Flood,,1,,,,,


In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10342 entries, 1 to 10342
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          10342 non-null  int64  
 1   Country                       10342 non-null  object 
 2   ISO                           10342 non-null  object 
 3   Disaster Group                10342 non-null  object 
 4   Disaster Subroup              10342 non-null  object 
 5   Disaster Type                 10342 non-null  object 
 6   Disaster Subtype              8225 non-null   object 
 7   Total Events                  10342 non-null  int64  
 8   Total Affected                7506 non-null   float64
 9   Total Deaths                  7317 non-null   float64
 10  Total Damage (USD, original)  3796 non-null   float64
 11  Total Damage (USD, adjusted)  3766 non-null   float64
 12  CPI                           10149 non-null  float64
dtypes

## Zurück zum Dashboard...

## Recherchefragen

- Wie viele Todesopfer gibt es insgesamt in Deutschland?
- Wie stark ist ein Land von Naturkatastrophen betroffen?
- Welchen Anteil haben die unterschiedlichen Naturkatastrophentypen daran?
- Wie hat sich die Anzahl der Naturkatastrophen über die Jahre hin entwickelt?

#### Wie viele Todesopfer gibt es insgesamt?

In [79]:
data["Total Deaths"].sum()

22845977.0

#### Wie viele Todesopfer gibt es insgesamt in Deutschland?

In [83]:
data

Unnamed: 0.1,Unnamed: 0,Year,Country,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)",CPI
0,1,1900,Cabo Verde,Climatological,Drought,Drought,1,11000.0,11000.0,,3.077091
1,2,1900,India,Climatological,Drought,Drought,1,1250000.0,1250000.0,,3.077091
2,3,1900,Jamaica,Hydrological,Flood,No Subtype,1,300.0,300.0,,3.077091
3,4,1900,Japan,Geophysical,Volcanic activity,Ash fall,1,30.0,30.0,,3.077091
4,5,1900,Turkey,Geophysical,Earthquake,Ground movement,1,140.0,140.0,,3.077091
...,...,...,...,...,...,...,...,...,...,...,...
10337,10338,2022,Yemen,Hydrological,Flood,Flash flood,1,3400.0,13.0,,
10338,10339,2022,South Africa,Hydrological,Flood,No Subtype,7,143119.0,562.0,3.164000e+09,
10339,10340,2022,Zambia,Hydrological,Flood,No Subtype,1,15000.0,3.0,,
10340,10341,2022,Zimbabwe,Hydrological,Flood,No Subtype,1,0.0,0.0,,


In [84]:
data['Country'] == 'Germany'

0        False
1        False
2        False
3        False
4        False
         ...  
10337    False
10338    False
10339    False
10340    False
10341    False
Name: Country, Length: 10342, dtype: bool

In [85]:
data[data['Country'] == 'Germany']

Unnamed: 0.1,Unnamed: 0,Year,Country,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)",CPI
3120,3121,1990,Germany,Meteorological,Storm,No Subtype,6,64.0,64.0,4.440000e+09,48.218797
3288,3289,1991,Germany,Meteorological,Storm,No Subtype,1,0.0,0.0,5.000000e+06,50.260853
3430,3431,1992,Germany,Geophysical,Earthquake,Ground movement,1,1525.0,1.0,5.000000e+07,51.783162
3431,3432,1992,Germany,Hydrological,Flood,No Subtype,1,0.0,0.0,3.010000e+07,51.783162
3581,3582,1993,Germany,Hydrological,Flood,Riverine flood,1,100000.0,5.0,6.000000e+08,53.311620
...,...,...,...,...,...,...,...,...,...,...,...
9468,9469,2019,Germany,Meteorological,Storm,Convective storm,1,1.0,1.0,,94.349092
9707,9708,2020,Germany,Meteorological,Storm,Extra-tropical storm,1,33.0,0.0,,95.512967
9951,9952,2021,Germany,Hydrological,Flood,No Subtype,1,1000.0,197.0,4.000000e+10,100.000000
9952,9953,2021,Germany,Meteorological,Storm,Convective storm,2,604.0,1.0,,100.000000


In [82]:
data['Germany'] 

KeyError: 'Germany'

Korrektur der Ländernamen

In [None]:
# Vorkommen von Deutschland
for country in data['Country'].unique():
    if 'german' in country.lower():
        print(country)

Funktion die dies bereinigt

In [None]:
def clean_country


In [None]:
text = 'Germany Fed Rep'
text = 'Mexico'
clean_country(text)

Anwendung der Funktion auf alle Werte einer Spalte

In [None]:
for index, row in data.iterrows():
    data.loc[index, "Country"] = clean_country(row['Country'])

In [None]:
help(data.loc)

In [None]:
data[data['Country'] == 'Germany']

**oder** (viel übersichtlicher und effizienter)

mit `apply()` kann eine Funktion auf eine komplette Spalte oder Zeile des Dataframes angewendet werden

In [None]:
data['Country'] = data['Country'].apply(clean_country)

#### Wie viele Todesopfer gab es insgesamt in Deutschland?

In [None]:
data[data['Country'] == 'Germany']

#### Wie viele Todesopfer gab es insgesamt in Indien?

#### Allgemeine Funktion die Gesamtzahl der Todesopfer eines Landes ausgibt

### Dashboard Teil II

#### Welchen Anteil haben die unterschiedlichen Naturkatastrophentypen in Deutschland?

# Visualisierung von DataFrames

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

Matplotlib makes easy things easy and hard things possible.

Install [matplotlib](https://matplotlib.org/)

In [None]:
!pip3 install --upgrade pip
!pip3 install --upgrade Pillow
!pip3 install matplotlib

In [None]:
import matplotlib.pyplot as plt

In [None]:
germany_data['Disaster Type'].value_counts()

In [None]:
fig = germany_data['Disaster Type'].value_counts().plot(kind='bar')

In [None]:
germany_data['Disaster Type'].value_counts().plot(kind='pie')

## Recherchefragen

#### Welche Naturkatastrophen fordern die meisten Todesopfer?

`.groupby()` gruppiert einen DataFrame nach den Werten einer oder mehreren Spalten.

Die Spalten nach denen man Gruppieren möchte werden als Argument übergeben. Danach folgt die gewünschte Berechnung auf dieser Gruppe. Das Ergebnis wird als DataFrame zurückgegeben. 

In [None]:
data.groupby('Disaster Type').sum()

In [None]:
data.groupby('Disaster Type')['Total Deaths'].sum()

`.groupby()` kann auch auf mehrere Spalten gleichzeitig angewendet werden

In [None]:
data.groupby(['Disaster Type', 'Disaster Subtype'])['Total Deaths'].sum()

#### Visualisierung

In [None]:
data.groupby('Disaster Type')['Total Deaths'].sum().plot(kind='pie')

#### Welche Naturkatastrophen fordern die meisten Todesopfer in Deutschland?

#### Generelle Funktion

In [None]:
country = 'India'
country_data = data[data['Country'] == country]
country_data.groupby('Disaster Type')['Total Deaths'].sum().plot(kind='pie')

## Dashboard Teil IV

#### Wie hat sich die Anzahl der Naturkatastrophen über die Jahre hin entwickelt?

In [None]:
yearly_events = data.groupby('Year')['Total Events'].sum()
yearly_events.plot(kind='line', x='Year', y='Total Events', title='Anzahl an Naturkatastrophen pro Jahr')

## Länderdashboard

In [None]:
def plot_pie(country_data):
    country_data.groupby('Disaster Type')['Total Deaths'].sum().plot(kind='pie', title='Anteil an getöten Menschen nach Naturkatastrophentyp')
    plt.show()

In [None]:
def plot_time_evolution(country_data):
    yearly_events = country_data.groupby('Year')['Total Events'].sum()
    yearly_events.plot(kind='line', x='Year', y='Total Events', title='Anzahl an Naturkatastrophen pro Jahr')
    plt.show()

In [None]:
def death_overview(country_data):
    total_deaths = country_data["Total Deaths"].sum()
    print(f'Getötete Menschen seit 1900: {total_deaths:,.0f}')

In [None]:
def compute_anteil(country_total, world_total):
    anteil = round(country_total / (world_total / 100.0), 2)
    print(f'{anteil}% aller Menschen die weltweit von Naturkatastrophen betroffen sind leben hier.')

In [None]:
def analyze(data, country):
    print(f'Naturkatastrophen in {country.upper()} \n')
    country_data = data[data['Country'] == country]
    
    compute_anteil(country_data['Total Affected'].sum(), data['Total Affected'].sum())
    death_overview(country_data)
    plot_pie(country_data)
    plot_time_evolution(country_data)

In [None]:
analyze(data, 'Bangladesh')

### Überblick über die Welt

Wie können wir die selbstdefinierte Funktion `analyze()` anpassen, so dass auch eine Gesamtstatistik, die alle Länder der Welt enthält abgefragt werden kann?

In [None]:
def analyze(data, country):
    print(f'Naturkatastrophen in {country.upper()} \n')
    country_data = data[data['Country'] == country]
    stats_overview(country_data)
    plot_pie(country_data)
    plot_time_evolution(country_data)

In [None]:
analyze(data, 'world')

# Zeit für Feedback



Link: https://ahaslides.com/HP3D5

![Feedback QR Code](../imgs/qrcode_vl7.png)

