The available joined data provides seasonal information on the case numbers per influenza virus type for each district as well as population density for each district. This study aims to analyse general seasonal trends for whole Baden-Württemberg, detecting differences in the spreading of influenza between densly populated areas and sparsly popualted areas in Baden-Württemberg as well as analyzing trends in virus variation over time and spreading on district level.

This sums up to the following hypotheses:
- the prevalence of influenza cases genereally increased, with a strong increase from the year 2012 onwards (sudden increase in virus variant diversity, introducing new clades/strains) and with sudden decrease in 2020 due to the covid measurments
- the spreading (reflected in case numbers) is significantly different comparing sparsly and densly popualted areas
- the virus diversity increase leads to higher case numbers and more different virus variants recorded per district

In [82]:
# imports 
import numpy as np
import pandas as pd
import os
import re
import math

import pandas as pd
import numpy as np
from bokeh.palettes import magma, Set3
from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.models import HoverTool,WheelZoomTool, PanTool, ResetTool, Legend, Band
from bokeh.models.widgets import Panel, Tabs
import hvplot.pandas
from bokeh.io import output_notebook
output_notebook()

# Load the merged dataset

In [83]:
# define the path under which the file is located
path = 'C:/Data_Science_for_Life_Sciences_MASTER/programming1/programming_1_influenza/data/'

# read the file
df = pd.read_csv(path + 'influenza_pop_dens_merged.csv', sep = '\t')
df.head()

Unnamed: 0.1,Unnamed: 0,virus_type,district,season,case_number,year,area [ha],population,population_density,population_density_bw,pop_exp,pop_dens_exp,area_exp,area [km^2],area ex [km^2]
0,0,-nicht erhoben-,AD,2000/01,0.0,2000,135732.0,185929.0,137.0,294.0,170214.78,125.35,135793.33,1357.9333,1357.93
1,1,-nicht ermittelbar-,AD,2000/01,0.0,2000,135732.0,185929.0,137.0,294.0,170214.78,125.35,135793.33,1357.9333,1357.93
2,2,Influenza A Virus,AD,2000/01,20.0,2000,135732.0,185929.0,137.0,294.0,170214.78,125.35,135793.33,1357.9333,1357.93
3,3,Influenza A(H1N1) Virus (vorpandemisch),AD,2000/01,0.0,2000,135732.0,185929.0,137.0,294.0,170214.78,125.35,135793.33,1357.9333,1357.93
4,4,Influenza A(H3N2) Virus,AD,2000/01,0.0,2000,135732.0,185929.0,137.0,294.0,170214.78,125.35,135793.33,1357.9333,1357.93


In [84]:
# drop not required columns
df = df.drop(columns = ['Unnamed: 0', 'area [ha]', 'population', 'population_density', 'area [km^2]'])
df.head()

Unnamed: 0,virus_type,district,season,case_number,year,population_density_bw,pop_exp,pop_dens_exp,area_exp,area ex [km^2]
0,-nicht erhoben-,AD,2000/01,0.0,2000,294.0,170214.78,125.35,135793.33,1357.93
1,-nicht ermittelbar-,AD,2000/01,0.0,2000,294.0,170214.78,125.35,135793.33,1357.93
2,Influenza A Virus,AD,2000/01,20.0,2000,294.0,170214.78,125.35,135793.33,1357.93
3,Influenza A(H1N1) Virus (vorpandemisch),AD,2000/01,0.0,2000,294.0,170214.78,125.35,135793.33,1357.93
4,Influenza A(H3N2) Virus,AD,2000/01,0.0,2000,294.0,170214.78,125.35,135793.33,1357.93


In [85]:
# rename the 

# Analyse the seasonal trends for whole Baden-Württemberg for available time period

Assumptions were, that there is an increase in cases from beginning of the recordings to now. Further an increase in virus variant diversity was stated to happen in 2012. It was assumed that the increase in virus diversity led to higher general case numbers (vaccines were then probably not too effective anymore, and fitting ones pers season harder to predict). A sudden decrease is expected to happen due to the covid measurements in season 2020/21.

In [86]:
total_cases_bw = df[['year', 'case_number']].groupby(['year']).sum()
total_cases_bw
p5 = figure(title = 'total case numbers baden - württemberg', x_axis_label = 'years', y_axis_label = 'total case numbers')

p5.line(total_cases_bw.index.values, total_cases_bw.case_number.values, line_width=2)

show(p5)

In [87]:
total_cases_pop_bw = df[['year', 'case_number', 'pop_exp', 'area ex [km^2]']]
total_cases_pop_bw = total_cases_pop_bw.groupby('year').agg({'case_number':['sum'], 'pop_exp': ['sum'], 'area ex [km^2]': ['sum']}).reset_index()
total_cases_pop_bw.columns = ["year", "case_number", "pop_exp", "area_exp_[km^2]"]
total_cases_pop_bw['pop_dens_bw'] = total_cases_pop_bw['pop_exp'] / total_cases_pop_bw['area_exp_[km^2]']

# make variable to relate case number to population (cases/100.000 inhabitants)
# make variable to relate case number to population density (cases/populaiton density)
total_cases_pop_bw['cases/pop_dens'] = total_cases_pop_bw['case_number']/total_cases_pop_bw['pop_dens_bw']
total_cases_pop_bw
p6 = figure(title = 'total case numbers baden - württemberg in relation to population', 
             x_axis_label = 'years', y_axis_label = 'total case numbers/population')

p6.line(total_cases_pop_bw.year, total_cases_pop_bw['cases/pop_dens'], line_width=2)

show(p6)

In [88]:
total_cases_pop_bw.year.astype(str)
p = figure(x_range = total_cases_pop_bw.year.astype(str), height=250, title="total case numbers per population density",
           toolbar_location=None, tools="")

p.vbar(x = total_cases_pop_bw.year.astype(str), top = total_cases_pop_bw['cases/pop_dens'], width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = 1.2

show(p)

Indeed the global trend shows an increase from 2000 onwards, with sudden decrease in 2020. The Increase in case numbers seems to be stronger for the time period 2013 to 2019. However there is a local maximum found for the year 2009 (season 2009/10). For all trends seen it is interesting to see how the virus types contribute before having a look on literature to search for possible reasons.

In [89]:
virus_types =df[['year', 'case_number', 'virus_type']].groupby(['year', 'virus_type']).sum()
virus_types = virus_types.reset_index()
virus_types.virus_type.nunique()
line_col_set3 = Set3[12]

p6 = figure(plot_width=1000,title = 'total case numbers per virus type in baden - württemberg', x_axis_label = 'years',

             y_axis_label = 'total case numbers',x_range=[2000,2022],

             tools="hover", tooltips="@year:  @case_number for @virus_type")

 

for (name, group), color in zip(virus_types.groupby('virus_type'), line_col_set3):

    cds = ColumnDataSource(data = group)

    p6.line(x = 'year', y = 'case_number',source=cds, line_color = color, line_width = 2, legend_label = name)

legend = p6.legend[0]

p6.center = [item for item in p6.center if not isinstance(item, Legend)]

p6.add_layout(legend, 'left')

show(p6)

In [185]:
virus_types.year = virus_types.year.astype(str)
vt_wide = virus_types.pivot('virus_type', 'year', 'case_number')
vt_wide

year,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
virus_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-andere/sonstige-,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
-nicht erhoben-,24.0,49.0,3.0,21.0,9.0,33.0,7.0,11.0,16.0,10.0,...,94.0,236.0,530.0,561.0,674.0,331.0,347.0,365.0,3.0,3.0
-nicht ermittelbar-,0.0,0.0,1.0,4.0,2.0,2.0,14.0,1.0,0.0,0.0,...,0.0,0.0,6.0,0.0,1.0,5.0,7.0,21.0,0.0,0.0
Influenza A Virus,315.0,194.0,436.0,555.0,930.0,634.0,1614.0,1133.0,2069.0,794.0,...,1933.0,1461.0,3905.0,4155.0,8153.0,8967.0,14344.0,12424.0,53.0,37.0
Influenza A Virus (zoonotisch),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
Influenza A(H1N1) Virus (vorpandemisch),0.0,0.0,0.0,2.0,10.0,12.0,25.0,64.0,7.0,0.0,...,0.0,6.0,9.0,15.0,4.0,30.0,66.0,50.0,1.0,0.0
Influenza A(H1N1)pdm09 Virus,0.0,0.0,0.0,0.0,1.0,255.0,725.0,1140.0,3643.0,9561.0,...,916.0,185.0,301.0,266.0,147.0,591.0,586.0,461.0,0.0,0.0
Influenza A(H1N2) Virus,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,4.0,20.0,1.0,1.0,13.0,0.0,0.0
Influenza A(H3N2) Virus,0.0,6.0,71.0,18.0,76.0,14.0,48.0,16.0,168.0,17.0,...,70.0,78.0,355.0,48.0,271.0,41.0,113.0,47.0,2.0,0.0
Influenza A/B Virus nicht differenziert nach A oder B,0.0,17.0,354.0,234.0,233.0,142.0,343.0,115.0,233.0,172.0,...,1455.0,658.0,1653.0,1127.0,2405.0,3310.0,112.0,106.0,3.0,2.0


In [181]:
vt_dict = vt_wide.reset_index().set_index('virus_type').T.to_dict('list')
colors = Set3[12]
vt = vt_wide.virus_type.unique()
years = vt_wide_cumsum.columns.values[1:]

In [186]:
p = figure(x_range = years, plot_height=500, title="case numbers by virus type in Baden-Württemberg",
           toolbar_location=None, tools="")

p.vbar_stack(vt, x = 'years', width=0.9, color=colors, source = vt_dict)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = 1.2

show(p)

The added dimension by splitting the case numbers into case number per virus types yields the following insights:
- for influenza A there indeed seems to be the prognosed trend of increasing case numbers over time. Influenza A contributes the most to the total case numbers.
- similar trends to influenza A are observed for Influenza B, but with earlier decrease in case numbers than 2020
- There is a category mixing Influenza A and B if it was not differenciated between the two, which shows similar trends as influenza B
- The high case numbers in year season 2009/10 were mainly caused by one virus type which did not cause high case numbers in the upcoming years. (Possible conclusions: The later vaccines covered this specific variant, the variant was negative selected, the strain was circulating in some regions only, and wasnt spread further)
- There are some other categories, which are either influenza C or not further differenciated cases

The categories not showing clear virus strain are not really informative, when observing the evolvments of virus types. For that reason The categories 'others', 'not recorded', 'not ascertainable' will be removed.

In [7]:
vt_out = ['-nicht erhoben-', '-nicht ermittelbar-', '-andere/sonstige-']

df_filtered_vt = df.query('@vt_out not in virus_type')

# Study the virus spreading on district level

Study each virus variant per district. The difficulty here is, to see trends for 43 districts (normally 44 but RT was exlcuded), therefore a categorization was conducted. The categorization was conducted to support answering the question, whether the spreading patterns differentiate between rural and urban areas.

First plan was to separate the districts by SK and LK. SK means city district, LK means rural district. However when reading in the data it was intuitivly seen that some SK's do not have a much higher population density than the the LK's. Therfore, the decision was made to conduct the categorization based on the popualtion density.

Literature distinguishes between rural, suburban and urban areas. Based on literature urban areas are characterized by population density above 1500 and population above 50 000. Suburban population density above 300 and population above 500. Rural areas: population density below 300 and population below 500. This approach is applied to rasters, and it was not really sure whether it is applicable for this case as well. [https://ec.europa.eu/regional_policy/sources/docgener/work/2014_01_new_urban.pdf]

Therefore, it was chosen to follow an investigative approach to find the categories.

In [8]:
p7 = df.hvplot.violin(y = 'pop_dens_exp', ylabel = 'population density [i/km^2]', 
                                            title = 'distribution of population density in Baden-Württemberg')

p7

The violin plot shows that the majority of districts has a popualtion below 500 inhabitants per square kilometer. Several districts have a population density ranging between 500 inhabitants per square kilometer and about 2250 inhabitants per quare kilometer. A quite separated group ranges between 2500 inhabitants per square kilometers and 3120 square kilometers.

Instead of separating in the two groups urban and rural, a separation in 5 parts will be conducted.

From the previously mentioned literature, 0 - 300: rural, 300 - 1500: suburban, > 1500: urban. Since there are a lot of rural areas, i will introduce more categories, to distinguish there a bit more: sparse: 0 - 150, rural: 100 - 300, subrural: 300 - 500, suburban: 500 - 1500, urban: 1500 - inf.

The categories will be build base upon the population density and added as the column 'dem_dim' (demographic dimension) to the merged dataset.

In [9]:
# add the category column to the dataset
df['dem_dim'] = pd.cut(df['pop_dens_exp'], bins=[0, 150, 300, 500, 1500, float('Inf')], 
                                       labels=['sparse', 'rural', 'subrural', 'suburban', 'urban'])

In [10]:
t = np.where(df.groupby('district')['dem_dim'].nunique() > 1)
df[df.dem_dim == 'subrural'].groupby('district')['dem_dim'].nunique()

district
BAD    1
BS     1
ENZ    1
GP     1
HNL    1
KAL    1
KN     1
RA     1
RM     1
RN     1
TUE    1
Name: dem_dim, dtype: int64

In [11]:
total_case_distr = df.groupby(['district', 'year']).sum().reset_index()
total_case_distr

Unnamed: 0,district,year,case_number,population_density_bw,pop_exp,pop_dens_exp,area_exp,area ex [km^2]
0,AD,2000,20.0,3528.0,2042577.36,1504.20,1629519.96,16295.16
1,AD,2001,14.0,3564.0,2059595.52,1516.68,1629519.96,16295.16
2,AD,2002,34.0,3576.0,2076613.80,1529.28,1629519.96,16295.16
3,AD,2003,7.0,3588.0,2093631.96,1541.76,1629519.96,16295.16
4,AD,2004,18.0,3600.0,2110650.12,1554.36,1629519.96,16295.16
...,...,...,...,...,...,...,...,...
941,WT,2017,138.0,3708.0,2030035.68,1794.72,1357383.96,13573.80
942,WT,2018,178.0,3720.0,2042523.96,1805.76,1357383.96,13573.80
943,WT,2019,9.0,3732.0,2055012.24,1816.80,1357383.96,13573.80
944,WT,2020,0.0,3732.0,2067500.52,1827.84,1357383.96,13573.80


In [12]:
# choose line colors
p8_col = Set3[12]

# make tabbed line plot
for it, district_group in enumerate(df.dem_dim.unique()):
    data = df[df.dem_dim == district_group]
    p8 = figure(title = 'case number per district')
    dfs = data.groupby(['district', 'year']).sum().reset_index()
    for (name, group), color in zip(dfs.groupby(['district']), p8_col):
        p8.line(x = group.year, y = group.case_number, line_color = color, line_width = 2, legend_label = str(name))
    
    tab = Panel(child = p8, title = '{}'.format(district_group))
    if it == 0:
        tabs_list = [tab]
        tabs = Tabs(tabs=tabs_list)
        
    # append the other plots as tabs
    else:
        tabs_list.append(tab)
        tabs.update(tabs=tabs_list)

tabs = Tabs(tabs = tabs_list)
show(tabs)

There are no clear trends per district group seen. The plan was to conduct smoothing to compare whether there are significant differences between the categories. However this seems not to be meaningful when keeping the visualisations in mind. Maybe there can be trends found when selecting by virus variant.

In [13]:
# Trying to smooth stuff out with forming avereages
df_filtered_vt['dem_dim'] = pd.cut(df_filtered_vt['pop_dens_exp'], bins=[0, 150, 300, 500, 1500, float('Inf')], 
                                       labels=['sparse', 'rural', 'subrural', 'suburban', 'urban'])
filtered_cat = df_filtered_vt[['virus_type', 'district', 'dem_dim', 'case_number', 'year']].groupby(['year', 'dem_dim']).agg({'case_number':['min','max','mean']}).reset_index()
filtered_cat.columns = ["year", "dem_dim", "min", "mean", "max"]
filtered_cat

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered_vt['dem_dim'] = pd.cut(df_filtered_vt['pop_dens_exp'], bins=[0, 150, 300, 500, 1500, float('Inf')],


Unnamed: 0,year,dem_dim,min,mean,max
0,2000,sparse,0.0,28.0,1.123457
1,2000,rural,0.0,13.0,0.472222
2,2000,subrural,0.0,24.0,1.063492
3,2000,suburban,0.0,17.0,0.833333
4,2000,urban,0.0,32.0,2.000000
...,...,...,...,...,...
105,2021,sparse,0.0,1.0,0.013889
106,2021,rural,0.0,12.0,0.188034
107,2021,subrural,0.0,3.0,0.037037
108,2021,suburban,0.0,8.0,0.291667


In [14]:
p9 = figure(plot_width=1000,title = 'smoothed case numbers per district category using the mean', x_axis_label = 'years',

             y_axis_label = 'total case numbers',x_range=[2000,2022],

             tools="hover", tooltips="@year:  @mean for @dem_dim")

 

for (name, group), color in zip(filtered_cat.groupby('dem_dim'), line_col_set3):
    #print(group)

    cds = ColumnDataSource(data = group)
    
    band = Band(base = 'year', lower = 'min', upper = 'max', fill_color = 'red', fill_alpha = 0.2, line_color = 'red')
    p9.add_layout(band)
    
    p9.line(x = 'year', y = 'mean',source=cds, line_color = color, line_width = 2, legend_label = name)

legend = p9.legend[0]

p9.center = [item for item in p9.center if not isinstance(item, Legend)]

p9.add_layout(legend, 'left')

show(p9)

In [15]:
p10 = figure(plot_width=1000,title = 'smoothed case numbers per district category using max case number', x_axis_label = 'years',

             y_axis_label = 'total case numbers',x_range=[2000,2022],

             tools="hover", tooltips="@year:  @mean for @dem_dim")

 

for (name, group), color in zip(filtered_cat.groupby('dem_dim'), line_col_set3):
    #print(group)

    cds = ColumnDataSource(data = group)
    
    #band = Band(base = 'year', lower = 'min', upper = 'max', fill_color = 'red', fill_alpha = 0.2, line_color = 'red')
    #p9.add_layout(band)
    
    p10.line(x = 'year', y = 'max',source=cds, line_color = color, line_width = 2, legend_label = name)

legend = p10.legend[0]

p10.center = [item for item in p9.center if not isinstance(item, Legend)]

p10.add_layout(legend, 'left')

show(p10)

# Study the spreading of virus variants on district level

Observe whether there is a shared pattern of case numbers in virus type per district category.

In [16]:
# choose line colors
p8_col = Set3[12]

# make tabbed line plot
for it, district_group in enumerate(df.dem_dim.unique()):
    data = df[df.dem_dim == district_group]
    p8 = figure(title = 'case number per district')
    dfs = data.groupby(['district', 'year']).sum().reset_index()
    for (name, group), color in zip(dfs.groupby(['district']), p8_col):
        p8.line(x = group.year, y = group.case_number, line_color = color, line_width = 2, legend_label = str(name))
    
    tab = Panel(child = p8, title = '{}'.format(district_group))
    if it == 0:
        tabs_list = [tab]
        tabs = Tabs(tabs=tabs_list)
        
    # append the other plots as tabs
    else:
        tabs_list.append(tab)
        tabs.update(tabs=tabs_list)

tabs = Tabs(tabs = tabs_list)
show(tabs)

# Study the spreading of virus variants per distric category
first, summing everyhing up per category
second, form mean per category 

# Final representation

The final representation shall be an geographical graphic. For each district (except RT, which was excluded) the total case number as well as the case number per virus type shall be selectable. To make it comparable the case numbers are normalized for the population density. The year can be selected by a slider.

In [42]:
import folium #as fm
import pandas as pd
import param
import panel as pn
import random
pn.extension(sizing_mode="stretch_width")

In [40]:
test = df[['district', 'year', 'case_number']]
test1 = test[test.year == 2009].groupby(['district','year']).sum().reset_index()
test1.head()

Unnamed: 0,district,year,case_number
0,AD,2009,327.0
1,BAD,2009,55.0
2,BB,2009,624.0
3,BC,2009,269.0
4,BHS,2009,335.0


In [187]:
# denominations in the json file
import json
communities_geo = r'C:/Data_Science_for_Life_Sciences_MASTER/programming1/programming_1_influenza/data/geodata/landkreise_simplify200.geojson'

# open the json file - json.load() methods returns a python dictionary
with open(communities_geo, 'rb') as communities_file:
    communities_json = json.load(communities_file)


# we loop through the dictionary to obtain the name of the communities in the json file
#denominations_json = []
for index in range(len(communities_json['features'])):
    print(communities_json['features'][index]['properties']['GEN'])

Stuttgart
Boeblingen
Esslingen
Goeppingen
Ludwigsburg
Rems-Murr-Kreis
Heilbronn
Heilbronn
Hohenlohekreis
Schwaebisch Hall
Main-Tauber-Kreis
Heidenheim
Ostalbkreis
Baden-Baden
Karlsruhe
Karlsruhe
Rastatt
Heidelberg
Mannheim
Neckar-Odenwald-Kreis
Rhein-Neckar-Kreis
Pforzheim
Calw
Enzkreis
Freudenstadt
Freiburg im Breisgau
Breisgau-Hochschwarzwald
Emmendingen
Ortenaukreis
Rottweil
Schwarzwald-Baar-Kreis
Tuttlingen
Konstanz
Loerrach
Waldshut
Tuebingen
Zollernalbkreis
Ulm
Alb-Donau-Kreis
Biberach
Bodenseekreis
Ravensburg
Sigmaringen
Konstanz


In [36]:
# import complete dataset (with RT)
df_complete = pd.read_csv(path + 'influenza_pop_dens_merged.csv', sep = '\t')
df_tc = df_complete[['district', 'year', 'case_number']].groupby(['district', 'year']).sum().reset_index()
df_tc[df_tc == 'SHA']
# preprocess data: wide format is required, which means the years need to be headers and the case numbers need 
#to be distributed accordingly
df_tc_wide = df_tc.pivot_table(values='case_number', index='district', columns='year').reset_index()
df_tc_wide
df_complete

Unnamed: 0.1,Unnamed: 0,virus_type,district,season,case_number,year,area [ha],population,population_density,population_density_bw,pop_exp,pop_dens_exp,area_exp,area [km^2],area ex [km^2]
0,0,-nicht erhoben-,AD,2000/01,0.0,2000,135732.0,185929.0,137.0,294.0,170214.78,125.35,135793.33,1357.9333,1357.93
1,1,-nicht ermittelbar-,AD,2000/01,0.0,2000,135732.0,185929.0,137.0,294.0,170214.78,125.35,135793.33,1357.9333,1357.93
2,2,Influenza A Virus,AD,2000/01,20.0,2000,135732.0,185929.0,137.0,294.0,170214.78,125.35,135793.33,1357.9333,1357.93
3,3,Influenza A(H1N1) Virus (vorpandemisch),AD,2000/01,0.0,2000,135732.0,185929.0,137.0,294.0,170214.78,125.35,135793.33,1357.9333,1357.93
4,4,Influenza A(H3N2) Virus,AD,2000/01,0.0,2000,135732.0,185929.0,137.0,294.0,170214.78,125.35,135793.33,1357.9333,1357.93
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11347,11347,Influenza A(H1N1)pdm09 Virus,U,2021/22,0.0,2021,,,,,129148.73,1088.12,,,118.69
11348,11348,Influenza A(H1N2) Virus,U,2021/22,0.0,2021,,,,,129148.73,1088.12,,,118.69
11349,11349,Influenza C Virus,U,2021/22,0.0,2021,,,,,129148.73,1088.12,,,118.69
11350,11350,Influenza A Virus (zoonotisch),U,2021/22,0.0,2021,,,,,129148.73,1088.12,,,118.69


In [45]:
# names in the data frame
dataframe_names = df_tc_wide.district.unique().tolist()
dataframe_names
# names in the json file - the same order as in the data frame 
geojson_names = ['Alb-Donau-Kreis', 'Baden-Baden', 'Boeblingen', 'Biberach', 'Breisgau-Hochschwarzwald', 'Zollernalbkreis', 
                 'Bodenseekreis',   'Calw', 'Emmendingen', 'Enzkreis', 'Esslingen', 'Freudenstadt', 'Freiburg im Breisgau',
                 'Goeppingen', 'Heidelberg', 'Heidenheim', 'Heilbronn', 'Heilbronn',  'Karlsruhe',  'Karlsruhe', 'Konstanz',
                 'Hohenlohekreis', 'Loerrach', 'Ludwigsburg', 'Mannheim', 'Main-Tauber-Kreis', 'Neckar-Odenwald-Kreis',
                 'Ostalbkreis', 'Ortenaukreis', 'Pforzheim', 'Rastatt', 'Rems-Murr-Kreis', 'Rhein-Neckar-Kreis', 'Reutlingen',
                 'Ravensburg', 'Rottweil', 'Stuttgart', 'Schwaebisch Hall', 'Sigmaringen', 'Tuebingen', 'Tuttlingen',
                 'Ulm', 'Schwarzwald-Baar-Kreis', 'Waldshut']

# replace data frame names by json names
df_tc_wide.replace(dict(zip(dataframe_names, geojson_names)), inplace=True)
#df_tc_wide[df_tc_wide.district == 'Waldshut']
dataframe_names

['Alb-Donau-Kreis',
 'Baden-Baden',
 'Boeblingen',
 'Biberach',
 'Breisgau-Hochschwarzwald',
 'Zollernalbkreis',
 'Bodenseekreis',
 'Calw',
 'Emmendingen',
 'Enzkreis',
 'Esslingen',
 'Freudenstadt',
 'Freiburg im Breisgau',
 'Goeppingen',
 'Heidelberg',
 'Heidenheim',
 'Heilbronn',
 'Karlsruhe',
 'Konstanz',
 'Hohenlohekreis',
 'Loerrach',
 'Ludwigsburg',
 'Mannheim',
 'Main-Tauber-Kreis',
 'Neckar-Odenwald-Kreis',
 'Ostalbkreis',
 'Ortenaukreis',
 'Pforzheim',
 'Rastatt',
 'Rems-Murr-Kreis',
 'Rhein-Neckar-Kreis',
 'Reutlingen',
 'Ravensburg',
 'Rottweil',
 'Stuttgart',
 'Schwaebisch Hall',
 'Sigmaringen']

In [43]:
#https://codefor.de/projekte/hn-geojson-utilities/ 
#http://opendatalab.de/projects/geojson-utilities/
districts_geo = r'C:/Data_Science_for_Life_Sciences_MASTER/programming1/programming_1_influenza/data/geodata/landkreise_simplify200.geojson'

# create a plain world map
communities_map = folium.Map(location=[48.758339, 8.243008], zoom_start=7.5, tiles='stamenwatercolor')

# generate choropleth map 
communities_map.choropleth(
    geo_data=districts_geo,
    data=df_tc_wide,
    columns=['district', 2019],
    key_on='feature.properties.GEN',
    fill_color='YlGnBu', 
    fill_opacity=1, 
    line_opacity=1,
    legend_name='case_number',
    smooth_factor=0)

# display map
communities_map

