In [1]:
import pandas as pd
import plotly_express as px
import numpy as np

In [2]:
vaccindata = pd.read_excel("./Data/Covid19_Vaccine.xlsx", sheet_name="Vaccinerade kommun och ålder")
vaccindata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2900 entries, 0 to 2899
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Län                  2900 non-null   int64  
 1   Län_namn             2900 non-null   object 
 2   Kommun               2900 non-null   int64  
 3   Kommun_namn          2900 non-null   object 
 4   Ålder                2900 non-null   object 
 5   Befolkning           2900 non-null   int64  
 6   Antal minst 1 dos    2900 non-null   int64  
 7   Antal minst 2 doser  2900 non-null   int64  
 8   Antal 3 doser        2320 non-null   float64
 9   Antal 4 doser        870 non-null    float64
 10  Andel minst 1 dos    2900 non-null   float64
 11  Andel minst 2 doser  2900 non-null   float64
 12  Andel 3 doser        2320 non-null   float64
 13  Andel 4 doser        870 non-null    float64
dtypes: float64(6), int64(5), object(3)
memory usage: 317.3+ KB


In [3]:
vaccindata.sample(3)

Unnamed: 0,Län,Län_namn,Kommun,Kommun_namn,Ålder,Befolkning,Antal minst 1 dos,Antal minst 2 doser,Antal 3 doser,Antal 4 doser,Andel minst 1 dos,Andel minst 2 doser,Andel 3 doser,Andel 4 doser
1678,14,Västra Götalands län,1484,Lysekil,80-89,1019,1000,999,979.0,908.0,0.981354,0.980373,0.960746,0.89107
1152,12,Skåne län,1278,Båstad,18-29,1565,1207,1167,501.0,,0.771246,0.745687,0.320128,
219,1,Stockholms län,186,Lidingö,90 eller äldre,725,706,702,688.0,629.0,0.973793,0.968276,0.948966,0.867586


Finding out the number of counties in the dataset:

In [4]:
vaccindata["Län"].nunique()

21

Double-checking the information

In [5]:
vaccindata["Län_namn"].nunique()

21

Or we can create a function which double-checks and prints the answer, if information seems to be correct.

In [6]:
from Functions import *

In [7]:
count_and_check("Number of counties in the dataset: ", vaccindata, "Län", "Län_namn")

Number of counties in the dataset:  21


As we can see, there is 21 county in the dataset, which corresponds to the total number of counties in Sweden.

In [8]:
count_and_check("Number of municipalities in the dataset: ", vaccindata, "Kommun", "Kommun_namn")

Number of municipalities in the dataset:  290


As we can see, there are 290 municipalities in the dataset, which means, all the Swedish minicipalities.

In [9]:
dataset_population = vaccindata["Befolkning"].sum()
dataset_population

9092790

The population represented in the dataset is 9 092 790 people.

In [10]:
age_groups = vaccindata["Ålder"].unique()
age_groups

array(['12-15', '16-17', '18-29', '30-39', '40-49', '50-59', '60-69',
       '70-79', '80-89', '90 eller äldre'], dtype=object)

As we can see, there is no data about how many children of age 0-12 there are in Sweden.
So we can not calculate the number of children under 18 from the dataset directly.
We are going to calculate this, using the following steps:
1. Find the data about total population in Sweden for year 2022.
2. Find out the number of adults 18+ in the dataset.
3. The difference between these two numbers is the number of children under 18 in Sweden, based on the dataset.

According to [this source](https://www.macrotrends.net/countries/SWE/sweden/population), the population of Sweden in 2022 was 10,549,347 people.

In [11]:
total_population = 10549347

In [12]:
dataset_adults = vaccindata[~vaccindata["Ålder"].isin(['12-15', '16-17'])]["Befolkning"].sum()
dataset_adults

8347420

In [13]:
total_children = total_population - dataset_adults
print("The number of children under the age of 18 in Sweden according to the dataset is ", total_children)

The number of children under the age of 18 in Sweden according to the dataset is  2201927


As we can see, the number of children under the age of 18 in Sweden according to the dataset is  2 201 927.

In [14]:
children_0_11 = total_population - dataset_population
children_0_11

1456557

Of them children 0-11 (not represented in the dataset): 1 456 557

In [15]:
ages_data = vaccindata.groupby("Ålder")["Befolkning"].sum()
ages_data = pd.concat([ages_data, pd.Series({'0-11': children_0_11})])
ages_data

12-15              503831
16-17              241539
18-29             1475950
30-39             1467590
40-49             1298156
50-59             1339798
60-69             1121922
70-79             1033113
80-89              496750
90 eller äldre     114141
0-11              1456557
dtype: int64

In [16]:
ages = vaccindata['Ålder'].unique()
ages = np.append(ages, '0-11')
ages

array(['12-15', '16-17', '18-29', '30-39', '40-49', '50-59', '60-69',
       '70-79', '80-89', '90 eller äldre', '0-11'], dtype=object)

We can draw a diagram about Swedish population right now, but I want to arrange the data into data frame in case I want some further data manipulation.

In [17]:
age_dic = {"age_group": ages,
           "population": ages_data}

age_frame = pd.DataFrame(age_dic)
age_frame

Unnamed: 0,age_group,population
12-15,12-15,503831
16-17,16-17,241539
18-29,18-29,1475950
30-39,30-39,1467590
40-49,40-49,1298156
50-59,50-59,1339798
60-69,60-69,1121922
70-79,70-79,1033113
80-89,80-89,496750
90 eller äldre,90 eller äldre,114141


In [18]:
fig = px.pie(age_frame, values = 'population', names = 'age_group', title = "Swedish population by ages")
fig.show()
fig.write_html("./Visualizations/Plotly parts 1_2/Swedish_population_age.html")


Let's find the procent of people vaccinated with 1/2/3 doses in each county.

In [19]:
counties_vaccinated = vaccindata.groupby('Län_namn')[
    ['Antal minst 1 dos', 'Antal minst 2 doser', 'Antal 3 doser', 'Befolkning']
    ].sum() 
counties_vaccinated

Unnamed: 0_level_0,Antal minst 1 dos,Antal minst 2 doser,Antal 3 doser,Befolkning
Län_namn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Blekinge län,122500,120727,92259.0,139327
Dalarnas län,221420,218009,164296.0,252075
Gotlands län,48785,47930,37423.0,53924
Gävleborgs län,220389,215267,159636.0,252216
Hallands län,259143,255329,191997.0,295663
Jämtlands län,102236,100525,73332.0,115398
Jönköpings län,274960,270266,199488.0,317355
Kalmar län,190931,188522,147192.0,216763
Kronobergs län,149141,146494,103745.0,175503
Norrbottens län,198514,195919,149293.0,220199


In [20]:
counties_vaccinated["% 1 dos"] = counties_vaccinated["Antal minst 1 dos"] / counties_vaccinated["Befolkning"] * 100
counties_vaccinated["% 2 doser"] = counties_vaccinated["Antal minst 2 doser"] / counties_vaccinated["Befolkning"] * 100
counties_vaccinated["% 3 doser"] = counties_vaccinated["Antal 3 doser"] / counties_vaccinated["Befolkning"] * 100
pd.set_option("display.float_format", '{:.2f}'.format)
counties_vaccinated

Unnamed: 0_level_0,Antal minst 1 dos,Antal minst 2 doser,Antal 3 doser,Befolkning,% 1 dos,% 2 doser,% 3 doser
Län_namn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Blekinge län,122500,120727,92259.0,139327,87.92,86.65,66.22
Dalarnas län,221420,218009,164296.0,252075,87.84,86.49,65.18
Gotlands län,48785,47930,37423.0,53924,90.47,88.88,69.4
Gävleborgs län,220389,215267,159636.0,252216,87.38,85.35,63.29
Hallands län,259143,255329,191997.0,295663,87.65,86.36,64.94
Jämtlands län,102236,100525,73332.0,115398,88.59,87.11,63.55
Jönköpings län,274960,270266,199488.0,317355,86.64,85.16,62.86
Kalmar län,190931,188522,147192.0,216763,88.08,86.97,67.9
Kronobergs län,149141,146494,103745.0,175503,84.98,83.47,59.11
Norrbottens län,198514,195919,149293.0,220199,90.15,88.97,67.8


In [21]:
fig_procent = px.bar(counties_vaccinated, 
             x = counties_vaccinated.index,
             y = ["% 1 dos", "% 2 doser", "% 3 doser"],
             title = "Percentage vaccinated per county",
             barmode = 'group',
             labels = dict(value = '%', variable = 'number of doses', Län_namn = 'County')
             )

fig_procent.show()
fig_procent.write_html("./Visualizations/Plotly parts 1_2/percentage_vaccinated_per_county.html")

Of course, I could have done the previous graph in a more straightforward way, using the data from the columns "Andel...", as I will do below for Stockholm and Västra Götalands counties. But in this case I need some data cleaning, as there are cells with n/a values in these columns. Also, we will be able to check our dataset for consistency, if the data got the first way would be the same with the same data got from the other columns.

In [22]:
vaccindata.isna().sum()

Län                       0
Län_namn                  0
Kommun                    0
Kommun_namn               0
Ålder                     0
Befolkning                0
Antal minst 1 dos         0
Antal minst 2 doser       0
Antal 3 doser           580
Antal 4 doser          2030
Andel minst 1 dos         0
Andel minst 2 doser       0
Andel 3 doser           580
Andel 4 doser          2030
dtype: int64

In [23]:
vaccindata["Andel 3 doser"] = vaccindata["Andel 3 doser"].fillna(0)
vaccindata["Andel 4 doser"] = vaccindata["Andel 4 doser"].fillna(0)
vaccindata.isna().sum()

Län                       0
Län_namn                  0
Kommun                    0
Kommun_namn               0
Ålder                     0
Befolkning                0
Antal minst 1 dos         0
Antal minst 2 doser       0
Antal 3 doser           580
Antal 4 doser          2030
Andel minst 1 dos         0
Andel minst 2 doser       0
Andel 3 doser             0
Andel 4 doser             0
dtype: int64

In [24]:
counties = vaccindata.groupby("Län_namn")

In [25]:
counties_procent_vacc = counties[["Andel minst 1 dos", "Andel minst 2 doser", "Andel 3 doser", "Andel 4 doser"]].mean()*100
counties_procent_vacc

Unnamed: 0_level_0,Andel minst 1 dos,Andel minst 2 doser,Andel 3 doser,Andel 4 doser
Län_namn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Blekinge län,85.65,84.2,59.59,24.82
Dalarnas län,85.56,84.13,59.48,26.44
Gotlands län,89.03,87.34,62.25,26.34
Gävleborgs län,85.17,82.89,57.96,25.3
Hallands län,85.92,84.42,59.94,25.79
Jämtlands län,86.09,84.4,57.85,24.87
Jönköpings län,86.0,84.67,59.98,25.92
Kalmar län,85.13,83.95,60.3,25.82
Kronobergs län,82.5,81.02,55.65,24.57
Norrbottens län,87.03,85.54,60.23,25.44


In [26]:
stockholm_vastra_counties = counties_procent_vacc.loc[["Stockholms län", "Västra Götalands län"]]

In [27]:
stockholm_vastra_counties

Unnamed: 0_level_0,Andel minst 1 dos,Andel minst 2 doser,Andel 3 doser,Andel 4 doser
Län_namn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Stockholms län,83.59,80.59,56.83,24.23
Västra Götalands län,85.94,83.97,57.79,25.34


In [28]:
data_from_first_calc = counties_vaccinated.loc[["Stockholms län", "Västra Götalands län"]][
    ["% 1 dos", "% 2 doser", "% 3 doser"]
    ]
data_from_first_calc

Unnamed: 0_level_0,% 1 dos,% 2 doser,% 3 doser
Län_namn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Stockholms län,83.12,80.11,56.95
Västra Götalands län,85.51,83.36,58.09


So, the data is pretty consistent, but when calculating the procentage for each row separately, a cumulative error is of course higher.

In [29]:
fig_two_counties = px.bar(stockholm_vastra_counties, 
        title = "Stockholm and Västra Götalands counties",
        barmode = 'group',
        labels = dict(value = '%', variable = 'Number of doses', Län_namn = 'County')
             )
fig_two_counties.show()
fig_two_counties.write_html("./Visualizations/Plotly parts 1_2/Percentage_two_counties.html")