<a href="https://colab.research.google.com/github/Advanced-Data-Science-TU-Berlin/Data-Science-Training-Python-Part-2/blob/main/interactive_notebooks/4_1_visualization_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Covid-19 Vaccines Analysis
Many vaccines have been introduced so far to fight covid-19. No vaccine has guaranteed 100% accuracy so far, but most manufacturing companies claim their vaccine is not 100% accurate, but still, it will save your life by giving you immunity.

Thus, each country tries to vaccinate a large part of its population so as not to depend on a single vaccine. That’s what we are going to analyze in this exercise, which is how many vaccines each country is using to fight covid-19.

We used the data from Kaggle dataset [here](https://www.kaggle.com/datasets/gpreda/covid-world-vaccination-progress). Feel free to navigate through data and its columns.

Let's get the data first.

In [None]:
!pip install opendatasets
!pip install wordcloud

In [None]:
import opendatasets as od
od.download("https://www.kaggle.com/datasets/gpreda/covid-world-vaccination-progress", force=True)

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
# Load the data into Pandas DataFrame
# Hint: use pd.read_csv and pass "/content/covid-world-vaccination-progress/country_vaccinations.csv"
df = <your-code-here>

# Look at first few rows
df.head()

As we can see te dataset has columns like `country`, `iso_code`, `date`, `total_vaccinations`, `people_vaccinated`, `people_fully vaccinated`, etc.

An initial look at the above table shows that data has `null values` too. We will deal with null values later.

When we have so many columns usually `info()` function is used to get the overview of data like data type of feature, a number of null values in each column, and many more.

In [None]:
# Get info about the data
# Hint: use df.info()
<your-code-here>

The above results shows that there are many null values in our dataset which we are going to deal with these null values later. There are two data types as seen from the table object means string and float.

In [None]:
# Total count of null values
# Hint: use df.isnull() and then .sum()
<your-code-here>

## Data Cleaning
When we are dealing with real datasets we have to make sure about data quality before starting any type of analysis.
Here we are going through some of the necessary data cleaning steps:

In [7]:
# Simply replacing null values with 0
# Hint: use df.fillna nad pass value=0 and inplace=True
<your-code-here>

# Convering Floats to Int cause they are count values
# Hint: use df.dtypes.items() to loop over columns and their type
for column, dtype in <your-code-here>:
  if dtype=='float64':
    # Hint: use df[column] and .astype(int)
    df[column] = <your-code-here>

# Extract Year, Month and Day from the date into separate columns
# Hint: use pd.DatetimeIndex(df['date']) and use year, month and day to extract each
df['year'] = <your-code-here>
df['month'] = <your-code-here>
df['day'] = <your-code-here>

### Basic Statistics
Let's look at few statistics from the data:

In [None]:
# Hint: use .nunique() on df.country and df.vaccines
print("Number of countries:", <your-code-here>)
print("Number of vaccines:", <your-code-here>)

# Hint: use df.date.min() and .max() to capture the date range
print(f"Data range from {<your-code-here>} to {<your-code-here>}")

## Total Vaccines per Country

Let's calcualte and take a look at the countries and their corresponding total number of vaccines to answer:
 - what are the top countries based on the total vaccines?

In [None]:
# Calculate total number of vaccines per country
#   - Group data by country: df.groupby('country')
#   - Select the field: .total_vaccinations
#   - Get maximum value of the field per group: .max()
#   - Sort Descending: .sort_values(ascending=False)
countries_total_vaccines_df = <your-code-here>

# Print top countries with highest number of vaccines
print(countries_total_vaccines_df.head())

## Bar Plot
We can use bar plot to show top countries with most vaccines. This plot will provide us with more details on how many vaccinations each country has so far in comparison to others.

In [None]:
# Top-100 countries
# Hint: use .iloc[:100] on countries_total_vaccines_df
# then .plot.bar nad pass figsize=(20,5)
ax_ = <your-code-here>

_ = ax_.set_title("Top-100 Counties With The Most Vaccinations Overall")

## Word Art of Countries
Word Cloud is a unique way to get information from our dataset. The words are shown in the form of art where the size proportional depends on how much the particular word repeated in the dataset. This is made by using the WordCloud library.

Here we are showing the country names based on their total vaccinations

In [None]:
from wordcloud import WordCloud

# Conver to Dictionary
# Hint: use .to_dict() on countries_total_vaccines_df
countries_total_vaccines = <your-code-here>

# Create WordCloud of country names using their total vaccines
# Hint: use WordCloud()
wc = <your-code-here>

# Hint: use generate_from_frequencies and pass countries_total_vaccines
wc.<your-code-here>

# Plot the word cloud
plt.figure(figsize=(15,7))
plt.axis('off')

# Plot the output of WorlCloud which is an image
# Hint: use plt.imshow(wc)
<your-code-here>

plt.show()

## Vaccination Trend

### Line Plot
In order to check what is the vaccination trend in each country, we are drawing the line plot where the x-axis is the date and the y-axis is the count of daily vaccination, Colours Is set to be the country.

### Plotly
Let's use `plotly` another useful python package for visualization.
This package will provide us with an `interactive` plot which you can cross-over and see more detailed information regarding individual points on the plot.

In [None]:
import plotly.express as px

# Plot an interactive line plot
# Hint: use px.line and pass df, x='date', y='daily_vaccinations and color='country'
<your-code-here>

# Set the title
fig.update_layout(
    title={
            'text' : "Daily vaccination trend",
            'y':0.95, # Proportion from the bottom
            'x':0.5 # Proportion from the left
        },
    xaxis_title="Date",
    yaxis_title="Daily Vaccinations"
)

# Show the plot
fig.show()

As we can see, there is a mixed kind of trend among each country. Sometimes a particular country shows a positive trend and sometimes it shows a negative trend.

## Missing Values in Cumulative Fields
Previously we simply replaced the missing values with 0. Let's take a look at what does it mean to have zeros in a field that aggregates over time (cumulative field):

Let's consider the total number of people being vaccinated (people_vaccinated column). We expect this value to aggregate and increase over time. If we don't have the value for one specific date (NULL value) what does it mean? Does it mean that we took back the vaccination from the people who had already been vaccined!? :D I don't think so!

In worth case scenario, we can assume that there were no new vaccines on that day this will keep the value of the field the same as before.

So let's replace the zero values with the existing previous value for the `people_vaccinated` column and see the difference for a sample country that has missing values:

In [None]:
# Sample country with missing values
country = 'Ireland'

# Sample cumulative column with missing values
column = 'people_vaccinated'

# Select the data only for the given country and field
# Hint: use df.query and pass f"country=='{country}'"
# then select [['date', column]]
selected_df = <your-code-here>

# Create a plot
fig, ax = plt.subplots(1,2, figsize=(15,5))

# Plot the data with missing values
# Hint: use selected_df.plot and pass x='date', y=column and ax=ax[0]
<your-code-here>

# replace the zeros with ffill which propagates last valid observation forward to next valid backfill
# Hint: use selected_df.replace and pass to_replace=0, method='ffill'
no_missing_df = <your-code-here>

# Plot the data without missing values
no_missing_df.plot(x='date', y=column, ax=ax[1])

## People Vaccinated VS Fully Vaccinated
Now let’s try to compare two fields with each other. For example, let's analyze how many people vaccinated vs the people which are fully vaccinated in a country (EX. Germany)



In [None]:
# Select the Country
country = 'Germany'

# Select the columns to compare
column_1 = 'people_fully_vaccinated'
column_2 = 'people_vaccinated'

# Select the data only for the given country and fields
# Hint: use df.query and pass f"country=='{country}'"
# select the columns [['date', column_1, column_2]]
selected_df = <your-code-here>

# Replace zeros using ffill method
# Hint: use selected_df.replace and pass to_replace=0, method='ffill'
no_missing_df = <your-code-here>

# Plot the data
fig, ax = plt.subplots(figsize=(10,5))

no_missing_df.plot(x='date', ax=ax)

Can you interpret this?

Let's use the graph_objects from `plotly` which uses scatter plots and `stackgroup` to stack these two plots together:

In [None]:
import plotly.graph_objects as go

plot = go.Figure(data=[
    # Hint: use no_missing_df['date'] for x and no_missing_df[column_1] for y
            go.Scatter( # First plot based on first column
              x = <your-code-here>,
              y = <your-code-here>,
              stackgroup='one', # set a stackgroup name
              name = column_1,
              marker_color= 'orange'),
            go.Scatter( # Second plot based on second column
              x = no_missing_df['date'],
              y = no_missing_df[column_2],
              stackgroup='one', # use the same stackgroup name as the previous
              name = column_2,
              marker_color= 'blue')
            ])
plot.update_layout(
    title={
            'text': f'People vaccinated vs Fully vaccinated till date in {country}',
            'y':0.95,
            'x':0.5
        },
        xaxis_title="Date"
    )
plot.show()

As we can see there are around 60M people fully vaccinated in Germany.

## Comparison Fully Vaccinated Between 2 Countries:
Now let's compare the number of fully vaccinated people between two countries (Ex. Germany vs France)

In [None]:
# Select countries
country_1 = 'Germany'
country_2 = 'France'

# Select the column
column = 'people_fully_vaccinated_per_hundred'

# Select data for target countries for comparison
# Hint: use .query() and pass f"country=='{country_1}' or country=='{country_2}'"
# select columns [['date', 'country', column]]
selected_df = <your-code-here>

# Plot two countries data in one plot
# Hint: use px.line and pass selected_df and
# x='date', y=column, color='country'
fig = <your-code-here>

fig.update_layout(
    title={
            'text': f"{column} - {country_1} vs {country_2}",
            'y':0.95,
            'x':0.5
        },
    xaxis_title="Date",
    yaxis_title=column
)
fig.show()

As we can see both countries had rather similar pace on number of fully vaccinated people untill May 2021 then we can see that Germany moved faster up to September 21 and then France had increased but both are following same pattern.

## Point Map
Now let's see how vaccinations are going in different countries using maps. The sizes are corresponding to `people_vaccinated_per_hundred` and different colors has been used for different countries.

In [None]:
# Select column
column = "people_vaccinated_per_hundred"

# Find last available date in our data
# Hint: use df.date.max()
captured_date = <your-code-here>
print("Last date:", captured_date)


# Select max values per countries
# Hint: use df.groupby and pass ["country", "iso_code"], as_index=False
# select [column] and .max()

selected_df = <your-code-here>

# Plot data on map
# Hint: use px.scatter_geo and pass selected_df
# use "iso_code" for location and "country" for color
fig = px.scatter_geo(<your-code-here>,
                     locations=<your-code-here>,
                     color=<your-code-here>, # which column to use to set the color of markers
                     hover_name="country", # column added to hover information
                     size=column, # size of markers
                     projection="natural earth")
fig.show()

## Choropleth Map
 Let's look at the same data but in a slightly different view using `choroplethmap`

In [None]:
# Hint: use px.choropleth and pass selected_df
# for location use "iso_code" and for locationmode use "ISO-3"
# color=column, hover_name="country", color_continuous_scale=px.colors.sequential.Blues
<your-code-here>


Usefull links:
- https://www.analyticsvidhya.com/blog/2021/05/analyze-covid-vaccination-progress-using-python/
- https://thecleverprogrammer.com/2021/04/13/covid-19-vaccines-analysis-with-python/