<h1 style="text-align: center;color: SlateBlue;font-family: courier;font-size: 300%;">COVID-19 VACCINATION ANALYSIS</h1>

# Table of Contents:

>- [About this EDA](#About-this-EDA:)
>- [About the Data](#About-the-Data:)
>- [Importing Libraries](#Importing-Libraries:)
>- [Reading the Data](#Reading-the-Data:)
>- [Cleaning the Data](#Cleaning-the-Data:)
>- [Exploratory Data Analysis](#Exploratory-Data-Analysis:)
>   - [Basic Info about Dataset](#Basic-Info-about-Dataset)
>   - [Amount of Vaccinated People](#Amount-of-Vaccinated-People)
>   - [Country wise Daily vaccination](#Country-wise-Daily-Vaccination)
>   - [Percent of Population Vaccinated](#Percent-of-Population-Vaccinated)
>   - [People vaccinated once VS People vaccinated twice](#People-vaccinated-once-VS-People-vaccinated-twice)
>   - [Most Popular Vaccine Scheme](#Most-Popular-Vaccine-Scheme)
>   - [Market share of Vaccine Schemes](#Market-share-of-Vaccine-Schemes)
>   - [TreeMap of Total Vaccinations per country, grouped by Vaccine Scheme](#TreeMap-of-Total-Vaccinations-per-country,-grouped-by-Vaccine-Scheme)
>   - [Visualising on a Map](#Visualising-on-a-Map)
>   - [Animating the Vaccination Progress](#Animating-the-Vaccination-Progress)
>- [Conclusion](#Conclusion:)

# About this EDA:

This is my first  exploratory data analysis and first submission on Kaggle. This is a basic EDA where I am trying to analyse the dataset and get meaningful information.

I have also used this notebook as my First Semester project and the timeline of some events described may be a bit inaccurate as i wrote this a month ago and was late to upload. I apologise for that.

If you like this project then please upvote and give your feedbacks 😁


# About the Data:

This Notebook would not have been possible without this [Dataset](https://www.kaggle.com/gpreda/covid-world-vaccination-progress) provided by [@Gabriel Preda](https://www.kaggle.com/gpreda).

The Data contains the following information:

- __Country__ - this is the country for which the vaccination information is provided.
- __Country ISO Code__ - ISO code for the country.
- __Date__ - date for the data entry; for some dates we have only the daily vaccinations, for others, only the (cumulative) total.
- __Total number of vaccinations__ - this is the absolute number of total immunizations in the country.
- __Total number of people vaccinated__ - a person, depending on the immunization scheme, will receive one or more (typically 2) vaccines; at a certain moment, the number of vaccination might be larger than the number of people.
- __Total number of people fully vaccinated__ - this is the number of people that received the entire set of immunization according to the immunization scheme (typically 2); at a certain moment in time, there might be a certain number of people that received one vaccine and another number (smaller) of people that received all vaccines in the scheme.
- __Daily vaccinations(raw)__ - for a certain data entry, the number of vaccination for that date/country.
- __Daily vaccinations__ - for a certain data entry, the number of vaccination for that date/country.
- __Total vaccinations per hundred__ - ratio (in percent) between vaccination number and total population up to the date in the country.
- __Total number of people vaccinated per hundred__ - ratio (in percent) between population immunized and total population up to the date in the country.
- __Total number of people fully vaccinated per hundred__ - ratio (in percent) between population fully immunized and total population up to the date in the country.
- __Number of vaccinations per day__ - number of daily vaccination for that day and country.
- __Daily vaccinations per million__ - ratio (in ppm) between vaccination number and total population for the current date in the country.
- __Vaccines used in the country__ - total number of vaccines used in the country (up to date).
- __Source name__ - source of the information (national authority, international organization, local organization etc.).
- __Source website__ - website of the source of information.

# Importing Libraries:

We initialize the Python packages we will use for data ingestion, preparation and visualization.
We will use : 
- `Pandas` to read and process the data.
- `Plotly` for visualization.

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# Reading the Data:

In [None]:
df = pd.read_csv('../input/covid-world-vaccination-progress/country_vaccinations.csv')
df.head()

In [None]:
df.info()


# Cleaning the Data:
As you can see, a lot of null values are present in the data. This may be because:
1. The vaccination drive has just started and will take some time for all the countries to catch up.
2. This Data is collected daily from Our World in Data GitHub repository for Covid-19, and thus during the creation of this dataset some in consistencies may have crept in.

There may be other reasons but let's move forward with cleaning the data and filling the missing values.


In [None]:
df.isnull().sum()

In [None]:
df[df['iso_code'].isnull()]['country'].value_counts()


- It seems that the ISO Code for these 4 countries is missing.
- All these countries are part of the United Kingdom.
- As Scotland, England, Northern Ireland and Wales are a part of the United Kingdom, they can be dropped.


In [None]:
df = df.loc[-df.country.isin(['England', 'Scotland', 'Wales', 'Northern Ireland'])]

df[df['iso_code'].isnull()]['country'].value_counts()
#The empty Series output shows that all nan values in iso_code are fixed

In [None]:
df.isnull().sum()

In [None]:
df.columns

### Other Apparent Inconsistencies
* `total_vaccination` is split into `people_vaccinated` and `people_fully_vaccinated` and any NaN can be ignored.
* `people_fully_vaccinated` gives the people who have taken the vaccine twice. Any NaN in this can be ignored
* `total_vaccinations_per_hundred` is the percent of total vaccinations(i.e. total_vaccinations by the population of the country). Let us rename this
* `people_fully_vaccinated_per_hundred` is the percent of total vaccinations(i.e. people_fully_vaccinated by the population of the country). Let us rename this
* `daily_vaccinations_per_million` is the total population of the country by its daily vaccinations. This doesn't have inconsistencies

In [None]:
df.rename(columns = {'total_vaccinations_per_hundred':'total_vaccinations_percent',
                     'people_fully_vaccinated_per_hundred':'people_fully_vaccinated_percent',
                    'people_vaccinated_per_hundred':'people_vaccinated_percent'}, inplace=True)
df.columns

# Exploratory Data Analysis:
## Basic Info about Dataset

In [None]:
print('Data point starts from:',df.date.min(),'\n')
print('Data point ends at:',df.date.max(),'\n')
print('Total no of Countries in the data set:',len(df.country.unique()),'\n')
print('Total no of unique Vaccine Schemes in the data set:',len(df.vaccines.unique()),'\n')

In [None]:
# All the different contries
df.country.unique()

In [None]:
# All the different kinds of vaccines
df.vaccines.unique()

## Amount of Vaccinated People

In [None]:
# Here we are creating `country_data` which store basic info about a country, like the vaccine scheme it uses, total 
# vaccinations completed and its percentage with the population

country_data = df.copy()
cols = ['country', 'total_vaccinations', 'iso_code', 'vaccines', 'total_vaccinations_percent']

country_data = country_data[cols].groupby('country').max().sort_values('total_vaccinations', ascending=False)
country_data.reset_index(inplace = True)

country_data.columns = ['Country', 'Total Vaccinations', 'iso_code', 'Vaccines', 'Total Vaccinations Percentage']
country_data

In [None]:
fig = px.bar(country_data[:40], x = 'Country', y = 'Total Vaccinations', color = 'Total Vaccinations')

fig.update_layout(title = dict(text = 'Vaccinizations World-Wide Comparision', x=0.5, y=0.95))
fig.update_xaxes(title = 'Countries', title_font = dict(size=18, family='Courier', color='crimson'), tickangle=-90)
fig.update_yaxes(title = 'Total Vaccinations', title_font = dict(size=18, family='Courier', color='crimson'))

fig.show()

From the plot, some interesting facts stand out:
- The __United States__, despite having the highest number of people affected by Covid-19, has the highest number of vaccinated people.
- __China__, from where the virus started spreading, is at second.
- __India__, who has been supplying vaccines to the world is at 3th position.
- __UK__, where we have found a new variant strain of the virus, is right next.
- Following that, we have __Israel__, __UAE__, __Brazil__, __Germany__ and others

## Country wise Daily Vaccination

In [None]:
top_countries = ['USA','CHN','GBR','IND','ISR','ARE','BRA','DEU','TUR','ITA','FRA']
fig = px.line(df[df.iso_code.isin(top_countries)], x='date', y='daily_vaccinations', color='country')

fig.update_layout(title = dict(text = 'World-Wide Daily Vaccination Timeline', x=0.5, y=0.95), 
                  legend = dict(title = 'Country', traceorder = 'reversed'))
fig.update_xaxes(title = 'Timeline', title_font = dict(size=18, family='Courier', color='crimson'))
fig.update_yaxes(title = 'Daily Vaccinations', title_font = dict(size=18, family='Courier', color='crimson'))

fig.show()

From the plot, we can deduce:
- The Line plot for China is composed entirely of straight lines. This can be attributed to the CCP which tries to restrict flow of information in and out of China. Thus, information from China usually comes in intervals and can be taken with a grain of salt.
- Comparatively, the plot of vaccinations in the USA is better plotted. We can also see that while the USA was heavily affected by the virus, its vaccination drive is highly effective.
- Others like the UK have a steady increase in Daily Vaccinations and India, while supplying to many countries, maintains a respectable 3th position.

## Percent of Population Vaccinated

In [None]:
top_country_data = country_data.sort_values('Total Vaccinations Percentage', ascending=False)[:30]

fig = px.bar(top_country_data, x = 'Country', y = 'Total Vaccinations Percentage', color = 'Total Vaccinations Percentage')

fig.update_layout(title = dict(text = 'Percentage of Vaccinated Population World-Wide Comparision', x=0.5, y=0.95))
fig.update_xaxes(title = 'Countries', title_font = dict(size=18, family='Courier', color='crimson'), tickangle=-90)
fig.update_yaxes(title = 'Percentage of Population Vaccinated', title_font = dict(size=14, family='Courier', color='crimson'))

fig.show()

- Countries like Israel, Gibraltar, UAE and Seychelles have the highest level of vaccinated people per hundred.
- We can see that the vaccination percentage of Gibraltar is at `137%`. This means that Gibraltar has already completed Phase One of vaccination for its population and is well on its way to completely vaccinate its population.
- But one shouldn't forget, that the population of these countries isn't really high, so that might be the reason of such a high statistic indicators.
- United Kingdom has really high results, and that's impressive as it's population is almost __7__ times higher than UAE's and Israel's, and what is really incredible, __2016__ times higher than Gibraltar's! 

## People vaccinated once _VS_ People vaccinated twice

In [None]:
# group the df by the date and calculate the sum
vaccinated_df = df.copy()
vaccinated_df = vaccinated_df.groupby('date')[['date', 'people_fully_vaccinated', 'people_vaccinated']].sum()

# reset index is to pop out the date index and make a date colum in its place, and the sort via date
vaccinated_df.reset_index(inplace = True)
vaccinated_df.sort_values('date') 

# plot the values
plot = go.Figure(data=[
    go.Scatter( 
        x = vaccinated_df['date'], 
        y = vaccinated_df['people_vaccinated'], 
        stackgroup='two',
        name = 'People Vaccinated once', 
        marker_color= '#35eb28'),
    
    go.Scatter(
        x = vaccinated_df['date'], 
        y = vaccinated_df['people_fully_vaccinated'], 
        stackgroup='one', 
        name = 'People Vaccinated twice', 
        marker_color= '#c4eb28')
    ]) 
    
plot.update_layout(title = dict(text= 'People vaccinated once vs Fully vaccinated till date', x = 0.5, y = 0.95))
plot.update_layout(legend = dict(orientation = "h", yanchor = "bottom", y = 1.02, xanchor = "right", x = 1))
plot.update_xaxes(title = 'Timeline', title_font = dict(size=18, family='Courier', color='crimson'))
plot.update_yaxes(title = 'Amount of vaccinated people', title_font = dict(size=18, family='Courier', color='crimson'))
    
plot.show()

#### The above plot shows the World-Wide comparison between the First and Second Dose of the Vaccine
From the plot we can determine:
- The `People Vaccinated` line has gradual ascent but has peculiar depths in the early days of daily vaccinations. These depths fall on Wednesday and Weekends, and so we can conclude that people don't prefer to vaccinate on Wednesdays and Weekends
- In `People Vaccinated fully` line, the ascent is gradual but _much_ less than `People Vaccinated` line.
- We can also notice that `People Vaccinated` line starts to lift from __Dec 19__ and `People Vaccinated fully` line starts to lift from __Jan 9__. This is because of a gap of 3 weeks between the first and second dose of the vaccine.
- From the collected data we can decipher that it will take some time before Covid-19 can be eradicated as the effectiveness of the vaccine is __95__% only 4 weeks after the second dose.

## Most Popular Vaccine Scheme

In [None]:
top_vaccine = country_data.copy().groupby('Vaccines').sum().sort_values(by = 'Total Vaccinations',ascending = False)
top_vaccine.reset_index(inplace = True)

fig = px.bar(top_vaccine, x = 'Vaccines', y = 'Total Vaccinations', 
            color = 'Vaccines', color_discrete_sequence = px.colors.sequential.Rainbow_r)

fig.update_layout(height= 575, title = dict(text = 'Total Vaccine per Scheme', x=0.5, y=0.95), 
                  legend_title='Types of Vaccine Scheme')
fig.update_xaxes(title = 'Vaccines', title_font = dict(size=18, family='Courier', color='crimson'), showticklabels = False)
fig.update_yaxes(title = 'Amount of vaccinated people', title_font = dict(size=18, family='Courier', color='crimson'))
fig.update_layout(showlegend=False) #toggle legend

fig.show()

## Market share of Vaccine Schemes

In [None]:
world_wide_vaccine_use = df.copy().vaccines.value_counts().to_dict()
vaccine_type = {}

for scheme,value in world_wide_vaccine_use.items():
    for name in scheme.split(','):
        vaccine_type[name.strip()] = vaccine_type.get(name.strip(),0) + value

fig = px.pie(values = vaccine_type.values(), names = vaccine_type.keys(),
             color_discrete_sequence = px.colors.sequential.Sunset_r)

fig.update_layout(title = dict(text = 'Market share of Vaccines', x=0.5, y=0.95),
                  legend_title='Types of Vaccines')
fig.show()

From the above charts, we can see that :

- The most used vaccine around the world is __Pfizer/BioNTech__ and the second most used vaccine is __Oxford/AstraZeneca__.

## TreeMap of Total Vaccinations per country, grouped by Vaccine Scheme

In [None]:
fig = px.treemap(country_data, path = ['Vaccines', 'Country'], values = 'Total Vaccinations', height = 650,
                custom_data = ['Country', 'Vaccines', 'Total Vaccinations'])

fig.update_layout(title = dict(text = 'Total vaccinations per country, grouped by Vaccine Scheme', x=0.5, y=0.95))
fig.update_traces(hovertemplate = 'Country: %{customdata[0]}<br>Vaccine: %{customdata[1]}<br>Total Vaccinations: %{customdata[2]}')
fig.show()

* From the above Treemap we can realise that a Bar and Pie Plot may often only show a part of the information that can be observed, whereas a Treemap can accurately show the share of a particular vaccine world-wide, the countries that are using the said vaccine and can even show comparisons between all the countries.
* As the Treemap shows so much information at a time, it can help one understand the data much more accurately.

## Visualising on a Map

In [None]:
fig = px.choropleth(country_data, locations = 'Country', color = 'Vaccines',
                    hover_data = ['Country', 'Vaccines'], locationmode = "country names", 
                    projection = 'natural earth')

fig.update_layout(legend = dict(title = "Vaccine Scheme", orientation = "h", y=-0.1), showlegend = False)
fig.update_layout(title = dict(text = 'Countries using same Vaccine Scheme', x=0.5, y=0.95),
                 geo = dict(showocean = True, oceancolor = "#7af8ff", showland = True, 
                            landcolor = "white", showlakes = False, showframe = False))

fig.show()

- From the above visualisation, we can see the countries that are using similar vaccinations schemes.


In [None]:
fig = px.choropleth(country_data, locations = 'Country', color = 'Total Vaccinations', 
                    locationmode = 'country names', color_continuous_scale = 'rainbow', 
                    hover_name = 'Country', projection = 'natural earth')

fig.update_layout(title = dict(text = 'Total Vaccinations in every Country', x=0.5, y=0.95), 
                 geo = dict(showocean = True, oceancolor = "#7af8ff", showland = True, 
                            landcolor = "white",showlakes = False, showframe = False))

fig.show()

- In the above visualisation, we can see the countries and the total vaccinations they have completed.


## Animating the Vaccination Progress

In [None]:
# Lets create a copy of the df, fill all the NaN values with a previous value and sort the values according to date
time_df = df.copy()
time_df.fillna(method='bfill', inplace=True)

time_df['date'] = pd.to_datetime(time_df['date'])
time_df = time_df.sort_values('date', ascending=True)
time_df['date'] = time_df['date'].dt.strftime('%m-%d-%Y')

In [None]:
fig = px.choropleth(time_df, locations = 'country', color = 'daily_vaccinations',
                    locationmode = 'country names', hover_name = 'country', 
                    animation_frame = 'date', projection = 'natural earth', 
                    color_continuous_scale = px.colors.sequential.Plasma)

fig.update_layout(title = dict(text = 'Daily Vaccinations World-Wide Timeline', x=0.47, y=0.95), 
                  geo = dict(showocean = True, oceancolor = "#7af8ff", showland = True, 
                             landcolor = "white", showlakes = False, showframe = False))
fig.show()

- The above animation shows the vaccination progress that has occurred over a time period of 3 months (`13-12-2020` to `13-3-2021`)
- With this animation we can compare when a country started its vaccination drive on the global scale and at what rate did it vaccinate its population.

# Conclusion:

From our analysis we can conclude the following:
- Though the vaccination drive is going in its full pace but in most of the countries people are not fully vaccinated 
- In North America and Europe, Pfizer/BioNTech, Oxford/AstraZeneca and Moderna vaccine scheme is widely used
- The most used vaccine around the world is Pfizer/BioNTech.
- Countries like Israel and Gibraltar have given the first dose of the vaccine to an incredible __100+__% of their population.

With this information we can realise that the Covid-19 vaccination drive is going at an incredible pace. Diseases like Polio and Smallpox, which caused many deaths, took decades and centuries to eradicate. At the current pace, Covid-19 can be eradicated in a relatively short time.