<center><img src="https://img.pravda.com/images/doc/1/c/1cb93ce-36bfa52-vaccine690.jpg" width=700></img></center> 

# <center><b>Hello guys!

### <center>In this notebook you will see the process of dealing with missing data and filling it with appropriate values.

### <center>You can also find exploratory data analysis along with vizualization on histograms and geo plots here.

### <center>Moreover, at the end of the notebook you can find predictive ARIMA model parameters choosing and making<br><br><br>predictions of the amount of vaccinated people for the next 7 days!

### <center>I hope, this notebook would be interesting and useful for you!

# <b><br><center>Prepairing data

Importing all needed libraries.<br><br>

In [None]:
import numpy as np, pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt, seaborn as sns
import scipy
import warnings
import plotly.express as px
from itertools import product
import statsmodels.api as sm
import datetime
from tqdm import tqdm
warnings.filterwarnings('ignore')

<br>Loading data.<br><br>

In [None]:
data = pd.read_csv('../input/covid-world-vaccination-progress/country_vaccinations.csv')

<br>Check if everything loaded fine.<br><br>

In [None]:
data

In [None]:
data.shape

In [None]:
data = data.drop(data[(data.date>'2021-02-07') | (data.date>'2021-02-06')].index)

# <br><center><b>Missing data<br><br>

Now, let's check out if we have any missing data in our dataset.<br><br>

In [None]:
data.isna().sum()

<br>As can be seen, there is quite much missing data.<br><br>

Let's drop total_vaccinations missing data, as without this value any raw doesn't make much sense.<br><br>

In [None]:
data = data.drop(data[data.total_vaccinations.isna()].index)

In [None]:
data.isna().sum()

## <br>	<b>&bull;</b> people_vaccinated<br>

In [None]:
check_data = data.drop(data[data.people_vaccinated.isna()].index)

In [None]:
check_data.head()

<br>As can bee seen from our data, the values of total_vaccinations column are mostly the same as people_vaccenated column's.<br><br>

total_vaccinations_per_hundred's and people_vaccinated_per_hundred are also very similar.<br><br>

Let's check the correlation to understand if it is so.<br><br>

In [None]:
plt.subplots(figsize=(8, 8))
sns.heatmap(check_data.corr(), annot=True, square=True)
plt.show()

<br>As can bee seen from the heatmap, these features have almost ideal correlation.

## <br>	<b>&bull; </b>people_vaccinated and people_vaccinated_per_hundred<br>

people_vaccinated and people_vaccinated_per_hundred greatly correlates with total_vaccinations and total_vaccinations_per_hundred.<br><br>

##### <b> Let's check the hypothesis that these columns distributions are the same. </b><br><br>

##### <b> Now and then we will use Mann-Whithey U test for this goal. </b><br><br>

In [None]:
scipy.stats.mannwhitneyu(check_data.total_vaccinations, check_data.people_vaccinated, alternative='two-sided')

<br><br>

In [None]:
scipy.stats.mannwhitneyu(check_data.total_vaccinations_per_hundred, check_data.people_vaccinated_per_hundred, alternative='two-sided')

<br>p-value is much than 0.05, which means we can't reject our hyphotesis. <br><br>

So, we will fill the missing values with the difference of these column's mean values.<br><br>

In [None]:
diff = check_data.total_vaccinations.mean() - check_data.people_vaccinated.mean()
diff_per_hundred = check_data.total_vaccinations_per_hundred.mean() - check_data.people_vaccinated_per_hundred.mean()

data.people_vaccinated = data.people_vaccinated.fillna(data.total_vaccinations - diff)
data.people_vaccinated_per_hundred = data.people_vaccinated_per_hundred.fillna(data.total_vaccinations_per_hundred - diff_per_hundred)

<br>Let's check if everything ok.<br><br>

In [None]:
data.isna().sum()

<br>Everything went fine we can move on  <b>&#10003;

# <br>	<b>&bull; </b>daily_vaccinations and daily_vaccinations_per_million<br>

daily_vaccinations and daily_vaccinations_per_million greatly correlates with people_vaccinated and people_vaccinated_per_hundred.<br><br>

##### <b> Let's check the hypothesis that these columns distributions are the same. <br><br>

In [None]:
scipy.stats.mannwhitneyu(check_data.people_vaccinated, check_data.daily_vaccinations)

<br>

In [None]:
scipy.stats.mannwhitneyu(check_data.people_vaccinated_per_hundred, check_data.daily_vaccinations_per_million)

<br>p-values are much less than 0.05, which means we will reject our hypothesis.<br><br>

<br>So, let's just fill missing values with zeros.<br><br>

In [None]:
data.daily_vaccinations = data.daily_vaccinations.fillna(0)
data.daily_vaccinations_per_million = data.daily_vaccinations_per_million.fillna(0)

<br>Let's check if everything ok.<br><br>

In [None]:
data.isna().sum()

<br>Everything worked fine <b>&#10003;<b>

# <br>	<b>&bull; </b>people_fully_vaccinated and people_fully_vaccinated_per_hundred<br>

people_fully_vaccinated and people_fully_vaccinated_per_hundred greatly correlates with total_vaccinations and total_vaccinations_per_hundred.

##### <b> <br>Let's check the hypothesis that these columns distributions are the same. </b><br><br>

In [None]:
scipy.stats.mannwhitneyu(check_data.people_fully_vaccinated, check_data.total_vaccinations)

<br>

In [None]:
scipy.stats.mannwhitneyu(check_data.people_fully_vaccinated_per_hundred, check_data.total_vaccinations_per_hundred)

<br>p-values are much less than 0.05, which means we will reject our hypothesis.<br><br>

Let's fill missing values with 0.<br><br>

In [None]:
data.people_fully_vaccinated = data.people_fully_vaccinated.fillna(0)
data.people_fully_vaccinated_per_hundred = data.people_fully_vaccinated_per_hundred.fillna(0)

<br>Let's check if everything ok.<br><br>

In [None]:
data.isna().sum()

<br>We can move on <b>&#10003;

## <br>	<b>&bull; </b>daily_vaccinations_raw<br>

daily_vaccinations_raw greatly correlates with daily_vaccinations.<br><br>

##### <b> Let's check the hypothesis that these columns distributions are the same. </b><br><br>

In [None]:
scipy.stats.mannwhitneyu(check_data.daily_vaccinations_raw, check_data.daily_vaccinations)

<br>p-values are much less than 0.05, which means we will reject our hypothesis.<br><br>

Let's fill missing values with 0.<br><br>

In [None]:
data.daily_vaccinations_raw = data.daily_vaccinations_raw.fillna(0)

<br>Let's check if everything worked fine.<br><br>

In [None]:
data.isna().sum()

<br>Everything worked fine <b>&#10003;

## <br>	<b>&bull; </b> iso_code

<br>Let's find out which countries have missing iso-code.<br><br>

In [None]:
data[data.iso_code.isna()].country.unique()

<br>Thats the iso-codes which are used for these countries : GB-ENG	for England, NC for Northern Cyprus, GB-NIR	for Northern Ireland, GB-SCT for Scotland, GB-WLS for Wales.<br><br>


We will fill missing iso-codes with appropriate ones.<br><br>

In [None]:
data[data.country == 'England'] = data[data.country == 'England'].fillna('GB-ENG')
data[data.country == 'Northern Ireland'] == data[data.country == 'Northern Ireland'].fillna('GB-NIR')
data[data.country == 'Scotland'] = data[data.country == 'Scotland'].fillna('GB-SCT')
data[data.country == 'Wales'] = data[data.country == 'Wales'].fillna('GB-WLS')
data = data.fillna('NC')

<br>Let's check if everything went fine.<br><br>

In [None]:
data.isna().sum()

<br>We have finally dealt with missing data, which was quite long 😀

# <center><br><b>EDA with vizualization

## <br>	<b>&bull;</b> Amount of vaccinated people<br>

First of all, let's vizualize which countries do have the highest ammount of vaccinated citizens.<br><br>

In [None]:
cols = ['country', 'total_vaccinations', 'iso_code', 'vaccines', 'total_vaccinations_per_hundred']
vacc_amount = data[cols].groupby('country').max().sort_values('total_vaccinations', ascending=False)

In [None]:
plt.figure(figsize=(16, 7))
plt.bar(vacc_amount.index, vacc_amount.total_vaccinations)
plt.xticks(rotation = 90)
plt.ylabel('Amount of vaccinated citizens')
plt.xlabel('Countries')
plt.show()

<b><br>As can be seen from the plot, China and USA vaccination amounts are much greater then other countrie's. But the leader in vaccination is USA.</b><br><br>


Let's take a look at the same data, but on the map.<br><br>

In [None]:
fig = px.choropleth(locations=vacc_amount.iso_code, color=vacc_amount.total_vaccinations, title='Amount of vaccinated citizens', 
                   color_continuous_scale='rainbow')
fig.show('notebook')

<b><br>As could be seen from this map, many European countries along with some Arabic counties Indonesia, Argentina and Ecuador have the lowest amount of vaccinated citizens.</b><br>

<b><br>At the same time, United Kingdom (mostly England, the biggest part of UK) which is really close to Europe is top 3 vaccinations amount country.</b>



## <br>	<b>&bull;</b> Amount of vaccinated people per hundred<br>

Let's find out which country has the highest level of vaccinated people per hundred.<br><br>

This way we will understand, which country has its biggest part of population vaccinated.<br><br>

In [None]:
vacc_amount = vacc_amount.sort_values('total_vaccinations_per_hundred', ascending=False)

In [None]:
plt.figure(figsize=(14, 5))
plt.bar(vacc_amount.index, vacc_amount.total_vaccinations_per_hundred)
plt.xticks(rotation = 90)
plt.ylabel('Amount of vaccinated people per hundred')
plt.xlabel('Countries')
plt.show()

<br><b>Israel, UAE, Gibraltar have the highest level of vaccinated people per hundred.<br><br>

<b>But we shouldn't forget, that the population of these countries isn't really high, so that might be the reason of such a high statistic indicators.<br><br>

<b>United Kingdom (along with England, Northern Ireland, Scotland and Wales) also have really high results, as it's population is almost 7 times higher than UAE's and Israels, and what is really incredible, <u>2016</u> times higher than Gibraltar's! <br><br>

Now, let's take a look at the same data on map.<br><br>

In [None]:
fig = px.choropleth(locations=vacc_amount.iso_code, color=vacc_amount.total_vaccinations_per_hundred, title='Amount of vaccinated citizens per hundred', 
                   color_continuous_scale='rainbow')
fig.show('notebook')

<br><b>It could now be seen that USA's level of vaccinated per hundred is also high.<br><br>

<b>And the lowest level have Russia, Mexico, Southern America and Asian countries.

## <br>	<b>&bull;</b> The most popular vaccine <br>

Now let's find out which vaccine is the most popular.<br><br>

In [None]:
vacc_pop = vacc_amount.groupby('vaccines').sum().sort_values('total_vaccinations', ascending=False)

In [None]:
plt.figure(figsize=(10, 5))
plt.bar(vacc_pop.index, vacc_pop.total_vaccinations)
plt.xticks(rotation = 90)
plt.ylabel('Amount of vaccinated people')
plt.xlabel('Vaccines')
plt.show()

<b><br>What is shown on the plot, is the fact that Pfizer/BioNTech vaccine seems to be the most popular and the most wide-spread one.<br><br>

<b>And Covishield along with Covaxin are problaby least popular.<br><br>

Let's also vizualize it on a map.<br><br>

In [None]:
fig = px.choropleth(locations=vacc_amount.iso_code, color=vacc_amount.vaccines, title='Name of the vaccine', 
                   color_continuous_scale='rainbow')
fig.show()

<br><b>It could be easily seen that Pfizer/BioNTech is really the most popular and wide-spread vaccine. People mostly prefer it in Europe and Northern America. <br><br>

<b>The Sputnik V vaccine is used by Russia, Argentina and Serbia.<br><br>

<b>Only Asian countries prefer Covaxin, Covishield.<br><br>

<b>Sinovac is being used in Turkey, Indonesia, Brazil and China.<br><br>

<b>And finally, CNBG is only being used in China.<br><br>

# <b><center>Vaccination amount prediction

## <b>&bull;</b> How the vaccination process changed through the time

In [None]:
t_cols = ['date', 'total_vaccinations']
timeseries_cov = data[t_cols].groupby('date').sum()[4:-1]

def invboxcox(y, l):
    if l == 0:
        return np.exp(y)
    else:
        return np.exp(np.log(l*y+1)/l)

In [None]:
plt.figure(figsize=(20,7))
timeseries_cov.total_vaccinations.plot()
plt.xticks(rotation=45)
plt.show()

<br><b>What can bee seen, is that despite some days the amount of vaccinated people falls, the vaccination has strong long uptrend.<br><br>

## <b>&bull;</b> Timeseries transformations to make it stationary <br>

To be able to predict future values, our timeseries <u><b>must be stationary</b></u>.<br><br>

Let's check if it is true with the help of Dickey-Fuller test.<br><br>

<b>Our hypotethis is, that our timeseries isn't stationary.<br><br>

In [None]:
print('p-value : {}'.format(sm.tsa.stattools.adfuller(timeseries_cov)[1]))

<br><br>Our p-value is extremely high and is higher than 0.05.<br><br>

Let's use Box-Cox transformation.<br><br>

In [None]:
timeseries_cov['total_vaccinations_box'], l = scipy.stats.boxcox(timeseries_cov.total_vaccinations)

In [None]:
print('p-value : {}'.format(sm.tsa.stattools.adfuller(timeseries_cov.drop(columns=['total_vaccinations']))[1]))

<br>Our p-value is still higher than 0.05.<br>

In [None]:
plt.figure(figsize=(20,7))
timeseries_cov.total_vaccinations_box.plot()
plt.xticks(rotation=45)
plt.show()

<br>We will seasonly differentiate our timeseries with the interval of 2 days.<br><br>

In [None]:
timeseries_cov['total_vaccinations_box_diff1int2'] = timeseries_cov.total_vaccinations_box - timeseries_cov.total_vaccinations_box.shift(2)

In [None]:
timeseries_cov['total_vaccinations_box_diff2int2'] = timeseries_cov['total_vaccinations_box_diff1int2'] - timeseries_cov['total_vaccinations_box_diff1int2'].shift(2)

In [None]:
print('p-value : {}'.format(sm.tsa.stattools.adfuller(timeseries_cov.drop(columns=['total_vaccinations', 'total_vaccinations_box', 'total_vaccinations_box_diff1int2'])[4:])[1]))

<br>Now our p-value is much less than 0.05, which means we could consider our timeseries not to be unstationary. Let's check if it is true with decomposing.<br><br>

In [None]:
sm.tsa.seasonal_decompose(timeseries_cov.total_vaccinations_box_diff2int2[4:], period=1).plot()
plt.show()

<br>As we can see, trend disappeared because of our differentiation. Let's move on.<br><br>

## <b>&bull;</b> ACF and PACF (Autocorrelation function and Partial autocorrelation function) <br>

Now, lets check Autocorrelation and Partial Autocorrelation of our timeseries.<br><br>

In [None]:
plt.figure(figsize=(20, 7))
ax = plt.subplot(211)
sm.graphics.tsa.plot_acf(timeseries_cov.drop(columns=['total_vaccinations', 'total_vaccinations_box', 'total_vaccinations_box_diff1int2'])[4:], 
                         lags=(len(timeseries_cov)-4)/4, ax=ax)
ax = plt.subplot(212)
sm.graphics.tsa.plot_pacf(timeseries_cov.drop(columns=['total_vaccinations', 'total_vaccinations_box', 'total_vaccinations_box_diff1int2'])[4:], 
                         lags=(len(timeseries_cov)-4)/4, ax=ax)
plt.show()

<br>We will choose our parameters in range of 0-7.<br><br>

As we have done one seasonal and any simple differentiations, D (amount of seasonal diffs) will be 1 and d (amount of simple diffs) will be 0.<br>

In [None]:
d = 0
D = 2

<br>Now we will train many models and will choose the one with the best Akaike Information Criterion (AIC).<br><br>

In [None]:
%%time
results = []
best_aic = float('inf')

parameters = list(product(np.arange(0, 7), np.arange(0, 7), np.arange(0, 7), np.arange(0, 7)))

for param in tqdm(parameters):
    try:
        arima = sm.tsa.statespace.SARIMAX(timeseries_cov.total_vaccinations_box, order=(param[0], d, param[1]), 
                                          seasonal_order=(param[2], D, param[3], 2)).fit(disp=False)
    except:
        continue
    aic = arima.aic
    if aic < best_aic:
        optimal_arima = arima
        best_aic = aic
        best_param = param
    results.append([param, optimal_arima.aic])

<br>Let's check the optimal model's info. <br><br>

In [None]:
print(optimal_arima.summary())

<br>Now, let's compare our timeseries with ARIMA's.<br><br>

In [None]:
timeseries_cov['arima'] = invboxcox(optimal_arima.fittedvalues, l)
plt.figure(figsize=(20,7))
timeseries_cov.total_vaccinations.plot()
timeseries_cov.arima.plot(color='r')
plt.xticks(rotation=45)
plt.show()

<br>Seems like ARIMA's timeseries is pretty close to ours. Anyway, you can improve it's accuracy with using much higher parameters, which will also take a lot of time.<br><br>

## <b>&bull;</b> Making prediction

<br>Now, let's create predictions for the next week.<br><br>

In [None]:
date = ['2021-02-'+str(x) for x in range(10, 17)]
timeseries = timeseries_cov['total_vaccinations']
pred_df = pd.DataFrame(index=date)
pred_df['total_vaccinations'] = invboxcox(optimal_arima.predict(start=44, end=50).values, l)
timeseries = pd.concat([timeseries, pred_df])

In [None]:
timeseries.drop(columns=[0])[-7:]

<br>And at the end let's vizualize our predictions.<br><br>

In [None]:
timeseries_cov['arima'] = invboxcox(optimal_arima.fittedvalues, l)
plt.figure(figsize=(20,7))
timeseries.total_vaccinations.plot(color='r')
timeseries_cov.total_vaccinations.plot()
plt.xticks(rotation=45)
plt.show()

# <center><br><b><br> Thank you for reading this notebook!

# <center><b><br>Good luck!