# Introduction
The COVID-19 pandemic has significantly impacted global health and economies. To combat its spread, vaccines have been developed and distributed worldwide. This project aims to analyze COVID-19 vaccination data across different countries, providing insights into vaccination trends, progress, and disparities.

# Problem Statement:

To analyze the global distribution of COVID-19 vaccines,identify disparities in vaccination rates across countries,and assess the impact of vaccinations on controlling the pandemic.


# Understand Global Vaccination Trends

Analyze how vaccination efforts progressed over time across different countries.
Identify patterns in vaccine distribution.

# Identify Top & Bottom Countries

Determine which countries have the highest and lowest vaccination rates.
Compare vaccination efforts across continents.

# Analyze Vaccine Distribution

Identify the types of vaccines used in different countries.
Compare the adoption rate of different vaccines.

# Study the Impact of Vaccination Efforts

Examine how total vaccinations correlate with COVID-19 cases/deaths (if additional data is available).
Evaluate the effectiveness of vaccination campaigns.

# Predict Future Vaccination Trends (if using ML models)

Forecast future vaccination rates based on historical data.
Identify countries that may need more vaccine supply based on trends.

# The real-world significance

* Tracking Global Vaccination Progress – Helps monitor immunization rates, identify lagging regions, and ensure timely responses.

* Identifying Inequities in Vaccine Distribution – Highlights disparities between countries and socioeconomic groups, aiding policy decisions.

* Evaluating Vaccine Effectiveness – Determines how well vaccines reduce infections, hospitalizations, and deaths.

* Preparing for Future Pandemics – Provides insights into vaccination strategies to improve future outbreak responses.

* Understanding Public Behavior & Vaccine Hesitancy – Helps address misinformation and improve public health campaigns.

* Economic & Social Impact – Assesses how vaccinations influence economic recovery, workforce stability, and social behaviors.


# Future Scope

# Advanced Predictive Modeling
Forecasting future pandemics or healthcare trends using ARIMA, LSTMs, or XGBoost.
# Data Integration for Deeper Insights  :
 Combining multiple data sources (WHO, CDC, local health records) for richer analysis.
# AI & ML Applications :
Using deep learning for disease detection or NLP for analyzing public health sentiment.
# Public Health & Policy Making :
Data-driven decision-making models for vaccine distribution and policy impact analysis.
# Real-Time Dashboards & Visualization :
Power BI/Tableau dashboards showing live public health trends.
# Expansion Beyond COVID-19:
 Extending analysis to other diseases (influenza, tuberculosis) or healthcare analytics.


# Step 1: Import Required Libraries

In [None]:
!pip install statsmodels

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier # Import RandomForestClassifier from sklearn.ensemble
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, accuracy_score, f1_score
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import r2_score
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA

# Step 2:Understanding dataset

In [None]:
df=pd.read_csv('/content/country_vaccinations.csv')

In [None]:
print(df)

# Step3:Preview the dataset

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.country.value_counts()

In [None]:
df.vaccines.value_counts()

In [None]:
df.columns

# step 4: Check the Shape of the Dataset

In [None]:
num_rows, num_columns = df.shape
print(f"Number of Instances (Rows): {num_rows}")
print(f"Number of Features (Columns): {num_columns}")

# Step 5: Check column information

In [None]:
display(df.info())
print("\nDescriptive Statistics for Numerical Columns:")

# Step 6:Check for Duplicates

In [None]:
print(f"duplicate Row :{df.duplicated().sum()}")

In [None]:
df['date']=pd.to_datetime(df['date'])

In [None]:
print(df.dtypes)

# Step 7:  Check for missing values

In [None]:
for column in ['country', 'source_name', 'vaccines']:
    print(f"\nUnique values in '{column}': {df[column].unique()[:20]}")  # Displaying first 20 unique values for brevity
print("\nMissing Values per Column:")
display(df.isnull().sum())

# Step 8: Summary Statistics

In [None]:
display(df.describe())
print("\nUnique Values in Categorical Columns:")

# Step 9 : Analyze data types

In [None]:
print(df.dtypes.value_counts())

In [None]:
print(df)

In [None]:
df.fillna(df.median(numeric_only=True), inplace=True)

# Check if the 'vaccines' column exists before proceeding
if 'vaccines' in df.columns:
    # Encoding categorical variables (if needed)
    df = pd.get_dummies(df, columns=["vaccines"], drop_first=True)
else:
    print("Warning: 'vaccines' column not found in DataFrame. Skipping encoding.")

# Normalize numerical data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[["total_vaccinations", "people_vaccinated", "people_fully_vaccinated"]] = scaler.fit_transform(
    df[["total_vaccinations", "people_vaccinated", "people_fully_vaccinated"]]
)

In [None]:
import matplotlib.pyplot as plt

numerical_columns = ['total_vaccinations', 'daily_vaccinations', 'people_vaccinated', 'people_fully_vaccinated']
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 10))
for i, column in enumerate(numerical_columns):
  row = i // 2
  col = i % 2
  df[column].hist(ax=axes[row, col], bins=20)
  axes[row, col].set_title(f"Distribution of {column}")
  axes[row, col].set_xlabel(column)
  axes[row, col].set_ylabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
# Fill missing values in vaccination-related columns with 0
columns_to_fill_with_zero = ['daily_vaccinations', 'people_vaccinated', 'people_fully_vaccinated',
                             'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred',
                             'people_fully_vaccinated_per_hundred', 'daily_vaccinations_per_million']
df[columns_to_fill_with_zero] = df[columns_to_fill_with_zero].fillna(0)

df['source_name'] = df['source_name'].fillna('Unknown')
# The 'vaccines' column was removed by get_dummies, so we don't need to fill it
#df['vaccines'] = df['vaccines'].fillna('Unknown')

df['date'] = pd.to_datetime(df['date'])

df.drop_duplicates(inplace=True)

# Remove 'vaccines' from the list of columns to print unique values for
for column in ['country', 'source_name']:
  print(f"\nUnique values in '{column}': {df[column].unique()[:20]}")  # Displaying first 20 unique values for brevity


print("\nMissing Values per Column:")
display(df.isnull().sum())

In [None]:
df

In [None]:
plt.figure(figsize=(12, 6))
numeric_df = df.select_dtypes(include=np.number)  # Select only numeric columns for correlation
sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm")
plt.title("Feature Correlation Heatmap")
plt.show()

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Calculate the IQR and identify outliers based on a threshold (e.g., 1.5 times the IQR)
q1 = df['daily_vaccinations'].quantile(0.25)
q3 = df['daily_vaccinations'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = df[(df['daily_vaccinations'] < lower_bound) | (df['daily_vaccinations'] > upper_bound)]

# Create a box plot to visualize the distribution and outliers
plt.figure(figsize=(15, 8))
plt.boxplot(df['daily_vaccinations'].dropna())
plt.ylabel('Daily Vaccinations')
plt.title('Box Plot of Daily Vaccinations with Outliers')
plt.show()

# Create a scatter plot to highlight the outliers
plt.figure(figsize=(15, 8))
plt.scatter(df.index, df['daily_vaccinations'], label='Daily Vaccinations')
plt.scatter(outliers.index, outliers['daily_vaccinations'], color='red', label='Outliers')
plt.xlabel('Index')
plt.ylabel('Daily Vaccinations')
plt.title('Scatter Plot of Daily Vaccinations with Outliers')
plt.legend()
plt.show()

In [None]:
df['total_vaccinations'] = df['total_vaccinations'].fillna(0)
df['people_vaccinated'] = df['people_vaccinated'].fillna(0)
df['people_fully_vaccinated'] = df['people_fully_vaccinated'].fillna(0)

df.loc[df['total_vaccinations'] < df['people_vaccinated'], 'total_vaccinations'] = df['people_vaccinated']

df.loc[df['total_vaccinations'] < df['people_fully_vaccinated'], 'total_vaccinations'] = df['people_fully_vaccinated']

In [None]:
total_vaccinations_per_country = df.groupby('country')['total_vaccinations'].max()

population_data = df.groupby('country').size()
population_data = population_data.fillna(10000000)

vaccination_rate = total_vaccinations_per_country / population_data

df['daily_vaccination_growth_rate'] = df.groupby('country')['daily_vaccinations'].pct_change()


df['rolling_avg_daily_vaccinations'] = df.groupby('country')['daily_vaccinations'].rolling(window=7).mean().reset_index(level=0, drop=True)

print("Total Vaccinations per Country:")
display(total_vaccinations_per_country.head())
print("\nVaccination Rate:")
display(vaccination_rate.head())
print("\nDataFrame with New Features:")
display(df.head())

In [None]:
# Distribution of Daily Vaccinations
sns.histplot(df["daily_vaccinations"], bins=60, kde=True)
plt.title("Daily Vaccinations Distribution")
plt.show()

In [None]:
# Assuming a population for countries without available data (replace with actual population data if available)
population_data = df.groupby('country').size()
population_data = population_data.fillna(10000000)

# Calculate the vaccination rate
total_vaccinations_by_country = df.groupby('country')['total_vaccinations'].max()
vaccination_rate_by_country = total_vaccinations_by_country / population_data

# Create a scatter plot
plt.figure(figsize=(15, 8))
plt.scatter(total_vaccinations_by_country, vaccination_rate_by_country)
plt.xlabel('Total Vaccinations')
plt.ylabel('Vaccination Rate')
plt.title('Vaccination Rate vs. Total Vaccinations')
plt.grid(True)
plt.show()

In [None]:
average_daily_vaccinations = df.groupby(['country', 'date'])['daily_vaccinations'].mean().reset_index()
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
for country in average_daily_vaccinations['country'].unique():
   country_data = average_daily_vaccinations[average_daily_vaccinations['country'] == country]
plt.plot(country_data['date'], country_data['daily_vaccinations'], label=country)
plt.xlabel('Date')
plt.ylabel('Average Daily Vaccinations')
plt.title('Daily Vaccination Trends')
plt.legend()
plt.show()

highest_avg_daily_vaccinations = average_daily_vaccinations.groupby('country')['daily_vaccinations'].mean().nlargest(10)
lowest_avg_daily_vaccinations = average_daily_vaccinations.groupby('country')['daily_vaccinations'].mean().nsmallest(10)

print("Countries with the highest average daily vaccination rates:")
display(highest_avg_daily_vaccinations)
print("\nCountries with the lowest average daily vaccination rates:")
display(lowest_avg_daily_vaccinations)

In [None]:
total_vaccinations_by_country = df.groupby('country')['total_vaccinations'].max()
highest_total_vaccinations = total_vaccinations_by_country.nlargest(10)
lowest_total_vaccinations = total_vaccinations_by_country.nsmallest(10)
print("Countries with the highest total vaccination numbers:")
display(highest_total_vaccinations)
print("\nCountries with the lowest total vaccination numbers:")
display(lowest_total_vaccinations)
plt.figure(figsize=(12, 6))
for country in df['country'].unique():
  country_data = df[df['country'] == country]
plt.plot(country_data['date'], country_data['total_vaccinations'],label=country)
plt.xlabel('Date')
plt.ylabel('Total Vaccinations')
plt.title('Total Vaccination Trends')
plt.legend()
plt.show()

In [None]:
population_data = df.groupby('country').size()
population_data = population_data.fillna(10000000)
vaccination_rate_by_country = total_vaccinations_by_country / population_data

# Analyze the countries with the highest and lowest vaccination rates
highest_vaccination_rate = vaccination_rate_by_country.nlargest(5)
lowest_vaccination_rate = vaccination_rate_by_country.nsmallest(5)

print("Countries with the highest vaccination rates:")
display(highest_vaccination_rate)
print("\nCountries with the lowest vaccination rates:")
display(lowest_vaccination_rate)
plt.figure(figsize=(12, 6)) #Fixed the indentation error by removing the extra space
for country in df['country'].unique():
    country_data = df[df['country'] == country]
    plt.plot(country_data['date'], country_data['total_vaccinations'] / population_data[country], label=country)
plt.xlabel('Date')
plt.ylabel('Vaccination Rate')
plt.title('Vaccination Rate Trends')
plt.show()

In [None]:
selected_countries = ['United States', 'United Kingdom', 'India', 'China', 'Brazil']
average_daily_vaccinations = df.groupby(['country', 'date'])['daily_vaccinations'].mean().reset_index()
plt.figure(figsize=(15, 8))
for country in selected_countries:
  country_data = average_daily_vaccinations[average_daily_vaccinations['country'] == country]
  plt.plot(country_data['date'], country_data['daily_vaccinations'], label=country)
plt.xlabel('Date')
plt.ylabel('Average Daily Vaccinations')
plt.title('Average Daily Vaccination Trends for Selected Countries')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
df_forecast = df.groupby("date")["total_vaccinations"].sum()
# Check stationarity (Dickey-Fuller Test)
adf_test = adfuller(df_forecast.dropna())
print("ADF Statistic:", adf_test[0])
print("p-value:", adf_test[1])

In [None]:
from statsmodels.tsa.arima.model import ARIMA
models_fit = ARIMA(df_forecast, order=(5,1,0)).fit()
print(models_fit.summary())

# Forecast next 30 days
df_forecast = models_fit.forecast(steps=30)

# Correct the forecast index to start from the last available date
forecast_index = pd.date_range(start=df_forecast.index[-1], periods=31, freq="D")[1:]

# Forecast next 30 days
df_forecast = models_fit.forecast(steps=30)

# Correct the forecast index to start from the last available date
forecast_index = pd.date_range(start=df_forecast.index[-1], periods=31, freq="D")[1:]

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(df_forecast, label="Actual")
plt.plot(forecast_index,df_forecast,label="Forecast", color="red")
plt.title("ARIMA Forecasting - Total Vaccinations")
plt.xlabel("Date")
plt.ylabel("Total Vaccinations")
plt.legend()
plt.show()

In [None]:
from sklearn.ensemble import RandomForestRegressor
# Define features and target
features = [
    "people_vaccinated", "people_fully_vaccinated", "daily_vaccinations_raw", "daily_vaccinations",
    "total_vaccinations_per_hundred", "people_vaccinated_per_hundred", "people_fully_vaccinated_per_hundred",
    "daily_vaccinations_per_million", "source_name", "source_website"
]
target = "total_vaccinations"

# Drop rows with missing values
df = df.dropna(subset=features + [target])
X = df[features]
y = df[target]

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Encode categorical variables
label_encoder_source = LabelEncoder()
X_train["source_name_encoded"] = label_encoder_source.fit_transform(X_train["source_name"])
X_test["source_name_encoded"] = label_encoder_source.transform(X_test["source_name"])

label_encoder_website = LabelEncoder()
X_train["source_website_encoded"] = label_encoder_website.fit_transform(X_train["source_website"])
X_test["source_website_encoded"] = label_encoder_website.transform(X_test["source_website"])

# Drop original categorical columns
X_train = X_train.drop(["source_name", "source_website"], axis=1)
X_test = X_test.drop(["source_name", "source_website"], axis=1)

# Train RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100, random_state=42)
regressor.fit(X_train, y_train)

# Make predictions
y_pred = regressor.predict(X_test)

# Convert continuous predictions into discrete classes for accuracy calculation
y_pred_classes = np.round(y_pred)  # Round predictions to the nearest integer
y_test_classes = np.round(y_test)  # Round actual values

# Model evaluation
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R-squared Score:", r2_score(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test_classes, y_pred_classes))  # Compute accuracy on rounded values
print(classification_report(y_test_classes, y_pred_classes))

# Step 10 :Hyperparameter Tuning (GridSearchCV)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# Load the dataset
file_path = "/content/country_vaccinations.csv"
df = pd.read_csv(file_path)

# Selecting relevant numerical features (excluding target variable)
features = ['total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated',
            'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred',
            'people_fully_vaccinated_per_hundred', 'daily_vaccinations_per_million']
target = 'daily_vaccinations'

# Dropping rows where target is NaN
df_cleaned = df.dropna(subset=[target])

# Splitting data into features (X) and target (y)
X = df_cleaned[features]
y = df_cleaned[target]

# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating a pipeline with imputation, scaling, and RandomForestRegressor
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('model', RandomForestRegressor(random_state=42))
])

# Defining hyperparameter grid
# Reduced search space to potentially reduce processing time
param_grid = {
    'model__n_estimators': [50, 100],
    'model__max_depth': [None, 10],
    'model__min_samples_split': [2, 5]
}

# Performing GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
# Wrap the fit call with try-except to gracefully handle KeyboardInterrupt
try:
    grid_search.fit(X_train, y_train)
except KeyboardInterrupt:
    print("Grid search interrupted. Using the best parameters found so far.")
    # Check if best_params_ is available before accessing it
    if hasattr(grid_search, 'best_params_'):
        print("Best parameters:", grid_search.best_params_)
        print("RMSE:", np.sqrt(-grid_search.best_score_))
    else:
        print("Grid search was interrupted before finding best parameters.")
else:  # If fit completes without interruption
    print("Best parameters:", grid_search.best_params_)
    print("RMSE:", np.sqrt(-grid_search.best_score_))