<a href="https://colab.research.google.com/github/N-Vasu-Reddy/Exploring-COVID19-Data/blob/main/Covid-19-Data-Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **What is COVID-19?**

COVID-19 is a respiratory illness caused by a new virus. Symptoms include fever, coughing, sore throat and shortness of breath. The virus can spread from person to person, but good hygiene can prevent infection.

**Related Information about COVID-19**
COVID-19 may not be fatal but it spreads faster than other diseases, like common cold. Every virus has Basic Reproduction number (R0) which implies how many people will get the disease from the infected person. As per inital reseach work R0 of COVID-19 is 2.7.

Currently the goal of all scientists around the world is to "Flatten the Curve". COVID-19 currently has exponential growth rate around the world. Flattening the Curve typically implies even if the number of Confirmed Cases are increasing but the distribution of those cases should be over longer timestamp. To put it in simple words if say suppose COVID-19 is going infect 100K people then those many people should be infected in 1 year but not in a month.

The sole reason to Flatten the Curve is to reudce the load on the Medical Systems so as to increase the focus of Research to find the Medicine for the disease.

Every Pandemic has four stages:

**Stage 1**: Confirmed Cases come from other countries

**Stage 2**: Local Transmission Begins

**Stage 3**: Communities impacted with local transimission

**Stage 4**: Significant Transmission with no end in sight

Italy, USA, UK and France are the two countries which are currently in Stage 4 While India is in on the edge of Stage 3.

Other ways to tackle the disease like Corona other than Travel Ban, Cross-Border shutdown, Ban on immigrants are Testing, Contact Tracing and Quarantine.

**Objective of the Notebook**

Objective of this notebook is to study COVID-19 outbreak with the help of some basic visualizations techniques. Comparison of China where the COVID-19 originally originated from with the Rest of the World. Perform predictions in order to study the impact and spread of the COVID-19 in comming days.

#Let's get started

#1. Importing Libraries and Setting Up



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import folium
from ipywidgets import interact # For interactive widgets
import warnings
warnings.filterwarnings('ignore') # Ignore warnings

#2. Loading and Displaying Datasets

In [None]:
cases_df = pd.read_csv('/content/WHO-COVID-19-global-table-data.csv',encoding='latin1')
#cases_df contains information about covid-19 cases and deaths globally.
vacc_df = pd.read_csv('/content/vaccination-data.csv',encoding='latin1')
#vacc_df contains information about covid-19 vaccinations globally.
loc_df = pd.read_csv('/content/location.csv',encoding='latin1')
#loc_df contains geospatial location information of each country.

In [None]:
cases_df.head()

In [None]:
vacc_df.head()

In [None]:
loc_df.head()

#3. Dataset Information

In [None]:
cases_df.info()

In [None]:
vacc_df.info()

#4. Cleaning and Preprocessing Datasets

##4.1. Dropping Irrelevant Columns

In [None]:
cases_df.drop(["WHO Region","Deaths - newly reported in last 7 days per 100000 population"], axis=1, inplace=True)
vacc_df.drop(["WHO_REGION", "DATA_SOURCE", "DATE_UPDATED", "VACCINES_USED", "NUMBER_VACCINES_TYPES_USED", "FIRST_VACCINE_DATE"], axis=1, inplace=True)
vacc_df = vacc_df.dropna(subset=['TOTAL_VACCINATIONS'])

##4.2. Renaming Columns

In [None]:
#renaming columns of cases_df, vacc_df and loc_df to improve readability of the features
cases_df.rename(columns={
    "Cases - cumulative total": "cases_tot",
    "Cases - cumulative total per 100000 population": "cases_tot_per10000",
    "Cases - newly reported in last 7 days": "new_cases_7d",
    "Cases - newly reported in last 7 days per 100000 population": "new_cases_7d_per10000",
    "Cases - newly reported in last 24 hours": "new_cases_24h",
    "Deaths - cumulative total": "deaths_tot",
    "Deaths - cumulative total per 100000 population": "deaths_tot_per10000",
    "Deaths - newly reported in last 7 days": "new_deaths_7d",
    "Deaths - newly reported in last 24 hours": "new_deaths_24h",
    "Name": "country"
}, inplace=True)

vacc_df.rename(columns={"COUNTRY": "country"}, inplace=True)
loc_df.rename(columns={"country": "code", "name": "country"}, inplace=True)
vacc_df.columns = [col.lower() for col in vacc_df.columns]

In [None]:
cases_df.head()

In [None]:
vacc_df.head()

#5. Merging Datasets

In this stage, I am going to create a covid-19 dataframe(covid_df) from the three available dataframes(loc_df,cases_df and vacc_df). This process requires treating of the three dataframes and merging them.

In [None]:
# Merging location, COVID-19 cases, and vaccination data into a single dataframe: covid_df
covid_df = pd.merge_ordered(loc_df, cases_df, on="country", how="inner")
covid_df = pd.merge_ordered(covid_df, vacc_df, how="left")
covid_df.rename(columns={covid_df.columns[0]: "code"}, inplace=True)

In [None]:
covid_df.head()

#6. Feature Engineering

##6.1. Creating New Features

In [None]:
covid_df['vaccination_gap'] = covid_df['persons_vaccinated_1plus_dose'] - covid_df['persons_last_dose']
covid_df['alive'] = covid_df['cases_tot'] - covid_df['deaths_tot']

In [None]:
covid_df.head()

##6.2. Replacing Zeroes with NaN

In [None]:
# Replacing 0 values with NaN for specific columns
cols_to_replace = [
    'cases_tot', 'cases_tot_per10000', 'new_cases_7d', 'new_cases_7d_per10000',
    'new_cases_24h', 'deaths_tot', 'deaths_tot_per10000', 'new_deaths_7d',
    'new_deaths_24h', 'total_vaccinations', 'persons_vaccinated_1plus_dose',
    'persons_booster_add_dose'
]
for col in cols_to_replace:
    if col in covid_df.columns:
        covid_df[col] = covid_df[col].replace(0, np.nan)

In [None]:
covid_df.head()

In [None]:
covid_df.shape

In [None]:
covid_df.info()

In [None]:
#covid_df.to_csv("covid_df.csv",index=False)

#7. Profile Reporting

In [None]:
!pip install ydata-profiling --q

In [None]:
import ydata_profiling

In [None]:
 profile = ydata_profiling.ProfileReport(covid_df)
 profile.to_notebook_iframe()

#8. Data Analysis and Visualization

##8.1. Pie Chart: COVID-19 Cases Distribution

In [None]:
fig = go.Figure(data=[go.Pie(
    labels=['Cummulative sum of cases','Cummulative sum of vaccinations','Cummulative sum of deaths'],
    values=[covid_df['cases_tot'].sum(),covid_df['total_vaccinations'].sum(),covid_df['deaths_tot'].sum()],
    hole=0.4,
    marker=dict(colors=['#ffc107', '#28a745','#dc3545'])
)])
fig.update_layout(
    title_text='COVID-19 Cases Distribution',
    title_x=0.5,
    annotations=[dict(text='Cases', x=0.5, y=0.5, font_size=20, showarrow=False)]
)
fig.show()

##8.2. Correlation Matrix

In [None]:
numeric_cols = covid_df.select_dtypes(include=np.number).drop(columns=['latitude', 'longitude'], errors='ignore')
corr_matrix = numeric_cols.corr()
plt.figure(figsize=(15,10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of COVID-19 Data')
plt.show()

##8.3. Country-Specific Analysis

In [None]:
fig,axes=plt.subplots(1,3,figsize=(20,10))
top_10_countries = covid_df.nlargest(10, 'cases_tot')
colors = sns.color_palette("Blues_r", n_colors=20)
sns.barplot(x='country', y='cases_tot', data=top_10_countries, palette=colors,ax=axes[0])
axes[0].set_title('Top 10 Countries with Highest Total Cases')
axes[0].set_xlabel('Country')
axes[0].set_ylabel('Total Cases')
axes[0].tick_params(axis='x',rotation=90)

top_10_countries = covid_df.nlargest(10, 'deaths_tot').sort_values(by='deaths_tot', ascending=False)
colors = sns.color_palette("Reds_r", n_colors=30)
sns.barplot(x='country', y='deaths_tot', data=top_10_countries,palette=colors,ax=axes[1])
axes[1].set_title('Top 10 Countries with Highest Death Cases')
axes[1].set_xlabel('Country')
axes[1].set_ylabel('Total Death Cases')
axes[1].tick_params(axis='x',rotation=90)

top_10_countries = covid_df.nlargest(10, 'total_vaccinations')
colors = sns.color_palette("Greens_r", n_colors=20)
sns.barplot(x='country', y='total_vaccinations', data=top_10_countries,palette=colors,ax = axes[2])
axes[2].set_title('Top 10 Countries with Highest Vaccinated Persons')
axes[2].set_xlabel('Country')
axes[2].set_ylabel('Total Vaccinations')
axes[2].tick_params(rotation=90)
plt.show()

##8.4. Statistical Analysis

###8.4.1.  Boxplot of Total Vaccinations

In [None]:
covid_df[['total_vaccinations','total_vaccinations_per100']].describe()

In [None]:
sns.boxplot(x='total_vaccinations',data=covid_df.sort_values(by='total_vaccinations',ascending=False).iloc[2:,:])
plt.title('Boxplot of Total Vaccinations')
plt.xlabel('Total Vaccinations')
plt.show()

###8.4.2. Boxplot of Total Vaccinations Per 100

In [None]:
covid_df['total_vaccinations_per100'].describe()

In [None]:
sns.boxplot(x='total_vaccinations_per100',data=covid_df.sort_values(by='total_vaccinations_per100',ascending=False).iloc[2:,:])
plt.title('Boxplot of Total Vaccinations Per 100')
plt.xlabel('Total Vaccinations Per 100')
plt.show()

###8.4.3 Histogram of Total Vaccinations Per 100

In [None]:
sns.histplot(x='total_vaccinations_per100',kde=True,bins=30,data=covid_df)
plt.title('Histogram of Total Vaccinations Per 100')
plt.xlabel('Total Vaccinations Per 100')
plt.ylabel('Frequency')
plt.show()

###8.4.4. Outlier Detection for Total Vaccinations

In [None]:
from scipy.stats import iqr
Q1 = np.quantile(covid_df['total_vaccinations'],0.25)
Q3 = np.quantile(covid_df['total_vaccinations'],0.75)
iqr = iqr(covid_df['total_vaccinations'])
lower_bound = Q1 - 1.5*iqr
upper_bound = Q3 + 1.5*iqr
outliers = covid_df[(covid_df['total_vaccinations']<lower_bound) | (covid_df['total_vaccinations']>upper_bound)]
print(len(outliers))

### Observations:


1.  The highest frequency occurs around 100 vaccinations per 100 people. This indicates that a significant number of countries (or regions) have achieved approximately this level of vaccination.
2.   The distribution appears to be right-skewed, with a long tail extending towards higher vaccination rates. This suggests that while many countries have moderate vaccination levels, a few countries have exceptionally high rates.
3. The density curve illustrates a bimodal distribution, indicating two distinct clusters of countries: one centered around 100 vaccinations per 100 people, and the other around 200 vaccinations per 100 people.

##8.5 Bivariate analysis

### 8.5.1 Total Vaccinations vs Total Vaccinations Per 100

In [None]:
sns.scatterplot(x='total_vaccinations',y='total_vaccinations_per100',data=covid_df)
plt.title('Total Vaccinations vs Total Vaccinations Per 100')
plt.xlabel('Total Vaccinations')
plt.ylabel('Total Vaccinations Per 100')
plt.show()

In [None]:
correlation = covid_df[['total_vaccinations', 'total_vaccinations_per100']].corr()
print(correlation)

###8.5.2 Full Vaccination vs. Booster Uptake

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(data=covid_df, x='persons_last_dose_per100', y='persons_booster_add_dose_per100')
plt.title('Full Vaccination vs. Booster Uptake')
plt.xlabel('Full Vaccination Rate (per 100)')
plt.ylabel('Booster Uptake Rate (per 100)')
plt.show()

In [None]:
covid_df['persons_vaccinated_1plus_dose'].corr(covid_df['persons_last_dose'])

There is a high correlation between the last vaccine uptake rate and booster dose uptake rate.This provides the conclusion that regions where people didn't take the last dose vaccinations also didn't take booster dose vaccination. So awareness about the "covid pandemic and vaccination intake" has to be taught.

##8.6 Comparing with the country China with rest of the World

### 8.6.1 Filter Data for China and the Rest of the World

In [None]:
china_df = covid_df[covid_df['country']=='China']
world_df = covid_df[covid_df['country']!='China']
china_stats = china_df[['cases_tot', 'deaths_tot', 'total_vaccinations']].iloc[0].values
world_stats = world_df[['cases_tot', 'deaths_tot', 'total_vaccinations']].sum().values
categories = ['Total Cases', 'Total Deaths', 'Total Vaccinations']

### 8.6.2 Pie Charts for Proportional Comparison

In [None]:
#Creating a pie chart comparing China with the rest of the World
total_cases = [china_stats[0], world_stats[0]]
total_deaths = [china_stats[1], world_stats[1]]
total_vaccinations = [china_stats[2], world_stats[2]]
fig, axs = plt.subplots(1, 3, figsize=(15, 5))

for ax, data, title in zip(
    axs,
    [total_cases, total_deaths, total_vaccinations],
    ['Total Cases', 'Total Deaths', 'Total Vaccinations']
):
    ax.pie(data, labels=['China', 'Rest of the World'], autopct='%1.1f%%', colors=['blue', 'orange'])
    ax.set_title(title)

plt.suptitle('Proportional Comparison of China vs. Rest of the World')
plt.tight_layout()
plt.show()

### 8.6.3. Bar Chart for Comparison

In [None]:
#Creating a bar chart comparing China with the rest of the World
fig = go.Figure()
fig.add_trace(go.Bar(
    x=categories,
    y=china_stats,
    name='China',
    marker_color='blue'
))
fig.add_trace(go.Bar(
    x=categories,
    y=world_stats,
    name='Rest of the World',
    marker_color='orange'
))
fig.update_layout(
    title='Comparison of China vs. Rest of the World',
    xaxis_title='Category',
    yaxis_title='Counts (log scale)',
    yaxis_type='log',
    barmode='group'
)
fig.show()

###8.7. Choropleth Analysis

### 8.7.1. Choropleth Map for Vaccination Rate

In [None]:
fig = px.choropleth(
    covid_df,
    locations='iso3',
    color='total_vaccinations_per100',
    hover_name='country',
    title='Vaccination Rate per 100 People by Country',
    color_continuous_scale='matter'
)
fig.show()


### 8.7.2. Dynamic COVID-19 Data Visualization Map

In [None]:
def create_map(column):
    base_map = folium.Map(location=[0, 0], zoom_start=2,tiles="cartodbpositron",control_scale=True,no_wrap=True,max_bounds=True,)
    bounds=[]
    max_value = covid_df[column].max()
    min_value = covid_df[column].min()
    for _, row in covid_df.iterrows():
        color = "blue" if column == "cases_tot" else "red" if column == "deaths_tot" else "green"
        folium.CircleMarker(
            location=[row["latitude"], row["longitude"]],
            radius = 5 + (row[column] - min_value) / (max_value - min_value) * 20 if max_value > min_value else 3,
            color=color,
            fill=True,
            fill_color=color,
            fill_opacity=0.6,
            tooltip=f"{column}: {row[column]}<br>Country: {row['country']}"
        ).add_to(base_map)

    return base_map

interact(create_map, column=["cases_tot", "deaths_tot", "total_vaccinations", "total_vaccinations_per100"])
