<a href="https://colab.research.google.com/github/Subhajit-Dey45/Covid-19AnalysisandVisualization/blob/main/Covid_19AnalysisandVisualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Title : Covid-19 Analysis and Visualization**

##Github Link=

## **Project Summary**
The study utilizes three primary datasets containing country-level statistics, time-series growth data, and specific US mortality figures categorized by age and condition. By applying data cleaning techniques—such as removing columns with excessive null values (NewCases, NewDeaths, and NewRecovered)—the project ensures a high degree of accuracy for statistical modeling.

###**Project Description**
The rapid global spread of COVID-19 created an unprecedented public health crisis, characterized by a massive volume of data that was often incomplete or inconsistent across different regions. Public health officials and researchers faced the significant challenge of transforming this raw, "dirty" data into actionable insights to understand the virus's trajectory and impact.

##1.Importing Necessary Libraries

In [49]:
# Data analysis and Manipulation
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.io as pio
import plotly.express as px

##Importing the Datasets
Importing three datasets into this project

In [50]:
covid=pd.read_csv("/content/covid.csv")
covid_group=pd.read_csv("/content/covid_grouped.csv")
covid_death=pd.read_csv("/content/coviddeath.csv")

###Initial Data Inspection


In [51]:
covid.head()

Unnamed: 0,Country/Region,Continent,Population,TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",Tot Cases/1M pop,Deaths/1M pop,TotalTests,Tests/1M pop,WHO Region,iso_alpha
0,USA,North America,331198100.0,5032179,,162804.0,,2576668.0,,2292707.0,18296.0,15194.0,492.0,63139605.0,190640.0,Americas,USA
1,Brazil,South America,212710700.0,2917562,,98644.0,,2047660.0,,771258.0,8318.0,13716.0,464.0,13206188.0,62085.0,Americas,BRA
2,India,Asia,1381345000.0,2025409,,41638.0,,1377384.0,,606387.0,8944.0,1466.0,30.0,22149351.0,16035.0,South-EastAsia,IND
3,Russia,Europe,145940900.0,871894,,14606.0,,676357.0,,180931.0,2300.0,5974.0,100.0,29716907.0,203623.0,Europe,RUS
4,South Africa,Africa,59381570.0,538184,,9604.0,,387316.0,,141264.0,539.0,9063.0,162.0,3149807.0,53044.0,Africa,ZAF


Further, information regarding the dataset we are using will help us sample it better for analysis.

In [52]:
# Returns tuple of shape (Rows, columns)
covid.shape

(209, 17)

In [53]:
covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209 entries, 0 to 208
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Country/Region    209 non-null    object 
 1   Continent         208 non-null    object 
 2   Population        208 non-null    float64
 3   TotalCases        209 non-null    int64  
 4   NewCases          4 non-null      float64
 5   TotalDeaths       188 non-null    float64
 6   NewDeaths         3 non-null      float64
 7   TotalRecovered    205 non-null    float64
 8   NewRecovered      3 non-null      float64
 9   ActiveCases       205 non-null    float64
 10  Serious,Critical  122 non-null    float64
 11  Tot Cases/1M pop  208 non-null    float64
 12  Deaths/1M pop     187 non-null    float64
 13  TotalTests        191 non-null    float64
 14  Tests/1M pop      191 non-null    float64
 15  WHO Region        184 non-null    object 
 16  iso_alpha         209 non-null    object 
dt

The dataset contains 209 entries with 17 columns.It contains Country/Region, Continent, Population, TotalCases, NewCases, TotalDeaths, NewDeaths, TotalRecovered, NewRecovered, ActiveCases, Serious, Critical, Tot Cases/1M pop, Deaths/1M pop, TotalTests, Tests/1M pop, WHO Region, iso_alpha.

Key observations:

Missing Values: Significant missing data in columns like NewCases,NewDeaths and NewRecovered

Data Types: Most columns are of object type,float type with TotalCases being int type

Similarly other datasets can be explored.

In [54]:
covid_group.head()

Unnamed: 0,Date,Country/Region,Confirmed,Deaths,Recovered,Active,New cases,New deaths,New recovered,WHO Region,iso_alpha
0,2020-01-22,Afghanistan,0,0,0,0,0,0,0,Eastern Mediterranean,AFG
1,2020-01-22,Albania,0,0,0,0,0,0,0,Europe,ALB
2,2020-01-22,Algeria,0,0,0,0,0,0,0,Africa,DZA
3,2020-01-22,Andorra,0,0,0,0,0,0,0,Europe,AND
4,2020-01-22,Angola,0,0,0,0,0,0,0,Africa,AGO


In [55]:
# Returns tuple of shape (Rows, columns)
covid_group.shape

(35156, 11)

In [56]:
covid_group.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35156 entries, 0 to 35155
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Date            35156 non-null  object
 1   Country/Region  35156 non-null  object
 2   Confirmed       35156 non-null  int64 
 3   Deaths          35156 non-null  int64 
 4   Recovered       35156 non-null  int64 
 5   Active          35156 non-null  int64 
 6   New cases       35156 non-null  int64 
 7   New deaths      35156 non-null  int64 
 8   New recovered   35156 non-null  int64 
 9   WHO Region      35156 non-null  object
 10  iso_alpha       35156 non-null  object
dtypes: int64(7), object(4)
memory usage: 3.0+ MB


This dataset contains 35156 rows with 11 columns.This dataset contains Date, Country/Region, Confirmed, Deaths, Recovered, Active, New cases, New deaths, New recovered, WHO Region, iso_alpha

Key observations:

Missing Values: No missing values

Data Types: Most columns are of object type and int type

In [57]:
covid_death.head()

Unnamed: 0,Data as of,Start Week,End Week,State,Condition Group,Condition,ICD10_codes,Age Group,Number of COVID-19 Deaths,Flag
0,08/30/2020,02/01/2020,08/29/2020,US,Respiratory diseases,Influenza and pneumonia,J09-J18,0-24,122.0,
1,08/30/2020,02/01/2020,08/29/2020,US,Respiratory diseases,Influenza and pneumonia,J09-J18,25-34,596.0,
2,08/30/2020,02/01/2020,08/29/2020,US,Respiratory diseases,Influenza and pneumonia,J09-J18,35-44,1521.0,
3,08/30/2020,02/01/2020,08/29/2020,US,Respiratory diseases,Influenza and pneumonia,J09-J18,45-54,4186.0,
4,08/30/2020,02/01/2020,08/29/2020,US,Respiratory diseases,Influenza and pneumonia,J09-J18,55-64,10014.0,


In [58]:
# Returns tuple of shape (Rows, columns)
covid_death.shape

(12260, 10)

In [59]:
covid_death.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12260 entries, 0 to 12259
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Data as of                 12260 non-null  object 
 1   Start Week                 12260 non-null  object 
 2   End Week                   12260 non-null  object 
 3   State                      12260 non-null  object 
 4   Condition Group            12260 non-null  object 
 5   Condition                  12260 non-null  object 
 6   ICD10_codes                12260 non-null  object 
 7   Age Group                  12260 non-null  object 
 8   Number of COVID-19 Deaths  5354 non-null   float64
 9   Flag                       6906 non-null   object 
dtypes: float64(1), object(9)
memory usage: 957.9+ KB


This dataset contains 12260 rows with 10 columns.This dataset contains Data as of, Start Week, End Week, State, Condition Group, Condition, ICD10_codes, Age Group, Number of COVID-19 Deaths, Flag

Key observations:

Missing Values: Significant missing data in columns like Number of COVID-19 Deaths and Flag

Data Types: Most columns are of object type with Number of COVID-19 Deaths being float type

##2.Dataset Cleaning
Data cleaning is the process of altering, modifying a recordset, correcting erroneous records from the database and identifying incomplete, incorrect, or irrelevant parts of the data, and then removing dirty data.

###Check missing data

In [60]:
covid.isnull().sum()

Unnamed: 0,0
Country/Region,0
Continent,1
Population,1
TotalCases,0
NewCases,205
TotalDeaths,21
NewDeaths,206
TotalRecovered,4
NewRecovered,206
ActiveCases,4


From this we can gather that the columns NewCases,NewDeaths and NewRecovered have mostly null values.
Therfore we don't need 'NewCases', 'NewDeaths', 'NewRecovered' columns as they contains NaN values.

In [61]:
covid.drop(['NewCases', 'NewDeaths', 'NewRecovered'],
              axis=1, inplace=True)

Dropped due to >90% null values which would skew daily trend analysis

In [62]:
# Select random set of values from dataset1
covid.sample(5)

Unnamed: 0,Country/Region,Continent,Population,TotalCases,TotalDeaths,TotalRecovered,ActiveCases,"Serious,Critical",Tot Cases/1M pop,Deaths/1M pop,TotalTests,Tests/1M pop,WHO Region,iso_alpha
151,Sao Tome and Principe,Africa,219544.0,878,15.0,797.0,66.0,,3999.0,68.0,3079.0,14025.0,Africa,STP
16,Turkey,Asia,84428331.0,237265,5798.0,220546.0,10921.0,580.0,2810.0,69.0,5081802.0,60191.0,Europe,TUR
202,Saint Kitts and Nevis,North America,53237.0,17,,16.0,1.0,,319.0,,1146.0,21526.0,Americas,KNA
192,Saint Martin,North America,38729.0,53,3.0,41.0,9.0,1.0,1368.0,77.0,1183.0,30546.0,,MAF
154,Vietnam,Asia,97425470.0,747,10.0,392.0,345.0,,8.0,0.1,482456.0,4952.0,WesternPacific,VNM


In [63]:
covid_group.isnull().sum()

Unnamed: 0,0
Date,0
Country/Region,0
Confirmed,0
Deaths,0
Recovered,0
Active,0
New cases,0
New deaths,0
New recovered,0
WHO Region,0


From this we can gather that there are no null values in this dataset

In [64]:
covid_death.isnull().sum()

Unnamed: 0,0
Data as of,0
Start Week,0
End Week,0
State,0
Condition Group,0
Condition,0
ICD10_codes,0
Age Group,0
Number of COVID-19 Deaths,6906
Flag,5354


From this we can gather that there are null values in Number of COVID-19 Deaths and Flag.

In [65]:
covid_death["Number of COVID-19 Deaths"].fillna(0)
covid_death["Flag"].fillna(0)

Unnamed: 0,Flag
0,0
1,0
2,0
3,0
4,0
...,...
12255,0
12256,0
12257,0
12258,Counts less than 10 suppressed.


Filling in the misiing values with 0 and let's check again

In [66]:
covid_death.isnull().sum()

Unnamed: 0,0
Data as of,0
Start Week,0
End Week,0
State,0
Condition Group,0
Condition,0
ICD10_codes,0
Age Group,0
Number of COVID-19 Deaths,6906
Flag,5354


Now there are no missing values

##3.Key Insights and Visualizations

In [67]:
px.bar(covid.head(15), x = 'Country/Region',
       y = 'TotalCases',color = 'TotalCases',
       height = 500,hover_data = ['Country/Region', 'Continent'])

This bar chart displays the top 15 countries with the most Covid cases.The bar chart reveals a massive disparity in case volumes, with a few countries bearing a disproportionate share of the global burden. The USA, Brazil, and India are identified as the most heavily impacted nations in terms of total volume.

In [68]:
px.bar(covid.head(15), x = 'Country/Region', y = 'TotalDeaths',
       color = 'TotalDeaths', height = 500,
       hover_data = ['Country/Region', 'Continent'])

Disparity Between Cases and Mortality
Comparing the two charts reveals that a high number of cases does not always correlate linearly with a high death toll.

Varied Mortality Ratios: While the USA leads in both categories, countries like the UK and Mexico show a relatively higher "Total Deaths" bar compared to their "Total Cases" bar when compared to nations like Russia or Saudi Arabia.

In [69]:
covid_group['Date'] = pd.to_datetime(covid_group['Date'])
global_trend = covid_group.groupby('Date')[['Confirmed', 'Deaths', 'Recovered']].sum().reset_index()

fig = px.line(global_trend, x='Date', y=['Confirmed', 'Deaths', 'Recovered'],
              title='Global COVID-19 Trends Over Time')
fig.show()

The visualization illustrates a clear exponential growth phase beginning in late March 2020. A positive insight is the widening gap between "Confirmed" and "Recovered" lines toward the end of the timeline, suggesting that while the virus spread rapidly, recovery rates also scaled significantly as medical protocols improved. The "Deaths" line remains relatively flat in comparison to confirmed cases, highlighting a declining Global Case Fatality Rate (CFR) over time.

In [70]:
# Scatter plot for Testing Efficiency
fig = px.scatter(covid.head(50),
                 x='TotalTests',
                 y='TotalCases',
                 size='Population',
                 color='Continent',
                 hover_name='Country/Region',
                 log_x=True, log_y=True,
                 title='Testing Volume vs. Total Cases (Top 50 Countries)',
                 labels={'TotalTests': 'Total Tests (Log Scale)', 'TotalCases': 'Total Cases (Log Scale)'})

fig.update_layout(template="plotly_dark")
fig.show()

There is a strong positive correlation between testing volume and total cases. Nations like the USA and Russia, which performed the highest number of tests, naturally identified the most cases. However, the vertical dispersion (how high a bubble is relative to its horizontal position) indicates "Positivity Rates." Countries higher on the Y-axis with lower X-axis values were likely under-testing, capturing only the most severe cases rather than the full scope of the outbreak.

In [71]:
# Filtering for a specific condition group to make the chart readable
respiratory_deaths = covid_death[(covid_death['Condition Group'] == 'Respiratory diseases') &
                                 (covid_death['State'] == 'US') &
                                 (covid_death['Age Group'] != 'All Ages')]

fig = px.bar(respiratory_deaths,
             x='Age Group',
             y='Number of COVID-19 Deaths',
             color='Condition',
             title='US Respiratory Deaths by Age Group',
             barmode='group')

fig.show()

The data confirms that age is the primary risk factor for mortality. Deaths from "Influenza and pneumonia" and "Respiratory failure" show a massive spike in the 65-74, 75-84, and 85+ age groups. Conversely, the 0-24 and 25-34 age groups show minimal mortality, reinforcing that the respiratory impact of COVID-19 was disproportionately lethal to the elderly population.

In [76]:
# Data Cleaning: Handle missing values in 'Continent'
# (Inspired by Netflix EDA approach to ensure no data is lost)
covid['Continent'] = covid['Continent'].fillna('Unknown')

continent_df = covid.groupby('Continent').agg({
    'TotalCases': 'sum',
    'TotalDeaths': 'sum',
    'TotalRecovered': 'sum',
    'ActiveCases': 'sum',
    'TotalTests': 'sum',
    'Population': 'sum'
}).reset_index()

# Feature Engineering: Calculate derived metrics
# Adding Mortality Rate and Testing Rate per Continent
continent_df['Mortality Rate (%)'] = (continent_df['TotalDeaths'] / continent_df['TotalCases']) * 100
continent_df['Cases per Million'] = (continent_df['TotalCases'] / continent_df['Population']) * 1_000_000

# Sort by Total Cases for better visualization
continent_df = continent_df.sort_values(by='TotalCases', ascending=False)
fig1 = px.bar(continent_df,
             x='Continent',
             y='TotalCases',
             text_auto='.2s',
             title='Total COVID-19 Cases by Continent',
             color='TotalCases',
             template='plotly_dark',
             color_continuous_scale='Reds')
fig1.show()

From this graph we can see that North America ans Asia had the most number of Covid-19 cases

In [77]:
# Mortality Rate vs. Testing Capacity (Bubble Chart)
# This helps see if higher testing correlates with lower reported mortality
fig2 = px.scatter(continent_df,
                 x='TotalTests',
                 y='Mortality Rate (%)',
                 size='Population',
                 color='Continent',
                 hover_name='Continent',
                 log_x=True,
                 title='Mortality Rate vs. Total Tests (Bubble size = Population)',
                 template='plotly_white')
fig2.show()

The Mortality Rate vs. Total Tests scatter plot reveals a critical insight: continents with lower testing capacity often show higher "apparent" mortality rates. This suggests that without widespread testing, only the most severe cases are recorded, artificially inflating the Case Fatality Rate (CFR).

In [78]:
# Proportion of Global Deaths (Pie Chart)
fig3 = px.pie(continent_df,
             values='TotalDeaths',
             names='Continent',
             title='Distribution of Global COVID-19 Deaths',
             hole=0.4,
             color_discrete_sequence=px.colors.sequential.RdBu)
fig3.show()

The Proportion of Global Deaths pie chart shows that while Asia had the highest case volume, the Americas and Europe often accounted for a disproportionately large slice of the mortality pie. This underscores the impact of aging populations and high comorbidity rates in Western nations compared to the younger demographics in Africa and parts of Oceania.

In [72]:
# World Map of Active Cases
fig = px.choropleth(covid,
                    locations="iso_alpha",
                    color="ActiveCases",
                    hover_name="Country/Region",
                    color_continuous_scale=px.colors.sequential.YlOrRd,
                    title="Global Map of Active COVID-19 Cases")

fig.update_layout(geo=dict(showframe=False, showcoastlines=False))
fig.show()

The map highlights geographical hotspots, with North and South America (specifically the USA and Brazil) showing the highest intensity. This suggests that during this period, the epicenter of the pandemic had shifted from Asia and Europe to the Western Hemisphere. The relatively "lighter" colors in parts of Africa may reflect either a lower viral load or, more likely, a lower rate of reporting and active case tracking.

In [73]:
# Calculate Case Fatality Rate (CFR)
covid['CFR (%)'] = (covid['TotalDeaths'] / covid['TotalCases']) * 100

fig = px.box(covid,
             x='Continent',
             y='CFR (%)',
             points="all",
             color='Continent',
             title='Distribution of Case Fatality Rate (%) by Continent')

fig.show()

Europe and the Americas show a wider distribution of CFR, likely due to the inclusion of both highly developed and developing healthcare systems. Africa shows a surprisingly tight distribution with a relatively low median CFR, which could be attributed to a younger median population age, which we know from the US Age Group analysis is less susceptible to fatal respiratory outcomes.

##Final Conclusion: Synthesis of Global Pandemic Trends
The analysis of the COVID-19 datasets reveals a complex global landscape characterized by varying levels of healthcare resilience and regional vulnerabilities. By synthesizing the three distinct datasets (Country-level, Time-series, and US Mortality), several critical conclusions emerge:

1. The Decoupling of Cases and Mortality: The time-series data illustrates
that while "Confirmed Cases" often saw massive spikes during new variant waves (like Omicron), the "Total Deaths" curve began to flatten in later stages. This suggests the effectiveness of global vaccination campaigns and improved clinical management of the virus over time.

2. Regional Resilience vs. Vulnerability: The Case Fatality Rate (CFR) analysis highlights significant disparities. Regions with higher "TotalTests" generally showed lower CFRs, not necessarily because fewer people died, but because robust testing captured the true denominator of mild and asymptomatic cases.

3. Comorbidity as a Primary Driver: The US mortality data confirms that COVID-19 rarely acted in isolation. Respiratory diseases and circulatory issues were the most frequent "Condition Groups" linked to fatal outcomes, reinforcing that the pandemic was a "syndemic"—a set of linked health problems interacting with a social environment.

##Actionable Public Health Recommendations
Based on the insights derived from this EDA, the following strategies are recommended for future public health responses:

* Targeted Protection for High-Risk Demographics: Since mortality is heavily concentrated in specific age groups and those with pre-existing conditions, future lockdowns or social distancing measures should be "precision-targeted" to protect these cohorts rather than being universally applied, minimizing economic disruption.

* Infrastructure Investment in Testing: The correlation between high testing rates and better outcomes suggests that rapid-response diagnostic infrastructure is just as important as hospital bed capacity.

* Standardization of Global Data: The presence of missing values in "NewCases" and "NewDeaths" across different regions during the peak of the pandemic identifies a need for a unified, real-time global reporting standard to allow for better international coordination.