The primary goal of this mini-project is to clean, analyze, and visualize COVID-19 data to identify trends, patterns, and key insights. The specific objectives include:

Perform Data Cleaning: Handle missing values, standardize date formats, and filter out inconsistencies.

Explore Trends in COVID-19 Cases and Deaths:

Analyze daily and cumulative trends in infections and fatalities.

Compare case and death rates across different countries and regions.

Create Data Visualizations:

Line Plots: Show the trend of cases, deaths, and vaccinations over time.

Bar Charts: Compare cases, deaths, and vaccinations by country.

Scatter Plots: Explore relationships between infection rates and testing or vaccination rates.

In [3]:
#import required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
# importing dataset

DT= pd.read_csv('my data/Dataset2.csv')

In [7]:
DT

Unnamed: 0,DATE,country,NEW Cases,NEW_DEATHS,vaccinated
0,2024-11-01,Argentina,,,unknown
1,2024-11-01,Australia,0.0,0.0,327
2,2024-11-01,Australia,0.0,0.0,327
3,2024-11-01,Brazil,971.0,48.0,430
4,2024-11-01,Canada,176.0,8.0,unknown
...,...,...,...,...,...
160,2024-11-30,Brazil,514.0,25.0,unknown
161,2024-11-30,Canada,1133.0,56.0,438
162,2024-11-30,Canada,1133.0,56.0,438
163,2024-11-30,China,0.0,0.0,unknown


In [9]:
DT.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATE         165 non-null    object 
 1   country      165 non-null    object 
 2    NEW Cases   154 non-null    float64
 3   NEW_DEATHS   139 non-null    float64
 4   vaccinated   165 non-null    object 
dtypes: float64(2), object(3)
memory usage: 6.6+ KB


In [11]:
DT.head()

Unnamed: 0,DATE,country,NEW Cases,NEW_DEATHS,vaccinated
0,2024-11-01,Argentina,,,unknown
1,2024-11-01,Australia,0.0,0.0,327
2,2024-11-01,Australia,0.0,0.0,327
3,2024-11-01,Brazil,971.0,48.0,430
4,2024-11-01,Canada,176.0,8.0,unknown


In [13]:
DT.describe()

Unnamed: 0,NEW Cases,NEW_DEATHS
count,154.0,139.0
mean,579.032468,28.47482
std,406.07835,20.640586
min,0.0,0.0
25%,271.75,13.5
50%,589.0,29.0
75%,801.25,39.5
max,1730.0,86.0


In [15]:
DT.isnull()

Unnamed: 0,DATE,country,NEW Cases,NEW_DEATHS,vaccinated
0,False,False,True,True,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
160,False,False,False,False,False
161,False,False,False,False,False
162,False,False,False,False,False
163,False,False,False,False,False


In [17]:
#to identify the number of missing values in each column
DT.isnull().sum()

DATE            0
country         0
 NEW Cases     11
NEW_DEATHS     26
vaccinated      0
dtype: int64

In [19]:
#to identify columns with at least one missing values
DT.isnull().any(axis=0)

DATE           False
country        False
 NEW Cases      True
NEW_DEATHS      True
vaccinated     False
dtype: bool

In [23]:
#to identfy columns that have missing values throughout the column
DT.isnull().all(axis=0)

DATE           False
country        False
 NEW Cases     False
NEW_DEATHS     False
vaccinated     False
dtype: bool

In [25]:
#number of columns with all missing values
DT.isnull().all(axis=0).sum()

0

In [27]:
#rows with at least one missing values
DT.isnull().any(axis=1)

0       True
1      False
2      False
3      False
4      False
       ...  
160    False
161    False
162    False
163    False
164    False
Length: 165, dtype: bool

In [29]:
#rows with all missing values
DT.isnull().all(axis=1)

0      False
1      False
2      False
3      False
4      False
       ...  
160    False
161    False
162    False
163    False
164    False
Length: 165, dtype: bool

In [31]:
#number of rows with all missing values
DT.isnull().all(axis=1).sum()

0