In [1]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

#Load the data
data_url = 'https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-data.csv?raw=true'
df = pd.read_csv(data_url)

#Explore the data
print(df.head())
print(df.info())
print(df.describe())
print(df.isnull().sum())

#Clean the data
countries_of_interest = ['Kenya', 'USA', 'India']
df_filtered = df[df['location'].isin(countries_of_interest)].copy()
df_filtered['date'] = pd.to_datetime(df_filtered['date'])
numerical_cols = ['total_cases', 'new_cases', 'total_deaths', 'new_deaths', 'total_vaccinations']
for col in numerical_cols:
    df_filtered[col] = df_filtered[col].interpolate()
df_filtered.dropna(subset=['date'], inplace=True)
print(df_filtered.head())
print(df_filtered.info())

#Exploratory Data Analysis (EDA)
plt.figure(figsize=(12, 6))
sns.set_style('darkgrid')
for country in countries_of_interest:
    country_data = df_filtered[df_filtered['location'] == country]
    plt.plot(country_data['date'], country_data['total_cases'], label=country, linewidth=2)
plt.title('Total COVID-19 Cases Over Time', fontsize=16)
plt.xlabel('Date', fontsize=12)
plt.ylabel('Total Cases', fontsize=12)
plt.legend(fontsize=10)
plt.xticks(rotation=45)
plt.show()

# Repeat similar plotting code for total deaths, new cases, and death rate

df_filtered['death_rate'] = df_filtered['total_deaths'] / df_filtered['total_cases']
plt.figure(figsize=(12, 6))
for country in countries_of_interest:
    country_data = df_filtered[df_filtered['location'] == country]
    plt.plot(country_data['date'], country_data['death_rate'], label=country, linewidth=2)
plt.title('COVID-19 Death Rate (Total Deaths / Total Cases)', fontsize=16)
plt.xlabel('Date', fontsize=12)
plt.ylabel('Death Rate', fontsize=12)
plt.legend(fontsize=10)
plt.xticks(rotation=45)
plt.show()

#Visualize Vaccination Progress
plt.figure(figsize=(12, 6))
for country in countries_of_interest:
    country_data = df_filtered[df_filtered['location'] == country]
    plt.plot(country_data['date'], country_data['total_vaccinations'], label=country, linewidth=2)
plt.title('Total COVID-19 Vaccinations Over Time', fontsize=16)
plt.xlabel('Date', fontsize=12)
plt.ylabel('Total Vaccinations', fontsize=12)
plt.legend(fontsize=10)
plt.xticks(rotation=45)
plt.show()

#Choropleth Map 
latest_data = df_filtered.groupby('location').last().reset_index()
fig = px.choropleth(latest_data,
                    locations='location',
                    locationmode='country names',
                    color='total_cases',
                    hover_name='location',
                    title='Total COVID-19 Cases by Country (Latest Data)',
                    color_continuous_scale='Reds')
fig.show()


ModuleNotFoundError: No module named 'pandas'

## Import Libraries

This code imports the necessary Python libraries for data analysis and visualization:

pandas (pd): For data manipulation and working with DataFrames.

matplotlib.pyplot (plt): For creating basic plots and charts.

seaborn (sns): For enhanced and more visually appealing plots.

plotly.express (px): For creating interactive plots, including the choropleth map.

## Load the Data

This code downloads the COVID-19 dataset from the Our World in Data GitHub repository and loads it into a pandas DataFrame:

data_url stores the URL of the CSV file.

pd.read_csv(data_url) reads the CSV file from the URL into a DataFrame, which is then assigned to the variable df.

## Explore the Data

This code provides a preliminary look at the structure and content of the dataset:

df.head(): Displays the first 5 rows of the DataFrame, showing the columns and some sample data.

df.info(): Prints a summary of the DataFrame, including the column names, data types, and the number of non-null values.

df.describe(): Generates descriptive statistics for the numerical columns in the DataFrame, such as mean, standard deviation, and quartiles.

df.isnull().sum(): Calculates the number of missing values in each column, which helps in identifying data cleaning needs.



## Clean the Data

This code performs several data cleaning and preprocessing steps:

countries_of_interest: Defines a list of countries ('Kenya', 'USA', 'India') for focused analysis.

df_filtered: Filters the DataFrame to include only the specified countries.  The .copy() method is used to create a new DataFrame, avoiding potential issues with modifying a slice of the original DataFrame.

df_filtered['date']: Converts the 'date' column to datetime objects, enabling time-based operations.

numerical_cols: Specifies the numerical columns to be processed.

The loop iterates through the numerical columns and fills missing values using interpolation, which estimates missing values based on surrounding data.

df_filtered.dropna(subset=['date'], inplace=True): Removes rows where the 'date' column has missing values.

The final prints display the first few rows and the structure of the cleaned data.

## Total COVID-19 Cases Over Time

This code generates a line plot showing the trend of total COVID-19 cases over time for the selected countries:

A figure of size 12x6 is created for the plot.

sns.set_style('darkgrid') applies a dark grid style to the plot for better readability.

The code iterates through the countries_of_interest.  In each iteration, it filters the data for a specific country and plots the total_cases against the date.

The plot includes a title, x-axis label ('Date'), y-axis label ('Total Cases'), and a legend to distinguish the countries.

plt.xticks(rotation=45) rotates the x-axis labels by 45 degrees for better readability.

plt.show() displays the plot.

## Total COVID-19 Deaths Over Time
This code generates a line plot showing the trend of total COVID-19 deaths over time for the selected countries.  The structure of this code is very similar to the previous one, but it plots total_deaths instead of total_cases.

Daily New COVID-19 Cases
This code generates a line plot showing the daily new COVID-19 cases for the selected countries.  It plots new_cases against date for each country.

## COVID-19 Death Rate (Total Deaths / Total Cases)

This code calculates the death rate as the ratio of total deaths to total cases and plots it over time for the selected countries:

df_filtered['death_rate'] = df_filtered['total_deaths'] / df_filtered['total_cases']:  Calculates the death rate and stores it in a new column called 'death_rate'.

The subsequent code generates a line plot of the 'death_rate' over time for each country.

## Total COVID-19 Vaccinations Over Time

This code generates a line plot showing the total number of COVID-19 vaccinations over time for the selected countries.

## Choropleth Map of Total Cases (Latest Data)

This code creates a choropleth map to visualize the total number of COVID-19 cases for the selected countries using the latest available data:

latest_data = df_filtered.groupby('location').last().reset_index(): Groups the filtered data by country and extracts the last recorded data point for each country, representing the most recent data.

px.choropleth(): Creates the choropleth map.

locations='location': Specifies the column containing the country names.

locationmode='country names':  Indicates that the locations are specified as country names.

color='total_cases':  Sets the color of each country based on the total_cases value.

hover_name='location':  Displays the country name when hovering over it.

title: Sets the title of the map.

color_continuous_scale='Reds':  Uses a red color scale to represent the number of cases.

fig.show():  Displays the interactive map.