In [8]:
# COVID-19 Data Exploratory Data Analysis (EDA)

## 1. Introduction

# This project performs an Exploratory Data Analysis (EDA) on a publicly available COVID-19 dataset that tracks confirmed cases and deaths by county across the United States. The dataset includes daily counts of cases and deaths reported at the county level, along with associated geographic and temporal information.

# The main goals of this analysis are to:

# - Understand the structure and contents of the dataset
# - Clean the data and handle any missing or inconsistent values
# - Summarize key statistics (e.g., total cases, deaths, trends over time)
# - Identify which counties or states were most affected
# - Visualize trends and patterns over time and across regions
# - Explore relationships between different features (e.g., cases vs. deaths)

# This project will use Python libraries such as **Pandas** for data manipulation, **Matplotlib** and **Seaborn** for visualization, and basic statistical methods to uncover insights from the data.

In [9]:
## 2. Load and Inspect the Data

# In this section, we will load the COVID-19 dataset using Pandas and perform an initial inspection to understand the structure, columns, and types of data we are working with.

In [10]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("us-counties.csv")

# Display the first 5 rows of the dataset
df.head()


Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0.0
1,2020-01-22,Snohomish,Washington,53061.0,1,0.0
2,2020-01-23,Snohomish,Washington,53061.0,1,0.0
3,2020-01-24,Cook,Illinois,17031.0,1,0.0
4,2020-01-24,Snohomish,Washington,53061.0,1,0.0


In [11]:
# Check the shape of the dataset (rows, columns)
print("Dataset shape:", df.shape)

Dataset shape: (2502832, 6)


In [12]:
# List all column names
print("Columns in dataset:")
print(df.columns.tolist())

Columns in dataset:
['date', 'county', 'state', 'fips', 'cases', 'deaths']


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2502832 entries, 0 to 2502831
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   date    object 
 1   county  object 
 2   state   object 
 3   fips    float64
 4   cases   int64  
 5   deaths  float64
dtypes: float64(2), int64(1), object(3)
memory usage: 114.6+ MB


In [14]:
df.describe()

Unnamed: 0,fips,cases,deaths
count,2479154.0,2502832.0,2445227.0
mean,31399.58,10033.8,161.61
std,16342.51,47525.22,820.3335
min,1001.0,0.0,0.0
25%,19023.0,382.0,6.0
50%,30011.0,1773.0,33.0
75%,46111.0,5884.0,101.0
max,78030.0,2908425.0,40267.0


In [15]:
df.isnull().sum()

date          0
county        0
state         0
fips      23678
cases         0
deaths    57605
dtype: int64