# COVID-19 Data Analysis (Beginner Project)

## Day 1: Project Setup & Dataset Overview

### Objective
- Load COVID-19 dataset
- Understand structure of data
- Identify important columns for analysis


In [13]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# basic settings
plt.style.use("default")


In [14]:
df = pd.read_csv(
    r"C:\Users\RUCHI\Downloads\WHO-COVID-19-global-data.csv",
    sep=";",
    engine="python"
)

df.head()



Unnamed: 0,Date_reported,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
0,05/01/2020,AF,Afghanistan,EMRO,,0,,0
1,12/01/2020,AF,Afghanistan,EMRO,,0,,0
2,19/01/2020,AF,Afghanistan,EMRO,,0,,0
3,26/01/2020,AF,Afghanistan,EMRO,,0,,0
4,02/02/2020,AF,Afghanistan,EMRO,,0,,0


In [15]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57840 entries, 0 to 57839
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Date_reported      57840 non-null  object 
 1   Country_code       57599 non-null  object 
 2   Country            57840 non-null  object 
 3   WHO_region         53502 non-null  object 
 4   New_cases          39028 non-null  float64
 5   Cumulative_cases   57840 non-null  int64  
 6   New_deaths         25001 non-null  float64
 7   Cumulative_deaths  57840 non-null  int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 3.5+ MB


In [16]:
df.describe()

Unnamed: 0,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
count,39028.0,57840.0,25001.0,57840.0
mean,19881.04,1792594.0,282.323947,20010.23
std,270782.3,7797691.0,1214.392195,81864.4
min,-65079.0,0.0,-3432.0,0.0
25%,43.0,4162.75,4.0,28.0
50%,393.0,45883.0,20.0,565.0
75%,3968.0,522087.2,105.0,6974.5
max,40475480.0,103436800.0,47687.0,1194158.0


## Day 2: Data Cleaning & Preparation

### Objective
- Select required columns
- Handle missing values
- Convert date column


In [17]:
import os
os.getcwd()


'c:\\Users\\RUCHI\\OneDrive\\Documents\\COVID19-Data-Analysis'

In [18]:
import os
print(os.getcwd())


c:\Users\RUCHI\OneDrive\Documents\COVID19-Data-Analysis


In [19]:
print(os.listdir())


['.git', 'covid-19_analysis.ipynb', 'data', 'README.md', 'requirments .txt']


In [24]:
df = df [['Date_reported',	'Country_code',	'Country','Cumulative_cases','New_deaths']]
df.head()

Unnamed: 0,Date_reported,Country_code,Country,Cumulative_cases,New_deaths
0,05/01/2020,AF,Afghanistan,0,
1,12/01/2020,AF,Afghanistan,0,
2,19/01/2020,AF,Afghanistan,0,
3,26/01/2020,AF,Afghanistan,0,
4,02/02/2020,AF,Afghanistan,0,


In [31]:
df['Date_reported'] = pd.to_datetime(df['Date_reported'], dayfirst=True)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Date_reported'] = pd.to_datetime(df['Date_reported'], dayfirst=True)


In [32]:
df['Date_reported'].head()


0   2020-01-05
1   2020-01-12
2   2020-01-19
3   2020-01-26
4   2020-02-02
Name: Date_reported, dtype: datetime64[ns]

In [33]:
df.dtypes


Date_reported       datetime64[ns]
Country_code                object
Country                     object
Cumulative_cases             int64
New_deaths                 float64
dtype: object

In [34]:
df.columns = df.columns.str.strip().str.lower()
df['date_reported'] = pd.to_datetime(df['date_reported'], dayfirst=True)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['date_reported'] = pd.to_datetime(df['date_reported'], dayfirst=True)


In [35]:
df.columns


Index(['date_reported', 'country_code', 'country', 'cumulative_cases',
       'new_deaths'],
      dtype='object')

In [36]:
df.isnull().sum()


date_reported           0
country_code          241
country                 0
cumulative_cases        0
new_deaths          32839
dtype: int64

In [37]:
df.fillna(0, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.fillna(0, inplace=True)


In [38]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57840 entries, 0 to 57839
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   date_reported     57840 non-null  datetime64[ns]
 1   country_code      57840 non-null  object        
 2   country           57840 non-null  object        
 3   cumulative_cases  57840 non-null  int64         
 4   new_deaths        57840 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 2.2+ MB


## Day 3: COVID-19 Cases Trend Over Time

### Objective
- Analyze how cases increased over time
- Visualize trend for a single country
