## Data Cleaning and Analysis Practice

Using a Kaggle dataset on COVID-19 IMPACT to practice data cleaning and analysis with visualizations.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv('data/CovidDeaths.csv')

In [4]:
df.head()

Unnamed: 0,iso_code,continent,location,date,population,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,...,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million
0,AFG,Asia,Afghanistan,1/3/2020,41128772,,0.0,,,0.0,...,,,,,,,,,,
1,AFG,Asia,Afghanistan,1/4/2020,41128772,,0.0,,,0.0,...,,,,,,,,,,
2,AFG,Asia,Afghanistan,1/5/2020,41128772,,0.0,,,0.0,...,,,,,,,,,,
3,AFG,Asia,Afghanistan,1/6/2020,41128772,,0.0,,,0.0,...,,,,,,,,,,
4,AFG,Asia,Afghanistan,1/7/2020,41128772,,0.0,,,0.0,...,,,,,,,,,,


In [5]:
df.shape

(309799, 26)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309799 entries, 0 to 309798
Data columns (total 26 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   iso_code                            309799 non-null  object 
 1   continent                           295057 non-null  object 
 2   location                            309799 non-null  object 
 3   date                                309799 non-null  object 
 4   population                          309799 non-null  int64  
 5   total_cases                         273627 non-null  float64
 6   new_cases                           300906 non-null  float64
 7   new_cases_smoothed                  299642 non-null  float64
 8   total_deaths                        252976 non-null  float64
 9   new_deaths                          301000 non-null  float64
 10  new_deaths_smoothed                 299770 non-null  float64
 11  total_cases_per_million   

In [7]:
df['location'].value_counts()

Argentina          1231
Sweden             1230
North America      1230
Asia               1230
Italy              1230
                   ... 
Scotland           1145
Wales              1135
Macao               795
Northern Cyprus     691
Western Sahara        1
Name: location, Length: 255, dtype: int64

In [8]:
df['location'].unique()

array(['Afghanistan', 'Africa', 'Albania', 'Algeria', 'American Samoa',
       'Andorra', 'Angola', 'Anguilla', 'Antigua and Barbuda',
       'Argentina', 'Armenia', 'Aruba', 'Asia', 'Australia', 'Austria',
       'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados',
       'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan',
       'Bolivia', 'Bonaire Sint Eustatius and Saba',
       'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'British Virgin Islands', 'Brunei', 'Bulgaria', 'Burkina Faso',
       'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde',
       'Cayman Islands', 'Central African Republic', 'Chad', 'Chile',
       'China', 'Colombia', 'Comoros', 'Congo', 'Cook Islands',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Curacao',
       'Cyprus', 'Czechia', 'Democratic Republic of Congo', 'Denmark',
       'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'England', 'Equatorial Guinea', 'Eritrea',

In [9]:
df.isna().sum()

iso_code                                   0
continent                              14742
location                                   0
date                                       0
population                                 0
total_cases                            36172
new_cases                               8893
new_cases_smoothed                     10157
total_deaths                           56823
new_deaths                              8799
new_deaths_smoothed                    10029
total_cases_per_million                36172
new_cases_per_million                   8893
new_cases_smoothed_per_million         10157
total_deaths_per_million               56823
new_deaths_per_million                  8799
new_deaths_smoothed_per_million        10029
reproduction_rate                     124982
icu_patients                          273842
icu_patients_per_million              273842
hosp_patients                         273060
hosp_patients_per_million             273060
weekly_icu

In [10]:
df['continent'].value_counts()

Africa           69769
Europe           67112
Asia             61442
North America    50200
Oceania          29377
South America    17157
Name: continent, dtype: int64

Dealing with missing values:
1. Remove the records
2. Replace it with mean/median/mode
3. Use special values such as "unknowns"

For continent, I am going to use "unknowns" to replace the NaNs


In [11]:
df['continent'] = df['continent'].fillna('Unknown')

Check NaNs in the continent column:

In [13]:
df['continent'].isnull().sum()

0

Look into reproduction rate column

In [18]:
df['reproduction_rate'].describe()

count    184817.000000
mean          0.911495
std           0.399925
min          -0.070000
25%           0.720000
50%           0.950000
75%           1.140000
max           5.870000
Name: reproduction_rate, dtype: float64