Importing the necessary libraries required for the data cleaning process, these are usually Pandas and numpy, sometimes Matplotlib and seaborn if there's a need to visualize some columns.

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

Importing the dataset file. This is usually carried out using the pd.read_csv() function (for a csv file). Usually, your dataset could come in excel, json...

To display the first 5 rows of the dataset, we use the data.head() function

To display the last 5 rows of the dataset, we use the data.tail() function

To display the entire dataset, we use the pd.set_option('display.max_rows', None) function

In [21]:
df = pd.read_csv(r'C:\Users\user\Desktop\data.csv')
df.head()

Unnamed: 0,Country,Year,Number of reported cases of cholera,Number of reported deaths from cholera,Cholera case fatality rate,WHO Region
0,Afghanistan,2016,677,5,0.7,Eastern Mediterranean
1,Afghanistan,2015,58064,8,0.01,Eastern Mediterranean
2,Afghanistan,2014,45481,4,0.0,Eastern Mediterranean
3,Afghanistan,2013,3957,14,0.35,Eastern Mediterranean
4,Afghanistan,2012,12,0,0.1,Eastern Mediterranean


In [22]:
df.tail()

Unnamed: 0,Country,Year,Number of reported cases of cholera,Number of reported deaths from cholera,Cholera case fatality rate,WHO Region
2487,Russian Federation,1980,,0.0,,Europe
2488,Russian Federation,1971,,0.0,,Europe
2489,Sweden,1984,,0.0,,Europe
2490,Switzerland,1980,,0.0,,Europe
2491,Cambodia,2011,,,0.0,Western Pacific


To check the data type of all the columns, we use the df.dtypes function. Always ensure that all columns have the correct datatype.

In [23]:
df.dtypes

Country                                   object
Year                                       int64
Number of reported cases of cholera       object
Number of reported deaths from cholera    object
Cholera case fatality rate                object
WHO Region                                object
dtype: object

The df.info() function prints information about the dataset. The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values)

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2492 entries, 0 to 2491
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Country          2492 non-null   object
 1   Year             2492 non-null   int64 
 2   Reported cases   2492 non-null   int32 
 3   Reported deaths  2492 non-null   int32 
 4   Fatality rate    2492 non-null   object
 5   WHO Region       2492 non-null   object
dtypes: int32(2), int64(1), object(3)
memory usage: 97.5+ KB


Renaming some of the columns for easy access during further processing.

In [41]:
df= df.rename(columns = {"Number of reported cases of cholera":"Reported cases", 
                          "Number of reported deaths from cholera" : "Reported deaths", 
                          "Cholera case fatality rate" : "Fatality rate"})
df.head()

Unnamed: 0,Country,Year,Reported cases,Reported deaths,Fatality rate,WHO Region
0,Afghanistan,2016,677,5,0.7,Eastern Mediterranean
1,Afghanistan,2015,58064,8,0.01,Eastern Mediterranean
2,Afghanistan,2014,45481,4,0.0,Eastern Mediterranean
3,Afghanistan,2013,3957,14,0.35,Eastern Mediterranean
4,Afghanistan,2012,12,0,0.1,Eastern Mediterranean


Generating the total number of nulls available in each column

In [42]:
df.isnull().sum()

Country            0
Year               0
Reported cases     0
Reported deaths    0
Fatality rate      0
WHO Region         0
dtype: int64

Filling up the nulls with 0 rather than dropping them


In [43]:
df['Reported cases'] = df['Reported cases'].fillna('0')
df['Reported deaths'] = df['Reported deaths'].fillna('0')
df['Fatality rate'] = df['Fatality rate'].fillna('0')

df.isnull().sum()

Country            0
Year               0
Reported cases     0
Reported deaths    0
Fatality rate      0
WHO Region         0
dtype: int64

Removing whitespaces from the dataset

In [None]:
df['Reported cases'] = df['Reported cases'].str.replace(" ", "")
df['Reported deaths'] = df['Reported deaths'].str.replace(" ", "")
df['Fatality rate'] = df['Fatality rate'].str.replace(" ", "")

Replacing the rows with unknown data with 0

In [45]:
df['Reported cases'] = df['Reported cases'].replace('Unknown', '0')
df['Reported deaths'] = df['Reported deaths'].replace('Unknown', '0')
df['Fatality rate'] = df['Fatality rate'].replace('Unknown', '0')

Converting the data type of Reported deaths and Reported cases from object to int 

In [39]:
df['Reported cases'] = df['Reported cases'].astype(int)
df['Reported deaths'] = df['Reported deaths'].astype(int)

In [32]:
df.dtypes

Country            object
Year                int64
Reported cases      int32
Reported deaths     int32
Fatality rate      object
WHO Region         object
dtype: object

Checking for incorrect spellings in the country column

In [33]:
df['Country'].value_counts()

India                                                   64
Malaysia                                                49
Viet Nam                                                47
Iran (Islamic Republic of)                              45
Myanmar                                                 45
Nigeria                                                 45
Cameroon                                                44
Bangladesh                                              44
Philippines                                             43
Ghana                                                   43
Japan                                                   43
Liberia                                                 42
United Kingdom of Great Britain and Northern Ireland    41
Singapore                                               41
United Republic of Tanzania                             40
Mozambique                                              40
Benin                                                   

Checking for incorrect spellings in the WHO Region column

In [34]:
df['WHO Region'].value_counts()

Africa                   991
Western Pacific          391
Europe                   327
South-East Asia          292
Americas                 256
Eastern Mediterranean    235
Name: WHO Region, dtype: int64

In [35]:
df.head()

Unnamed: 0,Country,Year,Reported cases,Reported deaths,Fatality rate,WHO Region
0,Afghanistan,2016,677,5,0.7,Eastern Mediterranean
1,Afghanistan,2015,58064,8,0.01,Eastern Mediterranean
2,Afghanistan,2014,45481,4,0.0,Eastern Mediterranean
3,Afghanistan,2013,3957,14,0.35,Eastern Mediterranean
4,Afghanistan,2012,12,0,0.1,Eastern Mediterranean
