### This notebook is primarily for preliminary analysis, data exploration, data cleaning and preparation of data for further analysis.

First, import the required libraries as well as the dataset.

In [1]:
# import the libraries

import numpy as np # for array and maths operation
import pandas as pd # for data manipulation
import matplotlib.pyplot as plt # for visualization
import seaborn as sns # for visualization

In [2]:
# load the dataset using pandas

data = pd.read_csv('../data/dataset.csv')

Now explore, clean the data and prepare data for further analysis.

In [3]:
# display the first few rows of the dataset to understand the kind of data in the dataset

data

Unnamed: 0,year_month,year,month,age,gender,mp
0,2021_9,2021,sept,0,m,1
1,2021_9,2021,sept,0,m,1
2,2021_9,2021,sept,0,m,1
3,2021_9,2021,sept,0,m,1
4,2021_9,2021,sept,1,f,0
...,...,...,...,...,...,...
3796,2007_10,2007,oct,60,m,0
3797,2007_10,2007,oct,50,f,0
3798,2007_11,2007,nov,4,m,0
3799,2007_11,2007,nov,16,m,0


The dataset has 6 columns:
- year_month: represents the year and the particular month this patient visited the health center (time)
- year: represent the year the patient visiting the health center and was diagnosed (time)
- month: the month the patient was diagnosed (time)
- age: age of the patient in years (0 means the patient was less than 1 year old) (continous data)
- gender: represents the sex of the patient (categorical data)
- mp: stands for malaria parasite determined by microscopy or use of RDTs (1-positive for malaria parasite, 0-negative for malarai parasite) (categorical)

Check the data type and related info for each variable

In [4]:
# understand properties of the variables in the dataset

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3801 entries, 0 to 3800
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   year_month  3801 non-null   object
 1   year        3801 non-null   int64 
 2   month       3801 non-null   object
 3   age         3801 non-null   int64 
 4   gender      3801 non-null   object
 5   mp          3801 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 178.3+ KB


Check each unique number of categories under each categorical variable (mp, year, gender, month)

In [5]:
# check for unique categories for each variable

list = data[['year','month','gender','mp']]
for i in list:
    print('unique categories in', i)
    data[i].unique()
    print(data[i].unique())
    print('-----------------------')


unique categories in year
[2021 2020 2019 2018 2017 2016 2014 2015 2011 2010 2012 2013 2007 2009
 2008 2006]
-----------------------
unique categories in month
['sept' 'oct' 'nov' 'may ' 'may' 'mar' 'june' 'july' 'jan' 'feb' 'dec'
 'aug' 'april' 'mar ' 'aprl' 'dec ' '0ct' 'june ']
-----------------------
unique categories in gender
['m' 'f' 'm ' ' m' ' f' 'd' 'g' 'f ']
-----------------------
unique categories in mp
[1 0 9]
-----------------------


From the output cell above, we have data from 16 different years. Month, gender, mp seem to have some irregularities probably due to data entry.

For the month category, we out to have 12 unique values. looking at the output cell above, we seem to have some misspelled which have created addition categories.

In [6]:
# handling irregularities in the month column.
# df['colunm'].replace({'string to replace': 'replacement value'}, inplace = True)

# replace '0ct' with 'oct'
# replace 'aprl' with 'april'
# replace 'dec ' with 'dec'
# replace 'may ' with 'may' 
# replace 'june ' with 'june' 
# replace 'mar ' with 'mar' 

data['month'] = data['month'].replace({'0ct':'oct', 'dec ':'dec', 'aprl':'april', 'mar ':'mar', 'may ':'may', 'june ':'june'})
data['month'].unique()

array(['sept', 'oct', 'nov', 'may', 'mar', 'june', 'july', 'jan', 'feb',
       'dec', 'aug', 'april'], dtype=object)

In [7]:
# handling irregularities in gender category
# replace 'm ', ' m' with 'm'
# replace ' f', 'f ', 'd', 'g' with 'f'

data['gender'] = data['gender'].replace({'m ':'m', ' m':'m', ' f':'f', 'f ':'f', 'd':'f', 'g':'f'})
data['gender'].unique()

array(['m', 'f'], dtype=object)

In [8]:
# handling irregularities in the mp column
# replace 9 with 0

data['mp'] = data['mp'].replace(9,0)
data['mp'].unique()

array([1, 0], dtype=int64)

Now, the last task to prepare the data will be to classify the age into different categories.

In [9]:
# display all the unique values in the age variable

data['age'].unique()

array([ 0,  1,  2,  4, 10, 11, 12, 17, 18, 20, 21, 23, 24, 32, 35, 36, 37,
       41, 43, 50, 51, 54, 60, 70,  5,  6,  7,  8, 22, 28, 30, 49, 56, 66,
       76, 80,  3,  9, 19, 27, 31, 33, 45, 68, 74, 16, 14, 15, 25, 34, 39,
       40, 47, 52, 53, 55, 57, 63, 71, 72, 13, 26, 38, 44, 64, 90, 29, 42,
       58, 65, 69, 77, 62, 48, 67, 75, 61, 73, 46, 59, 85, 82, 83, 78, 88,
       86, 98, 79, 93], dtype=int64)

- So from above output, age ranges from 0 years (< 12 months) to 93 years.
- This is malaria related data and we would like to understand prevalence levels for patients <1yr, 1-<5yrs, 5-<15yrs, 15-<45yrs, 45-<65yrs and >=65yrs

In [None]:
# create a new column in our dataset called age_group
# categories age into 3 categories
# 10 to 13 yrs: (9, 13], 9 is excluded and 13 is included
# 14 to 16 yrs: (13, 16], 13 is excluded and 16 is included
# 17 to 19 yrs: (16, 19], 16 is excluded and 19 is included.

students['Age_group'] = pd.cut(x=students['Age'], bins=[9, 13, 16, 19], labels=['10 to 13 yrs','14 to 16 yrs','17 to 19 yrs'])


In [17]:
# create a new column in the dataset called age_groups
# categorise age into 5 categories
# <1yr: (data['age'].min(), 0)
# 1-<5yrs: (0, 4], 0 is excluded and 4 is included (patients between 1 and 4 years)
# 5-<15yrs: (4, 14], patients between 5 and 14 years
# 15-<45yrs: (14,44], patients between 15 and 44 years
# 45-<65yrs> (44, 64], patients between 45 and 64 years
# >=65yrs: (64, data['age'].max())

bins = [data['age'].min()-1, 0, 4, 14, 44, 64, data['age'].max()]
labels = ['<1yr', '1-<5yrs', '5-<15yrs', '15-<45yrs', '45-<65yrs', '>=65yrs']
data['age_groups'] = pd.cut(x=data['age'], bins=bins, labels=labels)


Another category of patients i would evaluate are patients below and above 15 years of age. Hence i will create another variable called 'age_cat'

In [28]:
# creating a variable age_cat which will have 2 values, <15yrs and >=15yrs.
# <15yrs: (data['age'].min(), 14): patients between 0 and 14 years
# >=15yrs: (14, data['age'].max())

bin=[data['age'].min()-1, 14, data['age'].max()]
label=['<15yrs', '>=15yrs']
data['age_cat'] = pd.cut(x=data['age'], bins=bin, labels=label )

In [29]:
data

Unnamed: 0,year_month,year,month,age,gender,mp,age_groups,age_cat
0,2021_9,2021,sept,0,m,1,<1yr,<15yrs
1,2021_9,2021,sept,0,m,1,<1yr,<15yrs
2,2021_9,2021,sept,0,m,1,<1yr,<15yrs
3,2021_9,2021,sept,0,m,1,<1yr,<15yrs
4,2021_9,2021,sept,1,f,0,1-<5yrs,<15yrs
...,...,...,...,...,...,...,...,...
3796,2007_10,2007,oct,60,m,0,45-<65yrs,>=15yrs
3797,2007_10,2007,oct,50,f,0,45-<65yrs,>=15yrs
3798,2007_11,2007,nov,4,m,0,1-<5yrs,<15yrs
3799,2007_11,2007,nov,16,m,0,15-<45yrs,>=15yrs


We have successfully clean and prepare the data which can now be used for further analysis and modelling.


Next, we will save the clean dataset to a new file which will be used in preceeding notebooks

In [30]:
data.to_csv('../data/cleaned_data.csv')