# About the Notebook

Goal: Inspect and clean the data, include identifying missing and duplicated data

About the datasets:
- There are three malaria datasets
    - Malaria Deaths per Age Group
    - Malaria Death Rate
    - Malaria Incidence Rate
    
**Definitions**
|*Term*|*Definition*|
|---|---|
|Death rate|The number of deaths from malaria per 100,000 people.|
|Incidence rate|The number of new cases of malaria in a year per 1,000 population
at risk.|
|Deaths/ Total Deaths|Annual number of deaths from malaria.|

**Source**: [Our World in Data](https://ourworldindata.org/malaria#)

# PART 1. Imports

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import numpy as np
import pandas as pd

#Import Data_analysis functions
from data_analysis import *

#Country Data Exploration
from geonamescache import GeonamesCache
import pycountry

In [2]:
#Import csv files
age_grp_death_df = pd.read_csv("../data_input/2018_malaria_deaths_age.csv")
death_rate_df = pd.read_csv("../data_input/2018_malaria_deaths.csv")
inc_rate_df = pd.read_csv("../data_input/2018_malaria_inc.csv")

In [3]:
#List of data set names
dataframe_ls = "age_grp_death_df,death_rate_df,inc_rate_df"
dataframe_ls = dataframe_ls.split(',')

# PART 2. Initial Data Exploration and Processing

## 2.1 Quick Glance of the three data sets
> Snippet of the first three rows of each dataset

In [4]:
print("QUICK GLANCE OF DATA SETS:\n")

for data in dataframe_ls:
    #Name of dataframe
    print("--"*5, f"DATAFRAME: {data}", "--"*5)
    
    #Display first three rows of dataframe
    df_ = globals()[data]
    df_.head(3)
    
    #Show data size
    print(f"Dataframe size: {len(df_)}\n")

QUICK GLANCE OF DATA SETS:

---------- DATAFRAME: age_grp_death_df ----------


Unnamed: 0.1,Unnamed: 0,entity,code,year,age_group,deaths
0,1,Afghanistan,AFG,1990,Under 5,184.606435
1,2,Afghanistan,AFG,1991,Under 5,191.658193
2,3,Afghanistan,AFG,1992,Under 5,197.140197


Dataframe size: 30780

---------- DATAFRAME: death_rate_df ----------


Unnamed: 0,Entity,Code,Year,"Deaths - Malaria - Sex: Both - Age: Age-standardized (Rate) (per 100,000 people)"
0,Afghanistan,AFG,1990,6.80293
1,Afghanistan,AFG,1991,6.973494
2,Afghanistan,AFG,1992,6.989882


Dataframe size: 6156

---------- DATAFRAME: inc_rate_df ----------


Unnamed: 0,Entity,Code,Year,"Incidence of malaria (per 1,000 population at risk) (per 1,000 population at risk)"
0,Afghanistan,AFG,2000,107.1
1,Afghanistan,AFG,2005,46.5
2,Afghanistan,AFG,2010,23.9


Dataframe size: 508



---
**COMMENT**: About the data<br>

- There are three data sets with varying size:
    1. `age_grp_death_df`: Showing no. of deaths, per age group, per entity, and per year
    1. `death_rate_df`: Showing death rates per entity, and per year
    1. `inc_rate_df`: Showing incidence rates per entity, and per year
    
PLAN:
- Rename dataframe columns (for simplicity & consistency)
- Remove redundant columns (as seen in age_grp_death_df)
---

## 2.2 Initial Data Processing

**RENAME COLUMNS**<br>
> For consistency and simplicity

`age_grp_death_df`: Upper case column names

In [5]:
# Function to uppercase the first letter of a string
def uppercase_first_letter(s):
    return s[0].upper() + s[1:]

# Uppercase the first letter of each column name for age_grp_death_df
age_grp_death_df.columns = [uppercase_first_letter(col) for col in age_grp_death_df.columns]
age_grp_death_df.head(3)

Unnamed: 0.1,Unnamed: 0,Entity,Code,Year,Age_group,Deaths
0,1,Afghanistan,AFG,1990,Under 5,184.606435
1,2,Afghanistan,AFG,1991,Under 5,191.658193
2,3,Afghanistan,AFG,1992,Under 5,197.140197


`death_rate_df` & `inc_rate_df`: Rename the last column for simplicity

In [6]:
#Simplify Last column name in death_rate_df
last_col = death_rate_df.columns.tolist()[-1]
death_rate_df.rename(
    columns = {last_col:'DeathRate_per100K'},
    inplace = True)
death_rate_df.head(3)

Unnamed: 0,Entity,Code,Year,DeathRate_per100K
0,Afghanistan,AFG,1990,6.80293
1,Afghanistan,AFG,1991,6.973494
2,Afghanistan,AFG,1992,6.989882


In [7]:
#Simplify Last column name in inc_rate_df
last_col = inc_rate_df.columns.tolist()[-1]
inc_rate_df.rename(
    columns = {last_col:'IncRate_per100K'},
    inplace = True)
inc_rate_df.head(3)

Unnamed: 0,Entity,Code,Year,IncRate_per100K
0,Afghanistan,AFG,2000,107.1
1,Afghanistan,AFG,2005,46.5
2,Afghanistan,AFG,2010,23.9


**DROP REDUNDANT COLUMN**<br>
> Drop first column in age_grp_death_df: `Unnamed: 0`

In [8]:
#Showing initial sets of columns in age_grp_death_df
age_grp_death_df.columns.tolist()

['Unnamed: 0', 'Entity', 'Code', 'Year', 'Age_group', 'Deaths']

In [9]:
#Removing the first column which is redundant
age_grp_death_df = age_grp_death_df.drop(age_grp_death_df.columns[0], axis=1)
age_grp_death_df.head(3)

Unnamed: 0,Entity,Code,Year,Age_group,Deaths
0,Afghanistan,AFG,1990,Under 5,184.606435
1,Afghanistan,AFG,1991,Under 5,191.658193
2,Afghanistan,AFG,1992,Under 5,197.140197


---
**COMMENT**: What's done and What's next<br>

We have renamed the columns and dropped redundant columns.<br>
We will now perform data checking and exploration in the next segment.
---

# Part 3: Data Exploration across the datasets

## 3.1 Data Cleaning: Inspect for obvious issues
- Inspect datatype
- Inspect missing data
- Inspect data value
- Inspect for duplicates

In [10]:
for data in dataframe_ls:
    df_ = globals()[data]
    df_summary(df_, data)
    print("\n\n")

---------------DATAFRAME SUMMARY OF: age_grp_death_df---------------
SHAPE(col,rows): (30780, 5)

DUPLICATES
Number of duplicates: 0

DATA TYPE
Entity        object
Code          object
Year           int64
Age_group     object
Deaths       float64
dtype: object

MISSING DATA
Columns with missing values:
    col  num_nulls  perc_null
0  Code       4320       0.14

DATA VALUES: Quantitative data
  Column Name  Minimum Value  Maximum Value
0        Year         1990.0    2016.000000
1      Deaths            0.0  752025.548675

DATA VALUES: Qualitative data
--Column 'Entity' has
 228 unique values

--Column 'Code' has
 196 unique values

--Column 'Age_group' has
 5 unique values

---------------END OF DATA SUMMARY OF age_grp_death_df---------------



---------------DATAFRAME SUMMARY OF: death_rate_df---------------
SHAPE(col,rows): (6156, 4)

DUPLICATES
Number of duplicates: 0

DATA TYPE
Entity                object
Code                  object
Year                   int64
DeathRate_per1

---
**COMMENT**: Summary of Data Checking<br>

|Area of inspection|Description|Plan|
|---|---|---|
|Duplicates|Absent|NIL|
|Missing Data|Presence of missing data in `Code` column|`Code` column refers to the country code. Further examination needed to understand the nature of the missing data before coming up with appropriate actions to address the missing data. It is possible to be related to `Entity`, as not all elements in `Entity` column is necessarily a country.|
|Data type|Appropriate data type for respective columns|NIL|
|Data values|<li>Noticed there is different number of unique entities and code across the three data sets<li>Noticed there is different time period the data sets covers<li>There is no obvious abnormaly value (e.g. Negatie values) detected at the moment.|Explore the data further in detail, including exploring the entity to recognise if the data represents a country/region/ US state.|

---

## 3.2 Further Data Exploration: `Entity` & `Code`

**ABOUT [geonamescache](https://pypi.org/project/geonamescache/)**:<br>
- A Python library that provides functions to retrieve names and other information of continents, countries as well as US states and counties as Python dictionaries. 
- We will be using this library to:
    - Standardize countries names
    - Label the countries by their continent code

**ABOUT [pycountry](https://pypi.org/project/pycountry/)**<br>
- A Python package that provides information about countries and territories. It allows developers to work with ISO standard country codes, official names, common names, and other related data.
- We will be using this library to identify if a county is a country.

### Step 1: Check if `Code` is consistent with the  ISO 3166-1 alpha-3 standard
- We use pycountry to conduct this checking.
- Information about [ISO 3166 ](https://en.wikipedia.org/wiki/ISO_3166#:~:text=ISO%203166%20is%20an%20ISO,e.g.%2C%20provinces%20or%20states)

In [11]:
#Check if code is consistent with the  ISO 3166-1 alpha-3 standard.

#Set of ISO 3166-1 alpha-3 codes in `pycountry`
alpha3_codes = set([country.alpha_3 for country in pycountry.countries])

#Create an empty set
non_alpha3_set = set()

#Populate non_alpha3_set with elements in `Code` columns but not in ISO 3166-1 alpha-3 codes
for data in dataframe_ls:
    df = globals()[data]
    cond_= ~df['Code'].isin(alpha3_codes)
    sub_df = df.loc[cond_, 'Code'] 
    non_alpha3_set = non_alpha3_set| set(sub_df)
non_alpha3_set

{'OWID_WRL', nan}

---
**COMMENT**: About `Code` columns<br>
- The 3-alphabetical code are indeed ISO 3166 country code, each representing a country.
- Next step is to explore data with `Entity` as `OWID_WRL` or *NAN*.
---

### Step 2: What `Entity` uses the Code `OWID_WRL` or has missing Code

In [12]:
# Entity with Code: OWID_WRL
for data in dataframe_ls:
    df = globals()[data]
    cond_= df['Code']=="OWID_WRL"
    _d = df.loc[cond_, ['Entity']]
    _d['Entity'].unique()

array(['World'], dtype=object)

array(['World'], dtype=object)

array(['World'], dtype=object)

In [13]:
# Entity with no Code (nan)
no_code_entity = set()
for data in dataframe_ls:
    df = globals()[data]
    cond_= df['Code'].isna()
    _d = df.loc[cond_, ['Entity']]
    no_code_entity = no_code_entity|set(_d['Entity'].unique())
no_code_entity    

{'Andean Latin America',
 'Australasia',
 'Caribbean',
 'Central Asia',
 'Central Europe',
 'Central Latin America',
 'Central Sub-Saharan Africa',
 'Early-demographic dividend',
 'East Asia',
 'East Asia & Pacific',
 'East Asia & Pacific (IDA & IBRD)',
 'East Asia & Pacific (excluding high income)',
 'Eastern Europe',
 'Eastern Sub-Saharan Africa',
 'England',
 'Fragile and conflict affected situations',
 'Heavily indebted poor countries (HIPC)',
 'High SDI',
 'High-income Asia Pacific',
 'High-middle SDI',
 'IBRD only',
 'IDA & IBRD total',
 'IDA blend',
 'IDA only',
 'IDA total',
 'Late-demographic dividend',
 'Latin America & Caribbean',
 'Latin America & Caribbean (IDA & IBRD)',
 'Latin America & Caribbean (excluding high income)',
 'Latin America and Caribbean',
 'Least developed countries: UN classification',
 'Low & middle income',
 'Low SDI',
 'Low income',
 'Low-middle SDI',
 'Lower middle income',
 'Middle SDI',
 'Middle income',
 'North Africa and Middle East',
 'North Amer

In [14]:
#Check Entity with Code and Entity without Code are mutually exclusive
have_code_entity = set()
for data in dataframe_ls:
    df = globals()[data]
    cond_= ~df['Code'].isna()
    _d = df.loc[cond_, ['Entity']]
    have_code_entity = have_code_entity|set(_d['Entity'].unique())
have_code_entity & no_code_entity

set()

---
**COMMENT**: Findings<br>
1. Entity with Code are countries
1. Entity without Code are regions, classifications, or groupings related to demographic, economic, or geographical areas, but they do not represent individual countries.
1. 'England','Scotland','Wales','Northern Ireland' do not have `Code` as they are under the 'United Kingdom'
    - Subsequent analysis needs to be mindful of this.

PLAN:<br>
- Create new column `Entity_type` to label the data.
- Rename the countries to ensure consistency:
    - Countries may be named/ referred to in different ways: (e.g. cape verde vs cabo verde)
    - Countries will be renamed using their country code as reference and using the python library `GeonamesCache` 
---

### Step 3: New column `Entity_type` and Standardise Countries naming (`Updated_Entity`)

**NEW COLUMN: `Entity_type`**

In [15]:
#UK Countries
UK_countries = {'England','Scotland','Wales','Northern Ireland'}

# Income Classifications Entity
keywords = ['income',
            'IDA',
            'IBRD',
            'SDI',
            'demographic',
            'debt',
            'developed']
Income_class_entity = {x for x in no_code_entity if any(keyword in x for keyword in keywords)}

# Region Classifications Entity
Region_class_entity = {x for x in no_code_entity if x not in Income_class_entity|UK_countries}

In [16]:
#Function for labelling data based on Entity type:
def entity_type(ent):
    if ent in UK_countries:
        return 'country'
    elif ent in Income_class_entity:
        return 'income_class'
    elif ent in Region_class_entity:
        return 'region_class'
    else:
        return 'others'

In [17]:
#Create new Enity_type column across the datasets:

for data in dataframe_ls:
    df = globals()[data].copy()
    not_na_rows = df['Code'].notna()
    
    df['Entity_type'] = df['Entity'].apply(lambda x: entity_type(x))
    df.loc[not_na_rows,'Entity_type'] = 'country'
    df.loc[df['Entity']=='World','Entity_type'] = 'world'
    
    #printouts to view data for checking
    print(f"Dataframe: {data}")
    print("Unique Entity_type in dataframe:")
    df['Entity_type'].unique().tolist()
    df.head(3)
    globals()[data] = df

Dataframe: age_grp_death_df
Unique Entity_type in dataframe:


['country', 'region_class', 'income_class', 'world']

Unnamed: 0,Entity,Code,Year,Age_group,Deaths,Entity_type
0,Afghanistan,AFG,1990,Under 5,184.606435,country
1,Afghanistan,AFG,1991,Under 5,191.658193,country
2,Afghanistan,AFG,1992,Under 5,197.140197,country


Dataframe: death_rate_df
Unique Entity_type in dataframe:


['country', 'region_class', 'income_class', 'world']

Unnamed: 0,Entity,Code,Year,DeathRate_per100K,Entity_type
0,Afghanistan,AFG,1990,6.80293,country
1,Afghanistan,AFG,1991,6.973494,country
2,Afghanistan,AFG,1992,6.989882,country


Dataframe: inc_rate_df
Unique Entity_type in dataframe:


['country', 'income_class', 'region_class', 'world']

Unnamed: 0,Entity,Code,Year,IncRate_per100K,Entity_type
0,Afghanistan,AFG,2000,107.1,country
1,Afghanistan,AFG,2005,46.5,country
2,Afghanistan,AFG,2010,23.9,country


**STANDARDISING COUNTRIES NAME** (`Updated_Entity`)

In [18]:
#Step 1: create a dictionary with key: country code, value: country name
gc = GeonamesCache()

# Access data for specific entities
gc_countries = gc.get_countries_by_names()

# List of country CODE in GeonamesCache
gc_country_codes = [value['iso3'] for key, value in gc_countries.items()]
# List of country NAME in GeonamesCache
gc_country_names = [key for key, value in gc_countries.items()]
#Create dictionary
gc_country_dict = {k:v for k,v in zip(gc_country_codes,gc_country_names)}

In [19]:
#Step 2, create Updated Entity Column
for data in dataframe_ls:
    df = globals()[data].copy()
    
    #1. Duplicate the original entity column
    df['Updated_Entity'] = df['Entity']
    
    #2. Rename country names
    cond_1 = df['Code'].notna() 
    cond_2 = df['Entity']!= 'World'
    df.loc[cond_1 & cond_2,'Updated_Entity'] = df['Code'].map(gc_country_dict)
    
    #3 view changes
    print(f"Dataframe: {data}")
    print("Country names with updates:")
    cond = df['Updated_Entity'] != df['Entity']
    df.loc[cond, ['Entity','Updated_Entity']].drop_duplicates()
    
    #4 Update dataframe
    globals()[data] = df
    

Dataframe: age_grp_death_df
Country names with updates:


Unnamed: 0,Entity,Updated_Entity
945,Cape Verde,Cabo Verde
1269,Congo,Republic of the Congo
1323,Cote d'Ivoire,Ivory Coast
1431,Czech Republic,Czechia
1458,Democratic Republic of Congo,Democratic Republic of the Congo
3213,Macedonia,North Macedonia
3510,Micronesia (country),Micronesia
4158,Palestine,Palestinian Territory
5238,Swaziland,Eswatini
5454,Timor,Timor Leste


Dataframe: death_rate_df
Country names with updates:


Unnamed: 0,Entity,Updated_Entity
945,Cape Verde,Cabo Verde
1269,Congo,Republic of the Congo
1323,Cote d'Ivoire,Ivory Coast
1431,Czech Republic,Czechia
1458,Democratic Republic of Congo,Democratic Republic of the Congo
3213,Macedonia,North Macedonia
3510,Micronesia (country),Micronesia
4158,Palestine,Palestinian Territory
5238,Swaziland,Eswatini
5454,Timor,Timor Leste


Dataframe: inc_rate_df
Country names with updates:


Unnamed: 0,Entity,Updated_Entity
64,Cape Verde,Cabo Verde
88,Congo,Republic of the Congo
96,Cote d'Ivoire,Ivory Coast
100,Democratic Republic of Congo,Democratic Republic of the Congo
440,Swaziland,Eswatini
456,Timor,Timor Leste


In [20]:
#Check that each code only has ONE unique Entity
for data in dataframe_ls:
    df = globals()[data].copy()
    grouped = df.groupby("Code")
    count_unique_names = grouped["Entity"].nunique()
    codes_with_multiple_names = count_unique_names[count_unique_names > 1]
    codes_with_multiple_names

Series([], Name: Entity, dtype: int64)

Series([], Name: Entity, dtype: int64)

Series([], Name: Entity, dtype: int64)

---
**COMMENT**: What's done and What's Next<br>
1. Entity with Code are countries
1. Entity without Code are regions, classifications, or groupings related to demographic, economic, or geographical areas, but they do not represent individual countries.
1. 'England','Scotland','Wales','Northern Ireland' do not have `Code` as they are under the 'United Kingdom'
    - Subsequent analysis needs to be mindful of this.
4. Checked that each Code only has one unique Entity.

`Update`:
1. New column `Entity_type` created, labelling entity as `country`, `region_class` or `income_class`
1. New column `Updated_Entity` created to ensure country name is standardised.

PLAN:<br>
- Next, we will explore if the unique Entity are same across the three datasets

### Step 4: Are the unique Entities same across the datasets

Are the unique Entities same between `age_grp_death_df` & `death_rate_df`?

In [21]:
{x for x in age_grp_death_df['Updated_Entity']} == {x for x in death_rate_df['Updated_Entity']}

True

Are the unique Entities same between `inc_rate_df` & other two datasets?
> - Entity in `age_grp_death_df` and `death_rate_df` are identical
> - Earlier on we noticed that the number of unique entities in `inc_rate_df` is lesser than that of the other two dataset.
> - As such, we will examine the difference in Entity present between the incidence dataset (`inc_rate_df`) and the death datasets (`age_grp_death_df` and `death_rate_df`).

In [22]:
ent_in_deathdate = {x for x in age_grp_death_df['Updated_Entity']}
ent_in_incdata = {x for x in inc_rate_df['Updated_Entity']}
ent_in_all = ent_in_deathdate & ent_in_incdata
ent_in_either_notboth = {x for x in ent_in_deathdate|ent_in_incdata if x not in ent_in_all}
ent_onlyin_incdata = {x for x in ent_in_incdata if x not in ent_in_deathdate&ent_in_incdata}

print(f"""Number of Entity that:
Are in all three datasets: {len(ent_in_all)}
Are exclusively in either incidence dataset or death datasets: {len(ent_in_either_notboth)}
Are exclusively in incident dataset: {len(ent_onlyin_incdata)}""")

Number of Entity that:
Are in all three datasets: 102
Are exclusively in either incidence dataset or death datasets: 151
Are exclusively in incident dataset: 25


In [23]:
#Nature of the entity in all three datasets:
cond = death_rate_df['Updated_Entity'].isin(ent_in_all)
_d = death_rate_df.loc[cond,['Updated_Entity','Entity_type']].drop_duplicates()
_d
_d['Entity_type'].value_counts()

Unnamed: 0,Updated_Entity,Entity_type
0,Afghanistan,country
54,Algeria,country
162,Angola,country
216,Argentina,country
351,Azerbaijan,country
...,...,...
5940,Vietnam,country
6048,World,world
6075,Yemen,country
6102,Zambia,country


country         99
region_class     2
world            1
Name: Entity_type, dtype: int64

In [24]:
#Nature of the entity not found in inc_rate_df but found in other two data sets:
cond = ~death_rate_df['Updated_Entity'].isin(ent_in_all)
_d = death_rate_df.loc[cond,['Updated_Entity','Entity_type']].drop_duplicates()
_d
_d['Entity_type'].value_counts()

Unnamed: 0,Updated_Entity,Entity_type
27,Albania,country
81,American Samoa,country
108,Andean Latin America,region_class
135,Andorra,country
189,Antigua and Barbuda,country
...,...,...
5805,U.S. Virgin Islands,country
5832,Uruguay,country
5967,Wales,country
5994,Western Europe,region_class


country         100
region_class     20
income_class      6
Name: Entity_type, dtype: int64

In [25]:
#Nature of the entity not found in death_rate_df but in inc_rate_df:
cond = ~inc_rate_df['Updated_Entity'].isin(ent_in_all)
_d = inc_rate_df.loc[cond,['Updated_Entity','Entity_type']].drop_duplicates()
_d
_d['Entity_type'].value_counts()

Unnamed: 0,Updated_Entity,Entity_type
112,Early-demographic dividend,income_class
116,East Asia & Pacific,region_class
120,East Asia & Pacific (IDA & IBRD),income_class
124,East Asia & Pacific (excluding high income),income_class
148,Fragile and conflict affected situations,region_class
188,Heavily indebted poor countries (HIPC),income_class
196,IBRD only,income_class
200,IDA & IBRD total,income_class
204,IDA blend,income_class
208,IDA only,income_class


income_class    22
region_class     3
Name: Entity_type, dtype: int64

In [26]:
#Entity_types breakdown:
for data in dataframe_ls:
    print("\n",data)
    df = globals()[data].copy()
    _d = df.loc[:,['Updated_Entity','Entity_type']].drop_duplicates()
    _d['Entity_type'].value_counts()


 age_grp_death_df


country         199
region_class     22
income_class      6
world             1
Name: Entity_type, dtype: int64


 death_rate_df


country         199
region_class     22
income_class      6
world             1
Name: Entity_type, dtype: int64


 inc_rate_df


country         99
income_class    22
region_class     5
world            1
Name: Entity_type, dtype: int64

---
**COMMENT**: What's done and What's Next<br>
1. The death datasets (`age_grp_death_df` and `death_rate_df`) have identical Entity sets
1. `inc_rate_df` have dissimilar Entity from the other two data.

|||
|---|---|
|Found in all three datasets:|<li>100 countries<li>2 regions<li>0 income class|
|Found in death-related datasets only:|<li>100 countries<li>20 regions<li>6 income class|
|Found in incidence rate dataset only:|<li>0 countries<li>3 regions<li>22 income class|


PLAN:<br>
- The insights gained here will be useful in subsequent cross-data analysis.
- Next, we will explore if all entities have records across the years, or are there years certain entities have no records.
---

## 3.3 Further Data Exploration: `Entity` and `Year`
- Explore if all entities have records across the years, or are there years certain entities have no records.

In [27]:
for data in dataframe_ls:
    df = globals()[data].copy()
    print(f"\nDataframe: {data}")
    print(f"Number of unique Entity in dataset: {df['Updated_Entity'].nunique()}")
    print("Number of unique Entity per year in dataset:")
    df.groupby('Year')['Updated_Entity'].nunique()


Dataframe: age_grp_death_df
Number of unique Entity in dataset: 228
Number of unique Entity per year in dataset:


Year
1990    228
1991    228
1992    228
1993    228
1994    228
1995    228
1996    228
1997    228
1998    228
1999    228
2000    228
2001    228
2002    228
2003    228
2004    228
2005    228
2006    228
2007    228
2008    228
2009    228
2010    228
2011    228
2012    228
2013    228
2014    228
2015    228
2016    228
Name: Updated_Entity, dtype: int64


Dataframe: death_rate_df
Number of unique Entity in dataset: 228
Number of unique Entity per year in dataset:


Year
1990    228
1991    228
1992    228
1993    228
1994    228
1995    228
1996    228
1997    228
1998    228
1999    228
2000    228
2001    228
2002    228
2003    228
2004    228
2005    228
2006    228
2007    228
2008    228
2009    228
2010    228
2011    228
2012    228
2013    228
2014    228
2015    228
2016    228
Name: Updated_Entity, dtype: int64


Dataframe: inc_rate_df
Number of unique Entity in dataset: 127
Number of unique Entity per year in dataset:


Year
2000    127
2005    127
2010    127
2015    127
Name: Updated_Entity, dtype: int64

---
**COMMENT**: Findings<br>
1. All unique Entity have data entry across the years reported in each dataset.
1. Unlike the other two datasets which have records from 1990-2016, `inc_rate_df` only have records from 2000, 2005, 2010 and 2015.

PLAN:<br>
Explore if each entity and year have all the age groups.
---

## 3.4 Further Data Exploration: `Entity` and `Age_group`
- Explore if each entity and year have all the age groups.

In [28]:
age_grp_death_df.head()

Unnamed: 0,Entity,Code,Year,Age_group,Deaths,Entity_type,Updated_Entity
0,Afghanistan,AFG,1990,Under 5,184.606435,country,Afghanistan
1,Afghanistan,AFG,1991,Under 5,191.658193,country,Afghanistan
2,Afghanistan,AFG,1992,Under 5,197.140197,country,Afghanistan
3,Afghanistan,AFG,1993,Under 5,207.357753,country,Afghanistan
4,Afghanistan,AFG,1994,Under 5,226.209363,country,Afghanistan


In [29]:
age_grp_death_df['Age_group'].unique()

array(['Under 5', '70 or older', '5-14', '15-49', '50-69'], dtype=object)

In [30]:
_s = age_grp_death_df.groupby(['Age_group','Year'])['Updated_Entity'].nunique()
total_unique_ent = age_grp_death_df['Updated_Entity'].nunique()
len(_s) == len(_s[_s==total_unique_ent])

True

---
**COMMENT**: Findings<br>
1. All entity and year have all the age groups.
---

# Part 4: Notebook Summary & Export

The three datasets have been checked for obvious issues and their nature have been explored.
In the next notebook, we will perform more focused and granular analysis with data visualisations.

**SUMMARY OF THE THREE DATA SETS**
|DataSet|Description|Similarity with other datasets|
|---|---|---|
|age_grp_death_df|<li>228 Unique entities (200 countries, 22 regions, 6 income class)<li>Shows number of malaria deaths of 5 different age groups across the years 1990-2016|<li>Identical set of Entity as `death_rate_df`<li>Do not have income class entity that are found in `inc_rate_df`<li>Out of the 200 Countries found in this dataset, 100 are found in `inc_rate_df` as well.|
|death_rate_df|<li>228 Unique entities (200 countries, 22 regions, 6 income class)<li>Shows malaria deaths rates across the years 1990-2016|<li>Identical set of Entity as `age_grp_death_df`<li>Do not have income class entity that are found in `inc_rate_df`<li>Out of the 200 Countries found in this dataset, 100 are found in `inc_rate_df` as well.|
|inc_rate_df|<li>127 Unique entities (100 countries, 5 regions, 22 income class)<li>Shows malaria incidence rates in the years 2000, 2005, 2015 and 2015|<li>Countries found in this dataset are found in all other datasets<li>Have most of the unique income-class entity compared to the other datasets|

In [31]:
for data in dataframe_ls:
    globals()[data].to_csv(f"../data_output/{data}.csv", index=False)