# Data analysis of GLOBEL_PUBLIC_HEALTH_DATASET

This notebook explores the GLOBEL_PUBLIC_HEALTH_DATASET `.csv` file from 183 counteries and 5 official WHO region using the Pandas package in Python3. Through the apply different functions in the Pandas package, target subsets could be selected and exported as `.csv.` file for further research. In this example, the filtered data could be used to examine factors that influence the total population.

### Step overview
**Application required: Anaconda (installation process included in step 1)**
1. preperation
2. import pandas package
3. import and read GLOBAL_PUBLIC_HEALTH_dataset.csv as `.csv` file in Python 3
4. filter out the target dataset using pandas functions
5. export the filtered dataset as `.csv` file

### STEP 1: Get started
- Install Anaconda through [Anaconda installer](https://unc-libraries-data.github.io/Python/Setup.html#Anaconda-Installation).
- Open Anaconda and click on "Lauch" under Jupyter Lab.
- Navigate the list of directories presented on the right side to find the directory that containes your GLOBAL_PUBLIC_HEALTH_DATASET.csv file. Click on the the blue "+" button on the top left corner and then create a new Python 3(ipykernal) `.ipynb` file under the same directory.


### STEP 2:  Import Pandas package
- **Pandas** is a package in Python by default that allows user to store and analysis our large dataset in form of tabular, multi-dimensional objects (dataframes) with familiar features like rows, columns, and headers. It provides a series of tools and functions that enables user to reconstruct and filter parts of data using the Python built in data structure.
- Import the Pandas package using the following command:


In [4]:
import numpy as np
import pandas as pd

### STEP 3: import and read GLOBAL_PUBLIC_HEALTH_dataset.csv as `.csv` file in Python 3
- Function `pd.read_csv()` reads the tabular data from a Comma Separated Values (csv) file into a dataframe object that we'll define as `raw_data` which dould be reaplaced with other variable name. 
  In the parathenses, we input the path to access the `.csv` file in the computer system. Moving on, based on the preperation in STEP 1, the `.csv` file and the Notebook `.ipynb` is locted in the same directory so we could directly input the name of the `.csv` file in the 
 paranthesis.
- The following command is used to import the file:

In [26]:
raw_data = pd.read_csv("public_health_dataset.csv")

### STEP 4: Filter out target data using Pandas functions
- In this notebook, the goal is to generate a subset `.csv` file that includes column "**Name**","**Total population**","**Healthy life expectancy at birthb (years)**", "**Life expectancy**" and "**Maternal mortality ratioc (per 100 000 live births)**" for countries with Maternal mortality ratioc (per 100 000 live births) less than **10** and life expectancy **over 80**.
- To start with, we first use the function `<Dataframe name>.dtypes` to determine the data type for each column. Only when the data types are `int` or `float` could we filter with data over **10** or **80**.
  Note: in this example, the `<Dataframe name>` would be `raw_data`.

In [28]:
raw_data.dtypes

Name                                                                                 object
Total populationa (000s)                                                             object
Life expectancy at birthb (years)                                                   float64
Healthy life expectancy at birthb (years)                                           float64
Maternal mortality ratioc (per 100 000 live births)                                   int64
Proportion of births attended by skilled health personneld (%)                       object
Under-five mortality ratee (per 1000 live births)                                     int64
Neonatal mortality ratee (per 1000 live births)                                      object
New HIV infectionsf (per 1000 uninfected population)                                 object
Tuberculosis incidenceg (per 100 000 population)                                      int64
Malaria incidenceh (per 1000 population at risk)                                

- Since the datatype of all three target columns is `float64`,the next step would be to filter out the "**Life expectancy at birthb (year)**"  with data over **80** (>= 80).
- Funtion `<Dataframe name>.loc[]` is a frequently used function for indexing large data frame. Specific name for column and number of row could be input in the brackets to select rows and columns.  
  In the example below, raw_data["Life expectancy at birthb (years)"] indictes only the column named "Life expectancy at birthb (years)"

In [30]:
raw_data.loc[raw_data["Life expectancy at birthb (years)"]>=80]

Unnamed: 0,Name,Total populationa (000s),Life expectancy at birthb (years),Healthy life expectancy at birthb (years),Maternal mortality ratioc (per 100 000 live births),Proportion of births attended by skilled health personneld (%),Under-five mortality ratee (per 1000 live births),Neonatal mortality ratee (per 1000 live births),New HIV infectionsf (per 1000 uninfected population),Tuberculosis incidenceg (per 100 000 population),Malaria incidenceh (per 1000 population at risk),Hepatitis B surface antigen (HBsAg) prevalence among children under 5 yearsi (%),Reported number of people requiring interventions against NTDsj
7,Australia,25 500,83.0,70.9,6,99,4,2,0.03,7,-,0.13,20 401
8,Austria,9 006,81.6,70.9,5,98,4,2,-,5,-,0.16,22
15,Belgium,11 590,81.4,70.6,5,-,4,2,-,8,-,0.09,18
30,Canada,37 742,82.2,71.3,10,98,5,3,-,6,-,0.34,0
33,Chile,19 116,80.7,70.0,13,100,7,4,0.26,15,-,0.03,16
38,Costa Rica,5 094,80.8,70.0,27,99,8,6,0.34,10,<0.1,0.02,10 590
42,Cyprus,1 207,83.1,72.4,6,99,3,2,-,6,-,0.34,1
46,Denmark,5 792,81.3,71.0,4,95,4,3,0.02,5,-,0.68,0
58,Finland,5 541,81.6,71.0,3,100,2,1,-,4,-,0.81,3
59,France,65 274,82.5,72.1,8,98,4,3,-,8,-,0.15,120


- In order to save this newly filtered data frame for future reference, we could assign a new `<Dataframe name>`: `subset_lifeexpectancy`.


In [31]:
subset_lifeexpectancy = raw_data.loc[raw_data["Life expectancy at birthb (years)"]>=80]

- Similarly, we would filter out the "**Maternal mortality ratioc (per 100 000 live births)**" with data less than **10** and name the new `<Dataframe name>`: `subset_maternalmortality`.

In [39]:
subset_maternalmortality = raw_data.loc[raw_data["Maternal mortality ratioc (per 100 000 live births)"]<=10]

- Moving on, we could merge `subset_lifeexpectancy` and `subset_maternalmortality` to obtain a new data frame of countries with Maternal mortality ratioc (per 100 000 live births) over 100 and Life expectancy at birthb (years) less than 10.
- Function `pd.merge()` automatically combine the two data frames that uses all column names that appear in both datasets as keys. The attrubutes `how="inner"` contributes to keep rows that appear rows in both data frames.

In [43]:
pd.merge(subset_lifeexpectancy,subset_maternalmortality, how="inner")

Unnamed: 0,Name,Total populationa (000s),Life expectancy at birthb (years),Healthy life expectancy at birthb (years),Maternal mortality ratioc (per 100 000 live births),Proportion of births attended by skilled health personneld (%),Under-five mortality ratee (per 1000 live births),Neonatal mortality ratee (per 1000 live births),New HIV infectionsf (per 1000 uninfected population),Tuberculosis incidenceg (per 100 000 population),Malaria incidenceh (per 1000 population at risk),Hepatitis B surface antigen (HBsAg) prevalence among children under 5 yearsi (%),Reported number of people requiring interventions against NTDsj
0,Australia,25 500,83.0,70.9,6,99,4,2,0.03,7,-,0.13,20 401
1,Austria,9 006,81.6,70.9,5,98,4,2,-,5,-,0.16,22
2,Belgium,11 590,81.4,70.6,5,-,4,2,-,8,-,0.09,18
3,Canada,37 742,82.2,71.3,10,98,5,3,-,6,-,0.34,0
4,Cyprus,1 207,83.1,72.4,6,99,3,2,-,6,-,0.34,1
5,Denmark,5 792,81.3,71.0,4,95,4,3,0.02,5,-,0.68,0
6,Finland,5 541,81.6,71.0,3,100,2,1,-,4,-,0.81,3
7,France,65 274,82.5,72.1,8,98,4,3,-,8,-,0.15,120
8,Germany,83 784,81.7,70.9,7,96,4,2,0.03,6,-,0.21,116
9,Greece,10 423,81.1,70.9,3,100,4,2,0.09,5,-,0.14,41


- Similarly to previous steps, assign a new name to `<Dataframe name>`: `subset_both`.

In [None]:
subset_both=pd.merge(subset_lifeexpectancy,subset_maternalmortality, how="inner")

- The last part for Step 4 is to select the target column "**Name**", "**Total population**", "**Life expectancy**" ,"**Healthy life expectancy at birthb (years)**" and"**Maternal mortality ratioc (per 100 000 live births)**" in the subset `subset_both`. 
- Funtion `<Dataframe name>.iloc[]` is used. Different from `<Dataframe name>.loc[]`, integers are input the bracket.  
  In this example, [:,:5] indicates all rows and the first 5 columns. 

In [50]:
subset_both.iloc[:,:5] #all rows and the first 5 columns. 

Unnamed: 0,Name,Total populationa (000s),Life expectancy at birthb (years),Healthy life expectancy at birthb (years),Maternal mortality ratioc (per 100 000 live births)
0,Australia,25 500,83.0,70.9,6
1,Austria,9 006,81.6,70.9,5
2,Belgium,11 590,81.4,70.6,5
3,Canada,37 742,82.2,71.3,10
4,Cyprus,1 207,83.1,72.4,6
5,Denmark,5 792,81.3,71.0,4
6,Finland,5 541,81.6,71.0,3
7,France,65 274,82.5,72.1,8
8,Germany,83 784,81.7,70.9,7
9,Greece,10 423,81.1,70.9,3


- Similarly to previous steps, assign a new name to `<Dataframe name>`: `subset`.

In [51]:
subset = subset_both.iloc[:,:5]

### STEP 5: Export the filtered dataset as `.csv` file
- Function `.to_csv()` would be applied to achieve this goal.  
 Inside the parantheses we add the filename and extension. For instance, in this exmaple, the filename and extension would be `subset.csv`,this will export a `.csv` file in our working directory.
- The`index=false` in the statement tells it not bring in those index numbers.

In [53]:
subset.to_csv("subset.csv", index=False)