vignettes/DDharmonize_validate_BirthCounts.Rmd

---
title: "DDharmonize_validate_BirthCounts"
output: rmarkdown::html_vignette
author: ""
description: >
  This function implements a workflow for birth records extracted from vital registration databases and census. This workflow includes extracting data from DemoData, harmonizing age groups, identifying full series, validating totals by age, and eventually producing clean and harmonised datasets for each location. 
vignette: >
  %\VignetteIndexEntry{DDharmonize_validate_BirthCounts}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)


```

```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(rddharmony)

```
## Introduction

`DDharmonize_validate_BirthCounts()` is a function that implements a workflow for birth records extracted from vital registration databases and census. This workflow includes extracting data from the UNPD (United Nations Population Division) database, harmonizing age groups, identifying full series, validating totals by age, and eventually producing clean and harmonised datasets for each location. See the [harmonization workflow](https://shelmith-kariuki.github.io/rddharmony/articles/Harmonization_Workflow.html) article for a detailed overview of this process. 

The birth records are grouped into two types of data:

+ Births by age of mother and sex of child

+ Total births by sex of child

## Function definition

```{r }

# clean_df <- DDharmonize_validate_BirthCounts(locid, 
#                                              times, 
#                                              process = c("census", "vr"),
#                                              return_unique_ref_period = TRUE,
#                                              retainKeys = FALSE)

                                             
# example: extracting sweden's data                                            
# clean_df <- DDharmonize_validate_BirthCounts(locid = 752,
#                                              times = c(2010, 2011),
#                                              process = c("census", "vr"),
#                                              return_unique_ref_period = TRUE,
#                                              retainKeys = FALSE)

```


##  Function arguments

The function contains several arguments:

**`locid`:** This is the a numeric variable representing the location id of each of the locations. You can run `View(get_locations())` to get the list of plausible location ids. The ids are listed in the `PK_LocID` variable. 
You can also run the function `check_locid(insert locid here)` to check whether a location id is valid (part of the locations in the UNPD website). Running `check_locid(insert locid here)` with a valid id returns a message confirming that the location id is valid and also gives the location name of that particular id. Running the same code with an invalid id returns a message directing the user to run `View(get_locations())` in order to get a list of plausible location ids. See example below.

```{r checking validity of location ids}
## valid id
## check_locid(752)

## invalid id
## check_locid(2021)
```


**`times`:** The period of the data to be extracted. You can extract one year data e.g `times = 2020` or a longer period of time e.g `times = c(1950, 2020)`.

**`process`:** The process used to collect or to obtain the data i.e either via census or vital registrations (vr). By default, the function pulls data obtained through both of these processes.

**`return_unique_ref_period`:** Specifies whether the data to be returned should contain one unique id (`return_unique_ref_period == TRUE`) or several ids (`return_unique_ref_period == FALSE`) per time label. ids are a unique identifier for each unique set of records based on `LocID`, `LocName`, `DataProcess`, `ReferencePeriod`, `DataSourceName`, `StatisticalConceptName`, `DataTypeName` and `DataReliabilityName`. The definitions of these variables are provided later in this article.

**`retainKeys`:** Specifies whether only a few (`retainKeys == FALSE`) or all (`retainKeys == TRUE`) variables should be retained in the output.

## Output structure

The function returns clean data with 26 variables (when `retainKeys == TRUE`) which are defined below:

+ `id`: A unique id that is generated by combining the `LocID`, `LocName`, `DataProcess`, type of data (births), `TimeLabel`, `DataProcessType`, `DataSourceName`, `StatisticalConceptName`, `DataTypeName` and `DataReliabilityName`.

+ `LocID`: Location Id. This is a numerical Location Code (3-digit codes following ISO 3166-1 numeric standard - UNSD M49 codes) - see http://en.wikipedia.org/wiki/ISO_3166-1_numeric .

+ `LocName`: Name of a country or territory identified by each Location Id e.g when `LocID` == 404 , `LocName` == Kenya.

+ `IndicatorName`: Identifies the type of data i.e. `Births by age of mother and sex of child` or `Total births by sex of child`.

+ `IndicatorID`: An id representing each indicator. `IndicatorID` = 170 where `IndicatorName` == `Births by age of mother and sex of child` and `IndicatorID` = 159 where `IndicatorName` == `Total births by sex of child`.

+ `TimeStart`: Start year defining the period of interest (e.g 01/01/2000).

+ `TimeLabel`: The year of interest (e.g 2000).

+ `TimeEnd`: End year defining the period of interest (e.g 31/12/2000).

+ `TimeMid`: Mid-period based on `TimeStart` and `TimeEnd` (e.g 2000.500).

+ `DataProcessType`:  Defines the process used to collect or to obtain the data.

+ `DataSourceName`: This defines the source of data (e.g. Demographic Year Book, World Health Organization records, etc).

+ `StatisticalConceptName`: Defines the concept under which individuals (or vital events) are recorded (e.g De-facto, Year of occurrence).

+ `DataTypeName`: Indicates the type of collected data or estimation process used to derive the data.

+ `DataReliabilityName`: Denotes the reliability of the data values (default is unknown): Error, typo or invalid value, Very low quality, Low , Fair, High quality, etc. This is a default rating either obtained from the Data Source or assigned by default to the data during initial loading.

+ `AgeLabel`: Age bracket (where data is abridged) or single year of age (where the data is complete).

+ `AgeStart`: The start age of a particular age label e.g where the age label is 10-14, the start age is 10.      

+ `AgeEnd`: The end age of a particular age label + 1 e.g where the age label is 10-14, the end age is 15.
 
+ `AgeSpan`: The difference between `AgeStart` and `AgeEnd`.

+ `AgeSort`: Defines the order of a particular age label when the age labels are arranged in ascending order.

+ `abridged`: Identifies age labels that have an age span of 5 years, including open age groups and the total, but
excluding the label `0-4`.      

+ `five_year`: Identifies age labels that are abridged including `0-4`.             

+ `complete`: Identifies single year of age labels, including open age groups and the total. 

+ `non_standard`: These are age labels that are not standard e.g those with an age span of anything other than 1, 5 and they are neither total nor unknown.   

+ `SexID`: ID for each of the sex groups (1: Males, 2: Females, 3: Both sexes).        

+ `DataValue`: The numerical value of interest for a specific set of unique characteristics as defined by `id`.    

+ `note`:  Gives the user more information about a particular record in cases where the data is not fully clean e.g the record may not have been harmonised because of non-standard age groups or that the series is missing data for one or more age groups.                

## Harmonization Workflow

For a detailed explanation of the harmonization work flow, [see this article](https://shelmith-kariuki.github.io/rddharmony/articles/Harmonization_Workflow.html) .