# MRP Outputs Review

## What this notebook does

This notebook has been set up to read in csvs, jpegs, and html files which have been sent to the review bucket, so that they can be checked for quality and disclosure. All items which are requested for export first needs to be Disclosure controlled by a seperate DisCO (Disclosure Control Officer), this cannot be the same person who ran the data and requested the export. Once disclosure controlled a Data Journey manager can be notified to move the data ready for export.

## Setup

run this code chunk to set the script up to run

In [None]:
## Authentification for GCS
options("googleAuthR.httr_oauth_cache" = "gce.oauth")
googleAuthR::gar_gce_auth()
devtools::install("../gcptools", upgrade = FALSE, quiet = TRUE)
install.packages("jpeg", quiet = TRUE)
library(googleCloudStorageR)
library(readr)
library(dplyr ,quietly = TRUE)
# SECURITY FEATURE FOR PREVENTING .ipynb FILES FROM ACCIDENTALLY UPLOADING DATA TO GITHUB [DO NOT REMOVE]
gcptools::commit_hooks_setup("/home/jupyter/CIS_disclosure_control")

## Project Name

<FONT COLOR="RED"> **INSTRUCTION:**</FONT> Insert the project name in quotation marks to the right of the assignment opperator <br> <em><strong>
    (e.g. project_name <- "20221121_mrp")</strong></em>.

In [None]:
project_name <- "20221205_agecontour_swabs/"

In [None]:
#test colin

#' @title get_filepaths_from_project_name
#' 
#' @description function to search all of the available filepaths witihin a given project_name and return it as a dataframe
#'
#' @param project_name a string containing the exact project name (case-sensitive) which you want to get the filepaths from
#' 
#' @return dataframe containing the filepaths of all of the files within that project
#'
#' @export 

get_filepaths_from_project_name <- function(project_name, file_type = c(".csv", ".jpeg|.jpg", ".yaml|.yml", ".html")){
    
    file_type <- match.arg(file_type)
    
    project_pattern <- paste0("^", project_name)
    
    project_files <- grep(x = googleCloudStorageR::gcs_list_objects("polestar-prod-review")$name,
                   pattern = project_pattern,
                   value = TRUE)
    
    approved_file_patterns <- file_type
    
    project_files <- grep(x = project_files,
                   pattern = approved_file_patterns,
                   value = TRUE)
    
    as.data.frame(project_files)
}

## Checking .csv files

Run this code chunk to get the names of the .csv file types in the project_name

In [None]:
get_filepaths_from_project_name(project_name, ".csv")

<FONT COLOR="RED"> **INSTRUCTION:**</FONT> Select the csv filepath you want to review by replacing the string below with a filepath to the csv <br> 
<em><strong> (e.g. csv_path <- "20221121_mrp/probs_over_time_mrp_20221115_DTS221122_1411UTC.csv") </strong></em>

In [None]:
csv_path <- "20221205_agecontour_swabs/input_age_contour_swabs_20221126_DTS221206_1028UTC.csv" 

Run this code chunk to retrieve the data from the bucket

In [None]:
data <- suppressMessages(gcptools::gcp_read_csv(csv_path, bucket="review_bucket"))

<FONT COLOR="RED"> **INSTRUCTION:**</FONT> Preview what types of data are in the csv, by printing data to the console <br>
Look specifically for columns which could contain identifiable information, such as columns which contain <em><strong><u> counts</em></strong></u>  or other identifiable information like names, addresses, emails etc. <br> 

In [None]:
head(data) 

<FONT COLOR="RED"> **INSTRUCTION:**</FONT> To have a closer look at values within the columns, insert the column name below and run the next code chunk to see what's inside. <br> 
<em><strong> e.g column_name <- "probability_increase"  </em></strong> 

In [None]:
column_name <- "visit_date"

In [None]:
#' @title explore_column
#' 
#' @description function for use in disclosure control to explore the contents of a particular column
#'
#' @param data a dataframe containing the data you want to explore
#' @param column_name a string containing the exact name (case-sensitive) of the column you want to explore
#' 
#' @return either unique values or a percentage of low counts under a given threshold (default 10) 
#'
#' @export 

explore_column <- function(data, column_name){
    
    column_type <- typeof(data[[column_name]])

    if(column_type == "character"){
        get_unique_values(data, column_name)
    } else if(column_type %in% c("double", "integer")){
        reveal_low_counts(data, column_name)
    }
}


#' @title get_unique_values
#' 
#' @description function for use in disclosure control to explore the contents of a character or factor type column
#'
#' @param data a dataframe containing the data you want to explore
#' @param column_name a string containing the exact name (case-sensitive) of the column you want to explore
#' 
#' @return a pretty list containing the unique values within that column
#'
#' @export 

get_unique_values <- function(data, column_name){
    
    unique_values <- data%>%
        dplyr::pull(column_name)%>%
        unique()
    
    cat(c("Unique values:", unique_values), sep = "\n   ")

}

#' @title reveal_low_counts
#' 
#' @description function for use in disclosure control to explore the percentage of low counts within a given numeric column
#'
#' @param data a dataframe containing the data you want to explore
#' @param column_name a string containing the exact name (case-sensitive) of the column you want to explore
#' @param low_count_threshold the threshold at which counts are considered disclosive (default 10)
#' 
#' @return a message to the console which states the percentage of low counts within the column. 
#'
#' @export 

reveal_low_counts <- function(data, column_name, low_count_threshold = 10){
    
    perc_less_than_10 <- data%>%
        dplyr::filter(!!sym(column_name) < 10 )%>%
        dplyr::summarise(perc_less_than_10 = paste0(round(n()/nrow(data),2)*100,"%"))%>%
        dplyr::pull()
    
    paste(perc_less_than_10, "of rows contain values less than 10")
}

In [None]:
explore_column(data, column_name)

In [None]:
#' @title gcp_read_yaml
#' 
#' @description function to read yaml files in to the environment from a gcpbucket
#'
#' @param filepath a string containing file path (including folder name) to the yaml file which you want to read in
#' @param bucket a string containing the name of the gcp bucket you want to read the yaml file from 
#' 
#' @return a list containing the contents of the yaml file 
#'
#' @export 

gcp_read_yaml <- function(filepath, bucket = "polestar-prod-review"){
    
    tmp <- tempfile(fileext = ".yaml")
    
    suppressMessages(gcs_get_object(filepath,  bucket = bucket,  saveToDisk = tmp))
    
    yaml <- yaml::read_yaml(tmp)
    
    unlink(tmp)
    
    return(yaml)
}

## Checking .yml/.yaml files 

Run this code chunk to get the names of the .yml or .yaml file types in the project_name

In [None]:
get_filepaths_from_project_name(project_name, ".yaml|.yml")

<FONT COLOR="RED"> **INSTRUCTION:**</FONT> Insert the filepath of the yaml file you want to explore <br> 
<em><strong> e.g yaml_filepath <- "20221121_mrp/configs_mrp_20221115_DTS221215_1814UTC.yaml"  </em></strong> 

In [None]:
yaml_filepath <- "20221205_agecontour_swabs/config_age_contour_swabs_20221126_DTS221206_1028UTC.yaml"

Run this code chunk to retrieve the contents of the yaml file from the review bucket

In [None]:
gcp_read_yaml(yaml_filepath)

## Checking .jpg/.jpeg files

Run this code chunk to get the names of the .jpg or .jpeg file types in the project

In [None]:
get_filepaths_from_project_name(project_name, ".jpeg|.jpg")

In [None]:
jpeg_filepath <- "20221205_agecontour_swabs/plot_age_contour_swabs_regions_20221126_DTS221206_1028UTC.jpeg"

In [None]:
#' @title print_jpeg
#' 
#' @description function to print jpegs to the console
#'
#' @param jpeg_filepath a string containing a jpeg file path (including folder name) that you want to print to the console
#' 
#' @return an image of the jpeg within the console
#'
#' @export 

print_jpeg <- function(jpeg_filepath){
    
    jpeg <- suppressMessages(googleCloudStorageR::gcs_get_object(jpeg_filepath, bucket = "polestar-prod-review"))

    options(repr.plot.width = 25, repr.plot.height = 25)

    plot(0:3000, type = 'n')

    rasterImage(image = jpeg, 0, 0, 3000, 3000)
}

Run this code chunk to print the jpeg to the console

In [None]:
print_jpeg(jpeg_filepath)

## Checking .html Files

Run this code chunk to get the names of the .html file types in the project_name

In [None]:
get_filepaths_from_project_name(project_name, ".html")

In [None]:
#' @title retrieve_html_file
#' 
#' @description function to retrieve html file from the review bucket and put it in the QA_reports folder
#'
#' @param html_filepath a string containing a html file path (including folder name) that you want to save to the QA_reports folder
#' 
#' @return a message saying where to find the html file
#'
#' @export 

retrieve_html_file <- function(html_filepath){

    main_directory <- "/home/jupyter"
    sub_directory <- "QA_reports"

    if (!file.exists(paste0(main_directory,"/",sub_directory))){
        dir.create(file.path(main_directory, sub_directory))
        setwd(file.path(main_directory, sub_directory))   
    }

    suppressMessages(gcptools::download_qa_report_to_notebook(file = html_filepath, 
                                            bucket = "polestar-prod-review" ))

    print(paste0("The ", html_filepath, " file has been saved in the '", main_directory,"/", sub_directory, "' folder"))

}

<FONT COLOR="RED"> **INSTRUCTION:**</FONT> Replace the string in the 'file' argument to select the correct file from the review bucket<br> 
<em><strong> e.g html_filepath <- "20221121_mrp/MRP_QA_England_Datarun20221121_Co20221115_PrevCo20221108.html"  </em></strong> 

In [None]:
html_filepath <- "20221121_mrp/MRP_QA_England_Datarun20221121_Co20221115_PrevCo20221108.html"

In [None]:
retrieve_html_file(html_filepath)

<FONT COLOR="RED"> **INSTRUCTION:**</FONT> Locate the QA_reports folder at the repo level of folders (i.e. where you would go to navigate to other repo's) to review the report