# 1.2 Data pre-processing and wrangling for risk factors data
There are two categories for risk factors:

1. Social factors <br>
    1.1. New Zealand Health Survey (NZHS) data - Ministry of Health <br> 
    1.2. Education - Stats NZ <br>
    1.3. Work hours - Stats NZ <br>
    1.4. Income - Stats NZ <br>
    1.5. Birth numbers (Children) - Stats NZ <br>
2. Environments factors <br>
    2.1. DHB map geometry data (for mapping environment factors to DHB regions) - Stats NZ <br> 
    2.2. Earthquake - GeoNet <br>
    2.3. Temperature - Stats NZ <br>
    2.4. Air quality - LAWA database <br>
    2.5. Ground water quality LAWA database <br>

We cleaned all risk factors datasets from different resources and output a clean dataset "rf.Rdata" with all risk factors in long data format. <br>
Primary key: {DHB, year, sex, category, rf}<br>
Common identifier that connects with cancer data: {DHB,year}, except NZHS data {DHB, year, sex} <br>

"category": 
* NZHS
* Work Hours
* Education
* Income
* Birth Number
* Earthquake
* Air quality
* Water quality
* Temperature<br>

"rf": specific risk factors in each category such as "maximun magnitude of earthquake".

In [1]:
#loading library
library(tidyverse)
library(sf)
library(skimr)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.2     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
Linking to GEOS 3.12.0, GDAL 3.7.1, PROJ 9.3.0; sf_use_s2() is TRUE



## 1. Humanities Datasets

### 1.1. New Zealand Health Survey data

In [2]:
#### New Zealand Health Survey data ####
nzhs <- read_csv('data/raw/nz-health-survey-2017-20-regional-update-dhb-prevalences.csv')
nzhs %>% 
  filter(population == "adults") %>% #select adults population group for analysis
  filter(!grepl("-",year) & type == "STD") %>% #filter age-standardized data with type == "STD", remove combined year data.
  mutate(DHB = case_when(region == 'Tairāwhiti' ~ "Tairawhiti", #modify DHB names in DHB_mapto match cancer
                         region == 'Waitematā' ~ 'Waitemata',
                         region == "New Zealand" ~ 'All New Zealand',
                         TRUE ~ region),
         year = as.numeric(str_extract(year,"^[0-9]*")), #just keep the year information
         rf = short.description,
         sex = case_when(sex == 'All' ~ "AllSex",
                         TRUE ~ sex),
         value = Prevalence,
         type = "percentage"
  ) %>% 
  select(DHB,year,sex,rf,value,type) %>%
  mutate(category = "NZHS", .before = 4) ->
  nzhs_long # Note there are some missing values after changing to wide data format


# write_csv(nzhs_wide,"data/clean/nzhs_wide.csv") #save clean nzhs data for correlation analaysis

nzhs_long %>% head(3)


[1mRows: [22m[34m772978[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (9): population, short.description, region, type, year, agegroup, sex, E...
[32mdbl[39m (5): Prevalence, CL.Lower.Bound, CL.Upper.Bound, estimated.number, sampl...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


DHB,year,sex,category,rf,value,type
<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>
Auckland,2011,AllSex,NZHS,Physically active,46.2,percentage
Auckland,2012,AllSex,NZHS,Physically active,51.8,percentage
Auckland,2013,AllSex,NZHS,Physically active,45.4,percentage


### 1.2. Education data

In [3]:
#### Education data ####

education <- read.csv('data/raw/Highest_qualification_long_updated_16-7-20.csv')

#clean the Education data and map to qualification levels
qualifications_to_level <- c(
  "No qualification" = 'No qualification',
  "Level 1 certificate" = 'level 1',
  "Level 2 certificate" = 'level 2',
  "Level 3 certificate" = 'level 3',
  "Level 4 certificate" = 'level 4',
  "Level 5 diploma" = 'level 5',
  "Level 6 diploma" = 'level 6',
  "Bachelor degree and Level 7 qualification" = 'level 7',
  "Post-graduate and honours degrees" = 'level 8',
  "Masters degree" = 'level 9',
  "Doctorate degree" = 'level 10'
)

#create new variables with the "≥ levels" format
qualifications_map <- c(
  'No qualification' = "≥ level 1",
  'level 1' = "≥ level 2",
  'level 2' = "≥ level 3",
  'level 3' = "≥ level 4",
  'level 4' = "≥ level 5",
  'level 5' = "≥ level 6",
  'level 6' = "≥ level 7",
  'level 7' = "≥ level 8",
  'level 8' = "≥ level 9",
  'level 9' = "≥ level 10"
)


education %>% 
  mutate(Area_description = str_replace( Area_description,'Capital and Coast',"Capital & Coast")) %>% #unify DHB names
  filter(Year >= 2011 & Year <= 2020 & #screen years match the cancer data
           grepl("DHB",Area_code_and_description) & #filter rows with DHB information
           !Highest_qualification_descriptor %in% c('Overseas secondary school qualification', "Not elsewhere included","Total")) %>% #keep only NZ education qualication data
  rename( DHB = Area_description, year = Year) %>%
  mutate(rf =  qualifications_to_level[Highest_qualification_descriptor], Count = as.numeric(Count) ) %>%
  group_by(DHB,year) %>%
  mutate(value = Count / Count[Highest_qualification_descriptor == "Total stated"] * 100,
         type = "percentage") %>% #calculate the percentage of each qualification level
  filter(Highest_qualification_descriptor != 'Total stated') %>% 
  select(DHB,year,rf,value,type) %>%
  mutate(sex = "AllSex", category = 'Education' , .before =3)-> edu1

edu1 %>%  
  group_by(DHB,year) %>%
  mutate(value = 100 - cumsum(value)) %>% #calculate the cumulative percentage of each "≥ level" variable
  filter(rf != 'level 10') %>% #remove the last variable which has the 100% percentage 
  mutate(rf = qualifications_map[rf]) -> edu2

edu <- rbind(edu1,edu2) 

edu %>% head(3)

DHB,year,sex,category,rf,value,type
<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<chr>
Northland,2013,AllSex,Education,No qualification,27.40435,percentage
Northland,2013,AllSex,Education,level 1,16.61563,percentage
Northland,2013,AllSex,Education,level 2,11.19303,percentage


### 1.3. Working hours data

In [4]:
#### Working hours ####

work_hours <- read.csv('data/raw/total_hours_worked_long_updated_16-7-20.csv')

#create new variables with the "≥ hours" format
work_hours_map <- c(
  "1-9 hours worked" = "≥ 10 hours"  ,
  "10-19 hours worked" = "≥ 20 hours",
  "20-29 hours worked" = "≥ 30 hours",
  "30-39 hours worked" = "≥ 40 hours",
  "40-49 hours worked" = "≥ 50 hours",
  "50-59 hours worked" = "≥ 60 hours"
)

work_hours %>% 
  mutate(Area_description = str_replace( Area_description,'Capital and Coast',"Capital & Coast")) %>% #unify DHB names
  filter(Year >= 2011 & Year <= 2020 &  #screen years match the cancer data
           grepl("DHB",Area_code_and_description) & #filter rows with DHB information
           !Hours_worked_week_descriptor %in% c( "Not elsewhere included","Total")) %>%
  rename( DHB = Area_description, year = Year) %>%
  mutate(Count = as.numeric(Count) ) %>%
  group_by(DHB,year) %>%
  mutate(value = Count / Count[Hours_worked_week_descriptor == "Total stated"] * 100,  #calculate the percentage of each work hours length
         rf = Hours_worked_week_descriptor,
         type = "percentage") %>%
  filter(Hours_worked_week_descriptor != 'Total stated') %>% 
  select(DHB,year,rf,value,type) %>%
  mutate(sex = "AllSex", category = 'Work Hours' , .before =3) -> whs1

whs1 %>%  
  group_by(DHB,year) %>%
  mutate(value = 100 - cumsum(value)) %>% #calculate the cumulative percentage of each "≥ hours" variable
  filter(rf != '60 hours or more worked') %>% #remove the last variable which has the 100% percentage 
  mutate(rf = work_hours_map[rf]) -> whs2

whs <- rbind(whs1,whs2) 

whs %>% head(3)

DHB,year,sex,category,rf,value,type
<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<chr>
Northland,2013,AllSex,Work Hours,1-9 hours worked,5.855106,percentage
Northland,2013,AllSex,Work Hours,10-19 hours worked,8.145103,percentage
Northland,2013,AllSex,Work Hours,20-29 hours worked,10.830644,percentage


### 1.4. Income data 

In [5]:
#### Income data  ####

income_cesus <- read.csv('data/raw/Total_personal_income_long_updated_16-7-20.csv')


#create new variables with the "≥ $" format
income_map <- c(
  "$5,000 or less" = "> $5,000",
  "$5,001-$10,000" = "> $10,000",
  "$10,001-$20,000" = "> $20,000",
  "$20,001-$30,000" = "> $30,000",
  "$30,001-$50,000" = "> $50,000",
  "$50,001-$70,000" = "> $70,000"
)

income_cesus %>% 
  mutate(Area_description = str_replace( Area_description,'Capital and Coast',"Capital & Coast")) %>% #unify DHB names
  filter(Year >= 2011 & Year <= 2020 &  #screen years match the cancer data
           grepl("DHB",Area_code_and_description) & #filter rows with DHB information
           !Grouped_personal_income_descriptor %in% c( "Not stated","Total","Median, ($)")) %>%
  rename( DHB = Area_description, year = Year) %>%
  mutate(Count = as.numeric(Count) ) %>%
  group_by(DHB,year) %>%
  mutate(value = Count / Count[Grouped_personal_income_descriptor == "Total stated"] * 100,
         rf = Grouped_personal_income_descriptor,
         type = "percentage") %>% #calculate the percentage of each income range
  filter(Grouped_personal_income_descriptor != 'Total stated') %>% 
  select(DHB,year,rf,value,type)  %>%
  mutate(sex = "AllSex", category = 'Income' , .before =3)-> income1


income1 %>%  
  group_by(DHB,year) %>%
  mutate(value = 100 - cumsum(value)) %>% #calculate the cumulative percentage of each "≥ $" variable
  filter(rf != '$70,001 or more') %>% #remove the last variable which has the 100% percentage 
  mutate(rf = income_map[rf]) -> income2


income <- rbind(income1,income2) 

income %>% head(3)

DHB,year,sex,category,rf,value,type
<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<chr>
Northland,2013,AllSex,Income,"$5,000 or less",13.532017,percentage
Northland,2013,AllSex,Income,"$5,001-$10,000",5.814834,percentage
Northland,2013,AllSex,Income,"$10,001-$20,000",24.238187,percentage


### 1.5. Birth numbers

In [6]:
#### Birth numbers ####

children_census <- read.csv('data/raw/Number_of_children_born_long_updated_16-7-20.csv')

#create new variables with the "> children" format
children_map <- c(
  "No children" = "> 0 children",
  "One child" = "> 1 children",
  "Two children" = "> 2 children",
  "Three children" = "> 3 children",
  "Four children" = "> 4 children",
  "Five children" = "> 5 children"
)


children_census %>% 
  mutate(Area_description = str_replace( Area_description,'Capital and Coast',"Capital & Coast")) %>% #unify DHB names
  filter(Year >= 2011 & Year <= 2020 & #screen years match the cancer data
           grepl("DHB",Area_code_and_description) & #filter rows with DHB information
           !Number_children_born_descriptor %in% c("Total","Not elsewhere included","Object to answering")) %>%
  rename( DHB = Area_description, year = Year) %>%
  mutate(Count = as.numeric(Count) ) %>%
  group_by(DHB,year) %>%
  mutate(value = Count / Count[Number_children_born_descriptor == "Total stated"] * 100,
         rf = Number_children_born_descriptor,
         type = "percentage") %>%  #calculate the percentage 
  filter(Number_children_born_descriptor != 'Total stated') %>% 
  select(DHB,year,rf,value,type) %>%
  mutate(sex = "AllSex", category = 'Birth Number' , .before =3) -> children1


children1 %>%  
  group_by(DHB,year) %>%
  mutate(value = 100 - cumsum(value)) %>% #calculate the cumulative percentage of each "> children" variable
  filter(rf != 'Six or more children') %>%  #remove the last variable which has the 100% percentage 
  mutate(rf = children_map[rf]) -> children2

children <- rbind(children1,children2) 

children %>% head(3)

DHB,year,sex,category,rf,value,type
<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<chr>
Northland,2013,AllSex,Birth Number,No children,22.12185,percentage
Northland,2013,AllSex,Birth Number,One child,11.16309,percentage
Northland,2013,AllSex,Birth Number,Two children,26.58493,percentage


## 2. Environment Data

Environment data was mapped to DHB using coordinates data and DHB_map geometry map data

### 2.1 DHB geomtery map data

In [7]:
DHB_map <- st_read("data/raw/NZ_District_Health_Board_boundaries_-_generalised.kml", quiet=TRUE)
DHB_map %>% 
  mutate(DHB = case_when(DHB_name == "Capital and Coast" ~ "Capital & Coast", #modify DHB names in DHB_mapto match cancer
                         TRUE ~ DHB_name)) %>%
  select(DHB,geometry) -> 
  DHB_map


### 2.2. Earthquake data

In [8]:
### Earthquake data ####

earthquake_2007_to_2023 <- read_csv('data/raw/earthquake2007-2023.csv') #load earthquake data

earthquake_2007_to_2023 %>% 
  filter(eventtype == 'earthquake') %>%
  filter(origintime > as.Date("2011-01-01") &  origintime < as.Date("2020-12-31")) %>% #filter year matching cancer data
  mutate(year = format(origintime, "%Y"))%>% #extrat year information 
  select(year,longitude,latitude,magnitude,depth) %>% 
  st_as_sf(.,coords = c("longitude", "latitude"), crs = 4326) %>% #transform into sf object for coordinates mapping
  st_join(.,DHB_map) %>% # coordinates mapping
  filter(!is.na(DHB))-> #remove coordinates that failed to map to DHB region
  earthquake_map

earthquake_map %>%
  as.data.frame() %>%
  select(-geometry) %>%
  group_by(year,DHB) %>% #summarize based on the year and DHB group
  summarise(magnitude_max =  max(magnitude), magnitude_mean = mean(magnitude), 
            depth_max =  max(depth), depth_mean = mean(depth), 
            counts = n()) %>%
  pivot_wider(names_from = DHB, names_sep = "-" , values_from = c(3:7), values_fill = 0) %>% #some DHB does not have earthquake, change to generate those rows with 0
  pivot_longer(names_to = c('rf','DHB'), names_sep = '-', values_to = "value", cols = -1) %>% #change back to wide, but all values are in single column
  select(DHB,year,rf,value) %>%
  mutate(sex = "AllSex", category = 'Earthquake' , .before =3) %>%
  mutate(type = "value")-> #change values back to their original category
  earthquake

earthquake %>% head(3)

[1mRows: [22m[34m380390[39m [1mColumns: [22m[34m21[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m   (8): publicid, eventtype, magnitudetype, depthtype, evaluationmethod, ...
[32mdbl[39m  (11): longitude, latitude, magnitude, depth, usedphasecount, usedstatio...
[34mdttm[39m  (2): origintime, modificationtime

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1m[22m`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.


DHB,year,sex,category,rf,value,type
<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>
Bay of Plenty,2011,AllSex,Earthquake,magnitude_max,4.621,value
Canterbury,2011,AllSex,Earthquake,magnitude_max,6.2,value
Capital & Coast,2011,AllSex,Earthquake,magnitude_max,3.683,value


### 2.3. Temperature data

In [9]:
### Temperature data ####
#load temperature data with read_xlsx and suppress warnings messages
suppressWarnings(tmp_1951_to_2022 <- readxl::read_xlsx('data/raw/annual&seasonal-temperature 1951-2022.xlsx'))


tmp_1951_to_2022 %>% 
  filter(year >= 2011 & year <= 2020) %>% #filter year matching cancer data
  select( year, site, statistic, season , temperature,lat,lon) %>% 
  st_as_sf(.,coords = c("lon", "lat"), crs = 4326) %>% #transform into sf object for coordinates mapping
  st_join(.,DHB_map) %>% # coordinates mapping
  filter(!is.na(DHB))-> #remove coordinates that failed to map to DHB region
  tmp_map

tmp_map %>%
  as.data.frame() %>%
  select(-c(geometry,site)) %>%
  group_by(year,DHB,statistic,season) %>% #summarize based on the year and DHB group
  summarise(temperature = mean(temperature)) %>%
  ungroup(.) %>%
  pivot_wider(names_from = c(statistic,season), names_sep = '_', values_from = temperature) %>%
  pivot_longer(names_to = c('rf'),  values_to = "value", cols = -c(1:2)) %>%
  select(DHB,year,rf,value) %>%
  mutate(sex = "AllSex", category = 'Temperature' , .before =3) %>%
  mutate(type = "value")->
  tmp

tmp %>% head(3)

[1m[22m`summarise()` has grouped output by 'year', 'DHB', 'statistic'. You can
override using the `.groups` argument.


DHB,year,sex,category,rf,value,type
<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>
Bay of Plenty,2011,AllSex,Temperature,Average_Annual,15.69167,value
Bay of Plenty,2011,AllSex,Temperature,Average_Autumn,16.6,value
Bay of Plenty,2011,AllSex,Temperature,Average_Spring,14.56667,value


### 2.4. Air quality data

In [10]:
### Air quality data ####
air_2016_to_2022 <- readxl::read_xlsx('data/raw/air-quality2016-2022.xlsx')

air_2016_to_2022 %>% 
  mutate(year =  format(as.Date(`Sample Date`),'%Y'), #extrate year info
         Latitude = as.numeric(Latitude),
         Longitude = as.numeric(Longitude)) %>%
  filter(year >= 2011 & year <= 2020) %>% #filter year matching cancer data
  select( Latitude, Longitude, year, Indicator,Concentration) %>% 
  st_as_sf(.,coords = c("Longitude", "Latitude"), crs = 4326) %>% #transform into sf object for coordinates mapping
  st_join(.,DHB_map) %>%  # coordinates mapping
  filter(!is.na(DHB))-> #remove coordinates that failed to map to DHB region
  air_map

air_map %>%
  as.data.frame() %>%
  select(-c(geometry)) %>%
  group_by(year,DHB,Indicator) %>%  #summarize based on the year and DHB group
  summarise(concentration_max = max(Concentration,na.rm=T),
            concentration_mean = mean(Concentration,na.rm=T)) %>%
  pivot_longer(names_to = c('rf'),  values_to = "value", cols = -c(1:3)) %>%
  mutate(rf = paste0(Indicator,'_',rf)) %>%
  select(DHB,year,rf,value) %>%
  mutate(sex = "AllSex", category = 'Air quality' , .before =3) %>%
  mutate(type = "value") -> 
  air 

air %>% head(3)


[1m[22m`summarise()` has grouped output by 'year', 'DHB'. You can override using the
`.groups` argument.


DHB,year,sex,category,rf,value,type
<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>
Auckland,2016,AllSex,Air quality,PM10_concentration_max,62.6,value
Auckland,2016,AllSex,Air quality,PM10_concentration_mean,16.03421,value
Auckland,2016,AllSex,Air quality,PM2.5_concentration_max,19.4,value


### 2.5. Ground water quality ###

In [11]:
#### Ground water quality ####

suppressWarnings(water_2014_to_2021 <- readxl::read_xlsx('data/raw/ground water qualitiy 2014-2021.xlsx'))

water_2014_to_2021 %>% 
  mutate(year =  format(as.Date(Date),'%Y')) %>% #extrate year info
  filter(year >= 2011 & year <= 2020) %>% #filter year matching cancer data
  select( Latitude, Longitude, year, Indicator,CensoredValue) %>% 
  st_as_sf(.,coords = c("Longitude", "Latitude"), crs = 4326) %>% #transform into sf object for coordinates mapping
  st_join(.,DHB_map) %>% # coordinates mapping
  filter(!is.na(DHB)) -> #remove coordinates that failed to map to DHB region
  water_map

water_map %>%
  as.data.frame() %>%
  mutate(Indicator = str_replace(Indicator,"E\\. coli","E\\. Coli")) %>%
  select(-c(geometry)) %>%
  group_by(year,DHB,Indicator) %>% #summarize based on the year and DHB group
  summarise(censoredValue_max = max(CensoredValue,na.rm=T),
            censoredValue_mean = mean(CensoredValue,na.rm=T)) %>%
  pivot_longer(names_to = c('rf'),  values_to = "value", cols = -c(1:3)) %>%
  mutate(rf = paste0(Indicator,'_', str_remove(rf,'censoredValue_'))) %>%
  select(DHB,year,rf,value) %>%
  mutate(sex = "AllSex", category = 'Water quality' , .before =3) %>%
  mutate(type = "value") -> water

water %>% head(3)

[1m[22m`summarise()` has grouped output by 'year', 'DHB'. You can override using the
`.groups` argument.


DHB,year,sex,category,rf,value,type
<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>
Auckland,2011,AllSex,Water quality,Chloride_max,18.7,value
Auckland,2011,AllSex,Water quality,Chloride_mean,17.5,value
Auckland,2011,AllSex,Water quality,Dissolved Reactive Phosphorus_max,0.078,value


## 3. Combine All Risk Factors data

In [12]:
## combine all rf datasets

rf <- reduce(list(nzhs_long,whs,children,income,edu,earthquake,tmp,water,air),rbind) %>%
  mutate(year = as.numeric(year))

rf %>% skimr::skim() # view data 

rf %>% mutate(checked = duplicated(paste(DHB,year,sex,category,rf)))  %>% count(checked) # check no duplicates


── Data Summary ────────────────────────
                           Values    
Name                       Piped data
Number of rows             56170     
Number of columns          7         
_______________________              
Column type frequency:               
  character                5         
  numeric                  2         
________________________             
Group variables            None      

── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
[90m1[39m DHB                   0             1   5  34     0       22          0
[90m2[39m sex                   0             1   4   6     0        3          0
[90m3[39m category              0             1   4  13     0        9          0
[90m4[39m rf                    0             1   4  61     0      182          0
[90m5[39m type                  0             1   5  10     0        2          0

──

“'length(x) = 17 > 1' in coercion to 'logical(1)'”


Unnamed: 0_level_0,skim_type,skim_variable,n_missing,complete_rate,character.min,character.max,character.empty,character.n_unique,character.whitespace,numeric.mean,numeric.sd,numeric.p0,numeric.p25,numeric.p50,numeric.p75,numeric.p100,numeric.hist
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,character,DHB,0,1,5.0,34.0,0.0,22.0,0.0,,,,,,,,
2,character,sex,0,1,4.0,6.0,0.0,3.0,0.0,,,,,,,,
3,character,category,0,1,4.0,13.0,0.0,9.0,0.0,,,,,,,,
4,character,rf,0,1,4.0,61.0,0.0,182.0,0.0,,,,,,,,
5,character,type,0,1,5.0,10.0,0.0,2.0,0.0,,,,,,,,
6,numeric,year,0,1,,,,,,2015.1612,2.605127,2011.0,2013.0,2015.0,2018.0,2020.0,▇▇▇▇▅
7,numeric,value,0,1,,,,,,109.5362,6357.463406,-1.766667,6.4,16.25417,39.9,765000.0,▇▁▁▁▁


checked,n
<lgl>,<int>
False,56170


In [13]:
save(rf,file = 'data/clean/rf.Rdata')