Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: namespace issues with hk_accidents #16

Merged
merged 16 commits into from Sep 4, 2021
Merged

Conversation

martinctc
Copy link
Contributor

@martinctc martinctc commented Sep 3, 2021

Summary

This branch resolves the loading error with hk_accidents as mentioned in #15.

To resolve this error, a new implementation is introduced, where download_data() is a function that downloads and reads in the data into R using fst::read_fst(). The datasets are stored in the fst format on GitHub for fast loading and high compression.

Changes

The changes made in this PR are:

  1. Removed the defunct method of loading hk_accidents as this causes issues with CRAN R-CMD-checks.
  2. Added new fst files to GitHub (committed directly to master branch, separately). These files are stored in data-ready, which is ignored by R via .Rbuildignore.
  3. Added function download_data().

New implementation

You can download and return a dataset with the following code:

download_data(dataset = "hk_accidents")
download_data(dataset = "hk_casualties")
download_data(dataset = "hk_vehicles")

The files are downloaded to a temporary directory, and the download_data() function returns a data frame directly.


Check

  • The R CMD checks pass.

Note

This fixes #15.

@martinctc martinctc self-assigned this Sep 3, 2021
@martinctc martinctc added bug Something isn't working enhancement New feature or request labels Sep 3, 2021
@martinctc martinctc linked an issue Sep 3, 2021 that may be closed by this pull request
Copy link
Contributor

@KHwong12 KHwong12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this new implementation (and introducing fst format to me)! Only have one concern about the download link.

Comment on lines 26 to 30
if(is.null(dataset)){

stop("please provide the name of the dataset to pull.")

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to impose a stricter check of the dataset name? If the user typed the dataset name wrong, the cannot open URL 'https://github.com/Hong-Kong-Districts-Info/hkdatasets/raw/master/data-ready/SOME-RANDOM-CHARS.fst' error message may be hard to understand.

Suggested change
if(is.null(dataset)){
stop("please provide the name of the dataset to pull.")
}
AVAILABLE_DATASETS = c("hk_accidents", "hk_casualties", "hk_vehicles")
if (!(dataset %in% AVAILABLE_DATASETS)) {
stop(
paste0("Please provide the name of the dataset to pull.\n",
"Datasets currently available: ",
# Convert column vector to text first
paste0(AVAILABLE_DATASETS, collapse = ", ")
)
)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great suggestion. I'll add checks for the dataset argument input. Did this work for you locally? I like fst for its fast loading speed, so hopefully this will work as quite a scalable solution in the long term.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I ran the download and load fst data script in my local machine and it works well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic. I've now added some notes on NEWS.md and README, but this should be ready to merge to main and push to CRAN once ready.

@KHwong12
Copy link
Contributor

KHwong12 commented Sep 3, 2021

One additional question - are data types preserved when converted to fst format?

I compared the native dataset and the fst file of hk_accidents and do not see observable differences. Just want to ensure things will not go wrong for other datasets.

Native dataset
str(hk_accidents)
'data.frame':	95821 obs. of  32 variables:
 $ Date_Time                   : POSIXct, format: "2014-01-01 02:17:00" "2014-01-01 06:17:00" "2014-01-01 07:38:00" "2014-01-01 05:22:00" ...
 $ OBJECTID                    : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Year                        : num  2014 2014 2014 2014 2014 ...
 $ Serial_No_                  : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Severity                    : chr  "Serious" "Slight" "Slight" "Slight" ...
 $ District_Council_District   : chr  "E" "CW" "YTM" "N" ...
 $ Hit_and_Run                 : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
 $ Weather                     : chr  "Clear" "Clear" "Clear" "Dull" ...
 $ Rain                        : chr  "Not raining" "Not raining" "Not raining" "Not raining" ...
 $ Natural_Light               : chr  "Daylight" "Dawn/Dusk" "Daylight" "Dark" ...
 $ Junction_Control            : chr  "Not junction" "Traffic signal" "Not junction" "Traffic signal" ...
 $ Vehicle_Movements           : chr  "One moving vehicle" "One moving vehicle" "One moving vehicle" "Two moving vehicles - from opposite direction" ...
 $ Type_of_Collision           : chr  "Vehicle collision with Object" "Vehicle collision with Nothing" "Vehicle collision with Object" "Vehicle collision with Vehicle" ...
 $ No_of_Vehicles_Involved     : num  1 1 1 3 1 1 2 2 1 1 ...
 $ No_of_Casualties_Injured    : num  1 1 2 2 1 1 1 6 1 2 ...
 $ Grid_E                      : num  840782 832909 835270 830780 837229 ...
 $ Grid_N                      : num  816576 816551 817969 840400 825912 ...
 $ latitude                    : num  22.3 22.3 22.3 22.5 22.4 ...
 $ longitude                   : num  114 114 114 114 114 ...
 $ Within_70m                  : logi  FALSE TRUE FALSE TRUE FALSE FALSE ...
 $ Precise_Location            : chr  "Near Lamppost 28647 Island Eastern Corridor" "Outside No. 137 Des Voeux Road West" "China Hong Kong City No. 33 - 33 Canton Road YT Kowloon Tower 3, Podium (No road closure)" "Near Lamppost AD8473 Po Shek Wu Road Sheung Shui New Territories 氕䊰嶟斷葇汙ç¼D8473" ...
 $ Accident                    : logi  FALSE TRUE FALSE TRUE FALSE FALSE ...
 $ Junction_Type               : chr  NA "Cross roads" NA "Cross roads" ...
 $ Crossing_Control            : chr  "No crossing control" "On a crossing control" "No crossing control" "On a crossing control" ...
 $ Crossing_Type               : chr  NA NA NA NA ...
 $ Street_Name                 : chr  "ISLAND EASTERN CORRIDOR" "DES VOEUX ROAD WEST" "CANTON ROAD" "PO SHEK WU ROAD" ...
 $ Road_Type                   : chr  "One way" "Two way" "One way" "More than 2 carriageways" ...
 $ Cycle_Type                  : chr  "Others" "Others" "Others" "Others" ...
 $ Type_of_Collision_with_cycle: chr  "Vehicle collision with Object" "Vehicle collision with Nothing" "Vehicle collision with Object" "Vehicle collision with Vehicle" ...
 $ Structure_Type              : chr  "At grade road" "At grade road" "At grade road" "At grade road" ...
 $ Road_Hierarchy              : chr  "Expressway" "Main Road" "Secondary Road" "Main Road" ...
 $ Road_Ownership              : chr  "Public Road" "Public Road" "Public Road" "Public Road" ...
fst file
str(hk_accidents_fst_down)
'data.frame':	95821 obs. of  32 variables:
 $ Date_Time                   : POSIXct, format: "2014-01-01 02:17:00" "2014-01-01 06:17:00" "2014-01-01 07:38:00" "2014-01-01 05:22:00" ...
 $ OBJECTID                    : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Year                        : num  2014 2014 2014 2014 2014 ...
 $ Serial_No_                  : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Severity                    : chr  "Serious" "Slight" "Slight" "Slight" ...
 $ District_Council_District   : chr  "E" "CW" "YTM" "N" ...
 $ Hit_and_Run                 : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
 $ Weather                     : chr  "Clear" "Clear" "Clear" "Dull" ...
 $ Rain                        : chr  "Not raining" "Not raining" "Not raining" "Not raining" ...
 $ Natural_Light               : chr  "Daylight" "Dawn/Dusk" "Daylight" "Dark" ...
 $ Junction_Control            : chr  "Not junction" "Traffic signal" "Not junction" "Traffic signal" ...
 $ Vehicle_Movements           : chr  "One moving vehicle" "One moving vehicle" "One moving vehicle" "Two moving vehicles - from opposite direction" ...
 $ Type_of_Collision           : chr  "Vehicle collision with Object" "Vehicle collision with Nothing" "Vehicle collision with Object" "Vehicle collision with Vehicle" ...
 $ No_of_Vehicles_Involved     : num  1 1 1 3 1 1 2 2 1 1 ...
 $ No_of_Casualties_Injured    : num  1 1 2 2 1 1 1 6 1 2 ...
 $ Grid_E                      : num  840782 832909 835270 830780 837229 ...
 $ Grid_N                      : num  816576 816551 817969 840400 825912 ...
 $ latitude                    : num  22.3 22.3 22.3 22.5 22.4 ...
 $ longitude                   : num  114 114 114 114 114 ...
 $ Within_70m                  : logi  FALSE TRUE FALSE TRUE FALSE FALSE ...
 $ Precise_Location            : chr  "Near Lamppost 28647 Island Eastern Corridor" "Outside No. 137 Des Voeux Road West" "China Hong Kong City No. 33 - 33 Canton Road YT Kowloon Tower 3, Podium (No road closure)" "Near Lamppost AD8473 Po Shek Wu Road Sheung Shui New Territories 氕䊰嶟斷葇汙ç¼D8473" ...
 $ Accident                    : logi  FALSE TRUE FALSE TRUE FALSE FALSE ...
 $ Junction_Type               : chr  NA "Cross roads" NA "Cross roads" ...
 $ Crossing_Control            : chr  "No crossing control" "On a crossing control" "No crossing control" "On a crossing control" ...
 $ Crossing_Type               : chr  NA NA NA NA ...
 $ Street_Name                 : chr  "ISLAND EASTERN CORRIDOR" "DES VOEUX ROAD WEST" "CANTON ROAD" "PO SHEK WU ROAD" ...
 $ Road_Type                   : chr  "One way" "Two way" "One way" "More than 2 carriageways" ...
 $ Cycle_Type                  : chr  "Others" "Others" "Others" "Others" ...
 $ Type_of_Collision_with_cycle: chr  "Vehicle collision with Object" "Vehicle collision with Nothing" "Vehicle collision with Object" "Vehicle collision with Vehicle" ...
 $ Structure_Type              : chr  "At grade road" "At grade road" "At grade road" "At grade road" ...
 $ Road_Hierarchy              : chr  "Expressway" "Main Road" "Secondary Road" "Main Road" ...
 $ Road_Ownership              : chr  "Public Road" "Public Road" "Public Road" "Public Road" ...

@martinctc
Copy link
Contributor Author

One additional question - are data types preserved when converted to fst format?

I compared the native dataset and the fst file of hk_accidents and do not see observable differences. Just want to ensure things will not go wrong for other datasets.

Native dataset
fst file

As far as I'm aware all the main types like string, numeric, factors, and dates are preserved in fst. Unless we start doing trickier things like list-columns, we should be fine!

@KHwong12
Copy link
Contributor

KHwong12 commented Sep 4, 2021

One additional question - are data types preserved when converted to fst format?
I compared the native dataset and the fst file of hk_accidents and do not see observable differences. Just want to ensure things will not go wrong for other datasets.
Native dataset
fst file

As far as I'm aware all the main types like string, numeric, factors, and dates are preserved in fst. Unless we start doing trickier things like list-columns, we should be fine!

Great. And probably we will not do list-columns as the dataset should be kept as simple as possible.

@martinctc martinctc merged commit b940c96 into master Sep 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hk_accidents namespace not found
2 participants