Adds metadata attributes to ls.socrata(). Fixes #72 #113

tomschenkjr · 2016-10-31T22:26:11Z

This provides a fix for #72.

The catalog fields are now downloaded from Socrata's Project Open Data Metadata file, data.json, and attaches it as an attribute to the data frame obtained from ls.socrata().

Added a test to ensure that the catalog fields are present as attributes.

@geneorama - can you review this pull request and accept if everything looks good?

coveralls · 2016-10-31T22:31:01Z

Coverage increased (+0.1%) to 96.25% when pulling 0ed533d on issue72 into 47d0366 on dev.

geneorama · 2016-11-03T17:34:11Z

@tomschenkjr

During my review I remembered the issues that I had previously encountered with ls.socrata and the solutions that I developed.

Technically speaking ls.socrata does return a data.frame, but as it stands the data.frame is hard to use as a data.frame because it's not completely flattened. Whenever I use ls.socrata I examine the structure of the output with str and the result is so verbose that I can't see the top and bottom of the output.

I think it would be nice if the output columns were vectors, and I think there's a good argument to do it within this issue.

The columns contactPoint, publisher, keyword, and theme can be pretty easily converted to vectors, but the distribution column is more complicated. I've outlined some approaches to handling the distribution column below.

The columns contactPoint, publisher are easy to flatten into a single row with two columns. So, contactPoint could become contactPoint:<column1>, contactPoint:<column2>. In the past I've used the following approach:

cp <- do.call(cbind, df[, "contactPoint"])
colnames(cp) <- paste0("contactPoint:", colnames(cp))
df$contactPoint <- NULL
df <- cbind(df, cp)

The key keywords and theme columns are currently lists, but we could easily flatten them by converting them to converting them to JSON. I have been using this approach:

# clean up keywords
keywords <- data_df$keyword
data_df$keyword <- NA_character_
for(i in 1:nrow(data_df)){
  if(!is.null(keywords[[i]])){
    data_df$keyword[i] <- jsonlite::toJSON(keywords[[i]])
  }
}

The distribution column is a list that contains irregular and somewhat large data.frames. As such, it would be hard to flatten each element of distribution into a single row. However, distribution can be pretty easily normalized and put into a separate data.frame. Here's what I did to separate it out:

distro <- df[, "distribution"]
df[, "distribution"] <- NULL
distro <- lapply(1:length(distro), function(i)
    distro[[i]] <- cbind(landingPage = df$landingPage[i], distro[[i]])) ## add landing page as a key
distro <- do.call(rbind, distro)

This is nice, but now we would have df_data and distro to return to the user, which introduces design questions.

Here are some approaches that I would propose to handling the distribution return value.

Scenario 1

At first I thought this would require a two part return value (table + distribution), but then I realized that we could put it into the attributes. It might sound crazy to put a data.frame into the attributes, but there are other functions that do this such as the regex functions.

I've worked out an example of the code for option 1.
https://gist.github.com/geneorama/c03a2db1463b32622f07e8779a8cb712

Scenario 2

We could start returning a list with each element listed separately. In a lot of ways this is the cleanest, most flexible approach. However, it's kind of annoying.

df_data : the main table that we already have, but flattened into a proper data.frame. This would keep the attributes as proposed.
distribution as a separate flattened table
@context
@id
@type
conformsTo
describedBy

It would be less annoying with two return values (df_data, and the distribution), and we could continue to attach the other elements as attributes to df_data. This is a new kind of annoying, with attributes nested within a return value.

Scenario 3

We could make it so that the distribution information isn't returned by default, and create a separate function that would retrieve that information if the user wanted (or it could be an option) in ls.socrata.

In that scenario ls.socrata would return

df_data : the main table that we already have, but flattened into a proper data.frame. This would keep the attributes as proposed. There would be no distribution column

Something else like ls.socrata.distros would return the distro data.frame.

Scenario 4

We could not worry about it, and live with the complicated data.frame, or make it a feature request.

EDIT: Renamed headings to "Scenario" from "Option" and added some explanation to Scenario 2, and added a second sub option to scenario 2.

tomschenkjr · 2016-11-03T17:35:51Z

Make it a feature request. Purpose of this sprint is to fix bugs.

From: Gene Leynes [mailto:notifications@github.com]
Sent: Thursday, November 03, 2016 12:34 PM
To: Chicago/RSocrata RSocrata@noreply.github.com
Cc: Schenk, Tom Tom.Schenk@cityofchicago.org; Mention mention@noreply.github.com
Subject: Re: [Chicago/RSocrata] Adds metadata attributes to ls.socrata(). Fixes #72 (#113)

@tomschenkjrhttps://github.com/tomschenkjr

During my review I remembered the issues that I had previously encountered with ls.socrata and the solutions that I developed.

Technically speaking ls.socrata does return a data.frame, but as it stands the data.frame is hard to use as a data.frame because it's not completely flattened. Whenever I use ls.socrata I examine the structure of the output with str and the result is so verbose that I can't see the top and bottom of the output.

I think it would be nice if the output columns were vectors, and I think there's a good argument to do it within this issue.

The columns contactPoint, publisher, keyword, and theme can be pretty easily converted to vectors, but the distribution column is more complicated. I've outlined some approaches to handling the distribution column below.

The columns contactPoint, publisher are easy to flatten into a single row with two columns. So, contactPoint could become contactPoint:, contactPoint:. In the past I've used the following approach:

cp <- do.call(cbind, df[, "contactPoint"])

colnames(cp) <- paste0("contactPoint:", colnames(cp))

df$contactPoint <- NULL

df <- cbind(df, cp)

The key keywords and theme columns are currently lists, but we could easily flatten them by converting them to converting them to JSON. I have been using this approach:

clean up keywords

keywords <- data_df$keyword

data_df$keyword <- NA_character_

for(i in 1:nrow(data_df)){

if(!is.null(keywords[[i]])){

data_df$keyword[i] <- jsonlite::toJSON(keywords[[i]])

}

The distribution column is a list that contains irregular and somewhat large data.frames. As such, it would be hard to flatten each element of distribution into a single row. However, distribution can be pretty easily normalized and put into a separate data.frame. Here's what I did to separate it out:

distro <- df[, "distribution"]

df[, "distribution"] <- NULL

distro <- lapply(1:length(distro), function(i)

distro[[i]] <- cbind(landingPage = df$landingPage[i], distro[[i]])) ## add landing page as a key

distro <- do.call(rbind, distro)

This is nice, but now we would have df_data and distro to return to the user, which introduces design questions.

Here are some approaches that I would propose to handling the distribution return value.

Option 1

At first I thought this would require a two part return value (table + distribution), but then I realized that we could put it into the attributes. It might sound crazy to put a data.frame into the attributes, but there are other functions that do this such as the regex functions.

I've worked out an example of the code for option 1.
https://gist.github.com/geneorama/c03a2db1463b32622f07e8779a8cb712

Option 2

We could start returning a list with each element listed separately.

df_data : the main table that we already have, but flattened into a proper data.frame. This would keep the attributes as proposed.
distribution as a separate flattened table
@context
@id
@type
conformsTo
describedBy

Option 3

We could make it so that the distribution information isn't returned by default, and create a separate function that would retrieve that information if the user wanted (or it could be an option) in ls.socrata.

In that scenario ls.socrata would return

df_data : the main table that we already have, but flattened into a proper data.frame. This would keep the attributes as proposed. There would be no distribution column

Something else like ls.socrata.distros would return the distro data.frame.

Option 4

We could not worry about it, and live with the complicated data.frame, or make it a feature request.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//pull/113#issuecomment-258216195, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABkC0aIMj4E5hNB2zwxXNntjyFQwIomPks5q6hsUgaJpZM4KllKn.

This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.

geneorama · 2016-11-03T17:47:51Z

Based on conversation just now with @tomschenkjr I'm closing the pull request as it stands, and copying the previous comment to a new issue.

geneorama · 2016-11-03T17:49:23Z

Had to reopen to merge

tomschenkjr and others added 3 commits October 30, 2016 22:28

Added @id, @context, etc. to ls.socrata. See #72

8451789

Fixes #72

4820653

Added example on attributes from ls.socrata. See #72

0ed533d

tomschenkjr added this to the v1.7.1 milestone Oct 31, 2016

tomschenkjr assigned geneorama Oct 31, 2016

tomschenkjr added the in progress label Oct 31, 2016

PriyaDoIT added the enhancement label Nov 3, 2016

geneorama closed this Nov 3, 2016

tomschenkjr removed the in progress label Nov 3, 2016

geneorama reopened this Nov 3, 2016

tomschenkjr added the in progress label Nov 3, 2016

geneorama merged commit bb5e4e9 into dev Nov 3, 2016

tomschenkjr removed the in progress label Nov 3, 2016

PriyaDoIT deleted the issue72 branch November 16, 2016 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds metadata attributes to ls.socrata(). Fixes #72 #113

Adds metadata attributes to ls.socrata(). Fixes #72 #113

tomschenkjr commented Oct 31, 2016

coveralls commented Oct 31, 2016 •

edited

Loading

geneorama commented Nov 3, 2016 •

edited

Loading

tomschenkjr commented Nov 3, 2016

geneorama commented Nov 3, 2016

geneorama commented Nov 3, 2016

Adds metadata attributes to ls.socrata(). Fixes #72 #113

Adds metadata attributes to ls.socrata(). Fixes #72 #113

Conversation

tomschenkjr commented Oct 31, 2016

coveralls commented Oct 31, 2016 • edited Loading

geneorama commented Nov 3, 2016 • edited Loading

Scenario 1

Scenario 2

Scenario 3

Scenario 4

tomschenkjr commented Nov 3, 2016

clean up keywords

geneorama commented Nov 3, 2016

geneorama commented Nov 3, 2016

coveralls commented Oct 31, 2016 •

edited

Loading

geneorama commented Nov 3, 2016 •

edited

Loading