Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Lists in the New Data Frame dialog #7468

Merged

Conversation

anastasia-mbithe
Copy link
Contributor

@anastasia-mbithe anastasia-mbithe commented May 11, 2022

Fixes #7387
@rdstern, @N-thony @lloyddewit.
I have added the second part of the issue to the dialog. It is ready for review.

Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anastasia-mbithe this is great, and it mostly works.
a) I suggest adding a checkbox, default unchecked, with label something like R Command as the label. If checked, then it shows to box with the command, and that could be shorter as well.
b) It mostly works fine and gives the lists we need. A few do not work, and I think they all give the same error. Here is an example:
image

Others with the same error include words/word_clues, where none of words_five, four, six work.
This error could be a bit more of a puzzle to diagnose?

@anastasia-mbithe
Copy link
Contributor Author

@rdstern, can we have the lists as a tibble instead of as a data frame?

@rdstern
Copy link
Collaborator

rdstern commented May 16, 2022

I assume that could be nice, but I am not sure why you are asking for this? What's the advantage? I am adding @lilyclements who is much better placed to answer/confirm this.

@anastasia-mbithe
Copy link
Contributor Author

@rdstern, a tibble can accommodate the unequal number of rows, and that could fix our issue here. Though the general output looks different from what data.frame gives.

games cluedo
This is what we will have for the "games/cluedo"

animals dinosaurs
And this for the "animals/dinosaurs"

@lilyclements
Copy link
Contributor

@anastasia-mbithe can you post the R code for if we used a tibble here and for if we used a table here? Might give me a better idea of the aim of the dialog.

I know we do want to use tibbles instead of data frames in R Instat. It does introduce some differences (eg no row names). So this could be a good place to start on that. I’m not entirely sure what the “hold up” is on that.

@anastasia-mbithe
Copy link
Contributor Author

tibble
This is the code when using tibble and it runs fine.

errorTable
I have used table and it gives this error too.

@anastasia-mbithe
Copy link
Contributor Author

@lilyclements, The other option we have is filling the shortest column with NAs to match the number of rows of the longest column, then having it as a data frame. I found cbind.fill()/ rbind.fill(), though I couldn't find the functions in the current R version. Do you know of another function that could do the same?

@rdstern
Copy link
Collaborator

rdstern commented May 17, 2022

@anastasia-mbithe I really like your search for solutions here. I was happy with the data layout for the dinosaurs in a data frame, and would not want to lose that layout. I'd like to solve the importing of these troublesome lists with something that could be used by schoolchildren, even if they have to learn some data manipulation. So, making the data into a proper list, or into multiple variables would be ideal.

@N-thony
Copy link
Collaborator

N-thony commented May 19, 2022

@anastasia-mbithe I really like your search for solutions here. I was happy with the data layout for the dinosaurs in a data frame, and would not want to lose that layout. I'd like to solve the importing of these troublesome lists with something that could be used by schoolchildren, even if they have to learn some data manipulation. So, making the data into a proper list, or into multiple variables would be ideal.

@anastasia-mbithe any progress?

@anastasia-mbithe
Copy link
Contributor Author

@anastasia-mbithe I really like your search for solutions here. I was happy with the data layout for the dinosaurs in a data frame, and would not want to lose that layout. I'd like to solve the importing of these troublesome lists with something that could be used by schoolchildren, even if they have to learn some data manipulation. So, making the data into a proper list, or into multiple variables would be ideal.

@anastasia-mbithe any progress?

@N-thony Not yet, am still looking for a way to have the "particular" lists as a data frame.

@lilyclements
Copy link
Contributor

lilyclements commented May 23, 2022

@rdstern @anastasia-mbithe I don't think I fully understand what we want the output to be in R-Instat for something like games/cluedo.

In R, if I run rcorpora::corpora("games/cluedo"), I get a lot of lists with different lengths:

$description
[1] "Characters, rooms and weapons from the board game Cluedo / Clue."

$victim
$victim$Cluedo
[1] "Dr Black"

$victim$Clue
[1] "Mr Boddy"


$suspects
$suspects$Cluedo
[1] "Miss Scarlett"   "Professor Plum"  "Mrs Peacock"     "Reverend Green"  "Colonel Mustard" "Mrs White"      

$suspects$Clue
[1] "Miss Scarlet"    "Professor Plum"  "Mrs Peacock"     "Mr Green"        "Colonel Mustard" "Mrs White"      


$weapons
$weapons$Cluedo
[1] "candlestick" "dagger"      "lead pipe"   "revolver"    "rope"        "spanner"    

$weapons$Clue
[1] "candlestick" "knife"       "lead pipe"   "revolver"    "rope"        "wrench"     


$rooms
 [1] "kitchen"       "ballroom"      "conservatory"  "dining room"   "cellar"        "billiard room" "library"       "lounge"        "hall"          "study"        

$secret_passages
     [,1]      [,2]          
[1,] "kitchen" "study"       
[2,] "lounge"  "conservatory"
  1. @rdstern How would we want this displayed in R-Instat?
  2. If we do the tibble suggestion from @anastasia-mbithe we get the output below
    data.frame(tibble::tibble(rcorpora::corpora("games/cluedo")))
    This makes the most sense to me.
                                                                                                                                        rcorpora..corpora..games.cluedo..
1                                                                                                        Characters, rooms and weapons from the board game Cluedo / Clue.
2                                                                                                                                                      Dr Black, Mr Boddy
3 Miss Scarlett, Professor Plum, Mrs Peacock, Reverend Green, Colonel Mustard, Mrs White, Miss Scarlet, Professor Plum, Mrs Peacock, Mr Green, Colonel Mustard, Mrs White
4                                                          candlestick, dagger, lead pipe, revolver, rope, spanner, candlestick, knife, lead pipe, revolver, rope, wrench
5                                                                       kitchen, ballroom, conservatory, dining room, cellar, billiard room, library, lounge, hall, study
6                                                                                                                                    kitchen, lounge, study, conservatory
  1. Running a list with just two elements, such as rcorpora::corpora("animals/dinosaurs"), works in creating a data frame.
    (I say two elements, because it gives "description" and "dinosaurs").
    Using tibble here makes it messy.
    (tibble::tibble(rcorpora::corpora("animals/dinosaurs")))
    So, just to make it a bit more complicated, we want to say something like:
data <- rcorpora::corpora("animals/dinosaurs") 
if (length(data) == 2){
  data.frame(data = data.frame(data))
} else {
  data.frame(data = data.frame(tibble::tibble(data)))
}

The "if" statement can be in VB code not R code, and so this can be run in a sub.

(side note: as_tibble does not work for lists of different sizes, which is why using tibble)

  1. So, I agree with should use tibble for lists of different element sizes (e.g. "games/cluedo"), but only if that is the sort of output we want! If we do use tibble for those lists, we should only use them on lists with different sizes.

@lilyclements
Copy link
Contributor

@rdstern I've noticed this is in "blocker" - would you mind having a look at my comment above when you get the time?

@rdstern
Copy link
Collaborator

rdstern commented May 31, 2022

I am not sure whether I am answering your points above. I hope:
a) Most of the categories can be read?
b) We can quickly find a workable solution?
c) We could have examples we trap and don't allow.

Here is another example, that I quite like - more than cluedo - so I'd be just slightly disappointed if it can't be read?
rcorpora::corpora("words/word_clues/clues_five")
This gives:

# Code run from Script Window
rcorpora::corpora("words/word_clues/clues_five")

$description
[1] "a list of common 5-letter words followed by crossword/thesaurus-style hints for that word"

$data
$data$abase
 [1] "put down"     "humiliate"    "cut down"     "bring down"   "belittle"    
 [6] "put to shame" "humble"       "lower"        "demean"       "degrade"     

$data$abate
[1] "diminish"  "let up"    "fade away" "die down"  "lessen"    "drop off" 
[7] "subside"   "fall off" 

$data$abbot
[1] "monastery head"   "monastic title"   "monk's superior"  "monastery leader"

$data$abort
[1] "scrub a space mission" "cancel"                "call off"             
[4] "cut short"             "halt"                 

$data$about
[1] "concerning"   "regarding"    "aproximately"

$data$abuse
[1] "take advantage of" "bully"             "wrong"            
[4] "mishandle"         "corrupt practice" 


and so on. It is a long list. I was hoping that could be stored in 3 variables, namely the first - with the description, the second with the name, and the third with the list (e.g. "take advantage of" "bully", etc. . Then we can always use our text handling facilities to split the list into the separate bits if we wish. (We may have to extend our existing dialogues and that would be fine..)

@lilyclements
Copy link
Contributor

lilyclements commented May 31, 2022

@rdstern we have three different examples here.
A list of two, but with a single entry in the second part (the dinosaur name)
df <- rcorpora::corpora("animals/dinosaurs")

A list of two, but with multiple entries in the second part (e.g. "take advantage of", "bully", ...)
df <- rcorpora::corpora("words/word_clues/clues_five")

A list of more than two
df <- rcorpora::corpora("games/cluedo")

To handle these three cases, I've written some R code (below):

read_corpora <- function(data){
  data_unlist <- NULL
  description <- NULL
  for (i in 1:length(df)){
    if (names(df[i]) == "description") {
      description <- df[i][[1]]
    } else {
      if (class(df[[i]]) == "character"){
        data_unlist[[i]] <- data.frame(list = df[[i]])
      } else if (class(df[[i]]) == "list"){
        data_unlist_i <- purrr::map(.x = names(df[[i]]), .f = ~data.frame(list = df[[i]][[.x]]))
        names(data_unlist_i) <- names(df[[i]])
        data_unlist[[i]] <- plyr::ldply(data_unlist_i, .id = "name")
      } else if ("matrix" %in% class(df[[i]])){
        data_unlist[[i]] <- data.frame(list = do.call(paste, c(data.frame(df[[i]]), sep="-")))
      } else if (class(df[[i]]) == "data.frame"){
        data_unlist[[i]] <- data.frame(list = df[[i]])
      }
    }
  }
  names(data_unlist) <- names(df)
  data_unlist <- plyr::ldply(data_unlist, .id = "variable")
  if (!is.null(description)){
    data_full <- data.frame(description = description, data_unlist)
  } else {
    data_full <- data.frame(data_unlist)
  }
  return(data_full)
}

We could make this into a function to be called into the dialog for this particular package? Or would we rather avoid writing our own functions?

@rdstern
Copy link
Collaborator

rdstern commented May 31, 2022

@lilyclements I like the idea of functions! I don't see why we are avoiding them.

@lilyclements
Copy link
Contributor

lilyclements commented May 31, 2022

@rdstern great, should we store the function in stand_alone_functions.R

I have tested it with some of the data sets in the rcorpora package (but not all given there are a lot!)

@anastasia-mbithe
Copy link
Contributor Author

@lilyclements, I made the changes although there's a consistent error for all datasets like I had explained to you before. Kindly have a look at the code.

image

@N-thony
Copy link
Collaborator

N-thony commented Jun 14, 2022

@lilyclements, I made the changes although there's a consistent error for all datasets like I had explained to you before. Kindly have a look at the code.

image

@lilyclements please could you help here? Thanks.

@anastasia-mbithe the issue with the R function was just that the parameter name is `data`, but in the function itself is referred to as `df`. This should fix the issue with the R function!
@anastasia-mbithe
Copy link
Contributor Author

@lilyclements Thank you for the changes. The function is now working well and even for the "special" case lists. As I was testing randomly, everything was well until I got to "religion/christian_saints" and it gives this error;

image

I checked the list in R and noticed it has two variables with many missing values. Is there a way we can add NAs to the empty cells?

@anastasia-mbithe good find, and good investigation. The issue here wasn't to do with the NA values, but instead that the function doesn't have a way to handle this "case".
The function, until now, has only handled lists. However, this is a data frame.

``` 
x <- rcorpora::corpora("religion/christian_saints")
head(x)
class(x)
```

So we just have to add in a new `if` statement for how to handle the data if it is a data frame (that is, in it's simplest case!).
I've done this and made the changes in this commit.
This is ready for you again now.
@anastasia-mbithe
Copy link
Contributor Author

Thank you so much @lilyclements.
@rdstern and @lloyddewit, this is ready for review.

@rdstern had a small bug in the last version which prevented some of the functions from working - including
@lilyclements
Copy link
Contributor

lilyclements commented Jul 4, 2022

@rdstern had a small bug in the last version which prevented some of the functions from working - including religion > religions, words/word ones. This is now sorted.

I'll look at "Transportation > Commerical airlines" and "Words/Verbs with conjugations" tomorrow.

@rdstern
Copy link
Collaborator

rdstern commented Jul 4, 2022

Looks very good now. religion works as do the word clues. The only 2 I can find that don't work now are
a) transportation > commercial aircraft - but that's of minor importance.
b) words > verbs with conjugations still gives the annoying error that then persists. But perhaps that can wait? If so, I'll approve now.

@lilyclements
Copy link
Contributor

@rdstern i suggest we write them up in an issue. I’m happy to write the issue tomorrow.

rdstern
rdstern previously approved these changes Jul 5, 2022
Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great and almost everything is working now. I am approving in its current form, though there may be a new issue to enable the final couple of lists to be imported - or at least trapped.

Copy link
Contributor

@lloyddewit lloyddewit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lilyclements Thanks for the recent changes, I just have a couple of comments about readability

instat/static/InstatObject/R/stand_alone_functions.R Outdated Show resolved Hide resolved
Comment on lines 2566 to 2613
if (length(data[[i]]) == 0) {
data_unlist[[i]] <- data.frame(NA)
} else {
for (j in 1:length(data[[i]])){
if (class(data[[i]][[j]]) %in% c("character", "factor", "logical", "numeric", "integer")){
data_unlist_2[[j]] <- data.frame(list = data[[i]][[j]])

} else if (class(data[[i]][[j]]) == "list"){
if (length(data[[i]][[j]]) == 0) {
data_unlist_3[[j]] <- data.frame(list = NA)
} else {
for (k in 1:length(data[[i]][[j]])){
if (class(data[[i]][[j]][[k]]) %in% c("character", "factor", "logical", "numeric", "integer")){
data_unlist_3[[k]] <- data.frame(list = data[[i]][[j]][[k]])
} else if (class(data[[i]][[j]][[k]]) == "list"){
if (length(data[[i]][[j]][[k]]) == 0){
data_unlist_4[[k]] <- data.frame(list = NA)
} else {
for (l in 1:length(data[[i]][[j]][[k]])){
if (class(data[[i]][[j]][[k]][[l]]) %in% c("character", "factor", "logical", "numeric", "integer")){
data_unlist_4[[l]] <- data.frame(list = data[[i]][[j]][[k]][[l]])
} else if (class(data[[i]][[j]][[k]][[l]]) == "list"){
if (length(data[[i]][[j]][[k]][[l]]) == 0) {
data_unlist_4[[l]] <- data.frame(list = NA)
} else {
if (!is.null(names(data[[i]][[j]][[k]][[l]]))){
data_unlist_2_i <- purrr::map(.x = names(data[[i]][[j]][[k]][[l]]), .f = ~data.frame(list = data[[i]][[j]][[k]][[l]][[.x]]))
names(data_unlist_2_i) <- names(data[[i]][[j]][[k]][[l]])
data_unlist_4[[l]] <- plyr::ldply(data_unlist_2_i, .id = "variable4")
} else {
data_unlist_4[[l]] <- (plyr::ldply(data[[i]][[j]][[k]][[l]], rbind, .id = "variable4"))
}
}
}
}
}
names(data_unlist_4) <- names(data[[i]][[j]][[k]][1:length(data_unlist_4)])
data_unlist_3[[k]] <- plyr::ldply(data_unlist_4, .id = "variable4")
}
}
}
names(data_unlist_3) <- names(data[[i]][[j]][1:length(data_unlist_3)])
data_unlist_2[[j]] <- plyr::ldply(data_unlist_3, .id = "variable3")
}
}
}
names(data_unlist_2) <- names(data[[i]][1:length(data_unlist_2)])
data_unlist[[i]] <- plyr::ldply(data_unlist_2, .id = "variable2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lilyclements Would it be more readable to move this code into a separate function?
Also, this code seems to be going down 4 levels and doing similar things at each level (with the exception of the 4th level).
Some lines of code are repeated several times.
Could we replace this nested/duplicated code with a smaller recursive function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was thinking this too. I wrote a comment, but just realised I wrote it in the "commit" so it was lost!

"these should all work now. The main issue was with the religion one - it contains a list in a list in a list in a list in a list (data$Indigenous Traditional$Historical Polytheism$Indo-European$Hellenistic$Pythagoreanism) so it took a little while.

I think there is a much nicer way to do it than I am doing. I have seen, for example, Danny call his own function in the function. I think that is what we would want here. I can spend some time on cleaning this up and looking into that, if you think it is worth it? Alternatively, I can do that on another branch once this is merged."

I can do a recursive function. Shall I do it on this branch or elsewhere? I do not know how long it will take as I have not written recursive before. I also have a few other priorities to get to this week

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made some edits to the function to add in recursive and avoid repeating as much!

instat/static/InstatObject/R/stand_alone_functions.R Outdated Show resolved Hide resolved
lilyclements and others added 3 commits July 5, 2022 11:51
Co-authored-by: lloyddewit <57253949+lloyddewit@users.noreply.github.com>
Co-authored-by: lloyddewit <57253949+lloyddewit@users.noreply.github.com>
rdstern
rdstern previously approved these changes Jul 8, 2022
@lilyclements
Copy link
Contributor

@lloyddewit after our call, I found a solution which avoids recursion but is hopefully still readable - or at least, a lot less complex, and works for infinite list lengths. Let me know what you think.

lloyddewit
lloyddewit previously approved these changes Jul 12, 2022
Copy link
Contributor

@lloyddewit lloyddewit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, the latest commit is a big reduction in size/complexity. It looks like you used the power of the R libraries to do the heavy lifting, which is great.
I approved, there's just one open question.
Also, you mentioned that you might add some comments to explain what's happening. Do you still plan to do that? An example of a mult-level list would be helpful.

rdstern
rdstern previously approved these changes Jul 12, 2022
Copy link
Collaborator

@rdstern rdstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great

@lilyclements lilyclements dismissed stale reviews from rdstern and lloyddewit via 216fefb July 12, 2022 09:56
@lilyclements
Copy link
Contributor

@lloyddewit thanks for the reminder - I have just added some comments into the function. However, this has dismissed yours and @rdstern's reviews. Sorry for that - would you mind reapproving?

@lloyddewit lloyddewit merged commit 1106477 into IDEMSInternational:master Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Two additions to the File > New Data Frame
6 participants