Added Lists in the New Data Frame dialog #7468

anastasia-mbithe · 2022-05-11T07:30:05Z

Fixes #7387
@rdstern, @N-thony @lloyddewit.
I have added the second part of the issue to the dialog. It is ready for review.

updating master

Updating Master.

rdstern

@anastasia-mbithe this is great, and it mostly works.
a) I suggest adding a checkbox, default unchecked, with label something like R Command as the label. If checked, then it shows to box with the command, and that could be shorter as well.
b) It mostly works fine and gives the lists we need. A few do not work, and I think they all give the same error. Here is an example:

Others with the same error include words/word_clues, where none of words_five, four, six work.
This error could be a bit more of a puzzle to diagnose?

anastasia-mbithe · 2022-05-16T12:21:26Z

@rdstern, can we have the lists as a tibble instead of as a data frame?

rdstern · 2022-05-16T12:26:31Z

I assume that could be nice, but I am not sure why you are asking for this? What's the advantage? I am adding @lilyclements who is much better placed to answer/confirm this.

anastasia-mbithe · 2022-05-16T12:39:54Z

@rdstern, a tibble can accommodate the unequal number of rows, and that could fix our issue here. Though the general output looks different from what data.frame gives.

This is what we will have for the "games/cluedo"

And this for the "animals/dinosaurs"

lilyclements · 2022-05-16T13:08:25Z

@anastasia-mbithe can you post the R code for if we used a tibble here and for if we used a table here? Might give me a better idea of the aim of the dialog.

I know we do want to use tibbles instead of data frames in R Instat. It does introduce some differences (eg no row names). So this could be a good place to start on that. I’m not entirely sure what the “hold up” is on that.

anastasia-mbithe · 2022-05-16T13:24:00Z

This is the code when using tibble and it runs fine.

I have used table and it gives this error too.

anastasia-mbithe · 2022-05-16T13:29:39Z

@lilyclements, The other option we have is filling the shortest column with NAs to match the number of rows of the longest column, then having it as a data frame. I found cbind.fill()/ rbind.fill(), though I couldn't find the functions in the current R version. Do you know of another function that could do the same?

rdstern · 2022-05-17T03:59:15Z

@anastasia-mbithe I really like your search for solutions here. I was happy with the data layout for the dinosaurs in a data frame, and would not want to lose that layout. I'd like to solve the importing of these troublesome lists with something that could be used by schoolchildren, even if they have to learn some data manipulation. So, making the data into a proper list, or into multiple variables would be ideal.

N-thony · 2022-05-19T09:10:37Z

@anastasia-mbithe I really like your search for solutions here. I was happy with the data layout for the dinosaurs in a data frame, and would not want to lose that layout. I'd like to solve the importing of these troublesome lists with something that could be used by schoolchildren, even if they have to learn some data manipulation. So, making the data into a proper list, or into multiple variables would be ideal.

@anastasia-mbithe any progress?

anastasia-mbithe · 2022-05-19T10:12:12Z

@anastasia-mbithe I really like your search for solutions here. I was happy with the data layout for the dinosaurs in a data frame, and would not want to lose that layout. I'd like to solve the importing of these troublesome lists with something that could be used by schoolchildren, even if they have to learn some data manipulation. So, making the data into a proper list, or into multiple variables would be ideal.

@anastasia-mbithe any progress?

@N-thony Not yet, am still looking for a way to have the "particular" lists as a data frame.

lilyclements · 2022-05-23T11:25:54Z

@rdstern @anastasia-mbithe I don't think I fully understand what we want the output to be in R-Instat for something like games/cluedo.

In R, if I run rcorpora::corpora("games/cluedo"), I get a lot of lists with different lengths:

$description
[1] "Characters, rooms and weapons from the board game Cluedo / Clue."

$victim
$victim$Cluedo
[1] "Dr Black"

$victim$Clue
[1] "Mr Boddy"


$suspects
$suspects$Cluedo
[1] "Miss Scarlett"   "Professor Plum"  "Mrs Peacock"     "Reverend Green"  "Colonel Mustard" "Mrs White"      

$suspects$Clue
[1] "Miss Scarlet"    "Professor Plum"  "Mrs Peacock"     "Mr Green"        "Colonel Mustard" "Mrs White"      


$weapons
$weapons$Cluedo
[1] "candlestick" "dagger"      "lead pipe"   "revolver"    "rope"        "spanner"    

$weapons$Clue
[1] "candlestick" "knife"       "lead pipe"   "revolver"    "rope"        "wrench"     


$rooms
 [1] "kitchen"       "ballroom"      "conservatory"  "dining room"   "cellar"        "billiard room" "library"       "lounge"        "hall"          "study"        

$secret_passages
     [,1]      [,2]          
[1,] "kitchen" "study"       
[2,] "lounge"  "conservatory"

@rdstern How would we want this displayed in R-Instat?
If we do the tibble suggestion from @anastasia-mbithe we get the output below
data.frame(tibble::tibble(rcorpora::corpora("games/cluedo")))
This makes the most sense to me.

                                                                                                                                        rcorpora..corpora..games.cluedo..
1                                                                                                        Characters, rooms and weapons from the board game Cluedo / Clue.
2                                                                                                                                                      Dr Black, Mr Boddy
3 Miss Scarlett, Professor Plum, Mrs Peacock, Reverend Green, Colonel Mustard, Mrs White, Miss Scarlet, Professor Plum, Mrs Peacock, Mr Green, Colonel Mustard, Mrs White
4                                                          candlestick, dagger, lead pipe, revolver, rope, spanner, candlestick, knife, lead pipe, revolver, rope, wrench
5                                                                       kitchen, ballroom, conservatory, dining room, cellar, billiard room, library, lounge, hall, study
6                                                                                                                                    kitchen, lounge, study, conservatory

Running a list with just two elements, such as rcorpora::corpora("animals/dinosaurs"), works in creating a data frame.
(I say two elements, because it gives "description" and "dinosaurs").
Using tibble here makes it messy.
(tibble::tibble(rcorpora::corpora("animals/dinosaurs")))
So, just to make it a bit more complicated, we want to say something like:

data <- rcorpora::corpora("animals/dinosaurs") 
if (length(data) == 2){
  data.frame(data = data.frame(data))
} else {
  data.frame(data = data.frame(tibble::tibble(data)))
}

The "if" statement can be in VB code not R code, and so this can be run in a sub.

(side note: as_tibble does not work for lists of different sizes, which is why using tibble)

So, I agree with should use tibble for lists of different element sizes (e.g. "games/cluedo"), but only if that is the sort of output we want! If we do use tibble for those lists, we should only use them on lists with different sizes.

lilyclements · 2022-05-30T12:05:17Z

@rdstern I've noticed this is in "blocker" - would you mind having a look at my comment above when you get the time?

rdstern · 2022-05-31T07:30:03Z

I am not sure whether I am answering your points above. I hope:
a) Most of the categories can be read?
b) We can quickly find a workable solution?
c) We could have examples we trap and don't allow.

Here is another example, that I quite like - more than cluedo - so I'd be just slightly disappointed if it can't be read?
rcorpora::corpora("words/word_clues/clues_five")
This gives:

# Code run from Script Window
rcorpora::corpora("words/word_clues/clues_five")

$description
[1] "a list of common 5-letter words followed by crossword/thesaurus-style hints for that word"

$data
$data$abase
 [1] "put down"     "humiliate"    "cut down"     "bring down"   "belittle"    
 [6] "put to shame" "humble"       "lower"        "demean"       "degrade"     

$data$abate
[1] "diminish"  "let up"    "fade away" "die down"  "lessen"    "drop off" 
[7] "subside"   "fall off" 

$data$abbot
[1] "monastery head"   "monastic title"   "monk's superior"  "monastery leader"

$data$abort
[1] "scrub a space mission" "cancel"                "call off"             
[4] "cut short"             "halt"                 

$data$about
[1] "concerning"   "regarding"    "aproximately"

$data$abuse
[1] "take advantage of" "bully"             "wrong"            
[4] "mishandle"         "corrupt practice"

and so on. It is a long list. I was hoping that could be stored in 3 variables, namely the first - with the description, the second with the name, and the third with the list (e.g. "take advantage of" "bully", etc. . Then we can always use our text handling facilities to split the list into the separate bits if we wish. (We may have to extend our existing dialogues and that would be fine..)

lilyclements · 2022-05-31T09:58:16Z

@rdstern we have three different examples here.
A list of two, but with a single entry in the second part (the dinosaur name)
df <- rcorpora::corpora("animals/dinosaurs")

A list of two, but with multiple entries in the second part (e.g. "take advantage of", "bully", ...)
df <- rcorpora::corpora("words/word_clues/clues_five")

A list of more than two
df <- rcorpora::corpora("games/cluedo")

To handle these three cases, I've written some R code (below):

read_corpora <- function(data){
  data_unlist <- NULL
  description <- NULL
  for (i in 1:length(df)){
    if (names(df[i]) == "description") {
      description <- df[i][[1]]
    } else {
      if (class(df[[i]]) == "character"){
        data_unlist[[i]] <- data.frame(list = df[[i]])
      } else if (class(df[[i]]) == "list"){
        data_unlist_i <- purrr::map(.x = names(df[[i]]), .f = ~data.frame(list = df[[i]][[.x]]))
        names(data_unlist_i) <- names(df[[i]])
        data_unlist[[i]] <- plyr::ldply(data_unlist_i, .id = "name")
      } else if ("matrix" %in% class(df[[i]])){
        data_unlist[[i]] <- data.frame(list = do.call(paste, c(data.frame(df[[i]]), sep="-")))
      } else if (class(df[[i]]) == "data.frame"){
        data_unlist[[i]] <- data.frame(list = df[[i]])
      }
    }
  }
  names(data_unlist) <- names(df)
  data_unlist <- plyr::ldply(data_unlist, .id = "variable")
  if (!is.null(description)){
    data_full <- data.frame(description = description, data_unlist)
  } else {
    data_full <- data.frame(data_unlist)
  }
  return(data_full)
}

We could make this into a function to be called into the dialog for this particular package? Or would we rather avoid writing our own functions?

rdstern · 2022-05-31T10:00:39Z

@lilyclements I like the idea of functions! I don't see why we are avoiding them.

lilyclements · 2022-05-31T14:14:21Z

@rdstern great, should we store the function in stand_alone_functions.R

I have tested it with some of the data sets in the rcorpora package (but not all given there are a lot!)

anastasia-mbithe · 2022-06-09T08:57:14Z

@lilyclements, I made the changes although there's a consistent error for all datasets like I had explained to you before. Kindly have a look at the code.

N-thony · 2022-06-14T06:22:49Z

@lilyclements, I made the changes although there's a consistent error for all datasets like I had explained to you before. Kindly have a look at the code.

@lilyclements please could you help here? Thanks.

@anastasia-mbithe

@anastasia-mbithe the issue with the R function was just that the parameter name is `data`, but in the function itself is referred to as `df`. This should fix the issue with the R function!

anastasia-mbithe · 2022-06-15T07:02:10Z

@lilyclements Thank you for the changes. The function is now working well and even for the "special" case lists. As I was testing randomly, everything was well until I got to "religion/christian_saints" and it gives this error;

I checked the list in R and noticed it has two variables with many missing values. Is there a way we can add NAs to the empty cells?

@anastasia-mbithe

@anastasia-mbithe good find, and good investigation. The issue here wasn't to do with the NA values, but instead that the function doesn't have a way to handle this "case". The function, until now, has only handled lists. However, this is a data frame. ``` x <- rcorpora::corpora("religion/christian_saints") head(x) class(x) ``` So we just have to add in a new `if` statement for how to handle the data if it is a data frame (that is, in it's simplest case!). I've done this and made the changes in this commit. This is ready for you again now.

anastasia-mbithe · 2022-06-16T07:43:05Z

Thank you so much @lilyclements.
@rdstern and @lloyddewit, this is ready for review.

@rdstern

@rdstern had a small bug in the last version which prevented some of the functions from working - including

lilyclements · 2022-07-04T18:31:10Z

@rdstern had a small bug in the last version which prevented some of the functions from working - including religion > religions, words/word ones. This is now sorted.

I'll look at "Transportation > Commerical airlines" and "Words/Verbs with conjugations" tomorrow.

rdstern · 2022-07-04T19:16:35Z

Looks very good now. religion works as do the word clues. The only 2 I can find that don't work now are
a) transportation > commercial aircraft - but that's of minor importance.
b) words > verbs with conjugations still gives the annoying error that then persists. But perhaps that can wait? If so, I'll approve now.

lilyclements · 2022-07-04T19:30:20Z

@rdstern i suggest we write them up in an issue. I’m happy to write the issue tomorrow.

rdstern

This is great and almost everything is working now. I am approving in its current form, though there may be a new issue to enable the final couple of lists to be imported - or at least trapped.

lloyddewit

@lilyclements Thanks for the recent changes, I just have a couple of comments about readability