Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collapsing strings #25

Open
sergiocorreia opened this issue Nov 2, 2017 · 10 comments
Open

Collapsing strings #25

sergiocorreia opened this issue Nov 2, 2017 · 10 comments

Comments

@sergiocorreia
Copy link

Just a quick question: is there a reason why we can't use strings as targets in collapse?

EG:

sysuse auto
gcollapse (first) make

Returns:

. sysuse auto
(1978 Automobile Data)

. gcollapse (first) make
Source make must be numeric.
r(198);

On principle, some target variables (first, last, firstnm, lastnm) could be strings.

@mcaceresb
Copy link
Owner

mcaceresb commented Nov 2, 2017

Well, I wrote the entire code base assuming sources for summary stats had to be numeric. I note in the docs that gcollapse can't do this, but let me know if it should be more prominently stated.

That is, I wrote the C base and the internals assuming the sources are numeric, and only after noticed they could be string for some commands.

I'm not sure I can be persuaded to change it, as it would involve a non-trivial amount of work for what I see as little gain, but feel free to make a case for it.

@sergiocorreia
Copy link
Author

Oh, I agree it's not worth it, and now that I see it it's also documented on the help file.

The way I see it, gtools is mostly useful with big datasets. And strings are not what you want to have with big datasets, so the usage is low.

@mdroste
Copy link

mdroste commented Aug 26, 2019

I had a few relevant comments for this discussion.

  • I think that strings are occasionally somewhat useful even with big datasets. For instance, it is often more efficient to store information as 1-byte str1's rather than numeric byte variables (e.g. a string called gender = "M" or "F" or "O") rather than "byte male = 1 or 0", since the string only takes 1 byte and weirdly the byte takes 2. If you've got 50 such variables and are only using them to subset your data rather than as an input in a model, this trick can save a lot of space, and my coworkers and I have used it a lot.

  • In cases where a string variable only takes on a small number of values, then converting it into a numeric variable with value labels, using gcollapse, and then recasting the numeric converted variable back into a string works just fine (e.g. with encode, decode, or probably more efficiently by manually iterating over values and creating a new variable). I'd imagine this could become costly if the number of values a string variable takes is large, but in many use cases I'd imagine that doesn't occur. So it would be possible to do this behind the scenes in gtools as well, though maybe not worth your time. Just wanted to throw this idea out there in case!

@mcaceresb mcaceresb reopened this Aug 26, 2019
@mcaceresb
Copy link
Owner

@mdroste While I have come to see the value in adding strings to gcollapse, it seems to me to be way too much work.

  1. Given the way gcollapse was written (which is not particularly clean, but that is my fault), adapting it to allow for strings would require rewriting the C internals rather extensively.

  2. But even if I did that, I am in for a world of debugging in Stata because there are so many places where I have just assumed the gcollapse inputs are numbers.

I think a nice compromise is perhaps writing extensions to gegen that allow strings? Let me know how that sounds, and the types of functions that would be useful.

@mdroste
Copy link

mdroste commented Sep 2, 2019

Hey Mauricio - I totally understand about the complete rewriting the codebase of gtools - I do not intend to relitigate the earlier discussion here. My point was basically that one can leverage the built-in Stata functions encode/decode to allow gtools to collapse strings (albeit not blazingly fast, since encode/decode are themselves a little slow) without needing to rewrite any C code at all. Does that make sense? I am not suggesting you actually do this, and I leave the cost-benefit analysis here to you since you're better informed -- I just wanted to point out that a rewrite of gtools is not necessary to allow for collapsing strings via (first), (firstnm), (last), etc.

In words, gcollapse would be modified as follows: Check for numeric type variables, as before. If there are string types, make sure they are only present where sensible, i.e. (first), (last), etc. Throw an error if string variables present where type is not supported (i.e. not first or last or whatever), otherwise encode any string variables. Run gcollapse as normal on the encoded variables. Afterwards, decode any encoded variables.

This isn't super-fast, but is still way faster than collapse, and importantly allows collapses that would have run before with string variables to continue to run with gcollapse.

@mcaceresb
Copy link
Owner

I wouldn't want to do the decoding/encoding internally, though. It would be much easier for me if the user did that externally (and much more transparent about what is happening; not to mention I wouldn't have to deal with a bunch of internal debugging).

I'd be willing to provide bulk encode and decode utilities, if that's something you think will help? So sth like

tempfile cache
bulkencode g1 = str1 [g2 = str2 ...], file(`cache')
gcollapse (min) g1 [(firstnm) g2 ...]
bulkdecode g1 = str1 [g2 = str2 ...], file(`cache')

@sergiocorreia
Copy link
Author

Now that Stata has frames, wouldn't it be easy to allow the user to save the string-number pair in a separate frame? EG:

* Create dataset
clear all
sysuse auto, clear
decode foreign, gen(str_foreign)
drop foreign

* Create category
gegen byte foreign = group(str_foreign)

* Create frame with number-string pair
frame put foreign str_foreign, into(foreign_keys)
frame foreign_keys: bys foreign: keep if _n == 1
frame foreign_keys: list

* Now we can drop the string (and merge it later if we want to)
drop str_foreign

@mcaceresb
Copy link
Owner

@sergiocorreia I won't have Stata 16 any time soon, though, so I wouldn't be able to test that out. However, it does seem like a better solution for users who can implement it. You'd need one frame per variable. I'd suggest this, though, to avoid the sort:

foreach var in str1 str2 ... {
    glevelsof `var', nolocal gen(uniq_) group(int_`var')
    frame put uniq_`var' in 1 / `=r(J)', into(`var'_keys)
    frame `var'_keys: gen long int_`var' = _n
    drop uniq_`var'
}

In your example,

clear all
sysuse auto, clear
rename foreign int_foreign
decode int_foreign, gen(foreign)
local var foreign
drop int_foreign
glevelsof `var', nolocal gen(uniq_) group(int_`var')

frame put uniq_`var' in 1 / `=r(J)', into(`var'_keys)
frame `var'_keys: gen int_`var' = _n
frame `var'_keys: list
drop uniq_`var'

@sergiocorreia
Copy link
Author

That's even better than saving the keys (because we know they will be 1..N). One thing though is that frame put only allows in when all the variables are copied; otherwise you need to use if.

@mcaceresb
Copy link
Owner

mcaceresb commented Sep 9, 2019

It doesn't allow if either. I think this is the way to go (though it doesn't account for missing values or multiple targets, but I think it's an OK workaround):

foreach var in strvar1 strvar2 ... {
    local type_`var': type `var'
    frame create `var'_keys
    glevelsof `var', nolocal gen(uniq_) group(int_`var')
    frame `var'_keys: set obs `=r(J)'
    frame `var'_keys: gen `type_`var'' uniq_`var' = _frval(default, uniq_`var', _n)
    drop uniq_`var'
}

// gcollapse using int_`var'

foreach var in strvar1 strvar2 ... {
    gen `type_`var'' `var' = _frval(`var'_keys, uniq_`keys', int_`var')
    drop int_`var'
}

So

clear all
sysuse auto, clear
local var make
local type_`var': type `var'
frame create `var'_keys
glevelsof `var', nolocal gen(uniq_) group(int_`var')

frame `var'_keys: set obs `=r(J)'
frame `var'_keys: gen `type_`var'' uniq_`var' = _frval(default, uniq_`var', _n)
drop uniq_`var'

gcollapse (min) int_`var', by(rep78)
gen `type_`var'' `var' = _frval(`var'_keys, uniq_`keys', int_`var')
drop int_`var'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants