Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing vs None in edtreat and edinvest columns #47

Open
georgm8 opened this issue Mar 14, 2023 · 6 comments
Open

Missing vs None in edtreat and edinvest columns #47

georgm8 opened this issue Mar 14, 2023 · 6 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@georgm8
Copy link
Contributor

georgm8 commented Mar 14, 2023

The HDRUK Data Processing document states the following for edtreat and edinvest

NOTE: If all edtreat/edinvest fields empty, data is considered missing. If one or more fields complete, empty fields are considered no investigation/treatment. (None)

How should this be interpreted from a data-processing and aggregation point of view?

The _edtreat() and _edinvest() functions currently don't make the distinction between 'Missing' and 'None' and also categorise any codes that are not present in feature_maps.py as 'Urgent'

This would be my literal interpretation of the statement above. If we let 0 represent an empty field and 99 the code for 'No Treatment' then:

All fields empty - not change processing required
| 0 | 0 | 0 | 0 | 0 | --> | 0 | 0 | 0 | 0 | 0 |

When aggregated this would result in Missing: 5

One or more fields complete* - replace all empty fields (0) with 'None' (99)
| 1 | 2 | 3 | 0 | 0 | --> | 1 | 2 | 3 | 99 | 99 |

When aggregated this would result in None: 2

However I don't think this makes much sense to me - I would have thought that if all the fields are empty for a particular row then we would count this as a 1. With the second example, I would have thought that these 0 values would just be ignored rather than converted into a 'None' category

@vvcb
Copy link
Member

vvcb commented Mar 14, 2023

@georgm8 , @quindavies , @ccarenzoIC , can you please sense check whether the following makes sense? It has been a long day.

Definition of a non-urgent attendance from the Sheffield specs is here > https://docs.google.com/document/d/1wYBZmMUDR_uQBp6tK43FH7xST7ItvA3lktHFAPi8SPI/edit#heading=h.lbu7kq7o1jqd

Calculating non-urgent attendance

Winter Pressures analysis requires calculation of non-urgent ED attendances. An ED attendance is defined as non-urgent if it meets the following criteria:

aedepttype == 01 AND 
edattendcat == 01 AND 
disstatus == non-urgent AND
ALL( edinvest_nn == non-urgent OR edinvest_nn == none) AND
ALL( edtreat_nn == non-urgent OR edtreat_nn == none)

Assuming that df is the name of the dataframe with all the good rows that you have at the end of the second validation, then the following code should work. But, needs sense checking based on what @georgm8 is saying above.

# Create a series of boolean masks or filters based on the Sheffield specs
mask_eddeptype = df.eddepttype == "1"
mask_edattendcat = df.edattendcat == "1"
mask_disstatus_cat = df.disstatus_cat == "Non-urgent"

# Boolean mask for Non-urgent investigations
mask_edinvest = (df.filter(regex="edinvest_[0-9]{2}_cat") == "Non-urgent") | (
    df.filter(regex="edinvest_[0-9]{2}_cat") == "ERROR:Missing Data"
)
mask_edinvest = mask_edinvest.all(axis=1)

# Boolean mask for Non-urgent treatments
mask_edtreat = (df.filter(regex="edtreat_[0-9]{2}_cat") == "Non-urgent") | (
    df.filter(regex="edtreat_[0-9]{2}_cat") == "ERROR:Missing Data"
)
mask_edtreat = mask_edtreat.all(axis=1)

# Boolean mask for Non-urgent attendances
mask_nonurgent = mask_eddeptype & mask_edattendcat & mask_disstatus_cat & mask_edinvest & mask_edtreat

On the LTH data, non-urgent attendances based on these criteria account for just over 10% of the 66500+ visits. @quindavies - this needs validation please.

And for the logistic regression, edinvest and edtreat are binned into 2 categories (<=1 and >1). It should be easy enough to do this in a similar manner to above, I think. Happy to have a crack at it tomorrow evening if not already solved.

Regarding The _edtreat() and _edinvest() functions currently don't make the distinction between 'Missing' and 'None' and also categorise any codes that are not present in feature_maps.py as 'Urgent': Those functions are transforming the values in each column while the distinction of Missing vs None is based on the values in all the _[0-9]{2}_cat columns in each row.

If this is required as part of the feature engineering workflow, then those functions can be modified to not only transform each of the columns but also produce the desired aggregate outputs - eg. number of urgent investigations, number of non-urgent investigations, total number of investigations, etc. - depending on what is required.

@vvcb vvcb added enhancement New feature or request help wanted Extra attention is needed labels Mar 15, 2023
@georgm8
Copy link
Contributor Author

georgm8 commented Mar 16, 2023

The code you've written above depends on how you have defined "ERROR:Missing Data" and the pre-processing done within the columns.

In the processing I've done so far I've made a distinction between 'Missing' and 'None' such that any blank value (represented as 0) is mapped to "ERROR:Missing Data". Then the SNOMED code for 'No investigation' is mapped to the category 'None'. However, the 0 values are sometimes transformed into the SNOMED code for 'No investigation' depending on the values within the other columns.

Some examples....

If we let the SNOMED code for 'None' (No investigation) be 99, then, for a particular row, if all edinvest_nn columns are empty like this:

0 | 0 | 0 | 0 | 0

Then this is categorised as 'Missing' and therefore not considered in the analysis - i.e. ALL(edinvest_nn == None) evaluates to false because the data is considered invalid and therefore not included


If we have a mixture of valid codes and empty fields like this:

1 | 2 | 3 | 0 | 0

Then we should consider those two zero values to be 'None' (as per HDRUK's definition) and are converted to the value 99. Again, ALL(edinvest_nn == None) evaluates to false because not all the edinvest columns contain 'None' values


The only situation that ALL(edinvest_nn == None) evaluates to true is where the SNOMED code for None features at least once somewhere in one of the edinvest columns and is the only code for that particular row, so some examples would be:

99 | 0 | 0 | 0 | 0
0 | 99 | 99 | 0 | 0
0 | 0 | 0 | 0 | 99

Within the pipeline, to ensure the categorisation of these edinvest columns are mapped to 'None' I convert all the 0 values here to 99.

@vvcb
Copy link
Member

vvcb commented Mar 16, 2023

How many rows are affected by this in your dataset? I will ask @quindavies to look at this when she is back.

And minor point - should rename 'None' to 'No investigations', to avoid confusion with None.

@georgm8
Copy link
Contributor Author

georgm8 commented Mar 16, 2023

5% of the data is classified as 'Missing' and 10% is classified as 'No Investigation'

Despite my lengthy explanation above - I think its just easier to manage 'No Investigations' and 'Missing' the same (i.e. as None) 😅

@quindavies
Copy link
Collaborator

In terms of missing/none in our data, we don't have any records that have missing treatment or investigation treatment. If the fields in the record are all empty it will be because none took place. Because of this I've used the given codes for none for all empty fields so we shouldn't have any "ERROR:Missing" Data.

Also need to think of how we handle "ERROR:Unmapped - In Refset" in the treatments? There seem to be a large proportion of missing codes in the feature mappings for treatments, investigation and treatments. Think we found the problem to be due to the two different references sets? I've done a bit of manual work locally to merge to two sets together but it is still an issue for treatments.

@ccarenzoIC
Copy link
Contributor

Hi, in our case, for the treatments and investigations, if the field was empty, we considered it as "No Investigation/No Treatment".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants