Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download and Regulary Update All Device Data #176

Closed
funkrusher opened this issue Sep 26, 2021 · 12 comments
Closed

Download and Regulary Update All Device Data #176

funkrusher opened this issue Sep 26, 2021 · 12 comments
Assignees
Labels

Comments

@funkrusher
Copy link

funkrusher commented Sep 26, 2021

I'm not sure if i'm at the right place here with my question...

The OpenFDA Webpage writes:

To keep your downloaded data up to date, you need to re-download the data every time it is updated. 

The data i need to download is about 10 gigabytes. So if i want to keep my data up-to-date i would need to download 10 gigabytes every day.

  • Question1: Is this correct, or is there an other way to keep my local database of maude-data up-to-date with the live-database without download 10 gigabytes every day ?
  • Question2: Where can i see if an endpoint has an update... is it in the download.json in the "meta.last_updated" ?
  • Question3: Do the Events (deviceEvents, deviceClassifications) have a unique-id field i can rely on to uniquely identify the given data-records in my system ?

hopefully i'm somehow still at the right place here for my questions :)

@dkrylovsb dkrylovsb self-assigned this Sep 27, 2021
@dkrylovsb
Copy link
Collaborator

There is definitely no need to download 10GB worth of data daily, especially considering the fact MAUDE updates only weekly. You can hit https://api.fda.gov/download.json and then look at the export_date field to determine whether or not the downloadable files have changed since the last time you pulled them down.

Device Events uses mdr_report_key as the "primary key" for the dataset (see here). And based on this description, a combination of medical specialty, product code and regulation number could be used as a key for Device Classification.

@funkrusher
Copy link
Author

thank you very much.

  • my plan is now to proceed with a weekly crawling, whereas each crawling will look at the export_date of the dataset and only download it, if has changed since the last week.

Thank your for providing me the primary key of the deviceEvents and a possible one of deviceClassification.

I hope it's not too much to ask, but I'm a beginner and it would be nice, if you could also help me to find the primary keys for the rest of the device datasets. This would be:

device.enforcement
device.event [OK]
device.classification [OK]
device.510k
device.pma
device.recall
device.registrationlisting
device.udi
device.covid19serology

Im not a native speaker, so its a bit hard for me, to find the primary-keys of the datasets in the documentation-text. I know it should be provided in the "Searchable Fields" Documentation of your website. For example: https://open.fda.gov/apis/device/event/searchable-fields/

@funkrusher
Copy link
Author

funkrusher commented Sep 27, 2021

my most important requirement would be that the mdr_report_key will always identify the same device-event record in subsequent runs and after the download-files have been updated with new files by openfda.

That way i can use it as primary-key in my local database to identify the given record and make an SQL-UPDATE into my local database if i have already read an device-event record with mdr_report_key=123 in a previous run, or an SQL-INSERT if i have never read the mdr_report_key=123 in my database. I hope it can work this way.

@dkrylovsb
Copy link
Collaborator

Sure thing:

device.510k: k_number
device.pma: pma_number, supplement_number
device.recall: product_res_number
device.registrationlisting: registration_number
device.udi: look within the identifiers array for the identifier of type "Primary" 
device.covid19serology: evaluation_id, date_performed, sample_no

And yes, mdr_report_key works exactly as you described above.

@funkrusher
Copy link
Author

awesome, thx

@grimuz
Copy link

grimuz commented Sep 28, 2021

writing with my second account here... one final missing :) but i think i found it.
for device.enforcements i think it could be the "recallNumber"

@Mariano215
Copy link

Mariano215 commented Sep 28, 2021 via email

@grimuz
Copy link

grimuz commented Oct 1, 2021

also one additional question.

device.event.mdr_text is sub-array within device.event. It contains the texts and provides a field mdr_text_key.

I wanted to use this field as unique-key, but it seems there are duplicates for this field.

For example:

This query shows that the mdr_text_key "16363682" exists multiple times for the same device-event (mdr_report_key).

It seems to me that it should be unique, is it not?

@dkrylovsb
Copy link
Collaborator

Yes, that field should be unique, but apparently there are duplicates within the source data files. For example, they key you referenced above is duplicated in foitext2003.txt and foitext2004.txt:

foitext2003.txt:503218|16363682|D|1||DURING A BILATERAL HERNIA PROCEDURE, THE ANCHOR DID NOT FIX NORMALLY AND THE MESH LOOSENED AFTER FIXATION. FURTHER, THE ANCHOR CAUSED A BIGGER BLEEDING WHICH WAS CONTROLLED BY AN RF DEVICE. A BIGGER TROCAR WAS TAKEN AND THE PROCEDURE WAS FINISHED WITH AN "EMS". NO CONSEQUENCES FOR THE PT.
foitext2004.txt:503218|16363682|D|1||DURING A BILATERAL HERNIA PROCEDURE, THE ANCHOR DID NOT FIX NORMALLY AND THE MESH LOOSENED AFTER FIXATION. FURTHER, THE ANCHOR CAUSED A BIGGER BLEEDING WHICH WAS CONTROLLED BY AN RF DEVICE. A BIGGER TROCAR WAS TAKEN AND THE PROCEDURE WAS FINISHED WITH AN "EMS". NO CONSEQUENCES FOR THE PT.

We will work on enhancing the pipeline to catch and remove duplicates and report back once done. Thank you for bringing this to our attention.

@grimuz
Copy link

grimuz commented Oct 5, 2021

@dkrylovsb

ok thanks, thats good to know.

i have also recognized Duplicates for the Field "device.enforcements" --> "recallNumber".
Is it ok that i use this field as primary-key of the "device.enforcements" ?

sorry for asking so many questions, but i guess that would be my final one (for now) :D

@dkrylovsb
Copy link
Collaborator

device.event.mdr_text duplicates have been removed.

Yes, recall_number should be used as the primary key for the Device Enforcements datasets. There is a small number of duplicate records indeed -- a total of 6 -- which we are also going to look at and fix shortly. Thank you for bringing this to our attention.

@dkrylovsb
Copy link
Collaborator

The duplicates in the Device Recall Enforcement dataset have been removed as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants