Skip to content

Frequently Seen Issues

mim18 edited this page Oct 16, 2020 · 3 revisions

The following is a summary of some of the most frequently seen issues (FSIs! New acronym!) in data submissions to N3C. If you have a chance to check for these issues prior to your first submission, you may be able to save yourself some back-and-forth with us.

Note that we try to keep “dealbreaker” problems (those issues that will prevent your data from being released) to a minimum, so know that we are not looking for every possible data quality issue.

So, without further ado…

Common Formatting Issues: “Dealbreakers”

  • Data fields are not quoted. This is an issue particularly when your data itself contains pipes (“|”), which are our delimiter. It is really important that all of your string-type data fields are surrounded by double quotes in your extract.
  • Your ZIP directory is not correctly structured. Our data ingestion process is automated, and it will not be able to process your data if you have extra folders, incorrectly nested folders, etc. Review the “Examples” section of your model’s N3C documentation for the correct structure.

Remember that using our R or Python exporters will take care of all formatting problems before they happen. Though it is fine to run our scripts manually, we highly recommend the exporters to save you some heartburn.

Common Data Quality Issues: “Dealbreakers”

The following are issues that will prevent your data from being released into the Enclave.

  • COVID test results are not available. This seems to occur when sites do not map lab tests with qualitative test results to their common data model’s controlled vocabulary. You may want to review this Wiki article we wrote that covers how to map your COVID labs if you have not done so already. Here’s a good rule of thumb: If the only way you are able to find your COVID test results in your data is to run a non-standard query (like a string match, for example), then we will not be able to find your results on our end.
  • One or more string fields contain potential site identifiers. Depending on your source ETLs, you may be loading in some free-text into certain fields. We exclude as many of these fields as possible, as well as run scripts over the remaining fields to check for certain red flags (things like “Mr.”, “Ms.”, “Dr.”, etc.). However, you know your data best, so it is always a good idea to ensure that you are not inadvertently sending identifying data with your payload.
  • Your site’s data is highly non-compliant with your chosen data model. Non-compliant source data will cause problems when we try to map it to N3C OMOP. Issues that we have seen raise this flag include:
    • An ACT site’s demographics are not in the ACT ontology format. If the representation of your demographic data uses a local format, please map to the ACT format before submission. This means using the following prefixes DEM|SEX:, DEM|RACE:, DEM|VITAL STATUS: DEM|HISP:, and only using values that are in the ACT model.
    • An OMOP site has many concepts in the wrong tables, or a high percentage of non-standard concepts. We will run checks to see if the DOMAIN_ID (aka the field in the CONCEPT table which dictates which domain the concept should be in) to look for compliance with basic OMOP convention. You will be notified if your CDM has a high number of DOMAIN_IDs in the wrong domains. We understand that you may want to put all Conditions in the CONDITION_OCCURRENCE domain; however, a compliant OMOP CDM would put condition codes in the DOMAIN_ID that the CONCEPT table says to place it in (e.g. could be OBSERVATION, CONDITION_OCCURRENCE, etc). We expect OMOP vocabulary convention to be followed during ETL. This includes correct usage of standard concepts (inclusive of putting only standard concepts in the standard concept column).
  • Unique keys. Remember to run our primary key duplicate checker. If you’re using our exporters, this will automatically run, but if you’re running the SQL manually, you will need to make sure to run the VALIDATION_SCRIPT code block in the extract file. If you have no duplicates, you will get no results when you run this query. No results = good results! If you do get results, that means you have duplicate primary keys in one or more tables. We will ask you to fix this prior to submitting, as duplicate keys prevent us from being able to ingest your data.

Common Data Quality Issues: “Highly Recommended”

The following are issues that may not prevent your data from being released into the Enclave, but that we may inquire about. If it is easy for you to fix these things, we would love to have them, but understand if it is not possible for your site.

  • Death data. We ask that sites review the death data included in their submission for plausibility (e.g., there should be no deaths prior to Jan 2020, unless you are date shifting). If your site does not provide any death data, we may check in to see if it’s possible to provide in future payloads. This may include discharge disposition for data models that support that information (OMOP, PCORnet).

  • Visit concepts/types. Knowing whether a visit is inpatient/outpatient/emergency can be important for many research questions. If your site has a high percentage of null or non-standard values in this field, we may reach out to see if additional information can be mapped in future payloads. For i2b2 sites make sure you use only ACT defined visit types (EI Emergency Department Admit To Inpatient E Emergency Department Visit I Inpatient Hospital Stay N No Information NA Non-Acute Hospital Stay X Other Ambulatory Visit O Ambulatory Visit)

  • End dates. End dates are generally not required fields in models other than OMOP. But, they can be useful.

    • For non-OMOP sites: Despite the fact that they are not required, end dates for visits, particularly inpatient visits, are really important for being able to calculate things like length of stay. If most or all of your visit end dates are null, we will likely reach out and ask whether this can be rectified.
    • For OMOP sites: Many OMOP sites fill in null end dates (which are required) with “dummy” dates like 1/1/1800. We are able to work around this, but note that this convention may cause some researchers to calculate somewhat bizarre lengths of stay for your site until they understand the cause.

Note that if both visit concepts/types AND end dates are frequently missing/non-standard, that may qualify as a dealbreaker. We really need one or the other in order to appropriately classify encounters during analysis.