New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Survey Data in CDM #90

clairblacketer opened this Issue Aug 1, 2017 · 9 comments


None yet
6 participants

clairblacketer commented Aug 1, 2017

Adding PROM data to CDM

  • Requestor: Colin Orr and Catherine Kerr, ICON plc
  • Revising party: Joshua Ransom, Anna Corning, Emelly Rusli, Rayhnuma Ahmed, Aaron Stern; SHYFT Analytics
  • Discussion: here


ICON plc is currently engaged in a project with [[|ICHOM (International Consortium for Health Outcomes Measurement]].

ICHOM's mission is to unlock the potential of value-based healthcare by defining global Standard Sets of outcome measures that really matter to patients for the most relevant medical conditions and by driving adoption and reporting of these measures worldwide.

ICHOM brings together patient representatives, clinician leaders, and registry leaders from all over the world to develop Standard Sets, comprehensive yet parsimonious sets of outcomes and case-mix variables for specific medical conditions that ICHOM recommends all providers track.

Each Standard Set focuses on patient-centered results, and provides an internationally-agreed upon method for measuring each of these outcomes. ICHOM believes that standardized outcomes measurement will open up new possibilities to compare performance globally, allow clinicians to learn from each other, and rapidly improve the care provided to patients.

ICHOM Standard Sets include baseline conditions and risk factors to enable meaningful case-mix adjustment globally, ensuring that comparisons of outcomes will take into account the differences in patient populations across not just providers, but also countries and regions. They also include high-level treatment variables to allow stratification of outcomes by major treatment types. A comprehensive data dictionary, as well as scoring guides for patient-reported outcomes is provided for each Standard Set.


ICON plc is developing a platform to ingest, store and analyse the outcome measures and is using the OMOP Common Data Model to store the data. The current CDM satisfies many of the requirements, but there are some gaps, specifically:

  • We need to store data relating to each Patient Reported Outcome (PRO) questionnaire that is completed by a patient. Examples of this type of data are; timestamp of when the questionnaire was completed, did the patient complete it with assistance, role of person who completed the questionnaire, etc. We also need to store the attributes related to the timing of the survey in relation to the treatment the patient received - for example, 'baseline', or 'six month follow-up'. This is additional contextual data that allows us to compare outcomes over time. To store this data, we propose introducing a new SURVEY table. Each row in the table represents an instance of a completed survey and serves to link a number of survey questions and answers together to be stored in a new RESPONSE table. To store the responses to the questions on a questionnaire we propose introducing a new RESPONSE table. The patient responses are stored as key-value pairs in the RESPONSE table along with additional support and contextual information such as the text of the source question and type of field that is used to collect the data. All patient responses are linked back to a specific questionnaire instance in the SURVEY table.

SURVEY table

(other potential table name options: PRO, PATIENT_REPORTED_OUTCOME, PROM, others?)

Field Proposed_revision Required Type Description
SURVEY_OCCURRENCE_ID Yes integer Unique identifier for each completed survey
SURVEY_CONCEPT_ID Yes integer A foreign key that refers to a survey Concept identifier in the Standardized Vocabularies
PERSON_ID Yes integer A foreign key identifier to the Person in the PERSON table about whom the survey was completed
VISIT_OCCURRENCE_ID No integer A foreign key to the visit in the visit table during which the the treatment was carried out that relates to this survey
SURVEY_DATE Yes date Date on which the survey was completed
ASSISTED_CONCEPT_ID No integer A foreign key that refers to a data source Concept identifier in the Standardized Vocabularies Example: Yes = concept_id 45877994; LOINC concept_code LA33-6 No = concept_id 45878245; LOINC concept_code LA32-8
ASSISTED_SOURCE_VALUE No varchar(10) Source value representing whether patient required assistance to complete the survey. Example: “Completed without assistance”,”Completed with assistance”.
SURVEY_RECORDER_ CONCEPT_ID No integer A foreign key that refers to a data source Concept identifier in the Standardized Vocabularies Example: Research Associate = concept_id 4074477; SNOMED concept_code 224614003 Patient = concept_id 4023409; SNOMED concept_code 116154003
SURVEY_RECORDER_ SOURCE_VALUE No varchar(10) Source code representing role of person who completed the survey Example: Administrative, Clinician, Patient-reported
TIMING_CONCEPT_ID No integer A foreign key that refers to a timing Concept identifier in the Standardized Vocabularies Example: 3 month follow-up = concept_id 44789369; SNOMED obs concept_code 200521000000107
TIMING_SOURCE_VALUE No varchar(100) Text string representing the timing of the survey - Example: 'BASELINE'
COLLECTION_METHOD_ CONCEPT_ID No varchar(10) A foreign key that refers to a collection method Concept identifier in the Standardized Vocabularies Example: Telephone Reported = concept_id 4084141; obs SNOMED concept_code 281313006
COLLECTION_METHOD_SOURCE_VALUE No varchar(10) The collection type as it appears in the source data.This code is mapped to a Standard Concept in the Standardized Vocabularies and the original code is stored here for reference. Example: Paper, Telephone, Electronic Questionnaire
DURATION No varchar(50) Time taken to complete survey HH:MM:SS
SURVEY_SOURCE_VALUE No varchar(100) The survey name/title as it appears in the source data. This code is mapped to a Standard Concept in the Standardized Vocabularies and the original code is stored here for reference.
SURVEY_SOURCE_IDENTIFIER No varchar(100) Unique identifier for each completed survey in source system
VALIDATED_SURVEY_ CONCEPT_ID No Integer A foreign key that refers to a data source Concept identifier in the Standardized Vocabularies Example: Yes = concept_id 45877994; LOINC concept_code LA33-6 No = concept_id 45878245; LOINC concept_code LA32-8
SURVEY_VERSION_NUMBER No varchar(20) Version number of the questionnaire or survey used.


Patient responses to survey questions are stored in the RESPONSE table. Each record in the RESPONSE table represents a single question/response pair and is linked to a specific SURVEY / questionnaire in the SURVEY_OCCURRENCE_ID. Each response record is the response to a specific question identified by the QUESTION_CONCEPT_ID. This concept ID is a unique question in the CONCEPT table identifying the concept DOMAIN (see example below for more details). An individual survey question can have multiple responses (e.g. which of these items relate to you, a, b, c ,…?) to a question. Each response is stored as a separate record in the RESPONSE table.
The response / answer to a survey question can be stored as RESPONSE_AS_CONCEPT_ID, RESPONSE_AS_NUMBER, RESPONSE_AS_STRING, and/or RESPONSE_AS_DATETIME.

Field Required Type Description
RESPONSE_OCCURRENCE_ID Yes integer Unique identifier for each response.
PERSON_ID Yes integer A foreign key identifier to the Person in the PERSON table about whom the response was recorded.
SURVEY_OCCURRENCE_ID Yes integer A foreign key to SURVEY table about which survey the question and response occurred.
QUESTION_CONCEPT_ID Yes integer A foreign key that refers to a question Concept identifier in the Standardized Vocabularies
QUESTION_SOURCE_VALUE No varchar(255) The question as it appears in the source data. This code is mapped to a Standard Concept in the Standardized Vocabularies and the original code is, stored here for reference.
RESPONSE_DATE Yes date Date on which the response was recorded
RESPONSE_DATETIME No datetime Date and time on which the response was recorded
RESPONSE_AS_CONCEPT_ID Yes integer Foreign key that refers to a response Concept identifier in the Standardized Vocabularies
RESPONSE _AS_STRING No varchar(255) The response stored as a string. This is applicable to questions where the result is expressed as verbatim text.
RESPONSE _AS_NUMBER No float The response stored as a number. This is applicable to questions where the result is expressed as a numeric value.
RESPONSE_AS_DATETIME No datetime The response stored as a datetime. This is applicable to questions where the result is expressed as a historical date/time.
RESPONSE_RANGE_LOW No varchar(50) The lowest value of the range of responses for the question.
RESPONSE_RANGE_HIGH No varchar(50) The highest value of the range of responses for the question
RESPONSE_FIELD_TYPE_CONCEPT_ID No Integer A foreign key to the concept table indicating the type of field used to collect the response (multi-select, radio button, slider…)

Other Considerations

  • Extensions to the concept table include the survey and response data that is not currently contained in the standard libraries. All custom extensions to the concept table have been stored in the negative address space so as not to conflict with the currently defined standard. These extensions are not included in the definition of this proposal but should be considered for future work.
  • There is no formal definition of the relationship between a questionnaire/survey and the questions presented on that survey. There is an implicit relationship created when survey/response data is stored. If an explicit relationship is required, this can be achieved using the FACT_RELATIONSHIP table.

Use Cases

  • The example below describes the data to be stored for a question on the HOOSPS (Hip Disability and Osteoarthritis Outcome Score) patient questionnaire.
  • The question asks the degree of difficulty in descending stairs due to the patient's hip problem. The patient answers "Moderate".
  • The CONCEPT table contains domain data for the survey HOOSPS, question (HPS1) plus all the potential values that a patient can respond with.

CONCEPT table - example

2020 HPS1 Metadata Domain Domain ICHOM generated
2021 None HPS1 ICHOM Observation PRO Measure S 0
2022 Mild HPS1 ICHOM Observation PRO Measure S 1
2023 Moderate HPS1 ICHOM Observation PRO Measure S 2
2024 Severe HPS1 ICHOM Observation PRO Measure S 3
2025 Extreme HPS1 ICHOM Observation PRO Measure S 4

The patient response is captured as a code 2 (in this instance) in the questionnaire. The CONCEPT_ID is determined by finding a match in the concept table for the code (2) for the specific question (identified by HPS1) in column DOMAIN_ID and the response value (2) in the column CONCEPT_CODE.

SURVEY table - example

Column Value Comment
SURVEY_CONCEPT_ID 3501 Concept for HOOSPS survey
SURVEY_DATE 2016-07-14
ASSISTED_CONCEPT_ID 3601 Concept for "completed without assistance"
ASSISTED_SOURCE_VALUE completed without assistance
SURVEY_RECORDER_ CONCEPT_ID 3611 Concept for "patient reported"
SURVEY_RECORDER_ SOURCE_VALUE P-Rep Source system value for "patient reported"
TIMING_CONCEPT_ID 3621 Concept for baseline timing
COLLECTION_METHOD_ CONCEPT_ID 3631 Cocnept for "electronic questionnaire'
COLLECTION_METHOD_SOURCE_VALUE E-QUEST Source system value for "electronic questionnaire"
DURATION null unknown in this case
VALIDATED_SURVEY_ CONCEPT_ID 3701 Concept for "validated survey"

RESPONSE table - example

Column Value Comment
QUESTION_SOURCE_VALUE degree of difficulty in...
RESPONSE_DATE 2016-07-14
RESPONSE_AS_CONCEPT_ID 2023 Concept for "Moderate"
RESPONSE_FIELD_TYPE_CONCEPT_ID 4001 Concept for radio button

This comment has been minimized.


gowthamrao commented Aug 27, 2017

Survey data is very important, but I think we could potentially use a combination of visit_occurrence, the new visit_detail and observation table with minor modifications to achieve the same end result. Visit_detail may be able to represent much of the proposed survey table, and the observation table maybe able to represent much of the response table.


This comment has been minimized.


clairblacketer commented Sep 6, 2017

Here is Colin and Catherine's proposal presented on 9/5/2017:

2017 OMOP - CDM Survey data proposal v2.pdf


This comment has been minimized.

ericaVoss commented Sep 7, 2017

I actually think this fits one of the questions @alondhe / @clairblacketer / I put out on the forum which was where do you put patient reported information like HRA or Patient Reported Medications. I think the suggestion that gets adopted here could be used to answer this question.

Is it at all possible to consider the OBSERVATION table instead of the RESPONSE table (like @gowthamrao is suggesting above)? I feel like we can still make it fit. Instead of adding a whole new table we could add some columns to an existing table.


This comment has been minimized.

ColinOrr2006 commented Sep 7, 2017

Erica, yes, for some of this data I agree, it is the same class of problem. However, some of the data as you have already pointed out belongs in their already defined domains such as drug exposure. The method of collection may be somewhat irrelevant or least secondary. For the collection of validated PRO survey data, the collection mechanism is extremely important and hence the need to track the incidence of a questionnaire being completed. It is possible to store the answers to the questionnaires in the OBSERVATIONS table with a number of key modifications and that is what I am currently doing in the absence of a RESPONSE table. I have included a RESPONSE table in my proposal because when I ask myself the question, is this data (observations that are collected today AND responses to a patient questionnaire) combined for any analysis purposes the answer is no. When I am interrogating patient responses to validated patient questionnaires (or non validated also), I am scoring patients on a very specific measure or reported outcome. I never add/subtract or count this data with general invalidated observations. I don't see any value in merging this data to offset the impact of changing the structure of the observation table to accommodation questionnaire responses.

Hope that adds some clarity to my reasoning.


This comment has been minimized.


cgreich commented Sep 9, 2017


Thanks for pushing this forward with a vengeance, this is really an important work. And I got a whole lot of comments I would like to engage with you guys. However, I feel this towel-like long comment exchange is not a good way of doing it: it is very hard to refer to anything and have a in-depth meaningful discussion. You keep scrolling up and down till the mouse gets sore. Should we figure out something more streamlined, until the proposal has matured a little more? Other working groups have used:

  • Google docs (you can edit and comment, and it is clear who said what)
  • Workshop sessions, where somebody prepares and walks the group through all the detail, including what folks have brought up

Let me know if you need help with setting things up, we'll gladly put some effort in.

My worries are mostly around the following issues:

  1. Do we really need all these fields? We need to keep the CDM clean and 100% for purpose. If something has no use case, it ain't getting in. For example, why do we need to know whether a survey question was a radiobutton? What is the analytical use case?
  2. We need to do something about connecting the surveys to the general concept space. For example, if there were a quesion "Do you have high blood pressure" and the answer is "Yes", then we somehow need to let folks find out about it. The standard way in the OMOP CDM is a record in the Condition table with a concept for "Hypertension". Because that's where analytical methods and studies will look. How do we enable that?
  3. Surveys, in contrast to the rest of the CDM, are indirect data, which means their time stamp is not when things happen, but when questions are answered. The CDM data are usually facts timestamped to the moment they happen. Of course, that is not totally true, because there are survey questions that deal with how the patient feels currently. All this needs to be properly represented. We need to find effective ways to link survey questions and answers to real events in the patient data (you already heard me not being happy about using the VISIT table as a cheapo mechanism for this).
  4. We need to find ways to create effective vocabularies for the surveys, which allow for versions, connections between questions and their potential answers, categories of questions and overlaps of questions in different surveys.

BTW: There is no "negative address space". All concept_ids are positive integers. There is a convention that if you have local codes that are completely useless to the outside world, but you want to create private concepts, you use the ID space above 2 Billion. But the problem at hand doesn't seem to apply to that. We are talking about standardized Survey Concepts.


This comment has been minimized.

ColinOrr2006 commented Sep 11, 2017

Chris, I agree with the comment on collaboration, so there is a meeting / workshop set up for next week to go through this. In the meantime, just to comment on your feedback above (which is great)

  1. field type is an important attribute when reconciling the quality of the data. For validated questionnaires (which is the bulk of the PRO data I am collecting) it is important to know this information. It is important when attempting to understand if the source instrument has been migrated to an electronic format in accordance with guidelines. However, if you can recommend an alternative method to capture this metadata, please let me know.
    2/3) both comment 2 and 3 above are very closely related so I have combined. I understand the points you are making but for the majority of survey data I am collecting (validated patient questionnaires), it is the survey data that is the end point. How the patient feels, mobility, pain,... these are the end points, and timestamp relates to the patient experience when they actually answer the question. However you do raise an important point about when the questionnaires relate to items that are "less" subjective. If the survey is to collect objective data such as medications or treatments I would argue, why would you want to store this data as a questionnaire at all. The survey is just a harvesting method to collect the data objective data about the patient and should be stored in their respective domains in the CDM. However, if you do need to consider how the data is collected (perhaps to establish trust level in the data) or when it is collect then that data should be stored in the SURVEY domain.
    It is an important distinction and requires further discussion.

  2. linking questions to a list of valid answers is something that I have left out of the proposal but perhaps it makes sense to address at this stage. If could potentially be achieved using the fact_relationship table but I suggest should only be used where absolutely required as it could get out of hand very quickly in terms of data volumes.

Happy to set this up as a Google doc for the workshop if that think that will help. Christian, do you want to plan this offline prior to the meeting. Happy to work with you on how best to conduction the workshop


This comment has been minimized.

baileych commented Sep 12, 2017

I'm sorry I couldn't make the call last week, but I'm glad there's discussion about structring PRO data in the CDM. The PEPR Consortium (a group of sites doing validation of pediatric PROMIS measures) have hammering out a structure for this purpose as well, which I've pasted below in the hope that it'll be useful for the discussion. Some top-level points:

  • We went with a separate table rather than putting responses in observation because there were several metadata elements that applied specifically to surveys, and we thought a new table was a better option than burdening every row in observation with the extra columns.
  • Some of the description is PROMIS-specific, but we did try to set up the structure to generalize.
  • We put both the individual responses and the composite scores in the same table, with different concept IDs to distinguish them.
  • We left the survey structure out of the data model, except insofar as it needs to be represented in the vocabulary to fill in the requisite metadata fields.
  • We didn't normalize out of the table metadata common to all items in a given administration. That's definitely a cleaner approach for storage; defer to consensus on what's easier for analysis.
  • We did capture some additional scoring characteristics that are important for normalized scoring; it'll be interesting to hear how this might look outside the world of PROMIS and similar instruments.


PEPR pro_occurrence Table

Field NOT Null Constraint Data Type Description
pro _id Yes Integer Unique identifier for each PRO response (Primary Key)
person_id Yes Integer Foreign key to person table
pro_concept_id Yes Integer A foreign key referring to a standard concept ID in the Standardized Vocabularies for the specific question or item used to obtain this PRO; this value may encode the instrument as well
instrument_concept_id No Integer A foreign key referring to a standard concept ID in the Standardized Vocabularies for each instrument
instrument_version No Varchar(32) Version of instrument
pro_type_concept_id Yes Integer A foreign key referring to a standard concept ID in the Standardized Vocabularies for the administration method and mode
respondent_type_concept_id No Integer A foreign key a standard concept ID in the Standardized Vocabularies indicating the respondent type (e.g. patient, parent, proxy) to the PRO items
value_as_concept_id No Integer A foreign key a standard concept ID in the Standardized Vocabularies indicating the response to item for qualitative items \
value_as_number No Numeric Raw score for quantitative items
theta No Numeric Theta value for item
scaled_score No Numeric Scaled score for item
standard_error No Numeric Standard error of scaled score
pro_response_date No Date Date on which the PRO was obtained
pro_response_datetime No Datetime The time at which the PRO was obtained.
visit_occurence_id No Integer Foreign key to visit occurrence table indicating the visit on which the PRO was obtained
provider_id No Integer A foreign key to the provider in the provider table who was responsible for initiating or obtaining the PRO.
instrument_source_value No Varchar (255) Source value for the instrument name
pro_source_value No Varchar (255) Source value for the item ID
admin_method_source_value No Varchar (64) Source value for the method the respondent completed measure (self-administered vs interviewer-administered)
admin_mode_source_value No Varchar (64) Source value for the mode the respondent completed measure (paper/telephone/computer, CAT)
respondent_source_value No Varchar (255) Source value indicating respondent type
value_source_value No Varchar (255) Source value for item response


  • A PRO (person-reported outcome) differs from a generic observation
    in that it is derived from a structured survey or
    similar instrument. While this may include simple inventories, the
    instruments are typically structured surveys with validated
    scoring methods.

  • The pro_concept_id will specify either an
    individual item whose response is captured, or a composite score for
    all or part of an instrument. This concept will also typically
    include information about the instrument from which the item
    was drawn. However, where necessary for disambiguation or for
    convenience in analyses, the instrument_concept_id can provide
    (possibly redundant) specification of the instrument.

  • The pro_type_concept_id provides
    information about how the PRO was obtained, including method (e.g.
    self-administered or interviewer-administered) and mode (e.g.
    paper and pen, simple computer-based survey, CAT).

  • The respondent_type_concept_id provides information about the
    person who directly provided the PRO (e.g. patient,
    parent, proxy).

  • A standardized representation of the PRO response is captured in
    value_as_number (typically for numeric or Likert scales), or
    value_as_concept_id (typically for qualitative scales,
    Yes/No responses). Both may be used if a particular categorical
    response corresponds to a specific raw score (e.g. Likert scale of
    Never-Rarely-Sometimes-Often-Always corresponding to 1-5 score). If
    the response can only be represented as text, both of these fields
    are set to NULL, and analyses use value_source_value.

  • The method for obtaining scaled_score is determined by the
    instrument (e.g. T-score for PROMIS items).


This comment has been minimized.

ColinOrr2006 commented Sep 13, 2017

@baileych , thanks for that input and in the context of PROMIS it makes perfect sense. Some of your attributes have less ambiguous definitions than those contained in my earlier proposal. From a general PRO perspective and to address some additional requirements (e.g. tracking progression of a patients condition over time), I have normalized your PRO_occurrence into two separate tables. There are a number of attributes in the SURVEY table that are key for DQ analysis and validity of the data.

Adding the score into this PRO_occurrence table is a great idea. It is not something I currently store but makes sense to do it here. I am going to merge some of the concepts here into my earlier proposal and walk through them during the workshop scheduled for next Tuesday (19th Sept)


This comment has been minimized.


clairblacketer commented Nov 27, 2017

This proposal was updated as #137

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment