# Medical procedures dataset

### Description

Coming soon

### Introduction

The Human Phenotype Project study gathers medical data through online surveys where participants self-report their use of medical procedures. This approach relies on individuals to communicate their experiences with medical interventions. Accurate and comprehensive information about the utilization of these procedures is crucial for grasping their actual effects on personal health.

### Measurement protocol 
<!-- long measurment protocol for the data browser -->
During the registration phase of the study, participants are required to provide details about their medical procedures in the Initial Medical Survey. Further data is then gathered in the Follow-up Medical Survey when participants return for subsequent visits.

### Data availability 
<!-- for the example notebooks -->
The information is stored in 2 parquets file: `initial_medical.parquet`, `follow_up_medical.parquet`

### Relevant links

* [Pheno Knowledgebase](https://knowledgebase.pheno.ai/datasets/050-medical_procedures)
* [Pheno Data Browser](https://pheno-demo-app.vercel.app/folder/50)

In [1]:
%load_ext autoreload
%autoreload 2
from pheno_utils import PhenoLoader
import pandas as pd

In [2]:
pl = PhenoLoader('medical_procedures', base_path='s3://pheno-synthetic-data/data')
pl

PhenoLoader for medical_procedures with
63 fields
3 tables: ['initial_medical', 'follow_up_medical', 'age_sex']

## Data dictionary

In [3]:
pl.dict.head()

Unnamed: 0_level_0,folder_id,feature_set,field_string,relative_location,bulk_file_extension,bulk_dictionary,description_string,field_type,array,stability,units,sampling_rate,strata,sexed,debut,completed,pandas_dtype,data_coding
tabular_field_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
collection_timestamp,50,initial_medical,Collection timestamp,medical_procedures/initial_medical.parquet,,,Timestamp of measurements collection,Datetime,Single,Accruing,Time,,Collection time,Both sexes,2/26/2020,,"datetime64[ns, Asia/Jerusalem]",
collection_date,50,initial_medical,Collection date,medical_procedures/initial_medical.parquet,,,Date of measurments collection,Date,Single,Accruing,Time,,Collection time,Both sexes,2/26/2020,,datetime64[ns],
timezone,50,initial_medical,Timezone,medical_procedures/initial_medical.parquet,,,Timezone of the measurments,Categorical (single),Single,Accruing,,,Collection time,Both sexes,2/26/2020,,category,
fertility_treatment_final_year,50,initial_medical,Fertility treatment final year,medical_procedures/initial_medical.parquet,,,Fertility treatment final year,Continuous,Single,Accruing,,,Primary,Females only,2/26/2020,,int,
invasive_procedure,50,initial_medical,Invasive procedure,medical_procedures/initial_medical.parquet,,,Have you previously undergone surgery or an in...,Categorical (single),Single,Accruing,,,Primary,Both sexes,2/26/2020,,int,21.0


## Data coding
Mapping data codings with actual answers for categorical features

In [4]:
df_codes = pl.data_codings
pl.data_codings

Unnamed: 0,code_number,coding,english,hebrew,answer_field_name,sexed,ukbb_compatible,ukbb_similar_coding,scripting_instruction,description,notes
0,001_03,0,asia/jerusalem,זמן ישראל,,,False,,,Timezone,
1,062_02,1,BRCA-1,BRCA-1,brca_type_1,Both sexes,,,,,
2,062_02,2,BRCA-2,BRCA-2,brca_type_2,Both sexes,,,,,
3,062_02,-1,Do not know,לא יודע/ת,brca_type_3,Both sexes,,,NMUL,,
4,062_03,1,"A relative of mine got breast, ovarian, or pro...","קרוב/ת משפחה חלה/תה בסרטן השד, השחלה, או הערמו...",brca_indication_1,Both sexes,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
1323,009_03,0,5.3.81.5,,,,False,,,Sleep monitoring processing software version,
1324,009_04,0,Inconclusive REM,,,,False,,,Sleep monitoring warning: Inconclusive REM,
1325,009_04,1,Respiratory indices Not Available (N/A) due to...,,,,False,,,Sleep monitoring warning: Respiratory indices ...,
1326,009_04,2,Insufficient signals for Respiratory Indices c...,,,,False,,,Sleep monitoring warning: Insufficient signals...,


## Exploration of the data

Each column represents a single question. Answers are saved as numbers.
The type of the question (integer, or category with single/multiple choice) can be found in the data dictionary.

In [5]:
pl.dict.field_type.value_counts()

field_type
Categorical (single)      38
Continuous                26
Categorical (multiple)    16
Date                       7
Text                       5
Datetime                   2
Integer                    2
Name: count, dtype: int64

### Categorical single choice qustions
For categorical qustions, answers are coded according to the data coding. The automatic setting for PhenoLoader shows the values in english. To view the data codings you can set the preferred_language to 'coding'.

In [7]:
pl = PhenoLoader('medical_procedures', preferred_language='coding', base_path='s3://pheno-synthetic-data/data')

In [8]:
single = pl.dict[(pl.dict['field_type'] == 'Categorical (single)') & (pl.dict.index != 'Timezone')].index.values.tolist()

single_question = 'cardiac_MRI_results'

pl[single_question].value_counts()


cardiac_MRI_results
abnormal               5
normal                 5
Name: count, dtype: int64

To change the representation of the answers, we can use the function tranform_answers from pheno_utils.questionnaires_handler or change the setting of preferred lanuage  for PhenoLoader (default option is english)

In [9]:
from pheno_utils.questionnaires_handler import transform_answers

Transorm to English

In [10]:
original_series =  pl[single_question].squeeze()

tranformed_english = transform_answers(single_question,original_series, transform_from='coding', transform_to='english', dict_df=pl.dict, mapping_df=pl.data_codings)
tranformed_english.value_counts()

cardiac_MRI_results
abnormal    5
normal      5
Name: count, dtype: int64

### Categorical multiple questions

In [11]:
pl.dict[pl.dict['field_type'] =='Categorical (multiple)'].head(5)

Unnamed: 0_level_0,folder_id,feature_set,field_string,relative_location,bulk_file_extension,bulk_dictionary,description_string,field_type,array,stability,units,sampling_rate,strata,sexed,debut,completed,pandas_dtype,data_coding
tabular_field_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
invasive_procedure_type,50,initial_medical,Invasive procedure type,medical_procedures/initial_medical.parquet,,,Please select from the following list of invas...,Categorical (multiple),Multiple,Accruing,,,Primary,Both sexes,2/26/2020,,object,050_01
number_of_hospitalization_this_year,50,initial_medical,Number of hospitalization this year,medical_procedures/initial_medical.parquet,,,Please specify how many times have you been ho...,Categorical (multiple),Multiple,Accruing,,,Primary,Both sexes,2/26/2020,,object,050_03
performed_tests,50,initial_medical,Performed tests,medical_procedures/initial_medical.parquet,,,Did you perform any of the following tests,Categorical (multiple),Multiple,Accruing,,,Primary,Both sexes,2/26/2020,,object,050_02
polyps_information,50,initial_medical,Polyps information,medical_procedures/initial_medical.parquet,,,Please detail any findings as far as you know,Categorical (multiple),Multiple,Accruing,,,Primary,Both sexes,2/26/2020,,object,050_05
eco_heart_past_year,50,follow_up_medical,What surgery,medical_procedures/follow_up_medical.parquet,,,(22)? Have you had one of these tests in the l...,Categorical (multiple),Multiple,Accruing,,,Primary,Both sexes,2/26/2020,,object,046_01


In [12]:
multiple = pl.dict[pl.dict['field_type'] == 'Categorical (multiple)'].index.values.tolist()
multiple_question = 'performed_tests'


if isinstance(pl.dict.loc[multiple_question]['data_coding'], pd.Series):
    code = pl.dict.loc[multiple_question]['data_coding'][0]
else:
    code = pl.dict.loc[multiple_question]['data_coding']
code

'050_02'

In [13]:
pl[multiple_question].dropna().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,performed_tests
participant_id,cohort,research_stage,array_index,Unnamed: 4_level_1
5516424321,10k,00_00_visit,0,"[Eco -heart at rest, Echo heart in the effort,..."
5027574288,10k,00_00_visit,0,"[Heart mapping, Doppler of the neck arteries, ..."
7783260382,10k,00_00_visit,0,"[Doppler of the neck arteries, An endoscopic e..."
1178277844,10k,00_00_visit,0,"[Lung function, An endoscopic examination of t..."
1622660825,10k,00_00_visit,0,"[Lung function, (CT, MRI) brain imaging]"


In [14]:

original_series =  pl[multiple_question][multiple_question]

tranformed_english = transform_answers(multiple_question,original_series , transform_from='coding', transform_to='english', 
                                     dict_df=pl.dict, mapping_df=df_codes)
tranformed_english.dropna().head()

participant_id  cohort  research_stage  array_index
5516424321      10k     00_00_visit     0              [Eco -heart at rest, Echo heart in the effort,...
5027574288      10k     00_00_visit     0              [Heart mapping, Doppler of the neck arteries, ...
7783260382      10k     00_00_visit     0              [Doppler of the neck arteries, An endoscopic e...
1178277844      10k     00_00_visit     0              [Lung function, An endoscopic examination of t...
1622660825      10k     00_00_visit     0                       [Lung function, (CT, MRI) brain imaging]
Name: performed_tests, dtype: object

In [15]:
from collections import Counter
# Flatten the list of lists into a single list
flattened = [item for sublist in tranformed_english.dropna() for item in sublist]

# Count the frequency of each answer
answer_counts = Counter(flattened)

# Display the counts
print(answer_counts)

Counter({'Over the age of 50 - Did you do colonoscopy as part of a review?': 268, 'An endoscopic examination of the digestive tract unnecessarily by age review? Gastroscopy or colonoscopy': 230, 'Lung function': 196, 'Eco -heart at rest': 176, '(CT, MRI) brain imaging': 168, 'Echo heart in the effort': 165, 'Effort test (ergometry)': 130, 'Doppler of the neck arteries': 106, 'Heart mapping': 75, 'MRI cardal': 61, 'CT cardal': 56, 'Cordial catheterization': 12})
