# Medical symptoms dataset

### Description 

Medical symptoms data captures detailed information about an individual's current health status by recording physical and mental symptoms they experience. This data includes subjective descriptions such as pain levels, fatigue, and other discomforts, along with observed signs like fever or swelling. Tracking symptoms over time helps in identifying potential health issues, understanding disease onset and progression, and supporting diagnosis. It provides valuable real-time insights into patient well-being and is crucial for personalized healthcare management and treatment planning.

### Introduction

The Human Phenotype Project study collects medical data through online surveys, where participants self-report their experiences with various medical symptoms. This method depends on individuals accurately conveying their own health experiences. Obtaining detailed and thorough information about the symptoms people experience is essential to understand their actual impact on individual health.

### Measurement protocol 
<!-- long measurment protocol for the data browser -->
In the initial phase of the study, during registration, participants are asked to provide information about any medical symptoms they have experienced in the Initial Medical Survey. Additional data is then collected in the Follow-up UKBB Survey during subsequent visits, allowing for ongoing tracking of these symptoms

An additional survey is asked to participants focusing on Irritable Bowel Syndrome (IBS) and digestive health. This digestive health survey is an adaptation of the UK Biobank online gastro-intestinal health self-assessment questionnaire. Many of the questions in the UK Biobank gut health (IBS) questionnaire were adopted from the World Gastroenterology Association questionnaire (2009). Questions asked are related to the participant’s bowel habits, characterization of abdominal pain patterns, accompanying symptoms, and history of IBS in the family.

### Data availability 
<!-- for the example notebooks -->
The information is stored in 3 parquest files: `initial_medical.parquet`, `follow_up_ukbb.parquet` and `digestve_health.parquet` 

### Relevant links

* [Pheno Knowledgebase](https://knowledgebase.pheno.ai/datasets/049-medical_symptoms)
* [Pheno Data Browser](https://pheno-demo-app.vercel.app/folder/49)


In [1]:
%load_ext autoreload
%autoreload 2
from pheno_utils import PhenoLoader
import random

In [2]:
pl = PhenoLoader('medical_symptoms', base_path='s3://pheno-synthetic-data/data')
pl

PhenoLoader for medical_symptoms with
72 fields
4 tables: ['initial_medical', 'follow_up_ukbb', 'digestive_health', 'age_sex']

# Data dictionary

In [3]:
pl.dict.head()

Unnamed: 0_level_0,folder_id,feature_set,field_string,relative_location,bulk_file_extension,bulk_dictionary,description_string,field_type,array,stability,units,sampling_rate,strata,sexed,debut,completed,pandas_dtype,data_coding
tabular_field_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
collection_timestamp,49,initial_medical,Collection timestamp,medical_symptoms/initial_medical.parquet,,,Timestamp of measurements collection,Datetime,Single,Accruing,Time,,Collection time,Both sexes,2/26/2020,,"datetime64[ns, Asia/Jerusalem]",
collection_date,49,initial_medical,Collection date,medical_symptoms/initial_medical.parquet,,,Date of measurments collection,Date,Single,Accruing,Time,,Collection time,Both sexes,2/26/2020,,datetime64[ns],
timezone,49,initial_medical,Timezone,medical_symptoms/initial_medical.parquet,,,Timezone of the measurments,Categorical (single),Single,Accruing,,,Collection time,Both sexes,2/26/2020,,string,
are_you_suffering_for_the_following_symptoms,49,initial_medical,Are you suffering for the following symptoms,medical_symptoms/initial_medical.parquet,,,Are you suffering for the following symptoms,Categorical (multiple),Single,Accruing,,,Primary,Both sexes,2/26/2020,,object,049_01
how_often_headache,49,initial_medical,How often headache,medical_symptoms/initial_medical.parquet,,,How often do you have a headaches?,Categorical (single),Single,Accruing,,,Primary,Both sexes,2/26/2020,,int,049_02


In [4]:
pl.dfs['digestive_health'].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,abdomen_pain_frequency_past_3_months,abdomen_pain_women_during_period,abdomen_pain_past_6_months,abdomen_pain_improve_after_bowl_movement,abdomen_pain_increases_bowl_movements,abdomen_pain_decreases_bowl_movements,soft_stool_during_abdomen_pain,hard_stool_during_abdomen_pain,hard_stool_frequency_past_3_months,soft_stool_frequency_past_3_months,...,history_diagnoses_celiac,history_diagnoses_celiac_method,history_diagnoses_ibs,history_ibs_symptoms_speed,history_ibs_first_symptoms_infection,history_ibs_first_symptoms_infection_choice,history_ibs_first_symptoms_fever,history_ibs_first_symptoms_diarrhoea,history_ibs_first_symptoms_bloody_diarrhoea,history_ibs_first_symptoms_vomiting
participant_id,cohort,research_stage,array_index,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
5516424321,10k,00_00_visit,0,Two to three days a month,,,Sometimes,Often,Sometimes,Sometimes,Sometimes,Sometimes,Sometimes,...,No,,No,,,,,,,
5027574288,10k,00_00_visit,0,One day a week,No,,Most of the time,Often,Sometimes,Sometimes,Sometimes,Sometimes,Sometimes,...,No,,No,,,,,,,
7783260382,10k,00_00_visit,0,,,,,,,,,,,...,,,,,,,,,,
1178277844,10k,00_00_visit,0,Never,No,,,,Never,,Never,Never,Never,...,No,,No,,,,,,,
1622660825,10k,00_00_visit,0,Never,,,,,,,,Never,Never,...,No,,No,,,,,,,


## Data coding
A data-coding is a mapping between the actual data and the values used to represent it within the database for categorical features.

In [5]:
df_codes = pl.data_codings
df_codes.head()

Unnamed: 0,code_number,coding,english,hebrew,answer_field_name,sexed,ukbb_compatible,ukbb_similar_coding,scripting_instruction,description,notes
0,001_03,0,asia/jerusalem,זמן ישראל,,,False,,,Timezone,
1,062_02,1,BRCA-1,BRCA-1,brca_type_1,Both sexes,,,,,
2,062_02,2,BRCA-2,BRCA-2,brca_type_2,Both sexes,,,,,
3,062_02,-1,Do not know,לא יודע/ת,brca_type_3,Both sexes,,,NMUL,,
4,062_03,1,"A relative of mine got breast, ovarian, or pro...","קרוב/ת משפחה חלה/תה בסרטן השד, השחלה, או הערמו...",brca_indication_1,Both sexes,,,,,


Each column represents a single question. Answers are saved as numbers.
The type of the question (integer, or category with single/multiple choice) can be found in the data dictionary.

In [6]:
pl.dict['field_type'].value_counts()

field_type
Categorical (single)      61
Integer                   21
Datetime                   3
Date                       3
Categorical (multiple)     2
Name: count, dtype: int64

### Numeric questions

For numeric questions the answers are the numbers entered when filling the survey

In [7]:
numeric = pl.dict[pl.dict['field_type'] == 'Integer'].index.values.tolist()
# Select a random numeric question
numeric_question = random.choice(numeric)

pl[numeric_question][numeric_question].value_counts()

bowl_movements_daily_max
2     185
0     181
3     110
4      72
1      47
15      2
13      2
5       2
23      1
11      1
Name: count, dtype: int64

### Categorical single choice questions

For categorical qustions, answers are coded according to the data coding. The automatic setting for PhenoLoader shows the values in english. To view the data codings you can set the preferred_language to 'coding'.

In [13]:
pl = PhenoLoader('medical_symptoms', preferred_language='coding', base_path='s3://pheno-synthetic-data/data')
pl

PhenoLoader for medical_symptoms with
72 fields
4 tables: ['initial_medical', 'follow_up_ukbb', 'digestive_health', 'age_sex']

In [14]:
single = pl.dict[pl.dict['field_type'] == 'Categorical (single)'].index.values.tolist()
single_question = 'chest_pain_during_power_walking_up_stairs_uphill'

pl[single_question][single_question].value_counts()

chest_pain_during_power_walking_up_stairs_uphill
No     3
Yes    2
Name: count, dtype: int64

To change the representation of the answers, we can use the function tranform_answers from pheno_utils.questionnaires_handler

In [15]:
from pheno_utils.questionnaires_handler import transform_answers

Transform to English

In [16]:
tranformed_english = transform_answers(single_question, pl[single_question][single_question], transform_from='coding', transform_to='english', 
                                     dict_df=pl.dict, mapping_df=df_codes)
tranformed_english.value_counts()

chest_pain_during_power_walking_up_stairs_uphill
No     3
Yes    2
Name: count, dtype: int64

Transform to Hebrew

In [17]:
tranformed_hebrew = transform_answers(single_question, pl[single_question][single_question], transform_from='coding', transform_to='hebrew', 
                                     dict_df=pl.dict, mapping_df=df_codes)
tranformed_hebrew.value_counts()

chest_pain_during_power_walking_up_stairs_uphill
No     3
Yes    2
Name: count, dtype: int64