# Family history

### Description 

Family history information in medical research involves collecting data on genetic and hereditary conditions passed down through generations. This includes predispositions to diseases such as cancer, cardiovascular conditions, diabetes, and mental health disorders, helping identify potential health risks based on familial patterns.

### Introduction

The Human Phenotype Project conducts comprehensive data collection through online surveys, where participants voluntarily provide information on various aspects influencing their health. This includes information about a participant's family history, captured through two lifestyle surveys and the Initial Medical Survey.

### Measurement protocol 
<!-- long measurment protocol for the data browser -->
These lifestyle surveys are modeled after the UK Biobank's touch screen questionnaire. Participants receive the full version via email to complete on the Zoho platform, either before or after their baseline visit. A shorter, follow-up version of the questionnaire is then filled out by participants during subsequent visits. 

### Data availability 
<!-- for the example notebooks -->
The information is stored in 2 parquet files:  `initial_medical.parquet` and `ukbb.parquet`. The ukbb.parquet contains questions asked in our lifestyle surveys at the baseline and follow up stages.

### Relevant links

* [Pheno Knowledgebase](https://knowledgebase.pheno.ai/datasets/052-family_history.html)
* [Pheno Data Browser](https://pheno-demo-app.vercel.app/folder/52)


In [1]:
%load_ext autoreload
%autoreload 2

from pheno_utils import PhenoLoader
import pandas as pd
import random

In [2]:
pl = PhenoLoader('family_history', base_path='s3://pheno-synthetic-data/data')
pl

PhenoLoader for family_history with
57 fields
3 tables: ['initial_medical', 'ukbb', 'age_sex']

# Data dictionary

In [3]:
pl.dict.head()

Unnamed: 0_level_0,folder_id,feature_set,field_string,relative_location,bulk_file_extension,bulk_dictionary,description_string,field_type,array,pandas_dtype,stability,units,sampling_rate,strata,sexed,debut,completed,data_coding
tabular_field_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
sudden_death_family,52,initial_medical,Sudden death family,family_history/initial_medical.parquet,,,Has one or more of your close family members d...,Categorical (single),Single,int,Accruing,,,Primary,Both sexes,2/26/2020,,21
family_history,52,initial_medical,Family history,family_history/initial_medical.parquet,,,Has one or more of your close family members s...,Categorical (multiple),Multiple,object,Accruing,,,Primary,Both sexes,2/26/2020,,052_01
hypertension_family_number,52,initial_medical,Hypertension family number,family_history/initial_medical.parquet,,,How many of your close family members have hig...,Integer,Single,int,Accruing,,,Primary,Both sexes,2/26/2020,,
hyperlipidemia_family_number,52,initial_medical,Hyperlipidemia family number,family_history/initial_medical.parquet,,,How many of your close family members have ele...,Integer,Single,int,Accruing,,,Primary,Both sexes,2/26/2020,,
type_1_diabetes_family_number,52,initial_medical,Type 1 diabetes family number,family_history/initial_medical.parquet,,,How many of your close family members have typ...,Integer,Single,int,Accruing,,,Primary,Both sexes,2/26/2020,,


In [4]:
pl.dfs['ukbb'].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,early_life_breastfeeding,early_life_comparative_body_weight,early_life_comparative_body_height,early_life_handedness,early_life_adopted,early_life_multiple_birth,early_life_mother_smoking,family_history_father_alive,family_history_adopted_father_alive,family_history_current_father_age,...,family_history_number_adopted_brothers,family_history_number_full_sisters,family_history_number_adopted_sisters,family_history_past_siblings_illness_1,family_history_past_adopted_siblings_illness_1,family_history_past_siblings_illness_2,family_history_past_adopted_siblings_illness_2,family_history_number_older_siblings,family_history_non_accidental_death,data_source
participant_id,cohort,research_stage,array_index,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
5516424321,10k,00_00_visit,0,,,,,,,,No,,0,...,0,0,0,"[None of the above (group 2), Do not know (gro...",[],[None of the above (group 2)],[],0,No,follow_up_ukbb_survey
5027574288,10k,00_00_visit,0,,,,,,,,,,0,...,0,0,0,[Heart disease],[],[],[],0,,
7783260382,10k,00_00_visit,0,No,Thinner,Shorter,Right-handed,No,No,No,No,,0,...,0,1,0,[None of the above (group 2)],[],[None of the above (group 2)],[],0,No,initial_ukbb_survey
1178277844,10k,00_00_visit,0,,,,Right-handed,,,No,No,,0,...,0,0,0,"[None of the above (group 2), Do not know (gro...",[],[None of the above (group 2)],[],0,No,follow_up_ukbb_survey
1622660825,10k,00_00_visit,0,Do not know,About average,About average,Right-handed,No,No,No,Yes,,72,...,0,0,0,"[None of the above (group 2), Do not know (gro...",[],[None of the above (group 2)],[],0,No,initial_ukbb_survey


## Data coding
Data codings are used to encode actual answers for categorical features in english and hebrew.

In [5]:
df_codes = pl.data_codings
df_codes.head()

Unnamed: 0,code_number,coding,english,hebrew,answer_field_name,sexed,ukbb_compatible,ukbb_similar_coding,scripting_instruction,description,notes
0,001_03,0,asia/jerusalem,זמן ישראל,,,False,,,Timezone,
1,062_02,1,BRCA-1,BRCA-1,brca_type_1,Both sexes,,,,,
2,062_02,2,BRCA-2,BRCA-2,brca_type_2,Both sexes,,,,,
3,062_02,-1,Do not know,לא יודע/ת,brca_type_3,Both sexes,,,NMUL,,
4,062_03,1,"A relative of mine got breast, ovarian, or pro...","קרוב/ת משפחה חלה/תה בסרטן השד, השחלה, או הערמו...",brca_indication_1,Both sexes,,,,,


Each column represents a single question. Answers are saved as numbers.
The type of the question (integer, or category with single/multiple choice) can be found in the data dictionary.

In [6]:
pl.dict['field_type'].value_counts()

field_type
Integer                   27
Categorical (single)      15
Categorical (multiple)    14
Datetime                   2
Date                       2
Name: count, dtype: int64

### Numeric questions

For numeric questions the answers are the numbers entered when filling the survey.

In [7]:
pl.dict[pl.dict['field_type'] == 'Integer'].head()

Unnamed: 0_level_0,folder_id,feature_set,field_string,relative_location,bulk_file_extension,bulk_dictionary,description_string,field_type,array,pandas_dtype,stability,units,sampling_rate,strata,sexed,debut,completed,data_coding
tabular_field_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
hypertension_family_number,52,initial_medical,Hypertension family number,family_history/initial_medical.parquet,,,How many of your close family members have hig...,Integer,Single,int,Accruing,,,Primary,Both sexes,2/26/2020,,
hyperlipidemia_family_number,52,initial_medical,Hyperlipidemia family number,family_history/initial_medical.parquet,,,How many of your close family members have ele...,Integer,Single,int,Accruing,,,Primary,Both sexes,2/26/2020,,
type_1_diabetes_family_number,52,initial_medical,Type 1 diabetes family number,family_history/initial_medical.parquet,,,How many of your close family members have typ...,Integer,Single,int,Accruing,,,Primary,Both sexes,2/26/2020,,
type_2_diabetes_family_number,52,initial_medical,Type 2 diabetes family number,family_history/initial_medical.parquet,,,How many of your close family members have typ...,Integer,Single,int,Accruing,,,Primary,Both sexes,2/26/2020,,
stroke_family_number,52,initial_medical,Stroke family number,family_history/initial_medical.parquet,,,How many of your close family members have had...,Integer,Single,int,Accruing,,,Primary,Both sexes,2/26/2020,,


In [8]:
# filter out baseline visit
col = 'hypertension_family_number'

df = pl[[col, 'age', 'sex', 'collection_date']].loc[:,:,"00_00_visit",:,:]

df.head()



Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,hypertension_family_number,age,sex
participant_id,cohort,array_index,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5516424321,10k,0,0,54.447672,Female
5027574288,10k,0,0,54.169913,Male
7783260382,10k,0,0,45.892093,Male
1178277844,10k,0,0,44.4351,Female
1622660825,10k,0,0,44.621536,Male


### Categorical single choice questions

For categorical qustions, answers are coded according to the data coding. The automatic setting for PhenoLoader shows the values in english. 

In [9]:

single_question = 'early_life_adopted'
pl[single_question][single_question].value_counts()

early_life_adopted
No             314
Yes              2
Do not know      1
Name: count, dtype: int64

To view the data codings you can set the preferred_language to 'coding' in PhenoLoader function.

In [10]:
pl = PhenoLoader('family_history', preferred_language='coding', base_path='s3://pheno-synthetic-data/data')

In [11]:
print(single_question, 'mapping:', pl.dict.loc[single_question]['data_coding'])
pl[single_question][single_question].value_counts()

early_life_adopted mapping: 100349


early_life_adopted
No             314
Yes              2
Do not know      1
Name: count, dtype: int64

To change the representation of the answers, we can use the function tranform_answers from pheno_utils.questionnaires_handler or adjust the parameter of preferred_language in the PhenoLoader function (defaul is set to english).

In [12]:
from pheno_utils.questionnaires_handler import transform_answers

Transform to Hebrew

In [13]:
tranformed_hebrew = transform_answers(single_question, pl[single_question][single_question], transform_from='Coding', transform_to='Hebrew', dict_df=pl.dict, mapping_df=df_codes)
print(single_question, pl.dict.loc[single_question]['data_coding'])
tranformed_hebrew.value_counts()

early_life_adopted 100349


early_life_adopted
No             314
Yes              2
Do not know      1
Name: count, dtype: int64

Transorm to English

In [14]:
tranformed_english = transform_answers(single_question, pl[single_question][single_question], transform_from='Coding', transform_to='English', dict_df=pl.dict, mapping_df=df_codes)
print(single_question, pl.dict.loc[single_question]['data_coding'])
tranformed_english.value_counts()

early_life_adopted 100349


early_life_adopted
No             314
Yes              2
Do not know      1
Name: count, dtype: int64

### Categorical multiple questions

Catrgorical multiple choice questions are coded in the same way as categorical single questions, but multiple answers are saved in a list in each row.

In [15]:
multiple = pl.dict[pl.dict['field_type'] == 'Categorical (multiple)'].index.values.tolist()
# Select a random numeric question
multiple_question = random.choice(multiple)


if isinstance(pl.dict.loc[multiple_question]['data_coding'], pd.Series):
    code = pl.dict.loc[multiple_question]['data_coding'][0]
else:
    code = pl.dict.loc[multiple_question]['data_coding']
    
print(multiple_question, code)
pl[multiple_question][multiple_question].value_counts()

tranformed_english = transform_answers(multiple_question, pl[multiple_question][multiple_question], transform_from='coding', transform_to='english', 
                                     dict_df=pl.dict, mapping_df=df_codes)


family_history_past_father_illness_2 1010


In [16]:
from collections import Counter
# Flatten the list of lists into a single list
flattened = [item for sublist in tranformed_english.dropna() for item in sublist]

# Count the frequency of each answer
answer_counts = Counter(flattened)

# Display the counts
print(answer_counts)

Counter({'None of the above (group 2)': 295, 'Do not know (group 2)': 74, 'Severe depression': 59, 'Prostate cancer': 34, 'Bowel cancer': 25, 'Prefer not to answer (group 2)': 13, "Parkinson's disease": 11, 'Lung cancer': 1})
