# Lifestyle and environment

### Description

Lifestyle and environmental factors encompass a wide range of influences on an individual's health and well-being. These factors may include behaviors such as diet, physical activity, sleep patterns, smoking, and alcohol consumption, as well as environmental exposures like air quality, pollution, and climate. Together, they play a crucial role in determining overall health outcomes, impacting both the prevention and progression of diseases.

### Introduction

The Human Phenotype Project conducts comprehensive data collection through online surveys, where participants voluntarily provide information on various aspects influencing their health. This includes lifestyle and enviroment data, captured through two lifestyle surveys.

### Measurement protocol 
<!-- long measurment protocol for the data browser -->
These lifestyle surveys are modeled after the UK Biobank's touch screen questionnaire. Participants receive the full version via email to complete on the Zoho platform, either before or after their baseline visit. A shorter, follow-up version of the questionnaire is then filled out by participants during subsequent visits. 

### Data availability 
<!-- for the example notebooks -->
The information is stored in 1 parquet file: `'lifestyle_and_environment.parquet'.

### Relevant links

* [Pheno Knowledgebase](https://knowledgebase.pheno.ai/datasets/055-lifestyle_and_environment.html)
* [Pheno Data Browser](https://pheno-demo-app.vercel.app/folder/55)


In [1]:
%load_ext autoreload
%autoreload 2
from pheno_utils import PhenoLoader
import pandas as pd
import random

In [2]:
pl = PhenoLoader('lifestyle_and_environment', base_path='s3://pheno-synthetic-data/data')
pl

PhenoLoader for lifestyle_and_environment with
161 fields
2 tables: ['lifestyle_and_environment', 'age_sex']

## Data dictionary

In [3]:
pl.dict.head()

Unnamed: 0_level_0,folder_id,feature_set,field_string,relative_location,bulk_file_extension,bulk_dictionary,description_string,field_type,array,pandas_dtype,stability,units,sampling_rate,strata,sexed,debut,completed,data_coding
tabular_field_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
collection_timestamp,55,lifestyle_and_environment,Collection time,lifestyle_and_environment/lifestyle_and_enviro...,,,Collection time,Datetime,Single,"datetime64[ns, Asia/Jerusalem]",Accruing,,,Collection time,Both sexes,1/9/2019,,
collection_date,55,lifestyle_and_environment,Collection date,lifestyle_and_environment/lifestyle_and_enviro...,,,Collection date,Date,Single,datetime64[ns],Accruing,,,Collection time,Both sexes,1/9/2019,,
timezone,55,lifestyle_and_environment,Timezone,lifestyle_and_environment/lifestyle_and_enviro...,,,Timezone,Categorical (single),Single,category,Accruing,,,Collection time,Both sexes,1/9/2019,,
data_source,55,lifestyle_and_environment,Data source,lifestyle_and_environment/lifestyle_and_enviro...,,,Data source of survey information,Text,Single,string,Accruing,,,Auxiliary,Both sexes,1/9/2019,,
activity_walking_10min_days_weekly,55,lifestyle_and_environment,Number of days/week walked 10+ minutes,lifestyle_and_environment/lifestyle_and_enviro...,,,"In a typical WEEK, on how many days did you wa...",Integer,Single,int,Accruing,Days/Week,,Primary,Both sexes,1/9/2019,,


## Dataset

In [4]:
pl.dfs['lifestyle_and_environment'].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,collection_timestamp,collection_date,timezone,data_source,activity_walking_10min_days_weekly,activity_walking_minutes_daily,activity_moderate_days_weekly,activity_moderate_minutes_daily,activity_vigorous_days_weekly,activity_vigorous_minutes_daily,...,extra_diet_press_oil_type,extra_diet_main_dairy_source,extra_diet_cheese_fat_percentage,extra_diet_moldy_cheese_frequency,extra_diet_cereal_sugared,skin_color,skin_ease_of_burn,skin_past_sunburns,hair_color,activiy_indoors_pc_daily
participant_id,cohort,research_stage,array_index,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
5516424321,10k,00_00_visit,0,2022-08-02 09:32:01.373358080,2022-10-27 17:21:28.801640960,asia/jerusalem,follow_up_ukbb_survey,5,22,2,25,0,32,...,[],,,,,,,0,,0
5027574288,10k,00_00_visit,0,2023-01-23 04:30:06.857399296,2023-05-09 07:50:09.348581376,asia/jerusalem,follow_up_ukbb_survey,7,10,1,21,0,6,...,[Refined oil],,,,,,,0,,0
7783260382,10k,00_00_visit,0,2021-06-12 20:58:43.712332800,2020-04-18 19:37:33.439312640,asia/jerusalem,initial_ukbb_survey,5,0,0,21,0,0,...,"[Cold pressed olive oil, Refined oil, Oil - un...",Beef,9% fat or less cheese,Once every few months,Do not eat,Fair,Get moderately tanned,2,Light brown,0
1178277844,10k,00_00_visit,0,2022-12-14 07:48:11.993735936,2023-05-17 00:00:00.000000000,asia/jerusalem,initial_ukbb_survey,5,0,0,0,0,0,...,"[Cold pressed olive oil, Refined oil, Oil - un...",Beef,Fat cheese,Once every few months,Do not eat,Fair,"Never tan, only burn",0,Dark brown,0
1622660825,10k,00_00_visit,0,2021-06-17 19:35:15.170488064,2021-12-27 13:58:43.397170688,asia/jerusalem,follow_up_ukbb_survey,1,6,0,78,0,3,...,[Refined oil],,,,,,,0,,0


## Data coding
A data-coding is a mapping between the actual data and the values used to represent it within the database for categorical features.

In [5]:
df_codes = pl.data_codings
df_codes.head(5)

Unnamed: 0,code_number,coding,english,hebrew,answer_field_name,sexed,ukbb_compatible,ukbb_similar_coding,scripting_instruction,description,notes
0,001_03,0,asia/jerusalem,זמן ישראל,,,False,,,Timezone,
1,062_02,1,BRCA-1,BRCA-1,brca_type_1,Both sexes,,,,,
2,062_02,2,BRCA-2,BRCA-2,brca_type_2,Both sexes,,,,,
3,062_02,-1,Do not know,לא יודע/ת,brca_type_3,Both sexes,,,NMUL,,
4,062_03,1,"A relative of mine got breast, ovarian, or pro...","קרוב/ת משפחה חלה/תה בסרטן השד, השחלה, או הערמו...",brca_indication_1,Both sexes,,,,,


Each column represents a single question. Answers are saved as numbers.
The type of the question (integer, or category with single/multiple choice) can be found in the data dictionary.

In [6]:
pl.dict['field_type'].value_counts()

field_type
Categorical (single)      100
Integer                    46
Categorical (multiple)     10
Datetime                    1
Date                        1
Text                        1
Name: count, dtype: int64

### Numeric questions

For numeric questions the answers are the numbers entered when filling the survey

In [7]:
pl.dict[pl.dict['field_type'] == 'Integer'].head()

Unnamed: 0_level_0,folder_id,feature_set,field_string,relative_location,bulk_file_extension,bulk_dictionary,description_string,field_type,array,pandas_dtype,stability,units,sampling_rate,strata,sexed,debut,completed,data_coding
tabular_field_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
activity_walking_10min_days_weekly,55,lifestyle_and_environment,Number of days/week walked 10+ minutes,lifestyle_and_environment/lifestyle_and_enviro...,,,"In a typical WEEK, on how many days did you wa...",Integer,Single,int,Accruing,Days/Week,,Primary,Both sexes,1/9/2019,,
activity_walking_minutes_daily,55,lifestyle_and_environment,Duration of walks,lifestyle_and_environment/lifestyle_and_enviro...,,,How many minutes did you usually spend walking...,Integer,Single,int,Accruing,Minutes/Day,,Primary,Both sexes,1/9/2019,,
activity_moderate_days_weekly,55,lifestyle_and_environment,Number of days/week of moderate physical activ...,lifestyle_and_environment/lifestyle_and_enviro...,,,"In a typical WEEK, on how many days did you do...",Integer,Single,int,Accruing,Days/Week,,Primary,Both sexes,1/9/2019,,
activity_moderate_minutes_daily,55,lifestyle_and_environment,Duration of moderate activity,lifestyle_and_environment/lifestyle_and_enviro...,,,How many minutes did you usually spend doing m...,Integer,Single,int,Accruing,Minutes/Day,,Primary,Both sexes,1/9/2019,,
activity_vigorous_days_weekly,55,lifestyle_and_environment,Number of days/week of vigorous physical activ...,lifestyle_and_environment/lifestyle_and_enviro...,,,"In a typical WEEK, how many days did you do 10...",Integer,Single,int,Accruing,Days/Week,,Primary,Both sexes,1/9/2019,,


In [8]:
# filter out baseline visit
col = 'activity_walking_minutes_daily'

df = pl[[col, 'age', 'sex', 'collection_date']].loc[:,:,"00_00_visit",:,:]

df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,activity_walking_minutes_daily,age,sex,collection_date
participant_id,cohort,array_index,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5516424321,10k,0,22,54.447672,Female,2022-10-27 17:21:28.801640960
5027574288,10k,0,10,54.169913,Male,2023-05-09 07:50:09.348581376
7783260382,10k,0,0,45.892093,Male,2020-04-18 19:37:33.439312640
1178277844,10k,0,0,44.4351,Female,2023-05-17 00:00:00.000000000
1622660825,10k,0,6,44.621536,Male,2021-12-27 13:58:43.397170688


In [None]:
from pheno_utils.age_reference_plots import GenderAgeRefPlot

gender_refplots = GenderAgeRefPlot(df.dropna(subset=[col,"sex", "age"]), col, age_col="age")
gender_refplots.plot()

### Categorical single choice questions

For categorical qustions, answers are coded according to the data coding. The automatic setting for PhenoLoader shows the values in english. 

In [10]:
single = pl.dict[pl.dict['field_type'] == 'Categorical (single)'].index.values.tolist()
# Select a random numeric question
single_question = random.choice(single)
pl[single_question][single_question].value_counts()

diet_cereal_intake_choice
Less than one    28
Do not know       2
Name: count, dtype: int64

To view the data codings you can set the preferred_language to 'coding' in PhenoLoader function.

In [11]:
pl = PhenoLoader('lifestyle_and_environment', preferred_language='coding', base_path='s3://pheno-synthetic-data/data')

In [12]:
single = pl.dict[pl.dict['field_type'] == 'Categorical (single)'].index.values.tolist()
# Select a random numeric question
single_question = random.choice(single)
pl[single_question][single_question].value_counts()

diet_vegeterian_indicator
No     295
Yes     45
Name: count, dtype: int64

To change the representation of the answers, we can use the function tranform_answers from pheno_utils.questionnaires_handler.

In [13]:
from pheno_utils.questionnaires_handler import transform_answers

Transform to Hebrew

In [14]:
tranformed_hebrew = transform_answers(single_question, pl[single_question][single_question], transform_from='Coding', transform_to='Hebrew', dict_df=pl.dict, mapping_df=df_codes)
tranformed_hebrew.value_counts()

diet_vegeterian_indicator
No     295
Yes     45
Name: count, dtype: int64

Transorm to English

In [15]:
tranformed_english = transform_answers(single_question, pl[single_question][single_question], transform_from='Coding', transform_to='English', dict_df=pl.dict, mapping_df=df_codes)
tranformed_english.value_counts()

diet_vegeterian_indicator
No     295
Yes     45
Name: count, dtype: int64

### Categorical multiple choice questions

Catrgorical multiple choice questions are coded in the same way as categorical single questions, but are saved in a list in each row.

In [16]:
multiple = pl.dict[pl.dict['field_type'] == 'Categorical (multiple)'].index.values.tolist()
# Select a random numeric question
multiple_question = random.choice(multiple)


if isinstance(pl.dict.loc[multiple_question]['data_coding'], pd.Series):
    code = pl.dict.loc[multiple_question]['data_coding'][0]
else:
    code = pl.dict.loc[multiple_question]['data_coding']
    
print(multiple_question, code)
pl[multiple_question][multiple_question].value_counts()

tranformed_english = transform_answers(multiple_question, pl[multiple_question][multiple_question], transform_from='coding', transform_to='english', 
                                     dict_df=pl.dict, mapping_df=df_codes)

extra_diet_press_oil_type 040_04


In [17]:

from collections import Counter
# Flatten the list of lists into a single list
flattened = [item for sublist in tranformed_english.dropna() for item in sublist]

# Count the frequency of each answer
answer_counts = Counter(flattened)

# Display the counts
print(answer_counts)

Counter({'Refined oil': 398, 'Oil - unknown type of pickling.': 231, 'Cold pressed olive oil': 225, 'Butter': 213, 'Other cold pressed oil': 42, 'Margarine': 29})
