<img src="health.jpg">

###### Image credit https://lamvascular.com/healthy-lifestyle-dallas/

# Project 2019
## Programming for Data Analysis

For this project students must create a dataset by simluating a real-world phenomenon of their choosing. They may pick any phenomeon they wish. Then, rather than collect the data related to the phenomenon, they should model and synthesise such data using Python using the numpy.random package for this purpose.

Students should:
- Choose a real-world phenomenon that canbe measured and for which they could collect at least 100-hundred data points across at least 4 variables.
- Investigate the types of variables involved, their likely distributions, and their relationships with eachother.
- Synthese/ simulate a data set as closely matching their properties as possible.
- Detail their research and implement the simulation in a Jupyter notebook - the dataset itself can be displaed in an output cell within the notebook. 

## Lifestyle risk indicators
For this project I decided to base the simulated data on the real world phenomenon of factors that contribute to increased lifestyle risk. Lifestyle risk factors are defined by the Irish Health and Safety Executive as *"largely preventable...including tobacco and alcohol usage, together with physical inactivity, poor diet and obesity"*. Lifestyle risk factors are behaviours, the degree to which an individual indulges in them can indicate their propensity to develop chronic diseases. The are however not definitive, there are many factors that can influence an individuals likelihood of developing chronic diseases, but it is well recognised that certain behaviours do have a strong influence. 

While researching this project I concluded the main measurable factors that influence lifestype risk are:
- Diet 
- Smoking
- Alcohol Consumption
- Exercise

In the following section I will discuss each of these factors and how they combined to increase the risk of an individual developing a chronic disease. For this project I imagined that an employer asked me to produce a Python script that would calculate lifestyle risk, and that ultimately the scipt would be used on data collected during a voluntary street survey, however some testing of the script's accurancy was required before the survey was to be conducted. Therefore fake data was to be simulated as parting of the testing processes.

### Diet
Body Mass Index is a well established measurement of whether an individual is overweight of underweight. As such I have used it for a proxy measurement for whether or not a person has a healthy diet. 

Using the metric system Body Mass Index is calculated by dividing an individual's weight in Kilograms by the square of their height in metres. 

I researched the mean and variance for the heights of men and women in Ireland and their average weights, discovering that:
- The average height of a man in Ireland is 177cm with a standard deviation of 7.42cm
- The average height of a woman in Ireland is 163cm with a standard deviation of 7.11cm
- The average weight of a man in Ireland is 80.7 kilograms and 72% of men are overweight
- The average weight of a woman in Ireland is 69 kilograms and 52% of women are overweight

For the purpose of simulating data I concluded the mean heights and weights could be created using a normal distribution. I also discovered that 48% of the population is male and 52% female, deciding to factor this into the simulated data as the prevalence of smoking varies by gender. HSE guidance for low-risk alcohol consumption also varies by gender. 

Body Mass Categories are defined as follows:
- Underweight: <18.5
- Normal: 18.5 - 25
- Overweight: 25 - 30
- Obese: >30

### Smoking
The prevalance of smoking in Ireland is very well documented. The HSE's smoking prevalence tracker Half Year - 2018 survey carried out by Ipsos MRBI indicates that 21.1% of men and 16.6% of women smoker one of more cigarettes each week. 

## Alcohol Consumption
Alcohol Action Ireland, an advocacy group for the reduction of alcohol harm, advises that there is no "safe" amount of alcohol. That drinking any amount increases the risk of damage to health and risk generally increases in line with how much is drunk. Binge drinking being associated with additional health risks.

Guidelines for men are 17 standard drinks of alcohol per week, and 11 standard drinks of alcohol per week. The CSO data on per capita alchohol consumption in Ireland is estimates it at 11.46 litres pure alcohol per year. One standard drink contains 13ml of pure alcohol. So on average the per capita alcohol consumption in Ireland is 16.95 standard units per week.

There are many other factors involved in alcohol consumption and risk. Over 19% of people in Ireland do not drink alcohol, 83% of men and 79% of women consume alcohol. Risk is also influenced by drinking behaviours such as binge drinking, defined by the World Health Organisation as consuming six or more standard drinks in one session.  

For the purpose of simulating alcohol consumption I assumed a uniform distribution of alcohol consumption between zero and twenty units per week.

## Exercise
Sedentary behaviours are a significant lifestyle risk factor. The HSE's National Guidelines on Physical Activity for Ireland recommends adults participate in a minimum of 150 minutes of exercise each week, 30 minutes per day over five days.

## Purpose and development of the code
The primary purpose of this code is to assign a 'Lifestyle Risk' category based on: Body Mass Index, smoker or non-smoker, weekly alcohol consumption and exercise. In my imaginary scenario my employeer will conduct a survey that measures height and weight then asks; if respondents smoke, how much alcohol they consume on a weekly basis and how much strenuous exercise the take on a weekly basis. In advance of conducting the survey I simulate a dataset to aid the development and testing of the final algorithm. 

Respondents will also indicate their gender.

I initially created a pandas array with column headings for each or the variables being collected and calculated. 

#### Simulating gender
Gender is simulated by numpy.random.choice() between 'Male' and 'Female' with a probability of 0.48 male and 0.52 female based on CSO popluation statistics.

#### Simulating BMI
As previously mentioned BMI is simulated using height and weights that approximate the population of Ireland. I populated a column in the pandas array that recieved this simple arithmatic. Then I used pandas.cut() to convert this continuous BMI variable to the catagorical variables 'Underweight', 'Normal', 'Overweight' and 'Obese'. 

#### Simulating smoking
Smoker or non-smoker is simulated by defining a function. If a respondent is male numpy.random.choice() assigns 'Yes' or 'No'. Males have a probability of 0.211 for being a smoker and Females have a probability of 0.166 for being a smoker.

#### Simulating alcohol consumption
To simulate alcohol consumption I defined a function assumed a uniform distribution of alcohol consumption between zero and twenty units per week. This is inaccurate as it does sufficiently describe actual alcohol consumption in Ireland. However I considered it sufficient for the purpose of the simulation. With this data I created a second function that again converted the continuous variable into catagorical variables. Moderate drinking was assigned if less than the recommended standard units of alcohol where consumed each week. Heavy drinking was assigned if more than the recommended standard units of alcohol were consumed. 

#### Simulating exercise
I first created a list containing minutes of exercise per week. I again used numpy.random.choice() passing probabilities that the respondents undertake certain minutes of exercise each week. The probabilities approximating data provided in HSE surveys. Less than 150 minutes of exercise per week was used to assign a higher risk catagory. 

#### Determining lifestyle risk
I defined a function that iterates over all of the possible combinations of continuous and catagorial variables and returns probable lifestyle risks as follows:
- Normal
- Medium
- High
- Very High
- Extremely High

#### Running the functions in the DataFrame
I used pandas.apply() to execute each of the functions in the DataFrame. 



### Further development of the project
The accuracy of the simulation in my project could be substantially improved by including other factors such age. Since alcohol consumption, prevalence of smoking, BMI and exercise all vary by age. Ages in Ireland also follow a trapizoidal distribution which would have been a good inclusion. Drinking behaviours where not included in the simulation and could have been by defining an additional function and associated column in the DataFrame. 

In [9]:
# Import functions to create the dataframe and generate random data
import pandas as pd
import numpy as np

In [17]:
# Seed the random generator to freeze the output and hence make debugging easier
np.random.seed(100)

#Create the Male/ Female object
genders = ['Male', 'Female']

#Create the smoker/ non-smoker object
smokers = ['Yes', 'No']

#Create the dataframe with column headings
df = pd.DataFrame(columns=['Genders', 'Height', 'Weight', 'Alcohol units/ wk', 'Exercise', 'BMI', 'BMI Range',
                           'Alcohol Consumption','Smoker', 'Life Style Risk'])

#Populate the Genders column with men and women
df['Genders'] = np.random.choice(genders, 100, p=[0.48, 0.52])

#Create list of weekly minutes of exercise taken
ave_minutes_exercise_week = [25, 50, 100, 125, 150, 175, 200]

#Populate the dataframe with minutes of exercise according to certain probabilities
df['Exercise'] = np.random.choice(ave_minutes_exercise_week, 100, p=[0.2, 0.2, 0.2, 0.2, 0.1, 0.05, 0.05])

#Define a function to create Male and Female heights according to a normal distribution
def what_height(height):
        if height == 'Male':
            return np.random.normal(1.77, 0.0272)
        elif height == 'Female':
            return np.random.normal(1.63, 0.0267)
        
#Define a function to create Male and Female weights according to a normal distribution
def what_weight(weight):
        if weight == 'Male':
            return np.random.normal(80.7, 0.025)
        elif weight == 'Female':
            return np.random.normal(69, 0.025)

#Define a function that assigns smoker or non-smoker
def what_smoker(smoke):
    if smoke == 'Male':
        return np.random.choice(smokers, p=[0.211, 0.789])
    if smoke == 'Female':
        return np.random.choice(smokers, p=[0.166, 0.834])

#Define a function that assigns weekly units of standard drinks
def what_alcohol(alcohol):
    if alcohol == 'Male':
        return np.random.uniform(0, 20)
    if alcohol == 'Female':
        return np.random.uniform(0, 20)  

#Define a function that converts alcohol consumption into a catagorical value of moderate or heavy drinker
def drinker_type(row):
    if row['Genders'] == 'Male' and row['Alcohol units/ wk'] < 17:
        return 'Moderate'
    if row['Genders'] == 'Female' and row['Alcohol units/ wk'] < 11:
        return 'Moderate'
    else:
        return 'Heavy'    

#Define a function that assigns lifestyle risk based on the output of the other functions
def lifestyle_risk(row):
    if row['BMI Range'] == 'Underweight' and row['Alcohol Consumption'] == 'Moderate' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'Medium'
    if row['BMI Range'] == 'Underweight' and row['Alcohol Consumption'] == 'Moderate' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'High'
    if row['BMI Range'] == 'Underweight' and row['Alcohol Consumption'] == 'Heavy' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'High'
    if row['BMI Range'] == 'Underweight' and row['Alcohol Consumption'] == 'Heavy' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'Very High'  
    if row['BMI Range'] == 'Normal' and row['Alcohol Consumption'] == 'Moderate' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'Normal'
    if row['BMI Range'] == 'Normal' and row['Alcohol Consumption'] == 'Moderate' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'High'
    if row['BMI Range'] == 'Normal' and row['Alcohol Consumption'] == 'Heavy' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'High'
    if row['BMI Range'] == 'Normal' and row['Alcohol Consumption'] == 'Heavy' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'Very High' 
    if row['BMI Range'] == 'Overweight' and row['Alcohol Consumption'] == 'Moderate' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'Medium'
    if row['BMI Range'] == 'Overweight' and row['Alcohol Consumption'] == 'Moderate' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'Very High'
    if row['BMI Range'] == 'Overweight' and row['Alcohol Consumption'] == 'Heavy' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'High'
    if row['BMI Range'] == 'Overweight' and row['Alcohol Consumption'] == 'Heavy' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'Very High' 
    if row['BMI Range'] == 'Obese' and row['Alcohol Consumption'] == 'Moderate' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'Very High'
    if row['BMI Range'] == 'Obese' and row['Alcohol Consumption'] == 'Moderate' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'Very High'
    if row['BMI Range'] == 'Obese' and row['Alcohol Consumption'] == 'Heavy' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'Very High'
    if row['BMI Range'] == 'Obese' and row['Alcohol Consumption'] == 'Heavy' and row['Smoker'] == 'No' and row['Exercise'] <150:
        return 'Extremely High' 
    if row['Smoker'] == 'Yes':
        return 'Extremely High'
    else:
        return 'Normal'

#Run the functions in the DataFrame
df['Height'] = df['Genders'].apply(what_height)
df['Weight'] = df['Genders'].apply(what_weight)
df['BMI'] = df['Weight']/df['Height']**2
df['Smoker'] = df['Genders'].apply(what_smoker)
df['Alcohol units/ wk'] = df['Genders'].apply(what_alcohol)
df['BMI Range'] = pd.cut(df['BMI'], [-np.inf,18.5,24.9, 29.9, np.inf], 
                         labels=["Underweight", "Normal", "Overweight", "Obese"]
                        ).astype(str)
df['Alcohol Consumption'] = df.apply(lambda row: drinker_type(row), axis=1)
df['Life Style Risk'] = df.apply(lambda row: lifestyle_risk(row), axis=1)

In [18]:
#Display the DataFrame, rounding continuous variables to two decimal places
df.round({'Height': 2, 'Weight': 2, 'BMI': 2, 'Alcohol units/ wk': 1})

Unnamed: 0,Genders,Height,Weight,Alcohol units/ wk,Exercise,BMI,BMI Range,Alcohol Consumption,Smoker,Life Style Risk
0,Female,1.68,69.06,6.2,125,24.60,Normal,Moderate,No,Normal
1,Male,1.75,80.68,11.6,125,26.33,Overweight,Moderate,No,Medium
2,Male,1.80,80.70,10.4,125,24.92,Overweight,Moderate,No,Medium
3,Female,1.62,69.06,17.6,50,26.19,Overweight,Heavy,Yes,Extremely High
4,Male,1.77,80.71,5.4,125,25.77,Overweight,Moderate,Yes,Extremely High
5,Male,1.76,80.65,6.0,150,26.18,Overweight,Moderate,No,Normal
6,Female,1.65,69.03,6.2,125,25.35,Overweight,Moderate,No,Medium
7,Female,1.59,68.99,2.0,200,27.39,Overweight,Moderate,No,Normal
8,Male,1.82,80.67,5.9,200,24.28,Normal,Moderate,Yes,Extremely High
9,Female,1.59,69.02,14.6,25,27.25,Overweight,Heavy,Yes,Extremely High


## End

Please refer to README.md for a complete list of references used during the research of this project. 