# Simulating a real-world phenomenon
***

### Table of contents

#### 1. Introduction

#### 2. Variables

#### 3. Distributions

#### 4. Relationships

#### 5. Conclusion

#### 6. References

<br>

### 1. Introduction
***
Synthetic data is information that is artificially manufactured rather than generated by real-world events. Synthetic data is created algorithmically, and is used as a stand-in for test data sets of production or operational data, to validate models and, increasingly, to train machine learning models. It can be numerical, binary or categorical, and should preferably be random.

The data set will be created using the ```numpy.random``` package, which is used to generate random numbers and random sampling. The real-world phenomenon simulated below is the weight loss of an overweight male over the course of a calendar year from 01/01/20 to 31/12/20. As I have a personal trainer, I will use some of the variables he uses to track my progress. Every day I have to submit a check-in spreadsheet. Although my personal goal is not to loose weight, I will be using this in this particular example.    

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats

### 2. Variables
***
The following variables will be used in the data set:

1. Date
2. Day
3. Weight (kg)
4. Resting Heart Rate (bpm)
5. Fasting Blood Glucose (mmol/L)
6. Exercise
7. Calorie Intake
8. Calorie Output
9. Calorie Difference
10. Total Daily Energy Expenditure
11. Weight
12. Body Mass Index

#### Date

In [2]:
start_date = "2020-01-01"
end_date = "2020-12-31"

date = pd.date_range(start_date, end_date, freq= "D")

df = pd.DataFrame({"Date": date})

print(df)

          Date
0   2020-01-01
1   2020-01-02
2   2020-01-03
3   2020-01-04
4   2020-01-05
..         ...
361 2020-12-27
362 2020-12-28
363 2020-12-29
364 2020-12-30
365 2020-12-31

[366 rows x 1 columns]


#### Day

In [3]:
df["Day"] = df["Date"].dt.day_name()

df["Day"] = df["Day"].astype("category")

print(df)

          Date        Day
0   2020-01-01  Wednesday
1   2020-01-02   Thursday
2   2020-01-03     Friday
3   2020-01-04   Saturday
4   2020-01-05     Sunday
..         ...        ...
361 2020-12-27     Sunday
362 2020-12-28     Monday
363 2020-12-29    Tuesday
364 2020-12-30  Wednesday
365 2020-12-31   Thursday

[366 rows x 2 columns]


#### Activity

In [4]:
activities = ["Rest", "Cardio", "Legs", "Chest", "Delts", "Arms"]

mon = df["Day"] == "Monday"
tue = df["Day"] == "Tuesday"
wed = df["Day"] == "Wednesday"
thu = df["Day"] == "Thursday"
fri = df["Day"] == "Friday"
sat = df["Day"] == "Saturday"
sun = df["Day"] == "Sunday"

pmon = np.random.choice(activities, p=[0, 0, 0.25, 0.25, 0.25, 0.25])
ptue = np.random.choice(activities, p=[0, 0, 0.25, 0.25, 0.25, 0.25])
pwed = np.random.choice(activities, p=[0, 0, 0.25, 0.25, 0.25, 0.25])
pthu = np.random.choice(activities, p=[0, 0, 0.25, 0.25, 0.25, 0.25])
pfri = np.random.choice(activities, p=[0.5, 0.5, 0, 0, 0, 0])
psat = np.random.choice(activities, p=[0.5, 0.5, 0, 0, 0, 0])
psun = np.random.choice(activities, p=[0.5, 0.5, 0, 0, 0, 0])

df["Activity"] = np.select([mon, tue, wed, thu, fri, sat, sun], [pmon, ptue, pwed, pthu, pfri, psat, psun], default=np.nan)
df["Activity"] = df["Activity"].astype("category")

print(df)

          Date        Day Activity
0   2020-01-01  Wednesday     Legs
1   2020-01-02   Thursday     Arms
2   2020-01-03     Friday     Rest
3   2020-01-04   Saturday   Cardio
4   2020-01-05     Sunday   Cardio
..         ...        ...      ...
361 2020-12-27     Sunday   Cardio
362 2020-12-28     Monday    Delts
363 2020-12-29    Tuesday     Legs
364 2020-12-30  Wednesday     Legs
365 2020-12-31   Thursday     Arms

[366 rows x 3 columns]


#### Calorie Input

cardio = jogging, general = 7.0
legs = resistance (weight) training, squats, slow or explosive effort = 5.0
chest/delts/arms = resistance (weight) training, multiple exercises, 8-15 repetitions at varied resistance = 3.5 

In [5]:
start_weight = 80

rest = start_weight
cardio = start_weight * 7.0
legs = start_weight * 5.0
chest = start_weight * 3.5
delts = start_weight * 3.5
arms = start_weight * 3.5

In [6]:
def cals(row):
    if row["Activity"] == "Rest":
        return 0
    if row["Activity"] == "Cardio":
        return int(np.random.uniform(cardio/2, cardio))
    if row["Activity"] == "Legs":
        return int(np.random.uniform(legs/2, legs))
    if row["Activity"] == "Chest":
        return int(np.random.uniform(chest/2, chest))
    if row["Activity"] == "Delts":
        return int(np.random.uniform(delts/2, delts))
    if row["Activity"] == "Arms":
        return int(np.random.uniform(arms/2, arms))
    
df["Cals Burned"] = df.apply(lambda row: cals(row), axis = 1)

print(df)

          Date        Day Activity  Cals Burned
0   2020-01-01  Wednesday     Legs          363
1   2020-01-02   Thursday     Arms          180
2   2020-01-03     Friday     Rest            0
3   2020-01-04   Saturday   Cardio          517
4   2020-01-05     Sunday   Cardio          458
..         ...        ...      ...          ...
361 2020-12-27     Sunday   Cardio          493
362 2020-12-28     Monday    Delts          233
363 2020-12-29    Tuesday     Legs          279
364 2020-12-30  Wednesday     Legs          235
365 2020-12-31   Thursday     Arms          268

[366 rows x 4 columns]


In [7]:
def cal_intake():
    low = 1500
    high = 2500
    average = 2000
    std_dev = 165
    x = stats.truncnorm((low - average) / std_dev, (high - average) / std_dev, loc=average, scale=std_dev)
    cal_val = x.rvs(366).astype(int)
    return cal_val

cal_intake()

array([1731, 2185, 1896, 2028, 2017, 2138, 2045, 2117, 1913, 1960, 1810,
       2031, 2043, 1844, 2238, 2052, 1931, 1878, 1778, 2116, 2103, 1854,
       1833, 1992, 1939, 1925, 2094, 1925, 1981, 1713, 1952, 1789, 1916,
       1959, 1942, 1809, 2340, 1741, 1779, 1981, 2028, 1990, 1929, 1902,
       1858, 1770, 1898, 1982, 2360, 2219, 2116, 2044, 2138, 1919, 1933,
       1789, 2065, 2014, 2147, 2221, 1980, 2151, 1821, 2017, 1836, 2018,
       1977, 2064, 2286, 1978, 1957, 1748, 2004, 1788, 2243, 2127, 2161,
       1962, 1814, 1810, 1966, 2007, 2102, 1915, 1721, 2032, 2035, 1727,
       2077, 1971, 1785, 2298, 1846, 2080, 1509, 2205, 1941, 1922, 2209,
       2014, 1843, 2216, 2262, 1839, 1758, 2013, 2111, 2171, 1980, 1927,
       2076, 2047, 2114, 1885, 1958, 2086, 1935, 2220, 1927, 1905, 2107,
       2153, 2280, 2140, 1922, 2098, 2076, 2231, 2269, 2063, 2127, 2146,
       2197, 1823, 2150, 1783, 1995, 2204, 2051, 2010, 1855, 2258, 1952,
       1905, 2265, 2071, 2008, 2077, 1998, 1742, 20

In [8]:
#df["Cals In"] = cal_intake()
less_cals = np.random.uniform(200, 300)
df["Cals In"] = np.where((((df["Activity"])=="Rest") | ((df["Activity"])=="Cardio")), cal_intake() - int(less_cals), cal_intake())
print(df)

          Date        Day Activity  Cals Burned  Cals In
0   2020-01-01  Wednesday     Legs          363     1902
1   2020-01-02   Thursday     Arms          180     2077
2   2020-01-03     Friday     Rest            0     1582
3   2020-01-04   Saturday   Cardio          517     1642
4   2020-01-05     Sunday   Cardio          458     1484
..         ...        ...      ...          ...      ...
361 2020-12-27     Sunday   Cardio          493     1756
362 2020-12-28     Monday    Delts          233     1997
363 2020-12-29    Tuesday     Legs          279     2144
364 2020-12-30  Wednesday     Legs          235     1855
365 2020-12-31   Thursday     Arms          268     1953

[366 rows x 5 columns]


#### BMI

In [9]:
# Height (cm) and starting weight (kg) variables.
height = 180
starting_weight = 90

# BMI function.
def bmi(weight, height):
    result = weight/(height/100)**2
    return result

# Starting BMI.
starting_bmi = bmi(starting_weight, height)
print("Starting BMI:", round(starting_bmi, 2) )

Starting BMI: 27.78


#### BMR

In [10]:
# Variables to determine BMR.
age = 30
male = 5
female = -161

# BMR function
def bmr(weight, height, age, sex):
    result = 10 * weight + 6.25 * height - 5 * age + sex
    return result

# Calculating and displaying starting BMR.
starting_bmr = bmr(starting_weight, height, age, male)
print('Starting BMR: ', round(starting_bmr), 'calories.')

Starting BMR:  1880 calories.


#### TDEE

In [11]:
# TDEE function.
def tdee(weight):
    result = bmr(weight, height, age, male) * 1.2
    return result
         
# Calculating and displaying starting TDEE.
starting_tdee = tdee(starting_weight)
print('Starting TDEE:', round(starting_tdee), 'calories.')

Starting TDEE: 2256 calories.


In [12]:
# Creating TDEE column with just first row filled.
df["TDEE"] = np.nan
df.loc[0, "TDEE"] = starting_tdee

# Printing head of current dataset.
df.head()

Unnamed: 0,Date,Day,Activity,Cals Burned,Cals In,TDEE
0,2020-01-01,Wednesday,Legs,363,1902,2256.0
1,2020-01-02,Thursday,Arms,180,2077,
2,2020-01-03,Friday,Rest,0,1582,
3,2020-01-04,Saturday,Cardio,517,1642,
4,2020-01-05,Sunday,Cardio,458,1484,


#### Weight Goal and Calorie Deficit

In [13]:
# Target total weight loss in kilograms.
target = 10

# Target weekly weight loss.
weekly_target = target/52
print('Target weekly weight loss:', str(round(weekly_target, 3)) + 'kg')

# Estimated number of kilograms per calorie based on 3500cal = 0.454kg.
kg_per_cal = 0.454/3500 

# Function converting calories to kilograms. 
def cal_to_kg(calories):
    kg = calories * kg_per_cal
    return kg

# Weekly and daily calorie deficits.
weekly_cal_deficit = 1500
daily_cal_deficit = weekly_cal_deficit/7

# Weekly weight loss, daily calorie deficit and daily calorie allowance targets.
print('')
print('Estimated weekly weight loss based on calorie deficit of', str(weekly_cal_deficit) + ':', str(cal_to_kg(weekly_cal_deficit)) + 'kg')
print('')
print('Daily calorie deficit required:', round(daily_cal_deficit))
print('')
print('Daily calorie allowance:', round(starting_tdee - daily_cal_deficit))

Target weekly weight loss: 0.192kg

Estimated weekly weight loss based on calorie deficit of 1500: 0.19457142857142856kg

Daily calorie deficit required: 214

Daily calorie allowance: 2042


#### Cals Out

In [14]:
# Creating cals_out column.
df["Cals Out"] = np.nan
df.loc[0, "Cals Out"] = df.loc[0, "TDEE"] + df.loc[0, "Cals Burned"]

# Printing head of current dataset.
df.head()

Unnamed: 0,Date,Day,Activity,Cals Burned,Cals In,TDEE,Cals Out
0,2020-01-01,Wednesday,Legs,363,1902,2256.0,2619.0
1,2020-01-02,Thursday,Arms,180,2077,,
2,2020-01-03,Friday,Rest,0,1582,,
3,2020-01-04,Saturday,Cardio,517,1642,,
4,2020-01-05,Sunday,Cardio,458,1484,,


#### Weight

In [15]:
# Creating cal_dif column.
df["Cal Diff"] = np.nan
df.loc[0, "Cal Diff"] = df.loc[0, "Cals In"] - df.loc[0, "Cals Out"]

# Printing head of current dataset.
df.head()

Unnamed: 0,Date,Day,Activity,Cals Burned,Cals In,TDEE,Cals Out,Cal Diff
0,2020-01-01,Wednesday,Legs,363,1902,2256.0,2619.0,-717.0
1,2020-01-02,Thursday,Arms,180,2077,,,
2,2020-01-03,Friday,Rest,0,1582,,,
3,2020-01-04,Saturday,Cardio,517,1642,,,
4,2020-01-05,Sunday,Cardio,458,1484,,,


In [16]:
# Creating gain_or_loss column.
df["Kg Gain/Loss"] = np.nan
df.loc[0, "Kg Gain/Loss"] = cal_to_kg(df.loc[0, "Cal Diff"]) + (np.random.choice((-1, 1)) * np.random.normal(0.0274, 0.01))

# Printing head of current dataset.
df.head()

Unnamed: 0,Date,Day,Activity,Cals Burned,Cals In,TDEE,Cals Out,Cal Diff,Kg Gain/Loss
0,2020-01-01,Wednesday,Legs,363,1902,2256.0,2619.0,-717.0,-0.070943
1,2020-01-02,Thursday,Arms,180,2077,,,,
2,2020-01-03,Friday,Rest,0,1582,,,,
3,2020-01-04,Saturday,Cardio,517,1642,,,,
4,2020-01-05,Sunday,Cardio,458,1484,,,,


In [17]:
# Creating the weight column.
df["Weight"] = np.nan
df.loc[0, "Weight"] = starting_weight + df.loc[0, "Kg Gain/Loss"]

# Printing head of current dataset.
df.head()

Unnamed: 0,Date,Day,Activity,Cals Burned,Cals In,TDEE,Cals Out,Cal Diff,Kg Gain/Loss,Weight
0,2020-01-01,Wednesday,Legs,363,1902,2256.0,2619.0,-717.0,-0.070943,89.929057
1,2020-01-02,Thursday,Arms,180,2077,,,,,
2,2020-01-03,Friday,Rest,0,1582,,,,,
3,2020-01-04,Saturday,Cardio,517,1642,,,,,
4,2020-01-05,Sunday,Cardio,458,1484,,,,,


#### BMI

In [18]:
# Creating BMI column.
df["BMI"] = np.nan
df.loc[0, "BMI"] = bmi(df.loc[0, "Weight"], height)

# Printing head of current dataset.
df.head()

Unnamed: 0,Date,Day,Activity,Cals Burned,Cals In,TDEE,Cals Out,Cal Diff,Kg Gain/Loss,Weight,BMI
0,2020-01-01,Wednesday,Legs,363,1902,2256.0,2619.0,-717.0,-0.070943,89.929057,27.755882
1,2020-01-02,Thursday,Arms,180,2077,,,,,,
2,2020-01-03,Friday,Rest,0,1582,,,,,,
3,2020-01-04,Saturday,Cardio,517,1642,,,,,,
4,2020-01-05,Sunday,Cardio,458,1484,,,,,,


#### RHR

In [36]:
# Normal range 60-100
# started at 70
# go from 70 down to 62
rhr = np.linspace(70, 62, 365) + np.random.normal(0, 1, 365)
print(rhr)

[70.60667219 69.92006816 71.77770609 69.25051525 70.02091207 69.00490835
 71.2550013  68.97582368 70.49487874 69.47458567 69.52847421 70.14840953
 70.1603754  69.3915813  69.50683759 70.87014941 70.57337627 70.03249265
 67.99397971 69.40679479 69.60051593 68.98620269 68.60944335 69.99990532
 68.53025252 70.00128812 68.26438056 68.36650042 68.81981271 68.82119656
 69.65077949 70.58670747 69.23249715 69.64625027 69.68708596 70.11855115
 69.86287014 69.2258638  69.77507815 70.0542683  70.19156987 69.20453318
 69.00856782 68.81341502 70.56373109 70.61042642 68.22209085 69.2695306
 68.99398197 66.74947678 70.15764608 71.43797743 68.78563047 67.72989518
 68.8984793  68.0336393  68.31851653 68.12923034 68.54506287 69.99258071
 69.09474917 69.73608909 69.55800901 68.33491563 68.52946387 67.970812
 68.08038557 67.80585257 68.40505869 68.66029154 69.5362994  69.03593496
 69.76156061 68.64763351 69.0874176  66.81040512 69.31571217 67.32756907
 69.97692239 68.47383145 68.63621644 69.38474924 67.36

          Date        Day Activity  Cals Burned  Cals In         TDEE  \
0   2020-01-01  Wednesday     Legs          363     1902  2256.000000   
1   2020-01-02   Thursday     Arms          180     2077  2255.148679   
2   2020-01-03     Friday     Rest            0     1582  2255.043502   
3   2020-01-04   Saturday   Cardio          517     1642  2254.417672   
4   2020-01-05     Sunday   Cardio          458     1484  2252.813335   
..         ...        ...      ...          ...      ...          ...   
361 2020-12-27     Sunday   Cardio          493     1756  1981.646778   
362 2020-12-28     Monday    Delts          233     1997  1981.178097   
363 2020-12-29    Tuesday     Legs          279     2144  1981.209366   
364 2020-12-30  Wednesday     Legs          235     1855  1981.509458   
365 2020-12-31   Thursday     Arms          268     1953  1980.653859   

        Cals Out     Cal Diff  Kg Gain/Loss     Weight        BMI  \
0    2619.000000  -717.000000     -0.070943  89.929057

#### FBG

5.0 and 5.5 is the optimal https://stackoverflow.com/questions/14058340/adding-noise-to-a-signal-in-python

In [40]:
fbg = np.linspace(5.8, 5.2, 365) + np.random.normal(0, 0.1, 365)
print(fbg)

[5.95403129 5.80769613 5.70415246 5.74438002 5.74554427 5.9484355
 5.75785973 5.54108998 5.67870222 5.69260533 5.93122259 5.91521353
 5.60369799 5.69899256 5.88558359 5.85707412 5.86380909 5.69499786
 6.06788509 5.86609042 5.72727413 5.68024971 5.79927808 5.84826709
 5.95026908 5.7989951  5.85293504 5.74124128 5.73605693 5.71982783
 5.90555081 5.73411448 5.6206462  5.68568389 5.73431861 5.80743415
 5.64631857 5.96706138 5.70443475 5.78319554 5.86569156 5.56395076
 5.84423628 5.8562837  5.79236988 5.55535261 5.58550557 5.77796423
 5.72999382 5.76658159 5.59220416 5.78379038 5.62131664 5.67519119
 5.56823883 5.67222019 5.65678483 5.88551188 5.69359751 5.65486381
 5.70327452 5.53307856 5.79063329 5.77761582 5.63422527 5.7053728
 5.59955465 5.51024433 5.75085637 5.61561534 5.54634279 5.79606109
 5.61427309 5.69829564 5.68287398 5.65274597 5.60447768 5.70131757
 5.85020168 5.70814978 5.5817823  5.84234282 5.70128969 5.64376095
 5.78049539 5.73703694 5.55113424 5.82128576 5.60665789 5.816579

#### Populating the Data Set

In [22]:
# For loop to fill in missing values, starting from the second row.
#for i in range(1, len(df)):    
    # tdee depends on the weight of the previous day.
#    df.loc[i, "TDEE"] = tdee(df.loc[i-1, "Weight"])    
    # cals_out is the addition of tdee + exercise_cals.
#    df.loc[i, "Cals Out"] = df.loc[i, "TDEE"] + df.loc[i, "Cals Burned"]     
#    # cal_dif is the subtraction of cals_out from cals_in.
#    df.loc[i, "Cal Diff"] = df.loc[i, "Cals In"] - df.loc[i, "Cals Out"]    
    # gain_or_loss is the conversion of cals_out to weight plus a random variant.
#    df.loc[i, "Kg Gain/Loss"] = cal_to_kg(df.loc[i, "Cal Diff"]) + (np.random.choice((-1, 1)) * np.random.normal(0.0357, 0.01))    
    # weight is the subtraction of gain_or_loss from the previous day's weight.
#    df.loc[i, "Weight"] = df.loc[i-1, "Weight"] + df.loc[i, "Kg Gain/Loss"] 
    # bmi is the conversion of weight to a BMI score.
#    df.loc[i, "BMI"] = bmi(df.loc[i, "Weight"], height)
    
# Printing head of current dataset.
#df.head()

Unnamed: 0,Date,Day,Activity,Cals Burned,Cals In,TDEE,Cals Out,Cal Diff,Kg Gain/Loss,Weight,BMI
0,2020-01-01,Wednesday,Legs,363,1902,2256.0,2619.0,-717.0,-0.070943,89.929057,27.755882
1,2020-01-02,Thursday,Arms,180,2077,2255.148679,2435.148679,-358.148679,-0.008765,89.920292,27.753177
2,2020-01-03,Friday,Rest,0,1582,2255.043502,2255.043502,-673.043502,-0.052153,89.868139,27.73708
3,2020-01-04,Saturday,Cardio,517,1642,2254.417672,2771.417672,-1129.417672,-0.133695,89.734445,27.695816
4,2020-01-05,Sunday,Cardio,458,1484,2252.813335,2710.813335,-1226.813335,-0.119778,89.614667,27.658848
