# STAT 404 project

author:
  - Yanping (Dedoria) Wang (89845473)
  - Kevin Zhu (81805673)
  - Raojun Bo (55832919)


We are using the data set in kaggel which is aim at find [weight loss using different diet](https://www.kaggle.com/datasets/tombenny/foodhabbits).


## Objective:
To conduct a factorial experiment to investigate the main and interaction effects of diet type and  factors age and gender on weight loss among adults aged 16–60. This study aims to analyze how different combinations of diets and variables influence weight loss outcomes over a 6-week period, thereby identifying the most effective strategies for various population subgroups.

In [1]:
library(dplyr)
library(ggplot2)



次のパッケージを付け加えます: ‘dplyr’


以下のオブジェクトは ‘package:stats’ からマスクされています:

    filter, lag


以下のオブジェクトは ‘package:base’ からマスクされています:

    intersect, setdiff, setequal, union




In [2]:
#read data
food <- read.csv("data/foodDiet.csv")
head(food)

Unnamed: 0_level_0,Person,gender,Age,Height,pre.weight,Diet,weight6weeks
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>
1,25,0,41,171,60,2,60.0
2,26,0,32,174,103,2,103.0
3,1,0,22,159,58,1,54.2
4,2,0,46,192,60,1,54.0
5,3,0,55,170,64,1,63.3
6,4,0,33,171,64,1,61.1


## Variable meaning

| Variable Name         | Levels of the Variable      | Changes to Original Variable    |
|-----------------------|-----------------------------|-------------------------------------|
| person        | 1-78 stands for id for each person    | drop this, not useful            |
| gender        | 0 for female, 1 for male     | None            |
| Age(years old)        | 16-60    | 16-45(young), 45-60(elder)        |
| Height(cm)     | Continuous 141cm-201cm  | Calculate BMI with pre.weight                    |
| pre.Weight (kg)  | Continuous 58kg-103kg (weight before the experiment)      | categorize them as Underweight(<18.5), Healthy(18.5 - 25), Overweight(25 - 30), Obesity(>=30)     |
| Diet         | 1(Placebo), 2(Keto), 3(Vegan)    | None            |
| weight6weeks (kg)    | Continuous 53kg-103kg (weight after 6 weeks take diet)      |     Calculate BMI with Height        |
| weightloss (kg)    | Continuous  (weight6weeks - pre.Weight)   |    standardize       |

In [3]:
# Create a copy of the original data
food_original <- food

# Modify the data: drop 'Person' and perform necessary transformations
food_modified <- food_original %>%
  select(-Person) %>%  # Drop 'Person' column
  mutate(
    # Recode 'gender'
    gender = factor(gender, levels = c(0, 1), labels = c("Female", "Male")),
    
    # Reclassify 'Age' into age groups
    AgeGroup = cut(
      Age,
      breaks = c(15, 45, 60),
      right = TRUE,
      labels = c("Young", "Elder")
    ),
    
    # Convert 'Height' to meters
    Height_m = Height / 100,
    
    # Calculate BMI before the experiment
    BMI_pre = pre.weight / (Height_m^2),
    
    # Categorize BMI_pre
    BMI_category_pre = cut(
      BMI_pre,
      breaks = c(-Inf, 18.5, 25, 30, Inf),
      labels = c("Underweight", "Healthy", "Overweight", "Obesity")
    ),
    
    # Calculate BMI after 6 weeks
    BMI_post = weight6weeks / (Height_m^2),
    
    # Calculate weight loss
    weightloss = weight6weeks - pre.weight
  )

# Select required columns for the new dataset
food_new <- food_modified %>%
  select(
    gender,
    AgeGroup,
    Diet,
    BMI_pre,
    BMI_category_pre,
    BMI_post,
    weightloss
  )

# Identify numeric columns
numeric_cols <- food_new %>%
  select(where(is.numeric)) %>%
  names()

# Standardize numeric columns and add them to the new dataset
food_new <- food_new %>%
  mutate(across(all_of(numeric_cols), ~ scale(.)[,1], .names = "{col}_std"))

head(food_new)

Unnamed: 0_level_0,gender,AgeGroup,Diet,BMI_pre,BMI_category_pre,BMI_post,weightloss,Diet_std,BMI_pre_std,BMI_post_std,weightloss_std
Unnamed: 0_level_1,<fct>,<fct>,<int>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,Female,Young,2,20.51913,Healthy,20.51913,0.0,-0.04731281,-1.136746,-0.8409967,1.50691963
2,Female,Young,2,34.02035,Obesity,34.02035,0.0,-0.04731281,2.2089119,2.6665571,1.50691963
3,Female,Young,1,22.94213,Healthy,21.43903,-3.8,-1.27744594,-0.5363172,-0.6020131,0.01758659
4,Female,Elder,1,16.27604,Underweight,14.64844,-6.0,-1.27744594,-2.1882023,-2.3661771,-0.84465885
5,Female,Elder,1,22.14533,Healthy,21.90311,-0.7,-1.27744594,-0.733768,-0.4814449,1.23256881
6,Female,Young,1,21.88708,Healthy,20.89532,-2.9,-1.27744594,-0.7977641,-0.7432658,0.37032336


In [4]:
#make these as factor that is able to make anova work
food_new <- food_new %>%
  mutate(
    gender = as.factor(gender),
    AgeGroup = as.factor(AgeGroup),
    Diet = as.factor(Diet)
  )

In [5]:

grouped_summary_age_diet <- food_new %>%
  group_by(Diet, AgeGroup) %>%
  summarise(
    mean_weightloss = mean(weightloss, na.rm = TRUE),
    sd_weightloss = sd(weightloss, na.rm = TRUE),
    count = n()
  )

grouped_summary_gender_diet <- food_new %>%
  group_by(Diet, gender) %>%
  summarise(
    mean_weightloss = mean(weightloss, na.rm = TRUE),
    sd_weightloss = sd(weightloss, na.rm = TRUE),
    count = n()
  )


grouped_summary_interaction_diet <- food_new %>%
  group_by(Diet, AgeGroup, gender) %>%
  summarise(
    mean_weightloss = mean(weightloss, na.rm = TRUE),
    sd_weightloss = sd(weightloss, na.rm = TRUE),
    count = n()
  )

three <- c("Diet", "AgeGroup", "gender")


[1m[22m`summarise()` has grouped output by 'Diet'. You can override using the
`.groups` argument.
[1m[22m`summarise()` has grouped output by 'Diet'. You can override using the
`.groups` argument.
[1m[22m`summarise()` has grouped output by 'Diet', 'AgeGroup'. You can override using
the `.groups` argument.


In [6]:

for (var in three) {
  
  # Dynamically group by the current variable
  summary_table <- food_new %>%
    group_by(.data[[var]]) %>%  # Use .data pronoun for dynamic variable
    summarise(
      mean_weightloss = mean(weightloss, na.rm = TRUE),
      sd_weightloss = sd(weightloss, na.rm = TRUE),
      count = n()
    ) %>%
    arrange(desc(mean_weightloss))  # Optional: Arrange by mean_weightloss
  
  # Print the summary table with a header
  cat("\nSummary Statistics by", var, ":\n")
  print(summary_table)
}


Summary Statistics by Diet :
[90m# A tibble: 3 × 4[39m
  Diet  mean_weightloss sd_weightloss count
  [3m[90m<fct>[39m[23m           [3m[90m<dbl>[39m[23m         [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m
[90m1[39m 2               -[31m3[39m[31m.[39m[31m0[39m[31m3[39m          2.52    27
[90m2[39m 1               -[31m3[39m[31m.[39m[31m3[39m           2.24    24
[90m3[39m 3               -[31m5[39m[31m.[39m[31m15[39m          2.40    27

Summary Statistics by AgeGroup :
[90m# A tibble: 2 × 4[39m
  AgeGroup mean_weightloss sd_weightloss count
  [3m[90m<fct>[39m[23m              [3m[90m<dbl>[39m[23m         [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m
[90m1[39m Elder              -[31m3[39m[31m.[39m[31m75[39m          2.04    21
[90m2[39m Young              -[31m3[39m[31m.[39m[31m88[39m          2.73    57

Summary Statistics by gender :
[90m# A tibble: 2 × 4[39m
  gender mean_weightloss sd_weightloss count
  [3

From this we can see that the weight loss is kind of relavant to the gender and diet type

In [7]:
print(grouped_summary_gender_diet)
print(grouped_summary_age_diet)

[90m# A tibble: 6 × 5[39m
[90m# Groups:   Diet [3][39m
  Diet  gender mean_weightloss sd_weightloss count
  [3m[90m<fct>[39m[23m [3m[90m<fct>[39m[23m            [3m[90m<dbl>[39m[23m         [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m
[90m1[39m 1     Female           -[31m3[39m[31m.[39m[31m0[39m[31m5[39m          2.07    14
[90m2[39m 1     Male             -[31m3[39m[31m.[39m[31m65[39m          2.54    10
[90m3[39m 2     Female           -[31m2[39m[31m.[39m[31m28[39m          2.31    16
[90m4[39m 2     Male             -[31m4[39m[31m.[39m[31m11[39m          2.53    11
[90m5[39m 3     Female           -[31m5[39m[31m.[39m[31m88[39m          1.89    15
[90m6[39m 3     Male             -[31m4[39m[31m.[39m[31m23[39m          2.72    12
[90m# A tibble: 6 × 5[39m
[90m# Groups:   Diet [3][39m
  Diet  AgeGroup mean_weightloss sd_weightloss count
  [3m[90m<fct>[39m[23m [3m[90m<fct>[39m[23m              [3m[90m<d

From these two chart we see that the weight change seems to be irrelavent to Age and kind of relavent to gender, and may alter with different diets. 

In [8]:
print(grouped_summary_interaction_diet)

[90m# A tibble: 12 × 6[39m
[90m# Groups:   Diet, AgeGroup [6][39m
   Diet  AgeGroup gender mean_weightloss sd_weightloss count
   [3m[90m<fct>[39m[23m [3m[90m<fct>[39m[23m    [3m[90m<fct>[39m[23m            [3m[90m<dbl>[39m[23m         [3m[90m<dbl>[39m[23m [3m[90m<int>[39m[23m
[90m 1[39m 1     Young    Female           -[31m3[39m[31m.[39m[31m34[39m         2.30      8
[90m 2[39m 1     Young    Male             -[31m3[39m[31m.[39m[31m56[39m         2.87      8
[90m 3[39m 1     Elder    Female           -[31m2[39m[31m.[39m[31m67[39m         1.84      6
[90m 4[39m 1     Elder    Male             -[31m4[39m            0.141     2
[90m 5[39m 2     Young    Female           -[31m2[39m[31m.[39m[31m28[39m         2.61     12
[90m 6[39m 2     Young    Male             -[31m3[39m[31m.[39m[31m91[39m         2.75      9
[90m 7[39m 2     Elder    Female           -[31m2[39m[31m.[39m[31m3[39m          1.31      4
[90m 8

In [9]:
factorial_model <- aov(weightloss_std ~  Diet*AgeGroup*gender, data = food_new)
summary.aov(factorial_model)

                     Df Sum Sq Mean Sq F value  Pr(>F)   
Diet                  2  10.92   5.460   6.125 0.00363 **
AgeGroup              1   0.04   0.036   0.041 0.84085   
gender                1   0.14   0.144   0.162 0.68867   
Diet:AgeGroup         2   0.16   0.078   0.088 0.91597   
Diet:gender           2   6.32   3.158   3.542 0.03458 * 
AgeGroup:gender       1   0.05   0.049   0.055 0.81482   
Diet:AgeGroup:gender  2   0.54   0.270   0.303 0.73947   
Residuals            66  58.84   0.891                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In [10]:
# including factors seems to be important
factorial_model2 <- aov(weightloss_std ~ Diet + gender + Diet:gender + Diet:AgeGroup:gender, data = food_new)
summary.aov(factorial_model2)

                     Df Sum Sq Mean Sq F value  Pr(>F)   
Diet                  2  10.92   5.460   6.125 0.00363 **
gender                1   0.16   0.159   0.178 0.67415   
Diet:gender           2   6.29   3.143   3.525 0.03510 * 
Diet:gender:AgeGroup  6   0.80   0.133   0.149 0.98862   
Residuals            66  58.84   0.891                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The factorial ANOVA results from these two table reveal that Diet Type significantly affects weight loss outcomes at a significance level of 0.01 and that Diet with Gender interaction term is statistically significant at 0.05 significance level. 
Practical significance: The Diet type and the gender diet interaction term have the most sum of square. These two term explain the most variation and the other explain little of variation.