# **Lab #7. Repeated Measures Analysis for Sensor & Human Data**
In this lecture, we will explore how to choose the right method for analyzing repeated measures data.

*This material is a joint work of TAs from IC Lab at KAIST, including Gyuna Kim, Dejiang Zhang, and Panyu Zhang. This work is licensed under CC BY-SA 4.0.*

### Preparation
We will proceed with installing Python libraries and data

In [28]:
!pip install pingouin



In [29]:
!git clone https://github.com/Kaist-ICLab/IoTDS.git

fatal: destination path 'IoTDS' already exists and is not an empty directory.


# Overview: What We'll Learn Today

![scenario](https://drive.google.com/uc?export=view&id=1X6qZyy9vh0qcEsH7xky26uTv3gFq73tO)

In this session, we'll explore several statistical methods for analyzing **repeated measures data** — where multiple measurements are collected from the same participants over time or under different conditions.

Many traditional statistical methods (like the independent t-test) assume that each observation is **independent** of the others. However, in repeated measures designs, this assumption doens't hold — because the data points are **correlated within individuals**.  
If we ignore this correlation, it can lead to **underestimated standard errors** and ultimately, **misleading test results**.

## What You'll Learn

1. **Understand what repeated measures data is**

2. **Choose appropriate statistical tests**
   - Independent vs. dependent samples
   - Parametric vs. non-parametric methods

3. **Apply basic tests for repeated measures**
   - *Friedman Test*: A non-parametric alternative to one-way repeated measures ANOVA

4. **Go beyond traditional tests with more flexible models**
   - *GLM* (Generalized Linear Model)
   - *GLMM* (Generalized Linear Mixed Model)
   - *GEE* (Generalized Estimating Equations)

➤ We'll walk through each of these with examples and hands-on code in the following sections.

# **Repeated Measurements**

In human-centered research, data are often correlated. For example:
- Weekly mood ratings (e.g., stress)  
- Before vs. after an intervention

### Within-subject and Between-subject Correlation
- **Within-subject correlation:**  
  A participant’s stress level this week is probably similar to last week’s — their values are *correlated over time*.

- **Between-subject correlation:**  
  Participants from the same group (e.g., KAIST Students) may show *similar overall patterns*.

Because of these dependencies in the data, we need to use **statistical models that account for repeated or clustered structures**.  

# **Case Study: AI Counseling Agent** 🤖
Background:
- You’ve developed an **AI-based counseling agent** to help users manage depression.  
- To evaluate its effectiveness, you conducted a **3-week study** where participants interacted with the agent and completed a **PHQ-9 questionnaire** each week. (*PHQ-9 measures depressive symptoms — higher scores indicate more severe symptoms.*)

Goal:  
- **Determine whether PHQ-9 scores significantly change over Week 1 (W1), Week 2 (W2), and Week 3 (W3).**

This kind of study results in **repeated measures data**, where each participant contributes multiple, correlated observations over time.

![Target Scenario Illustration](https://drive.google.com/uc?export=view&id=1aeueOQqhzNicrkW1a_0salVwdut58fuy)

> **Note**: This dataset is **synthetic**, and was generated for instructional purposes by the TA.

## 📊 Exploratory Data Analysis (EDA)

Let’s begin by exploring the data from our **AI-based counseling agent** study.

### Dataset Overview

This dataset includes:

- `PID`: Participant ID  
- `Week`: Time point (1 to 3)
- `EngagementMinutes`: Usage duration (min)
- `EngagementLevel`: low, high
- `Group`: Optional grouping (e.g., agent type)  
- `PHQ9Score`: PHQ-9 score (0–27) (Related to depression)
- `PHQ9Binary`: PHQ-9 binary (0: non-depressed, 1: depressed)

### Initial Exploration Steps

1. Load the dataset
2. Check for missing values → Skip: We alreadly know there is no missing values
3. Inspect summary statistics
4. Visualize PHQ-9 scores by week

In [30]:
import pandas as pd
from IPython.display import display

# Load the dataset
counseling_data = pd.read_csv("/content/IoTDS/counseling_data.csv")

display(counseling_data)

Unnamed: 0,PID,Group,Gender,Week,EngagementMinutes,EngagementLevel,PHQ9Score,PHQ9Binary
0,P001,chat,male,1,33.0,low,17,1
1,P001,chat,male,2,37.5,low,15,1
2,P001,chat,male,3,22.0,low,11,1
3,P002,voice,male,1,56.9,low,16,1
4,P002,voice,male,2,87.9,high,17,0
...,...,...,...,...,...,...,...,...
115,P039,chat,male,2,129.8,high,13,0
116,P039,chat,male,3,107.2,high,8,1
117,P040,chat,female,1,106.2,high,17,1
118,P040,chat,female,2,104.9,high,18,1


The `groupby` function in Python is used to group data based on one or more columns and apply aggregate functions or transformations to each group.

In [31]:
weekly_stats = counseling_data.groupby('Week')['PHQ9Score'].describe()

print("PHQ9Score Descriptive Statistics by Week:")
display(weekly_stats)

PHQ9Score Descriptive Statistics by Week:


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,40.0,16.675,1.403064,14.0,15.75,17.0,17.25,19.0
2,40.0,16.225,1.746608,13.0,15.0,16.0,18.0,19.0
3,40.0,10.55,1.986493,7.0,9.0,11.0,12.0,14.0


In [32]:
import plotly.express as px

fig = px.box(
    counseling_data,
    x="Week",
    y="PHQ9Score",
    title="Distribution of PHQ-9 Scores by Week",
    labels={"PHQ9Score": "PHQ-9 Score", "Week": "Week"},
    category_orders={"Week": [1, 2, 3]}
)
fig.update_layout(boxmode='group')
fig.show()

## Statistical Test Selection

Repeated measures data from human subjects usually involve **dependent observations** and often **violate normality assumptions**.

### Summary:
|               |             | Parametric Test (normality assumed)                  | Non-parametric Test             |
|---------------|-------------|------------------------------|----------------------------|
| Independent   | 2           | Independent t-test           | Mann-Whitney U test        |
| Independent   | ≥3          | One-way ANOVA                | Kruskal-Wallis test        |
| Dependent     | 2           | Paired t-test                | Wilcoxon signed-rank test  |
| Dependent     | ≥3          | One Way Repeated Measures ANOVA      | Friedman test              |

## Normality Test

Here are three widely used approaches for checking normality:

| **Method**  | **When to Use?** |
|------------------|-------------|
| **Shapiro-Wilk Test**  | Widely used for small to medium samples |
| **Kolmogorov-Smirnov Test**  | Suitable for larger samples; tests skewness and kurtosis |
| **Q–Q Plots**  | Visual assessment by comparing quantiles of data against a normal distribution |

### What if Normality is Violated?
- **Apply transformations** (e.g., log, square root) to normalize data -- But in this exercise we will skip this
- **Use non-parametric tests** instead

### **Exercise: Run Normality Test and Select the Appropriate Statistical Test**
1️⃣ Run a **Shapiro-Wilk test** on `counseling_data` to assess whether the `PHQ-9 scores` follow a normal distribution.  
2️⃣ Based on the normality results:
   - If normal → Use **parametric tests**.
   - If non-normal → Use **non-parametric tests**.

3️⃣ Which test is appropriate?

In [33]:
from scipy.stats import shapiro

results = []
for week in [1, 2, 3]:
    # If there are any missing values, you will need to drop them, but there are no missing values in our lab data
    scores = counseling_data[counseling_data["Week"] == week]["PHQ9Score"]
    # Shapiro-Wilk Test
    shapiro_stat, shapiro_p = shapiro(scores)

    results.append({
        'Week': week,
        'Shapiro-Wilk Statistic': shapiro_stat,
        'p-value': shapiro_p
    })

shapiro_results = pd.DataFrame(results)
display(shapiro_results)

Unnamed: 0,Week,Shapiro-Wilk Statistic,p-value
0,1,0.926268,0.01219
1,2,0.939123,0.032308
2,3,0.931441,0.017952


➤ Result: All weeks have `p-value < 0.05`, suggesting that the **normality assumption is violated for each week**.

> (Note)
> If p-value > 0.05, the data would follow a **normal distribution** and we could use **RM-ANOVA**. However, since our results show **p-value < 0.05 for all weeks, we cannot assume normality**. Therefore, we will use the **Friedman test**, which is the appropriate non-parametric alternative when the normality assumption is violated.

📝 Supplementary notes: While **missing data** can be handled using **imputation techniques** (e.g., mean or median imputation), we will skip this step in this lecture.

## Friedman test
- Non-parametric alternative to RM-ANOVA  
- Used when data are **ordinal** or **not normally distributed**  
- Compares **within-subject differences across times or conditions**

### Limitations
- Cannot handle missing data  
- Cannot model **between-subject effects** (e.g., male vs. female)
- Does not show **which conditions differ** → post-hoc test needed  
- **Lower statistical power** than parametric tests

In [34]:
# wide format for Friedman test
counseling_data_wide = counseling_data.pivot(index="PID", columns="Week", values="PHQ9Score")

# display the wide format
print("Wide Format for Friedman Test:")
display(counseling_data_wide.head())

Wide Format for Friedman Test:


Week,1,2,3
PID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
P001,17,15,11
P002,16,17,11
P003,18,18,13
P004,16,16,10
P005,16,17,12


In [35]:
from scipy.stats import friedmanchisquare

friedman_results = friedmanchisquare(
    counseling_data_wide[1],
    counseling_data_wide[2],
    counseling_data_wide[3],
)

# Display results
print(friedman_results)

FriedmanchisquareResult(statistic=np.float64(64.32679738562096), pvalue=np.float64(1.0755074633296326e-14))


Since the **p-value is smaller than 0.05**, there is **statistically difference** in `PHQ9Scores` across weeks.

➤ Then we can perform **post-hoc tests** to further analyze which specific weeks differ from each other.

## Post-hoc test

In [36]:
import pingouin as pg

# The test used here is Wilcoxon signed-rank test, since parametric=False is specified.
# Pingouin automatically chooses Wilcoxon for within-subject, non-parametric tests.

posthoc = pg.pairwise_tests(
    data=counseling_data,       # Long-format DataFrame (must include PID, Week, PHQ9Score)
    dv="PHQ9Score",             # Dependent variable
    within="Week",              # Within-subject factor
    subject="PID",              # Subject identifier (repeated measures)
    parametric=False,           # Use non-parametric test → Wilcoxon signed-rank test
    padjust="bonf",             # Bonferroni correction for multiple comparisons
    effsize="hedges"            # Hedges' g for effect size
)

# Pretty-print the results in a clean table
pg.print_table(posthoc, floatfmt=".3f")



POST HOC TESTS

Contrast      A    B  Paired    Parametric      W-val  alternative      p-unc    p-corr  p-adjust      hedges
----------  ---  ---  --------  ------------  -------  -------------  -------  --------  ----------  --------
Week          1    2  True      False         148.500  two-sided        0.012     0.037  bonf           0.281
Week          1    3  True      False           0.000  two-sided        0.000     0.000  bonf           3.527
Week          2    3  True      False           0.000  two-sided        0.000     0.000  bonf           3.005



# **Interpretation of Post-hoc Results**

- Week 1 vs 2: p = 0.037, g = 0.28 → Slight improvement
- Week 1 vs 3: p = 0.000, g = 3.53 → Big drop
- Week 2 vs 3: p = 0.000, g = 3.01 → Continued drop

➤ **Main change occurred between Week 2 and 3**


📝 Supplementary Notes: If you're interested, check out the [Pingouin library](https://pingouin-stats.org/build/html/generated/pingouin.pairwise_tests.html) for more on pairwise tests

# Generalized Linear Model (GLM) and the extensions of GLM

So far, we've used the Friedman test to analyze repeated measures data.  
While it's suitable for simple, balanced designs, it has important limitations:

- Cannot handle **missing data** well  
- Require **equal time intervals** and **balanced data**  
- Cannot model **group-level effects** (e.g., male vs. female)  
- Ignore **subject-specific variation** in baseline levels or time trends  

## Scenarios

### Scenario 1: Comparing Group-Level Effects

- What if we want to examine how a group-level factor (e.g., gender, intervention type) influences user outcomes?  
- The Friedman test can't address this, as it doesn't model **between-subject differences**.
- ➡️ We need a model that can capture **both within-subject and group-level effects**.
  - **Generalized Estimating Equations (GEE)**

### Scenario 2: Unbalanced ESM Data  
![Unbalanced Longitudinal Data](https://drive.google.com/uc?export=view&id=1ueWdesIC_AZ7xeIFu3CRcbqfmGcIZXcW)
- In real-world studies (like those using **Experience Sampling Method - ESM**), users may not respond on a fixed schedule.
- This leads to:
  - Different numbers of measurements per person
  - Irregular time intervals
  - Missing data points
- In this case, a **GLMM** can flexibly handle irregularity and missingness in the data.

### What can we use?

| Model | Description |
|-------|-------------|
| **GLM** (Generalized Linear Model) | Models outcomes from the exponential family using link functions and fixed effects |
| **GEE** (Generalized Estimating Equation) | Estimates marginal effects from correlated data without random effects |
| **GLMM** (Generalized Linear Mixed Model) | Models outcomes from the exponential family using both fixed and random effects |

These models offer:  
- Flexibility in outcome distributions (not limited to normal)  
- The ability to model time trends and subject-level variability  
- Robust handling of correlated observations and missing data

## Generalized Linear Model (GLM)
- GLM models outcomes from the **exponential family** using link functions and fixed effects.
- The choice of distribution depends on the outcome's probability structure —
e.g., binary → binomial/logistic, count → Poisson, continuous → Gaussian, skewed continuous → Gamma.
- *Examples: linear regression, logistic regression, Poisson regression, gamma regression*
  - Gaussian GLM with identity link is equivalent to ordinary least squares (OLS) linear regression
![Probability distributions](https://drive.google.com/uc?export=view&id=1KO4KulLAikk_I-Ihh0TOQPmMOlKvlfWv)

### **Limitation in repeated measures context:**
GLM assumes **independent observations**.
However, repeated measures data are correlated → leads to invalid standard errors.

➡️ Use **GEE** or **GLMM** to handle within-subject correlation.

## Generating Estimation Equation (GEE)

- GEE estimates **marginal (population-averaged) effects** from correlated or repeated measures data.
- It accounts for within-subject correlation without specifying random effects, using a **working correlation** structure (e.g., exchangeable, AR(1)).
- Suitable when the focus is on **overall group-level effects** rather than individual trajectories.

## Generalized Linear Mixed Model (GLMM)
- GLMM extends GLM by including both **fixed** and **random effects**, allowing for **subject-specific inference**.
- Handles **hierarchical**, **clustered**, or **repeated measures** data by modeling **correlation within subjects** through random effects.
- Useful when modeling **individual variability** and **within-group heterogeneity** is important.

# (GEE Practice) Extension of AI Counseling Agent Case Study: Comparing Group-level Effects 🤖

![Group Comparison](https://drive.google.com/uc?export=view&id=1Z72iR8rTtM0vGocflSI-TbDaIxmdlKWV)

## Target Scenario (Extension of AI Counseling Agent Study)
Let’s assume the researcher wanted to examine how the interaction style with the counseling agent affects user outcomes.
To do this, participants were divided into two groups:

- Group A: Users interacting with a Voice Agent
- Group B: Users interacting with a Chat Agent

Over the course of three weeks, all participants completed PHQ-9 questionnaires weekly, just like before.

- `PID`: Participant ID  
- `Week`: Time point (1 to 3)
- `EngagementMinutes`: Usage duration (min)
- `EngagementLevel`: low, high
- `Group`: Optional grouping (e.g., agent type)  
- `PHQ9Score`: PHQ-9 score (0–27) (Related to depression)
- `PHQ9Binary`: PHQ-9 binary (0: non-depressed, 1: depressed)

## Objective
Your goal is to:

Use **GEE** to determine whether there is a **significant difference** in the change of PHQ-9 scores over time **between the two groups** (Voice vs. Chat), while accounting for **individual-level repeated measurements**.

In [37]:
# Counseling_data

display(counseling_data.head())

Unnamed: 0,PID,Group,Gender,Week,EngagementMinutes,EngagementLevel,PHQ9Score,PHQ9Binary
0,P001,chat,male,1,33.0,low,17,1
1,P001,chat,male,2,37.5,low,15,1
2,P001,chat,male,3,22.0,low,11,1
3,P002,voice,male,1,56.9,low,16,1
4,P002,voice,male,2,87.9,high,17,0


In [38]:
# Distribution of PHQ-9 Scores by Week and Group

import plotly.express as px

fig = px.box(
    counseling_data,
    x="Week",
    y="PHQ9Score",
    color="Group",
    title="Distribution of PHQ-9 Scores by Week and Group",
    labels={"PHQ9Score": "PHQ-9 Score", "Week": "Week"},
    category_orders={"Week": [1, 2, 3]}
)
fig.update_layout(boxmode='group', legend_title_text='Group')
fig.show()

In [39]:
# Check types of each column
counseling_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PID                120 non-null    object 
 1   Group              120 non-null    object 
 2   Gender             120 non-null    object 
 3   Week               120 non-null    int64  
 4   EngagementMinutes  120 non-null    float64
 5   EngagementLevel    120 non-null    object 
 6   PHQ9Score          120 non-null    int64  
 7   PHQ9Binary         120 non-null    int64  
dtypes: float64(1), int64(3), object(4)
memory usage: 7.6+ KB


In [40]:
import statsmodels.api as sm

# Convert group and week variables to categorical
counseling_data["Group"] = counseling_data["Group"].astype("category")
counseling_data["Week"] = counseling_data["Week"].astype("category")

# Define exchangeable correlation structure
exchangeable_corr = sm.cov_struct.Exchangeable()

# Fit GEE model with interaction term
gee_model = sm.GEE.from_formula(
    "PHQ9Score ~ Week * Group",        # Main effects + interaction
    groups="PID",                      # Repeated measures by PID
    cov_struct=exchangeable_corr,      # Correlation structure
    data=counseling_data
)

gee_result = gee_model.fit()
print(gee_result.summary())

                               GEE Regression Results                              
Dep. Variable:                   PHQ9Score   No. Observations:                  120
Model:                                 GEE   No. clusters:                       40
Method:                        Generalized   Min. cluster size:                   3
                      Estimating Equations   Max. cluster size:                   3
Family:                           Gaussian   Mean cluster size:                 3.0
Dependence structure:         Exchangeable   Num. iterations:                     2
Date:                     Sun, 06 Apr 2025   Scale:                           2.958
Covariance type:                    robust   Time:                         11:46:26
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                   16.4000      0.335     49.004 

**Interpretation**
- **Week 2** shows a **small** but **significant decrease** in PHQ-9 scores (coefficient β = -0.45, p = 0.049).
- **Week 3** shows a **large** and **significant decrease** (β = -5.45, p < 0.001).
  
  → Participants' depressive symptoms improved over time overall.

- No significant difference between Voice and Chat groups at Week 1 (β = 0.55, p = 0.200)
- At **Week 3**, the **Voice group improved more** than the Chat group (β = -1.35, p = 0.001)

  → Suggests the Voice agent may be more effective in reducing PHQ-9 scores over time.

### **Discussion of GEE**
GEE is especially useful when the research goal is to estimate **population-level trends**, rather than tracking individual-specific patterns.  
It is commonly used in epidemiological and public health research, where **subject-level variation is treated as a nuisance**.

However, since GEE does **not explicitly model individual trajectories**, it may not be ideal when subject-specific effects or within-group heterogeneity are key research questions.

#### **When is GEE recommended?**

According to [Neuhaus et al., 2006](https://www.sciencedirect.com/science/article/pii/S0306460306001018?ref=pdf_download&fr=RR-2&rr=923b19d5fa40ea15):
> GEE is appropriate **when the number of participants is at least 30**, and each participant has **3 to 5 repeated observations**.

This is because GEE relies on **large sample theory** (asymptotic properties) for valid and stable inference.

So if your study meets these criteria and your interest lies in **population-level effects**, then **GEE is a robust and interpretable choice** for repeated measures analysis.

#### **Working Correlation** Structure (in GEE)
In GEE, researchers must specify a **working correlation structure**, which models how repeated observations within a subject are related.
- **Exchangeable**: all time points equally correlated
- **AR(1)**: closer time points more correlated
- **Unstructured**: allows any pattern of correlation

# (GLMM Practice) Stress Monitoring via ESM in the Wild 📱

## Target Scenario
Assume we are working with an **in-the-wild dataset**, where participants report their stress levels multiple times per day using the **Experience Sampling Method (ESM)** through their smartphones.

![Unbalanced Longitudinal Data](https://drive.google.com/uc?export=view&id=1ueWdesIC_AZ7xeIFu3CRcbqfmGcIZXcW)

Because the data is collected in natural settings:

- Each participant has a different number of measurements (i.e., unbalanced data).

- The time intervals between measurements are irregular.

This results in a longitudinal dataset with repeated measures, but unlike earlier structured cases, it is:

- Unbalanced (unequal number of samples per person)
- Hierarchical (measurements nested within individuals)

## Method
In this scenario, we use **Generalized Linear Mixed Models (GLMMs)** to:

- Model the relationship between **passively sensed features** (e.g., screen time, location) and **self-reported stress levels**

- Account for the **random effects** due to individual differences

- Handle **non-normal outcomes** when applicable

## Objective
Your goal is to:

Build a **GLMM** to predict stress levels using sensor-based features as fixed effects,
while incorporating **random intercepts (and possibly slopes)** for each participant to account for **intra-individual variation**.

> **Note**: This dataset is **synthetic**, and was generated for instructional purposes by the TA.

In [41]:
# Load the dataset
esm_data = pd.read_csv("/content/IoTDS/esm_data.csv")

display(esm_data)

Unnamed: 0,PID,University,Gender,Timestamp,ScreenTimeSumBefore1Hr,StepCountSumBefore1Hr,Location,StressScore
0,P001,A,female,2025-03-04 08:56:06.479621765,1849.060173,2070,work,19
1,P001,A,female,2025-03-04 09:47:37.319771159,625.953149,3320,outside,11
2,P001,A,female,2025-03-07 09:25:33.998410366,870.141987,1922,home,17
3,P001,A,female,2025-03-07 10:33:00.416640129,861.052040,1333,work,13
4,P001,A,female,2025-03-08 08:06:11.396028044,1465.872188,848,work,19
...,...,...,...,...,...,...,...,...
392,P032,A,male,2025-03-18 08:56:22.627619557,844.220371,536,home,20
393,P032,A,male,2025-03-26 09:46:15.083863894,702.402989,984,work,20
394,P032,A,male,2025-03-28 08:01:12.472215284,908.369852,3302,outside,14
395,P032,A,male,2025-03-28 09:38:10.116109158,1249.255280,671,home,21


In [42]:
esm_data.describe()

Unnamed: 0,ScreenTimeSumBefore1Hr,StepCountSumBefore1Hr,StressScore
count,397.0,397.0,397.0
mean,984.524337,2079.191436,19.702771
std,329.379091,1049.682912,3.885653
min,40.876092,24.0,6.0
25%,763.272683,1276.0,17.0
50%,978.127976,1742.0,20.0
75%,1190.93506,3042.0,22.0
max,1965.429433,4609.0,31.0


In [43]:
esm_data['University_Gender'] = esm_data['University'] + ' - ' + esm_data['Gender']

# Boxplot of StressScore by University and Gender
fig = px.box(esm_data, x='University_Gender', y='StressScore', title='Boxplot of StressScore by University and Gender')
fig.update_layout()
fig.show()

In [44]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
esm_data['ScreenTime_scaled'] = scaler.fit_transform(esm_data['ScreenTimeSumBefore1Hr'].values.reshape(-1, 1))
esm_data['StepCount_scaled'] = scaler.fit_transform(esm_data['StepCountSumBefore1Hr'].values.reshape(-1, 1))

In [45]:
import statsmodels.formula.api as smf

# GLMM model
glmm_model = smf.mixedlm(
    "StressScore ~ University * Gender + ScreenTime_scaled + StepCount_scaled",
    data=esm_data,
    groups=esm_data["PID"]
).fit()

print(glmm_model.summary())

                  Mixed Linear Model Regression Results
Model:                   MixedLM      Dependent Variable:      StressScore
No. Observations:        397          Method:                  REML       
No. Groups:              32           Scale:                   3.8039     
Min. group size:         7            Log-Likelihood:          -867.3749  
Max. group size:         19           Converged:               Yes        
Mean group size:         12.4                                             
--------------------------------------------------------------------------
                               Coef.  Std.Err.    z    P>|z| [0.025 0.975]
--------------------------------------------------------------------------
Intercept                      16.719    0.695  24.054 0.000 15.357 18.081
University[T.B]                 3.602    1.101   3.273 0.001  1.445  5.759
Gender[T.male]                  2.457    1.051   2.338 0.019  0.397  4.517
University[T.B]:Gender[T.male] -0.583    1.5

**Interpretation**:

- **Intercept** (β = 16.719, p < 0.001): The mean StressScore for participants who are:
  - in University A
  - and female
  - with average ScreenTime and StepCount
  - with average random effect (since it's a mixed mdoel)

- **University B vs University A** (β = 3.602, p < 0.05): Participants from **University B have a significantly higher stress score** than those from University A.

- **Male vs Female** (β = 2.457, p < 0.05): **Male** participants show a **higher stress score** compared to female participants, with this difference being statistically significant.

- **University B × Male Interactio**n (β = -0.583, p = 0.698): The interaction between being from University B and male gender has **no significant effect** on the stress score.

- **ScreenTime** (β = 1.322, p < 0.001): **Increased screen time** is significantly associated with **higher stress scores**.

- **StepCount** (β = -1.157, p < 0.001): **Higher step counts** are significantly associated with **lower stress scores**, suggesting that physical activity has a beneficial effect on reducing stress.

- **Random effect** (Group Var = 4.009): Group Var is roughly comparable to the residual variance (Scale), indicating that between-person differences are substantial enough to justify the use of a random intercept.


---

# Homework #6: Anlayzing repeated measures data

*   3 points towards your final score
*   Must submit a single colab file to KLMS by next Wednesday 23:59:59 (April 9th)

Your task is to analyze it through the following steps:

## 1. RM-ANOVA test and post-hoc test (1.0 pt)
In our analysis of the `counseling_data`, we first performed a **normality test** on `PHQ9Score` and found that the **data did not follow a normal distribution**. Therefore, we used the **Friedman test**. However, what if we skipped the normality test and directly applied **RM-ANOVA** instead? Would the results be similar to those of the Friedman test? Run the code and compare the outcomes.

- 1.1 Run RM-ANOVA and interpret the result (0.3 pt)

- 1.2 If there are significant group differences, conduct a post-hoc test (0.3 pt)

- 1.3 Interpret the RM-ANOVA (and post-hoc test) results and compare them with the Friedman test results (0.2 pt)

- 1.4 Briefly discuss the potential risks of applying statistical tests that do not match the data characteristics. Explain why it is important to choose statistical tests based on the data distribution. (3–5 sentences) (0.2 pt)

### 1.1 Run One-way RM-ANOVA on `counseling_data.csv` and interpret the result (0.3 pt)

Normally, to conduct an RM-ANOVA, we would need to perform a **normality test** and **Mauchly's test** for **sphericity**. However, in this assignment, we will skip these steps.
Also, note that our dataset does not contain any missing values.

In [46]:
# Use counseling_data

display(counseling_data.head())

Unnamed: 0,PID,Group,Gender,Week,EngagementMinutes,EngagementLevel,PHQ9Score,PHQ9Binary
0,P001,chat,male,1,33.0,low,17,1
1,P001,chat,male,2,37.5,low,15,1
2,P001,chat,male,3,22.0,low,11,1
3,P002,voice,male,1,56.9,low,16,1
4,P002,voice,male,2,87.9,high,17,0


In [47]:
# You may want to use pingouin library
import pingouin as pg

# RM-ANOVA (0.2 pt)
rm_anova = pg.rm_anova(
    data=counseling_data,
    dv="PHQ9Score",                    # Write your answer here
    within="Week",                # Write your answer here
    subject="PID",               # Write your answer here
)

print(rm_anova)

  Source  ddof1  ddof2           F         p-unc     p-GG-corr       ng2  \
0   Week      2     78  476.232482  1.920408e-44  4.261410e-38  0.727255   

        eps  sphericity   W-spher   p-spher  
0  0.848595       False  0.821582  0.023898  


In [48]:
# Interpretation (0.1 pt)

# Interpretation template (modify based on your results)
print("Since the p-value is smaller than 0.05, there is a significant difference across weeks.")

Since the p-value is smaller than 0.05, there is a significant difference across weeks.


### 1.2 Post-hoc test and interpretation (0.3 pt)

In [49]:
# Post-hoc pairwise comparisons (0.2 pt)
import pingouin as pg

posthoc = pg.pairwise_tests(
    data=counseling_data,
    dv="PHQ9Score",                      # Dependent variable; Write your answer
    within="Week",                  # Within-subject factor; Write your answer
    subject="PID",                 # Subject identifier; Write your answer
    padjust="bonf",             # Adjustment method for multiple comparisons
    parametric=True             # Use True if you based your test on RM-ANOVA
)

print(posthoc)

  Contrast  A  B  Paired  Parametric          T   dof alternative  \
0     Week  1  2    True        True   2.623424  39.0   two-sided   
1     Week  1  3    True        True  26.654508  39.0   two-sided   
2     Week  2  3    True        True  22.328100  39.0   two-sided   

          p-unc        p-corr p-adjust       BF10    hedges  
0  1.235914e-02  3.707743e-02     bonf      3.404  0.281319  
1  1.209990e-26  3.629971e-26     bonf   2.37e+23  3.527309  
2  7.977015e-24  2.393104e-23     bonf  4.336e+20  3.004842  


# **Interpretation of Post-hoc Results**

- Week 1 vs 2: p = 0.037, g = 0.28 → Slight improvement
- Week 1 vs 3: p = 0.000, g = 3.53 → Big drop
- Week 2 vs 3: p = 0.000, g = 3.00 → Continued drop

➤ **Main change occurred between Week 2 and 3**

### 1.3 Compare the RM-ANOVA and post-hoc results with the Friedman test results (0.2 pt)


The result of RM-ANOVA and Friedman with post-hoc results both give the same conclusion. That is that there is a significant difference between the weeks. The post-hoc further confirms that the biggest change is from week 2 to week3. The RM-ANOVA however gives a much lower p-value. 1.920408e-44 for uncorrected and 4.261410e-38 corrected. Compared to friedman's 1.0755074633296326e-14.


### 1.4 Briefly discuss the potential risks of applying statistical tests that do not match the data characteristics. Explain why it is important to choose statistical tests based on the data distribution. (3–5 sentences) (0.2 pt)


Using statistical tests that do not match the data characteristics can lead to misleading results. In our case with RM-ANOVA and Friedman, where RM-ANOVA is only meant to be used for normally distributed data. The accuracy for the test is therefore greatly reduced.

## 2. GEE Modeling and interpretation (1.0 pt)
In this assignment, you will fit a **GEE** (Generalized Estimating Equation) model using `counseling_data.csv`.
We used an **exchangeable correlation** structure and **`PHQ9Score`** (a continuous variable) in the lab. Now you will apply an **AR(1) structure** and use **`PHQ9Binary`** (a binary outcome) instead -- and **interpret how the results change**.

### Guideline
Using the same counseling_data.csv dataset:
- Outcome: PHQ9Binary
  - PHQ9Binary is a binary variable indicating depression risk (1 if PHQ9Score > 10, else 0).
- Predictors
  - Week, Group, and their interaction (Hint: `Week * Group`)
  - `EngagementMinutes`
  - Interaction between Week and EngagementMinutes (Hint: `Week:EngagementMinutes`)
- Cluster ID: PID
- Correlation structure: AR(1)
- Family: Use an appropriate exponential family depending on the outcome variable.

In [50]:
# Label distribution
counseling_data['PHQ9Binary'].value_counts()

# 0: Non-depressed, 1: Depressed

Unnamed: 0_level_0,count
PHQ9Binary,Unnamed: 1_level_1
1,78
0,42


### 2.1 GEE Modeling (0.5 pt)

In [51]:
# Write your code here

# Convert Week to integer
counseling_data["Week"] = counseling_data["Week"].astype(int)

# Define AR(1) correlation structure
#working_corr =
working_corr = sm.cov_struct.Autoregressive()

# Fit GEE model with AR(1) (0.5 pt)
gee_model = sm.GEE.from_formula(
    "PHQ9Binary ~ Week * Group + EngagementMinutes + Week:EngagementMinutes", # Write your equation here
    groups="PID",                           # Write your answer
    cov_struct=working_corr,
    family= sm.families.Binomial(),                            # Select appropriate exponential family
    data=counseling_data
)

gee_result = gee_model.fit()
print(gee_result.summary())


grid=True will become default in a future version



                               GEE Regression Results                              
Dep. Variable:                  PHQ9Binary   No. Observations:                  120
Model:                                 GEE   No. clusters:                       40
Method:                        Generalized   Min. cluster size:                   3
                      Estimating Equations   Max. cluster size:                   3
Family:                           Binomial   Mean cluster size:                 3.0
Dependence structure:       Autoregressive   Num. iterations:                    13
Date:                     Sun, 06 Apr 2025   Scale:                           1.000
Covariance type:                    robust   Time:                         11:46:29
                             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  9.1792      3.028      3.032      0

### 2.2 GEE Interpretation (0.5 pt)

- **Voice group** shows a **large** but **insignificant decrease** in PHQ-9 scores (coefficient β = -1.4273, p = 0.281).
- **The week** shows a **large** but **insignificant decrease** (β = -1.3154, p = 0.281).

- **Engagement minutes** shows a **small** but **significant decrease** (β = -0.0839, p = 0.004).


The results suggest that the voice and which week had a high effect on the score, but statistically insignificant. Both having a p value of higher than 0.2.

Engagement minutes has the most statistical significant but has a weak effect.


The results shows that the voice group might have gotten a good result by chance.


## 3. GLMM Modeling and Interpretation (1.0 pt)
In this section, you will learn how to build a **Generalized Linear Mixed Model (GLMM)** using real-world repeated measures data and interpret the results.

We’ll work with `esm_data.csv`, a dataset collected using the **Experience Sampling Method (ESM)**. Each participant (identified by `PID`) has reported their stress levels, along with contextual and behavioral data.


### Guideline
Fit a **GLMM** where:
- **Outcome**: `StressScore`
- **Fixed effects**: `University`, `Gender`, `Location`, `ScreenTime_scaled` and `StepCount_scaled`
- **Random effect**: `PID` (participant ID)


### 3.1 GLMM Modeling (0.5 pt)

In [52]:
# Load the dataset
esm_data = pd.read_csv("/content/IoTDS/esm_data.csv")

In [53]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
esm_data['ScreenTime_scaled'] = scaler.fit_transform(esm_data['ScreenTimeSumBefore1Hr'].values.reshape(-1, 1))
esm_data['StepCount_scaled'] = scaler.fit_transform(esm_data['StepCountSumBefore1Hr'].values.reshape(-1, 1))

In [54]:
# Write your code here

model = smf.mixedlm("StressScore ~ University + Gender + Location + ScreenTime_scaled + StepCount_scaled",                        # Write your equation here
                esm_data,
                groups=esm_data["PID"])

result = model.fit()
print(result.summary())

            Mixed Linear Model Regression Results
Model:               MixedLM  Dependent Variable:  StressScore
No. Observations:    397      Method:              REML       
No. Groups:          32       Scale:               3.8247     
Min. group size:     7        Log-Likelihood:      -869.1093  
Max. group size:     19       Converged:           Yes        
Mean group size:     12.4                                     
--------------------------------------------------------------
                    Coef.  Std.Err.   z    P>|z| [0.025 0.975]
--------------------------------------------------------------
Intercept           16.817    0.642 26.193 0.000 15.559 18.076
University[T.B]      3.291    0.740  4.446 0.000  1.840  4.742
Gender[T.male]       2.174    0.741  2.933 0.003  0.721  3.627
Location[T.outside]  0.042    0.479  0.087 0.930 -0.897  0.980
Location[T.work]     0.036    0.245  0.145 0.885 -0.445  0.516
ScreenTime_scaled    1.321    0.103 12.856 0.000  1.120  1.523
StepC

### 3.2 GLMM Interpretation (0.5 pt)
- Interpret how Location affects stress.
- Are there significant differences by Gender or University?

> **Note**
> Although `StressScore` is not perfectly normally distributed, we will use the **Gaussian family** for simplicity. Continuous outcomes are easier to model with Gaussian assumptions in Python, and LMMs are fairly robust to slight non-normality.

Location does not seem to affect strees too much with a coefficient of less than 0.05. The results are also statistically insignificant, since the p values are higher that 0.8.

Male gender has a coefficient of 2.174 and a p value of 0.003 this means that gender has a big effect on stress scores.

A similar result can be seen from university, with a coefficient of 3.291 and p value of 0.

The only effect that reduces stress is stepcount with a coefficient of -1.166.