# üìä Data Comparison: Training vs Production

**Purpose:** Detect data drift between training and production datasets.

**Why Important:**
- Model trained on training data distribution
- Production data might have different distribution
- Drift = predictions might be unreliable

**What We Check:**
1. Statistical differences (mean, std)
2. Distribution differences (range, percentiles)
3. Per-sensor drift
4. Data quality issues

In [1]:
import pandas as pd

In [7]:
with_activities = pd.read_csv("all_users_data_labeled.csv")



In [8]:
with_activities.head()

Unnamed: 0,timestamp,Ax_w,Ay_w,Az_w,Gx_w,Gy_w,Gz_w,activity,User
0,2005-05-01 00:04:59.000,8.500479,-4.88562,-1.247403,-18.151813,1.957602,-0.36204,ear_rubbing,1
1,2005-05-01 00:04:59.020,8.62877,-4.715455,-1.247403,-20.759232,4.337617,-1.517069,ear_rubbing,1
2,2005-05-01 00:04:59.040,8.571752,-4.521656,-0.966979,-18.816491,4.285163,-1.692155,ear_rubbing,1
3,2005-05-01 00:04:59.060,8.581255,-4.436574,-0.962144,-16.541813,4.180102,-1.74454,ear_rubbing,1
4,2005-05-01 00:04:59.080,8.310418,-4.394033,-0.957309,-22.858267,3.0428,0.792587,ear_rubbing,1


In [11]:
without_activities = pd.read_csv("sensor_fused_50Hz.csv")
without_activities.head()

Unnamed: 0,timestamp_ms,timestamp_iso,Ax,Ay,Az,Gx,Gy,Gz
0,1742802739780,2025-03-24T07:52:19.780000Z,-95.41802,-1240.532,-1086.742,39.15614,-7.520049,7.881836
1,1742802739800,2025-03-24T07:52:19.800000Z,-159.8373,-1364.8455,-883.68655,27.50125,-10.19752,10.262
2,1742802739820,2025-03-24T07:52:19.820000Z,-161.2903,-1311.844,-529.8176,1.041086,-4.790063,2.981754
3,1742802739840,2025-03-24T07:52:19.840000Z,-123.02625,-1157.1745,-302.61215,-21.76137,-7.012552,0.549316
4,1742802739860,2025-03-24T07:52:19.860000Z,-149.6658,-806.8806,-136.52045,-44.00368,-6.960001,-3.230397


In [12]:
with_activities.describe()


Unnamed: 0,Ax_w,Ay_w,Az_w,Gx_w,Gy_w,Gz_w,User
count,385326.0,385326.0,385326.0,385326.0,385326.0,385326.0,385326.0
mean,3.21858,1.282084,-3.528949,0.599312,0.225205,0.088668,3.529199
std,6.568341,4.35147,3.236169,49.930271,14.811729,14.166785,1.712415
min,-26.846118,-16.560816,-45.230434,-818.951569,-302.422165,-242.407539,1.0
25%,-1.531575,-1.586314,-6.077462,-6.11384,-2.717356,-2.910012,2.0
50%,6.381298,2.209306,-2.95412,0.672039,0.235012,0.060095,4.0
75%,8.757061,4.118933,-1.503652,7.454204,3.034424,2.979887,5.0
max,31.954008,41.857702,24.150296,835.17215,206.932671,231.835464,6.0


In [13]:
without_activities.describe()

Unnamed: 0,timestamp_ms,Ax,Ay,Az,Gx,Gy,Gz
count,181699.0,181673.0,181673.0,181673.0,181673.0,181673.0,181673.0
mean,1742805000000.0,-16.234198,-19.022889,-1001.556902,0.492359,0.150784,0.131591
std,1049043.0,11.315276,31.041663,19.924188,5.332085,4.716818,1.873494
min,1742803000000.0,-1455.48775,-1924.73735,-2833.4155,-800.86145,-584.833433,-143.26595
25%,1742804000000.0,-17.921175,-21.007985,-1002.9575,0.446533,0.127297,0.112309
50%,1742805000000.0,-16.46811,-19.241257,-1001.9715,0.498668,0.145051,0.1466
75%,1742805000000.0,-15.015045,-14.74413,-1000.986,0.550757,0.179977,0.180059
max,1742806000000.0,955.633,1263.5635,390.8329,448.9011,582.54495,253.45925


## üìñ Understanding Data Drift Parameters

### What is Data Drift?

**Definition:** When production data distribution differs significantly from training data.

**Example:**
```
Training: Users aged 20-30, flat walking
Production: Users aged 60-70, stairs
‚Üí Different patterns! Model might fail.
```

### Key Drift Metrics

#### 1. Mean Drift
**Formula:**
```python
drift_mean = |production_mean - training_mean|
```

**Threshold:** drift > 0.1 indicates significant shift

**Example:**
```
Training Az mean: 9.81 m/s¬≤
Production Az mean: 10.2 m/s¬≤
Drift: |10.2 - 9.81| = 0.39 ‚ö†Ô∏è HIGH DRIFT
```

**What it means:**
- Data is systematically shifted (bias)
- Sensors might be calibrated differently
- Different user behavior patterns

#### 2. Standard Deviation Drift
**Formula:**
```python
drift_std = |production_std - training_std|
```

**Threshold:** drift > 0.15 indicates variability change

**Example:**
```
Training Az std: 0.50
Production Az std: 0.85
Drift: |0.85 - 0.50| = 0.35 ‚ö†Ô∏è HIGH DRIFT
```

**What it means:**
- Data variability has changed
- More/less dynamic movements
- Different activity intensity

#### 3. Range Drift
**Check if production values exceed training range:**

**Example:**
```
Training range: [-12, 15]
Production range: [-18, 25]
‚Üí Production has values outside training!
```

**Risk:** Model extrapolating (unreliable predictions)

#### 4. Per-Sensor Drift

Check each sensor separately:

| Sensor | Train Mean | Prod Mean | Drift | Status |
|--------|-----------|-----------|-------|--------|
| Ax | 0.12 | 0.17 | 0.05 | ‚úÖ OK |
| Ay | -0.08 | -0.05 | 0.03 | ‚úÖ OK |
| Az | 9.81 | 10.2 | 0.39 | ‚ö†Ô∏è HIGH |
| Gx | 0.05 | 0.13 | 0.08 | ‚úÖ OK |
| Gy | -0.02 | 0.50 | 0.52 | ‚ö†Ô∏è HIGH |
| Gz | 0.01 | 0.10 | 0.09 | ‚úÖ OK |

### Drift Thresholds Summary

| Metric | Threshold | Action |
|--------|-----------|--------|
| **Mean drift** | > 0.1 | ‚ö†Ô∏è Investigate |
| **Std drift** | > 0.15 | ‚ö†Ô∏è Check variability |
| **Range outside** | Any | ‚ö†Ô∏è Extrapolation risk |
| **KS p-value** | < 0.05 | ‚ö†Ô∏è Different distributions |

### What to Do When Drift Detected

‚úÖ **Low drift (< thresholds):** Proceed with inference

‚ö†Ô∏è **Medium drift (near thresholds):**
- Monitor predictions closely
- Validate on sample data
- Check prediction confidence

‚ùå **High drift (>> thresholds):**
- **Investigate cause:**
  - Different users?
  - Different devices?
  - Different activities?
- **Actions:**
  - Collect more training data from production
  - Retrain model with combined data
  - Recalibrate sensors
  - Adjust preprocessing

### Example Interpretation

**Scenario:**
```
Ax drift: 0.05  ‚úÖ OK
Ay drift: 0.03  ‚úÖ OK
Az drift: 0.45  ‚ùå HIGH
Gx drift: 0.08  ‚úÖ OK
Gy drift: 0.52  ‚ùå HIGH
Gz drift: 0.09  ‚úÖ OK
```

**Analysis:**
- Accelerometer Z-axis drifted (maybe gravity sensor issue)
- Gyroscope Y-axis drifted (maybe rotation pattern different)
- Other sensors OK

**Action:**
- Check Az calibration
- Investigate Gy activities (turning motions?)
- Consider retraining model