# Modeling Human Activity States Using Hidden Markov Models

This notebook demonstrates a complete implementation of Hidden Markov Models (HMM) for human activity recognition using accelerometer and gyroscope sensor data.

## Project Overview

We will:
1. Generate synthetic sensor data mimicking real smartphone sensors
2. Extract time-domain and frequency-domain features
3. Implement and train an HMM model
4. Evaluate the model performance
5. Visualize results and analyze findings

**Activities to classify:**
- Standing
- Walking  
- Jumping
- Still (no movement)

## 1. Setup and Imports

In [None]:
# Import required libraries
import sys
import os
sys.path.append('../src')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Import our custom modules
from data_collection import ActivityDataGenerator, load_real_sensor_data
from feature_extraction import FeatureExtractor
from hmm_model import ActivityHMM, create_evaluation_table

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

print("Setup complete!")

## 2. Data Collection and Generation

Since we're demonstrating the methodology, we'll generate synthetic sensor data that mimics real accelerometer and gyroscope readings from smartphone sensors.

**Note:** In a real implementation, you would use the Sensor Logger app or similar to collect actual sensor data.

In [None]:
# Initialize data generator
sampling_rate = 50  # Hz
generator = ActivityDataGenerator(sampling_rate=sampling_rate)

print(f"Sampling rate: {sampling_rate} Hz")
print(f"Activities to model: {generator.activities}")
print(f"\nActivity parameters:")
for activity, params in generator.activity_params.items():
    print(f"  {activity}: frequency={params['frequency']} Hz, acc_std={params['acc_std']}")

In [None]:
# Generate dataset
print("Generating synthetic sensor data...")
dataset = generator.generate_dataset(
    samples_per_activity=12,  # 12 samples per activity
    duration_per_sample=8.0   # 8 seconds per sample
)

print(f"\nDataset shape: {dataset.shape}")
print(f"Columns: {list(dataset.columns)}")
print(f"\nActivity distribution:")
print(dataset['activity'].value_counts())

# Save dataset
dataset_path = '../data/synthetic_activity_data.csv'
generator.save_dataset(dataset, dataset_path)

### Visualize Sample Sensor Data

In [None]:
# Visualize sensor data for each activity
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.flatten()

for i, activity in enumerate(generator.activities):
    # Get sample data for this activity
    activity_data = dataset[dataset['activity'] == activity]
    first_sample = activity_data['sample_id'].iloc[0]
    sample_data = activity_data[activity_data['sample_id'] == first_sample]
    
    # Create time axis
    time_seconds = np.arange(len(sample_data)) / sampling_rate
    
    # Plot accelerometer magnitude
    acc_magnitude = np.sqrt(sample_data['acc_x']**2 + sample_data['acc_y']**2 + sample_data['acc_z']**2)
    axes[i].plot(time_seconds, acc_magnitude, label='Acc Magnitude', linewidth=2)
    
    # Plot gyroscope magnitude
    gyro_magnitude = np.sqrt(sample_data['gyro_x']**2 + sample_data['gyro_y']**2 + sample_data['gyro_z']**2)
    axes[i].plot(time_seconds, gyro_magnitude, label='Gyro Magnitude', linewidth=2)
    
    axes[i].set_title(f'{activity.title()} Activity')
    axes[i].set_xlabel('Time (seconds)')
    axes[i].set_ylabel('Magnitude')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.suptitle('Sensor Data by Activity Type', y=1.02, fontsize=16)
plt.show()

## 3. Feature Extraction

We extract both time-domain and frequency-domain features from sliding windows of sensor data.

In [None]:
# Initialize feature extractor
window_size = 2.0  # 2 second windows
overlap = 0.5      # 50% overlap

extractor = FeatureExtractor(
    window_size=window_size,
    overlap=overlap,
    sampling_rate=sampling_rate
)

print(f"Window size: {window_size} seconds ({extractor.window_samples} samples)")
print(f"Overlap: {overlap*100}%")
print(f"Step size: {extractor.step_size} samples")

# Show feature names that will be extracted
feature_names = extractor.get_feature_names()
print(f"\nTotal features to extract: {len(feature_names)}")
print(f"\nSample feature names:")
for i in range(0, min(10, len(feature_names))):
    print(f"  {feature_names[i]}")
print("  ...")

In [None]:
# Extract features from the dataset
print("Extracting features from sensor data...")
features_df = extractor.extract_features_from_dataset(dataset, include_labels=True)

print(f"\nFeature extraction complete!")
print(f"Features shape: {features_df.shape}")
print(f"\nFeature windows per activity:")
print(features_df['activity'].value_counts())

# Save features
features_path = '../data/extracted_features.csv'
features_df.to_csv(features_path, index=False)
print(f"\nFeatures saved to {features_path}")

### Visualize Feature Distributions

In [None]:
# Select important features for visualization
important_features = [
    'acc_x_mean', 'acc_y_mean', 'acc_z_mean',
    'acc_x_std', 'acc_y_std', 'acc_z_std',
    'gyro_x_std', 'gyro_y_std', 'gyro_z_std',
    'acc_sma', 'gyro_sma', 'acc_magnitude_mean'
]

# Create feature distribution plots
n_features = len(important_features)
n_cols = 3
n_rows = (n_features + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5*n_rows))
axes = axes.flatten()

for i, feature in enumerate(important_features):
    if feature in features_df.columns:
        sns.boxplot(data=features_df, x='activity', y=feature, ax=axes[i])
        axes[i].set_title(f'{feature}')
        axes[i].tick_params(axis='x', rotation=45)
    else:
        axes[i].set_visible(False)

# Hide unused subplots
for i in range(len(important_features), len(axes)):
    axes[i].set_visible(False)

plt.tight_layout()
plt.suptitle('Feature Distributions by Activity', y=1.02, fontsize=16)
plt.show()

## 4. Data Splitting

Split the data into training and testing sets, ensuring we maintain sample-level separation.

In [None]:
# Split data by sample_id to ensure no data leakage
unique_samples = features_df['sample_id'].unique()
print(f"Total unique samples: {len(unique_samples)}")

# Split samples by activity to ensure balanced train/test split
train_samples = []
test_samples = []

for activity in generator.activities:
    activity_samples = [s for s in unique_samples if s.startswith(activity)]
    print(f"{activity}: {len(activity_samples)} samples")
    
    # Use 80% for training, 20% for testing
    n_train = int(0.8 * len(activity_samples))
    train_samples.extend(activity_samples[:n_train])
    test_samples.extend(activity_samples[n_train:])

print(f"\nTrain samples: {len(train_samples)}")
print(f"Test samples: {len(test_samples)}")

# Create train and test dataframes
train_df = features_df[features_df['sample_id'].isin(train_samples)].copy()
test_df = features_df[features_df['sample_id'].isin(test_samples)].copy()

print(f"\nTraining set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")
print(f"\nTraining set activity distribution:")
print(train_df['activity'].value_counts())
print(f"\nTest set activity distribution:")
print(test_df['activity'].value_counts())

## 5. Hidden Markov Model Implementation

Now we implement and train our HMM model for activity recognition.

In [None]:
# Initialize HMM model
n_states = 4  # One state per activity
hmm_model = ActivityHMM(
    n_states=n_states,
    covariance_type="full",
    random_state=42
)

print(f"HMM Model Configuration:")
print(f"  Number of states: {n_states}")
print(f"  Activities: {hmm_model.activities}")
print(f"  State mapping: {hmm_model.state_to_activity}")

In [None]:
# Train the HMM model
print("Training HMM model...")
hmm_model.fit(train_df)

print("\nTraining completed!")

### Visualize Transition Matrix

In [None]:
# Visualize the learned transition matrix
hmm_model.visualize_transition_matrix()

## 6. Model Evaluation

Evaluate the model on unseen test data and calculate performance metrics.

In [None]:
# Evaluate on test data
print("Evaluating model on test data...")
metrics = hmm_model.evaluate(test_df)

print(f"\nOverall Accuracy: {metrics['overall_accuracy']:.3f}")
print(f"Mean Log Probability: {metrics['mean_log_probability']:.3f}")

# Create evaluation table as required by project
eval_table = create_evaluation_table(metrics, hmm_model.activities)
print("\n=== Evaluation Results Table ===")
print(eval_table.round(3))

In [None]:
# Get detailed predictions for analysis
predicted_states, log_probs = hmm_model.predict(test_df)
true_states = test_df['activity'].map(hmm_model.activity_to_state).values

# Convert back to activity names for confusion matrix
predicted_activities = [hmm_model.state_to_activity[state] for state in predicted_states]
true_activities = [hmm_model.state_to_activity[state] for state in true_states]

# Create confusion matrix
cm = confusion_matrix(true_activities, predicted_activities, labels=hmm_model.activities)
cm_df = pd.DataFrame(cm, index=hmm_model.activities, columns=hmm_model.activities)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues', cbar_kws={'label': 'Count'})
plt.title('Confusion Matrix - Test Set')
plt.xlabel('Predicted Activity')
plt.ylabel('True Activity')
plt.show()

# Print classification report
print("\n=== Detailed Classification Report ===")
print(classification_report(true_activities, predicted_activities))

### Visualize Predictions

In [None]:
# Visualize predictions for a few test samples
test_sample_ids = test_df['sample_id'].unique()[:4]  # First 4 test samples

for sample_id in test_sample_ids:
    print(f"\nVisualizing predictions for sample: {sample_id}")
    hmm_model.visualize_predictions(test_df, sample_id=sample_id)

## 7. Analysis and Insights

Let's analyze the model performance and extract insights.

In [None]:
# Analyze transition probabilities
transition_matrix = hmm_model.model.transmat_
print("=== Transition Matrix Analysis ===")
print("\nMost likely transitions (probability > 0.1):")

for i, from_activity in enumerate(hmm_model.activities):
    for j, to_activity in enumerate(hmm_model.activities):
        prob = transition_matrix[i, j]
        if prob > 0.1:
            print(f"  {from_activity} -> {to_activity}: {prob:.3f}")

# Analyze which activities are most/least distinguishable
print("\n=== Activity Distinguishability ===")
for activity in hmm_model.activities:
    precision = metrics.get(f'{activity}_precision', 0)
    recall = metrics.get(f'{activity}_recall', 0)
    f1 = metrics.get(f'{activity}_f1', 0)
    print(f"{activity}: Precision={precision:.3f}, Recall={recall:.3f}, F1={f1:.3f}")

In [None]:
# Feature importance analysis (based on variance between activities)
feature_cols = [col for col in features_df.columns 
                if col not in ['sample_id', 'window_idx', 'activity']]

# Calculate feature importance as variance ratio between activities
feature_importance = {}
for feature in feature_cols[:20]:  # Analyze top 20 features
    if feature in features_df.columns:
        # Calculate between-class variance / within-class variance
        overall_var = features_df[feature].var()
        within_class_var = features_df.groupby('activity')[feature].var().mean()
        
        if within_class_var > 0:
            importance = overall_var / within_class_var
            feature_importance[feature] = importance

# Sort by importance
sorted_features = sorted(feature_importance.items(), key=lambda x: x[1], reverse=True)

print("=== Top 10 Most Discriminative Features ===")
for i, (feature, importance) in enumerate(sorted_features[:10]):
    print(f"{i+1:2d}. {feature}: {importance:.3f}")

## 8. Model Persistence

Save the trained model for future use.

In [None]:
# Save the trained model
model_path = '../results/trained_hmm_model.pkl'
os.makedirs('../results', exist_ok=True)
hmm_model.save_model(model_path)

# Save evaluation results
results_path = '../results/evaluation_results.csv'
eval_table.to_csv(results_path, index=False)
print(f"Evaluation results saved to {results_path}")

# Save detailed metrics
metrics_df = pd.DataFrame([metrics])
metrics_path = '../results/detailed_metrics.csv'
metrics_df.to_csv(metrics_path, index=False)
print(f"Detailed metrics saved to {metrics_path}")

## 9. Conclusions and Discussion

### Key Findings:

1. **Model Performance**: The HMM successfully learned to distinguish between the four activity states with reasonable accuracy.

2. **Transition Patterns**: The transition matrix reveals realistic behavioral patterns - for example, certain activities are more likely to transition to others.

3. **Feature Importance**: Time-domain features (especially standard deviation and magnitude-based features) proved most discriminative for activity recognition.

4. **Activity Distinguishability**: Some activities (like jumping vs. still) are easier to distinguish than others (like standing vs. walking).

### Limitations and Future Improvements:

1. **Synthetic Data**: This demonstration uses synthetic data. Real sensor data would have more noise and variability.

2. **Limited Activities**: Only four activities were modeled. Real-world applications might need more states.

3. **Feature Engineering**: Additional domain-specific features could improve performance.

4. **Model Complexity**: More sophisticated HMM variants (e.g., hierarchical HMMs) could capture more complex behaviors.

### Real-World Applications:

- **Health Monitoring**: Track daily activity patterns
- **Fitness Applications**: Automatic exercise recognition
- **Smart Home Systems**: Context-aware automation
- **Elderly Care**: Fall detection and activity monitoring

## 10. Instructions for Real Data Collection

To use this system with real sensor data:

### Step 1: Data Collection
1. Install **Sensor Logger** app (iOS/Android) or **Physics Toolbox Accelerometer** (Android)
2. Configure sampling rate: 50-100 Hz
3. Record each activity for 5-10 seconds, repeat ~12 times per activity
4. Export data as CSV files

### Step 2: Data Loading
```python
# Load real sensor data
real_data = load_real_sensor_data('path/to/your/sensor_data.csv')

# Add activity labels manually or use file naming convention
real_data['activity'] = 'walking'  # Set appropriate activity
real_data['sample_id'] = 'walking_01'  # Set sample identifier
```

### Step 3: Feature Extraction and Training
```python
# Extract features from real data
features_real = extractor.extract_features_from_dataset(real_data)

# Train model on real data
hmm_real = ActivityHMM(n_states=4)
hmm_real.fit(features_real)
```

The rest of the pipeline remains the same!