# IoTShield - Sensor Data Analysis & ML Model Training

## Privacy-Preserving Real-Time Home Automation System

This notebook provides comprehensive analysis and machine learning model training for the IoTShield sensor anomaly detection system.

**Dataset**: `sensor_data.csv` (10,000 samples)
**Goal**: Train models to detect anomalies in IoT sensor data

---

## 1. Load and Explore the Dataset

First, we'll import required libraries and load the sensor data.

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import joblib

# Configure display settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("All libraries imported successfully!")

In [None]:
# Load the dataset
df = pd.read_csv('../dataset/sensor_data.csv')

# Convert timestamp to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Display basic information
print("="*60)
print("DATASET OVERVIEW")
print("="*60)
print(f"\nDataset loaded successfully!")
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"Time Range: {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"\nFirst 5 rows:")
display(df.head())

print(f"\nData Types:")
print(df.dtypes)

print(f"\nStatistical Summary:")
display(df.describe())

## 2. Data Preprocessing and Cleaning

Check for data quality issues and clean the dataset.

In [None]:
# Check for missing values
print("="*60)
print("DATA QUALITY CHECK")
print("="*60)

print(f"\nMissing Values:")
missing = df.isnull().sum()
if missing.sum() == 0:
    print("No missing values found!")
else:
    print(missing[missing > 0])

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\nDuplicate Rows: {duplicates}")
if duplicates > 0:
    df = df.drop_duplicates()
    print(f"Removed {duplicates} duplicate rows")

# Anomaly distribution
print(f"\nAnomaly Distribution:")
anomaly_counts = df['is_anomaly'].value_counts()
print(f"Normal samples: {anomaly_counts[0]} ({anomaly_counts[0]/len(df)*100:.2f}%)")
print(f"Anomaly samples: {anomaly_counts[1]} ({anomaly_counts[1]/len(df)*100:.2f}%)")

# Device distribution
print(f"\nDevice Distribution:")
print(df['device_id'].value_counts())

# Location distribution
print(f"\nLocation Distribution:")
print(df['location'].value_counts())

## 3. Exploratory Data Analysis (EDA)

### 3.1 Sensor Data Distribution

In [None]:
# Visualize sensor distributions
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Sensor Data Distributions (Normal vs Anomaly)', fontsize=16, fontweight='bold')

sensors = ['temperature', 'humidity', 'gas_level', 'flame_detected', 'motion_detected', 'light_level']
colors = ['#FF6B6B', '#4ECDC4']

for idx, sensor in enumerate(sensors):
    ax = axes[idx // 3, idx % 3]
    
    # Separate normal and anomaly data
    normal_data = df[df['is_anomaly'] == 0][sensor]
    anomaly_data = df[df['is_anomaly'] == 1][sensor]
    
    # Plot histograms
    ax.hist([normal_data, anomaly_data], bins=30, label=['Normal', 'Anomaly'], 
            color=colors, alpha=0.7, edgecolor='black')
    ax.set_title(f'{sensor.replace("_", " ").title()}', fontsize=12, fontweight='bold')
    ax.set_xlabel('Value')
    ax.set_ylabel('Frequency')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Sensor distribution plots generated!")

### 3.2 Correlation Analysis

In [None]:
# Correlation heatmap
numeric_cols = ['temperature', 'humidity', 'gas_level', 'flame_detected', 'motion_detected', 'light_level', 'is_anomaly']
correlation_matrix = df[numeric_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap - Sensor Features', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

print("\nStrong correlations with anomaly:")
anomaly_corr = correlation_matrix['is_anomaly'].sort_values(ascending=False)
for feature, corr in anomaly_corr.items():
    if feature != 'is_anomaly':
        print(f"  • {feature}: {corr:.3f}")

## 4. Feature Engineering

Prepare features for machine learning models.

In [None]:
# Select features for ML model
feature_columns = ['temperature', 'humidity', 'gas_level', 'flame_detected', 'motion_detected', 'light_level']
target_column = 'is_anomaly'

# Prepare X and y
X = df[feature_columns].values
y = df[target_column].values

print("="*60)
print("FEATURE ENGINEERING")
print("="*60)
print(f"\nFeatures (X): {X.shape}")
print(f"Target (y): {y.shape}")
print(f"\nFeature columns: {feature_columns}")
print(f"Target column: {target_column}")

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"\nFeatures standardized (mean=0, std=1)")
print(f"Scaled features shape: {X_scaled.shape}")

## 5. Train-Test Split

Split data into training and testing sets.

In [None]:
# Split dataset into train and test sets (80-20)
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print("="*60)
print("TRAIN-TEST SPLIT")
print("="*60)
print(f"\nTraining set:")
print(f"   • X_train: {X_train.shape}")
print(f"   • y_train: {y_train.shape}")
print(f"   • Anomalies: {y_train.sum()} ({y_train.sum()/len(y_train)*100:.2f}%)")

print(f"\nTesting set:")
print(f"   • X_test: {X_test.shape}")
print(f"   • y_test: {y_test.shape}")
print(f"   • Anomalies: {y_test.sum()} ({y_test.sum()/len(y_test)*100:.2f}%)")

print(f"\nData split completed!")

## 6. Train Machine Learning Models

### 6.1 Isolation Forest (Unsupervised Anomaly Detection)

In [None]:
# Train Isolation Forest model
print("="*60)
print("TRAINING ISOLATION FOREST MODEL")
print("="*60)

iso_forest = IsolationForest(
    contamination=0.1,
    random_state=42,
    n_estimators=100,
    max_samples='auto',
    verbose=1
)

print("\nTraining model...")
iso_forest.fit(X_train)

# Predictions
y_pred_iso = iso_forest.predict(X_test)
# Convert predictions: 1 for normal, -1 for anomaly -> Convert to 0 and 1
y_pred_iso = (y_pred_iso == -1).astype(int)

# Calculate accuracy
accuracy_iso = accuracy_score(y_test, y_pred_iso)

print(f"\nModel trained successfully!")
print(f"Test Accuracy: {accuracy_iso*100:.2f}%")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_iso, target_names=['Normal', 'Anomaly']))

### 6.2 Random Forest Classifier (Supervised Learning)

In [None]:
# Train Random Forest Classifier
print("="*60)
print("TRAINING RANDOM FOREST CLASSIFIER")
print("="*60)

rf_classifier = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    max_depth=10,
    min_samples_split=5,
    verbose=1
)

print("\nTraining model...")
rf_classifier.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_classifier.predict(X_test)

# Calculate accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)

print(f"\nModel trained successfully!")
print(f"Test Accuracy: {accuracy_rf*100:.2f}%")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=['Normal', 'Anomaly']))

# Feature importance
print(f"\nFeature Importance:")
feature_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Importance': rf_classifier.feature_importances_
}).sort_values('Importance', ascending=False)
print(feature_importance)

## 7. Model Evaluation and Comparison

In [None]:
# Compare models
print("="*60)
print("MODEL COMPARISON")
print("="*60)

models_comparison = pd.DataFrame({
    'Model': ['Isolation Forest', 'Random Forest Classifier'],
    'Accuracy': [accuracy_iso * 100, accuracy_rf * 100],
    'Type': ['Unsupervised', 'Supervised']
})

print("\n", models_comparison.to_string(index=False))

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Confusion Matrix - Isolation Forest
cm_iso = confusion_matrix(y_test, y_pred_iso)
sns.heatmap(cm_iso, annot=True, fmt='d', cmap='Blues', ax=axes[0], 
            xticklabels=['Normal', 'Anomaly'], yticklabels=['Normal', 'Anomaly'])
axes[0].set_title(f'Isolation Forest\nAccuracy: {accuracy_iso*100:.2f}%', fontsize=12, fontweight='bold')
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')

# Confusion Matrix - Random Forest
cm_rf = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Greens', ax=axes[1],
            xticklabels=['Normal', 'Anomaly'], yticklabels=['Normal', 'Anomaly'])
axes[1].set_title(f'Random Forest Classifier\nAccuracy: {accuracy_rf*100:.2f}%', fontsize=12, fontweight='bold')
axes[1].set_ylabel('True Label')
axes[1].set_xlabel('Predicted Label')

plt.tight_layout()
plt.show()

# Determine best model
best_model_name = 'Random Forest Classifier' if accuracy_rf > accuracy_iso else 'Isolation Forest'
best_accuracy = max(accuracy_rf, accuracy_iso)
print(f"\nBest Model: {best_model_name} (Accuracy: {best_accuracy*100:.2f}%)")

## 8. Save the Best Model

Save the trained model and scaler for deployment in IoTShield system.

In [None]:
# Save the best performing model
print("="*60)
print("SAVING MODELS")
print("="*60)

# Save Isolation Forest (used by IoTShield system)
iso_forest_path = '../isolation_forest_model.pkl'
joblib.dump(iso_forest, iso_forest_path)
print(f"\nIsolation Forest saved: {iso_forest_path}")

# Save Random Forest Classifier
rf_classifier_path = '../random_forest_classifier.pkl'
joblib.dump(rf_classifier, rf_classifier_path)
print(f"Random Forest saved: {rf_classifier_path}")

# Save the scaler
scaler_path = '../scaler.pkl'
joblib.dump(scaler, scaler_path)
print(f"Scaler saved: {scaler_path}")

print(f"\nAll models and preprocessors saved successfully!")
print(f"\nModel files location: ml_models/")
print(f"   • isolation_forest_model.pkl (Anomaly Detection)")
print(f"   • random_forest_classifier.pkl (Classification)")
print(f"   • scaler.pkl (Feature Scaling)")

print(f"\nModels are ready for deployment in IoTShield system!")

## Summary

### Completed Tasks:
1. **Dataset Loaded**: 10,000 sensor readings with 6 features
2. **Data Preprocessing**: No missing values, no duplicates
3. **EDA**: Visualized sensor distributions and correlations
4. **Feature Engineering**: Standardized features for ML
5. **Model Training**: 
   - Isolation Forest (Unsupervised)
   - Random Forest Classifier (Supervised)
6. **Model Evaluation**: Compared accuracy and confusion matrices
7. **Model Deployment**: Saved models for IoTShield system

### Key Findings:
- **Anomaly Rate**: ~10% of samples
- **Strongest Correlations**: Gas level and flame detection with anomalies
- **Best Model**: Check comparison results above
- **Models Ready**: For real-time anomaly detection in IoTShield

### Next Steps:
1. Integrate models with Django backend
2. Test real-time predictions with IoT simulator
3. Monitor model performance with live data
4. Retrain with real sensor data when available