# Digital Twin Python & ML Foundations Notebook
Get hands-on with Python basics, descriptive statistics, and machine learning concepts commonly used when building digital twins.

## Learning Objectives
- Refresh core Python syntax for data analysis tasks.
- Practice computing mean, median, and mode with pure Python and pandas.
- Follow the end-to-end steps of a simple machine learning workflow.
- Explore supervised, unsupervised, and reinforcement learning concepts relevant to digital twins.
- Complete short exercises to solidify each concept.

## Setup Checklist
Make sure the following packages are available (install with `pip install pandas numpy scikit-learn matplotlib seaborn` if needed).

In [None]:
# Core libraries used throughout the notebook
import math
import random
from collections import Counter

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error, classification_report

## 1. Python Refresher
Python is often the glue language for digital twins. Let us revisit fundamental elements such as variables, lists, loops, and functions.

### Key Concepts
- Variables & data types (int, float, string, bool).
- Lists, tuples, dictionaries for sensor data and configuration.
- Control flow (`if`, `for`, `while`) for decision logic.
- Functions to encapsulate transformation steps.

In [None]:
# Simple Python recap: managing sensor readings
sensor_readings = [22.1, 22.5, 22.0, 21.9, 22.3]

def add_reading(readings, new_value):
    """Append a new sensor value if it is within acceptable bounds."""
    if 15 <= new_value <= 35:  # simple validation
        readings.append(new_value)
    return readings

for value in [22.7, 23.4, 14.2]:
    add_reading(sensor_readings, value)

sensor_readings

### Exercise 1
1. Create a dictionary representing a machine with keys: `name`, `location`, `status`.
2. Write a function `toggle_status(machine)` that flips the status between `'online'` and `'offline'`.
3. Use a loop to simulate five toggles and print the status history.

## 2. Descriptive Statistics: Mean, Median, Mode
Digital twins require summarising sensor feeds. Mean, median, and mode provide quick insights.

### Definitions
- **Mean**: Arithmetic average. Sensitive to outliers.
- **Median**: Middle value when sorted. Robust to extremes.
- **Mode**: Most frequent value. Useful for categorical states.

We will compute these using both pure Python and pandas.

In [None]:
# Sample vibration amplitude readings (mm/s)
vibration = [3.2, 3.5, 3.1, 3.6, 3.2, 3.3, 3.2, 5.0]

mean_value = sum(vibration) / len(vibration)
sorted_values = sorted(vibration)
mid = len(sorted_values) // 2
median_value = (sorted_values[mid] + sorted_values[~mid]) / 2  # works for even/odd
mode_value = Counter(vibration).most_common(1)[0][0]

stats_summary = pd.Series(vibration).describe()

mean_value, median_value, mode_value, stats_summary

### Exercise 2
1. Create a list of hourly energy consumption readings (kWh).
2. Compute mean, median, and mode with Python.
3. Use `pandas.Series(value_list).plot(kind='hist')` to visualise the distribution (requires matplotlib).

## 3. Basic Machine Learning Workflow
Typical steps when creating predictive components for a digital twin:
1. **Problem Definition**: e.g., predict temperature 5 minutes ahead.
2. **Data Collection**: sensors, logs, simulations.
3. **Feature Engineering**: rolling averages, gradients, domain heuristics.
4. **Model Selection & Training**: choose algorithm, fit on historical data.
5. **Evaluation**: error metrics, confusion matrix, cross-validation.
6. **Deployment & Monitoring**: integrate with twin runtime, track drift.

In [None]:
# Generate a small synthetic dataset: ambient temperature vs. cooling load
np.random.seed(42)
ambient_temp = np.random.normal(loc=25, scale=3, size=120)
cooling_load = 0.8 * ambient_temp + np.random.normal(loc=0, scale=1.5, size=120)
data = pd.DataFrame({
    'ambient_temp': ambient_temp,
    'cooling_load': cooling_load
})
data.head()

In [None]:
# Split into features (X) and target (y)
X = data[['ambient_temp']]
y = data['cooling_load']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a simple regression model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Evaluate
predictions = regressor.predict(X_test)
rmse = math.sqrt(mean_squared_error(y_test, predictions))
rmse

### Exercise 3
1. Add a new feature `humidity` (random values 40-60%).
2. Retrain the regression model with both features and compare the RMSE.
3. Plot predicted vs. actual cooling load scatter plot to inspect errors.

## 4. Supervised Learning Concepts
Supervised learning maps inputs to known outputs. In digital twins, common tasks include condition monitoring, failure prediction, and parameter estimation.

### Algorithms often used for digital twins
- **Random Forest**: Handles non-linear relationships, robust to noise.
- **Gradient Boosting (XGBoost/LightGBM)**: High accuracy for tabular data.
- **Neural Networks**: Capture complex temporal/spatial dynamics in twin simulations.
- **Support Vector Machines**: Effective for smaller datasets with clear margins.

In [None]:
# Classification example: detect anomalous temperature spikes
df = data.copy()
df['spike_flag'] = (df['cooling_load'] > df['cooling_load'].mean() + 2 * df['cooling_load'].std()).astype(int)

X_cls = df[['ambient_temp']]
y_cls = df['spike_flag']
Xc_train, Xc_test, yc_train, yc_test = train_test_split(X_cls, y_cls, test_size=0.25, random_state=7)

clf = RandomForestClassifier(n_estimators=200, random_state=7)
clf.fit(Xc_train, yc_train)
preds = clf.predict(Xc_test)
classification_report(yc_test, preds)

### Exercise 4
1. Replace `ambient_temp` with two features: `ambient_temp` and `rolling_mean_temp` (`df['ambient_temp'].rolling(window=3).mean().fillna(method='bfill')`).
2. Retrain the classifier and observe changes in precision/recall.
3. Experiment with `max_depth` and `min_samples_split` in `RandomForestClassifier`.

## 5. Unsupervised Learning Concepts
Unsupervised learning finds structure without labelled outcomes. Useful for clustering equipment behaviour, anomaly detection, and feature learning.

### Algorithms
- **K-Means**: Cluster similar operating states.
- **DBSCAN**: Discover dense regions and outliers.
- **Autoencoders**: Learn embeddings for high-dimensional sensor arrays.
- **Principal Component Analysis (PCA)**: Reduce dimensionality for monitoring dashboards.

In [None]:
# Cluster operating states using K-Means
state_df = pd.DataFrame({
    'ambient_temp': ambient_temp,
    'cooling_load': cooling_load
})
kmeans = KMeans(n_clusters=3, random_state=42)
state_df['state_cluster'] = kmeans.fit_predict(state_df)
state_df['state_cluster'].value_counts()

### Exercise 5
1. Visualise clusters using `seaborn.scatterplot` or `matplotlib`.
2. Change `n_clusters` to 2 and 4 and compare cluster distributions.
3. Brainstorm how each cluster could correspond to distinct operating modes in a digital twin scenario.

## 6. Reinforcement Learning Concepts
Reinforcement learning (RL) agents learn actions by interacting with an environment. In digital twins, RL can optimise control policies (e.g., HVAC setpoints).

### Key Terms
- **State**: Representation of the system (temperature, demand, etc.).
- **Action**: Control adjustments (valve opening, fan speed).
- **Reward**: Feedback signal (energy saved, comfort maintained).
- **Policy**: Strategy mapping states to actions.

Below is a compact Q-learning illustration for a thermostat agent.

In [None]:
# Minimal Q-learning loop for a thermostat-like control
states = ['cold', 'comfort', 'hot']
actions = ['heat', 'idle', 'cool']
q_table = pd.DataFrame(0, index=states, columns=actions, dtype=float)
transition = {
    ('cold', 'heat'): ('comfort', 1),
    ('cold', 'idle'): ('cold', -1),
    ('cold', 'cool'): ('cold', -2),
    ('comfort', 'heat'): ('hot', -1),
    ('comfort', 'idle'): ('comfort', 2),
    ('comfort', 'cool'): ('cold', -1),
    ('hot', 'heat'): ('hot', -2),
    ('hot', 'idle'): ('hot', -1),
    ('hot', 'cool'): ('comfort', 1)
}

alpha, gamma, epsilon = 0.1, 0.9, 0.2
state = 'comfort'
for episode in range(200):
    if random.random() < epsilon:
        action = random.choice(actions)
    else:
        action = q_table.loc[state].idxmax()
    next_state, reward = transition[(state, action)]
    best_future = q_table.loc[next_state].max()
    q_table.loc[state, action] += alpha * (reward + gamma * best_future - q_table.loc[state, action])
    state = next_state

q_table

### Exercise 6
1. Modify the reward structure to penalise frequent heating/cooling to encourage energy efficiency.
2. Track cumulative reward per episode and plot to see learning progress.
3. Discuss how to scale this to a simulator-backed digital twin environment (e.g., using `gymnasium`).

## 7. Capstone Task Ideas
Tie everything together with mini-projects:
- **Capstone A**: Build a baseline digital twin for a smart room that forecasts next-hour temperature using regression and flags anomalies with classification.
- **Capstone B**: Cluster multi-sensor data (temperature, vibration, energy) to identify equipment operating states and map them to maintenance actions.
- **Capstone C**: Prototype a small RL agent that adjusts HVAC setpoints within a simulated environment for energy savings.

Document assumptions, metrics, and visualisations for peer review.

## Additional Resources
- *Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow* by Aurélien Géron.
- *Digital Twin Driven Smart Design (Elsevier)* for industrial case studies.
- Open-source toolkits: `simpy` for discrete-event simulation, `gymnasium` for RL environments.