Question:
You are given a dataset of vehicle telemetry logs. Each row corresponds to one minute of data and includes features such as:

speed (km/h)

engine_rpm

acceleration

brake_pressure

steering_angle

Each row is labeled as one of three driving states:

Normal driving

Aggressive driving

System anomaly

Write Python code to:

Load and preprocess the dataset (handle missing values, normalize features if needed).

Split the dataset into training and test sets.

Train a simple classification model (e.g., Logistic Regression, Random Forest, or Gradient Boosting).

Evaluate the model with accuracy and confusion matrix.

Demonstrate how you would predict the driving state for new incoming telemetry data in real time.

# 0) Setup & Imports

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

import itertools, joblib
np.random.seed(42)


# 1) Create TWO tables and JOIN them

In [3]:
# --- Table A: telemetry-level rows (multiple rows per vehicle) ---
n_rows = 500
vehicle_ids = np.random.choice([101,102,103,104,105,106], size=n_rows, p=[.15,.2,.2,.2,.15,.1])

df_telemetry = pd.DataFrame({
    "vehicle_id": vehicle_ids,
    "speed": np.random.randint(0, 150, n_rows),
    "engine_rpm": np.random.randint(600, 6500, n_rows),
    "acceleration": np.random.uniform(-3.5, 3.5, n_rows),
    "brake_pressure": np.random.uniform(0, 1, n_rows),
    "steering_angle": np.random.uniform(-50, 50, n_rows)
})

# Introduce a few missing values to simulate reality
for col in ["acceleration", "brake_pressure"]:
    mask = np.random.rand(n_rows) < 0.03
    df_telemetry.loc[mask, col] = np.nan

# --- Table B: driver-level metadata (one row per vehicle_id, some vehicles missing) ---
df_drivers = pd.DataFrame({
    "vehicle_id": [101,102,103,104,106],  # 105 missing on purpose
    "driver_name": ["Alice","Bob","Charlie","Dana","Ed"],
    "age": [35, 42, 29, 51, 38],
    "region": ["West","East","West","Central","East"]
})

# --- JOIN: enrich telemetry with driver info ---
df = pd.merge(df_telemetry, df_drivers, on="vehicle_id", how="left")
df.head()


Unnamed: 0,vehicle_id,speed,engine_rpm,acceleration,brake_pressure,steering_angle,driver_name,age,region
0,103,144,5091,3.272579,0.680975,-1.162537,Charlie,29.0,West
1,106,140,2155,-2.201788,,23.756575,Ed,38.0,East
2,104,45,1404,-0.265319,0.618948,2.523976,Dana,51.0,Central
3,104,34,5629,-3.491377,0.3235,-46.735432,Dana,51.0,Central
4,102,133,1879,0.688291,0.834303,36.855151,Bob,42.0,East


In [4]:
df.shape

(500, 9)

# 2) Create labels (supervised target) + Quick EDA

In [5]:
# Simple rule-based labels so patterns are learnable
def label_row(r):
    if (r["brake_pressure"] is not None and pd.notna(r["brake_pressure"]) and r["brake_pressure"] > 0.85 and r["speed"] > 105) \
        or (r["engine_rpm"] > 5800 and r["speed"] > 120):
        return "System anomaly"
    elif (pd.notna(r["acceleration"]) and r["acceleration"] > 2.2) or (abs(r["steering_angle"]) > 32) or (r["speed"] > 130):
        return "Aggressive driving"
    else:
        return "Normal driving"

df["label"] = df.apply(label_row, axis=1)

print("Shape:", df.shape)
print("\nNulls:\n", df.isna().sum())
print("\nClass distribution:\n", df["label"].value_counts())

Shape: (500, 10)

Nulls:
 vehicle_id         0
speed              0
engine_rpm         0
acceleration      17
brake_pressure    17
steering_angle     0
driver_name       76
age               76
region            76
label              0
dtype: int64

Class distribution:
 label
Aggressive driving    259
Normal driving        216
System anomaly         25
Name: count, dtype: int64


# 3) Define features, split data
(# Correct order: Always do train_test_split first, then scale.)

In [6]:
# Feature columns (include joined columns)
num_features = ["speed","engine_rpm","acceleration","brake_pressure","steering_angle","age"]
cat_features = ["driver_name","region"]

X = df[num_features + cat_features]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

X_train.shape, X_test.shape

((375, 8), (125, 8))

# 4) Preprocessing (impute + scale numeric, impute + one-hot categorical)

In [7]:
numeric_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, num_features),
        ("cat", categorical_pipe, cat_features)
    ]
)

# 5) Models: Logistic Regression & Random Forest

In [15]:
# Logistic Regression (fundamentals)
logreg = Pipeline(steps=[
    ("prep", preprocess),
    ("clf", LogisticRegression(max_iter=500, class_weight="balanced"))
])

# Random Forest (strong structured baseline)
rf = Pipeline(steps=[
    ("prep", preprocess),
    ("clf", RandomForestClassifier(
        n_estimators=300, max_depth=None, random_state=42,
        class_weight=None  # try "balanced_subsample" if classes are skewed
    ))
])

logreg.fit(X_train, y_train)
rf.fit(X_train, y_train)

y_pred_lr = logreg.predict(X_test)
y_pred_rf = rf.predict(X_test)

In [14]:
print("LogReg accuracy:", accuracy_score(y_test, y_pred_lr))
print("RF     accuracy:", accuracy_score(y_test, y_pred_rf))

print("\nLogReg report:\n", classification_report(y_test, y_pred_lr))
print("\nRF report:\n", classification_report(y_test, y_pred_rf))

LogReg accuracy: 0.512
RF     accuracy: 0.96

LogReg report:
                     precision    recall  f1-score   support

Aggressive driving       0.58      0.46      0.51        65
    Normal driving       0.53      0.57      0.55        54
    System anomaly       0.20      0.50      0.29         6

          accuracy                           0.51       125
         macro avg       0.44      0.51      0.45       125
      weighted avg       0.54      0.51      0.52       125


RF report:
                     precision    recall  f1-score   support

Aggressive driving       0.93      1.00      0.96        65
    Normal driving       1.00      0.98      0.99        54
    System anomaly       1.00      0.33      0.50         6

          accuracy                           0.96       125
         macro avg       0.98      0.77      0.82       125
      weighted avg       0.96      0.96      0.95       125

