Question:
You are given a dataset of vehicle telemetry logs. Each row corresponds to one minute of data and includes features such as:

speed (km/h)

engine_rpm

acceleration

brake_pressure

steering_angle

Each row is labeled as one of three driving states:

Normal driving

Aggressive driving

System anomaly

Write Python code to:

Load and preprocess the dataset (handle missing values, normalize features if needed).

Split the dataset into training and test sets.

Train a simple classification model (e.g., Logistic Regression, Random Forest, or Gradient Boosting).

Evaluate the model with accuracy and confusion matrix.

Demonstrate how you would predict the driving state for new incoming telemetry data in real time.

In [2]:
# --- Part 1: Setup ---
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# --- Part 2: Create synthetic dataset ---
np.random.seed(42)

n_samples = 200
data = {
    "speed": np.random.randint(0, 150, n_samples),          # km/h
    "engine_rpm": np.random.randint(500, 6000, n_samples),  # rpm
    "acceleration": np.random.uniform(-3, 3, n_samples),    # m/s^2
    "brake_pressure": np.random.uniform(0, 1, n_samples),   # normalized [0,1]
    "steering_angle": np.random.uniform(-45, 45, n_samples) # degrees
}

df = pd.DataFrame(data)

# Generate labels based on some rules
conditions = []
for i, row in df.iterrows():
    if row["brake_pressure"] > 0.8 and row["speed"] > 100:
        conditions.append("System anomaly")
    elif row["acceleration"] > 2 or abs(row["steering_angle"]) > 30:
        conditions.append("Aggressive driving")
    else:
        conditions.append("Normal driving")

df["label"] = conditions

In [3]:
df

Unnamed: 0,speed,engine_rpm,acceleration,brake_pressure,steering_angle,label
0,102,4561,-0.864164,0.812901,-30.216078,System anomaly
1,92,3869,1.547077,0.999718,28.311725,Normal driving
2,14,762,-2.913639,0.996637,14.867750,Normal driving
3,106,1123,-2.303564,0.555432,2.075888,Normal driving
4,71,1516,-2.723984,0.768987,-12.705256,Normal driving
...,...,...,...,...,...,...
195,38,1970,-0.187841,0.841829,-42.977762,Aggressive driving
196,81,1080,-0.511083,0.139772,-20.819038,Normal driving
197,103,2794,-1.359558,0.795267,3.747079,Normal driving
198,128,3647,-2.661747,0.201627,12.013040,Normal driving


In [4]:
df

Unnamed: 0,speed,engine_rpm,acceleration,brake_pressure,steering_angle,label
0,102,4561,-0.864164,0.812901,-30.216078,System anomaly
1,92,3869,1.547077,0.999718,28.311725,Normal driving
2,14,762,-2.913639,0.996637,14.867750,Normal driving
3,106,1123,-2.303564,0.555432,2.075888,Normal driving
4,71,1516,-2.723984,0.768987,-12.705256,Normal driving
...,...,...,...,...,...,...
195,38,1970,-0.187841,0.841829,-42.977762,Aggressive driving
196,81,1080,-0.511083,0.139772,-20.819038,Normal driving
197,103,2794,-1.359558,0.795267,3.747079,Normal driving
198,128,3647,-2.661747,0.201627,12.013040,Normal driving


# Quick EDA (shape, nulls, class balance, simple distributions)

In [5]:
print("Shape:", df.shape)
print("\nNulls per column:\n", df.isna().sum())
print("\nClass distribution:\n", df["label"].value_counts())

Shape: (200, 6)

Nulls per column:
 speed             0
engine_rpm        0
acceleration      0
brake_pressure    0
steering_angle    0
label             0
dtype: int64

Class distribution:
 label
Normal driving        106
Aggressive driving     82
System anomaly         12
Name: count, dtype: int64


# 3) Train/Test split & preprocessing

In [7]:
X = df.drop(columns=['label'])

In [8]:
y = df['label']

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

In [13]:
num_features = X.columns.tolist()

In [14]:
num_features

['speed', 'engine_rpm', 'acceleration', 'brake_pressure', 'steering_angle']

# Correct order:

# Always do train_test_split first, then scale.