### **Hybrid Car Scanner Dataset Schema**

1. **Core Metadata**

 - `timestamp` → Time of log entry

 - `vehicle_id` → Unique identifier for the car

 - `trip_id` → Driving session identifier

 - `location_lat`, `location_lon` → GPS coordinates (optional, for context)


2. **Generic OBD‑II Signals**

 - `dtc_code` → Diagnostic Trouble Code (e.g., P0420, P0A80)

 - `rpm` → Engine revolutions per minute

 - `vehicle_speed` → Speed in km/h or mph

 - `coolant_temp` → Engine coolant temperature (°C)

 - `throttle_position` → % open

 - `intake_air_temp` → °C

 - `maf` → Mass airflow rate (g/s)


3. **Hybrid‑Specific Signals**

**Battery Management System (BMS):**

 - `battery_soc` → State of Charge (%)

 - `battery_soh` → State of Health (%)

 - `battery_voltage` → Pack voltage (V)

 - `battery_current` → Current flow (A)

 - `battery_temp_avg` → Average cell temperature (°C)

 - `battery_temp_max` → Max cell temperature (°C)

**Inverter / Converter:**

 - `inverter_temp` → °C

 - `inverter_voltage` → V

 - `inverter_current` → A

 - `dc_dc_converter_status` → On/Off

**Electric Motor:**

 - `motor_rpm` → Revolutions per minute

 - `motor_torque` → Nm

 - `regen_braking_status` → Active/Inactive

**Hybrid Control ECU:**

 - `hybrid_mode` → EV only / Hybrid assist / Charging

 - `ecu_fault_code` → Manufacturer‑specific hybrid ECU codes

**Cooling Systems:**

 - `battery_fan_speed` → RPM

 - `inverter_coolant_pump_status` → On/Off


4. **Labels / Targets (for ML training)**

 - `fault_category` → Engine / Transmission / Hybrid Battery / Inverter / Motor / Cooling

 - `fault_description` → Human‑readable explanation (e.g., “Replace Hybrid Battery Pack”)

 - `recommended_action` → Suggested fix (e.g., “Check battery cooling fan wiring”)

**How You’ll Use This Schema**

 - **Data Science:** Clean and normalize sensor values, encode categorical codes.

 - **Machine Learning:** Train classifiers on `dtc_code + sensor values → fault_category`.

 - **Deep Learning:** Use time‑series (battery_soc, inverter_temp) for anomaly detection.

 - **LLM + Prompt Engineering:** Generate explanations from `fault_description + recommended_action`.

 - **RAG:** Retrieve manufacturer manuals when `ecu_fault_code appears`.



**Summary**

 - Start with generic OBD‑II columns (RPM, speed, coolant temp, DTCs).

 - Extend with hybrid‑specific signals (battery SOC, inverter temp, motor torque).

 - Add labels for supervised learning and AI interpretation.

#### **Let’s make this concrete with a sample synthetic dataset row. This will show you how both generic OBD‑II signals and hybrid‑specific signals can be logged together in one CSV line.**

In [2]:
import pandas as pd

hybrid_data = pd.read_csv('test_1.csv')
hybrid_data.head()

Unnamed: 0,timestamp,vehicle_id,trip_id,location_lat,location_lon,dtc_code,rpm,vehicle_speed,coolant_temp,throttle_position,...,motor_rpm,motor_torque,regen_braking_status,hybrid_mode,ecu_fault_code,battery_fan_speed,inverter_coolant_pump_status,fault_category,fault_description,recommended_action
0,2025-12-09T05:30:00Z,HYB123,TRIP456,6.998,-3.456,P0A80,2200,65,90,35,...,1500,120,ACTIVE,Hybrid Assist,P0A80,1800,ON,Hybrid Battery,Replace Hybrid Battery Pack,Check battery cooling fan and wiring; schedule...


In [3]:
hybrid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 32 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   timestamp                     1 non-null      object 
 1   vehicle_id                    1 non-null      object 
 2   trip_id                       1 non-null      object 
 3   location_lat                  1 non-null      float64
 4   location_lon                  1 non-null      float64
 5   dtc_code                      1 non-null      object 
 6   rpm                           1 non-null      int64  
 7   vehicle_speed                 1 non-null      int64  
 8   coolant_temp                  1 non-null      int64  
 9   throttle_position             1 non-null      int64  
 10  intake_air_temp               1 non-null      int64  
 11  maf                           1 non-null      float64
 12  battery_soc                   1 non-null      int64  
 13  battery_s

In [6]:
hybrid_data['timestamp'] = pd.to_datetime(hybrid_data['timestamp'])


In [23]:
hybrid_data.head().T

Unnamed: 0,0
timestamp,2025-12-09 05:30:00+00:00
vehicle_id,HYB123
trip_id,TRIP456
location_lat,6.998
location_lon,-3.456
dtc_code,P0A80
rpm,2200
vehicle_speed,65
coolant_temp,90
throttle_position,35


In [8]:
hybrid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 32 columns):
 #   Column                        Non-Null Count  Dtype              
---  ------                        --------------  -----              
 0   timestamp                     1 non-null      datetime64[ns, UTC]
 1   vehicle_id                    1 non-null      object             
 2   trip_id                       1 non-null      object             
 3   location_lat                  1 non-null      float64            
 4   location_lon                  1 non-null      float64            
 5   dtc_code                      1 non-null      object             
 6   rpm                           1 non-null      int64              
 7   vehicle_speed                 1 non-null      int64              
 8   coolant_temp                  1 non-null      int64              
 9   throttle_position             1 non-null      int64              
 10  intake_air_temp               1 non-null  

**Breakdown of This Row - output**

**Generic OBD‑II:**

 - `dtc_code = P0A80 (fault code)`

- `rpm = 2200, vehicle_speed = 65 km/h, coolant_temp = 90 °C`

**Hybrid‑Specific:**

- `battery_soc = 42%, battery_soh = 75%`

- `inverter_temp = 55 °C, motor_torque = 120 Nm`

- `regen_braking_status = ACTIVE, hybrid_mode = Hybrid Assist`

**Labels:**

- `fault_category = Hybrid Battery`

- `fault_description = Replace Hybrid Battery Pack`

- `recommended_action = Check battery cooling fan and wiring; schedule battery replacement`


**How You’ll Use This**

- **Data Science:** Clean and normalize values (SOC %, temps, voltages).

- **ML/DL:** Train models to classify hybrid faults and detect anomalies.

- **LLM + Prompt Engineering:** Generate human‑friendly explanations from `fault_description + recommended_action`.

- **RAG:** Retrieve repair manual snippets when `ecu_fault_code appears`

**How You Can Use This Dataset**

- **Data Science:**

     - Clean missing values `(battery_fan_speed = 0 when OFF)`.

    - Normalize continuous features (RPM, voltage, temp).

    - Encode categorical features (`hybrid_mode, dc_dc_converter_status`).

- **Machine Learning:**

    - Train classifiers on `dtc_code + sensor values → fault_category`.

    - Example: `P0A80` → Hybrid Battery.

- **Deep Learning:**

    - Use sequences of `battery_soc`, `battery_temp_avg`, `inverter_temp` to detect anomalies.

- **LLM + Prompt Engineering:**

    - Generate human‑friendly explanations from `fault_description + recommended_action`.

- **RAG:**

    - Store manufacturer manuals for codes like `P0A80` and `P0C78`.

    - Retrieve repair steps when those codes appear.

In [22]:
# Import library
import torch
import torch.nn as nn
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [13]:
# 1 Load with Pandas
# Practicing with test_2.csv dataset
hybrid_data_2 = pd.read_csv('test_2.csv')
hybrid_data_2.head()

Unnamed: 0,timestamp,vehicle_id,trip_id,dtc_code,rpm,vehicle_speed,coolant_temp,throttle_position,intake_air_temp,maf,...,motor_rpm,motor_torque,regen_braking_status,hybrid_mode,ecu_fault_code,battery_fan_speed,inverter_coolant_pump_status,fault_category,fault_description,recommended_action
0,2025-12-09T05:30:00Z,HYB123,TRIP001,P0A80,2200,65,90,35,28,12.5,...,1500,120,ACTIVE,Hybrid Assist,P0A80,1800,ON,Hybrid Battery,Replace Hybrid Battery Pack,Check cooling fan and schedule battery replace...
1,2025-12-09T05:45:00Z,HYB123,TRIP001,P0C78,1800,50,85,30,27,11.0,...,1400,100,INACTIVE,EV Only,P0C78,1600,ON,Inverter,Drive Motor A Inverter Performance,Inspect inverter coolant pump and wiring
2,2025-12-09T06:00:00Z,HYB124,TRIP002,P0420,2500,70,95,40,29,13.2,...,1600,130,ACTIVE,Hybrid Assist,P0420,1700,ON,Engine,Catalyst Efficiency Below Threshold,Check catalytic converter and oxygen sensors
3,2025-12-09T06:15:00Z,HYB125,TRIP003,P0A7F,2000,55,88,32,26,10.8,...,1450,110,ACTIVE,Charging,P0A7F,1500,ON,Hybrid Battery,Hybrid Battery Deterioration,Monitor SOC trends; plan battery replacement
4,2025-12-09T06:30:00Z,HYB126,TRIP004,P0500,2300,0,80,20,25,9.5,...,0,0,INACTIVE,EV Only,P0500,0,OFF,Chassis,Vehicle Speed Sensor Malfunction,Check wheel speed sensors and wiring harness


2. **Preprocess**

    - Normalize continuous features (`rpm`, `battery_soc`, `inverter_temp`).

    - Encode categorical (`hybrid_mode`, `dc_dc_converter_status`).

    - Handle missing values (e.g., `motor_rpm = 0` when EV mode).

3. **Train Models**

    - **ML:** RandomForest/XGBoost → classify `fault_category`.

    - **DL:** LSTM/Autoencoder → detect anomalies in `battery_soc`, `inverter_temp`.

4. **LLM + RAG**

    - Use `fault_description` + `recommended_action` as training prompts.

    - Store manufacturer manuals in a vector DB → retrieve context for hybrid codes.

In [14]:
df = hybrid_data_2.copy()

In [15]:
df.isna().sum()

timestamp                       0
vehicle_id                      0
trip_id                         0
dtc_code                        0
rpm                             0
vehicle_speed                   0
coolant_temp                    0
throttle_position               0
intake_air_temp                 0
maf                             0
battery_soc                     0
battery_soh                     0
battery_voltage                 0
battery_current                 0
battery_temp_avg                0
battery_temp_max                0
inverter_temp                   0
inverter_voltage                0
inverter_current                0
dc_dc_converter_status          0
motor_rpm                       0
motor_torque                    0
regen_braking_status            0
hybrid_mode                     0
ecu_fault_code                  0
battery_fan_speed               0
inverter_coolant_pump_status    0
fault_category                  0
fault_description               0
recommended_ac

Data Cleaning & Preprocessing

In [18]:
# Handle missing values (example: fill with 0 or mean)
df.fillna(0, inplace=True)

In [19]:
# Normalize continous features
scaler = StandardScaler()
df[['rpm','vehicle_speed','coolant_temp','battery_soc','battery_voltage','inverter_temp']] = \
    scaler.fit_transform(df[['rpm','vehicle_speed','coolant_temp','battery_soc','battery_voltage','inverter_temp']])

In [20]:
# Encode categorical features
le = LabelEncoder()
df['dtc_encoded'] = le.fit_transform(df['dtc_code'])
df['mode_encoded'] = le.fit_transform(df['hybrid_mode'])

Fault Classification (ML)

In [21]:
X = df[['rpm', 'vehicle_speed', 'coolant_temp', 'battery_soc', 'battery_voltage', 'inverter_temp', 'dtc_encoded', 'mode_encoded']]
y = df['fault_category']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

     Chassis       0.00      0.00      0.00         0
    Inverter       1.00      1.00      1.00         1
Transmission       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.33      0.33      0.33         2
weighted avg       0.50      0.50      0.50         2

[[0 0 0]
 [0 1 0]
 [1 0 0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Sensor Anomaly Detection (DL)

In [25]:
class LSTMAnomalyDetector(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__intit__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, input_dim)
        
    def forward(self, x):
        out, _ = self.lstm(x)
        return self.fc(out)
    
    
# Example: train on sequences of battery SOC and inverter temp
# Reconstruction error > threshold = anomaly detected
# Note: This detects anomalies in hybrid battery/inverter signals before a fault code appears.

Natural Language Explanation (LLM)

In [None]:
from transformers import pipeline

# Load a small LLM for demo
generator = pipeline("text-generation", model="gpt2")

code = "P0A80"
context = "Replace Hybrid Battery Pack"

prompt = f"Explain fault code {code}: {context}. Provide likely causes and recommended fixes."
explanation = generator(prompt, max_length=80)[0]['generated_text']

print(explanation)

# This turns raw codes into human‑friendly explanations.

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: b7ae66c9-45ca-43ee-be05-03e06d04d45a)')' thrown while requesting HEAD https://huggingface.co/gpt2/resolve/main/config.json
Retrying in 1s [Retry 1/5].
'(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /gpt2/resolve/main/config.json (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x000002A00F9A6B40>: Failed to resolve \'huggingface.co\' ([Errno 11002] getaddrinfo failed)"))'), '(Request ID: ec2324f5-ae7c-4c3e-b9f2-06cbd10c8291)')' thrown while requesting HEAD https://huggingface.co/gpt2/resolve/main/config.json
Retrying in 2s [Retry 2/5].
'(MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /gpt2/resolve/main/config.json (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x000002A00F9A6A50>: Failed to res

Knowledge Integration (RAG)

In [None]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA

docs = ["P0A80: Replace Hybrid Battery Pack - check cooling fan and wiring",
        "P0C78: Drive Motor A Inverter Performance - inspect inverter coolant pump"]

embeddings = HuggingFaceEmbeddings()
db = FAISS.from_texts(docs, embeddings)

qa = RetrievalQA.from_chain_type(llm=generator, retriever=db.as_retriever())
rag_response = qa.run("Explain P0A80")

print(rag_response)

# This enriches explanations with manufacturer manuals.

Deployment (Chatbot Interface)

In [None]:
import gradio as gr

def chatbot_interface(code, rpm, speed, temp):
    category = clf.predict([[rpm, speed, temp, 0, 0, 0, encoder.transform([code])[0], 0]])[0]
    explanation = qa.run(f"Explain {code}")
    return f"Fault Category: {category}\nExplanation: {explanation}"

gr.Interface(fn=chatbot_interface,
             inputs=["text","number","number","number"],
             outputs="text").launch()


# This gives you a chatbot UI where users input a code and sensor values, and get AI‑powered interpretations.

**Summary**

Data Science: Clean & preprocess with pandas, scikit-learn.

ML: Classify fault categories with RandomForest/XGBoost.

DL: Detect anomalies with LSTM/autoencoder.

LLM + Prompt Engineering: Generate explanations.

RAG: Retrieve hybrid manuals for context.

Deployment: Chatbot/dashboard with Gradio or Streamlit.